Releases: delph-in/srg
0.3.6
Small update to 0.3.5:
- Months of the year treated as time relations; still lacking the specific CARG values though.
- Regression with respect to hace as in hace dos semanas. I don't understand what is going on but hopefully will fix it again. I don't think I changed anything related to this.
- Treebanked more items in tbdb11-12. Some were just hard to find but with less ambiguity, it has become a bit easier, so we found more good trees. The accuracy in 11-12 is now 77-79%, similar to TBDB10.
- A couple inexplicable regressions, due probably to processing quirks (RAM etc). Only literally a couple though.
0.3.5
Added AGR constraints to a number of types to reduce overgeneration.
Added a small testsuite for agreement, including 16 grammatical and 16 ungrammatical items. The overgeneration needs to be assessed in tsdb++ using LKB as the parser. The accuracy is more reliably assessed using a pydelphin script (util/treebanking-scripts/report_stats.py); I can't figure out a reliable way to combine tsdb++ and fftb. I think tsdb++ can report wrong numbers when dealing with fftb-treebanked corpora).
Other changes include:
- All participles now go through a vpart_ilr and then, if they are past participles, through a derivational ppart-lex-rule (of the appropriate kind).
- Ordinal numbers are now recognized as such
- Comparative adverbs now trigger not only adjectival lexical entries but also the adverbial ones
- Added a version of the colon that works like a copula (the type was already there but the lexical rule was not)
- Disabled some of the rules in srules.tdl which were unused at least in the treebanks up to length 12. Simply commented them out; can be added if needed (commit 7892ec3)
Overgeneration:
corpus | 0.3.4 | 0.3.5 |
---|---|---|
agreement | 0.75 | 0 |
Accuracy:
corpus | 0.3.4 | 0.3.5 |
---|---|---|
mrs | 0.81 | 0.95 |
tbdb01 | 1.0 | 1.0 |
tbdb02 | 0.93 | 0.94 |
tbdb03 | 0.88 | 0.91 |
tbdb04 | 0.86 | 0.89 |
tbdb05 | 0.86 | 0.89 |
tbdb06 | 0.82 | 0.88 |
tbdb07 | 0.76 | 0.86 |
tbdb08 | 0.82 | 0.81 |
tbdb09 | 0.77 | 0.79 |
tbdb10 | 0.76 | 0.75 |
tbdb11 | 0.50 | 0.53 |
tbdb12 | 0.65 | 0.64 |
The problem with the treebanks with longer sentences is they are less consistently verified (it's harder to establish that the structure is correct and it's easier to make a mistake). In many cases, the loss of an accepted/verified parse is in fact an improvement in the sense that the previous accepted structure was not correct. On the other hand, I actually think there are a lot more correct structures in e.g. 11-12 but more time is needed to find them. (I'd expect their real accuracy to be more similar to 9-10...)
Performance (assessed with tsdb++, not sure how reliably):
corpus | time compared to 0.3.4 | edges compared to 0.3.4 |
---|---|---|
mrs | -31% | -30% |
tbdb01 | -26% | -15% |
tbdb02 | -12% | -11% |
tbdb03 | -31% | -26% |
tbdb04 | -43% | -34% |
tbdb05 | -46% | -34% |
tbdb06 | -68% | -48% |
tbdb07 | -76% | -59% |
tbdb08 | -67% | -56% |
tbdb09 | -75% | -65% |
tbdb10 | -87% | -68% |
tbdb11 | -72% | -65% |
tbdb12 | -204% | -38% |
0.3.4
- Adding treebanks: TIBIDABO sentences of length 11 and 12 (partially treebanked, as much as we were able to do now).
- Got rid of outdated operator
:<
which was still found in some places. - Updated the use of exceptions to Freeling tags
- Removed WLING, CTO, CROM from MRS output
- Further modifications to the Freeling-LKB interface.
- AGR constrained between the subject and the complement of the copula v_ap_ser_synsem; no obvious regressions in TIBIDABO 1-12; modest improvement in ambiguity (e.g. 35.38->35.42 on tbdb06, which is due to 2 sentences, for which the ambiguity goes down from 9 to 8 and from 397 to 370); improvements in the learner treebank.
0.3.3
This version corresponds to the version presented at COLING-LREC 2024.
Added semi.vpm to update the MRS format (e.g. handle -> h). Reparsed and updated all previously released treebanks. In particular, they can now be added to tsdb/gold/ directory under the grammar directory and then LTDB can be used to access corpus statistics etc.
0.3.2
Releasing TIBIDABO treebank portions up to sentence length 10.
Accuracy figures according to util/report_stats.py:
Corpus new_mrs accuracy 86 out of 106 (0.81)
Corpus tbdb01 accuracy 65 out of 65 (1.00)
Corpus tbdb02 accuracy 166 out of 177 (0.94)
Corpus tbdb03 accuracy 161 out of 181 (0.89)
Corpus tbdb04 accuracy 189 out of 219 (0.86)
Corpus tbdb05 accuracy 199 out of 229 (0.87)
Corpus tbdb06 accuracy 175 out of 211 (0.83)
Corpus tbdb07 accuracy 187 out of 246 (0.76)
Corpus tbdb08 accuracy 228 out of 278 (0.82)
Corpus tbdb09 accuracy 253 out of 326 (0.78)
Corpus tbdb10 accuracy 273 out of 359 (0.76)
Total accuracy: 1982 out of 2397 (0.83)
Grammar changes:
- Converted some of the comments to dosctrings throughout the grammar (for better practice and also to be able to better use LTDB).
- Further tweaks to Freeling interface, now preserving more ambiguity (in particular, in articles and verbs like ser).
- Fixes issues: #77 #76 #65 #42 #60 #58 #17 #79 #71 #72 #69 #53 #41
0.3.1
This release contains updated treebanks (MRS test suite and TIBIDABO 01-02) plus four new portions of TIBIDABO: 03-06 (sentences of length up to 6 words). The treebanking decisions are of two types: (i) old decisions carried over from the old version of the treebanks, where it was possible to do this automatically or by manually recovering the tree which matches the gold one according to fftb treebanking tool; (2) additional manual decisions based on inspecting the MRS of the most sensible trees.
Major change to the Freeling-SRG interface now allows the SRG to better process some of the clitics as well as a range of other situations:
-
Freeling is now called without retokenization. This makes it possible to recognize clitics as being part of their host words (as opposed to separate tokens).
-
A custom mapping file named
srg-freeling.dat
is now included (found in theutil/freeling_api
folder). This file is an updated version ofsppp.dat
which the old version of the grammar used to override some of Freeling output. The current interface only makes use of the Replace and Fusion portions. This part of the interface is required to treat e.g. tag sequences found under Fusion. Such sequences have to be mapped to special single tags (representing morphological fusion) in order for the grammar to recognize them.
0.3.0
-
New interface Freeling4.2-LKB-FOS (thanks to John Carroll @john-a-carroll )
In order to use it:- update the last line in lkb/Globals.lsp to the correct location of freeling2lkb.py
- add srg/util to PYTHONPATH
-
Releasing all of the original treebanks (under freeling3.0-deprecated). They cannot be used with this version of the LKB or with ACE, however we want to release them because otherwise they do not have a public home.
-
Releasing updated tbdb02 treebank
-
Changes to the grammar: a few more generic lexical entries; started work on proper treatment of numbers and mathematical signs but that is not finished. The release is primarily for the treebanks and the LKB interface.
0.2.1
- Better use of Freeling API to match the grammar's expectations
- Updated MRS test suite: now includes the original version with reconciled treebanking decisions as well as the updated version with English items removed, new items added, some items modified, some old decisions rejected and some new analyses accepted.
- Fixes issues: #22 #11 #9 #7
0.2.0
Spanish Resource Grammar release 0.2.0 (minor update).
- Instead of using a script which calls Freeling's
analyze
binary as a subprocess, we now use the python API for Freeling. This hopefully brings at least some performance gain, and more importantly, will allow for more flexibility in the future. It also fixes the bug where original character positions were lost, which led to poor visualization by e.g. fftb in cases when input was tokenized in non-obvious ways. - The MRS test suite treebank is updated to better reflect the grammar's coverage. Some items which were previously marked as "accepted" are now rejected, and vice versa. A number of bugs were filed as a result of the test suite review: https://github.com/delph-in/srg/issues (see issues marked with label "mrs testsuite").
0.1.0
The initial release of the Spanish Resource Grammar updated to use Freeling 4.0.
Major changes:
- Freeling 3.0 tags in iflr.tdl were updated to Freeling 4.0 tags
- Token-mapping feature geometry was added (tmt.tdl, tmr/, and some new lexical rule types in letypes.tdl to ensure copying the TRAITS feature in types previousy inheriting from basic- types).
- A script util/populate_tokens.py added to utils. The script takes a directory as input and updates [incr tsdb()] profiles stored in that directory such that each item in each test suite gets the field i-tokens populated with YY formatted Freeling output.
This grammar version has been tested on the MRS test suite and on a small portion of the TIBIDABO treebank, specifically the one-word sentences. This updated grammar will at least sometimes behave differently compared to the old grammar which used Freeling 3.0. At the moment, the grammar yeilds gold parses for some but not all items in the treebanks. As the updates and re-treebanking progresses, we will release new versions.