v0.2.2
bill-baumgartner
released this
03 Jul 23:01
·
17 commits
to master
since this release
Changes include:
- Updated all Document Readers to validate annotations as they are imported. Two types of validation are implemented. 1) Discontinuous spans are validated in two ways. One, if a discontinuous span contains adjacent component spans, e.g. [35..43][44..52], or component spans are are separated by only whitespace, then the component spans are combined, e.g. [35..52]. Second, if the discontinuous span contains a component span that is nested in another component span, e.g. [78..92][88..92], then the nested span is removed, e.g. [78..92]. And 2), coreference identity chain annotations are checked for redundant annotations in a single chain, and for annotations that are members of multiple chains. In all cases, an Exception is thrown by the validation with an appropriate error message so that the issues can be easily addressed.
- Revised the CoNLLCoref Document Writer to exclude two annotation types that are included in the CRAFT coreference annotations, but that should not be included in the CoNLL-Coref 2011/12 file format, namely 'nonreferential pronoun' and 'partonymy relation'.
- Added discontinuous span validation for the CoNLLCorefDocumentWriter. Mapping spans to token boundaries can cause instances of nested discontinuous spans, so the validation code for discontinuous spans was added to the CoNLL-Coref document writer. There was a case in 16628246.xml (coreference annotations) where "7.5 dbc embryos" was annotated as "7" .. "5 dbc embryos". In this case the "7" maps to the "7.5" token and the "5" also maps to the "7.5" token, so the final annotation had two instances of the "7.5" token span. Seems like the original annotation might be faulty, i.e. the "7" .. "5" split, but that's the way it is, so a fix was required.