-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
newpar annotation does not follow the CoNLL-U specification #42
Comments
Yes, this annotation type is new in GUM, I recently pinged @dan-zeman for feedback on this and I'm waiting to hear back about his thoughts on this. I don't particularly mind renaming this annotation and/or changing it's internal syntax in some way, but the root issue here is that the basic
So just saying The current notation is not necessarily the most elegant solution and I'm happy to think about alternatives, but it has the advantage of allowing us to deterministically reconstruct the nested block structure of the original data. Adding @lauren-lizzy-levine who implemented the current solution, in case you have any thoughts on this! |
My 2 cents: List items aren't necessarily paragraphs. They can be construed as within a paragraph, or as containing paragraphs. Given the markup in your example, why not have separate directives for list items, e.g. |
The primary issue here is that the current GUM The secondary issue is that XML tags spanning whole sentences are required to be annotated using A simple solution for both of these issues would be to use different attribute than There are several other issues with the current GUM XML annotation, which I did not plan to discuss here, but in the end why not.
|
I think I can answer these questions:
|
Thanks for the explanations, Amir. |
For the first question we have many cases in the corpus - but the answer depends on whether it is a block tag which never breaks sentence hierarchy, or a coincidental one spanning a sentence, such as
In the second case we would use the XML misc annotation, under the understanding that although in this case it coincidentally spans a whole sentence, this is not really a block tag, and therefore should not be included in newpar: # newpar = p (1 s)
1 ... XML=<b>
...
23 XML =</b> For the second question, the answer is perhaps less satisfying: if a tag has been designated as a 'block' by the scheme, then sentence boundaries may not cross it. In other words, if we decide that is a block (and in GUM it is), then even if there is a syntactic structure that can be interpreted as a sentence spanning it, we split it into multiple sentences. This was done in the interest of consistency, since it is often hard to decide if list items forms one sentence or an enumeration of fragments - GUM consistently takes the latter view, e.g. this would be 5 sentences in GUM, even though we could consider the sublist to be an object of the verb 'need':
It is likely that this decision was influenced by the early inclusion of how-to guides in the corpus, which sometimes have very long nested lists without overt coordination between the bullets, but you can also get weird paradoxes in textbooks, things like:
In this example, both bullets 1 and 2 include infinitives that look like complements of "able to", but there are intervening sentences that mess this up. So to avoid all these kinds of contortions, all bullets in GUM are always an independent sentence. However this does not apply to non-block tags, so other types of XML markup can occur sentence-medially. It does apply to headings, paragraphs, captions etc., so a sentence can never begin in a heading and carry on into the paragraph, even if that seems syntactically right (though I have no such example - I think |
Thanks again for the explanations.
I missed the meaning of block when reading this for the first time. So if the XML tags are derived from HTML, we can distinguish block and inline elements. (GUM uses only TEI-derived XML tags, which can also be divided into block and inline.) Block elements can appear only at paragraph boundaries, or in other words paragraph boundaries are defined by presence of one or more block-level tags. Inline elements are always annotated using the List items ( So the only remaining question is the original one: what should we do?
|
Yes, it's exactly as you described! I don't feel passionately about which path to take, but I should perhaps explain why we didn't choose the second and third options you proposed: My initial instinct was to take the last one - represent everything in a single way using MISC and not bother with But the main reason we went with The other question of whether to simplify (only list newpar) or add all of the information also seemed to point towards adding it, because in reality newpars actually represent nestable blocks, and this is something we want to push CoNLL-U to allow us to represent in the future. Finally in terms of doing both (plain newpar AND represent blocks in more detail in XML), I would point out that double inclusion of information is never a good idea, since we can have corrupt, conflicting information (if there is XML |
From the name |
From my perspective the way to avoid corruption is with validation scripts, which GUM has anyway. :) If most UD treebanks had this rich XML structure it might be a different story, but assuming that it's just a few treebanks and they may have different kinds of XML, I would leave the standard fields like |
I proposed the Admittedly, the standard might look a bit different if it were a part of the original CoNLL-U specification and were discussed more thoroughly. But this is what we have, and it has been implemented in a number of treebanks in the meantime. I think I prefer to leave |
OK, it sounds like the consensus is that everyone would like to keep That said, I have seen numerous occasions where people have used UD data to reconstruct plain text representations of datasets, so I think that UD should make a recommendation about how nested blocks like headings and item lists should be represented. That way people who can and want to preserve this information will be able to do so in a consistent way, and ultimately having this information is useful for parsing, tagging and more (e.g. being inside a heading totally alters probabilities for POS tags and trees, not to mention sentence splitting and tokenization issues). For the solution that is adopted, however it looks, I would continue to argue that redundancies are potentially dangerous. Nathan is right that GUM has a build bot and validations which would prevent conflicts, but not everyone does, so for a future proof recommendation, building it the right way is still advisable. |
The CoNLL-U specification says
However, GUM exploits the
newpar
lines for other kinds of markup such as# newpar = p (1 s)
or# newpar = list type:::"ordered" (4 s) | item n:::"1" (1 s)
as described in the README.
Udapi follows strictly the CoNLL-U specification and allows only
^# (newpar|newdoc)(?:\s+id\s*=\s*(.+))?
. (Soudapy -s < in.conllu > out.conllu
results in deleting the extra markup and keeping just# newpar
.)So what should we do?
newpar
.newpar
for the XML annotations and keepnewpar
just for the original purpose. And possibly improve validate.py to checknewpar
using the above-mentioned regex.I would prefer the latter because there may be other toolkits (not only Udapi) or one-liners which expect
newpar
contains just the paragraph id (or nothing). Also, explaining the semantics of the XML-enhancednewpar
would make the CoNLL-U specification too long/complicated (and allowing it without explaining the semantics seems strange, although I admit there could be a link "see GUM docs for details").The text was updated successfully, but these errors were encountered: