-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MWEs #13
Comments
This is part of the issue #4 |
nope. this is not solved. The issue #4 is a 'meta' issue. We should break it apart in small issues like this one. |
For higher level parsing algorithms, MWE tokens are very beneficial, because they reduce the token distance between related language material. Many UD treebanks, including the English one, use lots of 1-token MWEs, in particular for closed-class items like prepositions, conjunctions and functional adverbs. Treebanks were originally motivated by linguistic interest in syntactic structure (not ML :)), hence the MWEs which simplify recognition of syntactic structure. That said, a post-filter could, in theory, split all MWEs, leaving its tag on the first part, and assigning "compound" edge labels to the later parts, plus dependency links to the first part. This would, however, NOT be a true representation of the internal structure of the MWEs. "em_vez_de", for instance, does have internal structure: PRP @x + N @p< + PRP @n<. For ACDC, at Linguateca, we made a filter for many MWEs to break them down WITH internal structure, but that was without dependency, and a closed list. Personally, I like the MWEs, because things like "em vez de" really feel like single units to me, but IF we go for splitting, I think the only practicable solution is a shallow "everything-links-to-first-part". |
This is a not an issue with Bosque 7.5 UD, it is just a matter of adapting its output so it can be compatible with what Freeling outputs. I'll remove the label, since the fix should be applied on our side. |
@fcbr , it is an issue with Bosque 7.5, since it deals with UD format, isnt it? |
Valeria:
e
|
@EckhardBick isto é que estamos discutindo hoje por email com o grupo da UD e sua proposta é exatamente o que eles sugerem na documentação de UD. Vc disse
E eles dizem
http://universaldependencies.org/u/dep/mwe.html Mas temos ainda a http://universaldependencies.org/u/dep/compound.html como uma relação mais genérica do que mwe e name. |
this is a modification that Dan Zeman made after the corpus was released, so there is a version that does not have this rewriting of the mwes, if we want to check it out. However Zeman's representation seems better to me than plain underscore joined tokens. for instance, by grepping for MWE we can get all many (most?) of the NEs in the corpus. |
Since @EckhardBick split the mwes (tks!! 💃) I'm closing that issue and opening specific issues #72 to the problems I found in his split. |
In [1,2] the mwe are tokenized as a single word, but this is not the UD recommendation nor how MWE are annotated in [3,4]. The UD documentation proposes (http://universaldependencies.org/u/dep/mwe.html) the dependency relation 'mwe' or 'compound' (http://universaldependencies.org/u/dep/compound.html) to capture these words.
Example in [1]
Example in [3]:
For reference:
The text was updated successfully, but these errors were encountered: