Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MWEs #13

Closed
livyreal opened this issue Oct 6, 2016 · 9 comments
Closed

MWEs #13

livyreal opened this issue Oct 6, 2016 · 9 comments

Comments

@livyreal
Copy link

livyreal commented Oct 6, 2016

In [1,2] the mwe are tokenized as a single word, but this is not the UD recommendation nor how MWE are annotated in [3,4]. The UD documentation proposes (http://universaldependencies.org/u/dep/mwe.html) the dependency relation 'mwe' or 'compound' (http://universaldependencies.org/u/dep/compound.html) to capture these words.

Example in [1]

9   primitive_dream_paintings   primitive_dream_paintings   NOUN    N_M_S_@P<   Number=Sing|Gender=Masc 6   nmod

Example in [3]:

9   primitive   primitive   NOUN    n|M|S   _   6   dobj    _   MWE=primitive_dream_paintings|MWEPOS=NOUN
10  dream   dream   NOUN    NOUN    _   9   compound    _   _
11  paintings   paintings   NOUN    NOUN    _   9   compound    _   _
12  »  »  PUNCT   punc    _   9   punct   _   _

For reference:

  1. Bosque 7.5 Universal dependencies, file bosque_CP.udep.conll.gz,
  2. Bosque 7.5 Universal dependencies, file bosque_CF.udep.conll.gz
  3. Bosque version 7.3, converted by Dan Zeman available in http://github.com/UniversalDependencies/UD_Portuguese
@arademaker
Copy link
Collaborator

This is part of the issue #4

@livyreal livyreal closed this as completed Oct 6, 2016
@arademaker arademaker reopened this Oct 6, 2016
@arademaker
Copy link
Collaborator

nope. this is not solved. The issue #4 is a 'meta' issue. We should break it apart in small issues like this one.

@EckhardBick
Copy link

For higher level parsing algorithms, MWE tokens are very beneficial, because they reduce the token distance between related language material. Many UD treebanks, including the English one, use lots of 1-token MWEs, in particular for closed-class items like prepositions, conjunctions and functional adverbs. Treebanks were originally motivated by linguistic interest in syntactic structure (not ML :)), hence the MWEs which simplify recognition of syntactic structure.

That said, a post-filter could, in theory, split all MWEs, leaving its tag on the first part, and assigning "compound" edge labels to the later parts, plus dependency links to the first part. This would, however, NOT be a true representation of the internal structure of the MWEs. "em_vez_de", for instance, does have internal structure: PRP @x + N @p< + PRP @n<. For ACDC, at Linguateca, we made a filter for many MWEs to break them down WITH internal structure, but that was without dependency, and a closed list.

Personally, I like the MWEs, because things like "em vez de" really feel like single units to me, but IF we go for splitting, I think the only practicable solution is a shallow "everything-links-to-first-part".

@fcbr
Copy link
Contributor

fcbr commented Oct 7, 2016

This is a not an issue with Bosque 7.5 UD, it is just a matter of adapting its output so it can be compatible with what Freeling outputs. I'll remove the label, since the fix should be applied on our side.

@fcbr fcbr removed the bosque-ud-7.5 label Oct 7, 2016
@claudiafreitas
Copy link

@fcbr , it is an issue with Bosque 7.5, since it deals with UD format, isnt it?
I totally agree with Eckhard position (and post-edditing solution). However, the 1-token mwe is an issue to discontinuous mwes, and then the UD formalism seems a good idea.
And: when splitting mwe, the tokens must receive their "original" pos - a very weird solution, I think:
"Isto é, a Unicre sustenta que os 130 escudos..."
Isto_ PRON
é_ VERB
...

@arademaker
Copy link
Collaborator

Valeria:

olhando pras stats eu notei que o Dan tem 'do' e 'da' entre os ADP, o que me pareceu estranho visto que o Bosque faz o split de+o, de+a, mas o Dan nao faz dentro das MWEs, e.g

# sent_id pt-s9
7   Junta   Junta   PROPN   prop|F|S    _   2   nmod    _   
MWE=Junta_da_Justiça_do_Trabalho|MWEPOS=PROPN

e

# sent_id pt-s10
MWE=Sindicato_dos_Jornalistas|MWEPOS=PROPN

@arademaker
Copy link
Collaborator

@EckhardBick isto é que estamos discutindo hoje por email com o grupo da UD e sua proposta é exatamente o que eles sugerem na documentação de UD. Vc disse

I think the only practicable solution is a shallow "everything-links-to-first-part".

E eles dizem

Multiword expressions are annotated in a flat, head-initial structure, in which all words in the expression modify the first one using the mwe label.

http://universaldependencies.org/u/dep/mwe.html

Mas temos ainda a http://universaldependencies.org/u/dep/compound.html como uma relação mais genérica do que mwe e name.

@vcvpaiva
Copy link

vcvpaiva commented Oct 28, 2016

this is a modification that Dan Zeman made after the corpus was released, so there is a version that does not have this rewriting of the mwes, if we want to check it out. However Zeman's representation seems better to me than plain underscore joined tokens. for instance, by grepping for MWE we can get all many (most?) of the NEs in the corpus.

@livyreal
Copy link
Author

livyreal commented Nov 1, 2016

Since @EckhardBick split the mwes (tks!! 💃) I'm closing that issue and opening specific issues #72 to the problems I found in his split.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants