MWEs #13

livyreal · 2016-10-06T15:57:38Z

In [1,2] the mwe are tokenized as a single word, but this is not the UD recommendation nor how MWE are annotated in [3,4]. The UD documentation proposes (http://universaldependencies.org/u/dep/mwe.html) the dependency relation 'mwe' or 'compound' (http://universaldependencies.org/u/dep/compound.html) to capture these words.

Example in [1]

9   primitive_dream_paintings   primitive_dream_paintings   NOUN    N_M_S_@P<   Number=Sing|Gender=Masc 6   nmod

Example in [3]:

9   primitive   primitive   NOUN    n|M|S   _   6   dobj    _   MWE=primitive_dream_paintings|MWEPOS=NOUN
10  dream   dream   NOUN    NOUN    _   9   compound    _   _
11  paintings   paintings   NOUN    NOUN    _   9   compound    _   _
12  »  »  PUNCT   punc    _   9   punct   _   _

For reference:

Bosque 7.5 Universal dependencies, file bosque_CP.udep.conll.gz,
Bosque 7.5 Universal dependencies, file bosque_CF.udep.conll.gz
Bosque version 7.3, converted by Dan Zeman available in http://github.com/UniversalDependencies/UD_Portuguese

The text was updated successfully, but these errors were encountered:

arademaker · 2016-10-06T16:06:25Z

This is part of the issue #4

arademaker · 2016-10-06T16:24:04Z

nope. this is not solved. The issue #4 is a 'meta' issue. We should break it apart in small issues like this one.

EckhardBick · 2016-10-07T07:17:03Z

For higher level parsing algorithms, MWE tokens are very beneficial, because they reduce the token distance between related language material. Many UD treebanks, including the English one, use lots of 1-token MWEs, in particular for closed-class items like prepositions, conjunctions and functional adverbs. Treebanks were originally motivated by linguistic interest in syntactic structure (not ML :)), hence the MWEs which simplify recognition of syntactic structure.

That said, a post-filter could, in theory, split all MWEs, leaving its tag on the first part, and assigning "compound" edge labels to the later parts, plus dependency links to the first part. This would, however, NOT be a true representation of the internal structure of the MWEs. "em_vez_de", for instance, does have internal structure: PRP @x + N @p< + PRP @n<. For ACDC, at Linguateca, we made a filter for many MWEs to break them down WITH internal structure, but that was without dependency, and a closed list.

Personally, I like the MWEs, because things like "em vez de" really feel like single units to me, but IF we go for splitting, I think the only practicable solution is a shallow "everything-links-to-first-part".

fcbr · 2016-10-07T12:08:55Z

This is a not an issue with Bosque 7.5 UD, it is just a matter of adapting its output so it can be compatible with what Freeling outputs. I'll remove the label, since the fix should be applied on our side.

claudiafreitas · 2016-10-07T15:45:01Z

@fcbr , it is an issue with Bosque 7.5, since it deals with UD format, isnt it?
I totally agree with Eckhard position (and post-edditing solution). However, the 1-token mwe is an issue to discontinuous mwes, and then the UD formalism seems a good idea.
And: when splitting mwe, the tokens must receive their "original" pos - a very weird solution, I think:
"Isto é, a Unicre sustenta que os 130 escudos..."
Isto_ PRON
é_ VERB
...

arademaker · 2016-10-27T18:06:26Z

Valeria:

olhando pras stats eu notei que o Dan tem 'do' e 'da' entre os ADP, o que me pareceu estranho visto que o Bosque faz o split de+o, de+a, mas o Dan nao faz dentro das MWEs, e.g

# sent_id pt-s9
7   Junta   Junta   PROPN   prop|F|S    _   2   nmod    _   
MWE=Junta_da_Justiça_do_Trabalho|MWEPOS=PROPN

e

# sent_id pt-s10
MWE=Sindicato_dos_Jornalistas|MWEPOS=PROPN

arademaker · 2016-10-27T18:12:09Z

@EckhardBick isto é que estamos discutindo hoje por email com o grupo da UD e sua proposta é exatamente o que eles sugerem na documentação de UD. Vc disse

I think the only practicable solution is a shallow "everything-links-to-first-part".

E eles dizem

Multiword expressions are annotated in a flat, head-initial structure, in which all words in the expression modify the first one using the mwe label.

http://universaldependencies.org/u/dep/mwe.html

Mas temos ainda a http://universaldependencies.org/u/dep/compound.html como uma relação mais genérica do que mwe e name.

vcvpaiva · 2016-10-28T01:32:25Z

this is a modification that Dan Zeman made after the corpus was released, so there is a version that does not have this rewriting of the mwes, if we want to check it out. However Zeman's representation seems better to me than plain underscore joined tokens. for instance, by grepping for MWE we can get all many (most?) of the NEs in the corpus.

livyreal · 2016-11-01T19:31:11Z

Since @EckhardBick split the mwes (tks!! 💃) I'm closing that issue and opening specific issues #72 to the problems I found in his split.

livyreal closed this as completed Oct 6, 2016

arademaker reopened this Oct 6, 2016

livyreal added the bosque-ud-7.5 label Oct 6, 2016

fcbr removed the bosque-ud-7.5 label Oct 7, 2016

fcbr added the bosque-ud-7.5 label Oct 7, 2016

arademaker mentioned this issue Oct 8, 2016

compounded proper nouns #23

Closed

arademaker mentioned this issue Oct 28, 2016

dependence relations found in corpus, not in the UD #66

Closed

livyreal closed this as completed Nov 1, 2016

arademaker mentioned this issue Sep 15, 2021

DET followed by NOUN, but not related #345

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MWEs #13

MWEs #13

livyreal commented Oct 6, 2016 •

edited by arademaker

Loading

arademaker commented Oct 6, 2016

arademaker commented Oct 6, 2016

EckhardBick commented Oct 7, 2016

fcbr commented Oct 7, 2016

claudiafreitas commented Oct 7, 2016

arademaker commented Oct 27, 2016

arademaker commented Oct 27, 2016

vcvpaiva commented Oct 28, 2016 •

edited

Loading

livyreal commented Nov 1, 2016

MWEs #13

MWEs #13

Comments

livyreal commented Oct 6, 2016 • edited by arademaker Loading

arademaker commented Oct 6, 2016

arademaker commented Oct 6, 2016

EckhardBick commented Oct 7, 2016

fcbr commented Oct 7, 2016

claudiafreitas commented Oct 7, 2016

arademaker commented Oct 27, 2016

arademaker commented Oct 27, 2016

vcvpaiva commented Oct 28, 2016 • edited Loading

livyreal commented Nov 1, 2016

livyreal commented Oct 6, 2016 •

edited by arademaker

Loading

vcvpaiva commented Oct 28, 2016 •

edited

Loading