Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compounded proper nouns #23

Closed
arademaker opened this issue Oct 8, 2016 · 12 comments
Closed

compounded proper nouns #23

arademaker opened this issue Oct 8, 2016 · 12 comments
Labels
manual correção manual necessária

Comments

@arademaker
Copy link
Collaborator

arademaker commented Oct 8, 2016

The sentence is "A renda da noite de inauguração será doada ao Fundo Social de Solidariedade do Estado."

<s id="2114" ref="CF502-4" source="CETENFolha n=502 cad=Cotidiano sec=soc sem=94a" forest="1" text="...">

...
12      Fundo_Social_de_Solidariedade   Fundo_Social_de_Solidariedade   PROPN   PROP_M_S_@P<    Number=Sing|Gender=Masc 9
       nmod
13      de      de      ADP     PRP_@N< _       15      case
14      o       o       DET     ART_M_S_@>N     PronType=Art|Number=Sing|Gender=Masc    15      det
15      Estado  estado  NOUN    N_M_S_@P<       Number=Sing|Gender=Masc 12      nmod

Note that #13 deals with the difference between UD and PALAVRAS to encode MWE in a single token vs. dep relations. Here the problem is the identification of the MWE, in particular NE. This is only one case, we may have probably more. How to fix it? I am assuming that would need to be done manually.

@vcvpaiva
Copy link

vcvpaiva commented Dec 9, 2016

is this issue still an issue? I thought Eckhard's new version, unpacking the mwes (that are now flat) had solved this one?

@livyreal
Copy link

livyreal commented Dec 12, 2016

o Eckhard apenas quebrou estas palavras, mas o fato de "do Estado" não participar da MWE precisa ser corrigido à mão, o POS tag individual de cada expressão das MWES também precisa ser feito à mão. Quando ele quebrou as mwes: Fundo_Social_de_Solidariedade (PROPN) virou Fundo (PROPN) Social (PROPN) de (PROPN) Solidariedade (PROPN).

Esta é uma issue que eu espero ser resolvida com a #110

Dá pra fechar, @arademaker ou é algo mais específico aqui?

@arademaker
Copy link
Collaborator Author

arademaker commented Dec 12, 2016

@livyreal não dá não. Como vai ficar o valor de MWE no campo misc? "Fundo_Social_de_Solidariedade_de_o_Estado" ou "Fundo_Social_de_Solidariedade_do_Estado" ? Em outras palavras, como lidar com as contrações? Por isso sou contra mantermos este valor MWE no campo misc!

Como ficam as relações dos tokens 14-18 ? Todos ligados ao 12 por flat:name ? Como ficam as upostag destes tokens? Vc pode editar o arquivo CF 502.conllu e me mandar a versão editada por email? Eu então coloco no repo e fecho este issue.

edited: concordo que provavelmente isto é parte do #110 mas não temos como pesquisar fácil as listas da @claudiafreitas como elas estão para sabermos se este nome ocorre nelas. Achei em uma delas uma MWE 'Fundo_Social' apenas, deve ser exatamente esta que estamos vendo aqui que foi quebrada indevidamente.

arademaker added a commit that referenced this issue Dec 13, 2016
arademaker added a commit that referenced this issue Dec 13, 2016
@vcvpaiva
Copy link

@arademaker you may have seen that there are 723 "fusions" in the UD_Portuguese corpus, as per stats in https://github.com/UniversalDependencies/UD_Portuguese/blob/master/stats.xml#L12.

@arademaker
Copy link
Collaborator Author

arademaker commented Dec 13, 2016

@vcvpaiva yes, lines such as this. I am aware of that and I know that we need to fix our data since we haven't encoded it. Actually, the English treebank haven't encoded it either. The documentation about it is here and we recently discussed it in UniversalDependencies/docs#322

But I didn't understand the reason for the comment here.

EDITED: fusions are also called contractions.

@arademaker
Copy link
Collaborator Author

@livyreal and @claudiafreitas do you agree with the changed in the file? Can I close this issue?

@vcvpaiva
Copy link

But I didn't understand the reason for the comment here.

well, it's just that you can create an issue to go over all the, not so many, 733 occurrences, if you wish.

@arademaker
Copy link
Collaborator Author

arademaker commented Dec 13, 2016

In the UD_Portuguese we have ~17K contractions!

$ egrep "^[0-9]+-[0-9]+\t" *conllu | wc -l
   17439

We need first to improve our library for read and write conllu files. It can't handle these lines yet.

@vcvpaiva
Copy link

vcvpaiva commented Dec 13, 2016 via email

@arademaker
Copy link
Collaborator Author

arademaker commented Dec 13, 2016

@vcvpaiva no. unique 723 fusions! But 17439 occurrences of fusions. here is how they count fusions, the same regex that I used above.

@vcvpaiva
Copy link

vcvpaiva commented Dec 13, 2016 via email

arademaker pushed a commit that referenced this issue Aug 27, 2021
@arademaker
Copy link
Collaborator Author

este issue começou com uma discussão de um caso particular de MWE de nome, depois passou a falar de contrações. De lá para cá, novo tratamento esta sendo adotado para MWE de nomes.. Logo, vou fechar este issue aproveitando correção que fiz em 2ecb4aa

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
manual correção manual necessária
Projects
None yet
Development

No branches or pull requests

3 participants