Text indexation on portuguese #32

albcunha · 2021-03-21T23:15:13Z

Hello! Maybe there is something not working correctly with token.idx on portuguese.

I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").

I saw #17, it seems to be the same problem, but it seem it wont work for portuguese.

This works (token.text and text slice are the same):

nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .

This wont work (token.text and text slice are not the same after multiword): :

nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
    print(token.text,text[token.idx:token.idx+len(token.text)])

A A
linguagem linguagem
de da
a p
paz z p
pode de s
ser r u
uma a c
cultura ltura.
.

Any ideas of how to circumvent this?

The text was updated successfully, but these errors were encountered:

asajatovic · 2021-03-24T18:29:26Z

@albcunha Thank you for reporting this issue. I'll try to look into it in more detail. In the meantime, which column is the desired one of the two for Portuguese? 😃

albcunha · 2021-03-28T00:55:27Z

Ideally, I think a general rule would be that token.idx for the take only the first character and the second token could rest of the word (the remaining characters). They could have this format:

A A 0
linguagem linguagem 2
da de 12
a  a 13
paz paz 15
pode pode 19
ser ser 24
uma uma 28
cultura cultura 32
. . 39

There are many words in portuguese words this contractions happens, some are not identified by the model. But, still, It happens a lot. The change, as suggested, would solve all the words I checked, such as these:
do, dos, da, das, dum, duns, duma, umas, doutro, doutros, doutra, doutras, donde, no, nos, na, nas, num, nuns, numa, numas, noutro, noutros, pelo, pelos, pela, pelas.

That are other contractions that the model wont "catch", so I think it does not matter.

Thanks for any help!

asajatovic added the bug Something isn't working label Mar 24, 2021

asajatovic assigned mariosasko Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text indexation on portuguese #32

Text indexation on portuguese #32

albcunha commented Mar 21, 2021

asajatovic commented Mar 24, 2021 •

edited

Loading

albcunha commented Mar 28, 2021 •

edited

Loading

Text indexation on portuguese #32

Text indexation on portuguese #32

Comments

albcunha commented Mar 21, 2021

asajatovic commented Mar 24, 2021 • edited Loading

albcunha commented Mar 28, 2021 • edited Loading

asajatovic commented Mar 24, 2021 •

edited

Loading

albcunha commented Mar 28, 2021 •

edited

Loading