You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! Maybe there is something not working correctly with token.idx on portuguese.
I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").
I saw #17, it seems to be the same problem, but it seem it wont work for portuguese.
This works (token.text and text slice are the same):
nlp = spacy_udpipe.load("en")
text = "The language of peace can be a culture."
doc = nlp(text)
for token in doc:
print(token.text,text[token.idx:token.idx+len(token.text)])
The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .
This wont work (token.text and text slice are not the same after multiword): :
nlp = spacy_udpipe.load("pt")
text = "A linguagem da paz pode ser uma cultura."
doc = nlp(text)
for token in doc:
print(token.text,text[token.idx:token.idx+len(token.text)])
A A
linguagem linguagem de da
a p
paz z p
pode de s
ser r u
uma a c
cultura ltura.
.
Any ideas of how to circumvent this?
The text was updated successfully, but these errors were encountered:
@albcunha Thank you for reporting this issue. I'll try to look into it in more detail. In the meantime, which column is the desired one of the two for Portuguese? 😃
Ideally, I think a general rule would be that token.idx for the take only the first character and the second token could rest of the word (the remaining characters). They could have this format:
A A 0
linguagem linguagem 2
da de 12
a a 13
paz paz 15
pode pode 19
ser ser 24
uma uma 28
cultura cultura 32
. . 39
There are many words in portuguese words this contractions happens, some are not identified by the model. But, still, It happens a lot. The change, as suggested, would solve all the words I checked, such as these: do, dos, da, das, dum, duns, duma, umas, doutro, doutros, doutra, doutras, donde, no, nos, na, nas, num, nuns, numa, numas, noutro, noutros, pelo, pelos, pela, pelas.
That are other contractions that the model wont "catch", so I think it does not matter.
Hello! Maybe there is something not working correctly with token.idx on portuguese.
I think the cause is multiword token. In portuguese "da" (of the) is a contraction of "de + a").
I saw #17, it seems to be the same problem, but it seem it wont work for portuguese.
This works (token.text and text slice are the same):
The The
language language
of of
peace peace
can can
be be
a a
culture culture
. .
This wont work (token.text and text slice are not the same after multiword): :
A A
linguagem linguagem
de da
a p
paz z p
pode de s
ser r u
uma a c
cultura ltura.
.
Any ideas of how to circumvent this?
The text was updated successfully, but these errors were encountered: