Portuguese model mistakenly splits URLs into their own sentences #1423

busdriverbuddha · 2024-09-20T13:40:08Z

Describe the bug
When a Pipeline is instantiated with a text containing some URLs (e.g. example.com) in Portuguese, the URLs are broken into their own sentences, as the dots are seemingly treated as full stops.

To Reproduce

import stanza
nlp = stanza.Pipeline(lang="pt", processores="tokenize", verbose=False)
text = "Olá, não deixe de visitar nossos sites em exemplo1.com, exemplo2.com e exemplo3.com.br"
doc = nlp(text)
print([s.text for s in doc.sentences])

Output:

['Olá, não deixe de visitar nossos sites em exemplo1.', 'com, exemplo2.', 'com e exemplo3.', 'com.br']

Expected behavior
As per the example above, a single sentence:

['Olá, não deixe de visitar nossos sites em exemplo1.com, exemplo2.com e exemplo3.com.br']

Environment (please complete the following information):

OS: Ubuntu
Python version: 3.10.12
Stanza version: 1.9.2

Additional context
This does not happen when the URLs start with www:

['Olá, não deixe de visitar nossos sites em www.exemplo1.com, www.exemplo2.com e www.exemplo3.com.br']

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2024-09-20T15:42:05Z

We use a regex for URL which allows certain patterns:

stanza/stanza/models/tokenization/utils.py

Line 198 in 539760c

    
           URL_RAW_RE = r"""(?:https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s"]{2,}|www\.[a-zA-Z0-9]+\.[^\s"]{2,})"""

The issue you're finding is that it accepts www.example.com, but doesn't accept example.com w/o the www. I suppose it would make sense to add example.com and example.com.TLD to the stuff it accepts. Those are unlikely to be false positives

AngledLuffa · 2024-09-20T22:44:39Z

4421213

busdriverbuddha · 2024-09-22T16:58:36Z

Ahh, I see! Might I recommend adding some other common suffixes like .org, .net, and .gov?

…country code TLD after them as well) #1423

busdriverbuddha added the bug label Sep 20, 2024

AngledLuffa added a commit that referenced this issue Sep 22, 2024

Add some more TLD to the tokenization RE (some of which actually get …

a545153

…country code TLD after them as well) #1423

AngledLuffa added a commit that referenced this issue Sep 22, 2024

Add some more TLD to the tokenization RE (some of which actually get …

f59ccd8

…country code TLD after them as well) #1423

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Portuguese model mistakenly splits URLs into their own sentences #1423

Portuguese model mistakenly splits URLs into their own sentences #1423

busdriverbuddha commented Sep 20, 2024

AngledLuffa commented Sep 20, 2024

AngledLuffa commented Sep 20, 2024

busdriverbuddha commented Sep 22, 2024

Portuguese model mistakenly splits URLs into their own sentences #1423

Portuguese model mistakenly splits URLs into their own sentences #1423

Comments

busdriverbuddha commented Sep 20, 2024

AngledLuffa commented Sep 20, 2024

AngledLuffa commented Sep 20, 2024

busdriverbuddha commented Sep 22, 2024