Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Portuguese model mistakenly splits URLs into their own sentences #1423

Open
busdriverbuddha opened this issue Sep 20, 2024 · 3 comments
Open
Labels

Comments

@busdriverbuddha
Copy link

Describe the bug
When a Pipeline is instantiated with a text containing some URLs (e.g. example.com) in Portuguese, the URLs are broken into their own sentences, as the dots are seemingly treated as full stops.

To Reproduce

import stanza
nlp = stanza.Pipeline(lang="pt", processores="tokenize", verbose=False)
text = "Olá, não deixe de visitar nossos sites em exemplo1.com, exemplo2.com e exemplo3.com.br"
doc = nlp(text)
print([s.text for s in doc.sentences])

Output:

['Olá, não deixe de visitar nossos sites em exemplo1.', 'com, exemplo2.', 'com e exemplo3.', 'com.br']

Expected behavior
As per the example above, a single sentence:

['Olá, não deixe de visitar nossos sites em exemplo1.com, exemplo2.com e exemplo3.com.br']

Environment (please complete the following information):

  • OS: Ubuntu
  • Python version: 3.10.12
  • Stanza version: 1.9.2

Additional context
This does not happen when the URLs start with www:

['Olá, não deixe de visitar nossos sites em www.exemplo1.com, www.exemplo2.com e www.exemplo3.com.br']
@AngledLuffa
Copy link
Collaborator

We use a regex for URL which allows certain patterns:

URL_RAW_RE = r"""(?:https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s"]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s"]{2,}|www\.[a-zA-Z0-9]+\.[^\s"]{2,})"""

The issue you're finding is that it accepts www.example.com, but doesn't accept example.com w/o the www. I suppose it would make sense to add example.com and example.com.TLD to the stuff it accepts. Those are unlikely to be false positives

@AngledLuffa
Copy link
Collaborator

4421213

@busdriverbuddha
Copy link
Author

Ahh, I see! Might I recommend adding some other common suffixes like .org, .net, and .gov?

AngledLuffa added a commit that referenced this issue Sep 22, 2024
AngledLuffa added a commit that referenced this issue Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants