-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Portuguese model mistakenly splits URLs into their own sentences #1423
Comments
We use a regex for URL which allows certain patterns: stanza/stanza/models/tokenization/utils.py Line 198 in 539760c
The issue you're finding is that it accepts |
Ahh, I see! Might I recommend adding some other common suffixes like .org, .net, and .gov? |
…country code TLD after them as well) #1423
…country code TLD after them as well) #1423
Describe the bug
When a Pipeline is instantiated with a text containing some URLs (e.g. example.com) in Portuguese, the URLs are broken into their own sentences, as the dots are seemingly treated as full stops.
To Reproduce
Output:
Expected behavior
As per the example above, a single sentence:
Environment (please complete the following information):
Additional context
This does not happen when the URLs start with
www
:The text was updated successfully, but these errors were encountered: