You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Notice also the end of the URL .html is separated from the beginning of the URL and classified as a language change again.
The same also happens if just the domain name somerandomwebsite.com is referenced in the text.
Would it be reasonable for the language detector to treat URIs as "language neutral" stretches of text that maybe also assume the language of the surrounding text? Treating URis as atomic would also solve the issue of URIs being split by the language detector.
Note: It is also possible to do this in post processing the results of Lingua. So after receiving the start/end indices of each language segment from Lingua, I then apply my URI regular expression to find start/end indices of URIs and then modify the Lingua results accordingly.
The text was updated successfully, but these errors were encountered:
I work with text that may contain URLs. I pre-process documents before feeding into lingua-rs, and I use linkify crate to find URL indices. Finding URLs is a tricky problem on its own, and there are many ways to do it. linkify returns any string that is valid according to specs, but there can be false positives. In addition, I validate domain names using addr
URLs tend to make the language detector switch to English. For example:
Results:
Notice also the end of the URL
.html
is separated from the beginning of the URL and classified as a language change again.The same also happens if just the domain name
somerandomwebsite.com
is referenced in the text.Would it be reasonable for the language detector to treat URIs as "language neutral" stretches of text that maybe also assume the language of the surrounding text? Treating URis as atomic would also solve the issue of URIs being split by the language detector.
Note: It is also possible to do this in post processing the results of Lingua. So after receiving the start/end indices of each language segment from Lingua, I then apply my URI regular expression to find start/end indices of URIs and then modify the Lingua results accordingly.
The text was updated successfully, but these errors were encountered: