-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bias towards non-English languages? #270
Comments
Can you please give me some examples for the Irish-English strings? That would make it easier for me to examine what's going on. Generally, there is no rule in the library's rule engine that explicitly underweights English. The rule engine looks for characters in the texts that are unique to one or more languages and then adds more weight to those languages. The following characters are treated as potential indicators for Irish (but also for a few others): Another factor might be the ngram probabilities. If the sum of ngram probabilities for Irish is larger than the sum of ngram probabilities for English, then Irish will be returned. If you give me some examples, I can tell you whether the rule engine or the statistical model is decisive for them.
No worries, you are welcome. I'm always happy about feedback, especially if it is as friendly as yours. :)
Thank you very much. :) Feel free to open a pull request if you think that you have found useful optimizations. |
Thanks. Here's an example of some print-outs. These are bilingual documents (attempt by me in fact to translate Alice in Wonderland from English to Irish). So in one column there's the source (English) and in the other column the Irish translation. You can see there are distinct runs of proper text in English and Irish (not just jottings). This would obviously be an ideal candidate for using
Haven't had a chance to look at your source code yet (or to try the new version which you said handles exotic Unicode better)... By the way (because I haven't had a chance to examine things, I don't know whether you've already factored in this sort of thing)... but maybe when a string is subjected to "multiple language analysis" you should take account of 1) newlines 2) full stops 3) semicolons, as factors likely to increase the likelihood of detecting a language fragment boundary? |
I think this is quite a different example but one of the failures I have seen is usage of words shared from other languages >>> from lingua import LanguageDetectorBuilder
>>> detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build()
>>> confidence_values = detector.compute_language_confidence_values("Just you and me in this digital tête-à-tête,")
>>> confidence_values
[ConfidenceValue(language=Language.FRENCH, value=0.0935808995725718), ConfidenceValue(language=Language.ENGLISH, value=0.083711183706532), ConfidenceValue(language=Language.DUTCH, value=0.049392064611581105), ConfidenceValue(language=Language.CATALAN, value=0.04836927544723105)... Interestingly the results are completely off for multi-lang detection of this snippet: >>> detector.detect_multiple_languages_of("Just you and me in this digital tête-à-tête,")
[DetectionResult(start_index=0, end_index=19, word_count=5, language=Language.FRENCH), DetectionResult(start_index=19, end_index=44, word_count=3, language=Language.ESPERANTO)]
>>> detector.detect_multiple_languages_of("Just you and me in this digital,")
[DetectionResult(start_index=0, end_index=32, word_count=7, language=Language.ENGLISH)]
>>> detector.detect_multiple_languages_of("tête-à-tête,")
[DetectionResult(start_index=0, end_index=12, word_count=3, language=Language.FRENCH)] It recognises the sentence without the french words easily as English, then the isolated words quickly throw lingua off. I'm using the python bindings here (2.0.2) - I hope that doesn't affect the relevance to this discussion. |
I won't bombard you with any more issues. I just think this crate is really excellent and am excited by it. It's going to make my Elasticsearch indices and my use of them much better.
So most of the strings I'm subjecting to analysis are in the range 100 chars to maybe 1000 chars.
I have quite a few bilingual documents in my corpus of documents, almost all between English and some other language. Usually with English in one column and the other language in the other. So parsing the document tends to produce quite a bit of text with, say, Irish and English mixed.
In those cases Irish almost always seems to be chosen as "language with the highest confidence". So then I thought I'd examine the levels of confidence for all 6 languages for all these bilingual Irish-English strings. To my surprise, it is usually Irish 1.0 and English 0.0! Or occasionally Irish 0.88 and English 0.09, something like that.
This tends to suggest that if a non-English language is detected it is given a higher "weighting" than English.
But the thing is, if you are offering multiple-language detection (which I realise is an experimental feature at this stage), having a bias against any language in this way is a bit unfortunate: it means that it is harder to identify strings where there appear to be runs of more than one language, so you can then change to
detect_multiple_languages_of
for more detailed analysis.I'd be interested to hear what you have to say about this. Meanwhile I may well clone your app and see if there are any obvious ways I might be able to tweak things a bit to address some of the issues I have currently.
The text was updated successfully, but these errors were encountered: