Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A Japanese sentence "人参はβ−カロテン含有量が高く栄養豊富" is detected as Chinese #406

Open
sueki1242 opened this issue Nov 28, 2024 · 0 comments

Comments

@sueki1242
Copy link

Thank you for providing such a wonderful crate!
I found an issue with the accuracy of the judgment. As the title says, the natural Japanese sentence 人参はβ−カロテン含有量が高く栄養豊富 was judged to be Chinese by the logic of detect_language_with_rules.

How to reproduce

let detector = LanguageDetectorBuilder::from_all_languages().build();  
let result = detector.detect_language_of("人参はβ−カロテン含有量が高く栄養豊富");  
assert_eq!(result, Some(Language::Chinese));

Details

If there are only Japanese, Chinese or None words exist in the input text, and the number of None words is small, it will be judged to be Japanese. This allows correct judgments to be made in many cases for sentences that are a mixture of kanji and kana.

However, since the sentence contains β, which is judged to be Greek, the above logic does not apply, and it was judged to be Chinese by majority vote.
I thought that the logic should be changed to something like “If Chinese and Japanese words make up the majority (>90%?) of the words, it is judged to be Japanese.”

If there are no problems with this policy, I will create a pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant