A Japanese sentence "人参はβ−カロテン含有量が高く栄養豊富" is detected as Chinese #406

sueki1242 · 2024-11-28T11:53:15Z

Thank you for providing such a wonderful crate!
I found an issue with the accuracy of the judgment. As the title says, the natural Japanese sentence 人参はβ−カロテン含有量が高く栄養豊富 was judged to be Chinese by the logic of detect_language_with_rules.

How to reproduce

let detector = LanguageDetectorBuilder::from_all_languages().build();  
let result = detector.detect_language_of("人参はβ−カロテン含有量が高く栄養豊富");  
assert_eq!(result, Some(Language::Chinese));

Details

If there are only Japanese, Chinese or None words exist in the input text, and the number of None words is small, it will be judged to be Japanese. This allows correct judgments to be made in many cases for sentences that are a mixture of kanji and kana.

However, since the sentence contains β, which is judged to be Greek, the above logic does not apply, and it was judged to be Chinese by majority vote.
I thought that the logic should be changed to something like “If Chinese and Japanese words make up the majority (>90%?) of the words, it is judged to be Japanese.”

If there are no problems with this policy, I will create a pull request.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Japanese sentence "人参はβ−カロテン含有量が高く栄養豊富" is detected as Chinese #406

A Japanese sentence "人参はβ−カロテン含有量が高く栄養豊富" is detected as Chinese #406

sueki1242 commented Nov 28, 2024

A Japanese sentence "人参はβ−カロテン含有量が高く栄養豊富" is detected as Chinese #406

A Japanese sentence "人参はβ−カロテン含有量が高く栄養豊富" is detected as Chinese #406

Comments

sueki1242 commented Nov 28, 2024

How to reproduce

Details