You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for providing such a wonderful crate!
I found an issue with the accuracy of the judgment. As the title says, the natural Japanese sentence 人参はβ−カロテン含有量が高く栄養豊富 was judged to be Chinese by the logic of detect_language_with_rules.
How to reproduce
let detector = LanguageDetectorBuilder::from_all_languages().build();
let result = detector.detect_language_of("人参はβ−カロテン含有量が高く栄養豊富");
assert_eq!(result, Some(Language::Chinese));
Details
If there are only Japanese, Chinese or None words exist in the input text, and the number of None words is small, it will be judged to be Japanese. This allows correct judgments to be made in many cases for sentences that are a mixture of kanji and kana.
However, since the sentence contains β, which is judged to be Greek, the above logic does not apply, and it was judged to be Chinese by majority vote.
I thought that the logic should be changed to something like “If Chinese and Japanese words make up the majority (>90%?) of the words, it is judged to be Japanese.”
If there are no problems with this policy, I will create a pull request.
The text was updated successfully, but these errors were encountered:
Thank you for providing such a wonderful crate!
I found an issue with the accuracy of the judgment. As the title says, the natural Japanese sentence
人参はβ−カロテン含有量が高く栄養豊富
was judged to be Chinese by the logic ofdetect_language_with_rules
.How to reproduce
Details
If there are only Japanese, Chinese or None words exist in the input text, and the number of None words is small, it will be judged to be Japanese. This allows correct judgments to be made in many cases for sentences that are a mixture of kanji and kana.
However, since the sentence contains
β
, which is judged to be Greek, the above logic does not apply, and it was judged to be Chinese by majority vote.I thought that the logic should be changed to something like “If Chinese and Japanese words make up the majority (>90%?) of the words, it is judged to be Japanese.”
If there are no problems with this policy, I will create a pull request.
The text was updated successfully, but these errors were encountered: