-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CV2-2559: use local language detection model #448
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some minor tweaks you can take or leave otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the langid tests might need to be updated to explicitly check fasttext? https://github.com/meedan/alegre/blob/develop/app/test/test_langid.py (some if it seems to test individual languages models independently)
Good point. I've added now 🙏 |
Looks like one of the fasttext tests failed
and there was another error about missing google credentials? (maybe no longer needed) |
No this was me just running tests to see if fasttext would agree with the expected cld output in all cases. The answer is 'no' 😂 I think our language tests are not great because we have a list of strings and their expected languages, but the expected language is different for different models 🙃 |
I'm guessing that is why the test was structured to not assume that results are identical across models, as long as they are consistent? i.e. |
Sure, but our tests used to accept In this case, I believe fasttext hasn't been trained to detect languages in Latin characters that are not usually written with Latin characters. Notably it returns lower confidence scores for these examples. I'm updating the tests to just skip these cases for fasttext now. I've left the other models alone for now. E.g., only Google gets this one right: I think this is a larger issue with the tests as written, but I'm not going to address that in this PR. The purpose of the tests as far as I'm concerned is to ensure the model produces consistent output across runs. |
I've updated the code to fallback to Google when the fasttext result how low confidence. Taking this conversation internal to check if we have support for this. |
app/main/lib/langid.py
Outdated
FastTextLangidProvider.fasttext_model.get_language("Some text to check") | ||
return True | ||
|
||
class HybridLangidProvider: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the idea is we run both for a little while to see if the agreement is good enough (disagreement probably only on edge cases) and then disable CLD?
Code Climate has analyzed commit 13c5c42 and detected 2 issues on this pull request. Here's the issue category breakdown:
The test coverage on the diff in this pull request is 87.8% (50% is the threshold). This pull request will bring the total coverage in the repository to 79.8% (0.0% change). View more on Code Climate. |
13c5c42
to
e20aba2
Compare
Description
Goal of this PR is to stop using Google's API for language identification and use a local model instead. We previously implemented CLD3, but had not turned it on. After some ad-hoc testing, I felt a FastText model was better. That model is included in this PR.
Longer term, language detection may belong in Presto. It's probably also not great to include the model in this repo, but in this case it's extremely small and including it here simplifies DevOps. Nonetheless, I welcome feedback on this point from the team.
Reference: CV2-2559
How has this been tested?
Tested locally via flash shell. Existing tests cover the code and work.
Have you considered secure coding practices when writing this code?
N/A