Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV2-2559: use local language detection model #448

Closed
wants to merge 0 commits into from

Conversation

computermacgyver
Copy link
Contributor

@computermacgyver computermacgyver commented Sep 5, 2024

Description

Goal of this PR is to stop using Google's API for language identification and use a local model instead. We previously implemented CLD3, but had not turned it on. After some ad-hoc testing, I felt a FastText model was better. That model is included in this PR.

Longer term, language detection may belong in Presto. It's probably also not great to include the model in this repo, but in this case it's extremely small and including it here simplifies DevOps. Nonetheless, I welcome feedback on this point from the team.

Reference: CV2-2559

How has this been tested?

Tested locally via flash shell. Existing tests cover the code and work.

Have you considered secure coding practices when writing this code?

N/A

app/main/lib/langid.py Outdated Show resolved Hide resolved
@computermacgyver computermacgyver marked this pull request as ready for review September 25, 2024 13:58
@computermacgyver computermacgyver changed the title CV2-2559: use CLD CV2-2559: use local language detection model Sep 25, 2024
Copy link
Contributor

@DGaffney DGaffney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor tweaks you can take or leave otherwise :shipit:

app/main/lib/elasticsearch.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/text_similarity.py Outdated Show resolved Hide resolved
Copy link
Contributor

@skyemeedan skyemeedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the langid tests might need to be updated to explicitly check fasttext? https://github.com/meedan/alegre/blob/develop/app/test/test_langid.py (some if it seems to test individual languages models independently)

@computermacgyver
Copy link
Contributor Author

I think the langid tests might need to be updated to explicitly check fasttext? https://github.com/meedan/alegre/blob/develop/app/test/test_langid.py (some if it seems to test individual languages models independently)

Good point. I've added now 🙏

@skyemeedan
Copy link
Contributor

Looks like one of the fasttext tests failed

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/app/app/test/test_langid.py", line 83, in test_langid_fasttext
    self.assertEqual(test['cld3'], result['result']['language'], test['text'])
AssertionError: 'hi-Latn' != 'nl'
- hi-Latn
+ nl
 : namaste mera naam Karim hai

and there was another error about missing google credentials? (maybe no longer needed)

@computermacgyver
Copy link
Contributor Author

Looks like one of the fasttext tests failed

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/app/app/test/test_langid.py", line 83, in test_langid_fasttext
    self.assertEqual(test['cld3'], result['result']['language'], test['text'])
AssertionError: 'hi-Latn' != 'nl'
- hi-Latn
+ nl
 : namaste mera naam Karim hai

and there was another error about missing google credentials? (maybe no longer needed)

No this was me just running tests to see if fasttext would agree with the expected cld output in all cases. The answer is 'no' 😂

I think our language tests are not great because we have a list of strings and their expected languages, but the expected language is different for different models 🙃

@skyemeedan
Copy link
Contributor

I think our language tests are not great because we have a list of strings and their expected languages, but the expected language is different for different models 🙃

I'm guessing that is why the test was structured to not assume that results are identical across models, as long as they are consistent? i.e. hi and hi-latin are both OK, but nl seems wrong

@computermacgyver
Copy link
Contributor Author

I'm guessing that is why the test was structured to not assume that results are identical across models, as long as they are consistent? i.e. hi and hi-latin are both OK, but nl seems wrong

Sure, but our tests used to accept en from the Microsoft language id model. There are limitations with every model. CLD3 thinks, 'how to slice a banana' is haw, but fasttext identifies it correctly as en. There is no perfect model that will get every string right. Google's API does best, but we've been told to stop using.

In this case, I believe fasttext hasn't been trained to detect languages in Latin characters that are not usually written with Latin characters. Notably it returns lower confidence scores for these examples.

I'm updating the tests to just skip these cases for fasttext now. I've left the other models alone for now. E.g., only Google gets this one right:
{ 'fasttext': None, 'cld3': 'id', 'microsoft': 'fr', 'google': ['ta', 'ta-Latn'], 'text': 'vanakkam en peyar Karim' },

I think this is a larger issue with the tests as written, but I'm not going to address that in this PR. The purpose of the tests as far as I'm concerned is to ensure the model produces consistent output across runs.

@computermacgyver
Copy link
Contributor Author

I've updated the code to fallback to Google when the fasttext result how low confidence. Taking this conversation internal to check if we have support for this.

app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
FastTextLangidProvider.fasttext_model.get_language("Some text to check")
return True

class HybridLangidProvider:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the idea is we run both for a little while to see if the agreement is good enough (disagreement probably only on edge cases) and then disable CLD?

app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
app/main/lib/langid.py Outdated Show resolved Hide resolved
Copy link

codeclimate bot commented Sep 30, 2024

Code Climate has analyzed commit 13c5c42 and detected 2 issues on this pull request.

Here's the issue category breakdown:

Category Count
Complexity 1
Style 1

The test coverage on the diff in this pull request is 87.8% (50% is the threshold).

This pull request will bring the total coverage in the repository to 79.8% (0.0% change).

View more on Code Climate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants