CV2-2559: use local language detection model #448

computermacgyver · 2024-09-05T11:33:04Z

Description

Goal of this PR is to stop using Google's API for language identification and use a local model instead. We previously implemented CLD3, but had not turned it on. After some ad-hoc testing, I felt a FastText model was better. That model is included in this PR.

Longer term, language detection may belong in Presto. It's probably also not great to include the model in this repo, but in this case it's extremely small and including it here simplifies DevOps. Nonetheless, I welcome feedback on this point from the team.

Reference: CV2-2559

How has this been tested?

Tested locally via flash shell. Existing tests cover the code and work.

Have you considered secure coding practices when writing this code?

N/A

app/main/lib/langid.py

DGaffney

some minor tweaks you can take or leave otherwise

app/main/lib/elasticsearch.py

app/main/lib/langid.py

app/main/lib/text_similarity.py

skyemeedan

I think the langid tests might need to be updated to explicitly check fasttext? https://github.com/meedan/alegre/blob/develop/app/test/test_langid.py (some if it seems to test individual languages models independently)

computermacgyver · 2024-09-25T17:58:04Z

I think the langid tests might need to be updated to explicitly check fasttext? https://github.com/meedan/alegre/blob/develop/app/test/test_langid.py (some if it seems to test individual languages models independently)

Good point. I've added now 🙏

skyemeedan · 2024-09-25T19:08:06Z

Looks like one of the fasttext tests failed

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/app/app/test/test_langid.py", line 83, in test_langid_fasttext
    self.assertEqual(test['cld3'], result['result']['language'], test['text'])
AssertionError: 'hi-Latn' != 'nl'
- hi-Latn
+ nl
 : namaste mera naam Karim hai

and there was another error about missing google credentials? (maybe no longer needed)

computermacgyver · 2024-09-25T20:29:09Z

Looks like one of the fasttext tests failed

----------------------------------------------------------------------
Traceback (most recent call last):
  File "/app/app/test/test_langid.py", line 83, in test_langid_fasttext
    self.assertEqual(test['cld3'], result['result']['language'], test['text'])
AssertionError: 'hi-Latn' != 'nl'
- hi-Latn
+ nl
 : namaste mera naam Karim hai

and there was another error about missing google credentials? (maybe no longer needed)

No this was me just running tests to see if fasttext would agree with the expected cld output in all cases. The answer is 'no' 😂

I think our language tests are not great because we have a list of strings and their expected languages, but the expected language is different for different models 🙃

skyemeedan · 2024-09-25T21:09:40Z

I think our language tests are not great because we have a list of strings and their expected languages, but the expected language is different for different models 🙃

I'm guessing that is why the test was structured to not assume that results are identical across models, as long as they are consistent? i.e. hi and hi-latin are both OK, but nl seems wrong

computermacgyver · 2024-09-25T21:42:47Z

I'm guessing that is why the test was structured to not assume that results are identical across models, as long as they are consistent? i.e. hi and hi-latin are both OK, but nl seems wrong

Sure, but our tests used to accept en from the Microsoft language id model. There are limitations with every model. CLD3 thinks, 'how to slice a banana' is haw, but fasttext identifies it correctly as en. There is no perfect model that will get every string right. Google's API does best, but we've been told to stop using.

In this case, I believe fasttext hasn't been trained to detect languages in Latin characters that are not usually written with Latin characters. Notably it returns lower confidence scores for these examples.

I'm updating the tests to just skip these cases for fasttext now. I've left the other models alone for now. E.g., only Google gets this one right:
{ 'fasttext': None, 'cld3': 'id', 'microsoft': 'fr', 'google': ['ta', 'ta-Latn'], 'text': 'vanakkam en peyar Karim' },

I think this is a larger issue with the tests as written, but I'm not going to address that in this PR. The purpose of the tests as far as I'm concerned is to ensure the model produces consistent output across runs.

computermacgyver · 2024-09-25T22:09:59Z

I've updated the code to fallback to Google when the fasttext result how low confidence. Taking this conversation internal to check if we have support for this.

app/main/lib/langid.py

skyemeedan · 2024-09-26T19:10:38Z

app/main/lib/langid.py

+    FastTextLangidProvider.fasttext_model.get_language("Some text to check")
+    return True
+
+class HybridLangidProvider:


Assuming the idea is we run both for a little while to see if the agreement is good enough (disagreement probably only on edge cases) and then disable CLD?

app/main/lib/langid.py

codeclimate · 2024-09-30T08:32:52Z

Code Climate has analyzed commit 13c5c42 and detected 2 issues on this pull request.

Here's the issue category breakdown:

Category	Count
Complexity	1
Style	1

The test coverage on the diff in this pull request is 87.8% (50% is the threshold).

This pull request will bring the total coverage in the repository to 79.8% (0.0% change).

View more on Code Climate.

codeclimate bot reviewed Sep 25, 2024

View reviewed changes

app/main/lib/langid.py Outdated Show resolved Hide resolved

computermacgyver marked this pull request as ready for review September 25, 2024 13:58

computermacgyver requested review from DGaffney and skyemeedan as code owners September 25, 2024 13:58

computermacgyver changed the title ~~CV2-2559: use CLD~~ CV2-2559: use local language detection model Sep 25, 2024

DGaffney approved these changes Sep 25, 2024

View reviewed changes

app/main/lib/elasticsearch.py Outdated Show resolved Hide resolved

app/main/lib/langid.py Outdated Show resolved Hide resolved

app/main/lib/langid.py Outdated Show resolved Hide resolved

app/main/lib/text_similarity.py Outdated Show resolved Hide resolved

skyemeedan reviewed Sep 25, 2024

View reviewed changes

codeclimate bot reviewed Sep 25, 2024

View reviewed changes

skyemeedan approved these changes Sep 26, 2024

View reviewed changes

codeclimate bot reviewed Sep 27, 2024

View reviewed changes

app/main/lib/langid.py Outdated Show resolved Hide resolved

app/main/lib/langid.py Outdated Show resolved Hide resolved

app/main/lib/langid.py Outdated Show resolved Hide resolved

app/main/lib/langid.py Outdated Show resolved Hide resolved

codeclimate bot reviewed Sep 30, 2024

View reviewed changes

app/main/lib/langid.py Outdated Show resolved Hide resolved

computermacgyver closed this Sep 30, 2024

computermacgyver force-pushed the CV2-2559-use-cld branch from 13c5c42 to e20aba2 Compare September 30, 2024 10:04

computermacgyver deleted the CV2-2559-use-cld branch September 30, 2024 10:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CV2-2559: use local language detection model #448

CV2-2559: use local language detection model #448

computermacgyver commented Sep 5, 2024 •

edited

Loading

DGaffney left a comment

skyemeedan left a comment

computermacgyver commented Sep 25, 2024

skyemeedan commented Sep 25, 2024

computermacgyver commented Sep 25, 2024

skyemeedan commented Sep 25, 2024

computermacgyver commented Sep 25, 2024

computermacgyver commented Sep 25, 2024

skyemeedan Sep 26, 2024

codeclimate bot commented Sep 30, 2024

CV2-2559: use local language detection model #448

CV2-2559: use local language detection model #448

Conversation

computermacgyver commented Sep 5, 2024 • edited Loading

Description

How has this been tested?

Have you considered secure coding practices when writing this code?

DGaffney left a comment

Choose a reason for hiding this comment

skyemeedan left a comment

Choose a reason for hiding this comment

computermacgyver commented Sep 25, 2024

skyemeedan commented Sep 25, 2024

computermacgyver commented Sep 25, 2024

skyemeedan commented Sep 25, 2024

computermacgyver commented Sep 25, 2024

computermacgyver commented Sep 25, 2024

skyemeedan Sep 26, 2024

Choose a reason for hiding this comment

codeclimate bot commented Sep 30, 2024

computermacgyver commented Sep 5, 2024 •

edited

Loading