Cv2 3408 language code standardization #32

amydunphy · 2023-08-24T20:58:38Z

Using python package langcodes (https://pypi.org/project/langcodes/) to handle standardization of model language code output.

Should now return standard form (2-letter codes where they exist, else 3-letter code) with script tags where there is not a default script (eg. for bhojpuri, chinese, etc.)

skyemeedan

great that this langcodes lib works!

skyemeedan · 2023-08-24T21:42:03Z

lib/model/fasttext.py

@@ -25,7 +27,7 @@ def respond(self, docs: Union[List[schemas.Message], schemas.Message]) -> List[s
        detectable_texts = [e.body.text for e in docs]
        detected_langs = []
        for text in detectable_texts:
-            detected_langs.append(self.model.predict(text)[0][0])
+            detected_langs.append(standardize_tag(self.model.predict(text)[0][0][9:], macro = True))


for clarity, I'd consider breaking this into a couple of lines to explain what it is doing

predict the language of the text

get the language code of the prediction ( what is the data structure that [0][0][9:] is indexing?)

convert the language code into a standardized language tag (what does macro do?)

skyemeedan · 2023-08-24T21:42:41Z

lib/model/fasttext.py

@@ -3,6 +3,8 @@
 import fasttext
 from huggingface_hub import hf_hub_download

+from langcodes import *


would import standize_tag work here?

skyemeedan · 2023-08-24T21:45:20Z

test/lib/model/test_fasttext.py

@@ -18,8 +18,8 @@ def test_respond(self):
        response = self.model.respond(query)



I think would be great to test how does for some extreme cases so we know how it will respond:

None (probably will return an error, but lets make sure it is a useful error)

"" i.e. no detectable language? what will it do

a mixed language string?

For mixed language strings, model will return one of the languages.

For no detectable languages (eg. emoji, numbers, punctuation), model will return a random language with low certainty. We're not currently outputting the model's certainty, just the language code, but if that'd be useful for filtering we can add it.

I agree that adding tests for some more extreme values is good. Tests in this way can help serve a bit of a documentation function as well as ensure we have tried cases like these that will inevitably come up in the real world.

For the return value, I think we should return the model certainty as well. Could we return a JSON object like

[ { "language": "zh", "script": "hans", "probability": 0.82 }, ... ]

Overall this is looking great!

yes we definitely need the probabilities - better to let downstream services make decisions on that information, should they need to

amydunphy added 3 commits August 24, 2023 13:41

add langcodes package

7b50c89

update unittest

4332fed

update requirements to include langcodes

1db38f3

amydunphy requested review from DGaffney, computermacgyver and skyemeedan as code owners August 24, 2023 20:58

skyemeedan approved these changes Aug 24, 2023

View reviewed changes

documentation + restructuring for clarity

07a82aa

computermacgyver approved these changes Aug 25, 2023

View reviewed changes

amydunphy and others added 10 commits August 29, 2023 10:00

include score in output

775e7fc

update tests

2ea48ee

round certainty value

5989092

more tests

405ebc3

typo

a31d443

another typo :(

ea519b3

Merge branch 'master' into cv2-3408-language-codes

59a8bd0

Update test_fasttext.py

290a740

Update test_fasttext.py

44a523f

Update test_fasttext.py

8bfbd54

DGaffney merged commit a3c8071 into master Sep 22, 2023
2 checks passed

DGaffney deleted the cv2-3408-language-codes branch September 22, 2023 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cv2 3408 language code standardization #32

Cv2 3408 language code standardization #32

amydunphy commented Aug 24, 2023

skyemeedan left a comment

skyemeedan Aug 24, 2023

skyemeedan Aug 24, 2023

skyemeedan Aug 24, 2023

amydunphy Aug 24, 2023

computermacgyver Aug 25, 2023 •

edited

Loading

DGaffney Aug 25, 2023

		@@ -18,8 +18,8 @@ def test_respond(self):
		response = self.model.respond(query)

Cv2 3408 language code standardization #32

Cv2 3408 language code standardization #32

Conversation

amydunphy commented Aug 24, 2023

skyemeedan left a comment

Choose a reason for hiding this comment

skyemeedan Aug 24, 2023

Choose a reason for hiding this comment

skyemeedan Aug 24, 2023

Choose a reason for hiding this comment

skyemeedan Aug 24, 2023

Choose a reason for hiding this comment

amydunphy Aug 24, 2023

Choose a reason for hiding this comment

computermacgyver Aug 25, 2023 • edited Loading

Choose a reason for hiding this comment

DGaffney Aug 25, 2023

Choose a reason for hiding this comment

computermacgyver Aug 25, 2023 •

edited

Loading