Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cv2 3408 language code standardization #32

Merged
merged 14 commits into from
Sep 22, 2023
Merged

Conversation

amydunphy
Copy link
Contributor

Using python package langcodes (https://pypi.org/project/langcodes/) to handle standardization of model language code output.

Should now return standard form (2-letter codes where they exist, else 3-letter code) with script tags where there is not a default script (eg. for bhojpuri, chinese, etc.)

Copy link
Contributor

@skyemeedan skyemeedan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great that this langcodes lib works!

@@ -25,7 +27,7 @@ def respond(self, docs: Union[List[schemas.Message], schemas.Message]) -> List[s
detectable_texts = [e.body.text for e in docs]
detected_langs = []
for text in detectable_texts:
detected_langs.append(self.model.predict(text)[0][0])
detected_langs.append(standardize_tag(self.model.predict(text)[0][0][9:], macro = True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for clarity, I'd consider breaking this into a couple of lines to explain what it is doing

  • predict the language of the text
  • get the language code of the prediction ( what is the data structure that [0][0][9:] is indexing?)
  • convert the language code into a standardized language tag (what does macro do?)

@@ -3,6 +3,8 @@
import fasttext
from huggingface_hub import hf_hub_download

from langcodes import *
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would import standize_tag work here?

@@ -18,8 +18,8 @@ def test_respond(self):
response = self.model.respond(query)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think would be great to test how does for some extreme cases so we know how it will respond:

  • None (probably will return an error, but lets make sure it is a useful error)
  • "" i.e. no detectable language? what will it do
  • a mixed language string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mixed language strings, model will return one of the languages.

For no detectable languages (eg. emoji, numbers, punctuation), model will return a random language with low certainty. We're not currently outputting the model's certainty, just the language code, but if that'd be useful for filtering we can add it.

Copy link
Collaborator

@computermacgyver computermacgyver Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that adding tests for some more extreme values is good. Tests in this way can help serve a bit of a documentation function as well as ensure we have tried cases like these that will inevitably come up in the real world.

For the return value, I think we should return the model certainty as well. Could we return a JSON object like

[ {
"language": "zh",
"script": "hans",
"probability": 0.82
},
...
]

Overall this is looking great!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we definitely need the probabilities - better to let downstream services make decisions on that information, should they need to

@DGaffney DGaffney merged commit a3c8071 into master Sep 22, 2023
2 checks passed
@DGaffney DGaffney deleted the cv2-3408-language-codes branch September 22, 2023 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants