-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cv2 3408 language code standardization #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great that this langcodes
lib works!
lib/model/fasttext.py
Outdated
@@ -25,7 +27,7 @@ def respond(self, docs: Union[List[schemas.Message], schemas.Message]) -> List[s | |||
detectable_texts = [e.body.text for e in docs] | |||
detected_langs = [] | |||
for text in detectable_texts: | |||
detected_langs.append(self.model.predict(text)[0][0]) | |||
detected_langs.append(standardize_tag(self.model.predict(text)[0][0][9:], macro = True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for clarity, I'd consider breaking this into a couple of lines to explain what it is doing
- predict the language of the text
- get the language code of the prediction ( what is the data structure that [0][0][9:] is indexing?)
- convert the language code into a standardized language tag (what does macro do?)
lib/model/fasttext.py
Outdated
@@ -3,6 +3,8 @@ | |||
import fasttext | |||
from huggingface_hub import hf_hub_download | |||
|
|||
from langcodes import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would import standize_tag
work here?
test/lib/model/test_fasttext.py
Outdated
@@ -18,8 +18,8 @@ def test_respond(self): | |||
response = self.model.respond(query) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think would be great to test how does for some extreme cases so we know how it will respond:
- None (probably will return an error, but lets make sure it is a useful error)
- "" i.e. no detectable language? what will it do
- a mixed language string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mixed language strings, model will return one of the languages.
For no detectable languages (eg. emoji, numbers, punctuation), model will return a random language with low certainty. We're not currently outputting the model's certainty, just the language code, but if that'd be useful for filtering we can add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that adding tests for some more extreme values is good. Tests in this way can help serve a bit of a documentation function as well as ensure we have tried cases like these that will inevitably come up in the real world.
For the return value, I think we should return the model certainty as well. Could we return a JSON object like
[ {
"language": "zh",
"script": "hans",
"probability": 0.82
},
...
]
Overall this is looking great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes we definitely need the probabilities - better to let downstream services make decisions on that information, should they need to
Using python package langcodes (https://pypi.org/project/langcodes/) to handle standardization of model language code output.
Should now return standard form (2-letter codes where they exist, else 3-letter code) with script tags where there is not a default script (eg. for bhojpuri, chinese, etc.)