-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for ICU tokenizer #61
Conversation
super().__init__(probability) | ||
|
||
self.template = template | ||
self.custom_detok_src = custom_detok_src |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of brittle as an unstructured string. I think it would be better to parse the string and validate it first in case bad values are passed in. I would suggest splitting on :
and optionally handling backwards compatible values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's validated in the next line inside make_detokenizer
which does the splitting and checks that the key is in a dictionary:
self.src_retokenizer = Retokenizer(
detokenizer=make_detokenizer(custom_detok_src) ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some validation, yes, but it's not ergonomic to use it afterwards since you are dealing with raw string manipulation rather than working with a structured class.
def tokenize(self, text:str) -> Tuple[TokenList, TokenSpanList]: | ||
from icu import BreakIterator, Locale | ||
|
||
bi = BreakIterator.createWordInstance(Locale(self.lang)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an issue as it's slow to create the BreakIterator. It should be created once and cached, probably in the constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the following code it's a stateful object and it should be created for every string, so it's correct to initialize it here. A user makes a tokenizer object once and then tokenizes many strings with it.
bi.setText(text)
tokens = []
start = bi.first()
for end in bi:
token = text[start:end]
Also, there are no issues with performance. It's super fast, unlike Moses tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it should be cached given how these things work, but if it's fast enough currently it's probably not a big deal. I think you setText each time you use it on a cached one is the design principle for ICU things.
76ce441
to
554b720
Compare
Add support for ICU-based detokenization for Tag inline noise augmentation modifier.
The tokenizer is:
Simple reconstruction of the original text makes maintenance of detokenization trivial compared to the rules in Sacremoses python tokenizer.
It's being used and tested in Firefox Translation (related PR, model training test run). Based on the test run it looks working properly.
See the related discussion for an explanation of why it's useful.