Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for ICU tokenizer #61

Closed
wants to merge 7 commits into from

Conversation

eu9ene
Copy link
Contributor

@eu9ene eu9ene commented Nov 22, 2024

Add support for ICU-based detokenization for Tag inline noise augmentation modifier.

The tokenizer is:

  • Faster than Moses
  • Provides universal language support
  • Works better for CJK
  • Preserves spaces as a special symbol for lossless detokenization
  • Matches the one we use in Firefox on inference (JS binding)

Simple reconstruction of the original text makes maintenance of detokenization trivial compared to the rules in Sacremoses python tokenizer.

It's being used and tested in Firefox Translation (related PR, model training test run). Based on the test run it looks working properly.

See the related discussion for an explanation of why it's useful.

README.md Show resolved Hide resolved
super().__init__(probability)

self.template = template
self.custom_detok_src = custom_detok_src
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of brittle as an unstructured string. I think it would be better to parse the string and validate it first in case bad values are passed in. I would suggest splitting on : and optionally handling backwards compatible values.

Copy link
Contributor Author

@eu9ene eu9ene Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's validated in the next line inside make_detokenizer which does the splitting and checks that the key is in a dictionary:

        self.src_retokenizer = Retokenizer(
            detokenizer=make_detokenizer(custom_detok_src) ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is some validation, yes, but it's not ergonomic to use it afterwards since you are dealing with raw string manipulation rather than working with a structured class.

def tokenize(self, text:str) -> Tuple[TokenList, TokenSpanList]:
from icu import BreakIterator, Locale

bi = BreakIterator.createWordInstance(Locale(self.lang))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an issue as it's slow to create the BreakIterator. It should be created once and cached, probably in the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the following code it's a stateful object and it should be created for every string, so it's correct to initialize it here. A user makes a tokenizer object once and then tokenizes many strings with it.

bi.setText(text)
tokens = []
start = bi.first()
for end in bi:
    token = text[start:end]

Also, there are no issues with performance. It's super fast, unlike Moses tokenizer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it should be cached given how these things work, but if it's fast enough currently it's probably not a big deal. I think you setText each time you use it on a cached one is the design principle for ICU things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants