Add support for ICU tokenizer #61

eu9ene · 2024-11-22T21:13:59Z

Add support for ICU-based detokenization for Tag inline noise augmentation modifier.

The tokenizer is:

Faster than Moses
Provides universal language support
Works better for CJK
Preserves spaces as a special symbol for lossless detokenization
Matches the one we use in Firefox on inference (JS binding)

Simple reconstruction of the original text makes maintenance of detokenization trivial compared to the rules in Sacremoses python tokenizer.

It's being used and tested in Firefox Translation (related PR, model training test run). Based on the test run it looks working properly.

See the related discussion for an explanation of why it's useful.

README.md

gregtatum · 2024-12-03T18:57:01Z

src/opustrainer/modifiers/placeholders.py

        super().__init__(probability)

        self.template = template
+        self.custom_detok_src = custom_detok_src


This is kind of brittle as an unstructured string. I think it would be better to parse the string and validate it first in case bad values are passed in. I would suggest splitting on : and optionally handling backwards compatible values.

it's validated in the next line inside make_detokenizer which does the splitting and checks that the key is in a dictionary:

self.src_retokenizer = Retokenizer( detokenizer=make_detokenizer(custom_detok_src) ...

There is some validation, yes, but it's not ergonomic to use it afterwards since you are dealing with raw string manipulation rather than working with a structured class.

gregtatum · 2024-12-03T19:00:28Z

src/opustrainer/tokenizers.py

+    def tokenize(self, text:str) -> Tuple[TokenList, TokenSpanList]:
+        from icu import BreakIterator, Locale
+
+        bi = BreakIterator.createWordInstance(Locale(self.lang))


This is an issue as it's slow to create the BreakIterator. It should be created once and cached, probably in the constructor.

Based on the following code it's a stateful object and it should be created for every string, so it's correct to initialize it here. A user makes a tokenizer object once and then tokenizes many strings with it.

bi.setText(text) tokens = [] start = bi.first() for end in bi: token = text[start:end]

Also, there are no issues with performance. It's super fast, unlike Moses tokenizer.

I still think it should be cached given how these things work, but if it's fast enough currently it's probably not a big deal. I think you setText each time you use it on a cached one is the design principle for ICU things.

eu9ene added 2 commits November 22, 2024 13:02

Add support for ICU tokenizer

ee534f3

Fix pyicu installation in CI

a397864

eu9ene mentioned this pull request Nov 22, 2024

Switch to ICU tokenizer mozilla/translations#939

Merged

eu9ene added 3 commits November 22, 2024 14:09

Fix benchmark

cdb3abd

Fix test package installation

4e6ac75

Disable ICU tokenizer test

ea357e8

gregtatum reviewed Dec 3, 2024

View reviewed changes

Make custom_detok settings backward compatible

633da72

eu9ene marked this pull request as ready for review December 18, 2024 23:08

eu9ene requested a review from gregtatum December 18, 2024 23:08

eu9ene mentioned this pull request Dec 18, 2024

Update dependencies to upstream OpusTrainer when released mozilla/translations#967

Open

Bump the version

554b720

eu9ene force-pushed the icu_tokenizer branch from 76ce441 to 554b720 Compare December 19, 2024 00:36

gregtatum approved these changes Dec 20, 2024

View reviewed changes

eu9ene closed this in mozilla/translations#939 Dec 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for ICU tokenizer #61

Add support for ICU tokenizer #61

eu9ene commented Nov 22, 2024 •

edited

Loading

gregtatum Dec 3, 2024

eu9ene Dec 18, 2024 •

edited

Loading

gregtatum Dec 20, 2024

gregtatum Dec 3, 2024

eu9ene Dec 16, 2024

gregtatum Dec 20, 2024

Add support for ICU tokenizer #61

Add support for ICU tokenizer #61

Conversation

eu9ene commented Nov 22, 2024 • edited Loading

gregtatum Dec 3, 2024

Choose a reason for hiding this comment

eu9ene Dec 18, 2024 • edited Loading

Choose a reason for hiding this comment

gregtatum Dec 20, 2024

Choose a reason for hiding this comment

gregtatum Dec 3, 2024

Choose a reason for hiding this comment

eu9ene Dec 16, 2024

Choose a reason for hiding this comment

gregtatum Dec 20, 2024

Choose a reason for hiding this comment

eu9ene commented Nov 22, 2024 •

edited

Loading

eu9ene Dec 18, 2024 •

edited

Loading