When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

enze5088 · 2023-10-09T16:19:43Z

I train a tokenizer and set 'add_prefix_space' to 'False', How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence ?

normalizer = normalizers.Sequence([NFC(), StripAccents()])
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
    [Whitespace(), Punctuation(), Digits(individual_digits=True), UnicodeScripts(),
     ByteLevel(add_prefix_space=False, use_regex=True), ])
tokenizer.decoder = decoders.ByteLevel(add_prefix_space=False, use_regex=True)
tokenizer.post_processor = tokenizers.processors.ByteLevel()

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-10-10T18:40:20Z

Hey! could you elaborate on How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence what is your concern / issue here?

enze5088 · 2023-10-13T03:34:09Z

I aim to develop a multilingual tokenizer. However, when processing multilingual text, especially text lacking space-based segmentation, like Chinese, it occasionally introduces erroneous spaces before certain characters. If I add whitespace in the pre-tokenizer, the tokenizer will not correctly preserve the spaces during the decoding of generated English text."

ArthurZucker · 2023-10-13T15:23:31Z

Ok, the additional space addition is fixed by #1357! You should give it a try!

enze5088 · 2023-10-30T14:25:24Z

Ok, the additional space addition is fixed by #1357! You should give it a try!

Thanks

enze5088 changed the title ~~When add_prefix_space=False, how to add space when decode English sentence.~~ When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? Oct 9, 2023

enze5088 closed this as completed Oct 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

enze5088 commented Oct 9, 2023 •

edited

Loading

ArthurZucker commented Oct 10, 2023

enze5088 commented Oct 13, 2023

ArthurZucker commented Oct 13, 2023

enze5088 commented Oct 30, 2023

When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

Comments

enze5088 commented Oct 9, 2023 • edited Loading

ArthurZucker commented Oct 10, 2023

enze5088 commented Oct 13, 2023

ArthurZucker commented Oct 13, 2023

enze5088 commented Oct 30, 2023

enze5088 commented Oct 9, 2023 •

edited

Loading