Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? #1362

Closed
enze5088 opened this issue Oct 9, 2023 · 4 comments

Comments

@enze5088
Copy link

enze5088 commented Oct 9, 2023

I train a tokenizer and set 'add_prefix_space' to 'False', How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence ?

normalizer = normalizers.Sequence([NFC(), StripAccents()])
tokenizer.normalizer = normalizer
tokenizer.pre_tokenizer = pre_tokenizers.Sequence(
    [Whitespace(), Punctuation(), Digits(individual_digits=True), UnicodeScripts(),
     ByteLevel(add_prefix_space=False, use_regex=True), ])
tokenizer.decoder = decoders.ByteLevel(add_prefix_space=False, use_regex=True)
tokenizer.post_processor = tokenizers.processors.ByteLevel()
@enze5088 enze5088 changed the title When add_prefix_space=False, how to add space when decode English sentence. When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces? Oct 9, 2023
@ArthurZucker
Copy link
Collaborator

Hey! could you elaborate on How can I ensure that BBPE tokenizers correctly handle space division when decoding a sequence what is your concern / issue here?

@enze5088
Copy link
Author

I aim to develop a multilingual tokenizer. However, when processing multilingual text, especially text lacking space-based segmentation, like Chinese, it occasionally introduces erroneous spaces before certain characters. If I add whitespace in the pre-tokenizer, the tokenizer will not correctly preserve the spaces during the decoding of generated English text."

@ArthurZucker
Copy link
Collaborator

Ok, the additional space addition is fixed by #1357! You should give it a try!

@enze5088
Copy link
Author

Ok, the additional space addition is fixed by #1357! You should give it a try!

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants