-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer adds an additional space after the added token #28218
Comments
Hey! Thanks for raising the issue. This is pretty much a duplicated of #26318 and will be fixed by #27883! |
PR Was merged, let me check if this is fixed! |
Okay not fixed yet I' ll include it in #27717 |
Hi @ArthurZucker excited to hear about the progress! :) Also, I realised that if there are multiple added tokens ( Just a reminder that I am thinking of Chinese language where words are not separated by space and hence my seemingly weird example of |
There is no real way to do that yet, I think we check the longest first. |
That is not fixed yet, but can be fixed kind of manually if we follow what was done for SpmConverters: def pre_tokenizer(self, replacement, add_prefix_space):
prepend_scheme = "always"
if hasattr(self.original_tokenizer, "legacy") and not self.original_tokenizer.legacy:
prepend_scheme = "first"
return pre_tokenizers.Metaspace(
replacement=replacement, add_prefix_space=add_prefix_space, prepend_scheme=prepend_scheme
) setting legacy to |
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The output from my code:
The original post where I raised this potential bug and was asked to file an issue would be at: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/5
For context, I am originally trying to add Chinese tokens to the tokenizer. However, for illustration purposes, I have demonstrated the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.
Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).
I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied, hence I am not 100% sure this behaviour by the tokenizer is intended.
The text was updated successfully, but these errors were encountered: