-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I replace a spare tokens? #31475
Comments
cc @itazap that would indeed be a good addition! More and more people pre-allocate some tokens and we don't have a |
PS: you can already replace directly in the |
Hey @ArthurZucker, I tried replacing a token in the Do you know what the problem might be? |
You can ignore this sorry. I found the issue. if you change the vocab in anyway, you need to make sure you also update the merges accordingly. |
huggingface/tokenizers#1570 should help |
Hi @ArthurZucker, do you mind elaborating on this? I'm experiencing the same issue as OP after modifying I would love to see this feature go through! |
Hi @itshuey, can you please share the model and token you are attempting this with (in a short snippet would be great!) so I can take a look? 😊 |
Sure @itazap, thank you. I am using the Mistral-7B-v0.3 tokenizer.
Reassignment is also futile. |
Hi @itshuey indeed this isn't fully supported with tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path) instead of |
I'm getting this error when I try to instantiate the
Regardless, I get the same result from trying to modify
Taking the tip into account, I tried using |
Sorry I apologize I wasn't very clear. You are correct that For now, you can do this by:
tokenizer = PreTrainedTokenizerFast.from_pretrained('mistralai/Mistral-7B-v0.3')
tokenizer.vocab['[control_8]'] # 10
tokenizer.save_pretrained(temp_folder)
# Pause here and edit both tokenizer.json and tokenizer_config.json. Example below changed control_8 to control_NEW
tokenizer_reloaded = PreTrainedTokenizerFast.from_pretrained(temp_folder)
tokenizer_reloaded.vocab['[control_NEW]'] # 10
tokenizer_reloaded.added_tokens_decoder[10] # 'AddedToken("[control_NEW]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)'
tokenizer_reloaded.vocab['control_8'] # throws error Please let me know if you are able to reproduce! |
Thank you for the clarification @itazap, modifying the configs worked perfectly! After using |
Awesome, I'm glad it worked! Thanks for your patience 🤗 |
System Info
I want to SFT Mistral-v0.3 with my own chat template.
So I followed this comment and replaced some [controal_n] tokens with special tokens for the chat template.
However, the new vocabulary was actually added and the size of the vocabulary increased.
Is there any way to replace the vocabulary?
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
tokenizer.json
tokenizer_config.json
test code
output
Expected behavior
[control_n] Tokens can be replaced with any token.
The text was updated successfully, but these errors were encountered: