-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast tokenizer breaks added tokens #27132
Comments
Hi @geronimi73, thanks for raising an issue! @ArthurZucker is off for this week and is the main person who knows and works with the tokenizers, so you might have to wait until then to have an answer. @Rocketknight1 any chance you know what's happening? |
Hi @geronimi73, I'll wait for @ArthurZucker to return to give a full answer here, but in the meantime I think the issue is that when you add a normal token, the tokenizer may split it. If you want to preserve an important control token like tokenizer.add_special_tokens({"additional_special_tokens": ["<|im_start|>"]})
tokenizer.add_special_tokens({"eos_token": "<|im_end|>"}) |
Well, it's partly true partly wrong 😅 |
ok, thanks! |
it's somehow working now. just to sum this up for others who are struggling with this too:
tokenizer.add_tokens(
AddedToken("<|im_start|>",normalized=False))
)
tokenizer = AutoTokenizer.from_pretrained("../models/llama2-7b", use_fast=True, legacy=False)
tokenizer.add_tokens(
AddedToken("<|im_start|>",normalized=False, rstrip=True, lstrip=False)
)
tokenizer.add_special_tokens({"eos_token": "<|im_end|>"})
# https://huggingface.co/docs/transformers/main/chat_templating
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
messages=[
{"role": "user", "content": "Hi there!"},
{"role": "assistant", "content": "Nice to meet you!"}
]
chat = tokenizer.apply_chat_template(messages, tokenize=False)
chat_tokenized = tokenizer(chat, add_special_tokens=False)["input_ids"]
print("INPUT")
print(chat)
print("-"*30)
print("DECODE(ENCODE(INPUT))")
print(tokenizer.decode(chat_tokenized))
# INPUT
# <|im_start|>user
# Hi there!<|im_end|>
# <|im_start|>assistant
# Nice to meet you!<|im_end|>
# ------------------------------
# DECODE(ENCODE(INPUT))
# <|im_start|> user
# Hi there!<|im_end|>
# <|im_start|> assistant
# Nice to meet you!<|im_end|>
tokenizer = AutoTokenizer.from_pretrained("../models/llama2-7b", use_fast=False, legacy=False)
tokenizer.add_tokens(["<|im_start|>"])
...
chat_tokenized = tokenizer(chat, add_special_tokens=False)["input_ids"]
print(tokenizer.decode(chat_tokenized, spaces_between_special_tokens=False))
|
Thanks for the great explanation! |
feel free to play with #26678 as well 🤗 |
System Info
transformers
version: 4.35.0.dev0Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
output: first occurence of
<|im_start|>
is correctly tokenized, second one is splitExpected behavior
this is the correct output of the slow tokenizer
AutoTokenizer.from_pretrained("models/llama2-7b", use_fast=False)
i guess this is known, sorry if I missed it in the existing issues
The text was updated successfully, but these errors were encountered: