Fast tokenizer breaks added tokens #27132

geronimi73 · 2023-10-29T11:59:35Z

System Info

transformers version: 4.35.0.dev0
Platform: Linux-6.2.0-35-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.25.0.dev0
PyTorch version (GPU?): 2.1.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("models/llama2-7b", use_fast=False)

# add tokens for chatml
tokenizer.add_tokens(["<|im_start|>"])
tokenizer.add_special_tokens({"eos_token": "<|im_end|>"})

messages = [ {"role": "user", "content": "question"},
  {"role": "assistant", "content": "answer"} ]

# https://huggingface.co/docs/transformers/main/chat_templating
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

chat = tokenizer.apply_chat_template(messages, tokenize=False)
chat_tokenized = tokenizer(chat, add_special_tokens=False)["input_ids"]

for token in chat_tokenized:
	print(f"{token} - \"{tokenizer.decode(token)}\"")

output: first occurence of <|im_start|> is correctly tokenized, second one is split

32000 - "<|im_start|>"
1792 - "user"
13 - "
"
12470 - "question"
32001 - "<|im_end|>"
29871 - ""
13 - "
"
**29966 - "<"
29989 - "|"
326 - "im"
29918 - "_"
2962 - "start"
29989 - "|"
29958 - ">"**
465 - "ass"
22137 - "istant"
13 - "
"
12011 - "answer"
32001 - "<|im_end|>"
29871 - ""
13 - "
"

Expected behavior

32000 - "<|im_start|>"
1404 - "user"
13 - "<0x0A>"
12470 - "question"
32001 - "<|im_end|>"
29871 - ""
13 - "<0x0A>"
32000 - "<|im_start|>"
20255 - "assistant"
13 - "<0x0A>"
12011 - "answer"
32001 - "<|im_end|>"
29871 - ""
13 - "<0x0A>"

this is the correct output of the slow tokenizer AutoTokenizer.from_pretrained("models/llama2-7b", use_fast=False)

why does this happen with fast but not slow?
any other solution than not using the fast tokenizer?

i guess this is known, sorry if I missed it in the existing issues

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-10-29T17:34:09Z

Hi @geronimi73, thanks for raising an issue!

@ArthurZucker is off for this week and is the main person who knows and works with the tokenizers, so you might have to wait until then to have an answer.

@Rocketknight1 any chance you know what's happening?

Rocketknight1 · 2023-10-31T14:55:50Z

Hi @geronimi73, I'll wait for @ArthurZucker to return to give a full answer here, but in the meantime I think the issue is that when you add a normal token, the tokenizer may split it. If you want to preserve an important control token like <|im_start|> you should make it a special token. Try doing this instead:

tokenizer.add_special_tokens({"additional_special_tokens": ["<|im_start|>"]})
tokenizer.add_special_tokens({"eos_token": "<|im_end|>"})

ArthurZucker · 2023-11-06T12:45:48Z

Well, it's partly true partly wrong 😅
When you add a token, if it is not special, it will be normalized by default. I'll add the add_tokens function to the doc it seems that it was removed. But anyway, the Llama normalizer adds a SPIECE_UNDERLINE at the beginning of the special tokens, which will thus be a different token. AddedTokens (special or not) should never be splitted, but the content of the added tokens is affected by the normalizer

geronimi73 · 2023-11-06T15:06:16Z

ok, thanks!

geronimi73 · 2023-11-12T16:46:00Z

it's somehow working now.

just to sum this up for others who are struggling with this too:

I raised this issue because the fast tokenizer breaks the ChatML tag <|im_start|> into several tokens even though it was added with tokenizer.add_tokens(["<|im_start|>"]), slow tokenizer works fine
@ArthurZucker explains above, Llama normalizer adds a SPIECE_UNDERLINE; indeed, fast tokenizer encodes <|im_start|> correctly when token is added with ..

tokenizer.add_tokens(
	AddedToken("<|im_start|>",normalized=False))
)

but, new problem. decoding now adds a space after added tokens, example

tokenizer = AutoTokenizer.from_pretrained("../models/llama2-7b", use_fast=True, legacy=False)

tokenizer.add_tokens(
	AddedToken("<|im_start|>",normalized=False, rstrip=True, lstrip=False)
)
tokenizer.add_special_tokens({"eos_token": "<|im_end|>"})

# https://huggingface.co/docs/transformers/main/chat_templating
tokenizer.chat_template = "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"

messages=[
    {"role": "user", "content": "Hi there!"},
    {"role": "assistant", "content": "Nice to meet you!"}
]

chat = tokenizer.apply_chat_template(messages, tokenize=False)
chat_tokenized = tokenizer(chat, add_special_tokens=False)["input_ids"]

print("INPUT")
print(chat)
print("-"*30)
print("DECODE(ENCODE(INPUT))")
print(tokenizer.decode(chat_tokenized))

# INPUT
# <|im_start|>user
# Hi there!<|im_end|>
# <|im_start|>assistant
# Nice to meet you!<|im_end|>

# ------------------------------
# DECODE(ENCODE(INPUT))
# <|im_start|> user
# Hi there!<|im_end|> 
# <|im_start|> assistant
# Nice to meet you!<|im_end|>

fix all of the above: use slow tokenizer use_fast=False, legacy=False, add tokens with tokenizer.add_tokens(["<|im_start|>"]), decode with spaces_between_special_tokens=False like this

tokenizer = AutoTokenizer.from_pretrained("../models/llama2-7b", use_fast=False, legacy=False)
tokenizer.add_tokens(["<|im_start|>"])
...
chat_tokenized = tokenizer(chat, add_special_tokens=False)["input_ids"]
print(tokenizer.decode(chat_tokenized, spaces_between_special_tokens=False))

using transformers 4.35.0 btw

ArthurZucker · 2023-11-13T09:56:13Z

Thanks for the great explanation!
Regarding the space added after added tokens, this PR will fix it: huggingface/tokenizers#1357 😉 I'll have to change the Llama paradigm a little bit to make sure it's compatible

ArthurZucker · 2023-11-13T09:57:11Z

feel free to play with #26678 as well 🤗

ArthurZucker mentioned this issue Nov 6, 2023

[PretrainedTokenizer] add some of the most important functions to the doc #27313

Merged

ArthurZucker closed this as completed in #27313 Nov 6, 2023

geronimi73 mentioned this issue Nov 6, 2023

Different handling of added tokens between fast and slow LlamaTokenizer #27279

Closed

4 tasks

ArthurZucker mentioned this issue Nov 8, 2023

Custom type symbols add by LlamaTokenizer, LlamaTokenizerFast fails to tokenize them correctly. #26670

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast tokenizer breaks added tokens #27132

Fast tokenizer breaks added tokens #27132

geronimi73 commented Oct 29, 2023

amyeroberts commented Oct 29, 2023

Rocketknight1 commented Oct 31, 2023

ArthurZucker commented Nov 6, 2023

geronimi73 commented Nov 6, 2023

geronimi73 commented Nov 12, 2023

ArthurZucker commented Nov 13, 2023 •

edited

Loading

ArthurZucker commented Nov 13, 2023

Fast tokenizer breaks added tokens #27132

Fast tokenizer breaks added tokens #27132

Comments

geronimi73 commented Oct 29, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Oct 29, 2023

Rocketknight1 commented Oct 31, 2023

ArthurZucker commented Nov 6, 2023

geronimi73 commented Nov 6, 2023

geronimi73 commented Nov 12, 2023

ArthurZucker commented Nov 13, 2023 • edited Loading

ArthurZucker commented Nov 13, 2023

ArthurZucker commented Nov 13, 2023 •

edited

Loading