[Potential Bug] Mistral Tokenizer Inconsistencies #1448

komninoschatzipapas · 2024-02-05T20:16:20Z

I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the legacy and use_fast options:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=False)
tokenizerl = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=False)
tokenizerf = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=True)
tokenizerlf = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=True)

s = "test<unk> This is a test phrase</s>"

print(
  f'Regular:\t{tokenizer.tokenize(s)}\n\t\t{tokenizer.decode(tokenizer.encode(s))}'
)

print(
  f'Legacy:\t\t{tokenizerl.tokenize(s)}\n\t\t{tokenizerl.decode(tokenizerl.encode(s))}'
)

print(
  f'Fast:\t\t{tokenizerf.tokenize(s)}\n\t\t{tokenizerf.decode(tokenizerf.encode(s))}'
)

print(
  f'Legacy Fast:\t{tokenizerlf.tokenize(s)}\n\t\t{tokenizerlf.decode(tokenizerlf.encode(s))}'
)

Which yields:

Regular:        ['▁test', '<unk>', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk> This is a test phrase</s>
Legacy:         ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Fast:           ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Legacy Fast:    ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']

You can find the full code here.

There seem to be inconsistencies with how legacy=False, use_fast=False tokenizes input compared to the other options.

If either option is set to True, there is an extra space added after tokens like <unk> or other special tokens.

It seems to me that only legacy=False, use_fast=False tokenenizes this input correctly.

We have a production app that extends Mistral with other special tokens besides <unk>, and extra spaces are added after those too.

So right now, we have switched over to legacy=False, use_fast=False, not getting any of the speed advantages of the Rust implementation.

Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-02-06T03:20:43Z

Hey! Thanks, a fix can be derived from #1357 and huggingface/transformers#26678.
Everything you describe is mentioned there. TLDR; use metaspace with prepend_scheme="first" and no normalizer will be the end of you problems

ArthurZucker · 2024-02-06T03:20:59Z

I have not had the time to change the default llama fast tokenizer, will try to do asap

github-actions · 2024-03-08T01:47:59Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

komninoschatzipapas · 2024-03-08T16:10:47Z

I think this is still relevant

github-actions · 2024-04-08T01:49:00Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-06-11T12:44:16Z

This was fixed in transformers you need to set legacy=False 🤗

komninoschatzipapas changed the title ~~Mistral Tokenizer Inconsistencies~~ [Potential Bug] Mistral Tokenizer Inconsistencies Feb 5, 2024

HugeHeart mentioned this issue Mar 5, 2024

Mistral Tokenizer.decode() add a space when use_fast=True huggingface/transformers#29452

Closed

github-actions bot added the Stale label Mar 8, 2024

github-actions bot removed the Stale label Mar 9, 2024

github-actions bot added the Stale label Apr 8, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 14, 2024

ArthurZucker closed this as completed Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

komninoschatzipapas commented Feb 5, 2024

ArthurZucker commented Feb 6, 2024

ArthurZucker commented Feb 6, 2024

github-actions bot commented Mar 8, 2024

komninoschatzipapas commented Mar 8, 2024

github-actions bot commented Apr 8, 2024

ArthurZucker commented Jun 11, 2024

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

Comments

komninoschatzipapas commented Feb 5, 2024

ArthurZucker commented Feb 6, 2024

ArthurZucker commented Feb 6, 2024

github-actions bot commented Mar 8, 2024

komninoschatzipapas commented Mar 8, 2024

github-actions bot commented Apr 8, 2024

ArthurZucker commented Jun 11, 2024