Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

Closed
komninoschatzipapas opened this issue Feb 5, 2024 · 6 comments
Closed

[Potential Bug] Mistral Tokenizer Inconsistencies #1448

komninoschatzipapas opened this issue Feb 5, 2024 · 6 comments
Labels

Comments

@komninoschatzipapas
Copy link

I have downloaded the Mistral 7B tokenizer locally and tried to compare different combinations of the legacy and use_fast options:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=False)
tokenizerl = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=False)
tokenizerf = AutoTokenizer.from_pretrained('.', legacy=False, use_fast=True)
tokenizerlf = AutoTokenizer.from_pretrained('.', legacy=True, use_fast=True)

s = "test<unk> This is a test phrase</s>"

print(
  f'Regular:\t{tokenizer.tokenize(s)}\n\t\t{tokenizer.decode(tokenizer.encode(s))}'
)

print(
  f'Legacy:\t\t{tokenizerl.tokenize(s)}\n\t\t{tokenizerl.decode(tokenizerl.encode(s))}'
)

print(
  f'Fast:\t\t{tokenizerf.tokenize(s)}\n\t\t{tokenizerf.decode(tokenizerf.encode(s))}'
)

print(
  f'Legacy Fast:\t{tokenizerlf.tokenize(s)}\n\t\t{tokenizerlf.decode(tokenizerlf.encode(s))}'
)

Which yields:

Regular:        ['▁test', '<unk>', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk> This is a test phrase</s>
Legacy:         ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Fast:           ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']
                <s> test<unk>  This is a test phrase</s>
Legacy Fast:    ['▁test', '<unk>', '▁', '▁This', '▁is', '▁a', '▁test', '▁phrase', '</s>']

You can find the full code here.

There seem to be inconsistencies with how legacy=False, use_fast=False tokenizes input compared to the other options.

If either option is set to True, there is an extra space added after tokens like <unk> or other special tokens.

It seems to me that only legacy=False, use_fast=False tokenenizes this input correctly.

We have a production app that extends Mistral with other special tokens besides <unk>, and extra spaces are added after those too.

So right now, we have switched over to legacy=False, use_fast=False, not getting any of the speed advantages of the Rust implementation.

Would appreciate any insight to what we are missing! And thank you for the enormous amount of work you guys have put into this library 🙏

@komninoschatzipapas komninoschatzipapas changed the title Mistral Tokenizer Inconsistencies [Potential Bug] Mistral Tokenizer Inconsistencies Feb 5, 2024
@ArthurZucker
Copy link
Collaborator

Hey! Thanks, a fix can be derived from #1357 and huggingface/transformers#26678.
Everything you describe is mentioned there. TLDR; use metaspace with prepend_scheme="first" and no normalizer will be the end of you problems

@ArthurZucker
Copy link
Collaborator

I have not had the time to change the default llama fast tokenizer, will try to do asap

Copy link

github-actions bot commented Mar 8, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Mar 8, 2024
@komninoschatzipapas
Copy link
Author

I think this is still relevant

@github-actions github-actions bot removed the Stale label Mar 9, 2024
Copy link

github-actions bot commented Apr 8, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 8, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 14, 2024
@ArthurZucker
Copy link
Collaborator

This was fixed in transformers you need to set legacy=False 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants