Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250

Closed
xenova opened this issue May 15, 2023 · 11 comments
Labels

Comments

@xenova
Copy link

xenova commented May 15, 2023

Description

There is a mismatch between the fast and slow implementations of Metaspace-based tokenizers (e.g,. t5, t5-v1_1, flan-t5, ...) when dealing with inputs that contain leading and trailing whitespace.

Code to reproduce:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/t5-v1_1-small", use_fast=True)
print(tokenizer('   space   '))
# {'input_ids': [628, 3, 1], 'attention_mask': [1, 1, 1]}

tokenizer = AutoTokenizer.from_pretrained("google/t5-v1_1-small", use_fast=False)
print(tokenizer('   space   '))
# {'input_ids': [628, 1], 'attention_mask': [1, 1]}

Expected:

I'm unsure which is correct, but the fast and slow tokenizers should match. If I had to make a guess, the slow tokenizer's output makes more sense since it's strange to keep a "prefix token" (id=3) by itself: ["space", "_", "</s>"]. The normal t5 tokenizers do not have this issue since they explicitly split on whitespace (they use a sequence: whitespace split followed by metaspace).

@Narsil
Copy link
Collaborator

Narsil commented May 15, 2023

@ArthurZucker Maybe ?

I don't know which version is "correct".

@chris-ha458
Copy link
Contributor

possible causes for this issue is because the fast version is based on rust while the "slow" one is based on python.
It might be possible to find the reason for discrepency hiding somewhere inside those two implementations, but unless your platform requirements absolutely dictate a pure python implementation, just go with the rust implementation (Fast)

@Narsil
Copy link
Collaborator

Narsil commented May 15, 2023

@chris-ha458 Thanks, I know this is the difference.

There are a lot of discrepancies between the 2 versions. Usually they are more subtle though. This is really bad to get difference encodings since it changes how the end model will behave.

@chris-ha458
Copy link
Contributor

@Narsil ah, obviously, as one of the most important developers in the project you would know :) I was not trying to "mansplain" this project to you.
I was trying to help explain for @xenova 's benefit, and forcing myself to understand this project and codebase more.
I've been doing this a lot on other issues and PRs lately. If this is counterproductive for you or the project in either general or specific context, feel free to point it out.

Coming back to this specific issue, is this something that you'd like a volunteer to investigate in detail?

@Narsil
Copy link
Collaborator

Narsil commented May 22, 2023

was trying to help explain for @xenova 's benefit, and forcing myself to understand this project and codebase more.
I've been doing this a lot on other issues and PRs lately.huggingface/transformers#20287 If this is counterproductive for you or the project in either general or specific context, feel free to point it out.

Your help is most definitely appreciated, and helps me out ton since I don't have to answer every issue.
I think this particular point was also obvious to OP so didn't feel like going on about it was necessary, but there was no animosity in my answer.huggingface/transformers#20287

If you want to dig into this, you have my blessing. It's most likely linked to number of spaces and how both do the preprocessing.
In general though, if you feel like it, slow and fast tokenizers discrepancies and generally a pain, any help to mitigate those would be highly appreciated.

the "encode" method most importantly. You can check out the test battery within transformers on the llama tokenizer to see what kind of test is needed to reach parity. Including utf-8 tokens, weird spacing and all.

One important thing to note is that since both slow and fast are now being used, "fixing" one might break existing things which is why that area has been moving quite slow.

@chris-ha458
Copy link
Contributor

chris-ha458 commented May 22, 2023

I was operating on the assumption that the rust binding was somewhat of a "canonical" version, but it seems both python and rust versions are of similar standing. it would be difficult to compare and contrast each divergences especially without knowing which is "correct".

I think my approach would be to acquire more information especially with my own tokenizers and corpuses so as to be able to speak with more confidence on what the encountered and expected behaviors might be.

Hopefully, some of the divergence might come from outright bugs from either implementations that could be fixed.
Those cases that are not "wrong" but "different would be more difficult to mitigate but it would be useful to atleast document in a reproducible manner.

As for the test battery
I assume you mean the "tests/models/llama/test_tokenization_llama.py" in transformers right?

@Narsil
Copy link
Collaborator

Narsil commented May 22, 2023

I assume you mean the "tests/models/llama/test_tokenization_llama.py" in transformers right?

Indeed.

@lucasjinreal
Copy link

I got a even more weried result:

encode 1 sentece take part 2 parts and concat the ids, result not same as encode sentence along:

huggingface/transformers#25285

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Jun 1, 2023

Also already mentioned in #16334

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Dec 26, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 31, 2023
@ArthurZucker ArthurZucker reopened this Jan 3, 2024
@ArthurZucker
Copy link
Collaborator

#1357 fixed it I believe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants