[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250

xenova · 2023-05-15T03:17:30Z

Description

There is a mismatch between the fast and slow implementations of Metaspace-based tokenizers (e.g,. t5, t5-v1_1, flan-t5, ...) when dealing with inputs that contain leading and trailing whitespace.

Code to reproduce:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/t5-v1_1-small", use_fast=True)
print(tokenizer('   space   '))
# {'input_ids': [628, 3, 1], 'attention_mask': [1, 1, 1]}

tokenizer = AutoTokenizer.from_pretrained("google/t5-v1_1-small", use_fast=False)
print(tokenizer('   space   '))
# {'input_ids': [628, 1], 'attention_mask': [1, 1]}

Expected:

I'm unsure which is correct, but the fast and slow tokenizers should match. If I had to make a guess, the slow tokenizer's output makes more sense since it's strange to keep a "prefix token" (id=3) by itself: ["space", "_", "</s>"]. The normal t5 tokenizers do not have this issue since they explicitly split on whitespace (they use a sequence: whitespace split followed by metaspace).

The text was updated successfully, but these errors were encountered:

Narsil · 2023-05-15T07:47:16Z

@ArthurZucker Maybe ?

I don't know which version is "correct".

chris-ha458 · 2023-05-15T14:24:36Z

possible causes for this issue is because the fast version is based on rust while the "slow" one is based on python.
It might be possible to find the reason for discrepency hiding somewhere inside those two implementations, but unless your platform requirements absolutely dictate a pure python implementation, just go with the rust implementation (Fast)

Narsil · 2023-05-15T15:30:26Z

@chris-ha458 Thanks, I know this is the difference.

There are a lot of discrepancies between the 2 versions. Usually they are more subtle though. This is really bad to get difference encodings since it changes how the end model will behave.

chris-ha458 · 2023-05-15T21:12:19Z

@Narsil ah, obviously, as one of the most important developers in the project you would know :) I was not trying to "mansplain" this project to you.
I was trying to help explain for @xenova 's benefit, and forcing myself to understand this project and codebase more.
I've been doing this a lot on other issues and PRs lately. If this is counterproductive for you or the project in either general or specific context, feel free to point it out.

Coming back to this specific issue, is this something that you'd like a volunteer to investigate in detail?

Narsil · 2023-05-22T07:54:45Z

was trying to help explain for @xenova 's benefit, and forcing myself to understand this project and codebase more.
I've been doing this a lot on other issues and PRs lately.huggingface/transformers#20287 If this is counterproductive for you or the project in either general or specific context, feel free to point it out.

Your help is most definitely appreciated, and helps me out ton since I don't have to answer every issue.
I think this particular point was also obvious to OP so didn't feel like going on about it was necessary, but there was no animosity in my answer.huggingface/transformers#20287

If you want to dig into this, you have my blessing. It's most likely linked to number of spaces and how both do the preprocessing.
In general though, if you feel like it, slow and fast tokenizers discrepancies and generally a pain, any help to mitigate those would be highly appreciated.

the "encode" method most importantly. You can check out the test battery within transformers on the llama tokenizer to see what kind of test is needed to reach parity. Including utf-8 tokens, weird spacing and all.

One important thing to note is that since both slow and fast are now being used, "fixing" one might break existing things which is why that area has been moving quite slow.

chris-ha458 · 2023-05-22T11:11:50Z

I was operating on the assumption that the rust binding was somewhat of a "canonical" version, but it seems both python and rust versions are of similar standing. it would be difficult to compare and contrast each divergences especially without knowing which is "correct".

I think my approach would be to acquire more information especially with my own tokenizers and corpuses so as to be able to speak with more confidence on what the encountered and expected behaviors might be.

Hopefully, some of the divergence might come from outright bugs from either implementations that could be fixed.
Those cases that are not "wrong" but "different would be more difficult to mitigate but it would be useful to atleast document in a reproducible manner.

As for the test battery
I assume you mean the "tests/models/llama/test_tokenization_llama.py" in transformers right?

Narsil · 2023-05-22T13:42:41Z

I assume you mean the "tests/models/llama/test_tokenization_llama.py" in transformers right?

Indeed.

lucasjinreal · 2023-05-26T06:21:41Z

I got a even more weried result:

encode 1 sentece take part 2 parts and concat the ids, result not same as encode sentence along:

huggingface/transformers#25285

ArthurZucker · 2023-06-01T15:25:42Z

Also already mentioned in #16334

github-actions · 2023-12-26T01:49:31Z

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker · 2024-01-03T09:38:24Z

#1357 fixed it I believe

chris-ha458 mentioned this issue May 26, 2023

Sentence start got unexpected space huggingface/transformers#25285

Closed

github-actions bot added the Stale label Dec 26, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 31, 2023

ArthurZucker reopened this Jan 3, 2024

ArthurZucker closed this as completed Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250

[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250

xenova commented May 15, 2023

Narsil commented May 15, 2023

chris-ha458 commented May 15, 2023

Narsil commented May 15, 2023

chris-ha458 commented May 15, 2023

Narsil commented May 22, 2023

chris-ha458 commented May 22, 2023 •

edited

Loading

Narsil commented May 22, 2023

lucasjinreal commented May 26, 2023

ArthurZucker commented Jun 1, 2023 •

edited

Loading

github-actions bot commented Dec 26, 2023

ArthurZucker commented Jan 3, 2024

[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250

[Bug] Inconsistent removal of leading and trailing whitespace for Metaspace pretokenizers #1250

Comments

xenova commented May 15, 2023

Description

Code to reproduce:

Expected:

Narsil commented May 15, 2023

chris-ha458 commented May 15, 2023

Narsil commented May 15, 2023

chris-ha458 commented May 15, 2023

Narsil commented May 22, 2023

chris-ha458 commented May 22, 2023 • edited Loading

Narsil commented May 22, 2023

lucasjinreal commented May 26, 2023

ArthurZucker commented Jun 1, 2023 • edited Loading

github-actions bot commented Dec 26, 2023

ArthurZucker commented Jan 3, 2024

chris-ha458 commented May 22, 2023 •

edited

Loading

ArthurZucker commented Jun 1, 2023 •

edited

Loading