-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper Tokenizer] Test timestamps #26053
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
bc86eab
to
be54cab
Compare
Hey!
seems fixed on main no?: from transformers import WhisperTokenizer, WhisperTokenizerFast
tokenizer = WhisperTokenizer.from_pretrained(f"openai/whisper-tiny")
tokenizer_fast = WhisperTokenizerFast.from_pretrained(f"openai/whisper-tiny")
print(tokenizer.encode("<|0.00|> hey"))
print(tokenizer_fast.encode("<|0.00|> hey")) Loading the tokenizer from the
|
Hey @ArthurZucker - currently these tests fail because the Whisper tokenizer does not respect the space between added tokens and the subsequent token. See the following code snippet for an example where the space is removed: from transformers import WhisperTokenizerFast
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny")
input_str = "<|0.00|> Whisper can do timestamps?<|2.60|>"
encoding = tokenizer(input_str, add_special_tokens=False).input_ids
decoded_str = tokenizer.decode(encoding, decode_with_timestamps=True)
print(decoded_str) Print Output:
Looking at the diff between this and the original: <|0.00|> Whisper can do timestamps?<|2.60|>
<|0.00|>Whisper can do timestamps?<|2.60|> we see that the space is contracted between the added token This is using the timestamp tokens added with Even if we try with from transformers import WhisperTokenizerFast, AddedToken
# specify revision that does not have timestamp tokens added
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny", revision="1135fb87afc1dd37439285d93c0388041612e9a4")
# add the timestamp tokens
timestamps = [AddedToken("<|%.2f|>" % (i * 0.02), lstrip=True, rstrip=True) for i in range(1500 + 1)]
tokenizer.add_tokens(timestamps)
input_str = "<|0.00|> Whisper can do timestamps?<|2.60|>"
encoding = tokenizer(input_str, add_special_tokens=False).input_ids
decoded_str = tokenizer.decode(encoding, decode_with_timestamps=True)
print(decoded_str) Print Output:
This is because we get different encodings with and without timestamp tokens: encoding_with = tokenizer("<|0.00|> The", add_special_tokens=False).input_ids
encoding_without = tokenizer(" The", add_special_tokens=False).input_ids
print("With timestamps: ", encoding_with)
print("Without timestamps: ", encoding_without) Print Output:
We can see the missing space in these tokens: print(tokenizer.decode(encoding_with))
print(tokenizer.decode(encoding_without)) Print Output:
|
I am just not getting these on main 😵💫 |
As discussed offline this is because we need to load the tokenizer, override the added tokens, and re-push it with the Will remove these special tokens and re-add them for the Whisper model. We can explore a proper fix in the lib as a long-term solution. |
Fixed by #26538 ! |
Running the CI to see if we test the correct tokenizer encoding |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Adds a test to check that the Whisper tokenizer correctly encodes/decodes the newly added timestamp tokens. This is quite a strict test since we check that the encoded id's match the expected id's, as well as the target string.
Currently, this test fails, since the slow tokenizer removes spaces after special token ids (see #24476 (comment)), and the fast tokenizer gives the wrong ids (see #24476 (comment))
We can be confident the Whisper tokenizers are working as expected once this test is green
cc @ArthurZucker