Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper Tokenizer] Test timestamps #26053

Closed

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Sep 8, 2023

What does this PR do?

Adds a test to check that the Whisper tokenizer correctly encodes/decodes the newly added timestamp tokens. This is quite a strict test since we check that the encoded id's match the expected id's, as well as the target string.

Currently, this test fails, since the slow tokenizer removes spaces after special token ids (see #24476 (comment)), and the fast tokenizer gives the wrong ids (see #24476 (comment))

We can be confident the Whisper tokenizers are working as expected once this test is green

cc @ArthurZucker

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@ArthurZucker
Copy link
Collaborator

ArthurZucker commented Sep 28, 2023

Hey!

the slow tokenizer removes spaces after special token ids
can be fixed by updating the AddedTokens to have lstrip and rstrip set to False and pushing the tokenizer again.
the fast tokenizer gives the wrong ids

seems fixed on main no?:

from transformers import WhisperTokenizer, WhisperTokenizerFast
 
tokenizer = WhisperTokenizer.from_pretrained(f"openai/whisper-tiny")
tokenizer_fast = WhisperTokenizerFast.from_pretrained(f"openai/whisper-tiny")
 
print(tokenizer.encode("<|0.00|> hey"))
print(tokenizer_fast.encode("<|0.00|> hey"))

Loading the tokenizer from the special_tokens_map.json and the added_tokens.json will be removed in transformers 5, it is kept for forward compatibility, but it is recommended to update your tokenizer_config.json by uploading it again. You will see the new added_tokens_decoder attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[50258, 50363, 50364, 17230, 50257]
[50258, 50363, 50364, 17230, 50257]

@sanchit-gandhi
Copy link
Contributor Author

Hey @ArthurZucker - currently these tests fail because the Whisper tokenizer does not respect the space between added tokens and the subsequent token. See the following code snippet for an example where the space is removed:

from transformers import WhisperTokenizerFast

tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny")

input_str = "<|0.00|> Whisper can do timestamps?<|2.60|>"
encoding = tokenizer(input_str, add_special_tokens=False).input_ids
decoded_str = tokenizer.decode(encoding, decode_with_timestamps=True)

print(decoded_str)

Print Output:

<|0.00|>Whisper can do timestamps?<|2.60|>

Looking at the diff between this and the original:

<|0.00|> Whisper can do timestamps?<|2.60|>
<|0.00|>Whisper can do timestamps?<|2.60|>

we see that the space is contracted between the added token <|0.00|> and the first non-added token.

This is using the timestamp tokens added with lstrip=rstrip=False, as per the code snippet: #24476 (comment)

Even if we try with lstrip=rstrip=True, we don't get the expected behaviour:

from transformers import WhisperTokenizerFast, AddedToken

# specify revision that does not have timestamp tokens added
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny", revision="1135fb87afc1dd37439285d93c0388041612e9a4")

# add the timestamp tokens
timestamps = [AddedToken("<|%.2f|>" % (i * 0.02), lstrip=True, rstrip=True) for  i in range(1500 + 1)]
tokenizer.add_tokens(timestamps)

input_str = "<|0.00|> Whisper can do timestamps?<|2.60|>"
encoding = tokenizer(input_str, add_special_tokens=False).input_ids
decoded_str = tokenizer.decode(encoding, decode_with_timestamps=True)

print(decoded_str)

Print Output:

<|0.00|>Whisper can do timestamps?<|2.60|>

This is because we get different encodings with and without timestamp tokens:

encoding_with = tokenizer("<|0.00|> The", add_special_tokens=False).input_ids
encoding_without = tokenizer(" The", add_special_tokens=False).input_ids

print("With timestamps:    ", encoding_with)
print("Without timestamps: ", encoding_without)

Print Output:

With timestamps:     [50364, 2471, 271, 610]
Without timestamps:  [41132, 610]

We can see the missing space in these tokens:

print(tokenizer.decode(encoding_with))
print(tokenizer.decode(encoding_without))

Print Output:

The
 The

@ArthurZucker
Copy link
Collaborator

I am just not getting these on main 😵‍💫

@sanchit-gandhi
Copy link
Contributor Author

As discussed offline this is because we need to load the tokenizer, override the added tokens, and re-push it with the added_tokens_decoder. However, there's currently a bug where all special token ids are loaded with lstrip=rstrip=True, regardless of how they're saved in the tokenizer file, and they can't be overriden using tokenizer.add_tokens(...).

Will remove these special tokens and re-add them for the Whisper model. We can explore a proper fix in the lib as a long-term solution.

@ArthurZucker
Copy link
Collaborator

Fixed by #26538 !

@github-actions github-actions bot closed this Nov 5, 2023
@huggingface huggingface deleted a comment from github-actions bot Dec 1, 2023
@sanchit-gandhi sanchit-gandhi reopened this Dec 1, 2023
@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Dec 1, 2023

Running the CI to see if we test the correct tokenizer encoding

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants