[Whisper Tokenizer] Test timestamps #26053

sanchit-gandhi · 2023-09-08T13:29:27Z

What does this PR do?

Adds a test to check that the Whisper tokenizer correctly encodes/decodes the newly added timestamp tokens. This is quite a strict test since we check that the encoded id's match the expected id's, as well as the target string.

Currently, this test fails, since the slow tokenizer removes spaces after special token ids (see #24476 (comment)), and the fast tokenizer gives the wrong ids (see #24476 (comment))

We can be confident the Whisper tokenizers are working as expected once this test is green

cc @ArthurZucker

HuggingFaceDocBuilderDev · 2023-09-08T14:05:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker · 2023-09-28T08:59:55Z

Hey!

the slow tokenizer removes spaces after special token ids
can be fixed by updating the AddedTokens to have lstrip and rstrip set to False and pushing the tokenizer again.
the fast tokenizer gives the wrong ids

seems fixed on main no?:

from transformers import WhisperTokenizer, WhisperTokenizerFast
 
tokenizer = WhisperTokenizer.from_pretrained(f"openai/whisper-tiny")
tokenizer_fast = WhisperTokenizerFast.from_pretrained(f"openai/whisper-tiny")
 
print(tokenizer.encode("<|0.00|> hey"))
print(tokenizer_fast.encode("<|0.00|> hey"))

Loading the tokenizer from the special_tokens_map.json and the added_tokens.json will be removed in transformers 5, it is kept for forward compatibility, but it is recommended to update your tokenizer_config.json by uploading it again. You will see the new added_tokens_decoder attribute that will store the relevant information.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[50258, 50363, 50364, 17230, 50257]
[50258, 50363, 50364, 17230, 50257]

sanchit-gandhi · 2023-09-28T09:08:25Z

Hey @ArthurZucker - currently these tests fail because the Whisper tokenizer does not respect the space between added tokens and the subsequent token. See the following code snippet for an example where the space is removed:

from transformers import WhisperTokenizerFast

tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny")

input_str = "<|0.00|> Whisper can do timestamps?<|2.60|>"
encoding = tokenizer(input_str, add_special_tokens=False).input_ids
decoded_str = tokenizer.decode(encoding, decode_with_timestamps=True)

print(decoded_str)

Print Output:

<|0.00|>Whisper can do timestamps?<|2.60|>

Looking at the diff between this and the original:

<|0.00|> Whisper can do timestamps?<|2.60|>
<|0.00|>Whisper can do timestamps?<|2.60|>

we see that the space is contracted between the added token <|0.00|> and the first non-added token.

This is using the timestamp tokens added with lstrip=rstrip=False, as per the code snippet: #24476 (comment)

Even if we try with lstrip=rstrip=True, we don't get the expected behaviour:

from transformers import WhisperTokenizerFast, AddedToken

# specify revision that does not have timestamp tokens added
tokenizer = WhisperTokenizerFast.from_pretrained("openai/whisper-tiny", revision="1135fb87afc1dd37439285d93c0388041612e9a4")

# add the timestamp tokens
timestamps = [AddedToken("<|%.2f|>" % (i * 0.02), lstrip=True, rstrip=True) for  i in range(1500 + 1)]
tokenizer.add_tokens(timestamps)

input_str = "<|0.00|> Whisper can do timestamps?<|2.60|>"
encoding = tokenizer(input_str, add_special_tokens=False).input_ids
decoded_str = tokenizer.decode(encoding, decode_with_timestamps=True)

print(decoded_str)

Print Output:

<|0.00|>Whisper can do timestamps?<|2.60|>

This is because we get different encodings with and without timestamp tokens:

encoding_with = tokenizer("<|0.00|> The", add_special_tokens=False).input_ids
encoding_without = tokenizer(" The", add_special_tokens=False).input_ids

print("With timestamps:    ", encoding_with)
print("Without timestamps: ", encoding_without)

Print Output:

With timestamps:     [50364, 2471, 271, 610]
Without timestamps:  [41132, 610]

We can see the missing space in these tokens:

print(tokenizer.decode(encoding_with))
print(tokenizer.decode(encoding_without))

Print Output:

The
 The

ArthurZucker · 2023-09-28T09:22:30Z

I am just not getting these on main 😵‍💫

sanchit-gandhi · 2023-09-29T17:35:00Z

As discussed offline this is because we need to load the tokenizer, override the added tokens, and re-push it with the added_tokens_decoder. However, there's currently a bug where all special token ids are loaded with lstrip=rstrip=True, regardless of how they're saved in the tokenizer file, and they can't be overriden using tokenizer.add_tokens(...).

Will remove these special tokens and re-add them for the Whisper model. We can explore a proper fix in the lib as a long-term solution.

ArthurZucker · 2023-10-03T12:28:58Z

Fixed by #26538 !

sanchit-gandhi · 2023-12-01T15:57:58Z

Running the CI to see if we test the correct tokenizer encoding

github-actions · 2023-12-26T08:05:35Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sanchit-gandhi added 4 commits September 28, 2023 09:53

[Whisper Tokenizer] Test timestamps

c1a9b82

style

d29369d

update tests

0baebe7

decode w/ timestamps

be54cab

sanchit-gandhi force-pushed the whisper-tok-time branch from bc86eab to be54cab Compare September 28, 2023 08:56

github-actions bot closed this Nov 5, 2023

huggingface deleted a comment from github-actions bot Dec 1, 2023

sanchit-gandhi reopened this Dec 1, 2023

Merge branch 'main' into whisper-tok-time

c5a10f0

github-actions bot closed this Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper Tokenizer] Test timestamps #26053

[Whisper Tokenizer] Test timestamps #26053

sanchit-gandhi commented Sep 8, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 8, 2023

ArthurZucker commented Sep 28, 2023 •

edited by sanchit-gandhi

Loading

sanchit-gandhi commented Sep 28, 2023

ArthurZucker commented Sep 28, 2023

sanchit-gandhi commented Sep 29, 2023

ArthurZucker commented Oct 3, 2023

sanchit-gandhi commented Dec 1, 2023 •

edited

Loading

github-actions bot commented Dec 26, 2023

[Whisper Tokenizer] Test timestamps #26053

[Whisper Tokenizer] Test timestamps #26053

Conversation

sanchit-gandhi commented Sep 8, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Sep 8, 2023

ArthurZucker commented Sep 28, 2023 • edited by sanchit-gandhi Loading

sanchit-gandhi commented Sep 28, 2023

ArthurZucker commented Sep 28, 2023

sanchit-gandhi commented Sep 29, 2023

ArthurZucker commented Oct 3, 2023

sanchit-gandhi commented Dec 1, 2023 • edited Loading

github-actions bot commented Dec 26, 2023

sanchit-gandhi commented Sep 8, 2023 •

edited

Loading

ArthurZucker commented Sep 28, 2023 •

edited by sanchit-gandhi

Loading

sanchit-gandhi commented Dec 1, 2023 •

edited

Loading