[Whisper] Fix whisper integration tests #34111

eustlb · 2024-10-12T17:57:39Z

What does this PR do?

This PR fixes multiple errors in Whisper integration tests and expected outputs.

To compute the correct excepted outputs, it is necessary to work from a very simple fork of the original OpenAI Whisper implementation. Indeed, the extraction of the mel spectrogram in WhisperFeatureExtractor diverges slightly from OpenAI's one: we pad the audio array to 30sec/ to longest with 0.0s and then extract our spectrogram through batched STFT while OpenAI's one will add 30sec of 0.0s to the audio array (and not pad to 30sec). This way, the are sure that model inputs for our and OpenAI's implementations are exactly the same.

With this, we can use the following protocol to compute the expected outputs for the tests:

extract mel inputs using the test's implementation (so using the WhisperFeatureExtractor)
infer OpenAI's model through the above explained fork directly from the mel input

Important

Code to reproduce the outputs for each of the verified tests can be found here.

Edit: some more details about why we work from a whisper fork

In Transformers, we have two inputs possibilities for Whisper:

mel spectrogram with 3000 frames → audio is first padded to 30sec with 0.0s and then we extract the mel
mel spectrogram with more than 3000 frames → no need for padding

case 1
With an audio <=30sec, the difference between our implementation and OAI is that we first pad to 30sec with 0.0s, then extract features and this will be the input to the model's forward, while OAI pads audio with adding 30sec 0.0s, extract features, slice the exact number of frames and then pads the mel spectrogram to 3000 frames with 0.0s.

To understand better, for an audio of 10secs:
Transformers: audio + 20sec of 0.0s → mel spectrogram of shape [80, 3000] where [2000:] frames are close but not exactly 0.0s
OAI: audio + 30sec of 0.0s → mel spectrogram of shape [80, 4000] → sliced to the duration of the audio (so until frame 1000) and then padded with 0.0s: [2000:] frames are exactly 0.0s.

case 2
No differences (other than numerical difference due to STFT implementation).

About the implementation in the simple whisper fork:
We just take the mel spectrogram and concat with 3000 frames of 0.0s. This emulates the 30sec of 0.0s added originally.
For case 1, the duration considered by OAI is 30sec (see this line) and therefore the audio segment that will be given to the forward is the exact mel input that was given.
For case 2, likewise the duration considered is the one of the given mel input.

With inferring OAI directly on the mel spectrogram (so either of exactly 3000 frames, either on more than 3000 frames), we ensure that each pass of the forward of OAI whisper and our gets the exact same mel spectrogram. This ensures that the expected result we have in the test are indeed results that should be expected given the same input mel with OAI implementation.

Note

For tests that required batched inference which is not supported by OAI implementation, I simply run it sequentially to get the outputs

TODO

Tests to be verified and eventually corrected

✅ for a correct test
❌ for an incorrect one

ylacombe

Hey @eustlb, let me know when you need a proper review.

In the meantime, I've left some formatting comments and a question: These are the expected results when generating with OpenAI code, right? Are these the results when doing padding up until 30s (like we do) or adding a 30s zero padded audio (as OpenAI does)?

tests/models/whisper/test_modeling_whisper.py

ylacombe · 2024-10-14T09:07:18Z

tests/models/whisper/test_modeling_whisper.py

+        gen_kwargs = {
+            "return_timestamps": True,
+            "no_speech_threshold": 0.6, # necessary to trigger no speech detection
+            "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
+            "compression_ratio_threshold": 1.35,
+            "condition_on_prev_tokens": False,
+            "logprob_threshold": -2.0, # necessary to avoid triggering temp fallback that will introduce randomness since we are comparing to openai EXTECTED_TEXT
+        }


Some of these are already used by default, let's remove them to improve readability

All the ones with numerical values here are not set by default (not in generate and not also in the model's generation_config.json). Concerning return_timestamps and condition_on_prev, I find it clearer to have them explicitly mentioned.

eustlb · 2024-10-14T15:06:28Z

These are the expected results when generating with OpenAI code, right? Are these the results when doing padding up until 30s (like we do) or adding a 30s zero padded audio (as OpenAI does)?

Indeed, these are the expected results generated with OAI code inferred on mel input features extracted through WhisperFeatureExtractor. I feel like here more details are required (also updating the PR's description).

In Transformers, we have two inputs possibilities for Whisper:

mel spectrogram with 3000 frames → audio is first padded to 30sec with 0.0s and then we extract the mel
mel spectrogram with more than 3000 frames → no need for padding

case 1
With an audio <=30sec, the difference between our implementation and OAI is that we first pad to 30sec with 0.0s, then extract features and this will be the input to the model's forward, while OAI pads audio with adding 30sec 0.0s, extract features, slice the exact number of frames and then pads the mel spectrogram to 3000 frames with 0.0s.

To understand better, for an audio of 10secs:
Transformers: audio + 20sec of 0.0s → mel spectrogram of shape [80, 3000] where [2000:] frames are close but not exactly 0.0s
OAI: audio + 30sec of 0.0s → mel spectrogram of shape [80, 4000] → sliced to the duration of the audio (so until frame 1000) and then padded with 0.0s: [2000:] frames are exactly 0.0s.

case 2
No differences (other than numerical difference due to STFT implementation).

About the implementation in the simple whisper fork:
We just take the mel spectrogram and concat with 3000 frames of 0.0s. This emulates the 30sec of 0.0s added originally.
For case 1, the duration considered by OAI is 30sec (see this line) and therefore the audio segment that will be given to the forward is the exact mel input that was given.
For case 2, likewise the duration considered is the one of the given mel input.

With inferring OAI directly on the mel spectrogram (so either of exactly 3000 frames, either on more than 3000 frames), we ensure that each pass of the forward of OAI whisper and our gets the exact same mel spectrogram. This ensures that the expected result we have in the test are indeed results that should be expected given the same input mel with OAI implementation.

Note: For tests that required batched inference which is not supported by OAI implementation, I simply run it sequentially to get the outputs

Co-authored-by: Yoach Lacombe <[email protected]>

HuggingFaceDocBuilderDev · 2024-10-14T15:46:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eustlb · 2024-10-18T15:30:26Z

As mentioned here, we won't return special tokens anymore with generate for Whisper. Let's adapt the tests a bit for that.

ylacombe

Thanks for this PR @eustlb, this work, although a bit time consuming, should have been done a long time ago! Congratulations on doing it so thoroughly.

Also, thanks for also providing code to reproduce every results. It'll greatly help future efforts on Whisper.

Most of the comments I've made are tiny, the PR looks great to me!

I think we might want to merge this at the same time or a bit before your other PR, but in the meantime, cc @ArthurZucker and @LysandreJik for a final review

tests/models/whisper/test_modeling_whisper.py

ylacombe · 2024-10-25T08:43:02Z

tests/models/whisper/test_modeling_whisper.py

@@ -2866,6 +2833,7 @@ def test_whisper_longform_single_batch_beam(self):
            "compression_ratio_threshold": 1.35,
            "condition_on_prev_tokens": True,
            "logprob_threshold": -1.0,
+            "renormalize_logits": True,  # necessary to match OAI beam search implementation


Great! Let's not forget to mention this somewhere in the docs

I am thinking about setting it by default to True in Whisper's generate (and adding it to the doc this way) and remove it from here. WDYT?

I think it would make more sense to leave it like that here, and change it (setting it by default to True in Whisper's generate-and adding it to the doc this way) in this PR, since those changes are intended for it anyway.

ylacombe · 2024-11-21T16:12:39Z

tests/models/whisper/test_modeling_whisper.py

+        self.assertListEqual(generated_ids.tolist()[0], generated_ids_forced.tolist()[0])
+
+    @slow
+    def test_generate_with_prompt_ids_task_language(self):


Great test here! 🔥

ylacombe

LGTM, thanks for the meticulous work!
Let's update to main and merge this just a tiny bit before the other PRs. Also, let's run the slow Whisper test on this PR, so that we can verify that your two other PRs are fixing these new tests

ArthurZucker

Thanks a lot for this PR. The whisper codebase evolved quite a lot since the first release and it's nice to freshen things up!

Very nice that you have a full reproducing recipe, this is something that was quite lacking from me, thanks for improving our port! 🤗

ArthurZucker · 2024-11-25T17:09:31Z

tests/models/whisper/test_modeling_whisper.py

if the tests fails, we get no help, using torch.testing.assert_close will give how close / far we are!

* fix test_tiny_timestamp_generation * fix test_large_timestamp_generation * fix test_whisper_shortform_single_batch_prev_cond * fix test_whisper_shortform_multi_batch_hard_prev_cond * return_timestamps necessary with long form * fix test_default_multilingual_transcription_long_form * fix test_tiny_token_timestamp_generation_longform * fix test_whisper_longform_multi_batch_hard * Update tests/models/whisper/test_modeling_whisper.py Co-authored-by: Yoach Lacombe <[email protected]> * fix typo * do not expect special tokens * fix test_whisper_longform_single_batch_beam * fix test_whisper_longform_multi_batch_hard_prev_cond * update test_whisper_longform_multi_batch_hard_prev_cond * update test_whisper_longform_multi_batch_hard_prev_cond * these tests does not make sense anymore * this test does not make sense anymore * make fixup * suggested nits * add test with forced_decoder_ids * this test does not make sense anymore * change assert for unittest test cases * make fixup * test with prompt_ids and task and language * fix unittest test case call * fix test_tiny_generation * fix test_tiny_en_generation * fix test_tiny_en_batched_generation * fix test_tiny_longform_timestamps_generation * fix test_tiny_timestamp_generation * fix test_large_generation * fix test_large_batched_generation * fix test_large_generation_multilingual * fix test_large_timestamp_generation * fix test_large_timestamp_generation * fix test_tiny_token_timestamp_generation_longform * fix test_tiny_en_batched_generation * make fixup * [run-slow] whisper --------- Co-authored-by: Yoach Lacombe <[email protected]>

eustlb added 8 commits October 12, 2024 18:57

fix test_tiny_timestamp_generation

c6419c8

fix test_large_timestamp_generation

eed3438

fix test_whisper_shortform_single_batch_prev_cond

15f46b2

fix test_whisper_shortform_multi_batch_hard_prev_cond

d2018a8

return_timestamps necessary with long form

9093bdc

fix test_default_multilingual_transcription_long_form

abc19c4

fix test_tiny_token_timestamp_generation_longform

374041f

fix test_whisper_longform_multi_batch_hard

fba2bb7

eustlb changed the title ~~[WIP] Fix whisper integration tests~~ [WIP] [Whisper] Fix whisper integration tests Oct 13, 2024

eustlb mentioned this pull request Oct 13, 2024

[Whisper] 🚨 Fix whisper decoding 🚨 #34135

Merged

ylacombe reviewed Oct 14, 2024

View reviewed changes

eustlb mentioned this pull request Oct 14, 2024

Whisper Beam Search doesn't work #33445

Closed

4 tasks

eustlb and others added 2 commits October 14, 2024 17:20

Update tests/models/whisper/test_modeling_whisper.py

189fc68

Co-authored-by: Yoach Lacombe <[email protected]>

fix typo

ae4c36a

eustlb and others added 9 commits October 18, 2024 17:35

do not expect special tokens

e4cc00b

fix test_whisper_longform_single_batch_beam

18fb18c

fix test_whisper_longform_multi_batch_hard_prev_cond

1e75a7d

update test_whisper_longform_multi_batch_hard_prev_cond

107a05a

update test_whisper_longform_multi_batch_hard_prev_cond

1b55778

these tests does not make sense anymore

83fd9c5

this test does not make sense anymore

39e6203

make fixup

6c84d69

Merge branch 'main' into fix-whisper-integration-tests

e7c243e

eustlb changed the title ~~[WIP] [Whisper] Fix whisper integration tests~~ [Whisper] Fix whisper integration tests Oct 25, 2024

ylacombe approved these changes Oct 25, 2024

View reviewed changes

eustlb added 2 commits October 25, 2024 11:45

suggested nits

2127991

add test with forced_decoder_ids

e5064d9

MahmoudAshraf97 mentioned this pull request Oct 31, 2024

Remove torch dependency, Faster numpy Feature extraction SYSTRAN/faster-whisper#1106

Merged

eustlb added 13 commits November 2, 2024 18:13

fix test_tiny_generation

1eb8353

fix test_tiny_en_generation

b9e2a19

fix test_tiny_en_batched_generation

400fd02

fix test_tiny_longform_timestamps_generation

121424d

fix test_tiny_timestamp_generation

f648c1d

fix test_large_generation

71f98ae

fix test_large_batched_generation

e2be78c

fix test_large_generation_multilingual

21fd94e

fix test_large_timestamp_generation

af7a071

fix test_large_timestamp_generation

9dd962d

fix test_tiny_token_timestamp_generation_longform

5a84b6c

fix test_tiny_en_batched_generation

26ee589

make fixup

e987139

ylacombe reviewed Nov 21, 2024

View reviewed changes

ylacombe approved these changes Nov 21, 2024

View reviewed changes

Merge branch 'main' into fix-whisper-integration-tests

c5ae9a9

eustlb force-pushed the fix-whisper-integration-tests branch from 40b1883 to 6d9b762 Compare November 21, 2024 16:37

ylacombe added the run-slow label Nov 21, 2024

eustlb force-pushed the fix-whisper-integration-tests branch from 6d9b762 to 5217b49 Compare November 21, 2024 16:47

[run-slow] whisper

abe4372

eustlb force-pushed the fix-whisper-integration-tests branch from 5217b49 to abe4372 Compare November 21, 2024 16:52

ylacombe requested a review from ArthurZucker November 25, 2024 08:11

ArthurZucker approved these changes Nov 25, 2024

View reviewed changes

eustlb added 2 commits November 25, 2024 18:31

Merge branch 'main' into fix-whisper-integration-tests

02bd8e5

Merge branch 'main' into fix-whisper-integration-tests

18187f7

eustlb merged commit 4d1d0f2 into huggingface:main Nov 26, 2024
17 checks passed

eustlb mentioned this pull request Dec 19, 2024

tokenizer decode decode with timestamp fails for extended vocabulary #35330

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] Fix whisper integration tests #34111

[Whisper] Fix whisper integration tests #34111

eustlb commented Oct 12, 2024 •

edited

Loading

ylacombe left a comment

ylacombe Oct 14, 2024

eustlb Oct 24, 2024

eustlb commented Oct 14, 2024

HuggingFaceDocBuilderDev commented Oct 14, 2024

eustlb commented Oct 18, 2024 •

edited

Loading

ylacombe left a comment

ylacombe Oct 25, 2024

eustlb Oct 25, 2024

eustlb Oct 25, 2024

ylacombe Nov 21, 2024

ylacombe left a comment •

edited

Loading

ArthurZucker left a comment

ArthurZucker Nov 25, 2024

[Whisper] Fix whisper integration tests #34111

[Whisper] Fix whisper integration tests #34111

Conversation

eustlb commented Oct 12, 2024 • edited Loading

What does this PR do?

Edit: some more details about why we work from a whisper fork

TODO

Tests to be verified and eventually corrected

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Oct 14, 2024

Choose a reason for hiding this comment

eustlb Oct 24, 2024

Choose a reason for hiding this comment

eustlb commented Oct 14, 2024

HuggingFaceDocBuilderDev commented Oct 14, 2024

eustlb commented Oct 18, 2024 • edited Loading

ylacombe left a comment

Choose a reason for hiding this comment

ylacombe Oct 25, 2024

Choose a reason for hiding this comment

eustlb Oct 25, 2024

Choose a reason for hiding this comment

eustlb Oct 25, 2024

Choose a reason for hiding this comment

ylacombe Nov 21, 2024

Choose a reason for hiding this comment

ylacombe left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Nov 25, 2024

Choose a reason for hiding this comment

eustlb commented Oct 12, 2024 •

edited

Loading

eustlb commented Oct 18, 2024 •

edited

Loading

ylacombe left a comment •

edited

Loading