[Whisper] 🚨 Fix whisper decoding 🚨 #34135

eustlb · 2024-10-13T15:40:59Z

What does this PR do?

This PR finalizes #30984 which enabled short-form (<=30sec) and long-form generation using temperature fallback. Indeed, the original OpenAI implementation uses the same decode_with_fallback for both short-form and long-form audio, while we used to have temperature fallback only for long-form audio.

It aims to solve issues and divergences with the original implementation:

When decoding with timestamps, the Transformers implementation skips the last segment for short-form audio while the OpenAI implementation does not: it will use the sliding window strategy and go to the last generated timestamp (so indeed no differences between inferring short-form and long-form audio). This was not detected in the tests since test_tiny_timestamp_generation and test_large_timestamp_generation are incorrect.
Miscalculation of the avg_logprobs that triggered temperature fallback when it should not.
When decoding short-form / long-form audio, we returned:
1. short-formdecoder_input_ids + predicted tokens (including the eos token)
2. long-form only the predicted tokens (without eos token)
  Since short-form and long-form generation are now merged, we need a consistent way of returning outputs (at least for the tokens, we still need to differentiate for past_key_values see here). To be consistent with generate convention and since the tokenizer has the skip_special_tokens argument, I went for option 1.

🚨 Important changes 🚨

➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.

➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.

Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).

In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.

Testing

Note

This PR reconciles our implementation with OpenAI's. It should therefore be tested with the updated and verified tests introduced in #34111

Evaluations

Let’s verify the effectiveness of this fix. We will evaluate both accuracy and inference speed. To do this, we will test on four short-form test sets and two long-form test sets (effectively filtering samples as ≤30 sec and >30 sec). For the short-form sets, we will compare results with the current main branch of Transformers on the first 100 samples. (Indeed, Issue #2, which involves miscalculation of the avg_logprobs, causes the full test set to be too slow.)

⚠️ Moreover, it was also tested with this PR merged.

As explained in this PR, we need to use a simple Whisper fork to ensure consistency with the same input features.

Important

TL;DR: We achieve perfect 1-to-1 matching in prediction results with greedy decoding for both short-form and long-form samples, validating that our implementation matches OpenAI’s original decoding algorithm. For short-form, this PR is ~5x faster than the current implementation (due to multiple incorrect temperature fallbacks).❗

Results 📊

→ short-form (first 100 samples)

Set 1: edinburghcstr/ami, config "ihm", split test[:100]
Set 2: distil-whisper/chime4, config "1-channel", split test[:100]
Set 3: google/fleurs, config "en_us", split test[:100]

Wandb results here!

	set 1 - WER	set 1 - RTFx	set 1 - WER	set 1 - RTFx	set 1 - WER	set 1 - RTFx
whisper-orig (script)	17.35	6.55	4.62	12.21	3.78	14.61
whisper-orig-fork (script)	19.4	6.45	4.55	12	3.78	14.12
this PR (script)	19.4	3.49	4.55	8.05	3.78	10.59
currrent main (script)	25.8	0.73	5.67	1.49	7.09	1.9

→ short-form (full test sets)

Set 1: edinburghcstr/ami, config "ihm", split test
Set 2: distil-whisper/chime4, config "1-channel", split test
Set 3: google/fleurs, config "en_us", split test

Wandb results here!

	Set 1 - WER	Set 1 - RTFx	Set 2 - WER	Set 2 - RTFx	Set 3 - WER	Set 3 - RTFx
whisper-orig (script)	16.22	7.86	10.74	11.45	4.21	13.79
whisper-orig-fork (script)	16.36	7.91	11.08	11.51	4.14	13.90
this PR (script)	16.36	4.18	11.08	7.51	4.14	10.67

→ long-form

Set 1: distil-whisper/tedlium-long-form, config "default", split test
Set 2: distil-whisper/meanwhile, config "default", split test

Wandb results here!

	Set 1 - WER	Set 1 - RTFx	Set 2 - WER	Set 2 - RTFx
whisper-orig (script)	172.15	4.98	264.66	4.09
whisper-orig-fork (script)	172.15	4.83	264.66	4.00
this PR (script)	172.15	4.88	264.66	3.98
current main (script)	172.15	4.92	264.66	3.98

eustlb · 2024-10-14T10:50:40Z

Even if we usually return the decoder_input_ids and eos_token with generate in Transformers, and this is why we have a skip_special_tokens=True option in the tokenizer, I think it is better here not to return context tokens. Indeed, this would imply a lot of ambiguities: since we actually have multiple calls to generate when doing sequential decoding, why would we include those tokens only for the first segment of the concatenated sequence of tokens and the last (for the eos token)? This would let users think that generation was indeed in one shot and that tokens are indeed the concatenated ones. Likewise, it would be even worse for the returned segments: some will have the context tokens, and some won't, depending on if the segment is the first of a new call to generate.

For these reasons, I think it is better to stick with OpenAI's choice: return only the generated tokens. Moreover, this is the way it is currently implemented in Transformers : long-form generation does not return context tokens. As a consequence, this also comes with the advantage of requiring fewer changes in the current codebase, reducing the potential for mistakes. WDYT @ylacombe? Also pinging @ArthurZucker here since you've worked on Whisper integration.

eustlb · 2024-10-18T15:18:06Z

Correction:
After discussion, it has been decided to rather go for solution 2: return only the generated tokens and skip all the special tokens.
Pros: fewer changes, no ambiguities.
Cons: need to overwrite generic tests of GenerationMixin

ArthurZucker · 2024-10-22T14:11:03Z

Sounds good!

HuggingFaceDocBuilderDev · 2024-10-24T16:27:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

eustlb · 2024-12-11T13:50:43Z

Thanks a lot @ylacombe for the review. I updated it based on your comments (see the updated PR comment) !

ylacombe

Thanks @eustlb for iterating! LGTM now!

src/transformers/models/whisper/generation_whisper.py

ArthurZucker

THanks for removing the wrapper!

src/transformers/models/whisper/generation_whisper.py

eustlb · 2024-12-18T13:04:14Z

The four failing slow tests are expected! Merging 🤗

eustlb added 9 commits October 13, 2024 17:14

do not remove decoder_input_ids for the first segment

4c7be8c

do not remove eos token in generate_with_fallback

805c688

when removing padding tokens, do not remove eos token

271014f

remove eos token in generate (and not in generate_with_fallback!)

d9a91b3

reconciliate short-from/ long-form behavior

b9d0ec1

correct avg_logprobs calculation

f9d5fdd

handle eos token in segments

f111aa3

handle decoder_input_ids and eos token in _prepare_decoder_input_ids

34d690a

fix incorrect time precision

16f768b

eustlb mentioned this pull request Oct 14, 2024

Whisper Beam Search doesn't work #33445

Closed

4 tasks

eustlb mentioned this pull request Oct 14, 2024

[Whisper] Fix whisper integration tests #34111

Merged

22 tasks

ylacombe mentioned this pull request Oct 16, 2024

#33512 handle last element out of range error #33625

Open

eustlb added 5 commits October 18, 2024 17:25

always remove eos token

cf18e0a

always remove decoder_input_ids

ec7cd58

no need to handle decoder_inputs_ids and eos token

1530930

no need to remove decoder_input_ids

67865dd

no need to handle eos token

eb107d9

fix num_beams in _retrieve_logit_processors

7881928

eustlb added 6 commits November 4, 2024 18:39

remove todo unconsistency

031ace6

no need to add eos token

eaaeec6

last_timestamp_pos should indeed be timestamp token pos

cdd5144

patch generate to enable compatibility with GenerationTesterMixin tests

544f21b

adapt test_generate_continue_from_past_key_values

70ffdb3

adapt test_prompt_lookup_decoding_matches_greedy_search

79347e9

eustlb changed the title ~~[WIP] [Whisper] Fix whisper decoding~~ [Whisper] Fix whisper decoding Nov 22, 2024

Merge branch 'main' into fix-whisper-decoding

c01ef12

test update

0960a52

[run-slow] whisper

0870ac7

ylacombe approved these changes Dec 11, 2024

View reviewed changes

src/transformers/models/whisper/generation_whisper.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/generation_whisper.py Outdated Show resolved Hide resolved

src/transformers/models/whisper/generation_whisper.py Outdated Show resolved Hide resolved

eustlb and others added 15 commits December 11, 2024 16:33

add force_unique_generate_call arg

00d37e8

do not use a patch

cf75ea3

correct the timestamps for the pad tokens

2cb638d

docstring update

66290cb

docstring update

b41e2c1

docstring update

221bae1

Merge branch 'main' into fix-whisper-decoding

98dc7f1

upodate TF tests

0b6687c

add require_read_token

ab910f7

[run-slow] whisper

692bf14

test reset dynamo

a15aa4a

Merge branch 'main' into fix-whisper-decoding

0011cec

[run-slow] whisper

41df6ca

fix

aaecea4

[run-slow] whisper

58d0b90

ArthurZucker approved these changes Dec 17, 2024

View reviewed changes

src/transformers/models/whisper/generation_whisper.py Outdated Show resolved Hide resolved

eustlb and others added 5 commits December 18, 2024 12:06

avoid iterating twice on current_segments

a1f4e43

[run-slow] whisper

671b079

Merge branch 'main' into fix-whisper-decoding

3d21ed8

[run-slow] whisper

dc6fbd1

Merge branch 'main' into fix-whisper-decoding

8b7a2e8

eustlb merged commit da334bc into huggingface:main Dec 18, 2024
25 checks passed

This was referenced Dec 18, 2024

Finish short form / long from generation integration in Whisper #32263

Closed

Fix EncoderDecoder cache candidate edge case #33926

Closed

Fix Whisper shortform EOS #33917

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Whisper] 🚨 Fix whisper decoding 🚨 #34135

[Whisper] 🚨 Fix whisper decoding 🚨 #34135

eustlb commented Oct 13, 2024 •

edited

Loading

eustlb commented Oct 14, 2024 •

edited

Loading

eustlb commented Oct 18, 2024

ArthurZucker commented Oct 22, 2024

HuggingFaceDocBuilderDev commented Oct 24, 2024

eustlb commented Dec 11, 2024

ylacombe left a comment

ArthurZucker left a comment

eustlb commented Dec 18, 2024

[Whisper] 🚨 Fix whisper decoding 🚨 #34135

[Whisper] 🚨 Fix whisper decoding 🚨 #34135

Conversation

eustlb commented Oct 13, 2024 • edited Loading

What does this PR do?

🚨 Important changes 🚨

Testing

Evaluations

→ short-form (first 100 samples)

→ short-form (full test sets)

→ long-form

eustlb commented Oct 14, 2024 • edited Loading

eustlb commented Oct 18, 2024

ArthurZucker commented Oct 22, 2024

HuggingFaceDocBuilderDev commented Oct 24, 2024

eustlb commented Dec 11, 2024

ylacombe left a comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

eustlb commented Dec 18, 2024

eustlb commented Oct 13, 2024 •

edited

Loading

eustlb commented Oct 14, 2024 •

edited

Loading