-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Whisper] 🚨 Fix whisper decoding 🚨 #34135
Conversation
Even if we usually return the For these reasons, I think it is better to stick with OpenAI's choice: return only the generated tokens. Moreover, this is the way it is currently implemented in Transformers : long-form generation does not return context tokens. As a consequence, this also comes with the advantage of requiring fewer changes in the current codebase, reducing the potential for mistakes. WDYT @ylacombe? Also pinging @ArthurZucker here since you've worked on Whisper integration. |
Correction: |
Sounds good! |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks a lot @ylacombe for the review. I updated it based on your comments (see the updated PR comment) ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eustlb for iterating! LGTM now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
THanks for removing the wrapper!
The four failing slow tests are expected! Merging 🤗 |
What does this PR do?
This PR finalizes #30984 which enabled short-form (<=30sec) and long-form generation using temperature fallback. Indeed, the original OpenAI implementation uses the same decode_with_fallback for both short-form and long-form audio, while we used to have temperature fallback only for long-form audio.
It aims to solve issues and divergences with the original implementation:
avg_logprobs
that triggered temperature fallback when it should not.decoder_input_ids + predicted tokens (including the eos token)
predicted tokens (without eos token)
Since short-form and long-form generation are now merged, we need a consistent way of returning outputs (at least for the tokens, we still need to differentiate for past_key_values see here). To be consistent with generate convention and since the tokenizer has the skip_special_tokens argument, I went for option 1.
🚨 Important changes 🚨
➡️ Previously:
• Short-form: Returned a
ModelOutput
ortorch.LongTensor
, including decoder input IDs and the EOS token ID.• Long-form: Returned a
Dict
ortorch.LongTensor
, excluding decoder input IDs and the EOS token ID.➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.
Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when
return_dict_in_generate=True
and (return_timestamps=False
orforce_unique_generate_call=True
).In this case, the output will be a
ModelOutput
, which is the result of the underlying call to GenerationMixin’s generate. Indeed,return_timestamps=False
ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.Testing
Note
This PR reconciles our implementation with OpenAI's. It should therefore be tested with the updated and verified tests introduced in #34111
Evaluations
Let’s verify the effectiveness of this fix. We will evaluate both accuracy and inference speed. To do this, we will test on four short-form test sets and two long-form test sets (effectively filtering samples as ≤30 sec and >30 sec). For the short-form sets, we will compare results with the current main branch of Transformers on the first 100 samples. (Indeed, Issue #2, which involves miscalculation of the avg_logprobs, causes the full test set to be too slow.)
As explained in this PR, we need to use a simple Whisper fork to ensure consistency with the same input features.
Important
TL;DR: We achieve perfect 1-to-1 matching in prediction results with greedy decoding for both short-form and long-form samples, validating that our implementation matches OpenAI’s original decoding algorithm. For short-form, this PR is ~5x faster than the current implementation (due to multiple incorrect temperature fallbacks).❗
Results 📊
→ short-form (first 100 samples)
Set 1: edinburghcstr/ami, config
"ihm"
, splittest[:100]
Set 2: distil-whisper/chime4, config
"1-channel"
, splittest[:100]
Set 3: google/fleurs, config
"en_us"
, splittest[:100]
Wandb results here!
→ short-form (full test sets)
Set 1: edinburghcstr/ami, config
"ihm"
, splittest
Set 2: distil-whisper/chime4, config
"1-channel"
, splittest
Set 3: google/fleurs, config
"en_us"
, splittest
Wandb results here!
→ long-form
Set 1: distil-whisper/tedlium-long-form, config
"default"
, splittest
Set 2: distil-whisper/meanwhile, config
"default"
, splittest
Wandb results here!