-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix missing sequences_scores
in the Whisper beam search output
#32970
Fix missing sequences_scores
in the Whisper beam search output
#32970
Conversation
@Nik-Kras Thank you for opening this issue 🤗 Regarding beam search: the core method was untouched, except for bug fixes on edge cases. In other words, it is possible but unlikely, our integration tests should have caught noticeable differences in quality 🤗 We did change the input preparation for If you can get us a snippet where the outputs are different between an old version and the current version, that could help us pin the potential bug! 🐛 (passing the return questions in the issue along to @sanchit-gandhi / @kamilakesbi , as this is a Whisper-specific question :) ) |
Happy to contribute!
Above, I attached a code snippet to generate 5 hypotheses with beam search and print them along with sequences_scores. Here is the output for Transformers 4.25.1
Here is the output for Transformers 4.44.1 (after my fix to return sequences_scores)
I only study this topic, but according to articles that explain Beam Search:
For each step, you select top Tell me if I misunderstood Beam Search, but I've seen @patrickvonplaten explaining it with top-k selection, which would result in different tokens. Couldn't find the source, unfortunately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the changes lgtm, but let's get the green light from one of our audio folks as well 🤗 )
@Nik-Kras ah, an important detail: Whisper has a more complex generation algorithm, which goes beyond the traditional beam search (i.e. it has many heuristics on top) 🤗 Have a look at the original paper, starting with Table 7 |
@gante Thanks for sharing! I really appreciate your assistance, as it helps me to learn more! I was sure that even if Whisper has additional algorithms on top of Beam Search, the results over different versions of Transformers should've been similar, if not the same. (I am just learning, so might be wrong) |
This is a valid argument that we're trying to honor in general! The exceptions are a) bug fixes b) clear performance upgrades (quality or throughput). The changes for Whisper would fall in clear performance upgrades -- see the original PR for WER upgrades.
Certainly! For instance, the modification to generate with a fallback suggested in the paper explicitly asks to regenerate the output in certain conditions. It also overwrites a few parameters in some conditions, including the number of beams for beam search. |
@Nik-Kras I wonder what exactly the "scores", and "sequences_scores" mean. I also wonder if I pass output_logits=True to the generate method, where can I check the specific operation process. For example,if my lable is [1,12,3],and I get the logits, how can I use logits to calculate the CrossEntropy Loss. Thanks! |
I believe you are trying to find the source to help with your CrossEntropy issue; this Pull Request is quite different; it focuses on the Beam Search methodology of generating sequences. Read more about it here:
Below is my personal reasoning |
@Nik-Kras Thanks for the documents. I can get the logits but as for each timestep in beam search, I wonder which index the beam search algorithm picks (Sometimes not the largest one). I want to calculate the CrossEntropy Loss at each timestep as a "metric" and finally get the N-Best results. Or may I change my problems as: how can I get the N-best results with scores? I think it is inappropriate for me to ask questions under this pull request. Would it be better to ask such questions in issues? Thanks!!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @Nik-Kras, thanks for opening this PR!
You get your reasoning right, and we're indeed missing sequence_scores
. The fix seems great to me!
Do you think:
- You could also take care of
beam_indices
as indicated in Some Whisper beam search output (sequences_scores, etc.) is lost in _stack_split_outputs #32373 ? - Add a test here to make sure the error doesn't happen again ?
Regarding the change in beam search behaviour, let's open a subsequent issue about it, once your PR is merged! I believe it might be solved by #32336, don't hesitate to try it out and let us know if it solved your issue
cc @eustlb for visibility
Also, how long is your input audio ? |
@ylacombe The audio is 7 seconds long. It is taken from Common Voice 17 and you can find it by ground truth transcription "How is Mozilla going to handle ambiguities like queue and cue?". I did put a link to the audio from HuggingFace, but it seems like it is broken. Trying again link I am a bit distracted by other projects at the moment, but the test should be very easy to write, I will add it soon. |
Hey @Nik-Kras, thanks for circling back on this! |
@ylacombe I will allocate some time early next week, if that is fine with you |
@ylacombe All done. It was an easy fix; the most time I spent was setting it up :) However, I would want to get your attention on a side issue I found - the inconsistency of Whisper's generation over different versions. I explained the comparison above, mentioning specific versions and my observations. I see that Please have a look at the changes in how the output is generated. I am unsure how to spot a bug there, but I strongly believe there is one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @Nik-Kras! LGTM! You should do make fixup
to correct your code quality as indicated here.
Regarding the change of behaviour, this is indeed really concerning. it might be related to #32246, which I've opened a PR for here: #32336
I'll try using #32336 on top of your PR to see if it solves your issue. I'll let you know how it goes.
If it doesn't solve the regression, then let's open another issue with a reproducer and the expected results!
@ylacombe Here are results of an experiment. code from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa
# Load the processor and model
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")
# Load and preprocess the audio file
audio_path = "audio.mp3"
audio, sr = librosa.load(audio_path, sr=16000) # Ensure the sample rate is 16kHz
# Preprocess the audio to get the input features
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
# Generate the transcription using Beam Search with the model
beam_outputs = model.generate(
inputs["input_features"],
num_beams=5, # Number of beams
num_return_sequences=5, # Number of hypotheses to return
early_stopping=True,
output_scores=True,
return_dict_in_generate=True,
)
# Decode the generated transcriptions
hypotheses = [processor.decode(output_ids, skip_special_tokens=True) for output_ids in beam_outputs.sequences]
# Print out the hypotheses
for i, hypothesis in enumerate(hypotheses):
print(f"Hypothesis {i + 1}: {hypothesis}. Score: {beam_outputs.sequences_scores[i]}") transformers v4.25.1
transformers v4.44.1 + My Fix
transformers v4.44.1 + My Fix + Your Fix
There is something else. The core algorithm of beam search seems incorrect. The algorithm should NOT select the same tokens for different beams, that is the point of top_k sampling in the beam search. |
Nik-Kras, thanks for taking the time to test it. I have the same results on my side. I managed to identified why this is happening. It's been introduced in #30984 and is happening because In other words, PS: don't forget to do |
Thanks for iterating! LGTM ! Would you like to open another issue for the problem you've identified? |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing and adding a test!
…gingface#32970) * added sequences_scores to the output * added beam_indices to output * added test to check for beam_indices, sequences_scores and their shape * removed redundant whitespaces * make fixup
…gingface#32970) * added sequences_scores to the output * added beam_indices to output * added test to check for beam_indices, sequences_scores and their shape * removed redundant whitespaces * make fixup
…gingface#32970) * added sequences_scores to the output * added beam_indices to output * added test to check for beam_indices, sequences_scores and their shape * removed redundant whitespaces * make fixup
…gingface#32970) * added sequences_scores to the output * added beam_indices to output * added test to check for beam_indices, sequences_scores and their shape * removed redundant whitespaces * make fixup
I can still reproduce this bug with the latest transformers. |
Indeed, is has not been solved yet. It's mainly due to the fact that the best sequences of beam search are ambiguous when it comes to Whisper's generation, as the underlying implementation does multiple (and possibly) independent calls to GenerationMixin's generate (the method that handles beam search). A heuristic to concatenate and sort sequences after those multiple calls should be found, or we should revert the feature has it's not compatible out of the box with Whisper. |
What does this PR do?
Fixed missing
sequences_scores
in the Whisper beam search output .Transformers v4.44.1 still has a bug with missing
sequences_scores
even when these values are explicitly requested. Below is the code snippet that you could use in the current version and get an error due to missingsequences_scores
, while it will work fine with version 4.25.1To reproduce the experiment, download any speech audio file, install transformers 4.44.1 and 4.25.1 and check the output.
In case you have problems with 4.25.1 due to
jax
andjaxlib
versions, remove version checkers Exceptions and change versions in the setup.py to:Audio used for this example: link
Code to reproduce the issue
Output before the fix:
Expected output:
Solves #32373 issue
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
PS: I am not sure that Beam Search behaviour is working as expected. The Beam Search algorithm in Transformers version 4.25.1 looks more natural to me. At least there is more diversity in beams, and they are not just repeated. I assume there is another bug in the Beam Search algorithm in the current version. But I am not confident enough to say that. Could you please review the output of the current version above and the output of the older version below and tell me if that is an expected behaviour @patrickvonplaten, @zucchini-nlp and @gante?
Output of the same code above, but with Transformers version 4.25.1, where hypotheses are more diverse (which is expected due to topk sampling). Also, scores are less similar in comparison to Transformers version 4.44.1.