Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing sequences_scores in the Whisper beam search output #32970

Merged
merged 5 commits into from
Sep 17, 2024

Conversation

Nik-Kras
Copy link
Contributor

What does this PR do?

Fixed missing sequences_scores in the Whisper beam search output .
Transformers v4.44.1 still has a bug with missing sequences_scores even when these values are explicitly requested. Below is the code snippet that you could use in the current version and get an error due to missing sequences_scores, while it will work fine with version 4.25.1

To reproduce the experiment, download any speech audio file, install transformers 4.44.1 and 4.25.1 and check the output.
In case you have problems with 4.25.1 due to jax and jaxlib versions, remove version checkers Exceptions and change versions in the setup.py to:

"jax>=0.4",
"jaxlib>=0.4.10",

Audio used for this example: link

Code to reproduce the issue

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load the processor and model
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")

# Load and preprocess the audio file
audio_path = "audio.mp3"
audio, sr = librosa.load(audio_path, sr=16000)  # Ensure the sample rate is 16kHz

# Preprocess the audio to get the input features
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate the transcription using Beam Search with the model
beam_outputs = model.generate(
    inputs["input_features"],
    num_beams=5,  # Number of beams
    num_return_sequences=5,  # Number of hypotheses to return
    early_stopping=True,
    output_scores=True, 
    return_dict_in_generate=True,
)

# Decode the generated transcriptions
hypotheses = [processor.decode(output_ids, skip_special_tokens=True) for output_ids in beam_outputs.sequences]

# Print out the hypotheses
for i, hypothesis in enumerate(hypotheses):
    print(f"Hypothesis {i + 1}: {hypothesis}. Score: {beam_outputs.sequences_scores[i]}")

Output before the fix:

TypeError: 'NoneType' object is not subscriptable

Expected output:

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261

Solves #32373 issue

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

PS: I am not sure that Beam Search behaviour is working as expected. The Beam Search algorithm in Transformers version 4.25.1 looks more natural to me. At least there is more diversity in beams, and they are not just repeated. I assume there is another bug in the Beam Search algorithm in the current version. But I am not confident enough to say that. Could you please review the output of the current version above and the output of the older version below and tell me if that is an expected behaviour @patrickvonplaten, @zucchini-nlp and @gante?

Output of the same code above, but with Transformers version 4.25.1, where hypotheses are more diverse (which is expected due to topk sampling). Also, scores are less similar in comparison to Transformers version 4.44.1.

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

@gante
Copy link
Member

gante commented Aug 23, 2024

@Nik-Kras Thank you for opening this issue 🤗

Regarding beam search: the core method was untouched, except for bug fixes on edge cases. In other words, it is possible but unlikely, our integration tests should have caught noticeable differences in quality 🤗 We did change the input preparation for generate in Whisper, which might explain the changes you're seeing.

If you can get us a snippet where the outputs are different between an old version and the current version, that could help us pin the potential bug! 🐛

(passing the return questions in the issue along to @sanchit-gandhi / @kamilakesbi , as this is a Whisper-specific question :) )

@Nik-Kras
Copy link
Contributor Author

Nik-Kras commented Aug 23, 2024

@Nik-Kras Thank you for opening this issue 🤗

Happy to contribute!

If you can get us a snippet where the outputs are different between an old version and the current version, that could help us pin the potential bug! 🐛

Above, I attached a code snippet to generate 5 hypotheses with beam search and print them along with sequences_scores.

Here is the output for Transformers 4.25.1

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

Here is the output for Transformers 4.44.1 (after my fix to return sequences_scores)

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495031476020813
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495032072067261

I only study this topic, but according to articles that explain Beam Search:

For each step, you select top num_beams tokens instead of a greedy selection of top-1 tokens. Therefore, you should select num_beams of different tokens. And while they could converge for some part of the sequence (if other sequences have too low scores), in the end, they still must be different according to the algorithm. And I am not sure the update in input data could change it.

Tell me if I misunderstood Beam Search, but I've seen @patrickvonplaten explaining it with top-k selection, which would result in different tokens. Couldn't find the source, unfortunately.

Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the changes lgtm, but let's get the green light from one of our audio folks as well 🤗 )

@gante
Copy link
Member

gante commented Aug 23, 2024

@Nik-Kras ah, an important detail: Whisper has a more complex generation algorithm, which goes beyond the traditional beam search (i.e. it has many heuristics on top) 🤗 Have a look at the original paper, starting with Table 7

@Nik-Kras
Copy link
Contributor Author

@gante Thanks for sharing! I really appreciate your assistance, as it helps me to learn more!

I was sure that even if Whisper has additional algorithms on top of Beam Search, the results over different versions of Transformers should've been similar, if not the same.
Do you think that Whisper's deviation from the original Beam Search implementation could be a valid explanation of the behaviour above?

(I am just learning, so might be wrong)

@gante
Copy link
Member

gante commented Aug 23, 2024

@Nik-Kras

the results over different versions of Transformers should've been similar

This is a valid argument that we're trying to honor in general! The exceptions are a) bug fixes b) clear performance upgrades (quality or throughput). The changes for Whisper would fall in clear performance upgrades -- see the original PR for WER upgrades.

Do you think that Whisper's deviation from the original Beam Search implementation could be a valid explanation of the behaviour above?

Certainly! For instance, the modification to generate with a fallback suggested in the paper explicitly asks to regenerate the output in certain conditions. It also overwrites a few parameters in some conditions, including the number of beams for beam search.

@dengchengxifrank
Copy link

@Nik-Kras I wonder what exactly the "scores", and "sequences_scores" mean. I also wonder if I pass output_logits=True to the generate method, where can I check the specific operation process. For example,if my lable is [1,12,3],and I get the logits, how can I use logits to calculate the CrossEntropy Loss. Thanks!

@Nik-Kras
Copy link
Contributor Author

I believe you are trying to find the source to help with your CrossEntropy issue; this Pull Request is quite different; it focuses on the Beam Search methodology of generating sequences. Read more about it here:

@Nik-Kras I wonder what exactly the "scores", and "sequences_scores" mean.

Below is my personal reasoning
Scores are kind of logits used to select the next token for each beam in the beam search. Sequences_scores is a kind of rating of each generated sequence that helps to select the best one. It is commonly calculated by multiplying the probabilities of each selected token in the beam. So, if the generated sequence consists of low-probability words, there is probably something wrong with the sequence, and it shall not be selected. Sequence Score simplifies your selection by comparing one number to the other.

@dengchengxifrank
Copy link

@Nik-Kras Thanks for the documents. I can get the logits but as for each timestep in beam search, I wonder which index the beam search algorithm picks (Sometimes not the largest one). I want to calculate the CrossEntropy Loss at each timestep as a "metric" and finally get the N-Best results. Or may I change my problems as: how can I get the N-best results with scores? I think it is inappropriate for me to ask questions under this pull request. Would it be better to ask such questions in issues? Thanks!!!

@LysandreJik
Copy link
Member

cc @eustlb if you have some bandwidth to take a look at this while @ylacombe is off!

Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Nik-Kras, thanks for opening this PR!

You get your reasoning right, and we're indeed missing sequence_scores. The fix seems great to me!

Do you think:

  1. You could also take care of beam_indices as indicated in Some Whisper beam search output (sequences_scores, etc.) is lost in _stack_split_outputs #32373 ?
  2. Add a test here to make sure the error doesn't happen again ?

Regarding the change in beam search behaviour, let's open a subsequent issue about it, once your PR is merged! I believe it might be solved by #32336, don't hesitate to try it out and let us know if it solved your issue

cc @eustlb for visibility

@ylacombe
Copy link
Contributor

ylacombe commented Sep 3, 2024

Also, how long is your input audio ?

@Nik-Kras
Copy link
Contributor Author

Nik-Kras commented Sep 4, 2024

@ylacombe The audio is 7 seconds long. It is taken from Common Voice 17 and you can find it by ground truth transcription "How is Mozilla going to handle ambiguities like queue and cue?". I did put a link to the audio from HuggingFace, but it seems like it is broken. Trying again link

I am a bit distracted by other projects at the moment, but the test should be very easy to write, I will add it soon.
I am not so confident about beam_indices as I remember there are several bugs / issues around it. It would take some time, so I would leave it for another PR

@ylacombe
Copy link
Contributor

ylacombe commented Sep 5, 2024

Hey @Nik-Kras, thanks for circling back on this!
Whisper is used by many users, so it'd be great to have the test added soon!
If you don't have the bandwidth in the next coming days, let us know and we can supersede the PR to add the test and beam_indices - while keeping you as co-author of course!

@Nik-Kras
Copy link
Contributor Author

Nik-Kras commented Sep 6, 2024

@ylacombe I will allocate some time early next week, if that is fine with you

@Nik-Kras
Copy link
Contributor Author

Nik-Kras commented Sep 10, 2024

@ylacombe All done. It was an easy fix; the most time I spent was setting it up :)

However, I would want to get your attention on a side issue I found - the inconsistency of Whisper's generation over different versions. I explained the comparison above, mentioning specific versions and my observations.
To fix the beam_indicies, I compared the output with version 4.25.1
The type and shape of it is the same, but the data is significantly different

I see that sequences_scores and beam_indicies are not just computed differently. I strongly believe they are computed wrong. I don't expect Whisper to give 5 identical beams due to the beam search (with identical scores as well). To perform LLM post-correction, I need transcription variants, which is a desired behaviour.

Please have a look at the changes in how the output is generated. I am unsure how to spot a bug there, but I strongly believe there is one.

@Nik-Kras Nik-Kras requested a review from ylacombe September 10, 2024 12:05
Copy link
Contributor

@ylacombe ylacombe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @Nik-Kras! LGTM! You should do make fixup to correct your code quality as indicated here.

Regarding the change of behaviour, this is indeed really concerning. it might be related to #32246, which I've opened a PR for here: #32336

I'll try using #32336 on top of your PR to see if it solves your issue. I'll let you know how it goes.

If it doesn't solve the regression, then let's open another issue with a reproducer and the expected results!

@Nik-Kras
Copy link
Contributor Author

@ylacombe Here are results of an experiment.
Correct transcription: "How is Mozilla going to handle ambiguities like queue and cue?"
Audio is taken from Common Voice 17 dataset

code

from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch
import librosa

# Load the processor and model
processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
model = AutoModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")

# Load and preprocess the audio file
audio_path = "audio.mp3"
audio, sr = librosa.load(audio_path, sr=16000)  # Ensure the sample rate is 16kHz

# Preprocess the audio to get the input features
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Generate the transcription using Beam Search with the model
beam_outputs = model.generate(
    inputs["input_features"],
    num_beams=5,  # Number of beams
    num_return_sequences=5,  # Number of hypotheses to return
    early_stopping=True,
    output_scores=True, 
    return_dict_in_generate=True,
)

# Decode the generated transcriptions
hypotheses = [processor.decode(output_ids, skip_special_tokens=True) for output_ids in beam_outputs.sequences]

# Print out the hypotheses
for i, hypothesis in enumerate(hypotheses):
    print(f"Hypothesis {i + 1}: {hypothesis}. Score: {beam_outputs.sequences_scores[i]}")

transformers v4.25.1

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.4627407491207123
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you and Q.. Score: -0.4789799749851227
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you, and cute.. Score: -0.48414239287376404
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you and cute.. Score: -0.4972183108329773
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you, and Q.. Score: -0.5054414868354797

transformers v4.44.1 + My Fix

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495038032531738
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495040416717529
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036840438843
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036244392395
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495033264160156

transformers v4.44.1 + My Fix + Your Fix

Hypothesis 1:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495038032531738
Hypothesis 2:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495040416717529
Hypothesis 3:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036840438843
Hypothesis 4:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495036244392395
Hypothesis 5:  How is Mozilla going to handle and be with this? Thank you.. Score: -0.5495033264160156

There is something else. The core algorithm of beam search seems incorrect. The algorithm should NOT select the same tokens for different beams, that is the point of top_k sampling in the beam search.
I guess another issue should be opened

@ylacombe
Copy link
Contributor

ylacombe commented Sep 12, 2024

Nik-Kras, thanks for taking the time to test it. I have the same results on my side.

I managed to identified why this is happening. It's been introduced in #30984 and is happening because _expand_variables_for_generation artificially expands the batch size to num_return_sequences. When it's then passed to GenerationMixin.generate and since batch_size=5, the model generates batch_size*num_beams beams but only keeps the most probable of them for each element of the batch.

In other words, num_return_sequences is not compatible with short-form and long-form generation anymore. Now that I've identified the issue, I'll think about how to best solve it. Happy to get your thoughts here!
Also cc @eustlb

PS: don't forget to do make fixup to correct your code quality as indicated here. Otherwise, your PR won't be merged

@ylacombe
Copy link
Contributor

Thanks for iterating! LGTM !

Would you like to open another issue for the problem you've identified?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing and adding a test!

@amyeroberts amyeroberts merged commit c29a869 into huggingface:main Sep 17, 2024
21 checks passed
itazap pushed a commit to NielsRogge/transformers that referenced this pull request Sep 20, 2024
…gingface#32970)

* added sequences_scores to the output

* added beam_indices to output

* added test to check for beam_indices, sequences_scores and their shape

* removed redundant whitespaces

* make fixup
amyeroberts pushed a commit to amyeroberts/transformers that referenced this pull request Oct 2, 2024
…gingface#32970)

* added sequences_scores to the output

* added beam_indices to output

* added test to check for beam_indices, sequences_scores and their shape

* removed redundant whitespaces

* make fixup
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
…gingface#32970)

* added sequences_scores to the output

* added beam_indices to output

* added test to check for beam_indices, sequences_scores and their shape

* removed redundant whitespaces

* make fixup
BernardZach pushed a commit to innovationcore/transformers that referenced this pull request Dec 6, 2024
…gingface#32970)

* added sequences_scores to the output

* added beam_indices to output

* added test to check for beam_indices, sequences_scores and their shape

* removed redundant whitespaces

* make fixup
@Daniel-Kelvich
Copy link

I can still reproduce this bug with the latest transformers.

@eustlb
Copy link
Contributor

eustlb commented Dec 19, 2024

Indeed, is has not been solved yet. It's mainly due to the fact that the best sequences of beam search are ambiguous when it comes to Whisper's generation, as the underlying implementation does multiple (and possibly) independent calls to GenerationMixin's generate (the method that handles beam search). A heuristic to concatenate and sort sequences after those multiple calls should be found, or we should revert the feature has it's not compatible out of the box with Whisper.
Nevertheless, it is the next point in the bug-fixing roadmap; thanks for your patience!! 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants