-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutomaticSpeechRecognition pipeline cannot predict WORD timestamps for Whisper models finetuned without timestamps prediction #30148
Comments
cc @ylacombe too |
I investigated a bit and got a more precise idea of the issue. There are actually two bugs with the pipeline, I think:
With the pipeline: from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import AutoTokenizer, AutoFeatureExtractor
model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
ds = load_dataset("mozilla-foundation/common_voice_13_0", "fr", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]
pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps="word", generate_kwargs={
})
print(transcript) All the word timestamps are set to 29.98s Without using the pipeline: features = processor(audio, return_tensors="pt",
truncation=False, sampling_rate=sr,
return_attention_mask=True)
generated = model.generate(features.input_features,
return_timestamps="word",
task="transcribe",
language="fr",
return_token_timestamps=True,
num_frames=int(len(audio) / processor.hop_length), # <-- doesn't work without this
is_multilingual=True)
print(generated["token_timestamps"]) The word timestamps are now appropriate.
Code to reproduce: from datasets import load_dataset, Audio
from transformers import AutomaticSpeechRecognitionPipeline, WhisperForConditionalGeneration, GenerationConfig
from transformers import AutoTokenizer, AutoFeatureExtractor
model_path = "BrunoHays/whisper-large-v3-french-illuin"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)
processor = AutoFeatureExtractor.from_pretrained(model_path)
ds = load_dataset("BrunoHays/multilingual-TEDX-fr", "max", streaming=True)
ds = ds.cast_column("audio", Audio(sampling_rate=16000))
item = next(iter(ds["test"]))["audio"]
audio = item["array"]
sr = item["sampling_rate"]
pipe = AutomaticSpeechRecognitionPipeline(model=model, feature_extractor=processor, tokenizer=tokenizer)
transcript = pipe(audio, return_timestamps=False) # No timestamps
print(transcript) I think the audio chunking method for long form should be used when timestamps are deactivated |
Gentle ping @ylacombe |
Hi @Hubert-Bonisseur, Thanks for sharing this issue!
|
Agreed @Hubert-Bonisseur and @kamilakesbi! In the case you don't want to train another model, the only option for long-form transcription is doing chunked inference: import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=25,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"]) If you're happy training another model, but don't have timestamps in your data, you can try training with LoRA. Using LoRAs reduces the amount of catastrophic forgetting, so even though we don't have timestamps in our fine-tuning data, the model remembers how to make timestamp'd predictions. You can see a guide on LoRA fine-tuning using the PEFT library here. Note that you want to run inference in half/full precision (not 8-bit), as outlined here. |
I missed the notifications, sorry If I understand correctly, @sanchit-gandhi , chunked transcription is activated once we pass the chunk_length_s argument ? |
Hi @Hubert-Bonisseur, You can indeed use chunked transcription by passing the The default behavior is to use the long-form algorithm as it is more efficient :) Hope it will help you! |
System Info
At present, the AutomaticSpeechRecognition pipeline offers the capability to predict timestamps either at the word level through cross attention or by utilizing timestamps tokens predicted by Whisper. The concern arises when opting for word-level prediction, as it activates timestamp prediction, which cannot be disabled as far as I know. This setup may inadvertently reduce timestamp accuracy or cause the word timestamps prediction to fail entirely, particularly for models fine-tuned without timestamp prediction.
Other frameworks can predict word timestamps appropriately with some arguments, for instance openai's whisper framework:
Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Take any model fientuned without timestamps, and get bogus word timestamps because attention is all over the place
Expected behavior
It should be possible to disable timestamps generation when choosing word timestamps
The text was updated successfully, but these errors were encountered: