-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
batching inference and forced decoding for speedup and multi-target #55
Comments
Hi! There's an implementation that supports batch inference: https://github.com/Vaibhavs10/insanely-fast-whisper |
yes, me neither. I would need a pointer to the function that takes two audio samples and processes them at once. |
OK, I checked it. The Insanely Fast Whisper is just a wrapper of Huggingface Transformers. The example usage of batching is huggingface/transformers#27658 . This https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py shows the forced decoding. So these are the initial points to work on this issue. I might do it in a few weeks, but anybody can go on :) |
Any news about this implementation or any finds so we can try to work on that? I am trying to build a multi-client server and batching would be nice to run more than one transcript at the same instance |
Wow, great! The easiest use case for batching is decode the same audio twice, the whole buffer + the whole minus last chunk. |
Sure, let's cooperate! My doubt is: decode the same audio twice is for speedup use case, right? I check you mention about multi-client in #42 and would it be necessary to decode + batching backend API to parallelize multiple audios in GPU? I could try to work in this batching backend layer using whisper-streaming source code. |
yes. Just be aware that batching multiple audios can result in slow down. There will be independent audio buffers of different lengths. You need to pad the audio input to the longest, and the processing time is the same as the longest. So you gain effectiveness, but lose some speed. |
So, how's your progress, @joaogabrieljunq ? |
Hello again @Gldkslfmsd, nice to know that you are progressing in batch implementation research! I spent yesterday researching also about possible implementations for this. Found WhisperS2T that seems to implement dynamic time length support in batch inference, helping in the pad problem that you mentioned above. Perhaps this could help also. https://github.com/shashikg/WhisperS2T/blob/main/whisper_s2t/backends/ctranslate2/model.py |
Any news on this matter? |
Any update on batching? |
no. Unfortunately it's not among my priorities anymore. |
Batching inference should be used in Whisper-Streaming. It's currently not implemented.
This could work: huggingface/transformers#27658
Why batching:
The text was updated successfully, but these errors were encountered: