Remove `torch` dependency, Faster numpy Feature extraction #1106

MahmoudAshraf97 · 2024-10-31T11:33:29Z

Hi all,

This PR aims to remove the torch dependency as it's only used for feature extraction, there are 2 options:

use NumPy only and scrap GPU acceleration completely, this will slightly simplify the FE code by removing the device handling and framework handling code
use NumPy + CuPy (the current implementation), this will allow the usage of GPU acceleration for those who want it, we still need to decide whether CuPy will be included in the base requirements or maybe as an optional requirement or none at all

These are performance figures on a 30s segment using this script

OMP_NUM_THREADS=4 OPENBLAS_NUM_THREADS=4 MKL_NUM_THREADS=4 VECLIB_MAXIMUM_THREADS=4 NUMEXPR_NUM_THREADS=4 taskset -c 0-3 python benchmark_fe.py

Cores: {0, 1, 2, 3}
Torch on CPU:
5.66 ms
Torch on GPU:
0.71 ms
New numpy on CPU:
10.36 ms
New CuPY on GPU:
2.19 ms
Old numpy on CPU:
39.55 ms

edit: Decided to remove CuPy as the speed difference is not worth the extra code

MahmoudAshraf97 · 2024-10-31T11:35:23Z

Mentioning community reviewers, feel free to ignore this
@Purfview @Jiltseb @ozancaglayan @jordimas @hoonlight @zh-plus

ozancaglayan · 2024-10-31T12:16:03Z

Hi,

What's the value for MKL_NUM_THREADS and OMP_NUM_THREADS here for the CPU tests? Or are they not set?
Is this for 80 mels or 128 mels case?

faster_whisper/feature_extractor.py

ozancaglayan · 2024-10-31T13:11:08Z

I have a librosa.stft based implementation for this. I checked that your stft and librosa.stft is producing the same results and more or less performing at the same speed (6ms for unpadded 30s of audio). Unfortunately, these implementations does not seem to be using multi-threaded computation whereas torch.stft efficiently uses all cores which make the difference: ~1ms

Another main bottleneck here is the padding with 30 seconds.

I get 32ms on a sample of 30s of audio with n_mels=128 and 27ms with n_mels=80
With padding=False, 32ms goes down to 16ms obviously

But essentially the effect of padding the input is padding the output with some constant vector. So one could mitigate this doubling by just padding the stft/mel outputs cleverly.

We lose almost the same amount of time as STFT to np.log10(np.clip(mels, a_min=1e-10, a_max=None)) as np.log10 is horribly slow. I think they fixed this in numpy 2 numpy/numpy#19478

ozancaglayan · 2024-10-31T13:23:21Z

I think doing this on GPU with torch is a bit overkill as feature extraction is seldom the bottleneck in the whisper pipeline. Simplicity, dependency management is more important here. And when things move towards numpy2, it'll be not that bad.

ozancaglayan · 2024-10-31T13:40:40Z

This looks interesting as well, should be quite fast
https://github.com/IntelPython/mkl_fft

MahmoudAshraf97 · 2024-10-31T13:52:12Z

Hi,

What's the value for MKL_NUM_THREADS and OMP_NUM_THREADS here for the CPU tests? Or are they not set?

Is this for 80 mels or 128 mels case?

I updated the figures along with the benchmarking script and limiting it to 4 cores for reproducability
I'm using 128 Mels

I think doing this on GPU with torch is a bit overkill as feature extraction is seldom the bottleneck in the whisper pipeline. Simplicity, dependency management is more important here. And when things move towards numpy2, it'll be not that bad.

We're removing torch that's for sure, the choice here is between both cupy and numpy or one of them

I have a librosa.stft based implementation for this. I checked that your stft and librosa.stft is producing the same results and more or less performing at the same speed (6ms for unpadded 30s of audio). Unfortunately, these implementations does not seem to be using multi-threaded computation whereas torch.stft efficiently uses all cores which make the difference: ~1ms

This implementation is ported from C++ pytorch implementation, it's very similar to the librosa implementation but has some extra functionality that I'm not sure we need here, the reason I went with this is that the librosa implementation needs an extra dependency, and copying the function over isn't simple because it uses many librosa internal functions that will be a massive increase to the amount of code needed

Another main bottleneck here is the padding with 30 seconds.

I get 32ms on a sample of 30s of audio with n_mels=128 and 27ms with n_mels=80

With padding=False, 32ms goes down to 16ms obviously

But essentially the effect of padding the input is padding the output with some constant vector. So one could mitigate this doubling by just padding the stft/mel outputs cleverly.

I'm working on removing the padding completely while still generating identical results but I'm leaving it for another PR, Check here it does reduce the FE time to half indeed

We lose almost the same amount of time as STFT to np.log10(np.clip(mels, a_min=1e-10, a_max=None)) as np.log10 is horribly slow. I think they fixed this in numpy 2 numpy/numpy#19478

I have no idea how to calculate log10 using another way that is faster unfortunately

jordimas · 2024-10-31T13:57:10Z

Thanks for your work @MahmoudAshraf97, it's really impressive.

My thinking is inline with @ozancaglayan "Simplicity, dependency management is more important here. And when things move towards numpy2, it'll be not that bad."

All implementations are good, depend on what you optimize for. I will optimize for simplicity and less dependencies.

ozancaglayan · 2024-10-31T16:06:15Z

I'm working on removing the padding completely while still generating identical results but I'm leaving it for another PR, huggingface/transformers#34111 (comment) it does reduce the FE time to half indeed

Interesting. I'm always confused about one thing. Given that whisper is trained with 30 secs segments, the above PR and your thinking is not about running Whisper freely on arbitrarily sized audio files but rather removing the always add 30 secs more of 0.0s logic in FE right?

I thought that to do proper seeking and timestamping, Whisper requires that +30 secs of zeros all the time but isn't that the case then? i.e. would the output of Whisper be exactly the same if we just pad the inputs to 30 secs at most?

MahmoudAshraf97 · 2024-10-31T18:40:34Z

Interesting. I'm always confused about one thing. Given that whisper is trained with 30 secs segments, the above PR and your thinking is not about running Whisper freely on arbitrarily sized audio files but rather removing the always add 30 secs more of 0.0s logic in FE right?

I thought that to do proper seeking and timestamping, Whisper requires that +30 secs of zeros all the time but isn't that the case then? i.e. would the output of Whisper be exactly the same if we just pad the inputs to 30 secs at most?

We don't even need to pad it to 30s, adding hop_length samples is enough since we remove the padding and then pad it again with zeros, that means we remove only the last frame in the features instead of 3000+ samples
I've verified that it's numerically identical but I'm still thoroughly investigating if further effects exist

Edit: it produces exactly the same encoder outputs for NumPy, while CuPy needs atleast 17s of zero padding to achieve the same numerical accuracy for audios less than 30s, that closes the speed gap even more between the two

MahmoudAshraf97 · 2024-10-31T22:27:38Z

I implemented the reduced padding in bc86503, everyone feel free to test it and report if there are any issues

ozancaglayan · 2024-10-31T22:51:04Z

Thanks! I wont probably have time next week as I'll be on annual leave. One thing though, does this mean that you'll no longer be padding let's say an audio file of 10 seconds to 30 seconds? People were saying that Whisper has a severe hallucination problem if you let it transcribe signals less than 30 seconds because its not trained with an attention mask. There are even papers which finetunes it to mitigate this.

But I'm not 100% sure whether its the same thing or not, need to test it properly.

MahmoudAshraf97 · 2024-11-01T09:10:49Z

Thanks! I wont probably have time next week as I'll be on annual leave. One thing though, does this mean that you'll no longer be padding let's say an audio file of 10 seconds to 30 seconds? People were saying that Whisper has a severe hallucination problem if you let it transcribe signals less than 30 seconds because its not trained with an attention mask. There are even papers which finetunes it to mitigate this.

But I'm not 100% sure whether its the same thing or not, need to test it properly.

In the original whisper implementation, the features are padded with 30s of zeros, converted to log mel spectrogram, and then the features that correspond to the padding are removed, and the features are then padded again with zeros until they are equivalent to 30s.

What I did here is that I found that padding with 10ms instead of 30s is mathematically equivalent and produces the same encoder input and output within the numerical tolerance of the data type used.

Why does this works? because in the STFT calculation, each window is independent and has an overlap of hop_length, so padding with hop_length is enough to simulate that the last frame has zeros after it, any more padding does not overlap at all with the audio and it's discarded later, so we can count that as redundant calculation.

faster_whisper/feature_extractor.py

MahmoudAshraf97 · 2024-11-11T12:54:13Z

I want to merge this before Friday so we can release a new version at 15th of November, so if anyone has any comments or reviews please let me know

…YSTRAN#1106)" This reverts commit 3e0ba86.

aligokalppeker · 2024-11-20T19:12:44Z

It is terrific to see that the community feedback worked and torch dependencies are gone with such PRs, bringing the faster-whisper back to its origin @MahmoudAshraf97.

But one question; have you tested your new implementation with short chunks? I have tested the new feature extractor and it produces no transcripts when chunks are shorter than 30 secs.

Pytorch implementation works fine and produces the correct results.

I see your discussions with @ozancaglayan and others about deprioritizing short-chunk processing. But we should not break an existing and expected feature.

MahmoudAshraf97 · 2024-11-20T20:21:40Z

This implementation should produce identical results to the old and original whisper implementation and it's completely backwards compatible, I'm not sure I understand your description of the problem, can you write code to reproduce it?

formater · 2024-11-22T05:44:25Z

I'm processing massive amount of short 1-3 second chunks, and until now I passed torch array to Transcribe to skip the slow audio decoding. Now since I cannot pass torch array anymore, passing numpy array, but that and moving FTT to numpy from torch implementation (and moving load from GPU to CPU) slowed down the whole transcribing by a lot on my side. So maybe use_pytorch boolean flag somewhere would be better to keep pytorch optionally available for audio input and for feature extraction.

formater · 2024-11-22T05:46:17Z

And just to emphasize, removing torch array as input type for transcribe function was a breaking change.

aligokalppeker · 2024-11-22T07:47:10Z

@formater Removing the torch array is required to keep faster-whisper out of torch dependency and it is not a breaking change as the latest release of faster-whisper (1.0.3) has an interface with numpy array. This PR is important to keep it as it is.

I think putting a boolean flag will make torch processing optional but not torch dependency optional, so your proposed solution is not applicable in the context of the PR.

One option might be making FeatureExtractor an injectable component via an interface. So people can plug in their FeatureExtractor implementation with their package dependency if they do not like the default FeatureExtractor implementation provided by the faster-whisper.

formater · 2024-11-22T07:57:05Z

The injectable component is a good idea! :)

I wrote that it was a breaking change for me, because I used torch array as transcriber input, which worked with 1.0.3, and not works anymore with latest :)

aligokalppeker · 2024-11-22T07:57:20Z

This implementation should produce identical results to the old and original whisper implementation and it's completely backwards compatible, I'm not sure I understand your description of the problem, can you write code to reproduce it?

@MahmoudAshraf97 I had used the new feature extractor with the 1.0.3 version as I do see that it has the same interface. My tests were based on that. When I switched the implementation to the new numpy version, the transcripts for 1-3 second size audio chunks are gone.

I am using numpy array as input, instead of filename. So this might help you reproduce the issue.

Let me try to give you more details on my experiment.

MahmoudAshraf97 · 2024-11-22T08:19:18Z

@formater if your use case is just audios shorter than 30s, then using faster whisper is very inefficient, use CTranslate2 directly or TRT.
You can use the original whisper feature extraction which uses torch on GPU, then stack the feature arrays, convert them to ct2.StorageView objects and feed them directly to the CT2 model, this should be much faster than faster whisper

aligokalppeker · 2024-11-25T10:14:46Z

@MahmoudAshraf97 When I run with the PyTorch or 1.0.3 version, I get an "80x3084" matrix with float32 data type when inputting data with 7039 bytes. The new FeatureExtractor extracts an "80x49" matrix from the same input, which is pretty much wrong in dimension size and order.

This will result in two things;

For audio data bigger than 30 seconds, transcriptions at the end of the data that have less than a 30-second frame will be transcribed as invalid, reducing the accuracy of the faster whisper.
For short chunk-based audio processing, it will give nothing.

You can reproduce a similar scenario by providing nd-array through the faster-whisper API.

I think this version should not be tagged as 1.10.0 even if it does have degraded accuracy and core faults at the core feature. Let me open an issue for it.

MahmoudAshraf97 · 2024-11-25T10:37:12Z

The new feature extractor is expected to produce an array with smaller dimension, the new dimensions are handled in transcribe.py
If you want to test this PR in v1.0.3, then you need to pull all the changes, not just the feature extractor

aligokalppeker · 2024-11-25T10:42:26Z

Yes @MahmoudAshraf97, I have tested it with the 1.1.0 before opening issue for it. It seems to be running ok with 1.1.0. There should have been a transformation stuff i have missed in the PR when using it in 1.0.3.

MahmoudAshraf97 · 2024-11-25T10:46:23Z

If you are getting this problem using 1.1.0, please open an issue with the previous and the current transcription, and share the audios to reproduce and I'll happily investigate it

aligokalppeker · 2024-11-25T11:06:02Z

But @MahmoudAshraf97, I see a strange performance difference here;

faster-whisper 1.0.3 + pytorch CPU feature extractor => works and it is fast, %10 faster than 1.1.0 version in CPU.
faster-whisper 1.1.0 + pytorch CPU feature extractor => works but it is 2x slow compared to new numpy implementation.

Normally pytorch feature extractor is expected to improve the performance compared to the numpy implementation in CPU as tested in 1.0.3.

But I guess due to the changes in the intermediate steps, it causes a 2x slow down in 1.1.0. That might be due to some sizing updates and so.

MahmoudAshraf97 · 2024-11-25T11:09:33Z

it all resolves down to using the correct value of content_frames in transcribe.py, this depends on the feature extractor you are using

aligokalppeker · 2024-11-25T11:15:27Z

I think the considerable effect lies in subtracting constant 1 instead of self.feature_extractor.nb_max_frames when determining content_frames.

aligokalppeker · 2024-11-25T11:27:00Z

@MahmoudAshraf97 I understand it more now, in the latest implementation, the padding is just not done according to the sample size but very strict to avoid performance due to that.

I have updated the padding stuff accordingly in the pytorch implementation and it performed similarly.

aligokalppeker · 2024-11-25T11:31:48Z

One comment here though, even if we use the same feature extractors in 1.0.3 and 1.1.0, there is a slight difference in performance which requires to be regressed.

Version 1.1.0 seems to be performing %6-7 slower compared to 1.0.3, everything is kept the same.

Great work for removing the pytorch dependencies, @MahmoudAshraf97

MahmoudAshraf97 · 2024-11-25T11:35:05Z

I received similar report in #1167 , but haven't dived deep into it as it's probably related to CT2 more than faster whisper

aligokalppeker · 2024-11-25T11:37:32Z

@MahmoudAshraf97 I should be better to dive in. Thanks for that.

But I think it should not be related to the CT2 version or others, as in my tests, I kept the CT2 version the same as 4.5.0.

ozancaglayan reviewed Oct 31, 2024

View reviewed changes

faster_whisper/feature_extractor.py Outdated Show resolved Hide resolved

MahmoudAshraf97 changed the title ~~Remove torch dependency, use CuPy for feature extraction~~ Remove torch dependency, Faster numpy Feature extraction Oct 31, 2024

MahmoudAshraf97 marked this pull request as ready for review October 31, 2024 22:27

MahmoudAshraf97 mentioned this pull request Nov 3, 2024

Update audio.py #1112

Closed

Jiltseb reviewed Nov 6, 2024

View reviewed changes

faster_whisper/feature_extractor.py Outdated Show resolved Hide resolved

Jiltseb reviewed Nov 6, 2024

View reviewed changes

faster_whisper/feature_extractor.py Outdated Show resolved Hide resolved

MahmoudAshraf97 added 8 commits November 13, 2024 17:51

initial commit

274bcb9

remove torch from reqirements

ae22d07

fix formatting and dtype

9c33f1b

add type annotations

f6cf278

remove CuPy

7946dcd

reduce padding to hop_length insteadn of n_samples

57061ac

review comments

d22f9d2

flake8 fix

8198307

MahmoudAshraf97 force-pushed the numpy_fe branch from ed85116 to 8198307 Compare November 13, 2024 15:51

MahmoudAshraf97 merged commit 3e0ba86 into SYSTRAN:master Nov 14, 2024
3 checks passed

mludolph added a commit to mludolph/faster-whisper that referenced this pull request Nov 14, 2024

Revert "Remove torch dependency, Faster numpy Feature extraction (S…

247d3ad

…YSTRAN#1106)" This reverts commit 3e0ba86.

MahmoudAshraf97 mentioned this pull request Nov 14, 2024

Transcription error related to BatchedInferencePipeline and numpy #1102

Closed

MahmoudAshraf97 deleted the numpy_fe branch November 16, 2024 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove `torch` dependency, Faster numpy Feature extraction #1106

Remove `torch` dependency, Faster numpy Feature extraction #1106

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 31, 2024

ozancaglayan commented Oct 31, 2024 •

edited

Loading

ozancaglayan commented Oct 31, 2024 •

edited

Loading

ozancaglayan commented Oct 31, 2024

ozancaglayan commented Oct 31, 2024

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

jordimas commented Oct 31, 2024 •

edited

Loading

ozancaglayan commented Oct 31, 2024

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 31, 2024

ozancaglayan commented Oct 31, 2024

MahmoudAshraf97 commented Nov 1, 2024

MahmoudAshraf97 commented Nov 11, 2024

aligokalppeker commented Nov 20, 2024

MahmoudAshraf97 commented Nov 20, 2024

formater commented Nov 22, 2024

formater commented Nov 22, 2024

aligokalppeker commented Nov 22, 2024

formater commented Nov 22, 2024

aligokalppeker commented Nov 22, 2024

MahmoudAshraf97 commented Nov 22, 2024

aligokalppeker commented Nov 25, 2024 •

edited

Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 •

edited

Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 •

edited

Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 •

edited

Loading

Remove torch dependency, Faster numpy Feature extraction #1106

Remove torch dependency, Faster numpy Feature extraction #1106

Conversation

MahmoudAshraf97 commented Oct 31, 2024 • edited Loading

MahmoudAshraf97 commented Oct 31, 2024

ozancaglayan commented Oct 31, 2024 • edited Loading

ozancaglayan commented Oct 31, 2024 • edited Loading

ozancaglayan commented Oct 31, 2024

ozancaglayan commented Oct 31, 2024

MahmoudAshraf97 commented Oct 31, 2024 • edited Loading

jordimas commented Oct 31, 2024 • edited Loading

ozancaglayan commented Oct 31, 2024

MahmoudAshraf97 commented Oct 31, 2024 • edited Loading

MahmoudAshraf97 commented Oct 31, 2024

ozancaglayan commented Oct 31, 2024

MahmoudAshraf97 commented Nov 1, 2024

MahmoudAshraf97 commented Nov 11, 2024

aligokalppeker commented Nov 20, 2024

MahmoudAshraf97 commented Nov 20, 2024

formater commented Nov 22, 2024

formater commented Nov 22, 2024

aligokalppeker commented Nov 22, 2024

formater commented Nov 22, 2024

aligokalppeker commented Nov 22, 2024

MahmoudAshraf97 commented Nov 22, 2024

aligokalppeker commented Nov 25, 2024 • edited Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 • edited Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 • edited Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 • edited Loading

aligokalppeker commented Nov 25, 2024 • edited Loading

MahmoudAshraf97 commented Nov 25, 2024

aligokalppeker commented Nov 25, 2024 • edited Loading

Remove `torch` dependency, Faster numpy Feature extraction #1106

Remove `torch` dependency, Faster numpy Feature extraction #1106

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

ozancaglayan commented Oct 31, 2024 •

edited

Loading

ozancaglayan commented Oct 31, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

jordimas commented Oct 31, 2024 •

edited

Loading

MahmoudAshraf97 commented Oct 31, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading

aligokalppeker commented Nov 25, 2024 •

edited

Loading