-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add numpy alternative to FE using torchaudio #26339
Conversation
fbank = ta_kaldi.fbank( | ||
waveform, | ||
htk_compat=True, | ||
sample_frequency=self.sampling_rate, | ||
use_energy=False, | ||
window_type="hanning", | ||
num_mel_bins=self.num_mel_bins, | ||
dither=0.0, | ||
frame_shift=10, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also took the opportunity to remove some unnecessary parameters here
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally think it's better to remove the torchaudio
dependency entirely and align these two outliers with the rest of the numpy-based audio feature extractors! Especially since we'll probably support a torch version in audio_utils
in an upcoming PR, so the speed diff will be recovered.
dither=0.0, | ||
frame_shift=10, | ||
) | ||
if self.use_torchaudio: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a need to have the use_torchaudio
argument. IMO we can execute the torchaudio
code if_torchaudio_is_available
(thus maintaining backwards comp), and the NumPy code otherwise
if_torchaudio_is_available():
# do legacy code
else:
# do numpy code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also fine with removing the legacy torchaudio code altogether. I know this makes the feature extraction quite a bit slower, but I think this is fine to remove the extra dependencies to bring these models in-line with the rest of the audio library.
Personally, I would favour this approach over supporting both methods for feature extraction (torchaudio and numpy). IMO having both methods convolutes the code quite a lot, which is something we want to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me to remove the previous code, it won’t be performance wise backward compatible 🫠
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's go with the first option here then? Decorate with if_torchaudio_is_available
?
@@ -198,3 +235,16 @@ def __call__( | |||
padded_inputs = padded_inputs.convert_to_tensors(return_tensors) | |||
|
|||
return padded_inputs | |||
|
|||
def to_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method strictly necessary? If it is, shouldn't it go in the base FeatureExtractionMixin
class? Rather than copying it out for every feature extractor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which method do you mean? to_dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's about time we add this to the base feature class!
(it's necessary if we support the numpy part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, aligned with @sanchit-gandhi on not adding the np support
@@ -198,3 +235,16 @@ def __call__( | |||
padded_inputs = padded_inputs.convert_to_tensors(return_tensors) | |||
|
|||
return padded_inputs | |||
|
|||
def to_dict(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it's about time we add this to the base feature class!
(it's necessary if we support the numpy part)
dither=0.0, | ||
frame_shift=10, | ||
) | ||
if self.use_torchaudio: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine with me to remove the previous code, it won’t be performance wise backward compatible 🫠
Hey @ArthurZucker and @sanchit-gandhi, thanks for the review! However, I'm not sure about what you meant here:
And here:
@sanchit-gandhi seems to be in favor of removing torchaudio support to only focus on the numpy port here, whereas @ArthurZucker seems to be in favor on not adding the numpy support. Maybe I misunderstood the comments here! Thanks for your help! |
Sorry I was confused! I agree that we should remove the old code, but worried about the performance issue, since we had to re introduce torch STFT for Whisper for example. (Performance wise backward compatible) |
I've made a quick benchmark, on AST, with results here: Basically, torchaudio is at least 19 faster than the numpy porting. If I haven't made any mistake in my benchmark, I'll be strongly in favor of keeping torchaudio compatibility. WDYT @ArthurZucker and @sanchit-gandhi ? Can you also take a quick look at the benchmark code to make sure that my results are correct (or redirect me to an expert at HF haha) ? For reference, here is the benchmark code: from datasets import load_dataset
import pytest
from transformers import ASTFeatureExtractor
ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
speech_samples = ds.sort("id").select(range(64))[:64]["audio"]
speech_samples = [x["array"] for x in speech_samples]
def torchaudio_unbatch():
fe = ASTFeatureExtractor(use_torchaudio=True)
for sample in speech_samples:
input_features = fe(sample, padding=True, return_tensors="pt")
def np_unbatch():
fe = ASTFeatureExtractor(use_torchaudio=False)
for sample in speech_samples:
input_features = fe(sample, padding=True, return_tensors="pt")
def torchaudio_batch_8():
fe = ASTFeatureExtractor(use_torchaudio=True)
for i in range(0,len(speech_samples),8):
samples = speech_samples[i:i+8]
input_features = fe(samples, padding=True, return_tensors="pt")
def np_batch_8():
fe = ASTFeatureExtractor(use_torchaudio=False)
for i in range(0,len(speech_samples),8):
samples = speech_samples[i:i+8]
input_features = fe(samples, padding=True, return_tensors="pt")
@pytest.mark.benchmark(
min_rounds=5, disable_gc=True, warmup=False
)
def test_torchaudio_unbatch(benchmark):
benchmark(torchaudio_unbatch)
@pytest.mark.benchmark(
min_rounds=5, disable_gc=True, warmup=False
)
def test_torchaudio_batch_8(benchmark):
benchmark(torchaudio_batch_8)
@pytest.mark.benchmark(
min_rounds=5, disable_gc=True, warmup=False
)
def test_np_unbatch(benchmark):
benchmark(np_unbatch)
@pytest.mark.benchmark(
min_rounds=5, disable_gc=True, warmup=False
)
def test_np_batch_8(benchmark):
benchmark(np_batch_8) |
It's also possible that we can optimize our |
Alright that's quite a significant difference - this probably requires overhauling the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ArthurZucker and @sanchit-gandhi, thanks for your help here.
To sum it up, I removed torchaudio
dependency for both FE, but those FE still use it if it installed to ensure speed.
I've also simulated torchaudio
absence to make sure everything is in order.
I'm requesting your reviews again!
def to_dict(self) -> Dict[str, Any]: | ||
""" | ||
Serializes this instance to a Python dictionary. | ||
|
||
Returns: | ||
`Dict[str, Any]`: Dictionary of all the attributes that make up this feature extractor instance. | ||
Serializes this instance to a Python dictionary. Returns: | ||
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance. | ||
""" | ||
output = copy.deepcopy(self.__dict__) | ||
output["feature_extractor_type"] = self.__class__.__name__ | ||
|
||
if "mel_filters" in output: | ||
del output["mel_filters"] | ||
if "window" in output: | ||
del output["window"] | ||
return output | ||
|
||
@classmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As requested I've modified to_dict
directly in feature_extraction_utils.py
@unittest.mock.patch( | ||
"transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.is_speech_available", | ||
lambda: False, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I simulate the absence of torchaudio in the test suite
def test_using_audio_utils(self): | ||
# Tests that it uses audio_utils instead of torchaudio | ||
feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict()) | ||
|
||
self.assertTrue(hasattr(feat_extract, "window")) | ||
self.assertTrue(hasattr(feat_extract, "mel_filters")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also ensuring that we use the audio_utils
package and torchaudio
is indeed not used in this class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This on paper LGTM and nice job on getting the tests working. My only thought is that maybe we should overhaul audio_utils.py
with these changes, rather than do the if/else
in the feature extraction code? This way, all the is_xxx_available
logic stays in audio_utils
(which is fine if it gets complex, since most people won't interact with it), and the feature extraction code can stay simple
Open do either refactoring this PR to make this change, or merging this and doing it in a follow-up (along with #26119)
if is_speech_available(): | ||
import torchaudio.compliance.kaldi as ta_kaldi | ||
|
||
if is_torch_available(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In speech-to-text we bundle these imports into one:
if is_speech_available():
import torchaudio.compliance.kaldi as ta_kaldi
import torch
Should we do the same here since we can only use torch
if torchaudio is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, torch
is also used here even when torchaudio
isn't used. I can maybe refactor the code to change that, but I'm not sure it's worth the time, WDYT ?
Hey @sanchit-gandhi, thanks for the review here!
We'd have to create a In any case, I'd rather refactor that in another PR, which would maybe add the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Looks good to me.
Aligned with @sanchit-gandhi on profiling our numpy code to see what's our huge bottlneck sometime soon!
"transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.is_speech_available", | ||
lambda: False, | ||
) | ||
class ASTFeatureExtractionWithoutTorchaudioTest(SequenceFeatureExtractionTestMixin, unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I think you can either add # Copied from
on top of some tests, or (not sure if it's possible) find a way to use parametrized
to mock the absence of the package as a parameter to avoid code duplications. Not very important # Copied from
will be great
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ydshieh, I'd love to have your take on how to best manage this.
In a few words, here I try to simulate that a library is missing in ASTFeatureExtractionTest
. The thing is that now, I had to create another class: ASTFeatureExtractionWithoutTorchaudioTest
, which is a copy of the previous one with a unittest.mock.patch
decorator to simulate the library absence.
I've looked over the internet to avoid test duplication, but without success. Do you have any take on how to parametrize the library absence ?
Thanks for your help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @ylacombe ! There is something like unittest.mock.Mock()
but I never used it (yet) myself.
Search in tests/models/t5/test_modeling_t5.py
def test_fp16_fp32_conversion(self):
r"""
A test to check whether the argument `keep_in_fp32_modules` correctly does its job
"""
orig_import = __import__
accelerate_mock = unittest.mock.Mock()
# mock import of accelerate
def import_accelerate_mock(name, *args, **kwargs):
if name == "accelerate":
if accelerate_available:
return accelerate_mock
else:
raise ImportError
return orig_import(name, *args, **kwargs)
and let me know how you feel. I can take a look too (good to learn anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ydshieh, thanks for the quick response and for this example! I didn't know about it! I'm not sure this is the right fit for this purpose though
The idea is really to run ASTFeatureExtractionTest
twice, one without context and the other with the missing library context!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed the above comment. So here is 2 classes instead of 2 test methods.
Is the code in the new ASTFeatureExtractionWithoutTorchaudioTest
be identical to the original ASTFeatureExtractionTest
? If so, maybe try make it a subclass of ASTFeatureExtractionTest
but decorated with unittest.mock.patch
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this!
Making a subclass works for me, I'll try it
@@ -104,7 +103,213 @@ def _flatten(list_of_lists): | |||
@require_torch | |||
@require_torchaudio | |||
class Speech2TextFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing some copied from as well here!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
# exact same tests than before, except that we simulate that torchaudio is not available | ||
@require_torch | ||
@unittest.mock.patch( | ||
"transformers.models.speech_to_text.feature_extraction_speech_to_text.is_speech_available", lambda: False | ||
) | ||
class Speech2TextFeatureExtractionWithoutTorchaudioTest(Speech2TextFeatureExtractionTest): | ||
def test_using_audio_utils(self): | ||
# Tests that it uses audio_utils instead of torchaudio | ||
feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict()) | ||
|
||
self.assertTrue(hasattr(feat_extract, "window")) | ||
self.assertTrue(hasattr(feat_extract, "mel_filters")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to make sure that this inheritance works with you @ArthurZucker and/or @amyeroberts, following @ydshieh suggestion!
As soon as I have approval, I'll merge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to assert that is_speech_available
is False
so we are sure the patch works 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(sometimes we have surprise ...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.mel_fitlers
and self.window
are not defined unless is_speech_available=False
but it's best to be on the safe side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks alright with me!
* add audio_utils usage in the FE of SpeechToText * clean unecessary parameters of AudioSpectrogramTransformer FE * add audio_utils usage in AST * add serialization tests and function to FEs * make style * remove use_torchaudio and move to_dict to FE * test audio_utils usage * make style and fix import (remove torchaudio dependency import) * fix torch dependency for jax and tensor tests * fix typo * clean tests with suggestions * add lines to test if is_speech_availble is False
What does this PR do?
Following on from #26182, which ported
torchaudio.compliance.kaldi.fbank
tonumpy
inaudio_utils
, this PR aims to enable the use of numpy porting in previous Feature Extractors (AST and SpeechToText) that usedtorchaudio
. It was discussed here.This serves two purposes:
audio_utils
instead oftorchaudio
for future Feature ExtractorsA next step would be to port
audio_utils
to torch, which might be faster (cc @sanchit-gandhi), but this is still open to discussion. Is this really relevant? And will it be really faster?cc @ArthurZucker and @sanchit-gandhi