Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Whisper] Fix slow tests #30152

Merged
merged 11 commits into from
Apr 19, 2024
Merged

Conversation

sanchit-gandhi
Copy link
Contributor

@sanchit-gandhi sanchit-gandhi commented Apr 9, 2024

What does this PR do?

Fixes failing slow integration tests for the Whisper model. Majority of the failing tests were simply due to the order of the expected transcriptions not matching the order of the ground-truth ones. This PR fixes this wrong ordering, and updates the tests to use the latest .generate API, rather than the deprecated forced decoder ids one.

cc @ydshieh

text="This part of the speech",
add_special_tokens=False,
return_tensors="pt",
sampling_rate=16_000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the arg sampling_rate=16_000 whenever we call the processor significantly reduces the number of warnings on the logger

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@sanchit-gandhi sanchit-gandhi requested a review from ydshieh April 9, 2024 22:50
@ydshieh
Copy link
Collaborator

ydshieh commented Apr 10, 2024

Hi @sanchit-gandhi Thanks a lot! There are still 2 failures, but I guess it's because the env. difference? What machine you used? If it is not T4, I can update the results with a T4 result and push to this PR.

See below

2 failures

FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelIntegrationTests::test_large_generation_multilingual - huggingface_hub.utils._headers.LocalTokenNotFoundError: Token is required (`token=True`), but no token found. You need to provide a token or be logged in to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens.

FAILED tests/models/whisper/test_modeling_whisper.py::WhisperModelIntegrationTests::test_whisper_longform_multi_batch_hard_prev_cond - AssertionError: assert ' You know, f...ent. Me wild!' == ' You know, f...ent. Me, why?'

Full error log

======================================================================================================================================================== FAILURES =========================================================================================================================================================
_____________________________________________________________________________________________________________________________ WhisperModelIntegrationTests.test_large_generation_multilingual _____________________________________________________________________________________________________________________________

self = <tests.models.whisper.test_modeling_whisper.WhisperModelIntegrationTests testMethod=test_large_generation_multilingual>

    @slow
    def test_large_generation_multilingual(self):
        torch_device = "cpu"
        set_seed(0)
        processor = WhisperProcessor.from_pretrained("openai/whisper-large")
        model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
        model.to(torch_device)
    
        token = os.getenv("HF_HUB_READ_TOKEN", True)
>       ds = load_dataset("mozilla-foundation/common_voice_6_1", "ja", split="test", streaming=True, token=token)

tests/models/whisper/test_modeling_whisper.py:1760: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/local/lib/python3.8/dist-packages/datasets/load.py:2556: in load_dataset
    builder_instance = load_dataset_builder(
/usr/local/lib/python3.8/dist-packages/datasets/load.py:2228: in load_dataset_builder
    dataset_module = dataset_module_factory(
/usr/local/lib/python3.8/dist-packages/datasets/load.py:1879: in dataset_module_factory
    raise e1 from None
/usr/local/lib/python3.8/dist-packages/datasets/load.py:1824: in dataset_module_factory
    raise e
/usr/local/lib/python3.8/dist-packages/datasets/load.py:1797: in dataset_module_factory
    dataset_info = hf_api.dataset_info(
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:119: in _inner_fn
    return fn(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/hf_api.py:2280: in dataset_info
    headers = self._build_hf_headers(token=token)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/hf_api.py:8411: in _build_hf_headers
    return build_hf_headers(
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_validators.py:119: in _inner_fn
    return fn(*args, **kwargs)
/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_headers.py:126: in build_hf_headers
    token_to_send = get_token_to_send(token)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

token = True

    def get_token_to_send(token: Optional[Union[bool, str]]) -> Optional[str]:
        """Select the token to send from either `token` or the cache."""
        # Case token is explicitly provided
        if isinstance(token, str):
            return token
    
        # Case token is explicitly forbidden
        if token is False:
            return None
    
        # Token is not provided: we get it from local cache
        cached_token = get_token()
    
        # Case token is explicitly required
        if token is True:
            if cached_token is None:
>               raise LocalTokenNotFoundError(
                    "Token is required (`token=True`), but no token found. You"
                    " need to provide a token or be logged in to Hugging Face with"
                    " `huggingface-cli login` or `huggingface_hub.login`. See"
                    " https://huggingface.co/settings/tokens."
                )
E               huggingface_hub.utils._headers.LocalTokenNotFoundError: Token is required (`token=True`), but no token found. You need to provide a token or be logged in to Hugging Face with `huggingface-cli login` or `huggingface_hub.login`. See https://huggingface.co/settings/tokens.

/usr/local/lib/python3.8/dist-packages/huggingface_hub/utils/_headers.py:160: LocalTokenNotFoundError
-------------------------------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------------------------------
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
______________________________________________________________________________________________________________________ WhisperModelIntegrationTests.test_whisper_longform_multi_batch_hard_prev_cond ______________________________________________________________________________________________________________________

self = <tests.models.whisper.test_modeling_whisper.WhisperModelIntegrationTests testMethod=test_whisper_longform_multi_batch_hard_prev_cond>

    @slow
    def test_whisper_longform_multi_batch_hard_prev_cond(self):
        # fmt: off
        EXPECTED_TEXT = [
            " Folks, if you watch the show, you know I spent a lot of time right over there. Patiently and astutely scrutinizing the boxwood and mahogany chest set of the day's biggest stories, developing the central headline pawns, definitely maneuvering an oh-so-topical night to F6, faming of classic Sicilian, named or variation on the news, all the while seeing eight moves deep and patiently marshalling the latest press releases into a Fisher shows in lip-nitsky attack that culminates in the elegant lethal slow-played, all-pass on checkmate that is my nightly monologue, but sometimes sometimes folks I sometimes I start to the wake-up side down in the monkey bars of a condemned playground on a super fun site, get all hepped up on goofballs, rummage that would discard a tag bag of defective toys, yank out a fistball of disembodied doll limbs, toss them on a stain kid's place mad from a defunct denies, set up a table inside a rusty cargo container down by the warf and challenge toothless drifters to the godless bughouse blitz of tournament that is my segment, meanwhile.",
            " Folks, I spent a lot of time right over there night after night, actually. Carefully selecting for you the day's newsiest, most aerodynamic headlines, stress testing on those topical anti-lock breaks and power steering, painstakingly stitching, leather seating, so soft, it would make JD power and her associates blush. To create the luxury sedan that is my nightly monologue, but sometimes I just sometimes focus. I lurched to consciousness in the back of an abandoned school bus and slapped myself awake with a crusty floor mat. Before using a mouse-bitten timing belt to strap some old plywood to a couple of discarded oil drums, then by the light of a heathen-moon render a gas tank out of an empty big gulp, filled with white claw and de-natured alcohol, then light a match and let her rip in the dis-mented one man, soapbox derby of news that is my segment.",
            " Ladies and gentlemen, you know, I spent a lot of time right over there, raising the finest hosting news cattle firmly, yet tenderly milking the latest headlines from their jokes, swollen teats, churning the daily stories into the decadent Provincil style triple cream-breed. It is my nightly monologue, but sometimes sometimes I stagger home hungry after being released by the police and root around in the neighbor's trash can for an old milk carton scrape out the blooming dairy residue into the remains of a wet cheese rod I won from a rat in a pre-drawn street fight. Put it in a discarded paint can to leave it to ferment next to a trash fire than a hunker down in hallucinate while eating the Listeria latent demon custard of news that is my segment.",
            " Folks, you watched this show, you know I spend most of my time right over there, carefully sorting through the days, big stories, and selecting only the most subtle, and unblemished ostrich and crocodile news leather, which I then entrust to artisan graduates of the Ickel Greg Waferandi, who carefully died them in a pallet of bright, zesty shades, and adorn them in the finest most topical inlay work, using hand tools and double magnifying glasses, then assemble them according to now classic and elegant geometry using our signature saddle stitching, and line it with bees, wax, coated linen, and finally attach a mallet hammered strap, purled hardware, and close-shet to create for you the one of a kind hope kutur, Ernme, is burkin bag that is my monologue, but sometimes, sometimes folks, sometimes. Sometimes I wake up in the last car of an abandoned rollercoaster at Coney Island where I'm hiding from the triads, I have some engine lubricants out of a safe way bag and staggered down the shore to tear the sail off a beach skoener, then I ripped the coaxial cable out of an RV and elderly couple from Utah, Hank, and Mabel, lovely folks, and use it to stitch the sail into a loose pouch-like rock sack, and I stow in the back of a garbage truck to the junkyard, where I pick through to the debris for only the broken toys that make me the saddest, until I have loaded for you, the hobo fugitives bug out bindle of news that",
            " You know, folks, I spent a lot of time crafting for you a bespoke playlist of the day's big stories right over there. meticulously selecting the most topical chakra affirming scented candles, using Feng Shui, to perfectly align the joke energy in the exclusive boutique yoga retreat that is my monologue, but sometimes just sometimes, I go to the dumpster behind the waffle house at three in the morning, take off my shirt, cover myself and use fry oil, wrap my hands and some old duct tape I stole from a broken car window, pound a six pack of blueberry hard-seller and a second pill, as I stole from a parked ambulance, then arm wrestle a raccoon in the back alley vision quest of news that is my segment.",
            " You know, folks, I spend most of my time right over there. Mining the days, biggest, most important stories, collecting the finest, most topical iron or hand hammering it into joke panels, then I craft sheets of bronze and blazing with patterns that tell an epic tale of conquest and glory. Then, using the Germanic tradition press, black process, I place thin sheets of foil against the scenes and by hammering or otherwise applying pressure from the back, I project these scenes into a pair of cheat cards and a face plate, and finally using fluted strips of white, alloyed molding, I divide the designs into framed panels and hold it all together using bronze rivets to create the beautiful and intimidating, Anglo-Saxon battle helm that is my nightly monologue. But sometimes, sometimes, folks. Sometimes, just sometimes, I come to my senses fully naked on the deck of a pirate-be-seed, melee, container ship that picked me up floating on the detached door of a porta-potty in the Indian Ocean. Then, after a sunstroke induced realization of the crew of this ship plans to sell me an exchange for a bag of oranges to fight off scurvy, I lead a mutiny using only a PVC pipe and a pool chain that accepting my new role as captain and declaring myself King of the Windark Seas. I grab a dirty mop bucket covered in barnacles and adorn it with the teeth of the vanquished to create these shopping wet pirate crown of news that is my segment. Me, why?",
            " Folks, if you watch this show, you know I spend most of my time right over there carefully blending for you the day's newsiest, most topical flower eggs, milk and butter. And straining into a fine batter to make delicate and informative comedy pancakes, then I glaze them in the juice and zest of the most relevant midnight valencio oranges. And doubts at all, and I find delimane de voyage cognac, before from bang and basting them tables, I deserve you the James Beard Award worthy creeps to ZET. That is my nightly monologue, but sometimes sometimes folks, I wake up in the baggage hole of Greyhound bus, it's being hoisted by the scrapyard claw toward the burn pit. Escape to a nearby abandoned price chopper where I scrounge for old bread scraps, busted up in bags of starfruit candies and expired eggs. Chuck it all on a dirty hubcap and slap it over a tire fire before using the legs of a strained pair of sweatpants and as ovenmets to extract and serve the demented transients pound cake of news that is my segment.",
            " Folks, if you watch the show and I hope you do, I spend a lot of time right over there. Tirelessly studying the lineage of the day's most important thoroughbred stories and whole-stiner headlines, working with the best trainers money can buy to rear their comedy offspring with a hand that is stern yet gentle into the triple crown winning equine specimen that is my nightly monologue. But sometimes sometimes folks I break into an unincorporated veterinary genetics lab. And grab whatever test tubes I can find and then under a grow light I got from a discarded chia pet. I mixed the pill for DNA of a horse and whatever was in a tube labeled Keith Cohen-Extra. Slurring the concoction with caffeine pills and a microwave bread bowl, I screamed sing a prayer to Janice initiator of human life and God of Transformation as a half horse, half man freak ceases to life before me and the hideous collection of loose animal parts and corrupted men tissue that is my segment. Meanwhile!"
        ]
        # fmt: on
    
        processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
        model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
        model = model.to(torch_device)
    
        ds = load_dataset("distil-whisper/meanwhile", "default")["test"]
        ds = ds.cast_column("audio", Audio(sampling_rate=16000))
    
        num_samples = 8
    
        audio = ds[:num_samples]["audio"]
        audios = [x["array"] for x in audio]
    
        inputs = processor(
            audios,
            return_tensors="pt",
            truncation=False,
            padding="longest",
            return_attention_mask=True,
            sampling_rate=16_000,
        )
        inputs = inputs.to(device=torch_device)
    
        gen_kwargs = {
            "return_timestamps": True,
            "no_speech_threshold": 0.6,
            "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
            "compression_ratio_threshold": 1.35,
            "condition_on_prev_tokens": True,
            "logprob_threshold": -1.0,
            "num_beams": 5,
        }
    
        torch.manual_seed(0)
        result = model.generate(**inputs, **gen_kwargs)
        decoded_all = processor.batch_decode(result, skip_special_tokens=True)
    
        for i in range(num_samples):
>           assert decoded_all[i] == EXPECTED_TEXT[i]
E           AssertionError: assert ' You know, f...ent. Me wild!' == ' You know, f...ent. Me, why?'
E             -  You know, folks, I spend most of my time right over there. Mining the days, biggest, most important stories, collecting the finest, most topical iron or hand hammering it into joke panels, then I craft sheets of bronze and blazing with patterns that tell an epic tale of conquest and glory. Then, using the Germanic tradition press, black process, I place thin sheets of foil against the scenes and by hammering or otherwise applying pressure from the back, I project these scenes into a pair of cheat cards and a face plate, and finally using fluted strips of white, alloy...
E             
E             ...Full output truncated (4 lines hidden), use '-vv' to show

tests/models/whisper/test_modeling_whisper.py:2684: AssertionError

@sanchit-gandhi
Copy link
Contributor Author

The first failure is because we haven't passed a token corresponding to a user that has accepted the dataset term of use: https://huggingface.co/datasets/mozilla-foundation/common_voice_6_1

If the token for the CI runner also hasn't accepted the terms of use for the gated dataset, I'm happy to update the dataset to one that's un-gated!

The second failure does indeed look like an env + machine difference - do you have easy access to the T4? I found it pretty difficult to debug yesterday on this machine given it's got a unique docker set-up

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 11, 2024

For common voice, let's try not to use common_voice_6_1. Instead, like #27147, let's use something that doesn't require extra step, if possible.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 11, 2024

do you have easy access to the T4? I found it pretty difficult to debug yesterday on this machine given it's got a unique docker set-up

I can access. I will update the second failing test and push

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 11, 2024

For common voice, let's try not to use common_voice_6_1. Instead, like #27147, let's use something that doesn't require extra step, if possible.

@sanchit-gandhi I updated the PR so the 2nd failing test (test_whisper_longform_multi_batch_hard_prev_cond) is passing now.

I will let you handle the first one (that with dataset issue) 🙏 .

request me for review once ready , thanks.

@ydshieh
Copy link
Collaborator

ydshieh commented Apr 15, 2024

@sanchit-gandhi WDYT about using "mozilla-foundation/common_voice_11_0"?

@sanchit-gandhi
Copy link
Contributor Author

sanchit-gandhi commented Apr 15, 2024

This is also a gated dataset. In 5739f54 I've updated the test to use an exclusively un-gated dataset on the Hub, multilingual librispeech

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

It looks like

WhisperModelIntegrationTests::test_whisper_longform_multi_batch_hard_prev_cond

will give different outputs when I ran that single test vs the whole WhisperModelIntegrationTests.

But @sanchit-gandhi is doing everything on his side.

Comment on lines 2676 to +2680
for i in range(num_samples):
assert decoded_all[i] == EXPECTED_TEXT[i]
if isinstance(EXPECTED_TEXT[i], str):
assert decoded_all[i] == EXPECTED_TEXT[i]
elif isinstance(EXPECTED_TEXT[i], tuple):
assert decoded_all[i] in EXPECTED_TEXT[i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not able to have the same results on a T4 VM and on AWS K8S T4 runner. The difference is I screamed vs I scream so I decide to allow both of expected values.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work - thanks for fixing all of these!

set_seed(0)
processor = WhisperProcessor.from_pretrained("openai/whisper-large")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large")
model.to(torch_device)

token = os.getenv("HF_HUB_READ_TOKEN", True)
ds = load_dataset("mozilla-foundation/common_voice_6_1", "ja", split="test", streaming=True, token=token)
ds = load_dataset("facebook/multilingual_librispeech", "german", split="test", streaming=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the change from japanese?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, i probably clicked the resolved button. See below for @sanchit-gandhi previous comment

#30152 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset used to load a Japanese sample is also gated. We've swapped to an un-gated dataset, as discussed in #30152 (comment).

@ydshieh ydshieh merged commit 4ed0e51 into huggingface:main Apr 19, 2024
19 checks passed
@sanchit-gandhi sanchit-gandhi deleted the whisper-slow-tests branch April 19, 2024 11:26
ArthurZucker pushed a commit that referenced this pull request Apr 22, 2024
* fix tests

* style

* more fixes

* move model to device

* move logits to cpu

* update expected values

* use ungated dataset

* fix

* fix

* update

---------

Co-authored-by: ydshieh <[email protected]>
ydshieh added a commit that referenced this pull request Apr 23, 2024
* fix tests

* style

* more fixes

* move model to device

* move logits to cpu

* update expected values

* use ungated dataset

* fix

* fix

* update

---------

Co-authored-by: ydshieh <[email protected]>
itazap pushed a commit that referenced this pull request May 14, 2024
* fix tests

* style

* more fixes

* move model to device

* move logits to cpu

* update expected values

* use ungated dataset

* fix

* fix

* update

---------

Co-authored-by: ydshieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants