Static cache + torch.compile: better documentation for prefill static sequence length #29151

fxmarty · 2024-02-20T18:57:27Z

Feature request

When using torch.compile, the prefill is recompiled for every new sequence length, which is slow. It may be nice to be able to compile only say for some sequence lengths (1, 2, 4, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, etc) on the fly depending on the input lengths, using some padding.

Motivation

torch.compile compilation is prohibitively slow even with #29114

If people want to use transformers + static cache + torch.compile, it should be FAST to run generate on new sequence lengths.

Your contribution

None for now

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-02-21T18:24:29Z

cc @gante

gante · 2024-02-21T19:32:48Z

@fxmarty this is the same problem as we have in TF and Flax. There, we nudged users to use the pad_to_multiple_of argument in the tokenizer, which I believe solves the problem 🤗

How do you suggest us to let users know about this feature, other than docs?

fxmarty · 2024-02-22T08:51:01Z

@gante That's already good to support that in the tokenizer, but I am wondering whether it would make sense to support that in the generation directly. Have you seen any user request about that?

gante · 2024-02-26T16:11:51Z

@fxmarty I haven't.

I am also not a big fan of it:
a) it pushes the problem from forward to generate (i.e. forward would not see recompilations, but generate will, as it will have an input tensor with arbitrary length)
b) it hides the real behavior (padding) from the user, which may lead to issues due to behavior misunderstandings. An obvious one I can foresee is "my input has X length, I have set max_new_tokens=Y, why isn't the output length X+Y?"

pad_to_multiple_of avoids the problems I mentioned, but it is harder to discover 🤗 Still, I think it is preferable!

fxmarty · 2024-02-26T16:14:14Z

a) it pushes the problem from forward to generate (i.e. forward would not see recompilations, but generate will, as it will have an input tensor with arbitrary length)

Not really (at least not for torch.compile), as generate is simply not compiled.

b) it hides the real behavior (padding) from the user, which may lead to issues due to behavior misunderstandings. An obvious one I can foresee is "my input has X length, I have set max_new_tokens=Y, why isn't the output length X+Y?"

Fair enough. I think a warning could be shown in generate (e.g. in case the model is an OptimizedModule) about the feature and/or we could document the usage with torch.compile.

gante · 2024-02-26T16:26:01Z

as generate is simply not compiled.

@fxmarty yet ;) Beam search has some heavy tensor operations that should be compiled, some logits processors are heavy, etc.

The difference between passing a flag to generate or to the tokenizer is small, but passing to generate will restrict our ability to fully compile generate if we decide to go through that path for some reason

fxmarty · 2024-02-26T16:45:29Z

@gante agreed although @torch.compiler.disable is useful for that

gante · 2024-05-25T15:51:34Z

#30788 -- this PR adds documentation to use pad_to_multiple_of to avoid input shape-related recompilation

I'm assuming this issue can be closed after the PR gets merged :) In the generate refactor we will be separating the prefill step, and we can then move/enhance related documentation.

fxmarty changed the title ~~Static cache: support prefill static sequence length~~ Static cache + torch.compile: support prefill static sequence length Feb 20, 2024

fxmarty added the Generation label Feb 21, 2024

fxmarty changed the title ~~Static cache + torch.compile: support prefill static sequence length~~ Static cache + torch.compile: better documentation for prefill static sequence length Feb 26, 2024

fxmarty added the Compilation Issues related to torchdynamo and torchinductor label Feb 28, 2024

amyeroberts added the Cache label Mar 24, 2024

huggingface deleted a comment from github-actions bot Mar 25, 2024

huggingface deleted a comment from github-actions bot Apr 19, 2024

huggingface deleted a comment from github-actions bot May 14, 2024

huggingface deleted a comment from github-actions bot Jun 19, 2024

huggingface deleted a comment from github-actions bot Jul 14, 2024

gante mentioned this issue Jul 15, 2024

Generate: end-to-end compilation #30788

Merged

3 tasks

gante closed this as completed in #30788 Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static cache + torch.compile: better documentation for prefill static sequence length #29151

Static cache + torch.compile: better documentation for prefill static sequence length #29151

fxmarty commented Feb 20, 2024

amyeroberts commented Feb 21, 2024

gante commented Feb 21, 2024

fxmarty commented Feb 22, 2024

gante commented Feb 26, 2024 •

edited

Loading

fxmarty commented Feb 26, 2024 •

edited

Loading

gante commented Feb 26, 2024 •

edited

Loading

fxmarty commented Feb 26, 2024

gante commented May 25, 2024

Static cache + torch.compile: better documentation for prefill static sequence length #29151

Static cache + torch.compile: better documentation for prefill static sequence length #29151

Comments

fxmarty commented Feb 20, 2024

Feature request

Motivation

Your contribution

amyeroberts commented Feb 21, 2024

gante commented Feb 21, 2024

fxmarty commented Feb 22, 2024

gante commented Feb 26, 2024 • edited Loading

fxmarty commented Feb 26, 2024 • edited Loading

gante commented Feb 26, 2024 • edited Loading

fxmarty commented Feb 26, 2024

gante commented May 25, 2024

gante commented Feb 26, 2024 •

edited

Loading

fxmarty commented Feb 26, 2024 •

edited

Loading

gante commented Feb 26, 2024 •

edited

Loading