-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Static cache + torch.compile: better documentation for prefill static sequence length #29151
Comments
cc @gante |
@fxmarty this is the same problem as we have in TF and Flax. There, we nudged users to use the How do you suggest us to let users know about this feature, other than docs? |
@gante That's already good to support that in the tokenizer, but I am wondering whether it would make sense to support that in the generation directly. Have you seen any user request about that? |
@fxmarty I haven't. I am also not a big fan of it:
|
Not really (at least not for torch.compile), as generate is simply not compiled.
Fair enough. I think a warning could be shown in generate (e.g. in case the model is an |
@fxmarty yet ;) Beam search has some heavy tensor operations that should be compiled, some logits processors are heavy, etc. The difference between passing a flag to |
@gante agreed although |
Feature request
When using torch.compile, the prefill is recompiled for every new sequence length, which is slow. It may be nice to be able to compile only say for some sequence lengths (
1, 2, 4, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, etc
) on the fly depending on the input lengths, using some padding.Motivation
torch.compile compilation is prohibitively slow even with #29114
If people want to use transformers + static cache + torch.compile, it should be FAST to run
generate
on new sequence lengths.Your contribution
None for now
The text was updated successfully, but these errors were encountered: