-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for a static shape generate
#28075
Comments
cc @gante |
Any update on this ticket? |
There is an open PR: #27931 |
Thanks @oobabooga 🤗 and yes this is my main focus, hoping to ship by end of the week |
4 tasks
Many thanks! do you need help for the PR? (Development/testing/writing examples on how to run a model with static shape on the NPU?) |
I don't really have access to a NPU currently so feel free to test it. It's still in draft mode so when it's ready for review! |
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature request
Many inference AI accelerators (Intel NPU, IPU, TPU, etc...) requires static shapes get maximum performance. Static shapes allows the NN graph compiler to improve memory management, schedule and overall network performance.
However in transformers the
generate
function uses dynamic shapes and increase the size of the input (and kv-cache) at every successive step. I opened this issue to implement a way to still do LLM generation inference using transformers API while maintaining static shapes:The trick is to use left padding and shift left the kv-cached values while doing inference. By setting the
position_id
correctly we can have a correct inference. Attached a GIF that hopefully explains how it works:Fist inference you pad left and run as usual. It is important to set the
attention_mask
andposition_ids
accordingly. In the kv-cached part you only need to pass the new token and the properprosition_ids
andattention_mask
while the cache values are shifted left. This works because in the MHA block the cached values and keys are concatenated left with the new ones and having left padding makes the new token key and value tensors adjacent to the cache valuesHere a snippet for a function that implements this. This code is not production ready but is a POC to show you how it is supposed to work both with and without KV-caching
Motivation
Enabling AI inference accelerator to be used with the
generate
APIYour contribution
I'll be happy to help integrating the code into
transformers
library. Let me know how I can helpThe text was updated successfully, but these errors were encountered: