-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support attention sinks #1105
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Hello, may I ask if attention sinks are still rather unstable currently? How should I use it? |
Some test is failing. Test should be changed or it's a bug? |
The implemetation is not correct yet as we do not cache before RoPE. I need to modify the paged attention kernels but I did not find the time yet to do it. |
Any update on this one? |
Looking forward to this as a better alternative than RoPE interpolation. In my experience RoPE interpolation techniques such as NTK tend to cause hallucination. |
Strange that this is not yet implemented, because i haven't seen or heard a better attentions technique so far. Please prove me wrong. |
Reason is simple, this implementation doesn't work exactly like sinks, the reason is KV cache handling because attention sinks require to slide everything except the sinks themselves, which makes AFAIK the KV cache invalid (since as soon as you slide once, the attention of the first non sink token is wrong, since it had been attending to non sinks before). @Bec-k No one is saying you're wrong. That doesn't mean we can implement the feature easily (or that it's worth the engineering time or maintenance cost) atm. |
Attention Sinks caching has been implemented in transformers. Really looking forward to the release of this feature on huggingface. |
@OlivierDehaene attention cache have been added into the huggingface transformers package, it is implemented as a custom cache: Can you refactor "text-generation-inference" code to adapt this new feature? Does this solves initial problem? |
Possibly you can forward that new Sinks Cache implementation in transformers and request additional changes for it to be properly implemented into your runtime? |
See https://arxiv.org/abs/2309.17453