Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support attention sinks #1105

Closed
wants to merge 3 commits into from
Closed

Conversation

OlivierDehaene
Copy link
Member

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@CEDIDataVault
Copy link

Hello, may I ask if attention sinks are still rather unstable currently? How should I use it?

@Bec-k
Copy link

Bec-k commented Oct 25, 2023

Some test is failing. Test should be changed or it's a bug?

@OlivierDehaene
Copy link
Member Author

The implemetation is not correct yet as we do not cache before RoPE. I need to modify the paged attention kernels but I did not find the time yet to do it.

@Bec-k
Copy link

Bec-k commented Nov 20, 2023

Any update on this one?

@zfang
Copy link

zfang commented Jan 4, 2024

Looking forward to this as a better alternative than RoPE interpolation. In my experience RoPE interpolation techniques such as NTK tend to cause hallucination.

@Bec-k
Copy link

Bec-k commented Jan 9, 2024

Strange that this is not yet implemented, because i haven't seen or heard a better attentions technique so far. Please prove me wrong.

@Narsil
Copy link
Collaborator

Narsil commented Jan 10, 2024

Reason is simple, this implementation doesn't work exactly like sinks, the reason is KV cache handling because attention sinks require to slide everything except the sinks themselves, which makes AFAIK the KV cache invalid (since as soon as you slide once, the attention of the first non sink token is wrong, since it had been attending to non sinks before).
Prove me wrong.

@Bec-k No one is saying you're wrong. That doesn't mean we can implement the feature easily (or that it's worth the engineering time or maintenance cost) atm.

@xuan1905
Copy link

Attention Sinks caching has been implemented in transformers. Really looking forward to the release of this feature on huggingface.

@Bec-k
Copy link

Bec-k commented Feb 21, 2024

@OlivierDehaene attention cache have been added into the huggingface transformers package, it is implemented as a custom cache:
https://huggingface.co/docs/transformers/en/internal/generation_utils#transformers.SinkCache
https://github.com/huggingface/transformers/blob/main/src/transformers/cache_utils.py#L172
PR of integration into transformers library:
https://github.com/huggingface/transformers/pull/26681/files

Can you refactor "text-generation-inference" code to adapt this new feature? Does this solves initial problem?

@Bec-k
Copy link

Bec-k commented Feb 21, 2024

Possibly you can forward that new Sinks Cache implementation in transformers and request additional changes for it to be properly implemented into your runtime?

@github-actions github-actions bot added the Stale label Mar 23, 2024
@github-actions github-actions bot closed this Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants