AutoModel supports FA2/paged attention #2133

fxmarty · 2024-06-27T13:30:05Z

As per title.

The models benefiting from it in Transformers are:

cohere
dbrx
gemma
llama
jetmoe
mistral
mixtral
olmo
phi
phi3
qwen2
qwen2_moe
stablelm
starcoder2

i.e. all models with supports_cache_class = True and _supports_flash_attn_2 = True, following huggingface/transformers#31446 and some more changes in Transformers needed to support a single dim for total sequence length [total_seqlen, hidden_size].

models/__init__.py is kind of bloated and I guess is going to be refactored with the upcoming TRT-LLM support / multi backend.

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py

fxmarty added 3 commits June 27, 2024 12:39

working flash + paged through transformers

cb37c55

refactor

770975f

add missing files

3760102

fxmarty commented Jun 27, 2024

View reviewed changes

server/text_generation_server/models/custom_modeling/flash_llama_modeling.py Show resolved Hide resolved

some cleaning

02ac451

fxmarty marked this pull request as draft June 27, 2024 13:43

github-actions bot added the Stale label Aug 1, 2024

github-actions bot closed this Aug 7, 2024

ArthurZucker mentioned this pull request Oct 9, 2024

Break cycle between the attention implementations and KV cache #2627

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoModel supports FA2/paged attention #2133

AutoModel supports FA2/paged attention #2133

fxmarty commented Jun 27, 2024

AutoModel supports FA2/paged attention #2133

AutoModel supports FA2/paged attention #2133

Conversation

fxmarty commented Jun 27, 2024