Open to contribution: adding `torch.nn.functional.scaled_dot_product_attention` support for more architectures #28005

fxmarty · 2023-12-13T12:35:52Z

Feature request

In Transformers 4.36, we started adding native support of torch.nn.functional.scaled_dot_product_attention (SDPA), enabled by default in Transformers: https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-and-memory-efficient-attention-through-pytorchs-scaleddotproductattention

SDPA allows to dispatch to memory-efficient attention, flash attention on supported GPUs (currently NVIDIA-only), and even on Intel CPUs.

For the record, here's a benchmark on some currently supported models:

Training benchmark, run on A100-SXM4-80GB.

Model	Batch size	Sequence length	Time per batch (`"eager"`, s)	Time per batch (`"sdpa"`, s)	Speedup	Peak memory (`"eager"`, MB)	Peak memory (`"sdpa"`, MB)	Memory savings
llama2 7b	4	1024	1.065	0.90	19.4%	73878.28	45977.81	60.7%
llama2 7b	4	2048	OOM	1.87	/	OOM	78394.58	SDPA does not OOM
llama2 7b	1	2048	0.64	0.48	32.0%	55557.01	29795.63	86.4%
llama2 7b	1	3072	OOM	0.75	/	OOM	37916.08	SDPA does not OOM
llama2 7b	1	4096	OOM	1.03	/	OOM	46028.14	SDPA does not OOM
llama2 7b	2	4096	OOM	2.05	/	OOM	78428.14	SDPA does not OOM

Inference benchmark, run on A100-SXM4-80GB.

Model	Batch size	Prompt length	Num new tokens	Per token latency `"eager"` (ms)	Per token latency `"sdpa"` (ms)	Speedup
llama2 13b	1	1024	1 (prefill)	178.66	159.36	12.11%
llama2 13b	1	100	100	40.35	37.62	7.28%
llama2 13b	8	100	100	40.55	38.06	6.53%
Whisper v3 large	1	/	62	20.05	18.90	6.10%
Whisper v3 large	8	/	77	25.42	24.77	2.59%
Whisper v3 large	16	/	77	28.51	26.32	8.34%

Previously, we had a partial support of SDPA in Optimum BetterTransformer but we are now looking to slowly deprecate it in favor of upstream support of SDPA directly in Transformers.

Here are the architectures for which support has been requested:

The integration could take inspiration from https://github.com/huggingface/optimum/blob/main/optimum/bettertransformer/models/decoder_models.py & https://github.com/huggingface/optimum/blob/main/optimum/bettertransformer/models/attention.py

Motivation

Faster training & inference, lower memory requirement

Your contribution

I may work on some at some point, but contributions are most welcome.

You should refer to #26572 to add the support of SDPA for a model, roughly following these steps:

Create a XxxSdpaAttention class inheriting from XxxAttention and implement the attention logic using SDPA
Use _prepare_4d_causal_attention_mask_for_sdpa instead of _prepare_4d_causal_attention_mask for SDPA
Use _prepare_4d_attention_mask_for_sdpa instead of _prepare_4d_attention_mask for SDPA
Add _supports_sdpa = True to XxxPreTrainedModel
Add "sdpa" key to XXX_ATTENTION_CLASSES in the model modeling file

The text was updated successfully, but these errors were encountered:

ENate · 2023-12-14T13:18:40Z

Hi @fxmarty I can take a look at this issue. Of I can ask questions if necessary. Or has anyone taken it already?

davidan5 · 2023-12-18T19:21:09Z

does someone know if longT5 and all T5 models are blocked by bias support in flash attention ?

Dao-AILab/flash-attention#617

ENate · 2023-12-19T08:14:41Z

Hi @davidan5 are you working on the implementation?

davidan5 · 2023-12-19T12:57:16Z

@ENate I was trying to understand the status and have an estimation of the code change to see if I can contribute.

ENate · 2023-12-19T13:11:47Z

I see.

hackyon · 2024-01-29T22:52:39Z

I'm interested in taking a look at this for the Mistral model if that's still needed. Otherwise, please let me know if there are any other models that still need some work. Thanks

ENate · 2024-01-29T23:35:53Z

Is LongT5 still open?

ArthurZucker · 2024-01-30T09:52:07Z

Mistral is already covered! LongT5 if it is like T5 and has attention bias that might not be supported

hackyon · 2024-01-30T22:44:19Z

Oh yea, looks like you added support for Mistral/Mixtral last month.

It doesn't seem to be supported for BERT yet (I think someone else is working on FA2 but not SDPA), so I'll take a crack at it. It looks like there is a config for relative position embeddings for BERT, so I'll just have it fallback to the original attention for configs using relative position embeddings.

@ArthurZucker - Please let me know if you know if someone else is already working on SDPA for BERT and I can look for something else to do. Thanks!

ArthurZucker · 2024-01-31T01:40:27Z

Not sure anyone is working on that but bert is already so small that I doubt it will have a lot of impact on perf!

abdulfatir · 2024-03-31T21:12:15Z

@ArthurZucker for the T5 family of models, attention bias is required, so flash-attention won't work for now but torch SDPA can still use the memory efficient kernel from xformers, right? I did some benchmarking with Chronos models (based on T5 architecture) here (amazon-science/chronos-forecasting#33) and there's a clear speedup when using torch SDPA.

fxmarty · 2024-04-02T09:00:24Z

@abdulfatir That's correct

abdulfatir · 2024-04-02T09:04:22Z

I can open a PR for T5 with SDPA then. Are there specific things that I should know of or a reference that can look at?

fxmarty · 2024-04-02T09:11:36Z

@abdulfatir For sure, some specific things that are good to know:

pytorch/pytorch#108108 (is_causal=True may not do what you expect)
pytorch/pytorch#110213 (You need

transformers/src/transformers/modeling_attn_mask_utils.py

Line 189 in 416711c

def _unmask_unattended(

)

example of a PR: #29108

ArthurZucker · 2024-04-02T09:15:53Z

FYI going forward we should rather use

transformers/src/transformers/models/llama/modeling_llama.py

Line 1058 in 416711c

def _update_causal_mask(self, attention_mask, input_tensor, cache_position):

as it is more self contained, easier to debug and maintain than the many paths in the atnn_mask utils

sayakpaul · 2024-04-19T10:48:29Z

Hey @abdulfatir just wanted to check in if you are still working on dropping a PR to add SDPA support for T5? It would tremendously help accelerating diffusion models that use T5.

avishaiElmakies · 2024-09-10T08:43:29Z

I can handle dinoV2

avishaiElmakies · 2024-09-25T12:25:34Z

I have some free time, I will add sdpa to Speech2text. will use bart as inspiration

michaelshekasta · 2024-09-30T11:58:29Z

@allenyummy Have you implement sdpa on debertav3? if not, is it sophisticated?

github-actions · 2024-10-25T08:12:58Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker · 2024-10-29T10:23:56Z

It should not be super complicated, and a good refactoring is quite needed. Once #22105 is merged, we can easily add it!

toneyfox · 2024-11-05T10:04:20Z

ailab_OmniGen
Phi3Transformer does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: #28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument attn_implementation="eager" meanwhile. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")

HELP ME。
The boss asked how to solve this problem？

ENate · 2024-11-05T11:26:15Z

Hi there. Are there any open models?

michaelshekasta · 2024-11-05T13:05:38Z

It should not be super complicated, and a good refactoring is quite needed. Once #22105 is merged, we can easily add it!

@ArthurZucker
Okay, do we have something like a roadmap? I understand that we should finish the DeBERTaV2 refactoring before implementing FA2 for DeBERTaV3. Do I understand it correctly

ArthurZucker · 2024-11-25T12:57:55Z

Yep! The refactor is now merged, I'll have a look at #34826!

github-actions · 2024-12-20T08:15:30Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

fxmarty added the contributions-welcome label Dec 13, 2023

ArthurZucker mentioned this issue Jan 3, 2024

Pythia regression in transformers==4.36.2 vs transformers==4.30.1 #28316

Closed

4 tasks

huggingface deleted a comment from github-actions bot Jan 15, 2024

hackyon mentioned this issue Jan 31, 2024

[BERT] Add support for sdpa #28802

Merged

5 tasks

hackyon mentioned this issue Feb 19, 2024

[Phi] Add support for sdpa #29108

Merged

5 tasks

huggingface deleted a comment from github-actions bot Feb 26, 2024

lyaronskaya mentioned this issue Feb 27, 2024

add sdpa to ViT #29325

Closed

5 tasks

huggingface deleted a comment from github-actions bot Mar 22, 2024

avishaiElmakies mentioned this issue Sep 10, 2024

Sdpa dino v2 #33403

Merged

5 tasks

hanjidani mentioned this issue Sep 18, 2024

GPT2LMHeadModel does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet md-mohaiminul/VideoRecap#14

Open

OmarManzoor mentioned this issue Sep 19, 2024

Add sdpa for BioGpt #33592

Merged

xiongjun926g mentioned this issue Sep 22, 2024

Issue of install FastCoref in Colab shon-otmazgin/fastcoref#59

Open

avishaiElmakies mentioned this issue Sep 26, 2024

add sdpa and flash_attention2 support to speech2text #33716

Open

5 tasks

OmarManzoor mentioned this issue Sep 26, 2024

Add sdpa for DistilBert #33724

Merged

RUFFY-369 mentioned this issue Sep 27, 2024

Add sdpa for Vivit #33757

Merged

5 tasks

Rocketknight1 mentioned this issue Oct 29, 2024

FlashAttention2 implementation for OpenELM model #34485

Closed

2 tasks

mixoadrian mentioned this issue Nov 3, 2024

Phi3Transformer does not support an attention implementation AIFSH/OmniGen-ComfyUI#12

Open

tpolson mentioned this issue Nov 5, 2024

Phi3Transformer does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. 1038lab/ComfyUI-OmniGen#5

Closed

TAYLENHE mentioned this issue Nov 6, 2024

What is this problem? 1038lab/ComfyUI-OmniGen#6

Closed

zxtmzxtm mentioned this issue Nov 6, 2024

Phi3Transformer does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. 1038lab/ComfyUI-OmniGen#7

Closed

letscreate321 mentioned this issue Nov 7, 2024

Its not working for me 1038lab/ComfyUI-OmniGen#14

Open

SpoSer23 mentioned this issue Nov 10, 2024

SDPA not implemented error whyNLP/LCKV#9

Closed

gunia10 mentioned this issue Nov 11, 2024

ValueError: Phi3Transformer does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. 1038lab/ComfyUI-OmniGen#23

Open

izonewonyoung mentioned this issue Nov 15, 2024

Failed to create pipeline: Phi3Transformer does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. 1038lab/ComfyUI-OmniGen#27

Open

OmarManzoor mentioned this issue Nov 20, 2024

Add sdpa for Detr #34826

Closed

OmarManzoor mentioned this issue Nov 26, 2024

Add sdpa for Beit #34941

Merged

wzf03 mentioned this issue Nov 27, 2024

[ESM] Add support for sdpa. #34954

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Open to contribution: adding `torch.nn.functional.scaled_dot_product_attention` support for more architectures #28005

Open to contribution: adding `torch.nn.functional.scaled_dot_product_attention` support for more architectures #28005

fxmarty commented Dec 13, 2023 •

edited

Loading

ENate commented Dec 14, 2023

davidan5 commented Dec 18, 2023

ENate commented Dec 19, 2023

davidan5 commented Dec 19, 2023

ENate commented Dec 19, 2023

hackyon commented Jan 29, 2024

ENate commented Jan 29, 2024

ArthurZucker commented Jan 30, 2024

hackyon commented Jan 30, 2024 •

edited

Loading

ArthurZucker commented Jan 31, 2024

abdulfatir commented Mar 31, 2024

fxmarty commented Apr 2, 2024

abdulfatir commented Apr 2, 2024

fxmarty commented Apr 2, 2024

ArthurZucker commented Apr 2, 2024

sayakpaul commented Apr 19, 2024

avishaiElmakies commented Sep 10, 2024

avishaiElmakies commented Sep 25, 2024 •

edited

Loading

michaelshekasta commented Sep 30, 2024

github-actions bot commented Oct 25, 2024

ArthurZucker commented Oct 29, 2024

toneyfox commented Nov 5, 2024

ENate commented Nov 5, 2024

michaelshekasta commented Nov 5, 2024 •

edited

Loading

ArthurZucker commented Nov 25, 2024

github-actions bot commented Dec 20, 2024

Open to contribution: adding torch.nn.functional.scaled_dot_product_attention support for more architectures #28005

Open to contribution: adding torch.nn.functional.scaled_dot_product_attention support for more architectures #28005

Comments

fxmarty commented Dec 13, 2023 • edited Loading

Feature request

Motivation

Your contribution

ENate commented Dec 14, 2023

davidan5 commented Dec 18, 2023

ENate commented Dec 19, 2023

davidan5 commented Dec 19, 2023

ENate commented Dec 19, 2023

hackyon commented Jan 29, 2024

ENate commented Jan 29, 2024

ArthurZucker commented Jan 30, 2024

hackyon commented Jan 30, 2024 • edited Loading

ArthurZucker commented Jan 31, 2024

abdulfatir commented Mar 31, 2024

fxmarty commented Apr 2, 2024

abdulfatir commented Apr 2, 2024

fxmarty commented Apr 2, 2024

ArthurZucker commented Apr 2, 2024

sayakpaul commented Apr 19, 2024

avishaiElmakies commented Sep 10, 2024

avishaiElmakies commented Sep 25, 2024 • edited Loading

michaelshekasta commented Sep 30, 2024

github-actions bot commented Oct 25, 2024

ArthurZucker commented Oct 29, 2024

toneyfox commented Nov 5, 2024

ENate commented Nov 5, 2024

michaelshekasta commented Nov 5, 2024 • edited Loading

ArthurZucker commented Nov 25, 2024

github-actions bot commented Dec 20, 2024

Open to contribution: adding `torch.nn.functional.scaled_dot_product_attention` support for more architectures #28005

Open to contribution: adding `torch.nn.functional.scaled_dot_product_attention` support for more architectures #28005

fxmarty commented Dec 13, 2023 •

edited

Loading

hackyon commented Jan 30, 2024 •

edited

Loading

avishaiElmakies commented Sep 25, 2024 •

edited

Loading

michaelshekasta commented Nov 5, 2024 •

edited

Loading