Adding Flash Attention 2 Support for GPT2 #29226

EduardoPach · 2024-02-23T00:14:04Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Hey, @younesbelkada added flash attention 2 support for GPT2. The only thing missing is the Expected speedups, could you share the code you used for the other models you added support to keep consistency?

younesbelkada

Wow thanks for the great work ! At a quick glance it seems you took care very well of the copy mechanism which is quite a challenge for GPT2 !
Please find the benchmarking script: https://gist.github.com/younesbelkada/02f35734da906cc0f2389ae4f665c58f I suggest to try it out for prefill only on large sequence length - let us know with @ArthurZucker @fxmarty how it goes

EduardoPach · 2024-02-23T10:09:01Z

Wow thanks for the great work ! At a quick glance it seems you took care very well of the copy mechanism which is quite a challenge for GPT2 ! Please find the benchmarking script: https://gist.github.com/younesbelkada/02f35734da906cc0f2389ae4f665c58f I suggest to try it out for prefill only on large sequence length - let us know with @ArthurZucker @fxmarty how it goes

Hey, I don't have a GPU and I was renting in RunPod an RTX 3090 to work on this PR, is it a problem to use the 3090 to benchmark or should I switch to an A100 (which I believe it was the GPU used in the other benchmarks at least the ones I've seen)?

younesbelkada · 2024-02-23T10:14:20Z

Thanks @EduardoPach for getting back, I think using a 3090 is fine !

EduardoPach · 2024-02-24T12:43:42Z

@ArthurZucker I believe it should be ready for review

younesbelkada

Great work !

HuggingFaceDocBuilderDev · 2024-02-27T02:34:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM but we need to add a test 😉

src/transformers/models/gpt2/modeling_gpt2.py

younesbelkada · 2024-02-28T00:27:38Z

@EduardoPach thanks again, what @ArthurZucker meant is an integration test similar as:

transformers/tests/models/llama/test_modeling_llama.py

Line 400 in 63caa37

def test_flash_attn_2_generate_padding_right(self):

for GPT2 only, would you be happy to work on that? 🙏

EduardoPach · 2024-02-28T07:52:18Z

@EduardoPach thanks again, what @ArthurZucker meant is an integration test similar as:

transformers/tests/models/llama/test_modeling_llama.py

Line 400 in 63caa37

def test_flash_attn_2_generate_padding_right(self):

for GPT2 only, would you be happy to work on that? 🙏

Yeah, I will add the test in the following hours

Co-authored-by: Arthur <[email protected]>

EduardoPach · 2024-03-01T14:01:24Z

Following hours became more like following days haha, but should be good now @ArthurZucker

ArthurZucker

Almost good, left a few nits

docs/source/en/model_doc/gpt2.md

ArthurZucker · 2024-03-04T08:35:30Z

src/transformers/models/decision_transformer/modeling_decision_transformer.py

@@ -346,21 +572,25 @@ def forward(self, hidden_states: Optional[Tuple[torch.FloatTensor]]) -> torch.Fl
        return hidden_states


-# Copied from transformers.models.gpt2.modeling_gpt2.GPT2Block with GPT2->DecisionTransformerGPT2
+DECISIONTRANSFORMERGPT2_ATTENTION_CLASSES = {


Suggested change

DECISIONTRANSFORMERGPT2_ATTENTION_CLASSES = {

DECISION_TRANSFORMER_GPT2_ATTENTION_CLASSES = {

ArthurZucker · 2024-03-04T08:35:52Z

src/transformers/models/decision_transformer/modeling_decision_transformer.py

-# Copied from transformers.models.gpt2.modeling_gpt2.GPT2Block with GPT2->DecisionTransformerGPT2
+DECISIONTRANSFORMERGPT2_ATTENTION_CLASSES = {
+    "eager": DecisionTransformerGPT2Attention,
+}


Where is DecisionTransformerGPT2FlashAttention2

Haven't added it there, but added it now here 74fb9bd. However, DecisionTransformer does not support flash attention yet just I had to do these modifications to make sure nothing would break with the Copy from statements.

tests/models/gpt2/test_modeling_gpt2.py

Co-authored-by: Arthur <[email protected]>

ArthurZucker

LGMT on final nit for the test to have explicit values

tests/models/gpt2/test_modeling_gpt2.py

younesbelkada

Thanks again ! We just merged some fixes on main - could you rebase again 🙏 then we should finally merge :D sorry for all the iterations !

EduardoPach · 2024-03-14T11:34:50Z

Thanks again ! We just merged some fixes on main - could you rebase again 🙏 then we should finally merge :D sorry for all the iterations !

No worries! Done

younesbelkada · 2024-03-14T11:55:27Z

Thanks ! Hmm I can't see the rebase commit on the history, perhaps can you try again ?

EduardoPach · 2024-03-14T14:09:39Z

Thanks ! Hmm I can't see the rebase commit on the history, perhaps can you try again ?

I've done git fetch upstream && git merge upstream/main here dcde56c

amyeroberts

Thanks for adding this and making our models go brrr 🔥

Just a few small comments. The diffs in the READMEs will need to be resolved before we can merge

amyeroberts · 2024-03-14T20:09:59Z

README_de.md

There shouldn't be readme changes here. Can you make sure to rebase on main to include the mode recent changes?

src/transformers/models/gpt2/modeling_gpt2.py

docs/source/en/perf_infer_gpu_one.md

Co-authored-by: amyeroberts <[email protected]>

…transformers into add-flash-attn-gpt2

amyeroberts

Thanks for the continued work on this!

Only thing left to do is make sure decision transformer has the updated documentation and tests

amyeroberts · 2024-03-18T12:44:49Z

tests/models/gpt2/test_modeling_gpt2.py

+    @require_torch_gpu
+    @pytest.mark.flash_attn_test
+    @slow
+    def test_flash_attn_2_generate_padding_left(self):


The equivalent test for decision transformer should also be added

Doesn't the test in GPT2 already cover Decision Transformer? Since, basically the usage of Flash Attention in Decision Transformer happens exactly due to GPT2Model being embedded in its architecture

Both models should be tested. This makes sure that if anything changes upstream they remain correct, for example, inputs preparation in DecisionTransformerModel

While adding the test for DecisionTransformer I realized that the model has two distinct xxxPreTrainedModels and that adding support for flash_attention_2 would be a bit more complicated, therefore I believe it would be better to have a specific PR to add support.

In this case, flash attention shouldn't be added at all for the model. You can use #Ignore copy on the init so the previous attention class' method is used

@amyeroberts 94c2fe8

amyeroberts · 2024-03-18T12:45:21Z

docs/source/en/model_doc/gpt2.md

@@ -60,6 +60,73 @@ This model was contributed by [thomwolf](https://huggingface.co/thomwolf). The o
 - Enabling the *scale_attn_by_inverse_layer_idx* and *reorder_and_upcast_attn* flags will apply the training stability
  improvements from [Mistral](https://github.com/stanford-crfm/mistral/) (for PyTorch only).

+## Usage example


We should have the equivalent added for decision transformer too

See message above

NieShenRuc · 2024-03-26T08:39:30Z

Thank you for your hard work! Many of us are excited about the GPT-2 model supporting flash attention. May I ask when the PR is expected to be merged?

EduardoPach · 2024-03-26T09:48:11Z

Thank you for your hard work! Many of us are excited about the GPT-2 model supporting flash attention. May I ask when the PR is expected to be merged?

Hey, I believe if @amyeroberts agrees with my latest message it should get merged right away 🤞

amyeroberts

Thanks for iterating - a few final places to tidy up.

amyeroberts · 2024-03-26T16:15:00Z

docs/source/en/perf_infer_gpu_one.md

@@ -40,8 +40,10 @@ FlashAttention-2 is currently supported for the following architectures:
 * [Bark](https://huggingface.co/docs/transformers/model_doc/bark#transformers.BarkModel)
 * [Bart](https://huggingface.co/docs/transformers/model_doc/bart#transformers.BartModel)
 * [Cohere](https://huggingface.co/docs/transformers/model_doc/cohere#transformers.CohereModel)
+* [DecisionTransformer](https://huggingface.co/docs/transformers/en/model_doc/decision_transformer)


This should be removed

amyeroberts · 2024-03-26T16:16:52Z

src/transformers/models/decision_transformer/modeling_decision_transformer.py

@@ -548,25 +551,26 @@ def forward(
            position_ids = torch.arange(past_length, input_shape[-1] + past_length, dtype=torch.long, device=device)
            position_ids = position_ids.unsqueeze(0)

-        # GPT2Attention mask.
+        # Attention mask.


Here I would use # ignore copy - the model shouldn't have FA2 logic

amyeroberts · 2024-03-26T16:17:02Z

src/transformers/models/decision_transformer/modeling_decision_transformer.py

@@ -575,7 +579,8 @@ def forward(
            encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
            if encoder_attention_mask is None:
                encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
-            encoder_attention_mask = self.invert_attention_mask(encoder_attention_mask)


Same here above this line

amyeroberts

Thanks for adding this for GPT2 and iterating on a solution!

NieShenRuc · 2024-03-28T08:38:31Z

It seems everything is okay. May I kindly request to merge this PR? I am really looking forward to speeding up my GPT-2. If my request has added to your workload, I apologize for any inconvenience.

EduardoPach · 2024-03-28T09:01:09Z

It seems everything is okay. May I kindly request to merge this PR? I am really looking forward to speeding up my GPT-2. If my request has added to your workload, I apologize for any inconvenience.

c.c. @amyeroberts

* First commit to add flash attention 2 for GPT-2 * more improvements * Make GPT2 pass tests and fixed Decison Transformers copies * Fixed missing arg * fix copies * Added expected speedup * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: Arthur <[email protected]> * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: Arthur <[email protected]> * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: Arthur <[email protected]> * Added test * Fixed attn attribute * Update docs/source/en/model_doc/gpt2.md Co-authored-by: Arthur <[email protected]> * Update docs/source/en/model_doc/gpt2.md Co-authored-by: Arthur <[email protected]> * Update Decision transformer attentions * More updates * Passing tests * Fix copies * Fix copies part 2 * Decision transformer updates * Update src/transformers/models/gpt2/modeling_gpt2.py Co-authored-by: amyeroberts <[email protected]> * Fix copies * Decision transformer not supporting flash attn * Addressed comments * Addressed comments * Addressed comments --------- Co-authored-by: Arthur <[email protected]> Co-authored-by: amyeroberts <[email protected]>

EduardoPach added 3 commits February 22, 2024 20:33

First commit to add flash attention 2 for GPT-2

bfd9a8f

more improvements

333a867

Make GPT2 pass tests and fixed Decison Transformers copies

b4a053a

EduardoPach mentioned this pull request Feb 23, 2024

Community contribution: Adding Flash Attention 2 support for more architectures #26350

Open

24 tasks

EduardoPach added 2 commits February 23, 2024 00:35

Fixed missing arg

eaaa6e4

fix copies

b5ded2e

younesbelkada reviewed Feb 23, 2024

View reviewed changes

Added expected speedup

774414b

younesbelkada approved these changes Feb 27, 2024

View reviewed changes

younesbelkada requested a review from ArthurZucker February 27, 2024 02:11

ArthurZucker reviewed Feb 28, 2024

View reviewed changes

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

src/transformers/models/gpt2/modeling_gpt2.py Outdated Show resolved Hide resolved

EduardoPach and others added 5 commits March 1, 2024 13:36

Update src/transformers/models/gpt2/modeling_gpt2.py

d405771

Co-authored-by: Arthur <[email protected]>

Update src/transformers/models/gpt2/modeling_gpt2.py

06bf96e

Co-authored-by: Arthur <[email protected]>

Update src/transformers/models/gpt2/modeling_gpt2.py

ad65025

Co-authored-by: Arthur <[email protected]>

Added test

bc7f558

Fixed attn attribute

19f171d

EduardoPach requested a review from ArthurZucker March 1, 2024 14:01

ArthurZucker reviewed Mar 4, 2024

View reviewed changes

EduardoPach and others added 3 commits March 4, 2024 10:32

Update docs/source/en/model_doc/gpt2.md

78c78f6

Co-authored-by: Arthur <[email protected]>

Update docs/source/en/model_doc/gpt2.md

a895172

Co-authored-by: Arthur <[email protected]>

Update Decision transformer attentions

74fb9bd

ArthurZucker approved these changes Mar 6, 2024

View reviewed changes

tests/models/gpt2/test_modeling_gpt2.py Show resolved Hide resolved

More updates

fc1cf99

younesbelkada approved these changes Mar 14, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into add-flash-attn-gpt2

6fe34ab

amyeroberts reviewed Mar 14, 2024

View reviewed changes

EduardoPach and others added 7 commits March 15, 2024 12:53

Merge remote-tracking branch 'upstream/main' into add-flash-attn-gpt2

c043c27

Decision transformer updates

c36c60a

Update src/transformers/models/gpt2/modeling_gpt2.py

656561b

Co-authored-by: amyeroberts <[email protected]>

Merge remote-tracking branch 'upstream/main' into add-flash-attn-gpt2

9be39bf

Merge branch 'add-flash-attn-gpt2' of https://github.com/EduardoPach/…

54935c5

…transformers into add-flash-attn-gpt2

Fix copies

97c04ec

Merge remote-tracking branch 'upstream/main' into add-flash-attn-gpt2

5fb3a3d

amyeroberts reviewed Mar 18, 2024

View reviewed changes

EduardoPach added 2 commits March 23, 2024 17:39

Decision transformer not supporting flash attn

a938fc9

Merge remote-tracking branch 'upstream/main' into add-flash-attn-gpt2

0c42513

EduardoPach requested a review from amyeroberts March 25, 2024 10:10

EduardoPach added 2 commits March 26, 2024 14:08

Merge remote-tracking branch 'upstream/main' into add-flash-attn-gpt2

393d17f

Addressed comments

94c2fe8

amyeroberts reviewed Mar 26, 2024

View reviewed changes

EduardoPach added 2 commits March 26, 2024 18:02

Addressed comments

343c04e

Addressed comments

2799988

amyeroberts approved these changes Mar 27, 2024

View reviewed changes

amyeroberts merged commit 22d159d into huggingface:main Mar 28, 2024
19 checks passed

younesbelkada mentioned this pull request Apr 16, 2024

Add Flash Attention 2 to M2M100 model #30256

Merged

5 tasks

	DECISIONTRANSFORMERGPT2_ATTENTION_CLASSES = {
	DECISION_TRANSFORMER_GPT2_ATTENTION_CLASSES = {

Adding Flash Attention 2 Support for GPT2 #29226

Adding Flash Attention 2 Support for GPT2 #29226

Conversation

EduardoPach commented Feb 23, 2024

What does this PR do?

Before submitting

Who can review?

younesbelkada left a comment

Choose a reason for hiding this comment

EduardoPach commented Feb 23, 2024

younesbelkada commented Feb 23, 2024

EduardoPach commented Feb 24, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Feb 27, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

younesbelkada commented Feb 28, 2024

EduardoPach commented Feb 28, 2024

EduardoPach commented Mar 1, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

EduardoPach commented Mar 14, 2024

younesbelkada commented Mar 14, 2024

EduardoPach commented Mar 14, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts Mar 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

NieShenRuc commented Mar 26, 2024

EduardoPach commented Mar 26, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

NieShenRuc commented Mar 28, 2024

EduardoPach commented Mar 28, 2024

amyeroberts Mar 18, 2024 •

edited

Loading