Add sdpa for Detr #34826

OmarManzoor · 2024-11-20T11:22:24Z

What does this PR do?

Towards #28005

Adds sdpa for Detr model

Who can review?

CC: @amyeroberts @ArthurZucker

qubvel · 2024-11-20T13:55:46Z

Hi @OmarManzoor! Thanks for taking this up, feel free to ping me when it's ready for review!

OmarManzoor · 2024-11-20T13:57:53Z

@qubvel I posted a question in the description. Could you kindly have a look?

qubvel · 2024-11-20T14:01:38Z

@OmarManzoor Yes, for some models we see that the threshold should be different. You can override model tests

… into detr_sdpa

OmarManzoor · 2024-11-21T14:40:39Z

@qubvel I tried make the required adjustments. I don't really know about these test failures.

qubvel · 2024-11-21T15:35:58Z

Hi @OmarManzoor! Thanks for iterating!

Re tests:

This one looks unrelated

FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_assisted_decoding_matches_greedy_search_1_same - AssertionError: False is not true

However this one might be related to the PR

FAILED examples/pytorch/test_pytorch_examples.py::ExamplesTests::test_run_object_detection - AssertionError: 0.0152 not greater than or equal to 0.1

Can you also push an empty commit with the message [run-slow] detr to trigger slow tests? (it should be the last commit in a sequence so I can approve the CI run)

… into detr_sdpa

qubvel

Added some comments regarding changes in related models such as Conditional DETR and Maskformer. We should not remove modules, instead # Copied from should be utilized.

src/transformers/models/table_transformer/modeling_table_transformer.py

qubvel · 2024-11-21T15:40:50Z

tests/models/detr/test_modeling_detr.py

@@ -198,6 +198,7 @@ class DetrModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin
    # special case for head models
    def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
        inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
+        inputs_dict.pop("decoder_input_ids", None)


Can you comment re this change?

src/transformers/models/maskformer/modeling_maskformer.py

src/transformers/models/conditional_detr/modeling_conditional_detr.py

OmarManzoor · 2024-11-21T16:01:13Z

@qubvel I tried running some benchmarks for training. This is the script I used. train benchmark script
However it seems that the speedups have been decreasing instead of increasing.

qubvel · 2024-11-21T16:07:28Z

Interesting observation! Did you try with larger images? I observed something similar for the RT-DETR model, SDPA implementation did not give any speedups

OmarManzoor · 2024-11-21T16:40:01Z

Interesting observation! Did you try with larger images?

No not beyond 128. Does the script look okay though?
One thing that was different from the case with text models, was that this line

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True)

resulted is no kernel found errors. I had to set enable_math=True to run the code.

OmarManzoor · 2024-11-22T09:58:58Z

@qubvel I think the tests are generally looking okay now. I also pushed a commit with run slow for Detr.

Also one thing that I observed was that this test for table transformer currently seems to be failing. This happens on the main branch as well when I tested locally, probably because the values don't match.

transformers/tests/models/table_transformer/test_modeling_table_transformer.py

Lines 594 to 601 in 1de7dc7

    
           expected_logits = torch.tensor( 
        
               [[-6.7329, -16.9590, 6.7447], [-8.0038, -22.3071, 6.9288], [-7.2445, -20.9855, 7.3465]], 
        
               device=torch_device, 
        
           ) 
        
           self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4)) 
        
           expected_boxes = torch.tensor( 
        
               [[0.4868, 0.1764, 0.6729], [0.6674, 0.4621, 0.3864], [0.4720, 0.1757, 0.6362]], device=torch_device

qubvel

Hi @OmarManzoor, thanks for iterating! Can you post here some benchmarks you have? I'd recommended to test it on real input, e.g. for DETR it might be an image of size ~1300x800, please use the image processor to prepare some inputs. At least for inference mode.

Can you push run-slow message once again? it should be the last one, otherwise I can't approve CI run, please use [run-slow] detr, table_transformer. Thanks!

HuggingFaceDocBuilderDev · 2024-11-22T15:56:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

OmarManzoor · 2024-11-22T17:25:42Z

Hi @OmarManzoor, thanks for iterating! Can you post here some benchmarks you have? I'd recommended to test it on real input, e.g. for DETR it might be an image of size ~1300x800, please use the image processor to prepare some inputs. At least for inference mode.

I don't think I can run for larger inputs since I only have a GPU with 8GB available to run the benchmarks.

OmarManzoor · 2024-11-22T17:30:22Z

This is a benchmark I just ran for 1048 x 640 with float16 for training

num_training_steps	batch_size	image_size	is cuda	Time per batch (eager - s)	Time per batch (sdpa - s)	Speedup (%)	Eager peak mem (MB)	sdpa peak mem (MB)	Mem saving (%)
50	2	(1048, 640)	True	0.173	0.181	-4.707	5948.894	6320.631	-5.881

qubvel · 2024-11-22T17:56:06Z

Hi @OmarManzoor, thanks for sharing this! To move forward with this PR, we need to ensure it performs at least as fast as the eager implementation or offers better memory efficiency. I plan to test it on my side next week to see if the performance depends on specific hardware. Maybe worth benchmarking the compiled version as well.

If we can't achieve this improvement now, it might not be ideal to merge at this point, as SDPA is the default setting. Slower performance and additional maintenance code could introduce challenges. However, this approach might work better with newer GPUs and future torch versions, so even if we don’t identify a solution right away, we could revisit and integrate it at a more optimal time.

Looking forward to hearing your thoughts!

OmarManzoor · 2024-11-22T18:35:21Z

I agree with you. However I would be grateful if you could try out some benchmarks on your end using better GPUs and better configurations and maybe some tweaks in the benchmarking script if required. That would confirm if we are getting any performance upgrades or not. If the results are similar to what I observed then I think we can close this PR for now. Also I haven't really tested for inference because not once did I observe any sort of improvement for the training script, even with float32 it was almost the same speed with eager being slightly better.

OmarManzoor · 2024-11-26T07:04:42Z

Closing as no specific performance improvements were noted for Detr. Thank you for your support on this @qubvel and if you observe any performance improvements at your side, we can reopen.

Add sdpa for Detr

09d8dce

qubvel added Vision SDPA labels Nov 20, 2024

OmarManzoor added 2 commits November 21, 2024 16:18

Merge branch 'main' into detr_sdpa

7428017

Fix copies

65999a4

OmarManzoor marked this pull request as ready for review November 21, 2024 11:34

OmarManzoor added 4 commits November 21, 2024 18:16

Updates and add to doc

7d6956c

Merge branch 'detr_sdpa' of https://github.com/OmarManzoor/transformers…

9fa83f5

… into detr_sdpa

Fix some issues

696f3be

Merge branch 'main' into detr_sdpa

ecd767e

qubvel added the run-slow label Nov 21, 2024

OmarManzoor added 2 commits November 21, 2024 20:45

[run-slow] detr

65136d2

Merge branch 'detr_sdpa' of https://github.com/OmarManzoor/transformers…

4e0e751

… into detr_sdpa

qubvel reviewed Nov 21, 2024

View reviewed changes

OmarManzoor added 6 commits November 22, 2024 11:03

Fix the copies and make them consistent

d78baa2

Remove the benchmark file

3d9e123

Merge branch 'main' into detr_sdpa

2590146

Improve docs

57314a3

Fix the scaling factor in SDPA

933341e

[run-slow] detr

379dde3

Merge branch 'main' into detr_sdpa

bacd9ea

qubvel reviewed Nov 22, 2024

View reviewed changes

[run-slow] detr

1edba7e

daniel-bogdoll mentioned this pull request Nov 22, 2024

Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature #34883

Merged

5 tasks

ArthurZucker mentioned this pull request Nov 25, 2024

Open to contribution: adding torch.nn.functional.scaled_dot_product_attention support for more architectures #28005

Closed

6 tasks

OmarManzoor closed this Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sdpa for Detr #34826

Add sdpa for Detr #34826

OmarManzoor commented Nov 20, 2024 •

edited

Loading

qubvel commented Nov 20, 2024

OmarManzoor commented Nov 20, 2024

qubvel commented Nov 20, 2024

OmarManzoor commented Nov 21, 2024

qubvel commented Nov 21, 2024

qubvel left a comment

qubvel Nov 21, 2024

OmarManzoor commented Nov 21, 2024

qubvel commented Nov 21, 2024

OmarManzoor commented Nov 21, 2024 •

edited

Loading

OmarManzoor commented Nov 22, 2024 •

edited

Loading

qubvel left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 22, 2024

OmarManzoor commented Nov 22, 2024

OmarManzoor commented Nov 22, 2024 •

edited

Loading

qubvel commented Nov 22, 2024 •

edited

Loading

OmarManzoor commented Nov 22, 2024

OmarManzoor commented Nov 26, 2024

Add sdpa for Detr #34826

Add sdpa for Detr #34826

Conversation

OmarManzoor commented Nov 20, 2024 • edited Loading

What does this PR do?

Who can review?

qubvel commented Nov 20, 2024

OmarManzoor commented Nov 20, 2024

qubvel commented Nov 20, 2024

OmarManzoor commented Nov 21, 2024

qubvel commented Nov 21, 2024

qubvel left a comment

Choose a reason for hiding this comment

qubvel Nov 21, 2024

Choose a reason for hiding this comment

OmarManzoor commented Nov 21, 2024

qubvel commented Nov 21, 2024

OmarManzoor commented Nov 21, 2024 • edited Loading

OmarManzoor commented Nov 22, 2024 • edited Loading

qubvel left a comment • edited Loading

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 22, 2024

OmarManzoor commented Nov 22, 2024

OmarManzoor commented Nov 22, 2024 • edited Loading

qubvel commented Nov 22, 2024 • edited Loading

OmarManzoor commented Nov 22, 2024

OmarManzoor commented Nov 26, 2024

OmarManzoor commented Nov 20, 2024 •

edited

Loading

OmarManzoor commented Nov 21, 2024 •

edited

Loading

OmarManzoor commented Nov 22, 2024 •

edited

Loading

qubvel left a comment •

edited

Loading

OmarManzoor commented Nov 22, 2024 •

edited

Loading

qubvel commented Nov 22, 2024 •

edited

Loading