Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sdpa for Detr #34826

Closed
wants to merge 17 commits into from
Closed

Add sdpa for Detr #34826

wants to merge 17 commits into from

Conversation

OmarManzoor
Copy link
Contributor

@OmarManzoor OmarManzoor commented Nov 20, 2024

What does this PR do?

Towards #28005

  • Adds sdpa for Detr model

Who can review?

CC: @amyeroberts @ArthurZucker

@qubvel
Copy link
Member

qubvel commented Nov 20, 2024

Hi @OmarManzoor! Thanks for taking this up, feel free to ping me when it's ready for review!

@OmarManzoor
Copy link
Contributor Author

@qubvel I posted a question in the description. Could you kindly have a look?

@qubvel
Copy link
Member

qubvel commented Nov 20, 2024

@OmarManzoor Yes, for some models we see that the threshold should be different. You can override model tests

@OmarManzoor OmarManzoor marked this pull request as ready for review November 21, 2024 11:34
@OmarManzoor
Copy link
Contributor Author

@qubvel I tried make the required adjustments. I don't really know about these test failures.

@qubvel
Copy link
Member

qubvel commented Nov 21, 2024

Hi @OmarManzoor! Thanks for iterating!

Re tests:

This one looks unrelated

FAILED tests/models/xlm_roberta_xl/test_modeling_xlm_roberta_xl.py::XLMRobertaXLModelTest::test_assisted_decoding_matches_greedy_search_1_same - AssertionError: False is not true

However this one might be related to the PR

FAILED examples/pytorch/test_pytorch_examples.py::ExamplesTests::test_run_object_detection - AssertionError: 0.0152 not greater than or equal to 0.1

Can you also push an empty commit with the message [run-slow] detr to trigger slow tests? (it should be the last commit in a sequence so I can approve the CI run)

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments regarding changes in related models such as Conditional DETR and Maskformer. We should not remove modules, instead # Copied from should be utilized.

@@ -198,6 +198,7 @@ class DetrModelTest(ModelTesterMixin, GenerationTesterMixin, PipelineTesterMixin
# special case for head models
def _prepare_for_class(self, inputs_dict, model_class, return_labels=False):
inputs_dict = super()._prepare_for_class(inputs_dict, model_class, return_labels=return_labels)
inputs_dict.pop("decoder_input_ids", None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment re this change?

src/transformers/models/maskformer/modeling_maskformer.py Outdated Show resolved Hide resolved
@OmarManzoor
Copy link
Contributor Author

@qubvel I tried running some benchmarks for training. This is the script I used. train benchmark script
However it seems that the speedups have been decreasing instead of increasing.

@qubvel
Copy link
Member

qubvel commented Nov 21, 2024

Interesting observation! Did you try with larger images? I observed something similar for the RT-DETR model, SDPA implementation did not give any speedups

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Nov 21, 2024

Interesting observation! Did you try with larger images?

No not beyond 128. Does the script look okay though?
One thing that was different from the case with text models, was that this line

with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True)

resulted is no kernel found errors. I had to set enable_math=True to run the code.

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Nov 22, 2024

@qubvel I think the tests are generally looking okay now. I also pushed a commit with run slow for Detr.

Also one thing that I observed was that this test for table transformer currently seems to be failing. This happens on the main branch as well when I tested locally, probably because the values don't match.

expected_logits = torch.tensor(
[[-6.7329, -16.9590, 6.7447], [-8.0038, -22.3071, 6.9288], [-7.2445, -20.9855, 7.3465]],
device=torch_device,
)
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_logits, atol=1e-4))
expected_boxes = torch.tensor(
[[0.4868, 0.1764, 0.6729], [0.6674, 0.4621, 0.3864], [0.4720, 0.1757, 0.6362]], device=torch_device

Copy link
Member

@qubvel qubvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @OmarManzoor, thanks for iterating! Can you post here some benchmarks you have? I'd recommended to test it on real input, e.g. for DETR it might be an image of size ~1300x800, please use the image processor to prepare some inputs. At least for inference mode.

Can you push run-slow message once again? it should be the last one, otherwise I can't approve CI run, please use [run-slow] detr, table_transformer. Thanks!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@OmarManzoor
Copy link
Contributor Author

Hi @OmarManzoor, thanks for iterating! Can you post here some benchmarks you have? I'd recommended to test it on real input, e.g. for DETR it might be an image of size ~1300x800, please use the image processor to prepare some inputs. At least for inference mode.

I don't think I can run for larger inputs since I only have a GPU with 8GB available to run the benchmarks.

@OmarManzoor
Copy link
Contributor Author

OmarManzoor commented Nov 22, 2024

This is a benchmark I just ran for 1048 x 640 with float16 for training

num_training_steps batch_size image_size is cuda Time per batch (eager - s) Time per batch (sdpa - s) Speedup (%) Eager peak mem (MB) sdpa peak mem (MB) Mem saving (%)
50 2 (1048, 640) True 0.173 0.181 -4.707 5948.894 6320.631 -5.881

@qubvel
Copy link
Member

qubvel commented Nov 22, 2024

Hi @OmarManzoor, thanks for sharing this! To move forward with this PR, we need to ensure it performs at least as fast as the eager implementation or offers better memory efficiency. I plan to test it on my side next week to see if the performance depends on specific hardware. Maybe worth benchmarking the compiled version as well.

If we can't achieve this improvement now, it might not be ideal to merge at this point, as SDPA is the default setting. Slower performance and additional maintenance code could introduce challenges. However, this approach might work better with newer GPUs and future torch versions, so even if we don’t identify a solution right away, we could revisit and integrate it at a more optimal time.

Looking forward to hearing your thoughts!

@OmarManzoor
Copy link
Contributor Author

I agree with you. However I would be grateful if you could try out some benchmarks on your end using better GPUs and better configurations and maybe some tweaks in the benchmarking script if required. That would confirm if we are getting any performance upgrades or not. If the results are similar to what I observed then I think we can close this PR for now. Also I haven't really tested for inference because not once did I observe any sort of improvement for the training script, even with float32 it was almost the same speed with eager being slightly better.

@OmarManzoor
Copy link
Contributor Author

Closing as no specific performance improvements were noted for Detr. Thank you for your support on this @qubvel and if you observe any performance improvements at your side, we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants