Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of torch.autograd.profiler_legacy for benchmarks #2149

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

anmyachev
Copy link
Contributor

@anmyachev anmyachev commented Sep 6, 2024

Part of #1905

Closes #2150

Since #1905 is blocked for now it's probably good idea to change profiler to xpu elapsed_time in order to continue investigate performance degradation caused by removing IPEX import. I think we've come to terms with the increase in absolute performance numbers at this point (see #1905 (comment)).

@anmyachev anmyachev marked this pull request as ready for review September 6, 2024 19:49
@anmyachev anmyachev requested review from ESI-SYD, whitneywhtsang and chengjunlu and removed request for ESI-SYD September 6, 2024 19:49
@anmyachev anmyachev marked this pull request as draft September 6, 2024 22:49
@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 6, 2024

Signed-off-by: Anatoly Myachev <[email protected]>
@anmyachev anmyachev marked this pull request as ready for review September 7, 2024 21:25
benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved
benchmarks/setup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@whitneywhtsang whitneywhtsang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any performance impact?

@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 9, 2024

Any performance impact?

Let's take results (TFLOPS) from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10760495839 and https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10763825651.

Note: unfortunately the script from #2060 doesn't work for this PR

Summary (old version vs new version):

  1. Softmax perf:
    • Absolute numbers:
      • On average difference: -17% (Triton), -14% (XeTLA)
      • Maximum difference: -44% (Triton), -48% (XeTLA)
    • Ratio (Triton/XeTLA) numbers:
      • On average difference: 2%
      • Maximum difference: 10%
  2. Attn perf:
    • Absolute numbers:
      • On average difference: the same (<1 %) (Triton), -93% (XeTLA)!!!
      • Maximum difference: the same (Triton), -98% (XeTLA)!!!
  3. Matmul perf:
    • Absolute numbers:
      • On average difference: the same (< 1%) (Triton)
      • Maximum difference: 10% (Triton)

@anmyachev
Copy link
Contributor Author

anmyachev commented Sep 10, 2024

We can try another method. It looks like the previous measurement method only included the sycl kernel time, and the new approach (elapsed_time) also includes everything that was done to prepare this kernel, for example, a large number of allocations in the case of "Attn" benchmark. If we remove them from the measurement, the average difference becomes only 16%!!! instead of 93% (for "Attn").

I believe that by working in this direction it is possible to achieve an acceptable deterioration in absolute numbers of performance while maintaining the ratio (within some acceptable limits). This could be an acceptable solution until bugs in the other method are fixed. This also unlocks the ability for us to benchmark on platforms that only work with Upstream PyTorch.

New CI run on this PR with last changes (with moving allocations from the measured function): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10783177036/job/29904547469.

@etiotto @whitneywhtsang @alexbaden thoughts?

anmyachev added a commit that referenced this pull request Sep 11, 2024
Performance with the current approach remains
[unchanged](https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10794302056/job/29938275434),
but greatly improves the numbers in a situation where `elapsed_time`
method is used.

Part of #2149

Closes #2198

Signed-off-by: Anatoly Myachev <[email protected]>
@pbchekin pbchekin changed the base branch from llvm-target to main September 14, 2024 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove usage of torch.autograd.profiler_legacy for benchmarks as deprecated
3 participants