Remove usage of `torch.autograd.profiler_legacy` for benchmarks #2149

anmyachev · 2024-09-06T19:35:59Z

Part of #1905

Since #1905 is blocked for now it's probably good idea to change profiler to xpu elapsed_time in order to continue investigate performance degradation caused by removing IPEX import. I think we've come to terms with the increase in absolute performance numbers at this point (see #1905 (comment)).

anmyachev · 2024-09-06T22:49:46Z

~~https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10746078971/job/29806332779 is failed at the moment~~

UPD: Fixed, see the last run: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10760495839/job/29838516905

Signed-off-by: Anatoly Myachev <[email protected]>

benchmarks/CMakeLists.txt

benchmarks/setup.py

Signed-off-by: Anatoly Myachev <[email protected]>

whitneywhtsang

Any performance impact?

anmyachev · 2024-09-09T09:41:33Z

Any performance impact?

Let's take results (TFLOPS) from https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10760495839 and https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10763825651.

Note: unfortunately the script from #2060 doesn't work for this PR

Summary (old version vs new version):

Softmax perf:
- Absolute numbers:
  - On average difference: -17% (Triton), -14% (XeTLA)
  - Maximum difference: -44% (Triton), -48% (XeTLA)
- Ratio (Triton/XeTLA) numbers:
  - On average difference: 2%
  - Maximum difference: 10%
Attn perf:
- Absolute numbers:
  - On average difference: the same (<1 %) (Triton), -93% (XeTLA)!!!
  - Maximum difference: the same (Triton), -98% (XeTLA)!!!
Matmul perf:
- Absolute numbers:
  - On average difference: the same (< 1%) (Triton)
  - Maximum difference: 10% (Triton)

Signed-off-by: Anatoly Myachev <[email protected]>

This reverts commit c5aefdc.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-09-10T10:20:59Z

We can try another method. It looks like the previous measurement method only included the sycl kernel time, and the new approach (elapsed_time) also includes everything that was done to prepare this kernel, for example, a large number of allocations in the case of "Attn" benchmark. If we remove them from the measurement, the average difference becomes only 16%!!! instead of 93% (for "Attn").

I believe that by working in this direction it is possible to achieve an acceptable deterioration in absolute numbers of performance while maintaining the ratio (within some acceptable limits). This could be an acceptable solution until bugs in the other method are fixed. This also unlocks the ability for us to benchmark on platforms that only work with Upstream PyTorch.

New CI run on this PR with last changes (with moving allocations from the measured function): https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10783177036/job/29904547469.

@etiotto @whitneywhtsang @alexbaden thoughts?

Performance with the current approach remains [unchanged](https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10794302056/job/29938275434), but greatly improves the numbers in a situation where `elapsed_time` method is used. Part of #2149 Closes #2198 Signed-off-by: Anatoly Myachev <[email protected]>

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev marked this pull request as ready for review September 6, 2024 19:49

anmyachev requested review from ESI-SYD, whitneywhtsang and chengjunlu and removed request for ESI-SYD September 6, 2024 19:49

anmyachev force-pushed the amyachev/cleanup4 branch from 4b8f9d2 to e6e3bbc Compare September 6, 2024 22:29

anmyachev marked this pull request as draft September 6, 2024 22:49

anmyachev force-pushed the amyachev/cleanup4 branch from d9794bf to ef78aa4 Compare September 7, 2024 12:26

anmyachev added 3 commits September 7, 2024 12:28

Remove usage of 'torch.autograd.profiler_legacy' for benchmarks

1d16811

Signed-off-by: Anatoly Myachev <[email protected]>

add 'USE_IPEX' compilation option

e60f1f4

Signed-off-by: Anatoly Myachev <[email protected]>

fix and revert unnecessary changes

b7ddbc9

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev force-pushed the amyachev/cleanup4 branch from ef78aa4 to b7ddbc9 Compare September 7, 2024 12:28

fix

b1c5467

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev marked this pull request as ready for review September 7, 2024 21:25

anmyachev commented Sep 7, 2024

View reviewed changes

benchmarks/CMakeLists.txt Outdated Show resolved Hide resolved

benchmarks/setup.py Outdated Show resolved Hide resolved

anmyachev added 4 commits September 8, 2024 13:12

remove remaining stuff

d2614e2

Signed-off-by: Anatoly Myachev <[email protected]>

Merge remote-tracking branch 'origin' into amyachev/cleanup4

5d5e970

fix

711c939

Signed-off-by: Anatoly Myachev <[email protected]>

Revert changes for disabling IPEX

dedd201

Signed-off-by: Anatoly Myachev <[email protected]>

chengjunlu approved these changes Sep 9, 2024

View reviewed changes

whitneywhtsang reviewed Sep 9, 2024

View reviewed changes

anmyachev mentioned this pull request Sep 9, 2024

[Benchmarks][Upstream PyTorch 2.5] Triton and XeTLA softmax performance degrades in comparison with torch 2.1 / ipex 2.1 test proxies #2106

Closed

anmyachev marked this pull request as draft September 9, 2024 14:07

anmyachev added 4 commits September 9, 2024 14:17

DEBUG: emulate 'sync_submitting'

c5aefdc

Signed-off-by: Anatoly Myachev <[email protected]>

Revert "DEBUG: emulate 'sync_submitting'"

10151f6

This reverts commit c5aefdc.

Try reduce overhea dwhile using elaped_time profiling method

89c40df

Signed-off-by: Anatoly Myachev <[email protected]>

update softmax

9bbe8a9

Signed-off-by: Anatoly Myachev <[email protected]>

Merge remote-tracking branch 'origin' into amyachev/cleanup4

1d86e87

anmyachev mentioned this pull request Sep 10, 2024

Rewrite benchmarks to be more elapsed_time friendly #2186

Merged

anmyachev added 2 commits September 11, 2024 10:23

Merge remote-tracking branch 'origin' into amyachev/cleanup4

e006d08

Merge remote-tracking branch 'origin' into amyachev/cleanup4

62168cb

pbchekin changed the base branch from llvm-target to main September 14, 2024 00:00

anmyachev added 3 commits September 19, 2024 19:38

Merge remote-tracking branch 'origin' into amyachev/cleanup4

04d8745

fix after merge

f6726e4

Signed-off-by: Anatoly Myachev <[email protected]>

remove USE_IPEX

3754c5e

Signed-off-by: Anatoly Myachev <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove usage of `torch.autograd.profiler_legacy` for benchmarks #2149

Remove usage of `torch.autograd.profiler_legacy` for benchmarks #2149

anmyachev commented Sep 6, 2024 •

edited

Loading

anmyachev commented Sep 6, 2024 •

edited

Loading

whitneywhtsang left a comment

anmyachev commented Sep 9, 2024 •

edited

Loading

anmyachev commented Sep 10, 2024 •

edited

Loading

Remove usage of torch.autograd.profiler_legacy for benchmarks #2149

Are you sure you want to change the base?

Remove usage of torch.autograd.profiler_legacy for benchmarks #2149

Conversation

anmyachev commented Sep 6, 2024 • edited Loading

anmyachev commented Sep 6, 2024 • edited Loading

whitneywhtsang left a comment

Choose a reason for hiding this comment

anmyachev commented Sep 9, 2024 • edited Loading

anmyachev commented Sep 10, 2024 • edited Loading

Remove usage of `torch.autograd.profiler_legacy` for benchmarks #2149

Remove usage of `torch.autograd.profiler_legacy` for benchmarks #2149

anmyachev commented Sep 6, 2024 •

edited

Loading

anmyachev commented Sep 6, 2024 •

edited

Loading

anmyachev commented Sep 9, 2024 •

edited

Loading

anmyachev commented Sep 10, 2024 •

edited

Loading