[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes #1325

lw · 2024-11-22T09:49:50Z

Stack from ghstack (oldest at bottom):

And circumvent the issue with the slow CUTLASS kernel by using the cuBLAS kernel + manual scaling.

[ghstack-poisoned]

pytorch-bot · 2024-11-22T09:49:53Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1325

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 6f4615b with merge base 1a0dbf1 ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Code Analysis with Ruff / build (3.9) (gh) (trunk failure)
##[error]Process completed with exit code 1.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

vkuzo · 2024-11-26T22:37:18Z

torchao/float8/float8_python_api.py

+        and b_scale.shape == (1, b_data.shape[1])
+        and not use_fast_accum
+    ):
+        # The rowwise CUTLASS-based kernel is so slow without fast-accum that


just curious, do we have any OSS shareable evidence (perf/accuracy) on doing this versus rowwise with fast-accum off that we can add here?

I ran a quick benchmark on my H100 with a recent-ish version of PyTorch (nightly from Nov 12). I samples all MxNxK matmul shapes where each of M, N and K is a power of two between 512 and 16384. Here I'm plotting the slowdowns observed when activating slow-accum for the rowwise (CUTLASS-based) and tensorwise (cuBLAS-based) modes

In summary: in tensorwise we get a max slowdown of 50% (usually much less), with rowwise we typically are 2x as slow, with peaks of 4.5x as slow as fast-accum.

(I suspect that for very small shapes the benchmark was CPU-bound hence slow-accum looks as fast as fast-accum, but that's probably misleading)

[ghstack-poisoned]

lw · 2024-12-04T15:53:08Z

Landing since Ruff is already broken on main

lw · 2024-12-04T15:57:10Z

Superseded by #1377

Update

c9e26bd

[ghstack-poisoned]

lw mentioned this pull request Nov 22, 2024

[float8] Allow specifying arbitrary dtype for each tensor #1326

Draft

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 22, 2024

lw added the topic: performance Use this tag if this PR improves the performance of a feature label Nov 22, 2024

Update

7debcd9

[ghstack-poisoned]

lw requested a review from vkuzo November 26, 2024 17:20

vkuzo reviewed Nov 26, 2024

View reviewed changes

vkuzo approved these changes Nov 26, 2024

View reviewed changes

Update

6f4615b

[ghstack-poisoned]

lw marked this pull request as ready for review December 4, 2024 13:51

lw merged commit 6f4615b into gh/lw/1/base Dec 4, 2024
29 of 31 checks passed

lw mentioned this pull request Dec 4, 2024

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes #1377

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes #1325

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes #1325

lw commented Nov 22, 2024 •

edited

Loading

pytorch-bot bot commented Nov 22, 2024 •

edited

Loading

vkuzo Nov 26, 2024

lw Dec 3, 2024

lw commented Dec 4, 2024

lw commented Dec 4, 2024

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes #1325

[float8] Re-enable slow-accum in the bwd of axis-wise scaling schemes #1325

Conversation

lw commented Nov 22, 2024 • edited Loading

pytorch-bot bot commented Nov 22, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1325

✅ You can merge normally! (1 Unrelated Failure)

vkuzo Nov 26, 2024

Choose a reason for hiding this comment

lw Dec 3, 2024

Choose a reason for hiding this comment

lw commented Dec 4, 2024

lw commented Dec 4, 2024

lw commented Nov 22, 2024 •

edited

Loading

pytorch-bot bot commented Nov 22, 2024 •

edited

Loading