add fused transpose and non-transpose kernel and use it for grad output #1497

danielvegamyhre · 2025-01-03T21:41:29Z

Stack Summary
The following stack of PRs completes a float8 training prototype with performance that slightly beats the production Float8Linear + torch.compile approach.

Changes in this PR
This PR implements a new kernel which reads in a row major high precision tensor, and writes 2 outputs: a fp8 row major output, and a transposed fp8 row major output. This is useful for the backward pass where we need to convert the grad_output to both of these formats.

Next steps:

Add support for activation checkpointing
Add test verifying this prototype is compatible with FSDP
Implement usage of this prototype in torchtitan and benchmark results

Test Plan

pytest kernels/ - kernel specific unit tests are passing
pytest test/ - e2e training test is passing

Performance Benchmarking

Performance benchmarks show this implementation is beating torch.compile by 1.72-4.45% depending on the input tensor size:

input_shape    kernel_algo                 high_precision_dtype      eager_time    compiled_time    float8nocompile
-------------  --------------------------  ----------------------  ------------  ---------------  -----------------
(16, 4096)     KernelAlgorithm.ATOMIC_MAX  torch.bfloat16               649.218          394.725            386.469
(256, 4096)    KernelAlgorithm.ATOMIC_MAX  torch.bfloat16               685.783          420.743            408.137
(4096, 4096)   KernelAlgorithm.ATOMIC_MAX  torch.bfloat16              1829.13          1053.64             977.858
(65536, 4096)  KernelAlgorithm.ATOMIC_MAX  torch.bfloat16             21554.2          12369.7            10813.3
(16, 4096)     KernelAlgorithm.REDUCTION   torch.bfloat16               650.026          394.951            696.221
(256, 4096)    KernelAlgorithm.REDUCTION   torch.bfloat16               684.865          421.144            729.459
(4096, 4096)   KernelAlgorithm.REDUCTION   torch.bfloat16              1826.42          1050.85            1596.12
(65536, 4096)  KernelAlgorithm.REDUCTION   torch.bfloat16             21584.7          12347.2            17290

[ghstack-poisoned]

danielvegamyhre · 2025-01-03T21:41:30Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-01-03T21:41:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1497

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 01eedbf with merge base eb49333 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: b8ffafdd2ff8428643045b9e7fb9046a0eab22c7 ghstack-comment-id: 2569853196 Pull Request resolved: #1497

[ghstack-poisoned]

ghstack-source-id: be4465cbef3e93fa415d1acf65d9a889043ead0d ghstack-comment-id: 2569853196 Pull Request resolved: #1497

[ghstack-poisoned]

danielvegamyhre added 7 commits January 3, 2025 13:41

Update

f85618f

[ghstack-poisoned]

Update

c603139

[ghstack-poisoned]

Update

9b42e69

[ghstack-poisoned]

Update

fc301fd

[ghstack-poisoned]

Update

a69fc66

[ghstack-poisoned]

Update

1d2ee55

[ghstack-poisoned]

Update

5870160

[ghstack-poisoned]

danielvegamyhre added a commit that referenced this pull request Jan 3, 2025

add fused transpose and non-transpose kernel and use it for grad output

1f7a272

ghstack-source-id: b8ffafdd2ff8428643045b9e7fb9046a0eab22c7 ghstack-comment-id: 2569853196 Pull Request resolved: #1497

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jan 3, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 3, 2025

danielvegamyhre requested a review from vkuzo January 3, 2025 21:50

Update

36d8d17

[ghstack-poisoned]

danielvegamyhre added a commit that referenced this pull request Jan 3, 2025

add fused transpose and non-transpose kernel and use it for grad output

2cc3a17

ghstack-source-id: be4465cbef3e93fa415d1acf65d9a889043ead0d ghstack-comment-id: 2569853196 Pull Request resolved: #1497

danielvegamyhre mentioned this pull request Jan 7, 2025

add handling for batch dim in float8nocompile #1512

Merged

vkuzo approved these changes Jan 7, 2025

View reviewed changes

danielvegamyhre added 7 commits January 7, 2025 09:19

Update

2bbdf88

[ghstack-poisoned]

Update

58be437

[ghstack-poisoned]

Update

25298cb

[ghstack-poisoned]

Update

89c6b53

[ghstack-poisoned]

Update

0808acf

[ghstack-poisoned]

Update

3cc35df

[ghstack-poisoned]

Update

ddf1efc

[ghstack-poisoned]

danielvegamyhre added 4 commits January 7, 2025 10:39

Update

0536cb8

[ghstack-poisoned]

Update

10830d8

[ghstack-poisoned]

Update

5a47687

[ghstack-poisoned]

Update

f485529

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 7, 2025

fix bug in tl.store mask for kernel _to_fp8_row_major_t_and_non_t #1516

Merged

danielvegamyhre added 7 commits January 7, 2025 12:52

Update

a6becf8

[ghstack-poisoned]

Update

5714e99

[ghstack-poisoned]

Update

7ccdd26

[ghstack-poisoned]

Update

1cb1fec

[ghstack-poisoned]

Update

23266fb

[ghstack-poisoned]

Update

99fab5a

[ghstack-poisoned]

Update

8860f93

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 7, 2025

support for activation checkpointing in float8nocompile #1517

Open

danielvegamyhre added 2 commits January 7, 2025 16:11

Update

4aadedf

[ghstack-poisoned]

Update

6db778a

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 8, 2025

float8nocompile: add e2e fsdp test #1523

Open

danielvegamyhre added 4 commits January 7, 2025 17:55

Update

e11918d

[ghstack-poisoned]

Update

c0da780

[ghstack-poisoned]

Update

49373f1

[ghstack-poisoned]

Update

3eb406f

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 8, 2025

fix linter errors #1525

Closed

Update

01eedbf

[ghstack-poisoned]

danielvegamyhre changed the base branch from gh/danielvegamyhre/14/head to main January 8, 2025 03:48

danielvegamyhre merged commit 070345d into main Jan 8, 2025
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fused transpose and non-transpose kernel and use it for grad output #1497

add fused transpose and non-transpose kernel and use it for grad output #1497

danielvegamyhre commented Jan 3, 2025 •

edited

Loading

danielvegamyhre commented Jan 3, 2025 •

edited

Loading

pytorch-bot bot commented Jan 3, 2025 •

edited

Loading

add fused transpose and non-transpose kernel and use it for grad output #1497

add fused transpose and non-transpose kernel and use it for grad output #1497

Conversation

danielvegamyhre commented Jan 3, 2025 • edited Loading

danielvegamyhre commented Jan 3, 2025 • edited Loading

pytorch-bot bot commented Jan 3, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1497

✅ No Failures

danielvegamyhre commented Jan 3, 2025 •

edited

Loading

danielvegamyhre commented Jan 3, 2025 •

edited

Loading

pytorch-bot bot commented Jan 3, 2025 •

edited

Loading