add axiswise scaling to Float8Linear #920

vkuzo · 2024-09-23T16:52:29Z

Summary:

This PR: support scaling of all arguments of all gemms to be axiswise,
and ensure that training with axiswise scaling works e2e.

Future PR: support more granular configurability and optimize
performance, add docs

Feel free to ignore the UX introduced in this PR, it's just an intermediate step. See next PR for the real UX.

Test Plan:

// tests pass
./test/float8/test_everything.sh

// sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8:
// 1. verify performance does not regress with tensorwise scaling
// 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though
// logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2024-09-23T16:52:29Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2024-09-23T16:52:32Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/920

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d70326c with merge base 52d27a1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: This PR: support scaling of all arguments of all gemms to be axiswise, and ensure that training with axiswise scaling works e2e. Future PR: support more granular configurability and optimize performance, add docs Test Plan: ``` // tests pass ./test/float8/test_everything.sh // sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8: // 1. verify performance does not regress with tensorwise scaling // 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though // logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 6368d8ec2fb50eea52cd54e1ca5724047483a7eb ghstack-comment-id: 2368837904 Pull Request resolved: #920

[ghstack-poisoned]

Summary: This PR: support scaling of all arguments of all gemms to be axiswise, and ensure that training with axiswise scaling works e2e. Future PR: support more granular configurability and optimize performance, add docs Test Plan: ``` // tests pass ./test/float8/test_everything.sh // sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8: // 1. verify performance does not regress with tensorwise scaling // 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though // logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 77d62e8efb3a838035213125476c714290882a08 ghstack-comment-id: 2368837904 Pull Request resolved: #920

[ghstack-poisoned]

Summary: This PR: support scaling of all arguments of all gemms to be axiswise, and ensure that training with axiswise scaling works e2e. Future PR: support more granular configurability and optimize performance, add docs Test Plan: ``` // tests pass ./test/float8/test_everything.sh // sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8: // 1. verify performance does not regress with tensorwise scaling // 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though // logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 0d471db431fab2195a86e84bc7d3a93cc25db6e4 ghstack-comment-id: 2368837904 Pull Request resolved: #920

[ghstack-poisoned]

Summary: This PR: support scaling of all arguments of all gemms to be axiswise, and ensure that training with axiswise scaling works e2e. Future PR: support more granular configurability and optimize performance, add docs Test Plan: ``` // tests pass ./test/float8/test_everything.sh // sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8: // 1. verify performance does not regress with tensorwise scaling // 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though // logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: af334fd3f9f0b10e2f0a7cf1e38513741d1b45f7 ghstack-comment-id: 2368837904 Pull Request resolved: #920

[ghstack-poisoned]

Summary: This PR: support scaling of all arguments of all gemms to be axiswise, and ensure that training with axiswise scaling works e2e. Future PR: support more granular configurability and optimize performance, add docs Test Plan: ``` // tests pass ./test/float8/test_everything.sh // sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8: // 1. verify performance does not regress with tensorwise scaling // 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though // logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 304a5427739966a9601fa860ed248fc2bb902d67 ghstack-comment-id: 2368837904 Pull Request resolved: #920

[ghstack-poisoned]

drisspg · 2024-10-02T16:56:17Z

test/float8/test_compile.py

@@ -39,6 +44,7 @@
 from torch._dynamo.test_case import TestCase as DynamoTestCase
 from torch._dynamo.testing import CompileCounterWithBackend

+# TODO(future PR): standardize IS_H100 with the rest of the codebase


we want is SM89 for testing cublas matmuls, since in theory we have hardware with that capability on CI/CD

yeah I meant the variable name, not what it's checking

drisspg · 2024-10-02T17:01:56Z

torchao/float8/float8_linear.py

+            # Cast grad_output to float8_e5m2 during backward
+            output = self.cast_output_to_float8_in_bw(output)
+
+        else:


why do we need a separate path for this?

this is choosing to move fast on a single use case at the expense of taking on some temporary tech debt

[ghstack-poisoned]

lw

LGTM! Left some minor readability nits

lw · 2024-10-03T11:53:11Z

torchao/float8/float8_linear.py

+    and other granularities in a separate PR.
+    """
+
+    # TODO(this PR): types of inputs


Done, right?

lw · 2024-10-03T11:56:36Z

torchao/float8/float8_linear.py

+        # the reshapes are needed in order to make the shapes compatible with
+        # torch.mm
+        orig_shape = input_fp8.shape
+        input_fp8_reshaped = input_fp8.reshape(-1, orig_shape[-1])


Is this equivalent to input_fp8.flatten(0, -2)? If so, I find this more self-descriptive

lw · 2024-10-03T11:57:24Z

torchao/float8/float8_linear.py

+        orig_shape = input_fp8.shape
+        input_fp8_reshaped = input_fp8.reshape(-1, orig_shape[-1])
+        res_bits = torch.mm(input_fp8_reshaped, weight_fp8_t)
+        res_bits = res_bits.reshape(*orig_shape[:-1], res_bits.shape[-1])


Conversely, res_bits.unflatten(0, orig_shape[:-1]).

lw · 2024-10-03T11:58:26Z

torchao/float8/float8_linear.py

+        # the reshapes are needed in order to make the shapes compatible with
+        # torch.mm
+        grad_output_orig_shape = grad_output.shape
+        grad_output_reshaped = grad_output.reshape(


Same here and below

[ghstack-poisoned]

Summary: This PR: support scaling of all arguments of all gemms to be axiswise, and ensure that training with axiswise scaling works e2e. Future PR: support more granular configurability and optimize performance, add docs Feel free to ignore the UX introduced in this PR, it's just an intermediate step. See next PR for the real UX. Test Plan: ``` // tests pass ./test/float8/test_everything.sh // sanity check on torchtitan with LLaMa 3 8B on 4 H100s with float8: // 1. verify performance does not regress with tensorwise scaling // 2. smoke test that axiswise scaling works and numerics are sane, performance isn't there though // logs: https://gist.github.com/vkuzo/70fa5eb3c23375f307d11e7bae48682f ``` Reviewers: Subscribers: Tasks: Tags:

* Update README.md * Adding additional minor changes and Using markdown note blocks * Minor typos and undoing changes that are more impactful * adds

vkuzo added 2 commits September 23, 2024 09:52

Update

183dbec

[ghstack-poisoned]

Update

241f815

[ghstack-poisoned]

vkuzo mentioned this pull request Sep 23, 2024

add axiswise granularity to Float8Tensor #919

Merged

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 23, 2024

vkuzo added 2 commits September 23, 2024 09:54

Update

d0b1002

[ghstack-poisoned]

Update

f15c2a0

[ghstack-poisoned]

vkuzo added 2 commits September 23, 2024 10:55

Update

c816ce9

[ghstack-poisoned]

Update

9150b4f

[ghstack-poisoned]

vkuzo added 2 commits September 23, 2024 12:25

Update

40279fb

[ghstack-poisoned]

Update

459e92c

[ghstack-poisoned]

Update

732b231

[ghstack-poisoned]

vkuzo mentioned this pull request Sep 24, 2024

float8 training axiswise scaling support with per-gemm-argument configuration #940

Merged

vkuzo added 2 commits September 27, 2024 15:33

Update

c5d19e0

[ghstack-poisoned]

Update

7e0fe97

[ghstack-poisoned]

vkuzo requested a review from drisspg October 1, 2024 01:30

vkuzo added 2 commits October 2, 2024 08:29

Update

0c473c4

[ghstack-poisoned]

Update

c1be278

[ghstack-poisoned]

drisspg reviewed Oct 2, 2024

View reviewed changes

drisspg approved these changes Oct 2, 2024

View reviewed changes

vkuzo added 3 commits October 2, 2024 16:06

Update

fc8d4ef

[ghstack-poisoned]

Update

76322de

[ghstack-poisoned]

Update

ac6f768

[ghstack-poisoned]

Update

02b9fca

[ghstack-poisoned]

lw approved these changes Oct 3, 2024

View reviewed changes

vkuzo added 3 commits October 4, 2024 09:50

Update

1f01df9

[ghstack-poisoned]

Update

e595f30

[ghstack-poisoned]

Update

d70326c

[ghstack-poisoned]

vkuzo changed the base branch from gh/vkuzo/10/head to main October 7, 2024 17:21

vkuzo merged commit e76db70 into main Oct 7, 2024
43 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add axiswise scaling to Float8Linear #920

add axiswise scaling to Float8Linear #920

vkuzo commented Sep 23, 2024 •

edited

Loading

vkuzo commented Sep 23, 2024 •

edited

Loading

pytorch-bot bot commented Sep 23, 2024 •

edited

Loading

drisspg Oct 2, 2024

vkuzo Oct 2, 2024

drisspg Oct 2, 2024

vkuzo Oct 2, 2024

lw left a comment

lw Oct 3, 2024

lw Oct 3, 2024

lw Oct 3, 2024

lw Oct 3, 2024

add axiswise scaling to Float8Linear #920

add axiswise scaling to Float8Linear #920

Conversation

vkuzo commented Sep 23, 2024 • edited Loading

vkuzo commented Sep 23, 2024 • edited Loading

pytorch-bot bot commented Sep 23, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/920

✅ No Failures

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lw left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vkuzo commented Sep 23, 2024 •

edited

Loading

vkuzo commented Sep 23, 2024 •

edited

Loading

pytorch-bot bot commented Sep 23, 2024 •

edited

Loading