[cleanup][4/x] unify weight casting #1481

vkuzo · 2025-01-02T20:15:53Z

Summary:

Removes redundant logic for weight casting

Performance/peak_mem on torchtitan llama 3 8B on 8 NVIDIA H100 GPUs:

before this PR stack, every experiment has float8 + compile

ac: selective(op)

all_tensorwise with FSDP2 float8 all-gather: tps 7030, peak_mem 37.54 GiB
all_tensorwise with FSDP2 float8 all-gather with force_recompute_fp8_weight_in_bwd=True: tps 7060, peak_mem 37.54 GiB
all_axiswise: tps 6300, peak_mem 57.03 GiB
lw_axiswise_with_gw_hp: tps 5996, peak_mem 57.03 GiB

ac: none

all_tensorwise with FSDP2 float8 all-gather: tps 7321, peak_mem 56.56 GiB
all_tensorwise with FSDP2 float8 all-gather with force_recompute_fp8_weight_in_bwd=True: tps 7443, peak_mem 51.39 GiB

after this PR stack

ac: selective(op)

all_tensorwise with FSDP2 float8 all-gather: tps 7050, peak_mem 37.54 GiB
all_tensorwise with FSDP2 float8 all-gather with force_recompute_fp8_weight_in_bwd=True: 7040 tps, 37.54 GiB
all_axiswise: tps 6300, peak_mem 57.03 GiB
lw_axiswise_with_gw_hp: tps 5996, peak_mem 57.03 GiB

ac: none

all_tensorwise with FSDP2 float8 all-gather: tps 7280, peak_mem 56.56 GiB
all_tensorwise with FSDP2 float8 all-gather with force_recompute_fp8_weight_in_bwd=True: tps 7283, peak_mem 56.56 GiB

Test Plan:

./test/float8/test_everything.sh

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-01-02T20:15:54Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-01-02T20:15:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1481

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e9cd02b with merge base 457c5b1 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Not ready for review yet, performance regression because tensorwise abs+max and weight casting is happening twice between fwd and bwd. Limitation of something in PT2 stack? Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 5d789dd3ea6c508c767907951a38b905a745f3d7 ghstack-comment-id: 2568319095 Pull Request resolved: #1481

[ghstack-poisoned]

Summary: Not ready for review yet, performance regression because tensorwise abs+max and weight casting is happening twice between fwd and bwd. Limitation of something in PT2 stack? Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 95d7c478dcff2d6b1203dae2855a2894d8b1e3d0 ghstack-comment-id: 2568319095 Pull Request resolved: #1481

[ghstack-poisoned]

Summary: Not ready for review yet, performance regression because tensorwise abs+max and weight casting is happening twice between fwd and bwd. Limitation of something in PT2 stack? Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 03f6e9939d866c719c132eb27125540817fc692a ghstack-comment-id: 2568319095 Pull Request resolved: #1481

[ghstack-poisoned]

Summary: Not ready for review yet, performance regression because tensorwise abs+max and weight casting is happening twice between fwd and bwd. Limitation of something in PT2 stack? Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 27996f8401a77ca2fc5fdf1bb2b200d3b9fd41a7 ghstack-comment-id: 2568319095 Pull Request resolved: #1481

[ghstack-poisoned]

Summary: Not ready for review yet, performance regression because tensorwise abs+max and weight casting is happening twice between fwd and bwd. Limitation of something in PT2 stack? Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: e3da31f48634640b1b569228ad3a5d3964860acb ghstack-comment-id: 2568319095 Pull Request resolved: #1481

[ghstack-poisoned]

Summary: Not ready for review yet, performance regression because tensorwise abs+max and weight casting is happening twice between fwd and bwd. Limitation of something in PT2 stack? Test Plan: Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: d013d859f3f4230e28207e70b8aafcfd907d5c45 ghstack-comment-id: 2568319095 Pull Request resolved: #1481

vkuzo · 2025-01-09T00:25:03Z

config.force_recompute_fp8_weight_in_bwd doesn't work with FSDP2 with float8 all-gather properly yet, looking

Update

b4df943

[ghstack-poisoned]

This was referenced Jan 2, 2025

[cleanup][1/x] make hp_tensor_to_float8_dynamic only work with hp inputs #1458

Open

[cleanup][2/x] split float8 mm by delayed vs dynamic #1461

Open

[cleanup][3/x] unify dynamic input and grad_output casting #1480

Open

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 2, 2025

Update

4b99e2c

[ghstack-poisoned]

Update

47fe433

[ghstack-poisoned]

vkuzo mentioned this pull request Jan 8, 2025

inductor slow kernel choice for max(x) if x is not contiguous pytorch/pytorch#144431

Open

Update

76b3c4e

[ghstack-poisoned]

Update

3aee785

[ghstack-poisoned]

vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Jan 8, 2025

Update

e9cd02b

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cleanup][4/x] unify weight casting #1481

[cleanup][4/x] unify weight casting #1481

vkuzo commented Jan 2, 2025 •

edited

Loading

vkuzo commented Jan 2, 2025 •

edited

Loading

pytorch-bot bot commented Jan 2, 2025 •

edited

Loading

vkuzo commented Jan 9, 2025

[cleanup][4/x] unify weight casting #1481

Are you sure you want to change the base?

[cleanup][4/x] unify weight casting #1481

Conversation

vkuzo commented Jan 2, 2025 • edited Loading

vkuzo commented Jan 2, 2025 • edited Loading

pytorch-bot bot commented Jan 2, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1481

✅ No Failures

vkuzo commented Jan 9, 2025

vkuzo commented Jan 2, 2025 •

edited

Loading

vkuzo commented Jan 2, 2025 •

edited

Loading

pytorch-bot bot commented Jan 2, 2025 •

edited

Loading