float8 delayed scaling: private API to fix user overriding buffers #1292

vkuzo · 2024-11-15T05:58:31Z

Summary:

If the user has delayed scaling and FSDP float8 all-gather on, there is a subtle bug that can happen if the user calls
model.to_empty(device="cuda"):

to_empty recreates the buffers for tracking weight amax and scale
(1) leaves the buffers pointed to by Float8Linear.weight._amax_buffer, etc orphaned, because they don't participate in to_empty

I couldn't think of an easy and clean way to auto-fix this since we can't expect torch.nn.Module to know that our logic has multiple references to the same buffer, so exposing a private API for now until we can think of something better.

With the current fix, the user can then call
_maybe_fixup_delayed_scaling_buffers manually to relink the buffers to the correct new versions.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2024-11-15T05:58:35Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1292

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

[DomainsOnly] Jobs fail with GLIBC version not found

✅ No Failures

As of commit faa1593 with merge base 56bf2e8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: Context: pytorch/torchtitan#654 If the user has delayed scaling and FSDP float8 all-gather on, there is a subtle bug that can happen if the user calls `model.to_empty(device="cuda")`: 1. to_empty recreates the buffers for tracking weight amax and scale 2. (1) leaves the buffers pointed to by Float8Linear.weight._amax_buffer, etc orphaned, because they don't participate in `to_empty` I couldn't think of an easy and clean way to auto-fix this since we can't expect `torch.nn.Module` to know that our logic has multiple references to the same buffer, so exposing a private API for now until we can think of something better. With the current fix, the user can then call `_maybe_fixup_delayed_scaling_buffers` manually to relink the buffers to the correct new versions. Test Plan: CI Reviewers: Subscribers: Tasks: Tags:

weifengpy · 2024-11-21T01:11:30Z

test/float8/test_base.py

+        assert m_fp8[0].fp8_scale_weight is m_fp8[0].weight._scale_buffer
+
+        m_fp8.to_empty(device="cuda")
+        m_fp8[0]._maybe_fixup_delayed_scaling_buffers()


we would need to call this inside torchtitan’s training loop?

yeah, which is definitely not ideal

…#1292)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 15, 2024

vkuzo mentioned this pull request Nov 15, 2024

meta device issue with float8 delayed scale pytorch/torchtitan#654

Open

vkuzo force-pushed the 20241115_fix_to_empty branch from 9457814 to faa1593 Compare November 15, 2024 05:59

vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Nov 15, 2024

vkuzo requested a review from weifengpy November 15, 2024 06:00

weifengpy reviewed Nov 21, 2024

View reviewed changes

yanbing-j pushed a commit to yanbing-j/ao that referenced this pull request Dec 9, 2024

Update torchchat Android app to use latest ET llama demo app (pytorch…

c867660

…#1292)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

float8 delayed scaling: private API to fix user overriding buffers #1292

float8 delayed scaling: private API to fix user overriding buffers #1292

vkuzo commented Nov 15, 2024

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading

weifengpy Nov 21, 2024

vkuzo Nov 22, 2024

float8 delayed scaling: private API to fix user overriding buffers #1292

Are you sure you want to change the base?

float8 delayed scaling: private API to fix user overriding buffers #1292

Conversation

vkuzo commented Nov 15, 2024

pytorch-bot bot commented Nov 15, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1292

❗ 1 Active SEVs

✅ No Failures

weifengpy Nov 21, 2024

Choose a reason for hiding this comment

vkuzo Nov 22, 2024

Choose a reason for hiding this comment

pytorch-bot bot commented Nov 15, 2024 •

edited

Loading