fix bug in tl.store mask for kernel _to_fp8_row_major_t_and_non_t #1516

danielvegamyhre · 2025-01-07T18:39:06Z

Discovered this bug while prototyping torchtitan integration with float8nocompile in pytorch/torchtitan#778, the symptom was the grads went to NaN during training.

The issue was a mismatch between how output offsets for writing the transposed tensor are calculated, versus the offsets used in creating the output mask. Output offsets for writing were correct but the offsets used in the output mask were backwards accidentally.

[ghstack-poisoned]

danielvegamyhre · 2025-01-07T18:39:07Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-01-07T18:39:10Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1516

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 96ee5ee with merge base f86fda9 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh)
RuntimeError: Command docker exec -t fb4d6860a3727f171085677a64a804406265c4269e6eefe27625371655fc4133 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

vkuzo · 2025-01-08T03:56:07Z

looks like there is a gap in test coverage, is it doable to add a test to this PR which fails before the fix and passes after the fix?

[ghstack-poisoned]

danielvegamyhre · 2025-01-08T15:54:10Z

looks like there is a gap in test coverage, is it doable to add a test to this PR which fails before the fix and passes after the fix?

ah thanks for the reminder, just updated the e2e training test with a new test case with large inputs, which I've confirmed fails without this change and passes with this change.

[ghstack-poisoned]

danielvegamyhre added 23 commits January 3, 2025 13:41

Update

f85618f

[ghstack-poisoned]

Update

c603139

[ghstack-poisoned]

Update

9b42e69

[ghstack-poisoned]

Update

fc301fd

[ghstack-poisoned]

Update

a69fc66

[ghstack-poisoned]

Update

1d2ee55

[ghstack-poisoned]

Update

5870160

[ghstack-poisoned]

Update

36d8d17

[ghstack-poisoned]

Update

7e526fd

[ghstack-poisoned]

Update

2bbdf88

[ghstack-poisoned]

Update

58be437

[ghstack-poisoned]

Update

25298cb

[ghstack-poisoned]

Update

89c6b53

[ghstack-poisoned]

Update

0808acf

[ghstack-poisoned]

Update

3cc35df

[ghstack-poisoned]

Update

ff6dad0

[ghstack-poisoned]

Update

ddf1efc

[ghstack-poisoned]

Update

0536cb8

[ghstack-poisoned]

Update

10830d8

[ghstack-poisoned]

Update

5a47687

[ghstack-poisoned]

Update

8d52227

[ghstack-poisoned]

Update

f485529

[ghstack-poisoned]

Update

ad6d97b

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 7, 2025

add fp8 conversion kernel for transpose in column major format #1493

Merged

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 7, 2025

This was referenced Jan 7, 2025

add fp8 conversion kernel that writes both row and column major outputs #1494

Merged

add torch autograd funcs wrapping new fp8 conversion kernels #1495

Merged

integrate new differentiable fp8 conversion funcs into Float8NoCompileLinear #1496

Merged

danielvegamyhre added 2 commits January 7, 2025 16:11

Update

6db778a

[ghstack-poisoned]

Update

f585e44

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 8, 2025

float8nocompile: add e2e fsdp test #1523

Open

danielvegamyhre added 8 commits January 7, 2025 17:55

Update

e11918d

[ghstack-poisoned]

Update

f65e981

[ghstack-poisoned]

Update

c0da780

[ghstack-poisoned]

Update

754c6bf

[ghstack-poisoned]

Update

49373f1

[ghstack-poisoned]

Update

e459d25

[ghstack-poisoned]

Update

ff6b91e

[ghstack-poisoned]

Update

3eb406f

[ghstack-poisoned]

danielvegamyhre mentioned this pull request Jan 8, 2025

fix linter errors #1525

Closed

danielvegamyhre added 5 commits January 7, 2025 18:27

Update

c78a574

[ghstack-poisoned]

Update

e5c69e7

[ghstack-poisoned]

Update

01eedbf

[ghstack-poisoned]

Update

74286fe

[ghstack-poisoned]

Update

a356ac5

[ghstack-poisoned]

danielvegamyhre added 5 commits January 8, 2025 06:40

Update

7ee060a

[ghstack-poisoned]

Update

84cc74b

[ghstack-poisoned]

Update

2600ee4

[ghstack-poisoned]

Update

d5666b2

[ghstack-poisoned]

Update

7a44bd9

[ghstack-poisoned]

vkuzo approved these changes Jan 8, 2025

View reviewed changes

danielvegamyhre added 3 commits January 8, 2025 10:05

Update

2e13197

[ghstack-poisoned]

Update

e665139

[ghstack-poisoned]

Update

96ee5ee

[ghstack-poisoned]

danielvegamyhre changed the base branch from gh/danielvegamyhre/16/head to main January 8, 2025 21:35

danielvegamyhre merged commit 4738377 into main Jan 9, 2025
44 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug in tl.store mask for kernel _to_fp8_row_major_t_and_non_t #1516

fix bug in tl.store mask for kernel _to_fp8_row_major_t_and_non_t #1516

danielvegamyhre commented Jan 7, 2025 •

edited

Loading

danielvegamyhre commented Jan 7, 2025 •

edited

Loading

pytorch-bot bot commented Jan 7, 2025 •

edited

Loading

vkuzo commented Jan 8, 2025

danielvegamyhre commented Jan 8, 2025

fix bug in tl.store mask for kernel _to_fp8_row_major_t_and_non_t #1516

fix bug in tl.store mask for kernel _to_fp8_row_major_t_and_non_t #1516

Conversation

danielvegamyhre commented Jan 7, 2025 • edited Loading

danielvegamyhre commented Jan 7, 2025 • edited Loading

pytorch-bot bot commented Jan 7, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1516

❌ 1 New Failure

vkuzo commented Jan 8, 2025

danielvegamyhre commented Jan 8, 2025

danielvegamyhre commented Jan 7, 2025 •

edited

Loading

danielvegamyhre commented Jan 7, 2025 •

edited

Loading

pytorch-bot bot commented Jan 7, 2025 •

edited

Loading