[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1455

danielvegamyhre · 2024-12-20T21:51:49Z

Summary

Add new Triton kernels to use a different approach for calculating the global amax
Have a single shared global amax tensor of size 1
All blocks compute local amax then perform a thread-safe max operation tl.atomic_max with the shared global max tensor to find the true global max once the kernel is complete
The advantage of this strategy is we don't have to allocate a shared buffer of size num_elements // BLOCK_SIZE outside of the kernels, which prevents us from autotuning the block size since we need to know it ahead of time to allocate the buffer.

Also made the algorithm used configurable and added unit tests for both.

Test plan

pytest test/test.py - Passing
python3 benchmark/benchmark.py yields large performance improvement as seen below. It's now markedly better than production path eager execution but not quite as good as compiled yet.

Performance of reduction based kernel with un-tuned block size of 8:

  input_size  high_precision_dtype      eager_time    compiled_time    float8nocompile
-------------  ----------------------  ------------  ---------------  -----------------
 65500         torch.float32                599.299          298.101    94446
 65500         torch.bfloat16               649.674          394.535    94386.3
     1.05e+06  torch.float32                640.5            332.171   104449
     1.05e+06  torch.bfloat16               685.822          421.365   104372
     1.68e+07  torch.float32               1963.09          1214.32    280825
     1.68e+07  torch.bfloat16              1828.16          1051.67    261710
     2.68e+08  torch.float32              24129.8          16287.2          3.39791e+06
     2.68e+08  torch.bfloat16             21603.2          12389.9          3.39515e+06

Performance of atomic max based kernel with auto-tuned block size:

  input_size  high_precision_dtype      eager_time    compiled_time    float8nocompile
------------  ----------------------  ------------  ---------------  -----------------
65500         torch.float32                599.055          298.543            372.018
65500         torch.bfloat16               649.523          394.457            413.137
    1.05e+06  torch.float32                640.497          332.419            413.503
    1.05e+06  torch.bfloat16               685.584          421.296            453.472
    1.68e+07  torch.float32               1963.72          1215.19            1415.5
    1.68e+07  torch.bfloat16              1829.28          1051.86            1298.55
    2.68e+08  torch.float32              24126.2          16294.3            19124.8
    2.68e+08  torch.bfloat16             21592.5          12390.6            16485.7

pytorch-bot · 2024-12-20T21:51:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1455

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1c06b47 with merge base 3bac905 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg

Looks good. I'm not sure I would completely throw away the old though since I think tuning the split reduction could get us a lot of performance and is deterministic which is nice

danielvegamyhre · 2024-12-20T22:01:50Z

Looks good. I'm not sure I would completely throw away the old though since I think tuning the split reduction could get us a lot of performance and is deterministic which is nice

Hmm ok I will make the strategy used configurable.

danielvegamyhre · 2024-12-20T22:38:46Z

Looks good. I'm not sure I would completely throw away the old though since I think tuning the split reduction could get us a lot of performance and is deterministic which is nice

@drisspg I updated the PR to keep both implementations and make the kernel algorithm used configurable, and updated unit tests to exercise both paths.

One interesting thing I noticed though is that the atomic max strategy was failing non-deterministically, I had to add a call torch.cuda.synchronize() after the global amax kernel to fix this.

…t both algos

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 20, 2024

danielvegamyhre force-pushed the refactor branch from 42f4a36 to 8683228 Compare December 20, 2024 21:52

danielvegamyhre added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Dec 20, 2024

danielvegamyhre requested review from vkuzo and drisspg December 20, 2024 21:53

refactor float8nocompile kernel so autotune is easily usable

40165e8

danielvegamyhre force-pushed the refactor branch from 8683228 to 40165e8 Compare December 20, 2024 21:54

drisspg reviewed Dec 20, 2024

View reviewed changes

danielvegamyhre force-pushed the refactor branch from d6295f4 to faf855f Compare December 20, 2024 22:32

danielvegamyhre changed the title ~~[float8nocompile] Refactor Triton kernels so triton.autotune is easily usable, and autotune the block size~~ [float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic max algo instead of reduction based algo Dec 20, 2024

refactor to make kernel algo configurable; refactor unit tests to tes…

1c06b47

…t both algos

danielvegamyhre force-pushed the refactor branch from faf855f to 1c06b47 Compare December 20, 2024 22:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1455

[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1455

danielvegamyhre commented Dec 20, 2024 •

edited

Loading

pytorch-bot bot commented Dec 20, 2024 •

edited

Loading

drisspg left a comment

danielvegamyhre commented Dec 20, 2024

danielvegamyhre commented Dec 20, 2024 •

edited

Loading

[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1455

Are you sure you want to change the base?

[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1455

Conversation

danielvegamyhre commented Dec 20, 2024 • edited Loading

pytorch-bot bot commented Dec 20, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1455

✅ No Failures

drisspg left a comment

Choose a reason for hiding this comment

danielvegamyhre commented Dec 20, 2024

danielvegamyhre commented Dec 20, 2024 • edited Loading

danielvegamyhre commented Dec 20, 2024 •

edited

Loading

pytorch-bot bot commented Dec 20, 2024 •

edited

Loading

danielvegamyhre commented Dec 20, 2024 •

edited

Loading