-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[float8nocompile] Add alternate Triton kernels for FP8 conversion which use atomic_max-based algo instead of reduction-based algo #1455
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1455
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 1c06b47 with merge base 3bac905 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
42f4a36
to
8683228
Compare
8683228
to
40165e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I'm not sure I would completely throw away the old though since I think tuning the split reduction could get us a lot of performance and is deterministic which is nice
Hmm ok I will make the strategy used configurable. |
d6295f4
to
faf855f
Compare
@drisspg I updated the PR to keep both implementations and make the kernel algorithm used configurable, and updated unit tests to exercise both paths. One interesting thing I noticed though is that the atomic max strategy was failing non-deterministically, I had to add a call |
faf855f
to
1c06b47
Compare
Summary
tl.atomic_max
with the shared global max tensor to find the true global max once the kernel is completenum_elements // BLOCK_SIZE
outside of the kernels, which prevents us from autotuning the block size since we need to know it ahead of time to allocate the buffer.Also made the algorithm used configurable and added unit tests for both.
Test plan
pytest test/test.py
- Passingpython3 benchmark/benchmark.py
yields large performance improvement as seen below. It's now markedly better than production path eager execution but not quite as good as compiled yet.Performance of reduction based kernel with un-tuned block size of 8:
Performance of atomic max based kernel with auto-tuned block size: