You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have recently shared in the ROCm triton repo some benchmarking results comparing different implementation of attention on an MI250x.
It is worth pointing out that the implementation in this repo is faster (on average) than the triton counterpart in the forward pass, but is substantially slower in the backward pass. The detailed results are below FYI.
There is a WIP refactor of this repo using newly developed ck_tile from composable_kernel, which will bring in speed up on both MI200/300 cases.
Stay tuned for later update
Problem Description
I have recently shared in the ROCm triton repo some benchmarking results comparing different implementation of attention on an MI250x.
It is worth pointing out that the implementation in this repo is faster (on average) than the triton counterpart in the forward pass, but is substantially slower in the backward pass. The detailed results are below FYI.
Forward pass:
Forward pass followed by backward:
Operating System
Red Hat Enterprise Linux 8.8
CPU
AMD EPYC 7A53 64-Core Processor
GPU
AMD Instinct MI250X
ROCm Version
ROCm 5.7.1
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: