Replies: 8 comments 2 replies
-
Hi @hwu36 We found ourselves extending cutlass's classes on several occasions: Several times we had to modify existing classes, which properties were |
Beta Was this translation helpful? Give feedback.
-
@danthe3rd , thank you very much for your summary. It is very cool. From what you said, it sounds like a fused attention kernel. You already did most of things I would do if I wrote attention. What is the relation of this one and your flash attention? The collaboration is also very welcome. You can tell from our 2.10 that cutlass needs a good fused attention. |
Beta Was this translation helpful? Give feedback.
-
This is indeed a fused attention kernel.
In xFormers, we expose a single In terms of implementation, there are a few key differences with Flash: In terms of performance, we outperform Flash on the FW for
I'll send you an email to discuss that further [1] Performance comparison for BW: facebookresearch/xformers#469 |
Beta Was this translation helpful? Give feedback.
-
@danthe3rd would you mind adding me to that email? |
Beta Was this translation helpful? Give feedback.
-
@danthe3rd, @fmassa and I have been collaborating and swapping notes about FlashAttention / Mem-efficient attention. |
Beta Was this translation helpful? Give feedback.
-
Our b2b gemm also has a version that stores the results in the registers. It can be used in the short sequence length case. |
Beta Was this translation helpful? Give feedback.
-
I am not going to call our current ex41 a fused implementation. This one uses group gemm to support variable sequence length well, but it is not fast in fixed sequence length because the kernel is not fully fused. In the long run, we want to use group gemm together with fmha to support both fixed sequence and variable sequence length efficiently in one kernel. |
Beta Was this translation helpful? Give feedback.
-
The front pass is merged in #662 . |
Beta Was this translation helpful? Give feedback.
-
So glad to see people using CUTLASS to implement their own kernels, especially something complex like Attention.
@danthe3rd @fmassa
Beta Was this translation helpful? Give feedback.
All reactions