Happy Holidays, Have some Flash Attention v2.3.6 wheel builds since they're not readily available #2313
Replies: 3 comments 5 replies
-
How is performance compared to the latest pytorch nightly 2.2 which I think has flash attention 2? |
Beta Was this translation helpful? Give feedback.
-
D:\AI\ComfyUI>call conda activate D:\AI\ComfyUI\venv-comfyui Import times for custom nodes: Starting server To see the GUI go to: http://127.0.0.1:8188 |
Beta Was this translation helpful? Give feedback.
-
D:\AI\ComfyUI>conda activate D:\AI\ComfyUI\venv-comfyui (D:\AI\ComfyUI\venv-comfyui) D:\AI\ComfyUI>python
|
Beta Was this translation helpful? Give feedback.
-
Since flash-attention only supports 2 architectures really (Ampere A100, Hopper) and Ada Lovelace / consumer Ampere as side effects of normally building PTX for sm_80, I decided to go the opposite direction and build pre-compiled code for the cards that aren't $20,000-$40,000 and running on systems that can easily handle the 1GB of ram per HT core the build uses by default (and build the PTX for themselves near-instantly) and PTX only code for either the lower of the two models (in the case of the sm80 + sm86 package) or just build combination cubin + PTX for the same arch for sm_89 since I have no way of testing anything on Hopper anyway.
Update
Torch 2.1.2 came out so I built xformers and somehow managed to get the integrated flash attention 2 working, so now you can just install one package. It's two for the price of one!
xFormers + Flash Attention 2 for Torch 2.1.2-cu121 / Python 3.11
Older
Just install + upgrade xformers since some new versions of the flash attention 2 functions have been made lately, and install the highest SM version of wheel that's directly compatible with your GPU, whichever python version 3.10 / 3.11, and make sure you're running torch 2.1.1-cu121. flash attention has no support for cards under sm_80 yet so I'm not going to build them, and I can't test on Ampere. Those binaries are larger because they're dual architecture, :
You should be able to install from the github link as well.
Ada Lovelace (sm_89) optimized binary + compute_89 PTX for Python 3.10 or 3.11 - Download if you've got ADA lovelace and the current torch release
Ampere A100 (sm_80) optimized binary + (compute_80) PTX in Python 3.10 thanks to a script typo, oops and Ampere consumer / workstation card (sm_86 compiled + compute_86 PTX in larger file) - Download for ampere, should work with ada too if you've got a 3000 series and a 4000 series in the same system.
If somebody really needs it I still have CUDA 11.8 installed alongside 12.3 so I can build against it if you're running a torch2.1.1-cu181? build.
These were built with CUDA 12.3 so you can use the environment variable:
CUDA_MODULE_LOADING=lazy
to enable lazy loading of kernels, which will keep it from taking time to compile on first load on non-prebuilts (somebody probably has a Jetson sm_87 device) or loading things you don't need when they're built already.
I will attempt to keep releases up to date with new official releases of flash attention (since they don't seem to be motivated to make wheels for Windows installs) and official torch releases so the current one is usable.
HOW DO I KNOW THIS DOESN'T HAVE A VIRUS OMG
This is the internet, so you don't. You should ask this of everything, whether you have the source or not. My repo for these builds is mostly empty except links to the main flash-attn repo because I don't need to change the source outside of the setup.py that configures build options; it was easier to do that than sort out which environment variables the script actually used.
Why not build this myself?
Because I already did it and it takes 2 minutes now that I know the best configurations to include and read the nvcc manual to clarify what the 3 variants of every CUDA arch are meant to do. The only reason I built it originally was so it could be linked with CUDA 12.3 and lazy loading would work. I plan on building torch against 12.3 soon too
Why not a package with just Ampere sm_86 and Ada sm_89 compiled to cover all the consumer and workstation hardware, ya dummy?
That would have been the best thing to do in the first place, so I'm going to do it right now, but only for python 3.11. It'll be in the releases in a while.
Beta Was this translation helpful? Give feedback.
All reactions