Happy Holidays, Have some Flash Attention v2.3.6 wheel builds since they're not readily available #2313

NeedsMoar · 2023-12-17T12:44:49Z

NeedsMoar
Dec 17, 2023

Since flash-attention only supports 2 architectures really (Ampere A100, Hopper) and Ada Lovelace / consumer Ampere as side effects of normally building PTX for sm_80, I decided to go the opposite direction and build pre-compiled code for the cards that aren't $20,000-$40,000 and running on systems that can easily handle the 1GB of ram per HT core the build uses by default (and build the PTX for themselves near-instantly) and PTX only code for either the lower of the two models (in the case of the sm80 + sm86 package) or just build combination cubin + PTX for the same arch for sm_89 since I have no way of testing anything on Hopper anyway.

Update

Torch 2.1.2 came out so I built xformers and somehow managed to get the integrated flash attention 2 working, so now you can just install one package. It's two for the price of one!

xFormers + Flash Attention 2 for Torch 2.1.2-cu121 / Python 3.11

Older

Just install + upgrade xformers since some new versions of the flash attention 2 functions have been made lately, and install the highest SM version of wheel that's directly compatible with your GPU, whichever python version 3.10 / 3.11, and make sure you're running torch 2.1.1-cu121. flash attention has no support for cards under sm_80 yet so I'm not going to build them, and I can't test on Ampere. Those binaries are larger because they're dual architecture, :

pip install -U xformers
pip install flash_attn-2.3.6-cp311-cp311-win_amd64-torch2.1.1.cu121-cuda12.3-sm89.whl

You should be able to install from the github link as well.

Ada Lovelace (sm_89) optimized binary + compute_89 PTX for Python 3.10 or 3.11 - Download if you've got ADA lovelace and the current torch release
Ampere A100 (sm_80) optimized binary + (compute_80) PTX in Python 3.10 thanks to a script typo, oops and Ampere consumer / workstation card (sm_86 compiled + compute_86 PTX in larger file) - Download for ampere, should work with ada too if you've got a 3000 series and a 4000 series in the same system.

If somebody really needs it I still have CUDA 11.8 installed alongside 12.3 so I can build against it if you're running a torch2.1.1-cu181? build.

These were built with CUDA 12.3 so you can use the environment variable:
CUDA_MODULE_LOADING=lazy
to enable lazy loading of kernels, which will keep it from taking time to compile on first load on non-prebuilts (somebody probably has a Jetson sm_87 device) or loading things you don't need when they're built already.

I will attempt to keep releases up to date with new official releases of flash attention (since they don't seem to be motivated to make wheels for Windows installs) and official torch releases so the current one is usable.

HOW DO I KNOW THIS DOESN'T HAVE A VIRUS OMG

This is the internet, so you don't. You should ask this of everything, whether you have the source or not. My repo for these builds is mostly empty except links to the main flash-attn repo because I don't need to change the source outside of the setup.py that configures build options; it was easier to do that than sort out which environment variables the script actually used.

Why not build this myself?

Because I already did it and it takes 2 minutes now that I know the best configurations to include and read the nvcc manual to clarify what the 3 variants of every CUDA arch are meant to do. The only reason I built it originally was so it could be linked with CUDA 12.3 and lazy loading would work. I plan on building torch against 12.3 soon too

Why not a package with just Ampere sm_86 and Ada sm_89 compiled to cover all the consumer and workstation hardware, ya dummy?

That would have been the best thing to do in the first place, so I'm going to do it right now, but only for python 3.11. It'll be in the releases in a while.

comfyanonymous · 2023-12-17T19:01:17Z

comfyanonymous
Dec 17, 2023
Maintainer

How is performance compared to the latest pytorch nightly 2.2 which I think has flash attention 2?

5 replies

NeedsMoar Dec 17, 2023
Author

Good question. This repo seems to be most heavily geared towards LLMs except the recent flash_attn_with_kvcache xformers contributed and possibly sliding window attention. TransformerEngine is also built on top of this but with the fp8 types... still haven't had a bit of luck making that one build unfortunately.

I'll grab a nightly pytorch now and try using the built in on multiple runs of a 1024x1024->2048x2048 scale-up with my QRCode monster controlnet mess to make sure I get some decently large inference times for the whole run then build this against the nightly and try it with xformers, bbiab. :-)

NeedsMoar Dec 17, 2023
Author

Well the nightlies are now 2.3 (Not sure if they're skipping 2.2 or doing final testing) and 2.1.2 was out apparently which instantly made these builds slightly outdated.

I had to build xformers against the nightly; but it tries to build a copy of transformers in-tree and copy it into the wheel and fails horribly because something changed with the version of cutlass, and the compiler exploded trying to deal with the sub-library cute's absolutely non-deducible decltype(auto) return types on a bunch of template classes that are being passed small un-typed integers as their only parameters which the return is supposed to deduce from. It's also doing all this in sm_90a files which weren't selected for build so something may just be broken with their headers / CUDA version detection right now.

Unfortunately I couldn't get a fair test because (shocker) the torch nightlies aren't built against flash attention 2 (guessing they don't keep the requisite 4TB and 96 core epycs to build all architectures quickly enough to get multiple nightlies done) so this popped up somewhere in the middle of lora loading / CLiP

C:\Programs\ComfyUI\comfy\ldm\modules\attention.py:318: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:281.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load BaseModel
Loading 1 new model
unload clone 1

I'm not even going to attempt to build torch, the smaller libraries are enough of a nightmare to fix when they inevitably fail the first try and flash attention won't build in tree right now anyway, so we'll have to wait for an official release. I'd hope the integrated version would be faster but who knows.

SDP uses a huge amount of memory compared to flash attention... batch size 1 of the scale up mentioned above in SDP was using 17GB of vram during VAE decode. Batch size 4 in xformers + flash only hit 15 during VAE and was faster per-image (was around 21s with single images)

Results for batch size 1 runs:
results.txt

batch size 4 ToT xformers + flash attention + nightly torch

100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:23<00:00,  2.12it/s]
100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:42<00:00,  3.54s/it]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 71.51 seconds

NeedsMoar Dec 20, 2023
Author

Hypertiling + flash_attention seems to be flying along more than usual. This is a more complicated network with an ipadapter load->encode->apply, the requisite CLIPVision model, and a CLIPVision encode->unclip conditioning

8-latent batch

got prompt
Requested to load BaseModel
Loading 1 new model
unload clone 1
100%|██████████████████████████████████████████████████████████████████████████████████| 50/50 [00:30<00:00,  1.66it/s]
LatentUpscaler: Using HF Hub model
Requested to load BaseModel
Loading 1 new model
unload clone 0
100%|██████████████████████████████████████████████████████████████████████████████████| 14/14 [00:34<00:00,  2.43s/it]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 75.82 seconds

The previous batch was 4 images and speeds were 3.11it/s for the 1024x1024 ksampler, 1.33s/it for the 2048x2048 after latent upscale, 41.04 seconds total
From past tests performance was leveling off at batch sizes of 5 or 6 without the adapters when hypertiling had been added so that's nice to see that it magically got better, maybe I won't have to wait as long for animations the next time I try to get that workflow to produce clean output. LCM is under 1s per image at that batch size but it's harder to guide it correctly... messed with SDXLTurbo at a little and of course that's pretty much instant but it doesn't seem to like any kind of LoRA I threw at it so I kinda suspect they'll need to be specifically trained against it.

NeedsMoar Dec 20, 2023
Author

I'll give the 2.2 nightly a shot when I figure out which / if one has flash attention. I'd been working on compiling one locally with CUDA 12.3 to get lazy loading and through some miracle managed to do a static build of Magma for Windows which normally isn't included but literally passed out trying to sort out whether I had the other torch pre-reqs before firing up cmake and going into rage mode.

comfyanonymous Dec 20, 2023
Maintainer

I'm on Linux so I have not tried your wheels but what gives me the best performance is pytorch nightly which is now 2.3 with scaled_dot_product_attention.

pytorch 2.1.2 + xformers gives me very similar performance to pytorch nightly but that's because the xformers build for Linux has flash attention.

wibur0620 · 2024-04-13T20:31:26Z

wibur0620
Apr 13, 2024

D:\AI\ComfyUI>call conda activate D:\AI\ComfyUI\venv-comfyui
Total VRAM 8188 MB, total RAM 65268 MB
xformers version: 0.0.25.post1
Set vram state to: NORMAL_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4060 Laptop GPU : cudaMallocAsync
VAE dtype: torch.bfloat16
Using xformers cross attention

Import times for custom nodes:
0.0 seconds: D:\AI\ComfyUI\custom_nodes\websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
got prompt
model_type EPS
Using xformers attention in VAE
Using xformers attention in VAE
clip missing: ['clip_l.logit_scale', 'clip_l.transformer.text_projection.weight']
Requested to load SD1ClipModel
Loading 1 new model
D:\AI\ComfyUI\comfy\ldm\modules\attention.py:345: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at ..\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:263.)
out = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0, is_causal=False)
Requested to load BaseModel
Loading 1 new model
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [00:02<00:00, 9.31it/s]
I installed using 'pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu121'. My GPU driver version is 551.86. I always encounter the above content during the first inference, and when using SDXL for inference, the speed slows down

0 replies

wibur0620 · 2024-04-13T21:04:43Z

wibur0620
Apr 13, 2024

D:\AI\ComfyUI>conda activate D:\AI\ComfyUI\venv-comfyui

(D:\AI\ComfyUI\venv-comfyui) D:\AI\ComfyUI>python
Python 3.11.8 | packaged by Anaconda, Inc. | (main, Feb 26 2024, 21:34:05) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import torch
print(torch.version)
2.2.2+cu121
import torch

if torch.cuda.is_available() and torch.version >= '1.9.0':
... print("FlashAttention is supported on this PyTorch installation.")
... else:
... print("FlashAttention is not supported on this PyTorch installation.")
...
FlashAttention is supported on this PyTorch installation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Happy Holidays, Have some Flash Attention v2.3.6 wheel builds since they're not readily available #2313

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Happy Holidays, Have some Flash Attention v2.3.6 wheel builds since they're not readily available #2313

NeedsMoar Dec 17, 2023

Update

Older

HOW DO I KNOW THIS DOESN'T HAVE A VIRUS OMG

Why not build this myself?

Why not a package with just Ampere sm_86 and Ada sm_89 compiled to cover all the consumer and workstation hardware, ya dummy?

Replies: 3 comments · 5 replies

comfyanonymous Dec 17, 2023 Maintainer

NeedsMoar Dec 17, 2023 Author

NeedsMoar Dec 17, 2023 Author

NeedsMoar Dec 20, 2023 Author

8-latent batch

NeedsMoar Dec 20, 2023 Author

comfyanonymous Dec 20, 2023 Maintainer

wibur0620 Apr 13, 2024

wibur0620 Apr 13, 2024

NeedsMoar
Dec 17, 2023

Replies: 3 comments 5 replies

comfyanonymous
Dec 17, 2023
Maintainer

NeedsMoar Dec 17, 2023
Author

NeedsMoar Dec 17, 2023
Author

NeedsMoar Dec 20, 2023
Author

NeedsMoar Dec 20, 2023
Author

comfyanonymous Dec 20, 2023
Maintainer

wibur0620
Apr 13, 2024

wibur0620
Apr 13, 2024