Have some instructions for using Xformers on Win64 with the normally disabled 2:4 sparse tensor routines and Triton (Triton should also enable some missing items in torch) #2870

NeedsMoar · 2024-02-22T16:28:20Z

NeedsMoar
Feb 22, 2024

The wheel for xformers on Windows includes flash attention now, but while checking the enabled features list I was sad to see that the "disabled" items part had gotten bigger. This saddens me because the whole point of bothering to use a disaster of a scripting language like python in the first place is keeping your project cross platform without the usual messes. Unfortunately most things are too slow to do in native python so practically all major functionality is implemented in C++ / CUDA / etc, and people who use linux absolutely hate making things cross-platform and then use weird GCC extensions to make things harder to port. For the record I did an internal port of LLVM and clang + our product integration to Windows in under a week before LLVM would even build for Windows out of the box; this was mainly possible because a huge number of ISO C++ members were working on it like crazy; presumably so they'd never have to look at the abomination that is the GCC codebase again. /rant

These instructions are for Win64, Torch 2.2, xformers 0.0.24 and Python 3.10 or 3.11.

The sparse tensor portion just requires two steps:

Download cusparseLt.dll from NVidia and drop it in your python3.xx\Lib\site-packages\torch\lib directory after installing it in your CUDA toolkit directory for good measure. Torch plans on using this to implement one of the available backends for their sparse 2:4 tensor functions and it's supposedly there in the nightlies for linux at least, so you shouldn't need to keep doing this forever.
Open up your local copy of python3.xx\Lib\site-packages\xformers\ops\sp24.py, import platform somewhere at the top of the file and search for the _get_cusparselt_lib() function, and modify it to not dumbly always try to open the .so as follows (I filed a bug on this so hopefully they fix it by the time torch starts including the library by default):

def _get_cusparselt_lib() -> Optional[str]:
    if platform.system() == "Windows":
        libs = glob.glob(str(Path(torch._C.__file__).parent / "lib" / "cusparseLt.dll"))
    else: 
        libs = glob.glob(
            str(Path(torch._C.__file__).parent / "lib" / "libcusparseLt*.so.0")
        )
    if len(libs) != 1:
        return None
    return libs[0]

Now you're done with the hard part, apart from figuring out how to use the API in practice.

There's no point in conditionalizing for mac unless you have one old enough to run a CUDA card that's useful enough to run stable diffusion models, or maybe one of the more recent Intel mac pros with Windows running on it if apple didn't block NVidia hardware from working.

The triton portion is easier thanks to the wheel builds someone made:
Just download the triton artifacts from here, extract the wheel for your version of python, and run pip install on it and you're set.

After all this is done you can check that xformers sees everything by running
python -m xformers.info

which should output something like this:

xFormers 0.0.24
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
[email protected]:        available
[email protected]:        available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
sequence_parallel_fused.write_values:              unavailable
sequence_parallel_fused.wait_values:               unavailable
sequence_parallel_fused.cuda_memset_32b_async:     unavailable
sp24.sparse24_sparsify_both_ways:                  available
sp24.sparse24_apply:                               available
sp24.sparse24_apply_dense_output:                  available
sp24._sparse24_gemm:                               available
[email protected]:                        available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
pytorch.version:                                   2.2.0+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            8.9
gpu.name:                                          NVIDIA GeForce RTX 4090
dcgm_profiler:                                     unavailable
build.info:                                        available
build.cuda_version:                                1201
build.python_version:                              3.11.7
build.torch_version:                               2.2.0+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.24
build.nvcc_version:                                12.1.66
source.privacy:                                    open source

I believe the triton flash attention functions are unavailable when the flash-attention-2 versions are in use so that's normal, and the sequence_parallel stuff requires an old nvidia lib that's never been supported on Windows and only benefits multi-gpu setups so I think that's everything you can get. I haven't explored what this lets torch use, but I think lack of triton was the reason inductor and possibly torch.compile don't work right so I'll leave that to somebody who likes messing around in python to play with.

NeedsMoar · 2024-02-26T02:48:30Z

NeedsMoar
Feb 26, 2024
Author

After some testing the main thing I've noticed is that very large batch sizes seem to be the same speed as smaller ones now depending on the resolution... for example before the "optimal" size for a batch of 512x512 images on the 4090 seemed to be 20ish, averaged per image. Now 32x512x512 images is the same total speed as 16x512x512, roughly 2s/it with animatediff at both batch sizes. Values in the 20s (non-power-of-two) were slightly slower. I don't know how exactly that works but I'll take it.

Unless something has changed in comfy or animatediff that would explain that? I can't keep up with the huge number of changelists. :-)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Have some instructions for using Xformers on Win64 with the normally disabled 2:4 sparse tensor routines and Triton (Triton should also enable some missing items in torch) #2870

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Have some instructions for using Xformers on Win64 with the normally disabled 2:4 sparse tensor routines and Triton (Triton should also enable some missing items in torch) #2870

NeedsMoar Feb 22, 2024

Replies: 1 comment

NeedsMoar Feb 26, 2024 Author

NeedsMoar
Feb 22, 2024

NeedsMoar
Feb 26, 2024
Author