-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ROCm support #756
Add ROCm support #756
Conversation
You have logic bug there, if |
Tested on Ubuntu 22.04.3 with RX 7900 XTX and passed with the following command: Without ROCM_HOME make cannot find proper headers to build. |
I get this error: |
Can compile using the official rocm pytorch docker image and then pip install on the original desktop environment and it works just fine though |
Not really that used to working with Makefiles directly. If it's empty it should skip that check since it doesn't matter what target you set as it won't be able to find libraries / binaries with ROCM_HOME being empty. I just tried to roughly copy the cuda checks since that also only takes one parameter. I agree it should throw an error, I'll see what I can do, thanks.
If ROCM_HOME isn't already set it will try to automatically find with the command which
hipblaslt is available since ROCM 5.6 (which is why I called my fork that). If you are below that this fork won't work and I'd recommend any another one that can be found in the linked issues (no 4bit support though). I dunno if ROCm 5.7 changed anything. From what I read 6.0 is gonna be the release where stuff isn't backwards compatible anymore so I don't think that should already be the case. |
FWIW, Arch Linux's ROCM doesn't seem to be distributing hipblaslt (yet), so I am also getting missing hipblaslt.h errors on 5.6 there. It doesn't seem to be a straightforward (for me) thing to build, either. |
You don't need the actual library, just headers, as igenn is still disabled for this anyways (they were needed for this to compile though, I could've probably just made some placeholder defines so it compiles, but since I had the header I decided not to). The headers can be found under https://github.com/ROCmSoftwarePlatform/hipBLASLt/tree/develop/library/include and you could just put those in /opt/rocm/include. I don't know if they have any other dependencies, but other than that it should work. |
That worked (at least as far as being able to execute |
How do you deal with multiple gpus of different targets. I have both MI100s (gfx908) and W6800s (gfx1030) in my machine. Can I use ROCM_TARGET=gfx908;gfx1030? |
Any joy with this patch? |
Hi, is there any update on this? Would love to have this merged! |
@arlo-phoenix pls add support for ROCM 5.7 |
7c77be4
to
3c9c262
Compare
Since there seems to now be an official plan to extend support to multiple platforms / hardware targets this will probably have to adjust if it's gonna be merged. I personally want to wait for ROCm 6.0 since that might break stuff. And even if this gets merged it would just be basic support without matrix cores as I don't have a more recent one. And even if I did, the hipblasLt project (at least according to their docs) only officially supports
While ROCm is annoying with having to compile for each arch, nothing should've broken between ROCm 5.6 and 5.7 (will likely happen with 6.0). Since many people still have problems with building this: git clone https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6
cd bitsandbytes-rocm-5.6
ROCM_TARGET=gfx1030 make hip
pip install . At least on my system you don't even need to set the ROCM_TARGET as it will just build it for all targets. I still recommend it for a faster build process. For finding it use
or just look for your GPU under https://www.llvm.org/docs/AMDGPUUsage.html#processors Edit: Just noticed 6.0 was already out .-., I'll update this once the official docker images are updated as well |
@arlo-phoenix rocm 6.0 docker has arrived https://hub.docker.com/r/rocm/pytorch/tags |
Thanks for the info! Only tested the basic stuff and will probably only further test after the holidays, but it still compiles, 4 bit works and all optimizers also all work (at least according to pytest). So summed up ROCm 6.0 breaks nothing in this after all. @Titus-von-Koeller, I only skimmed through #898, but from what I see the idea is to add the ability to have different backends with one of them being the current implementation now under a CudaBackend. From my perspective this won't really change this PR that much then (only gotta move some checks) since there isn't really a need for a separate backend for HIP and AMD GPU's should just use the CudaBackend as well.
|
One thing i should note about this pr is that since it dose not support wave64 it should really refuse to compile on those, or assert at run time, right now it produces incorrect results. all amd ai/compute focused gpus are wave64 only (ie mi25,mi50,mi100,m210 all the way to the latest mi300) its only consumer gpus newer than radeon VII that can do both wave64 and wave32 so this pr excludes the very gpus that are best suited to be used in ml. |
Hello, any news about 6.0 update? |
Seconding this ^ |
Status quo is still
not finetuning anything atm, but since it still compiles and tests succeed it should still work as expected. The only thing I expected to break was the makefile or some includes or defines becoming deprecated, but didn't see anything.
7900XTX should work, it's wavefront 64 that doesn't work and 7900XTX has the normal wavefront size 32. It would not become a battle here though as this doesn't support hipblaslt yet meaning no matrix cores are used and so the 7900XTX /MI300 wouldn't perform well at all. This isn't something I can implement/test myself so someone else will need to do that. The changes shouldn't be too large, just a small python check if the hip device supports hipblaslt where gemm support is checked and adjusting the Makefile to actually use the library.
That's interesting, didn't find anything last time because I didn't bother into looking into large architecture description PDF's just to look for a wavefront size, but you are right. Then it's a bit more important, I assumed it was only the CDNA1 that was just supporting wavefront size 64. I'll try to think of a good way to include them anyways. The wavefront size override won't actually affect how everything is executed, it's just that some compile time asserts are not triggered anymore (from what I remember). The define override should still be removed / only be called if something like FORCE_WAVEFRONT32 is set. I'll try to see if I can just trigger a trap in device code for the unsupported functions so it compiles. If that's actually the case it would be enough to add a one time warning with a fallback in the affected block size functions to actually use the next larger block size or throw an exception if that doesn't work. The problem with the second solution will be that most projects use the smallest BLOCK_SIZE for 4bit stuff which means e.g. https://github.com/TimDettmers/bitsandbytes/blob/f63abb5a0d0bc971d28972ba890a9e59596caac4/csrc/kernels.cu#L3976 for FP4 is called which doesn't work with the larger wavefront size of 64. So that fallback / exception if impossible with tensor size would need to go here https://github.com/TimDettmers/bitsandbytes/blob/f63abb5a0d0bc971d28972ba890a9e59596caac4/bitsandbytes/functional.py#L690 Not experienced at all with that so no idea if that even works, but if it doesn't work to just use the next larger BLOCK_SIZE we can always just throw an exception and nothing that doesn't work would be called. Same would need to be done for dequantize. |
@arlo-phoenix I can confirm that kQuantizeBlockwise works on gfx90* now and this solution is sane. However over all i am leaning towards amds solution of hipifying once and keeping the hip code seperate, as this would allow further optimization for gcn/cdna without a mess of ifdefs The amd repo is also interesting, the reason they haven't published it widely is presumably because it is still quite unfinished as evidenced by the myriad of disabled tests and the makefile jank. That said after patching the makefile as below to disable hipblaslt usage amds version also works (tested on gfx1030, gfx900, gfx906 and gfx908) for the purposes of the non-amd-disabled tests i tried, transformers bnb integration and the 8bit adamw.
|
ROCM_TARGET=gfx1030 make hip Uses define BITS_AND_BYTES_USE_ROCM to redefine cuda functions to ROCm equivalent credit to previous ports: Co-authored-by: broncotc <[email protected]> Co-authored-by: agrocylo <[email protected]>
disables igemm for now and adds path to compiled library libbitsandbytes_hip_nohipblaslt
the unrolls already somehow worked correctly before, but they shouldn't have.
I wanted to try this out, so I installed it, (and also https://github.com/ROCmSoftwarePlatform/triton which it needs) and I got:
If you happen to recignize this error, could you please set me straight? |
sry never needed to use triton for what I did so far. I assume you are on a newer GPU then with matrix core support. If so I recommend giving the official fork a try. There's been a lot of progress in the last weeks from what I saw and should give a more stable/faster experience. Edit: The docker had |
I agree. I'm keeping this PR open for now since the official fork still only works with hipblaslt without your patch so a working fork is easier to find, but definitely not something that should be merged anymore. The reason I opted for defines is that I really didn't like filling the pythoninterface.c with all the ifdefs and I was/am too inexperienced to come up with a solution for multiple backends. But now with there being an effort to create a proper backend system this won't really matter anymore. I skimmed through the source code and an ifdef solution wouldn't even have been possible since the API's differ a lot more after all. If anyone from AMD is reading this (issues aren't enabled in the fork), are there any plans for integrating ROCm upstream in bitsandbytes? There is currently an ongoing discussion on how backends should be integrated and I think it makes more sense for someone currently working on this to chime in there #898. |
Hey wonderful AMD people, we are try to enable rocm support, could one of you assist or put us in contact with a person? Given you currently have a fork it would be beneficial to have a long term up-streamed solution and save support effort overall.
@Lzy17 @howiejayz @CRobeck @kuhar @jerryyin @keryell @jeffdaily @keryell @dllehr-amd |
@Iron-Bound @amathews-amd is grabbing the right folks on our side. |
Thank you! This is huge! |
Thanks again everyone for supporting us on upstreaming AMD support. Just wanted to write again to renew our support in making that work out. Currently, there's a lot of work on deciding on how to best handle the backend abstraction in #1077 and #898. Once we're through with that relatively soon the path is free for follow-up PRs with enabling the individual functionalities in the AMD backend. Feel free to also chip in on the backend abstraction discussions. This decision will likely set some stuff "in stone", so it is important to get it right. |
The ROCm fork: https://github.com/ROCm/bitsandbytes/tree/rocm_enabled That fork should work:
for all ROCm supported GPUs. They also marked still failing tests for skipping so all tests under For installing from an empty venv:
pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/rocm6.0
you can find your git clone https://github.com/ROCm/bitsandbytes.git
cd bitsandbytes
git checkout c037a306e97ced3c452570132f66aac4e2964056
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -DCOMPUTE_BACKEND=hip -DAMDGPU_TARGETS=gfx1030 -S .
cmake --build . --config Release
pip install . I think I've kept this open long enough so I'm closing this. The official fork now has better support for all devices and is easier to find than when I initially discovered it. |
Thanks @arlo-phoenix and everyone involved to get this done ❤️ |
what about 6.1? |
what about it? |
For those willing to alpha test, the ROCm backend is already available for that when compiling from source from the multi-backend-refactor branch. See today's change to the Readme for more details. |
Very interested |
we re gonna announce the alpha release with downloadable pre-compiled wheel file early next week |
Edit: See #756 (comment) for current status of ROCm support
Inspired by the llama.cpp ROCm port, I decided to try and use a similar approach for bitsandbytes and worked through the different hipified cuda functions/classes and just redefine them with the HIP equivalents. This only happens if
BNB_USE_HIP
is set and merging this shouldn't affect the CUDA code at all. It's also easier to maintain than keeping a parallel hip code base alive.This PR adds the target hip to make and works with the most recent version (0.42.0) with ROCm 5.6+ (6.0 included). For installing just do
It won't pass all tests as some are igemm or Cuda specific, but all optimizers work in both 8bit and 32bit. I also used this a lot with llama 4-bit interference, that also works. The tests that fail are beside those test_autograd.py and anything with double_quant in its name, I assume that also has to do with matrix multiplication and is expected to fail.
Besides that igemm / Matrix core support for the more recent AMD GPU's is still impossible because of missing instructions in hipBLASLt. There is also an official fork which tries to enable it, but doesn't seem finished yet. If you want to use the official fork without hipblasLt @IMbackK provided a patch for it which should work on all ROCm supported GPU's.
I'm making this a draft for now, as it is still not well tested and I haven't really updated the documentation yet. From an actual code standpoint not much will change on my side as I only own a gfx1030 GPU and thus can't test igemm support.
Closes #47, closes #107, closes #681