-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework CUDA/native-library setup and diagnostics #1041
Rework CUDA/native-library setup and diagnostics #1041
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Really nice refactoring, much easier to follow the code. I would argue the setup code should embrace further device abstraction, but I think this is being worked on already and this looks like a really nice improvement! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this huge work and refactor ! 🙏
I quickly tried this branch out, and I managed to compile bnb on your branch, but when calling python -m bitsandbytes
I get:
Could not find the bitsandbytes CUDA binary at PosixPath('/bitsandbytes/bitsandbytes/libbitsandbytes_cuda121.so')
Could not load bitsandbytes native library: /bitsandbytes/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory
Traceback (most recent call last):
File "/bitsandbytes/bitsandbytes/cextension.py", line 110, in <module>
lib = get_native_library()
File "/bitsandbytes/bitsandbytes/cextension.py", line 97, in get_native_library
dll = ct.cdll.LoadLibrary(str(binary_path))
File "/opt/conda/envs/peft/lib/python3.8/ctypes/__init__.py", line 451, in LoadLibrary
return self._dlltype(name)
File "/opt/conda/envs/peft/lib/python3.8/ctypes/__init__.py", line 373, in __init__
self._handle = _dlopen(self._name, mode)
OSError: /bitsandbytes/bitsandbytes/libbitsandbytes_cpu.so: cannot open shared object file: No such file or directory
CUDA Setup failed despite CUDA being available. Please run the following command to get more information:
python -m bitsandbytes
Inspect the output of the command and see if you can locate CUDA libraries. You might need to add them
to your LD_LIBRARY_PATH. If you suspect a bug, please take the information from python -m bitsandbytes
and open an issue at: https://github.com/TimDettmers/bitsandbytes/issues
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
CUDA specs: CUDASpecs(highest_compute_capability=(8, 0), cuda_version_string='121', cuda_version_tuple=(12, 1))
PyTorch settings found: CUDA_VERSION=121, Highest Compute Capability: (8, 0).
Could not find the bitsandbytes CUDA binary at PosixPath('/bitsandbytes/bitsandbytes/libbitsandbytes_cuda121.so')
Traceback (most recent call last):
File "/opt/conda/envs/peft/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/envs/peft/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/bitsandbytes/bitsandbytes/__main__.py", line 4, in <module>
main()
File "/bitsandbytes/bitsandbytes/diagnostics/main.py", line 44, in main
print_cuda_diagnostics(cuda_specs)
File "/bitsandbytes/bitsandbytes/diagnostics/cuda.py", line 112, in print_cuda_diagnostics
if not binary_path.exists():
AttributeError: 'NoneType' object has no attribute 'exists'
Any idea what I did wrong and how to fix this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Below will solve the error but I am still not able to compile and use bnb on a docker image
116c910
to
42edea0
Compare
@younesbelkada Rebased and fixed your excellent catch :) Funnily enough this still works fine on my machine even if the diagnostics say
since what really matters for linkage (on Linux) is the
– I think the env path diagnostics should probably be reworked too. |
Ok, this is looking really good so far. We're planning to merge #898 after a final review from Tim and me this end of week. After that I would like to merge this one here and #1060. Also, I see a bunch of PRs from @rickardp: Any particular order you want these merged/ looked at? Please make me aware of any dependencies/ conflicts between all mentioned PRs, if you're aware of them or if you think anything should come first. Thanks again all for the valuable work your contributing :) |
The most important one of my PRs right now is #1050 but no one should block this one. I try to stay clear of the Python code as I know there's a lot going on there right now |
Rebased, incorporating #1064. |
So yeah, @Titus-von-Koeller, this is again mergeable 😅 |
174f575
to
20ed75b
Compare
Works on my Windows machine too (though the output really does need some cleaning, as noted in the PR description).
|
b76c8cb
to
e213618
Compare
from pathlib import Path | ||
import platform | ||
|
||
DYNAMIC_LIBRARY_SUFFIX = { | ||
"Darwin": ".dylib", | ||
"Linux": ".so", | ||
"Windows": ".dll", | ||
}.get(platform.system(), ".so") | ||
|
||
PACKAGE_DIR = Path(__file__).parent | ||
PACKAGE_GITHUB_URL = "https://github.com/TimDettmers/bitsandbytes" | ||
NONPYTORCH_DOC_URL = "https://github.com/TimDettmers/bitsandbytes/blob/main/docs/source/nonpytorchcuda.mdx" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
much cleaner like this, nice! also especially like the .get(platform.system(), ".so")
import torch | ||
|
||
|
||
@dataclasses.dataclass(frozen=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, super nice and clean, the whole module! Really like how you factor out everything, much easier to follow and maintain now
return PACKAGE_DIR / library_name | ||
|
||
|
||
class BNBNativeLibrary: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this refactor here, I also find very useful, much more pythonic and maintainable. thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
of course there's still a lot of work, but this is a significant step forward
Ok, I just went through everything one more time and it's looking great, as always. Thanks for providing tweaks for the things that came up in yesterdays review and that we discussed in Slack this morning. I ran the test suite and despite the usual flakiness, there's only one issue:
The Transformer integration tests also came through clean, not that I was expecting an issue, but just ran them to be sure. I'm not sure to what extent this test is actually useful. Wdyt? We could also delete it, in case it doesn't. Anyways, maybe that's another thing to discuss in dev corner, to what extent we would like test coverage for this setup stuff and to what extent that is even practically testable and if this is useful or not. Otherwise, this is ready to merge. Great work! Really glad to have you on board at BNB ❤️ |
@Titus-von-Koeller Ah, good catch, yeah. Fixing that test actually uncovered a bug regarding the override mechanism (it was using an unused |
fd723b7
into
bitsandbytes-foundation:main
Well, good that we caught it! Nice additions to the tests as well. Thanks again for the great work on this cleanup! Glad to merge. |
- This PR removes $CUDA_HOME/lib from $LD_LIBRARY_PATH because bitsandbytes can follow PyTorch's CUDA runtime since bitsandbytes-foundation/bitsandbytes#1041, and setting $LD_LIBRARY_PATH for `bitsandbytes` is not necessary any more - Hard-coded `/run/opengl-driver/lib` are removed because nix-gl-host can print the driver path for both Ubuntu and NixOS since numtide/nix-gl-host#16 - `link-nvidia-drivers.nix` is removed because drivers are instead copied to a cache directory by nix-gl-host
This PR reworks the native-library setup code that was partially enmeshed with the diagnostics code. I'm sorry it's all in one big commit, this was a bit too hard to pull into smaller ones so it still made sense 😓
The main thing is that
cuda_setup/
is gone; in its place arecuda_specs
, which is a simple dataclass containing information about the current cuda environment, and the native-library loading code is now incextension.py
. (This would be further reworked when more backends are introduced, of course; see Enable common device abstraction for 8bits/4bits #898.)cextension
introduces a minimal proxy object for thectypes.CDLL
(a precursor to aBackend
), so a global CUDA-specific constantCOMPILED_WITH_CUDA
is not needed.diagnostics/
, which is a separate thing that's being run bypython -m bitsandbytes
.cuda_setup
and__main__.py
; I got rid of the latter implementation in favor of the stuff that had already been there incuda_setup
. As far as I could tell, though, that code was only ever used to sniff around for CUDA runtime libraries, but that information was only shown to the user as a diagnostic.nvcuda
orlibcudart
now smells like CUDA.########
headers turn into monstrous Markdown headings. That's for another PR, though.)Additionally, this PR trivially enables the Mac
libbitsandbytes_cpu.dylib
native library to be found by adding it to theDYNAMIC_LIBRARY_SUFFIX
mapping.I tested locally that:
libbitsandbytes_cuda121_nocublaslt.so
gets loaded as expected and tests pass as before.This is still in no way perfect, but it's better IMO :)
Refs #918