-
Notifications
You must be signed in to change notification settings - Fork 645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable common device abstraction for 8bits/4bits #898
Enable common device abstraction for 8bits/4bits #898
Conversation
Co-authored-by: Jiong Gong <[email protected]>
Hi @TimDettmers, |
Thanks for your contribution! We'll look into it and get back to you soon. |
Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the In the next weeks, we will work on splitting the What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the |
That sounds like a good approach to me. The only thing there I remember is that there was some specific code around device selection currently. Not sure if it is needed or not. |
Hi @Titus-von-Koeller . It's exciting to see that BNB is going to support more platforms! Both device abstraction designs are acceptable if one can be applied soon, so we can start the following work to optimize the CPU. Thanks : ) |
I think Common Device abstraction would be applicable in this case. Although a hybrid combination of solutions would be ultimately needed. This is because from the compiler side , CUDA runtime should be agnostic to SYCL runtime (although common headers/utility methods should be used). Maybe linking up separate compiler specific kernels to separate pybinds/cmakes keeping the interface common can be used? Seeing the progress in #1077 |
@jiqing-feng Just to give you a short interim update. I've spent quite some time looking into the matter this week and am in the middle of discussions with Tim this. There are still a few open questions. I'll get back to you with something concrete asap. |
Cross-posted reply to #898
The four sub-interfaces look good to me. We probably want to create higher-level abstractions for 4-bit quantization/dequantization and some other functions should be deprecated such as percentile clipping, and the non-tensor-core 8-bit matrix multiplication. It seems more people like the tensor-driven dispatch. I find it difficult to read and debug and it can lead to more accidental bugs compare to a singleton. There are also further problems. For example, for paged tensor, the device is For a tensor type dispatch, how would you handle the case where the user uses the wrong device? If a tensor is on the CPU, I guess it would silently be executed on the CPU (if supported). This would be slow without the user knowing that something is wrong and we have no way of detecting if something is wrong because we just dispatching based on device type.
Some functions specific to the device would be specializations in the base class and would not be provided in the interface. So these should be "private" functions for the base class. I think a runtime singleton would be overkill. The case that someone has multiple accelerators of different device type would be very rare (probably only developers have this). We should have a special manual hack for this rather than degrading the overall system for this use-case. So, a static one-time singleton would be preferable from my view.
Agreed. I am not sure if all device type would work the same way. For example, does
The PR #1107 was based on a discussion that I had with @Titus-von-Koeller. It was my preferred view at the time, but we should discuss what would be best for everyone and decide based on that. Cross-posted reply to #1077
I think this might be the way to go. If we use a singleton we do not need this, but for a dispatch scenario we need to do this and it is probably the best solution to do an if-else on
This would be more general, and we could even abstract it to disk -> unified memory architectures like Apple silicon. It requires more work, but would be a great feature for Apple silicon to work with larger models on smaller Macbooks and the iPhone etc. The main problem here is that it requires quite a bit of work, but I think it could well be worth it to create this abstraction as it is more general and quite powerful (it can be used beyond optimizers, for example also for matrix multiplication / weights).
I think this is the main advantage of non-static singleton. I would optimize for the user-side rather than developer-side. A singleton can prevent a lot of accidental bugs and enhances the user-experience for cases where the user forgot to move to the correct device. This can become a problem in particular if we support CPU devices. I think the main discussion point should be which architectural design will help with the best user experience. That is really what I want to optimize for with the design. Lets discuss. PS: @Titus-von-Koeller had more details on the sub-interfaces and what will be deprecated and we will post soon. |
Thanks for your replies! :) @TimDettmers
Got it, thus we would follow these four sub-interfaces first.
Yes, when dispatching on the wrong device, singleton is to be more clear than tensor-driven. However, is it proper to take an assumption of what is wrong or correct devices from the BNB side? For example, for Huggingface users, choosing/setting a correct device for BNB runtime shall already be taken care of from the transformers library (
Any backend library (like PyTorch) that supports multi-platforms would face the same question. The key here is if it is necessary or do we have the ability to define the wrong or correct devices for users from the view of the library serving for multi-backends? This is the also main limitation of using singleton, we would not know what is the correct device that users intend to run and init the backend singleton when importing BNB and init. Is there any great approach to handle such limitation of init backend as the singleton? The PR #1107 implemets one static init, which seems not sufficient. Tensor-driven would be clear from the view of serving as quantization backends that support multi-platforms under the use cases of calling BNB from the transformers library by the info of
Got it, so for the scope of device-specific functions/class, we would go with a static one-time singleton.
Got it, thanks, will refactor the |
I agree with it. The BNB has the same level as PyTorch because they are both fundamental libraries that support the OP level. Here are my opinions:
In conclusion, following the PyTorch design is the safest choice for us. We can apply it without struggle. Would like to hear the opinion from all of you @TimDettmers @Titus-von-Koeller @rickardp @akx @abhilash1910 @jgong5 @matthewdouglas @arlo-phoenix @younesbelkada |
Hi @Titus-von-Koeller @TimDettmers Refer to the comment. I am glad we finally agreed and decided to apply tensor-driven dispatch. This PR would be a good start. Do you think we should merge this PR since it has been open and discussed for 4 months... Hope we can see the progress soon. Thanks to all of you : ) |
I need to go back and review the changes in the PR again, but in general I just want to be sure we're not making changes to the public API that could cause any problems. Some of this may be redundant here, but here's my notes on what I'd consider the main part of the "public" API to be. I'm focusing only on
Then there's these which are more internal/backend-specific. I think they are probably safe to move around, but maybe worth exporting from bitsandbytes.functional with a deprecation warning if that's going to be done.
Some that have appeared in public documentation:
Some that I'm not too sure about where they fit:
Not every backend has to necessarily implement everything in |
Hi @jiqing-feng, I'd like to clarify that the decision to go with tensor-driven dispatch is not final, and there's been arguments for and against it across the community. I've personally tried championing it with Tim, which you can see in his last response in this PR. Tim is the owner of bitsandbytes and will need to give the final ok, once we have concretized the approach sufficiently. I’m confident that we accomplish this and we still need to put in significant work before. To summarize, I do think it makes sense to go forward with a tensor-driven dispatch, and we should work together to make sure this refactor doesn’t break any existing functionality and that we're all happy with the outcome. To this end, I feel we need to collaborate on a series of follow up PRs. Let me explain: My main and absolute concern regarding BNB as its lead maintainer is stability above everything else, and I still have concerns on that end which need addressing before this can be fully finalized. Due to the complexity of the codebase and the knowledge it requires, and the fact that this introduces a fundamental design change, we need to take it slowly so I can fully understand the changes and their implications. I’m very happy with the work you’ve already done on this PR and therefore I will merge it. However, the merge will be to a distinct branch off of Please resolve any conflicts between this branch and We’ll merge changes to Until then, all backend related PRs should be addressed at the I'm still working out details of what I believe needs changing, which I will share with you shortly in this Github Discussion, where I would like us to centralize all general discussions around the refactoring there as much as possible. My feedback will also address further comments from @jianan-gu @matthewdouglas and you. I’m looking forward to our further collaboration with the community and making sure we deal with enabling multi_backend as soon as we can. To this end we would like to please ask Intel to allocate an engineer to work with us to set up a BNB Github runner with Intel hardware (cc @yao-matrix). We’re currently also setting up Nvidia GPU enabled runners and this will be crucial for the repo to have immediate feedback on code correctness across platforms. |
P.S. Be sure to install the pre-commit hooks as they will make sure formatting is applied as specified in the repo. This likely accounts for the majority of the merge conflicts, due to the recently merged PR #1081. |
Thanks! Merging into the multi-backend-refactor is okay; we can build from the source to enable it. I've already tested it on language models and found no breakdown or regression. BTW do you think we should release a pip package for multi-backend after the CPU ops are done? It would be better to have the preview package so users can use it, and we can find more issues before the official release. |
Hi, thanks for your comments, will cross check with the scope we listed as a follow up. Besides, QuantState is linked back for |
Hi @Titus-von-Koeller . Would you please help to check if there is anything we need to change before merging? Thx! |
2ffa367
into
bitsandbytes-foundation:multi-backend-refactor
Thanks for all your hard work on this @jianan-gu! Really excited for this, and looking forward to what it might unlock down the line =) |
Yes, I think we should release a nightly release both of
Yes, agreed, thank you very much ❤️ |
As one of the plans in the RFC #894. This PR intends to add a device abstraction that allows the supports of non-CUDA devices to be added into bitsandbytes (for 8bits/4bits functionality) more commonly and easily.
To create a device backend abstraction, this PR creates a common backend folder, which contains the key kernel interfaces class to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.
The common backends will further be used in bitsandbytes.functional to implement the 8bits/4bits functionality.
Currently, the backend only contains CUDA as it is without this PR.