-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Split model over multiple Vulkan GPUs #11004
Comments
I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead. |
Oh damn, wow! Thanks! |
To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations. |
Yes but it also seems to prevent some performance scaling you'd expect with mgpu, eg some models I load on vulkan mgpu do not perform faster for prompt processing or text generation as you would expect perhaps but they do load balance ram at least. But yes the one method we talked about yesterday would probably require GPUs on the same driver at a minimum and potentially identical GPUs. |
Do you mean not being faster than running on the CPU, or on a single GPU? |
Roughly same speed as 1 GPU for me on 2x MI60 radv YMMV on other GPUs / drivers. |
Is there a serial dependency between the contents of each GPU in the current configuration? ie. needing to complete layers 1 2 and 3 on gpu1 before handing on the result to layers 4 5 and 6 on gpu2, or is it instead split more down the middle? |
It's a regular layer split implementation, you split the model at some point and put half the layers on the first GPU and half the layers on the second GPU. Once the first GPU is done with its part the intermediate result gets copied to the second GPU and that one continues. It's the same way it works on CUDA and ROCm by default. |
ah, I guess running at the same speed would be to be expected, since there isn't any parallelization between GPUs. |
Prerequisites
Feature Description
Related to #5259 (closed), if you want I could move this there.
How hard would it be to implement splitting over vulkan GPUs instead of CUDA/HIP?
I guess OpenCL could be another path if vulkan is too hard, since there's now a maturing rusticl driver that can be layered on top of vulkan as well as various native drivers, but it may not be fully baked enough yet to support llamacpp (though maybe that's changing [1]). Also, afaik mapping memory between GPUs in a multi-GPU config is still under active development/implementation.
[1] https://archive.fosdem.org/2024/events/attachments/fosdem-2024-3364-why-not-run-opencl-accelerated-llm-on-your-phone-/slides/22383/Why_not_run_OpenCL-accelerated_LLM_on_your_phon_nK2DudB.pdf
Motivation
This would be really helpful as it's now not unreasonable to want to ditch nvidia's drivers for the open source NVK vulkan driver, and AMD's cards are also MUCH better supported with vulkan on the RADV driver than with AMD's spotty/nonexistant ROCm/HIP support. Vulkan is also more universally supported, so this could enable someone to split a model over eg. an AMD and an nvidia GPU if that's what they have.
Possible Implementation
N/A
The text was updated successfully, but these errors were encountered: