Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Split model over multiple Vulkan GPUs #11004

Open
4 tasks done
wittypastoral opened this issue Dec 28, 2024 · 9 comments
Open
4 tasks done

Feature Request: Split model over multiple Vulkan GPUs #11004

wittypastoral opened this issue Dec 28, 2024 · 9 comments
Labels
enhancement New feature or request

Comments

@wittypastoral
Copy link

wittypastoral commented Dec 28, 2024

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Related to #5259 (closed), if you want I could move this there.

How hard would it be to implement splitting over vulkan GPUs instead of CUDA/HIP?

I guess OpenCL could be another path if vulkan is too hard, since there's now a maturing rusticl driver that can be layered on top of vulkan as well as various native drivers, but it may not be fully baked enough yet to support llamacpp (though maybe that's changing [1]). Also, afaik mapping memory between GPUs in a multi-GPU config is still under active development/implementation.

[1] https://archive.fosdem.org/2024/events/attachments/fosdem-2024-3364-why-not-run-opencl-accelerated-llm-on-your-phone-/slides/22383/Why_not_run_OpenCL-accelerated_LLM_on_your_phon_nK2DudB.pdf

Motivation

This would be really helpful as it's now not unreasonable to want to ditch nvidia's drivers for the open source NVK vulkan driver, and AMD's cards are also MUCH better supported with vulkan on the RADV driver than with AMD's spotty/nonexistant ROCm/HIP support. Vulkan is also more universally supported, so this could enable someone to split a model over eg. an AMD and an nvidia GPU if that's what they have.

Possible Implementation

N/A

@wittypastoral wittypastoral added the enhancement New feature or request label Dec 28, 2024
@0cc4m
Copy link
Collaborator

0cc4m commented Dec 28, 2024

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

@wittypastoral
Copy link
Author

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

Oh damn, wow! Thanks!

@wittypastoral
Copy link
Author

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations.

@cb88
Copy link

cb88 commented Dec 29, 2024

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations.

Yes but it also seems to prevent some performance scaling you'd expect with mgpu, eg some models I load on vulkan mgpu do not perform faster for prompt processing or text generation as you would expect perhaps but they do load balance ram at least.

But yes the one method we talked about yesterday would probably require GPUs on the same driver at a minimum and potentially identical GPUs.

@wittypastoral
Copy link
Author

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations.

Yes but it also seems to prevent some performance scaling you'd expect with mgpu, eg some models I load on vulkan mgpu do not perform faster for prompt processing or text generation as you would expect perhaps but they do load balance ram at least.

Do you mean not being faster than running on the CPU, or on a single GPU?

@cb88
Copy link

cb88 commented Dec 29, 2024

Do you mean not being faster than running on the CPU, or on a single GPU?

Roughly same speed as 1 GPU for me on 2x MI60 radv YMMV on other GPUs / drivers.

@wittypastoral
Copy link
Author

wittypastoral commented Dec 29, 2024

Do you mean not being faster than running on the CPU, or on a single GPU?

Roughly same speed as 1 GPU for me on 2x MI60 radv YMMV on other GPUs / drivers.

Is there a serial dependency between the contents of each GPU in the current configuration? ie. needing to complete layers 1 2 and 3 on gpu1 before handing on the result to layers 4 5 and 6 on gpu2, or is it instead split more down the middle?

@0cc4m
Copy link
Collaborator

0cc4m commented Dec 29, 2024

It's a regular layer split implementation, you split the model at some point and put half the layers on the first GPU and half the layers on the second GPU. Once the first GPU is done with its part the intermediate result gets copied to the second GPU and that one continues. It's the same way it works on CUDA and ROCm by default.

@wittypastoral
Copy link
Author

ah, I guess running at the same speed would be to be expected, since there isn't any parallelization between GPUs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants