Feature Request: Split model over multiple Vulkan GPUs #11004

wittypastoral · 2024-12-28T15:33:41Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Related to #5259 (closed), if you want I could move this there.

How hard would it be to implement splitting over vulkan GPUs instead of CUDA/HIP?

I guess OpenCL could be another path if vulkan is too hard, since there's now a maturing rusticl driver that can be layered on top of vulkan as well as various native drivers, but it may not be fully baked enough yet to support llamacpp (though maybe that's changing [1]). Also, afaik mapping memory between GPUs in a multi-GPU config is still under active development/implementation.

[1] https://archive.fosdem.org/2024/events/attachments/fosdem-2024-3364-why-not-run-opencl-accelerated-llm-on-your-phone-/slides/22383/Why_not_run_OpenCL-accelerated_LLM_on_your_phon_nK2DudB.pdf

Motivation

This would be really helpful as it's now not unreasonable to want to ditch nvidia's drivers for the open source NVK vulkan driver, and AMD's cards are also MUCH better supported with vulkan on the RADV driver than with AMD's spotty/nonexistant ROCm/HIP support. Vulkan is also more universally supported, so this could enable someone to split a model over eg. an AMD and an nvidia GPU if that's what they have.

Possible Implementation

N/A

0cc4m · 2024-12-28T17:36:35Z

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

wittypastoral · 2024-12-28T22:05:35Z

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

Oh damn, wow! Thanks!

wittypastoral · 2024-12-29T02:44:59Z

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations.

cb88 · 2024-12-29T02:51:41Z

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations.

Yes but it also seems to prevent some performance scaling you'd expect with mgpu, eg some models I load on vulkan mgpu do not perform faster for prompt processing or text generation as you would expect perhaps but they do load balance ram at least.

But yes the one method we talked about yesterday would probably require GPUs on the same driver at a minimum and potentially identical GPUs.

wittypastoral · 2024-12-29T03:20:18Z

I implemented Vulkan multigpu in #5321, it's been around for a while. It should work reasonably well, although there's probably optimizations left that could reduce device-to-device communication overhead.

To be honest, I wouldn't be surprised if copying to ram instead of directly from device to device is also doing a good job of working around various issues with vulkan driver and memory sharing implementations.

Yes but it also seems to prevent some performance scaling you'd expect with mgpu, eg some models I load on vulkan mgpu do not perform faster for prompt processing or text generation as you would expect perhaps but they do load balance ram at least.

Do you mean not being faster than running on the CPU, or on a single GPU?

cb88 · 2024-12-29T03:24:04Z

Do you mean not being faster than running on the CPU, or on a single GPU?

Roughly same speed as 1 GPU for me on 2x MI60 radv YMMV on other GPUs / drivers.

wittypastoral · 2024-12-29T04:53:43Z

Do you mean not being faster than running on the CPU, or on a single GPU?

Roughly same speed as 1 GPU for me on 2x MI60 radv YMMV on other GPUs / drivers.

Is there a serial dependency between the contents of each GPU in the current configuration? ie. needing to complete layers 1 2 and 3 on gpu1 before handing on the result to layers 4 5 and 6 on gpu2, or is it instead split more down the middle?

0cc4m · 2024-12-29T07:15:54Z

It's a regular layer split implementation, you split the model at some point and put half the layers on the first GPU and half the layers on the second GPU. Once the first GPU is done with its part the intermediate result gets copied to the second GPU and that one continues. It's the same way it works on CUDA and ROCm by default.

wittypastoral · 2024-12-29T15:35:15Z

ah, I guess running at the same speed would be to be expected, since there isn't any parallelization between GPUs.

wittypastoral added the enhancement New feature or request label Dec 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Split model over multiple Vulkan GPUs #11004

Feature Request: Split model over multiple Vulkan GPUs #11004

wittypastoral commented Dec 28, 2024 •

edited

Loading

0cc4m commented Dec 28, 2024

wittypastoral commented Dec 28, 2024

wittypastoral commented Dec 29, 2024

cb88 commented Dec 29, 2024

wittypastoral commented Dec 29, 2024

cb88 commented Dec 29, 2024 •

edited

Loading

wittypastoral commented Dec 29, 2024 •

edited

Loading

0cc4m commented Dec 29, 2024

wittypastoral commented Dec 29, 2024

Feature Request: Split model over multiple Vulkan GPUs #11004

Feature Request: Split model over multiple Vulkan GPUs #11004

Comments

wittypastoral commented Dec 28, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

0cc4m commented Dec 28, 2024

wittypastoral commented Dec 28, 2024

wittypastoral commented Dec 29, 2024

cb88 commented Dec 29, 2024

wittypastoral commented Dec 29, 2024

cb88 commented Dec 29, 2024 • edited Loading

wittypastoral commented Dec 29, 2024 • edited Loading

0cc4m commented Dec 29, 2024

wittypastoral commented Dec 29, 2024

wittypastoral commented Dec 28, 2024 •

edited

Loading

cb88 commented Dec 29, 2024 •

edited

Loading

wittypastoral commented Dec 29, 2024 •

edited

Loading