Running multiple GPUs #1416

cduk · 2023-04-20T19:40:20Z

cduk
Apr 20, 2023

I'm using LLAMA and want to use a bigger model. I have 11GB ram and wondered if the layer splitting works well to split between 2 GPUs.

I wonder if someone who has done this can share the tokens/s on single GPU versus split across 2 so I can understand the speed penalty (if any).

I'd like to avoid the expense of buying a 24GB card if I can just buy another 12 GB card. Thanks.

R4X70N · 2023-04-26T04:55:59Z

R4X70N
Apr 26, 2023

I am running dual NVIDIA 3060 GPUs, totaling 24GB of VRAM, on Ubuntu server in my dedicated AI setup, and I've found it to be quite effective. When running smaller models or utilizing 8-bit or 4-bit versions, I achieve between 10-15 tokens/s. It's possible to run the full 16-bit Vicuna 13b model as well, although the token generation rate drops to around 2 tokens/s and consumes about 22GB out of the 24GB of available VRAM. Nonetheless, it does run.

In my case, I chose to prioritize running larger models at a slower pace rather than focusing on faster performance with smaller models. I'm quite satisfied with this decision and can attest to the effectiveness of LLAMA's dual GPU support. This could be a viable option for you to consider, as opposed to investing in a single large card.
Its surprising how little information is out there on this setup

9 replies

Shogoci Apr 26, 2023

Thank you for your response.
Unfortunately, it's still not working for me. Oobabooga keeps ignoring my 1660 but i will still run out of memory. I'm running the vicuna-13b-GPTQ-4bit-128g or the PygmalionAI Model. It's not working for both.
Im on Windows.

Shogoci Apr 26, 2023

Maybe this is important?

R4X70N Apr 26, 2023

Are you using WSL or just Windows? If you are using Windows try WSL if you are using WSL try Windows.
The issue could be with Windows not detecting and using it correctly.

R4X70N Apr 26, 2023

Alright, I've been doing some testing. Dual GPU with GPTQ seems to be very finicky. Im testing with GPT4-X-Alpaca-30B-4bit and after loading and unloading the model from the webui a few times it decided to load on both GPU's and I have no idea why. I'm going to try a few other models when I get a chance.

This is the server command I use: python server.py --auto-devices --chat --model-menu --wbit 4 --groupsize 128
Make sure you aren't using Pre_Layer, it seems to default to 1 GPU.

kdubey22 Jun 26, 2023

Running the following commands throws an error: python server.py --auto-devices --chat --model-menu --wbits 4 --groupsize 128
Error: OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.90 GiB total capacity; 15.06 GiB already allocated; 73.81 MiB free; 15.26 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF

I use the GPT4-X-Alpaca-30B-4bit model and have 2 GPUs (NVIDIA Tesla P100).
I have also tried python server.py --listen --model MetaIX_GPT4-X-Alpaca-30B-4bit --loader gptq-for-llama --pre_layer 30 60 and it successfully launched the web UI but when I am generating the response I got: NameError: name 'quant_cuda' is not defined
I have tried to uninstall and install everything again but nothing seems to work fine.

surak · 2023-06-01T19:12:36Z

surak
Jun 1, 2023

I have a system with 8 RTX 3090 with 24gb each. I can't make the models run in more than a GPU. It shows me all of them in the "Transformers - gpu-memory for device X" but does nothing with them. If I load a model, it only fits in the first gpu.

7 replies

saintskytower Oct 6, 2023

I'm able to run Llama-70b GPTQ on dual 3090's. 12-15 tokens/s.

mrharrison007 Jan 30, 2024

@surak , I've also got two 3090's, are you on Linux? If so, can you supply your start_linux.sh switches? I tried auto-devices but it appears loads are still just using one 3090 and loads fail on out-of-memory errors. Thanks.

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-134a5.....fdf)
GPU 1: NVIDIA GeForce RTX 3090 (UUID: GPU-ddaf2.....930)

mrharrison007 Jan 30, 2024

I was able to get TheBloke/llama2_70b_chat_uncensored-GPTQ working, with --auto-device in start_linux.sh and using the gpu-split setting in the models tab, setting was 22, 22. Will experiment a bit more, as I'm not sure that setting is what allowed the loading. May just be the model that lent itself to loading onto two GPUs without issue.

Update:
It was setting gpu-split to 22, 22 under the Models tab, that allowed TheBloke/llama2_70b_chat_uncensored-GPTQ to load, tried without and it failed to load and only used one GPU.

surak Feb 1, 2024

Hey @mrharrison007 I eventually ended up adopting https://github.com/lm-sys/FastChat instead. On there, I can use Llama2-70 in 8 gpus, without any quantization. But the model was slow and not that interesting after seeing others. Currently those 8 gpus are running Mixtral and I'm happy with Mixtral and FastChat - Sorry

Ph0rk0z Feb 1, 2024

Its slow because FP16 is using accelerate to split. You need to use another backend like vllm with proper multi-gpu support.

QuantumQuill · 2023-06-02T11:33:52Z

QuantumQuill
Jun 2, 2023

Just a question of clarification:
Can you split a Model into VRAM of several GPUs?
Let say you have a model of 10GB of size and 3 GPUs of 8GB VRAM each.

2 replies

surak Jun 2, 2023

Technically you should - that's how one can run Llama-65b for example.

But I haven't managed to do so with this tool lately.

I can test more, if needed. I have gpus all from 16 to 40gb, from 1 to 8 per compute node.

maciej-trebacz Jun 3, 2023

If you find a way please share it with the rest of us 🙂

viktor-ferenczi · 2023-06-05T02:26:13Z

viktor-ferenczi
Jun 5, 2023

Same problem here, plus found many related issues.

Trying to load the TheBloke/guanaco-65B-HF model into a RunPod 2x80GB instance. No matter how I vary the command line options and the ones I set on Web UI it just does not work. It fails with a ton of Torch errors on the console running server.py.

Had to set a memory limit of 77GB on both GPUs for the model to not crash with straight out of memory errors on first inference attempt. It tries to allocate more while running the model and if it fills the first GPU to max during model load time, then it fails to allocate during inference time. Can it be the cache allocation or something else?

Another related problem is that the --gpu-memory command line option seems to be ignored, including the case when I have only a single GPU. It forces me to specify the GPU RAM limit(s) on the Web UI and cannot start the server with the right configs from a script.

So multiple issues with with the most recent version for sure.

4 replies

QuantumQuill Jun 5, 2023

Can you please provide the version line of the nvidia-smi output?

surak Jun 5, 2023

for me (seeing the same effects) is NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7

viktor-ferenczi Jun 5, 2023

RunPod
2 x A100 80GB
32 vCPU 251 GB RAM

nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-***)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-***)

Detailed nvidia-smi -q output is attached as text file.
nvidia-smi-q.txt

Hope it helps.

viktor-ferenczi Jun 5, 2023

There is a related discussion post on HF with the actual error message:
https://huggingface.co/TheBloke/guanaco-65B-HF/discussions/1

QuantumQuill · 2023-06-06T11:19:25Z

QuantumQuill
Jun 6, 2023

I got it to load on a 3 GPU setting. See Issue #2543
It seems that CPU RAM has to have at least the size of the model even it is not used. #Bug

0 replies

Ph0rk0z · 2023-06-28T13:52:43Z

Ph0rk0z
Jun 28, 2023

Try to specify a memory size below the size of the actual model. If you do 24,24 it will load all the model into the first GPU and stop. If you do 4, 24 it will fill 4gb, then fill some of the second GPU and then come back to fill more if it needs to. I think many people assume to max out the memory in both GPUs and that's just not the way to get accelerate to distribute the models properly. Check with nvtop or nvidia-smi to see what happened and adjust from there.

3 replies

emersonium Sep 18, 2023

Hi, this makes sense, does it actually work tho ? Can someone with two 3090s, use the above and actually load and run say a 30GB model ?

Ph0rk0z Sep 18, 2023

I run the 70b regularly so yea.

henryhcy Oct 6, 2023

Hi, I'm wondering where can I specify the memory size for the model?

timakovi · 2023-06-29T12:08:50Z

timakovi
Jun 29, 2023

Hi there!
@cduk try use this setting: --load-in-8bit
I have the same bug but this parameter was fix it.

1 reply

Ph0rk0z Jun 30, 2023

that will just load an FP16 model in 8 bits. It is incompatible with GPTQ models.

coyoteltd · 2023-10-17T00:18:53Z

coyoteltd
Oct 17, 2023

I'm having this same problem on linux myself. When I load a model, it ignores the GPU RAM settings completely, attempts to load the model into device: 0, fills it up and then dumps with CUDA out of memory without ever touching device: 1.

1 reply

coyoteltd Oct 17, 2023

Update: I've even tried setting max_memory, as per this thread:

#86

I've set max_memory {0: '2GiB', 1: '15GiB'} and the behavior is the same according to nvidia-smi (device 0 fills up, device 1 is empty).

tomsepe · 2024-02-12T21:31:32Z

tomsepe
Feb 12, 2024

Help! Is there a tutorial or guide for running multiple GPUs? I'm having difficulty parsing people's various approaches in this thread.

It would appear that just setting the gpu-split on the model page is not enough, at least for me. I don't see any increase in activity on my 2nd GPU. I'm running 2x RTX-4070-ti-super (16 GB ea) and all activity seems to be occurring on GPU 0 (I'm using nvitop to monitor).

Do I need to set flags in the CMD_FLAGS.txt?

Is there a list of possible CMD flags?

0 replies

Running multiple GPUs #1416

Replies: 10 comments · 28 replies

Replies: 10 comments 28 replies