Replies: 10 comments 28 replies
-
I am running dual NVIDIA 3060 GPUs, totaling 24GB of VRAM, on Ubuntu server in my dedicated AI setup, and I've found it to be quite effective. When running smaller models or utilizing 8-bit or 4-bit versions, I achieve between 10-15 tokens/s. It's possible to run the full 16-bit Vicuna 13b model as well, although the token generation rate drops to around 2 tokens/s and consumes about 22GB out of the 24GB of available VRAM. Nonetheless, it does run. In my case, I chose to prioritize running larger models at a slower pace rather than focusing on faster performance with smaller models. I'm quite satisfied with this decision and can attest to the effectiveness of LLAMA's dual GPU support. This could be a viable option for you to consider, as opposed to investing in a single large card. |
Beta Was this translation helpful? Give feedback.
-
I have a system with 8 RTX 3090 with 24gb each. I can't make the models run in more than a GPU. It shows me all of them in the "Transformers - gpu-memory for device X" but does nothing with them. If I load a model, it only fits in the first gpu. |
Beta Was this translation helpful? Give feedback.
-
Just a question of clarification: |
Beta Was this translation helpful? Give feedback.
-
Same problem here, plus found many related issues. Trying to load the Had to set a memory limit of 77GB on both GPUs for the model to not crash with straight out of memory errors on first inference attempt. It tries to allocate more while running the model and if it fills the first GPU to max during model load time, then it fails to allocate during inference time. Can it be the cache allocation or something else? Another related problem is that the So multiple issues with with the most recent version for sure. |
Beta Was this translation helpful? Give feedback.
-
I got it to load on a 3 GPU setting. See Issue #2543 |
Beta Was this translation helpful? Give feedback.
-
Try to specify a memory size below the size of the actual model. If you do 24,24 it will load all the model into the first GPU and stop. If you do 4, 24 it will fill 4gb, then fill some of the second GPU and then come back to fill more if it needs to. I think many people assume to max out the memory in both GPUs and that's just not the way to get accelerate to distribute the models properly. Check with nvtop or nvidia-smi to see what happened and adjust from there. |
Beta Was this translation helpful? Give feedback.
-
Hi there! |
Beta Was this translation helpful? Give feedback.
-
I'm having this same problem on linux myself. When I load a model, it ignores the GPU RAM settings completely, attempts to load the model into device: 0, fills it up and then dumps with CUDA out of memory without ever touching device: 1. |
Beta Was this translation helpful? Give feedback.
-
Help! Is there a tutorial or guide for running multiple GPUs? I'm having difficulty parsing people's various approaches in this thread. It would appear that just setting the gpu-split on the model page is not enough, at least for me. I don't see any increase in activity on my 2nd GPU. I'm running 2x RTX-4070-ti-super (16 GB ea) and all activity seems to be occurring on GPU 0 (I'm using nvitop to monitor). Do I need to set flags in the CMD_FLAGS.txt? Is there a list of possible CMD flags? |
Beta Was this translation helpful? Give feedback.
-
I'm using LLAMA and want to use a bigger model. I have 11GB ram and wondered if the layer splitting works well to split between 2 GPUs.
I wonder if someone who has done this can share the tokens/s on single GPU versus split across 2 so I can understand the speed penalty (if any).
I'd like to avoid the expense of buying a 24GB card if I can just buy another 12 GB card. Thanks.
Beta Was this translation helpful? Give feedback.
All reactions