You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've tried a number of different optimizations on both GCC and CLANG, and the performance of this system is about 10% of that of my Dual Xeon desktop with just one of those Tesla T4 cards. I've tried I've tested with a single card as well and the performance is about the same as below.
./llama-bench -m /storage/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
Device 1: Tesla T4, compute capability 7.5, VMM: yes
Device 2: Tesla T4, compute capability 7.5, VMM: yes
Device 3: Tesla T4, compute capability 7.5, VMM: yes
model
size
params
backend
ngl
test
t/s
llama 8B Q4_0
4.33 GiB
8.03 B
CUDA
99
pp512
191.08 ± 0.39
llama 8B Q4_0
4.33 GiB
8.03 B
CUDA
99
tg128
6.94 ± 0.00
The nvidia bandwidth benchmarks look pretty good to me:
!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!
Running on...
Device 0: Tesla T4
Device 1: Tesla T4
Device 2: Tesla T4
Device 3: Tesla T4
Quick Mode
Host to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 47.8
Device to Host Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 52.7
Device to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 548.2
Result = PASS
There seems to be very little load on the GPUs and system CPUs while queries are running, I've done some nsight profiling and found no obvious issues but I'm admittedly not a pro at debugging cuda. Has anyone run into this? I know this is an exotic architecture so I'm not sure how many folks have been down this road.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I've been working with llama.cpp on my IBM S822L (8247-22L), configured as follows:
OS: Debian GNU/Linux 12 (bookworm) ppc64le
Host: IBM,8247-22L
Kernel: 6.11.5+bpo-powerpc64le-64k
CPU: POWER8 (architected) (128) @ 4.157GHz
GPU: NVIDIA Tesla T4
GPU: NVIDIA Tesla T4
GPU: NVIDIA Tesla T4
GPU: NVIDIA Tesla T4
Memory: 6046MiB / 785354MiB
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000010:01:00.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15360MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000018:01:00.0 Off | 0 |
| N/A 34C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000021:01:00.0 Off | 0 |
| N/A 33C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000029:01:00.0 Off | 0 |
| N/A 32C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
I've tried a number of different optimizations on both GCC and CLANG, and the performance of this system is about 10% of that of my Dual Xeon desktop with just one of those Tesla T4 cards. I've tried I've tested with a single card as well and the performance is about the same as below.
./llama-bench -m /storage/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
Device 1: Tesla T4, compute capability 7.5, VMM: yes
Device 2: Tesla T4, compute capability 7.5, VMM: yes
Device 3: Tesla T4, compute capability 7.5, VMM: yes
The nvidia bandwidth benchmarks look pretty good to me:
/bandwidthTest --device=all
[CUDA Bandwidth Test] - Starting...
!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!
Running on...
Device 0: Tesla T4
Device 1: Tesla T4
Device 2: Tesla T4
Device 3: Tesla T4
Quick Mode
Host to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 47.8
Device to Host Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 52.7
Device to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 548.2
Result = PASS
There seems to be very little load on the GPUs and system CPUs while queries are running, I've done some nsight profiling and found no obvious issues but I'm admittedly not a pro at debugging cuda. Has anyone run into this? I know this is an exotic architecture so I'm not sure how many folks have been down this road.
Any thoughts or suggestions are appreciated.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions