IBM Power 8 Architecture #10805

Zerolapse · 2024-12-12T19:43:17Z

Zerolapse
Dec 12, 2024

Hello,

I've been working with llama.cpp on my IBM S822L (8247-22L), configured as follows:

OS: Debian GNU/Linux 12 (bookworm) ppc64le
Host: IBM,8247-22L
Kernel: 6.11.5+bpo-powerpc64le-64k
CPU: POWER8 (architected) (128) @ 4.157GHz
GPU: NVIDIA Tesla T4
GPU: NVIDIA Tesla T4
GPU: NVIDIA Tesla T4
GPU: NVIDIA Tesla T4
Memory: 6046MiB / 785354MiB

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03 Driver Version: 535.216.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000010:01:00.0 Off | 0 |
| N/A 34C P8 9W / 70W | 0MiB / 15360MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000018:01:00.0 Off | 0 |
| N/A 34C P8 10W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000021:01:00.0 Off | 0 |
| N/A 33C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000029:01:00.0 Off | 0 |
| N/A 32C P8 9W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

I've tried a number of different optimizations on both GCC and CLANG, and the performance of this system is about 10% of that of my Dual Xeon desktop with just one of those Tesla T4 cards. I've tried I've tested with a single card as well and the performance is about the same as below.

./llama-bench -m /storage/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
Device 1: Tesla T4, compute capability 7.5, VMM: yes
Device 2: Tesla T4, compute capability 7.5, VMM: yes
Device 3: Tesla T4, compute capability 7.5, VMM: yes

model	size	params	backend	ngl	test	t/s
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	pp512	191.08 ± 0.39
llama 8B Q4_0	4.33 GiB	8.03 B	CUDA	99	tg128	6.94 ± 0.00

The nvidia bandwidth benchmarks look pretty good to me:

/bandwidthTest --device=all
[CUDA Bandwidth Test] - Starting...

!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

Device 0: Tesla T4
Device 1: Tesla T4
Device 2: Tesla T4
Device 3: Tesla T4
Quick Mode

Host to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 47.8

Device to Host Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 52.7

Device to Device Bandwidth, 4 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 548.2

Result = PASS

There seems to be very little load on the GPUs and system CPUs while queries are running, I've done some nsight profiling and found no obvious issues but I'm admittedly not a pro at debugging cuda. Has anyone run into this? I know this is an exotic architecture so I'm not sure how many folks have been down this road.

Any thoughts or suggestions are appreciated.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBM Power 8 Architecture #10805

{{title}}

Replies: 0 comments

Select a reply

IBM Power 8 Architecture #10805

Zerolapse Dec 12, 2024

Replies: 0 comments

Zerolapse
Dec 12, 2024