RocM container fails on certain AMD systems. #497

smooge · 2024-12-02T19:04:47Z

In trying to debug https://bugzilla.redhat.com/show_bug.cgi?id=2329826 I found that the containers for Rocm would not work with at least 2 AMD chipsets:

[ssmoogen@xenadu ~]$ ramalama --debug run granite "What is Fedora?"
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_wTQC4rmySL -t --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/ssmoogen/.local/share/ramalama/models/huggingface/ibm-granite/granite-8b-code-instruct-GGUF/granite-8b-code-instruct.Q4_K_M.gguf,destination=/mnt/models/model.file,rw=false quay.io/ramalama/rocm:latest /bin/sh -c llama-cli -m /mnt/models/model.file --in-prefix '' --in-suffix '' -p 'What is Fedora?' -c 2048

rocBLAS error: Cannot read /opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx803
 List of available TensileLibrary Files :
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"

This fails on

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 2100]
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev d3)

I could not figure out a way to force it to use just the CPU so possibly a --cpu flag which tells it not to try speeding things up ?

The text was updated successfully, but these errors were encountered:

ericcurtin · 2024-12-02T23:10:02Z

This is gonna be a constant issue... A --cpu flag seems fine to me...

But it won't fix the issue of the GPU not working of course, we are not gonna work on every single GPU in the world, but if someone opens an PR to support this one, great!

I did deliberately remove support for a lot of older GPUs in the AMD Containerfile to save about 20G in container image size, but if people enable extra ones one by one they would like, no big deal. The problem is if you enable every little one, you get a huge image. Also some GPUs will just prove to be headaches and a lot of effort.

smooge · 2024-12-03T13:09:16Z

The two ways I could see this fixed was a --cpu flag OR having the RocM items added only when --gpu is a command line option which would match the man pages and other documentation.

If we go with --cpu I would say that --cpu and --gpu conflict as command line options. The documentation is fixed to say that --gpu is only for when the system is not running a container, and --cpu will override that and give only local cpu performance.

Writing the above kind of made me think that adding --cpu was going to make it more complicated than having --gpu checked with the container.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RocM container fails on certain AMD systems. #497

RocM container fails on certain AMD systems. #497

smooge commented Dec 2, 2024

ericcurtin commented Dec 2, 2024

smooge commented Dec 3, 2024

RocM container fails on certain AMD systems. #497

RocM container fails on certain AMD systems. #497

Comments

smooge commented Dec 2, 2024

ericcurtin commented Dec 2, 2024

smooge commented Dec 3, 2024