Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RocM container fails on certain AMD systems. #497

Open
smooge opened this issue Dec 2, 2024 · 2 comments
Open

RocM container fails on certain AMD systems. #497

smooge opened this issue Dec 2, 2024 · 2 comments

Comments

@smooge
Copy link
Collaborator

smooge commented Dec 2, 2024

In trying to debug https://bugzilla.redhat.com/show_bug.cgi?id=2329826 I found that the containers for Rocm would not work with at least 2 AMD chipsets:

[ssmoogen@xenadu ~]$ ramalama --debug run granite "What is Fedora?"
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_wTQC4rmySL -t --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/ssmoogen/.local/share/ramalama/models/huggingface/ibm-granite/granite-8b-code-instruct-GGUF/granite-8b-code-instruct.Q4_K_M.gguf,destination=/mnt/models/model.file,rw=false quay.io/ramalama/rocm:latest /bin/sh -c llama-cli -m /mnt/models/model.file --in-prefix '' --in-suffix '' -p 'What is Fedora?' -c 2048

rocBLAS error: Cannot read /opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx803
 List of available TensileLibrary Files :
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1010.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1012.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat"
"/opt/rocm-6.2.2/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat"

This fails on

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 2100]
07:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev d3)

I could not figure out a way to force it to use just the CPU so possibly a --cpu flag which tells it not to try speeding things up ?

@ericcurtin
Copy link
Collaborator

This is gonna be a constant issue... A --cpu flag seems fine to me...

But it won't fix the issue of the GPU not working of course, we are not gonna work on every single GPU in the world, but if someone opens an PR to support this one, great!

I did deliberately remove support for a lot of older GPUs in the AMD Containerfile to save about 20G in container image size, but if people enable extra ones one by one they would like, no big deal. The problem is if you enable every little one, you get a huge image. Also some GPUs will just prove to be headaches and a lot of effort.

@smooge
Copy link
Collaborator Author

smooge commented Dec 3, 2024

The two ways I could see this fixed was a --cpu flag OR having the RocM items added only when --gpu is a command line option which would match the man pages and other documentation.

If we go with --cpu I would say that --cpu and --gpu conflict as command line options. The documentation is fixed to say that --gpu is only for when the system is not running a container, and --cpu will override that and give only local cpu performance.

Writing the above kind of made me think that adding --cpu was going to make it more complicated than having --gpu checked with the container.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants