-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support Moore Threads GPU #8383
Conversation
I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000? |
Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better if the MUSA changes were completely separate from the CUDA logic in the Makefile/cmakelists.txt. But I don't feel particularly strongly about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should refine this. But my initial thought is to mostly reuse the CUDA compiler settings and only change the different parts. This is because the CUDA compiler is already well-configured and I don't want to mess up with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After conducting some local testing, I found that if we separate out the MUSA changes from the CUDA logic in the Makefile/CMakeLists.txt, we would end up duplicating a significant amount of code. This would lead to maintenance issues down the line.
I'd love to explore alternative solutions but I'm not sure what the best way to proceed is. I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said, I don't feel particularly strongly about this. But what I would like is a consistent implementation between HIP and MUSA. So if we take the original approach of this PR then we should also adjust the HIP logic (in a separate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I understand.
dd64710
to
bba0b94
Compare
Makefile
Outdated
CC := clang | ||
CXX := clang++ | ||
GGML_CUDA := 1 | ||
GGML_NO_OPENMP := 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the issue with OpenMP is that the MUSA compiler does not work well it. Could the MUSA compiler be used only for the ggml-cuda files instead of completely disabling OpenMP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for pointing this out! The MUSA compiler is shipped with our Clang variant, which is based on Clang 14.
- Ubuntu 20.04 Requirement: MUSA libraries require Ubuntu 20.04.
- OpenMP Version: The latest OpenMP version available on Ubuntu 20.04 (Focal) is
libomp-12-dev
.
I'm not sure if setting GGML_NO_OPENMP
to 0
will cause runtime issues. Therefore, I have set GML_NO_OPENMP
to 1
to disable it. Once we upgrade to Ubuntu 22.04, we can re-enable it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, just saw this in my previous work log:
root@b7a6e65eec34:/ws# ldd llama-server
linux-vdso.so.1 (0x00007ffdff2de000)
libmusa.so.1.0 => /usr/local/musa/lib/libmusa.so.1.0 (0x00007f044742e000)
libmublas.so => /usr/local/musa/lib/libmublas.so (0x00007f043c1a0000)
libmusart.so.1.0 => /usr/local/musa/lib/libmusart.so.1.0 (0x00007f043c16e000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f043c144000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f043c13e000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f043c134000)
libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f043bf50000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f043be01000)
libomp.so.5 => /lib/x86_64-linux-gnu/libomp.so.5 (0x00007f043bcff000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f043bce4000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f043baf2000)
/lib64/ld-linux-x86-64.so.2 (0x00007f04477a9000)
libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007f043bad6000)
libsrv_um_MUSA.so => /ddk/usr/lib/x86_64-linux-gnu/musa/libsrv_um_MUSA.so (0x00007f043b1bf000)
libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f043b1b2000)
libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f043b196000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will work without OpenMP, but the performance will be significantly worse in some scenarios such as -nkvo
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I understand. I can perform some tests in the runtime env and check if we can enable it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just want to confirm: will llama.cpp accept a hardcoded path like this?
diff --git a/Makefile b/Makefile
index d2d9ce0f..4d10bd74 100644
--- a/Makefile
+++ b/Makefile
@@ -530,7 +530,6 @@ ifdef GGML_MUSA
CC := clang
CXX := clang++
GGML_CUDA := 1
- GGML_NO_OPENMP := 1
MK_CPPFLAGS += -DGGML_USE_MUSA
endif
@@ -589,8 +588,8 @@ ifdef GGML_CUDA
CUDA_PATH ?= /usr/local/musa
endif
- MK_CPPFLAGS += -DGGML_USE_CUDA -I$(CUDA_PATH)/include
- MK_LDFLAGS += -lmusa -lmublas -lmusart -lpthread -ldl -lrt -L$(CUDA_PATH)/lib -L/usr/lib64
+ MK_CPPFLAGS += -DGGML_USE_CUDA -I$(CUDA_PATH)/include -I/usr/lib/llvm-12/lib/clang/12.0.0/include
+ MK_LDFLAGS += -lmusa -lmublas -lmusart -lpthread -ldl -lrt -L$(CUDA_PATH)/lib -L/usr/lib64 -L/usr/lib/llvm-12/lib
MK_NVCCFLAGS += -x musa -mtgpu --cuda-gpu-arch=mp_22
else
ifneq ('', '$(wildcard /opt/cuda)')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's always the path for the MUSA compiler that may be ok, it depends entirely on the compiler, not on llama.cpp. If it is not, it may cause issues when other people try to build with MUSA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to the default libomp-dev
in the builder image and libomp5-10
in the runtime image.
Now GGML_NO_OPENMP := 1
is removed from Makefile (as well as set(GGML_OPENMP OFF)
in CMake) and everything looks good in my local tests.
Please see the latest commits. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the .cu
files are being compiled with the mcc
compiler, so why change the C/C++ compiler to clang
for the rest of the project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mcc
is actually a link to clang
.
root@9627f6e94585:/ws# ll /usr/local/musa/bin/mcc
lrwxrwxrwx 1 root root 5 Jul 11 13:16 /usr/local/musa/bin/mcc -> clang*
The version of clang
ships with MUSA, contains some extra modifications.
Signed-off-by: Xiaodong Ye <[email protected]>
Signed-off-by: Xiaodong Ye <[email protected]>
Signed-off-by: Xiaodong Ye <[email protected]>
bf41f17
to
0fc2a35
Compare
No problem. I can start working on this. |
Signed-off-by: Xiaodong Ye <[email protected]>
In an earlier post you said:
Have there been any updates on this? |
I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated. The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link |
My video card is s80,cpu is amd 2600 |
Please see the above comments:
We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR. |
* Update doc for MUSA Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <[email protected]> * CUDA => MUSA Signed-off-by: Xiaodong Ye <[email protected]> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <[email protected]> * Fix CI build failure Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>
any progress in s80? |
I guess we'll have to wait for the next version of the SDK |
Yes, please give us more time. |
@yeahdongcn please help - have got the problem with compiling llama.cpp using MUSA SDK rc2.0.0 with Ubuntu 20.04.6 LTS: |
@yeahdongcn thank you for your reply! Will this code work with MTT S3000? |
Haha, it seems that you're one of our business customers! MTT S3000 shares the same architecture as MTT S80, I can test on MTT S3000 as well. |
@yeahdongcn what speeds can we expect for ~8B models for the MTT S80? |
~15 tokens/s (llama3.1:8b) |
@yeahdongcn very good, thank you |
Moore Threads, a cutting-edge GPU startup, introduces MUSA (Moore Threads Unified System Architecture) as its foundational technology. This pull request marks the initial integration of MTGPU support into llama.cpp, leveraging MUSA's capabilities to enhance LLM inference performance.
Similar to #1087, CUDA APIs are replaced by MUSA APIs using macros, and a new build option is added to Makefile and CMake.
I also sent a PR to Ollama to integrate MTGPU to it and all the tests were performed through Ollama. Tested models are: