feat: Support Moore Threads GPU #8383

yeahdongcn · 2024-07-09T01:24:38Z

Moore Threads, a cutting-edge GPU startup, introduces MUSA (Moore Threads Unified System Architecture) as its foundational technology. This pull request marks the initial integration of MTGPU support into llama.cpp, leveraging MUSA's capabilities to enhance LLM inference performance.

Similar to #1087, CUDA APIs are replaced by MUSA APIs using macros, and a new build option is added to Makefile and CMake.

# make
make GGML_MUSA=1

# CMake
cmake -B build -DGGML_MUSA=ON
cmake --build build --config Release

I also sent a PR to Ollama to integrate MTGPU to it and all the tests were performed through Ollama. Tested models are:

tinyllama:latest (1b)
llama3:latest (8b)
qwen2:72b

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

JohannesGaessler · 2024-07-09T07:14:22Z

I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000?

yeahdongcn · 2024-07-09T07:54:13Z

I am one of the primary llama.cpp CUDA developers. I would in principle be willing to buy a Moore Threads GPU and to test any code changes I do in order to assert that they don't break MUSA. On the Moore Threads website I only see a "Buy Now" button for the MTT S80. Would testing and performance optimization on that GPU be representative of an MTT S4000?

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

JohannesGaessler · 2024-07-09T08:57:39Z

Makefile

I think it would be better if the MUSA changes were completely separate from the CUDA logic in the Makefile/cmakelists.txt. But I don't feel particularly strongly about this.

I should refine this. But my initial thought is to mostly reuse the CUDA compiler settings and only change the different parts. This is because the CUDA compiler is already well-configured and I don't want to mess up with it.

After conducting some local testing, I found that if we separate out the MUSA changes from the CUDA logic in the Makefile/CMakeLists.txt, we would end up duplicating a significant amount of code. This would lead to maintenance issues down the line.
I'd love to explore alternative solutions but I'm not sure what the best way to proceed is. I'm open to suggestions.

As I said, I don't feel particularly strongly about this. But what I would like is a consistent implementation between HIP and MUSA. So if we take the original approach of this PR then we should also adjust the HIP logic (in a separate PR).

Ok, I understand.

Makefile

ggml/src/ggml-cuda.cu

ggml/src/ggml-cuda/common.cuh

src/llama.cpp

slaren · 2024-07-11T11:29:18Z

Makefile

+	CC := clang
+	CXX := clang++
+	GGML_CUDA := 1
+	GGML_NO_OPENMP := 1


I assume that the issue with OpenMP is that the MUSA compiler does not work well it. Could the MUSA compiler be used only for the ggml-cuda files instead of completely disabling OpenMP?

Thank you for pointing this out! The MUSA compiler is shipped with our Clang variant, which is based on Clang 14.

Ubuntu 20.04 Requirement: MUSA libraries require Ubuntu 20.04.

OpenMP Version: The latest OpenMP version available on Ubuntu 20.04 (Focal) is libomp-12-dev.

I'm not sure if setting GGML_NO_OPENMP to 0 will cause runtime issues. Therefore, I have set GML_NO_OPENMP to 1 to disable it. Once we upgrade to Ubuntu 22.04, we can re-enable it.

BTW, just saw this in my previous work log:

root@b7a6e65eec34:/ws# ldd llama-server linux-vdso.so.1 (0x00007ffdff2de000) libmusa.so.1.0 => /usr/local/musa/lib/libmusa.so.1.0 (0x00007f044742e000) libmublas.so => /usr/local/musa/lib/libmublas.so (0x00007f043c1a0000) libmusart.so.1.0 => /usr/local/musa/lib/libmusart.so.1.0 (0x00007f043c16e000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f043c144000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f043c13e000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f043c134000) libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f043bf50000) libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f043be01000) libomp.so.5 => /lib/x86_64-linux-gnu/libomp.so.5 (0x00007f043bcff000) libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f043bce4000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f043baf2000) /lib64/ld-linux-x86-64.so.2 (0x00007f04477a9000) libelf.so.1 => /lib/x86_64-linux-gnu/libelf.so.1 (0x00007f043bad6000) libsrv_um_MUSA.so => /ddk/usr/lib/x86_64-linux-gnu/musa/libsrv_um_MUSA.so (0x00007f043b1bf000) libnuma.so.1 => /lib/x86_64-linux-gnu/libnuma.so.1 (0x00007f043b1b2000) libz.so.1 => /lib/x86_64-linux-gnu/libz.so.1 (0x00007f043b196000)

It will work without OpenMP, but the performance will be significantly worse in some scenarios such as -nkvo.

Yes, I understand. I can perform some tests in the runtime env and check if we can enable it in this PR.

I just want to confirm: will llama.cpp accept a hardcoded path like this?

diff --git a/Makefile b/Makefile index d2d9ce0f..4d10bd74 100644 --- a/Makefile +++ b/Makefile @@ -530,7 +530,6 @@ ifdef GGML_MUSA CC := clang CXX := clang++ GGML_CUDA := 1 - GGML_NO_OPENMP := 1 MK_CPPFLAGS += -DGGML_USE_MUSA endif @@ -589,8 +588,8 @@ ifdef GGML_CUDA CUDA_PATH ?= /usr/local/musa endif - MK_CPPFLAGS += -DGGML_USE_CUDA -I$(CUDA_PATH)/include - MK_LDFLAGS += -lmusa -lmublas -lmusart -lpthread -ldl -lrt -L$(CUDA_PATH)/lib -L/usr/lib64 + MK_CPPFLAGS += -DGGML_USE_CUDA -I$(CUDA_PATH)/include -I/usr/lib/llvm-12/lib/clang/12.0.0/include + MK_LDFLAGS += -lmusa -lmublas -lmusart -lpthread -ldl -lrt -L$(CUDA_PATH)/lib -L/usr/lib64 -L/usr/lib/llvm-12/lib MK_NVCCFLAGS += -x musa -mtgpu --cuda-gpu-arch=mp_22 else ifneq ('', '$(wildcard /opt/cuda)')

If that's always the path for the MUSA compiler that may be ok, it depends entirely on the compiler, not on llama.cpp. If it is not, it may cause issues when other people try to build with MUSA.

I switched to the default libomp-dev in the builder image and libomp5-10 in the runtime image.
Now GGML_NO_OPENMP := 1 is removed from Makefile (as well as set(GGML_OPENMP OFF) in CMake) and everything looks good in my local tests.
Please see the latest commits. Thanks.

I see that the .cu files are being compiled with the mcc compiler, so why change the C/C++ compiler to clang for the rest of the project?

mcc is actually a link to clang.

root@9627f6e94585:/ws# ll /usr/local/musa/bin/mcc lrwxrwxrwx 1 root root 5 Jul 11 13:16 /usr/local/musa/bin/mcc -> clang*

The version of clang ships with MUSA, contains some extra modifications.

Signed-off-by: Xiaodong Ye <[email protected]>

ggml/src/CMakeLists.txt

ggml/src/ggml-cuda.cu

Signed-off-by: Xiaodong Ye <[email protected]>

yeahdongcn · 2024-07-25T00:32:40Z

Eventually we should move all the HIP and MUSA-specific code to its own headers.

No problem. I can start working on this.

Signed-off-by: Xiaodong Ye <[email protected]>

JohannesGaessler · 2024-07-25T09:42:34Z

In an earlier post you said:

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Have there been any updates on this?

yeahdongcn · 2024-07-25T10:20:17Z

In an earlier post you said:

Thank you for checking out this PR! Yes, the current code changes were tested on the MTT S4000 (--cuda-gpu-arch=mp_22) and this model of GPU only ships with our data center solution. I will test the code changes on the MTT S80 (--cuda-gpu-arch=mp_21) and let you know the results.

Have there been any updates on this?

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.

The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

1823616178 · 2024-07-29T02:14:22Z

make error

ubuntu 20.04.06 LTS
musa driver 2.7.0
SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao

make GGML_MUSA=1

Please help me.

1823616178 · 2024-07-29T02:17:35Z

My video card is s80,cpu is amd 2600

yeahdongcn · 2024-07-29T02:35:52Z

make error

ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao

make GGML_MUSA=1 Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.

The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

* Update doc for MUSA Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <[email protected]> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <[email protected]> * CUDA => MUSA Signed-off-by: Xiaodong Ye <[email protected]> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <[email protected]> * Fix CI build failure Signed-off-by: Xiaodong Ye <[email protected]> --------- Signed-off-by: Xiaodong Ye <[email protected]>

XenoAmess · 2024-08-29T07:14:27Z

make error
ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao
make GGML_MUSA=1 Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.
The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

any progress in s80?

1823616178 · 2024-08-29T09:58:27Z

make error
ubuntu 20.04.06 LTS musa driver 2.7.0 SDK: MUSA+SDK-rc2.1.0_Intel_CPU_Ubuntu_chunxiao
make GGML_MUSA=1 Please help me.

Please see the above comments:

I've encountered some compilation issues on S80 toolchain and have opened several internal tickets to the compiler team. I'll monitor the progress and keep you updated.
The S80 toolchain (rc2.1.0_Intel_CPU_Ubuntu_quyuan) I used is publicly available but still in the RC stage. Please refer to link

We are still investigating this issue internally. Please expect a new release of MUSA SDK and llama.cpp PR.

any progress in s80?

I guess we'll have to wait for the next version of the SDK

yeahdongcn · 2024-08-30T00:04:53Z

I guess we'll have to wait for the next version of the SDK

Yes, please give us more time.

Ivening · 2024-09-19T11:13:46Z

@yeahdongcn please help - have got the problem with compiling llama.cpp using MUSA SDK rc2.0.0 with Ubuntu 20.04.6 LTS:
running make GGML_MUSA=1 shows the following error:

Is there something I am doing wrong for the compilation?

yeahdongcn · 2024-09-19T12:52:07Z

@Ivening We are still working on MTT S80 support, please see: #9526

If you are interested in running llama.cpp on MTT S80, please add me through WeChat: yeahdongcn.

Ivening · 2024-09-19T12:58:07Z

@yeahdongcn thank you for your reply! Will this code work with MTT S3000?

yeahdongcn · 2024-09-20T00:30:31Z

@yeahdongcn thank you for your reply! Will this code work with MTT S3000?

Haha, it seems that you're one of our business customers! MTT S3000 shares the same architecture as MTT S80, I can test on MTT S3000 as well.

arch-btw · 2024-10-22T06:09:06Z

@yeahdongcn what speeds can we expect for ~8B models for the MTT S80?

yeahdongcn · 2024-10-22T08:20:29Z

@yeahdongcn what speeds can we expect for ~8B models for the MTT S80?

~15 tokens/s (llama3.1:8b)

Please also see the recording on llama3.2:1b:

arch-btw · 2024-10-22T14:29:28Z

@yeahdongcn very good, thank you

yeahdongcn mentioned this pull request Jul 9, 2024

feat: Support Moore Threads GPU ollama/ollama#5556

Closed

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Jul 9, 2024

JohannesGaessler reviewed Jul 9, 2024

View reviewed changes

yeahdongcn force-pushed the musa branch 10 times, most recently from dd64710 to bba0b94 Compare July 11, 2024 05:01

slaren reviewed Jul 11, 2024

View reviewed changes

yeahdongcn force-pushed the musa branch from bba0b94 to 6aa4232 Compare July 12, 2024 04:51

Update doc for MUSA

cb2c688

Signed-off-by: Xiaodong Ye <[email protected]>

yeahdongcn force-pushed the musa branch from 6aa4232 to cf5e386 Compare July 12, 2024 09:38

slaren reviewed Jul 12, 2024

View reviewed changes

ggml/src/CMakeLists.txt Outdated Show resolved Hide resolved

slaren reviewed Jul 12, 2024

View reviewed changes

ggml/src/CMakeLists.txt Outdated Show resolved Hide resolved

slaren reviewed Jul 12, 2024

View reviewed changes

ggml/src/ggml-cuda.cu Outdated Show resolved Hide resolved

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 13, 2024

yeahdongcn added 2 commits July 14, 2024 15:59

Add GGML_MUSA in Makefile

779c920

Signed-off-by: Xiaodong Ye <[email protected]>

Add GGML_MUSA in CMake

f085684

Signed-off-by: Xiaodong Ye <[email protected]>

yeahdongcn force-pushed the musa branch 3 times, most recently from bf41f17 to 0fc2a35 Compare July 15, 2024 11:16

Fix CI build failure

5afc5cf

Signed-off-by: Xiaodong Ye <[email protected]>

JohannesGaessler approved these changes Jul 25, 2024

View reviewed changes

slaren merged commit e54c35e into ggerganov:master Jul 27, 2024
53 checks passed

yeahdongcn mentioned this pull request Jul 28, 2024

feat: Support Moore Threads GPU gpustack/gpustack#97

Merged

yeahdongcn mentioned this pull request Jul 29, 2024

refactor: Organize vendor-specific headers into vendors directory #8746

Merged

5 tasks

This was referenced Sep 3, 2024

Add Moorethreads MUSA support #6697

Closed

Support Moorethreads MUSA RWKV/ggml#3

Closed

Add MUSA support RWKV/rwkv.cpp#172

Closed

yeahdongcn mentioned this pull request Sep 11, 2024

musa: remove Clang builtins mapping #9421

Merged

5 tasks

yeahdongcn mentioned this pull request Sep 18, 2024

musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) #9526

Merged

7 tasks

yeahdongcn mentioned this pull request Sep 23, 2024

musa: enable VMM support #9597

Merged

6 tasks

yeahdongcn deleted the musa branch October 22, 2024 22:59

yeahdongcn restored the musa branch October 22, 2024 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support Moore Threads GPU #8383

feat: Support Moore Threads GPU #8383

yeahdongcn commented Jul 9, 2024 •

edited

Loading

JohannesGaessler commented Jul 9, 2024

yeahdongcn commented Jul 9, 2024 •

edited

Loading

JohannesGaessler Jul 9, 2024 •

edited

Loading

yeahdongcn Jul 9, 2024

yeahdongcn Jul 10, 2024

JohannesGaessler Jul 10, 2024

yeahdongcn Jul 10, 2024

slaren Jul 11, 2024

yeahdongcn Jul 12, 2024

yeahdongcn Jul 12, 2024

slaren Jul 12, 2024

yeahdongcn Jul 12, 2024

yeahdongcn Jul 12, 2024

slaren Jul 12, 2024

yeahdongcn Jul 12, 2024

slaren Jul 12, 2024

yeahdongcn Jul 14, 2024

yeahdongcn commented Jul 25, 2024

JohannesGaessler commented Jul 25, 2024

yeahdongcn commented Jul 25, 2024 •

edited

Loading

1823616178 commented Jul 29, 2024

1823616178 commented Jul 29, 2024

yeahdongcn commented Jul 29, 2024

XenoAmess commented Aug 29, 2024

1823616178 commented Aug 29, 2024

yeahdongcn commented Aug 30, 2024

Ivening commented Sep 19, 2024

yeahdongcn commented Sep 19, 2024

Ivening commented Sep 19, 2024

yeahdongcn commented Sep 20, 2024

arch-btw commented Oct 22, 2024

yeahdongcn commented Oct 22, 2024 •

edited

Loading

arch-btw commented Oct 22, 2024

feat: Support Moore Threads GPU #8383

feat: Support Moore Threads GPU #8383

Conversation

yeahdongcn commented Jul 9, 2024 • edited Loading

JohannesGaessler commented Jul 9, 2024

yeahdongcn commented Jul 9, 2024 • edited Loading

JohannesGaessler Jul 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yeahdongcn commented Jul 25, 2024

JohannesGaessler commented Jul 25, 2024

yeahdongcn commented Jul 25, 2024 • edited Loading

1823616178 commented Jul 29, 2024

1823616178 commented Jul 29, 2024

yeahdongcn commented Jul 29, 2024

XenoAmess commented Aug 29, 2024

1823616178 commented Aug 29, 2024

yeahdongcn commented Aug 30, 2024

Ivening commented Sep 19, 2024

yeahdongcn commented Sep 19, 2024

Ivening commented Sep 19, 2024

yeahdongcn commented Sep 20, 2024

arch-btw commented Oct 22, 2024

yeahdongcn commented Oct 22, 2024 • edited Loading

arch-btw commented Oct 22, 2024

yeahdongcn commented Jul 9, 2024 •

edited

Loading

yeahdongcn commented Jul 9, 2024 •

edited

Loading

JohannesGaessler Jul 9, 2024 •

edited

Loading

yeahdongcn commented Jul 25, 2024 •

edited

Loading

yeahdongcn commented Oct 22, 2024 •

edited

Loading