Add Intel Advanced Matrix Extensions (AMX) support to ggml #8998

mingfeima · 2024-08-12T06:56:34Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

replacement of #7707 to trigger ggml-ci on amx

ggerganov · 2024-08-12T07:23:06Z

To trigger ggml-ci you need to include the string "ggml-ci" somewhere in the commit message. For example: 5ef07e2

mingfeima · 2024-08-14T07:39:31Z

@ggerganov could you please take a look at this one? I have moved the amx init code from ggml.c to ggml-amx/mmq.cpp according to previous comments.

slaren

Currently, GGML_AMX is always enabled with llama.cpp, and whether AMX is actually used depends entirely on using -march=native with GGML_NATIVE, and building on a machine with AMX support. This is not ideal because it is a departure from the way the other x86 instruction sets are handled, and it doesn't allow for cross-compilation. I think it would be better to handle this in the same way as any other x86 instruction set, and add an explicit compiler option to enable this architecture (-mamx-int8?) when using GGML_AMX without GGML_NATIVE, or let it be enabled automatically by -march=native when using GGML_NATIVE.

mingfeima · 2024-08-15T03:00:44Z

@slaren just updated cmake compiler options: -mamx-tile, -mamx-int8 and -mamx-bf16!

ggml/src/ggml-amx/mmq.cpp

mingfeima · 2024-09-25T07:21:42Z

@ggerganov could you please tell me why this CI failed? https://github.com/ggml-org/ci/tree/results/llama.cpp/28/cfc0ffdbfec9520a2c190d57025350229d340c/ggml-4-x86-cuda-v100

ggerganov · 2024-09-25T07:26:47Z

failure 124 means that the run timeout. There is currently a limit of 30 minutes for these runs and in this case it was exceeded.

It's not related to this PR - the CUDA build time has recently increased so this run is timeouting from time to time. I just restarted it to see if it would pass.

slaren

Other than the issues with is_host pointed below, the ggml-backend interface implementation looks good. Unfortunately fixing these issues may result in a small hit in performance since it may cause some additional copies when data needs to be moved between the CPU and AMX backends, and fixing that will require changes to the ggml-backend interface.

The llama.cpp side will probably need some changes. I expect that the current implementation won't work with KV quantization. I cannot test this, but I think changing it so that the AMX buffer type is only used for the weights may work better, while also avoiding the need to use -ngl:

diff --git a/src/llama.cpp b/src/llama.cpp
index b85e8acd..13d70ec1 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -3462,8 +3462,6 @@ static ggml_backend_buffer_type_t llama_default_buffer_type_offload(const llama_
     }
 #elif defined(GGML_USE_CANN)
     buft = ggml_backend_cann_buffer_type(local_gpu);
-#elif defined(GGML_USE_AMX)
-    buft = ggml_backend_amx_buffer_type();
 #endif

     if (buft == nullptr) {
@@ -6865,7 +6863,14 @@ static bool llm_load_tensors(

     // assign cpu layers
     for (int i = 0; i < i_gpu_start; ++i) {
+    #ifdef GGML_USE_AMX
+        model.buft_layer[i] = {
+            llama_default_buffer_type_cpu(true),
+            ggml_backend_amx_buffer_type()
+        };
+    #else
         model.buft_layer[i] = llama_default_buffer_type_cpu(true);
+    #endif
     }

     if (split_mode == LLAMA_SPLIT_MODE_LAYER) {
@@ -18587,11 +18592,6 @@ struct llama_model_params llama_model_default_params() {
     result.n_gpu_layers = 999;
 #endif

-#ifdef GGML_USE_AMX
-    // by default offload all layers to AMX
-    result.n_gpu_layers = 999;
-#endif
-
     return result;
 }

I also expect that this implementation will have issues when built with a GPU backend such as CUDA that allows the weights to be copied to VRAM when evaluating large batches (>32 tokens), although that could be fixed by implementing conversion back to standard ggml format in ggml_backend_amx_buffer_get_tensor.

ggml/src/ggml-amx.cpp

src/llama.cpp

mingfeima · 2024-09-30T02:37:55Z

@slaren after changing is_host to false from the AMX backend leads to an fault from ggml_backend_sched_backend_id_from_cur (log attached below). Do you have any insight how to fix it?

llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        AMX KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
ggml/src/ggml-backend.c:1204: pre-allocated tensor in a backend that cannot run the operation
[New LWP 2746117]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007f7d1a7205a2 in waitpid () from /lib64/libpthread.so.0
#0  0x00007f7d1a7205a2 in waitpid () from /lib64/libpthread.so.0
#1  0x000000000048a648 in ggml_print_backtrace () at ggml/src/ggml.c:282
282             waitpid(pid, &wstatus, 0);
#2  ggml_abort (file=file@entry=0x6788d4 "ggml/src/ggml-backend.c", line=line@entry=1204, fmt=fmt@entry=0x678c50 "pre-allocated tensor in a backend that cannot run the operation") at ggml/src/ggml.c:309
309         ggml_print_backtrace();
#3  0x00000000004cf025 in ggml_backend_sched_backend_id_from_cur (sched=0x172fd20, tensor=0x5305e10) at ggml/src/ggml-backend.c:1204
1204            GGML_ABORT("pre-allocated tensor in a backend that cannot run the operation");
#4  0x00000000004d127c in ggml_backend_sched_split_graph (sched=sched@entry=0x172fd20, graph=graph@entry=0x1ada190) at ggml/src/ggml-backend.c:1337
1337                *leaf_backend_id = ggml_backend_sched_backend_id_from_cur(sched, leaf);
#5  0x00000000004d2dde in ggml_backend_sched_split_graph (graph=0x1ada190, sched=<optimized out>) at ggml/src/ggml-backend.c:1327
1327        if (sched->ctx == NULL) {
#6  ggml_backend_sched_reserve (sched=0x172fd20, measure_graph=0x1ada190) at ggml/src/ggml-backend.c:1992
1992        ggml_backend_sched_split_graph(sched, measure_graph);
#7  0x000000000053204b in llama_new_context_with_model (model=0x1729f30, params=...) at src/llama.cpp:19176
19176               if (!ggml_backend_sched_reserve(ctx->sched, gf)) {
#8  0x000000000060a48d in llama_init_from_gpt_params (params=...) at common/common.cpp:843
843         llama_context * lctx = llama_new_context_with_model(model, cparams);
#9  0x000000000043894d in main (argc=<optimized out>, argv=0x7fffb9a8de18) at examples/main/main.cpp:200
200         llama_init_result llama_init = llama_init_from_gpt_params(params);
[Inferior 1 (process 2746116) detached]
./run_generate_cpu.sh: line 26: 2746116 Aborted                 (core dumped) $PREFIX $main -m ./models/Meta-Llama-3-8B-Instruct-GGUF/$MODEL -t $CORES -n 5 -p "$prompt" --no-mmap

slaren · 2024-09-30T02:59:47Z

I think that may be related to KV operations, which should be fixed with the change I suggested before. By making llama_default_buffer_type_offload return the AMX buffer type, it will cause the KV cache to be allocated on an AMX buffer, which is not good. If that doesn't fix it, please add some prints to show the tensor that is causing the error.

mingfeima · 2024-10-14T02:22:27Z

@slaren could you please help review this one again? just changed ggml_backend_buft_is_host to return false for amx backend.

ggml/src/ggml-backend.cpp

src/llama.cpp

ggml/src/ggml-amx.cpp

slaren · 2024-10-17T01:39:48Z

Looks good to me, feel free to merge this at any point.

ggerganov · 2024-10-17T06:30:16Z

@slaren Thank you for the detailed review.

@mingfeima Remember to squash the commits when merging as explained in the contributing guidelines. Btw, I just restarted the ggml-ci node with AMX instruction set support, so we might want to wait for ggml-ci to run before merging. Will run it on this branch shortly.

Edit: the AMX CI has passed

ggerganov · 2024-10-17T08:44:48Z

Would recommend using 4 spaces for indentation for conformance with the rest of the codebase.

add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend

mingfeima · 2024-10-18T03:44:28Z

@ggerganov changed to tab with 4 spaces. also the branch is rebased to squash into one.

ggerganov · 2024-10-18T05:20:04Z

Nice, great job! Feel free to merge this - you should have the access to do so.

mingfeima · 2024-10-18T05:35:32Z

@slaren thanks a lot for your review!

add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend

nai-kon · 2024-11-12T01:37:37Z

@mingfeima

Thank you for your work. I have a question.
The qtype_has_amx_kernels() returns TRUE with Q4_0 and Q4_1, so does it mean the AMX acceleration is only enabled with Q4_0 and Q4_1 model?
Is there any plan to support Q4_K or more?

qtype_has_amx_kernels

inline bool qtype_has_amx_kernels(const enum ggml_type type) {
    // TODO: fix padding for vnni format
    return (type == GGML_TYPE_Q4_0) ||
        (type == GGML_TYPE_Q4_1);
        //(type == GGML_TYPE_Q8_0) ||
        //(type == GGML_TYPE_Q4_K) ||
        //(type == GGML_TYPE_Q5_K) ||
        //(type == GGML_TYPE_Q6_K) ||
        //(type == GGML_TYPE_IQ4_XS);
}

nai-kon · 2024-11-12T06:02:41Z

@mingfeima
I have an additional question.
When I tried latest lamma.cpp in my environment, the prompt eval time became 2x faster compared to no AMX supported commit, but the eval time (text generation speed) remained almost unchanged.
(I tried it with gemma-2b-q4_0 on Intel Xeon with AMX.)

In your original PR, it was reported that eval time became 2x faster.

Do you have any idea that the reason of eval time does not become faster?

add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend

mingfeima · 2024-11-18T02:30:56Z

@nai-kon originally i wrote kernels for

        //(type == GGML_TYPE_Q8_0) ||
        //(type == GGML_TYPE_Q4_K) ||
        //(type == GGML_TYPE_Q5_K) ||
        //(type == GGML_TYPE_Q6_K) ||
        //(type == GGML_TYPE_IQ4_XS);

but later on, these are banned, because i did not figure out how to do the padding with ggml-backend (these formats require additional padding for the packed format). Anyway i suppose it can be done, it's just i don't have spare time for this at the moment.

As for the performance of generation, I suppose this is because the original cpu impl in ggml-quant has also been improved (the old code base has a huge thread sync overhead with pthread, it uses atomic; later on when openmp is used, the overhead is much smaller). Additionally, the current version of AMX kernels are actually slower than my first version, some optimizations are removed to rebase to use ggml-backend. My estimation is that, the current AMX kernels still have ~1/3 gap to optimal.

the work here is my personal interest (not company task), I will get back to this once I have spare time (add back the kernels for other quant types and improve the performance).

nai-kon · 2024-11-18T04:38:49Z

Thank you for your reply. It sounds wonderful that the generation speed can still be improved.
I truly appreciate your wonderful personal effort and work.

add intel amx isa detection add vnni kernel for gemv cases add vnni and amx kernel support for block_q8_0 code cleanup fix packing B issue enable openmp fine tune amx kernel switch to aten parallel pattern add error message for nested parallelism code cleanup add f16 support in ggml-amx add amx kernels for QK_K quant formats: Q4_K, Q5_K, Q6_K and IQ4_XS update CMakeList update README fix some compilation warning fix compiler warning when amx is not enabled minor change ggml-ci move ggml_amx_init from ggml.c to ggml-amx/mmq.cpp ggml-ci update CMakeLists with -mamx-tile, -mamx-int8 and -mamx-bf16 ggml-ci add amx as an ggml-backend update header file, the old path for immintrin.h has changed to ggml-cpu-impl.h minor change update CMakeLists.txt minor change apply weight prepacking in set_tensor method in ggml-backend fix compile error ggml-ci minor change ggml-ci update CMakeLists.txt ggml-ci add march dependency minor change ggml-ci change ggml_backend_buffer_is_host to return false for amx backend ggml-ci fix supports_op use device reg for AMX backend ggml-ci minor change ggml-ci minor change fix rebase set .buffer_from_host_ptr to be false for AMX backend

slaren · 2024-11-19T18:09:48Z

Additionally, the current version of AMX kernels are actually slower than my first version, some optimizations are removed to rebase to use ggml-backend.

Did you find any issue with the ggml-backend interface that forced you to remove these optimizations? The plan is to reintegrate the AMX backend in the CPU backend in the future (#10359), which may eliminate some overheads and allow using the optimized q8_0 quantization functions again.

mingfeima · 2024-11-26T03:02:42Z

@slaren the major problem that i have with ggml-backend is I didn't figure out how to do padding with the AMX backend (when the packing weight for AMX, e.g. vnni format, has a different memory size with default CPU backend. I run into a couple of issues when integrating AMX backend with ggml-backend. So I just leave the dtypes that does not require padding. Again, I think it should be able to be done elegantly, I just did not have the time to investigate recently.

From the performance wise, TODOs from my side are: fuse the quantization of A into the gemm loop; Q8_K quant method is a reference now (very slow); and so on.

slaren · 2024-11-28T19:15:07Z

@mingfeima as far as I can tell, you were already doing everything that's necessary to allow padding (returning the padded size in get_alloc_size is enough). I intend to move the AMX code to the CPU backend and enable all the supported types in #10570.

mingfeima marked this pull request as draft August 12, 2024 06:58

github-actions bot added build Compilation issues ggml changes relating to the ggml tensor library for machine learning labels Aug 12, 2024

mingfeima changed the title ~~Pr add intel amx support~~ Add Intel Advanced Matrix Extensions (AMX) support to ggml Aug 12, 2024

mingfeima force-pushed the pr_add_intel_amx_support branch from 39d84e3 to 0b4de32 Compare August 12, 2024 07:43

mingfeima mentioned this pull request Aug 14, 2024

Add Intel Advanced Matrix Extensions (AMX) support to ggml #7707

Open

mingfeima force-pushed the pr_add_intel_amx_support branch from 2c95fa5 to 37ccb9d Compare August 14, 2024 06:14

mingfeima marked this pull request as ready for review August 14, 2024 06:58

mingfeima requested review from ggerganov and slaren August 14, 2024 07:37

slaren reviewed Aug 14, 2024

View reviewed changes

ggerganov approved these changes Aug 16, 2024

View reviewed changes

ggerganov requested a review from slaren August 16, 2024 07:59

slaren reviewed Aug 16, 2024

View reviewed changes

ggml/src/ggml-amx/mmq.cpp Outdated Show resolved Hide resolved

mingfeima force-pushed the pr_add_intel_amx_support branch from a43f8e0 to 47b1a74 Compare September 18, 2024 06:16

mingfeima force-pushed the pr_add_intel_amx_support branch from c90a43a to 28cfc0f Compare September 25, 2024 06:27

mingfeima force-pushed the pr_add_intel_amx_support branch from b4d38b2 to 7c371fa Compare September 26, 2024 06:12

slaren reviewed Sep 30, 2024

View reviewed changes

ggml/src/ggml-amx.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-amx.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

mingfeima force-pushed the pr_add_intel_amx_support branch from 2f37e21 to fc709cf Compare October 11, 2024 08:12

slaren reviewed Oct 14, 2024

View reviewed changes

ggml/src/ggml-backend.cpp Outdated Show resolved Hide resolved

src/llama.cpp Outdated Show resolved Hide resolved

src/llama.cpp Show resolved Hide resolved

mingfeima force-pushed the pr_add_intel_amx_support branch from fc709cf to 45451e2 Compare October 16, 2024 07:58

slaren reviewed Oct 16, 2024

View reviewed changes

ggml/src/ggml-amx.cpp Outdated Show resolved Hide resolved

slaren reviewed Oct 16, 2024

View reviewed changes

ggml/src/ggml-amx.cpp Outdated Show resolved Hide resolved

mingfeima force-pushed the pr_add_intel_amx_support branch from 6072775 to 3366c22 Compare October 16, 2024 23:58

mingfeima requested a review from slaren October 17, 2024 00:20

slaren approved these changes Oct 17, 2024

View reviewed changes

mingfeima force-pushed the pr_add_intel_amx_support branch from d5e8aba to 2d3fc54 Compare October 18, 2024 03:36

mingfeima merged commit 60ce97c into master Oct 18, 2024
60 checks passed

ggerganov mentioned this pull request Oct 30, 2024

ggml: x64: implement AMX dot product #7487

Closed

nai-kon mentioned this pull request Nov 11, 2024

Update related llama.cpp to support Intel AMX instruction abetlen/llama-cpp-python#1827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Intel Advanced Matrix Extensions (AMX) support to ggml #8998

Add Intel Advanced Matrix Extensions (AMX) support to ggml #8998

mingfeima commented Aug 12, 2024

ggerganov commented Aug 12, 2024

mingfeima commented Aug 14, 2024

slaren left a comment

mingfeima commented Aug 15, 2024

mingfeima commented Sep 25, 2024

ggerganov commented Sep 25, 2024

slaren left a comment

mingfeima commented Sep 30, 2024

slaren commented Sep 30, 2024

mingfeima commented Oct 14, 2024

slaren commented Oct 17, 2024

ggerganov commented Oct 17, 2024 •

edited

Loading

ggerganov commented Oct 17, 2024

mingfeima commented Oct 18, 2024

ggerganov commented Oct 18, 2024

mingfeima commented Oct 18, 2024

nai-kon commented Nov 12, 2024 •

edited

Loading

nai-kon commented Nov 12, 2024

mingfeima commented Nov 18, 2024

nai-kon commented Nov 18, 2024 •

edited

Loading

slaren commented Nov 19, 2024

mingfeima commented Nov 26, 2024

slaren commented Nov 28, 2024

Add Intel Advanced Matrix Extensions (AMX) support to ggml #8998

Add Intel Advanced Matrix Extensions (AMX) support to ggml #8998

Conversation

mingfeima commented Aug 12, 2024

ggerganov commented Aug 12, 2024

mingfeima commented Aug 14, 2024

slaren left a comment

Choose a reason for hiding this comment

mingfeima commented Aug 15, 2024

mingfeima commented Sep 25, 2024

ggerganov commented Sep 25, 2024

slaren left a comment

Choose a reason for hiding this comment

mingfeima commented Sep 30, 2024

slaren commented Sep 30, 2024

mingfeima commented Oct 14, 2024

slaren commented Oct 17, 2024

ggerganov commented Oct 17, 2024 • edited Loading

ggerganov commented Oct 17, 2024

mingfeima commented Oct 18, 2024

ggerganov commented Oct 18, 2024

mingfeima commented Oct 18, 2024

nai-kon commented Nov 12, 2024 • edited Loading

nai-kon commented Nov 12, 2024

mingfeima commented Nov 18, 2024

nai-kon commented Nov 18, 2024 • edited Loading

slaren commented Nov 19, 2024

mingfeima commented Nov 26, 2024

slaren commented Nov 28, 2024

ggerganov commented Oct 17, 2024 •

edited

Loading

nai-kon commented Nov 12, 2024 •

edited

Loading

nai-kon commented Nov 18, 2024 •

edited

Loading