llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm · 2024-12-20T05:33:43Z

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

amritahs-ibm · 2024-12-20T07:11:30Z

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

slaren · 2024-12-23T00:24:14Z

We will need to merge #10714 first, since there may be some conflicts.

amritahs-ibm · 2024-12-23T05:07:20Z

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

Djip007 · 2024-12-24T14:12:39Z

I'll try to submit it today. 🤞

ggml/src/ggml-cpu/llamafile/sgemm.cpp

amritahs-ibm · 2024-12-26T10:15:53Z

I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes.

Djip007

These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.

ggml/src/ggml-cpu/llamafile/sgemm.cpp

Djip007 · 2024-12-28T01:40:59Z

@amritahs-ibm
I didn't realize that MMA was "Matrix Multiply Accelerate".
Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP?
That way A is repack at load time once, and only B need to be repack at runtime.

This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>

amritahs-ibm · 2024-12-30T08:50:29Z

All comments are addressed expect for the last MMA one. The updated patch has been committed.
I will look into MMA comment and get back to you.

Djip007 · 2025-01-01T20:04:38Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+  private:
+
+    template<int RM, int RN>
+    void save_res(int ii, int jj, int idx, vector float* fin_res) {


You can add inline . It may (or not) give some optimised code.

Djip007 · 2025-01-01T20:04:58Z

ggml/src/ggml-cpu/llamafile/sgemm.cpp

+    }
+
+    template<int size>
+    void compute(acc_t* ACC, int c_idx, int s_idx, std::array<int, size>& comparray, vector float* vs, vector float* fin_res) {


You can add inline to.

amritahs-ibm · 2025-01-02T04:30:13Z

@amritahs-ibm I didn't realize that MMA was "Matrix Multiply Accelerate". Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP? That way A is repack at load time once, and only B need to be repack at runtime.

Are you referring to gemm4xN and gemmMx4 functions in tinyBLAS_Q0_AVX?

Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing.

Please find below the MMA guide:
https://www.redbooks.ibm.com/redpapers/pdfs/redp5612.pdf

github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024

amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from 85c5280 to d70f5fc Compare December 20, 2024 06:22

Djip007 reviewed Dec 24, 2024

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved

amritahs-ibm force-pushed the sgemm_q8 branch from d70f5fc to a8d3700 Compare December 26, 2024 10:12

Djip007 reviewed Dec 26, 2024

View reviewed changes

Djip007 reviewed Dec 27, 2024

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved

Djip007 reviewed Dec 27, 2024

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved

Djip007 reviewed Dec 27, 2024

View reviewed changes

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved

amritahs-ibm force-pushed the sgemm_q8 branch from a8d3700 to dc23ee5 Compare December 30, 2024 07:24

amritahs-ibm force-pushed the sgemm_q8 branch from dc23ee5 to 4147962 Compare December 30, 2024 07:27

Djip007 reviewed Jan 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llamafile_sgemm API - INT8 implementation #10912

llamafile_sgemm API - INT8 implementation #10912

amritahs-ibm commented Dec 20, 2024 •

edited

Loading

amritahs-ibm commented Dec 20, 2024

slaren commented Dec 23, 2024

amritahs-ibm commented Dec 23, 2024

Djip007 commented Dec 24, 2024

amritahs-ibm commented Dec 26, 2024

Djip007 left a comment

Djip007 commented Dec 28, 2024

amritahs-ibm commented Dec 30, 2024

Djip007 Jan 1, 2025

Djip007 Jan 1, 2025

amritahs-ibm commented Jan 2, 2025

llamafile_sgemm API - INT8 implementation #10912

Are you sure you want to change the base?

llamafile_sgemm API - INT8 implementation #10912

Conversation

amritahs-ibm commented Dec 20, 2024 • edited Loading

amritahs-ibm commented Dec 20, 2024

slaren commented Dec 23, 2024

amritahs-ibm commented Dec 23, 2024

Djip007 commented Dec 24, 2024

amritahs-ibm commented Dec 26, 2024

Djip007 left a comment

Choose a reason for hiding this comment

Djip007 commented Dec 28, 2024

amritahs-ibm commented Dec 30, 2024

Djip007 Jan 1, 2025

Choose a reason for hiding this comment

Djip007 Jan 1, 2025

Choose a reason for hiding this comment

amritahs-ibm commented Jan 2, 2025

amritahs-ibm commented Dec 20, 2024 •

edited

Loading