Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llamafile_sgemm API - INT8 implementation #10912

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

amritahs-ibm
Copy link
Contributor

@amritahs-ibm amritahs-ibm commented Dec 20, 2024

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

@github-actions github-actions bot added testing Everything test related ggml changes relating to the ggml tensor library for machine learning labels Dec 20, 2024
@amritahs-ibm amritahs-ibm force-pushed the sgemm_q8 branch 2 times, most recently from 85c5280 to d70f5fc Compare December 20, 2024 06:22
@amritahs-ibm
Copy link
Contributor Author

Hi @ggerganov,
Can you please help reviewing this PR. Or suggest any missing actions required from me to get this patch reviewed.

@slaren
Copy link
Collaborator

slaren commented Dec 23, 2024

We will need to merge #10714 first, since there may be some conflicts.

@amritahs-ibm
Copy link
Contributor Author

Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR.

@Djip007
Copy link
Contributor

Djip007 commented Dec 24, 2024

I'll try to submit it today. 🤞

@amritahs-ibm
Copy link
Contributor Author

I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes.

Copy link
Contributor

@Djip007 Djip007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.

ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
ggml/src/ggml-cpu/llamafile/sgemm.cpp Outdated Show resolved Hide resolved
@Djip007
Copy link
Contributor

Djip007 commented Dec 28, 2024

@amritahs-ibm
I didn't realize that MMA was "Matrix Multiply Accelerate".
Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP?
That way A is repack at load time once, and only B need to be repack at runtime.

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>
@amritahs-ibm
Copy link
Contributor Author

All comments are addressed expect for the last MMA one. The updated patch has been committed.
I will look into MMA comment and get back to you.

private:

template<int RM, int RN>
void save_res(int ii, int jj, int idx, vector float* fin_res) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add inline . It may (or not) give some optimised code.

}

template<int size>
void compute(acc_t* ACC, int c_idx, int s_idx, std::array<int, size>& comparray, vector float* vs, vector float* fin_res) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add inline to.

@amritahs-ibm
Copy link
Contributor Author

@amritahs-ibm I didn't realize that MMA was "Matrix Multiply Accelerate". Have you see what was done with amx or "aarch64" kernel? There is now "simple" arch to create kernel that can "repack" the weight, so it fit to the structure needed for such Matrix OP? That way A is repack at load time once, and only B need to be repack at runtime.

Are you referring to gemm4xN and gemmMx4 functions in tinyBLAS_Q0_AVX?

Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing.
image

Please find below the MMA guide:
https://www.redbooks.ibm.com/redpapers/pdfs/redp5612.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants