-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llamafile_sgemm API - INT8 implementation #10912
base: master
Are you sure you want to change the base?
Conversation
85c5280
to
d70f5fc
Compare
Hi @ggerganov, |
We will need to merge #10714 first, since there may be some conflicts. |
Sure. Please let me know once that PR is merged. I will fix mine, if any conflicts and resubmit my PR. |
I'll try to submit it today. 🤞 |
d70f5fc
to
a8d3700
Compare
I have made the changes suggested by @Djip007 and pushed. @slaren / @Djip007 / @ggerganov Please review the changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are quick personal comments wait for @slaren @ggerganov before change.
And I read it very quickly.
@amritahs-ibm |
a8d3700
to
dc23ee5
Compare
This change upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for quantised int8 datatype. This change results in 10% - 70% improvement in total speed(ie all tokens/total time), across various batch sizes. The patch is tested with Meta-Lllama-3-8B, Mistral-7B, Llama-2-7B-chat-hf models on a IBM POWER10 machine. Signed-off-by: Amrita H S <[email protected]>
dc23ee5
to
4147962
Compare
All comments are addressed expect for the last MMA one. The updated patch has been committed. |
private: | ||
|
||
template<int RM, int RN> | ||
void save_res(int ii, int jj, int idx, vector float* fin_res) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add inline . It may (or not) give some optimised code.
} | ||
|
||
template<int size> | ||
void compute(acc_t* ACC, int c_idx, int s_idx, std::array<int, size>& comparray, vector float* vs, vector float* fin_res) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can add inline to.
Are you referring to gemm4xN and gemmMx4 functions in tinyBLAS_Q0_AVX? Also in case of PowerPC's MMA for int8 data type, MMA engine requires the data to be packed in a different way. So I came up with a specific function for int8(ie packNormal) to do the packing. Please find below the MMA guide: |
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10%-70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.