llamafile_sgemm API - INT8 implementation · ggerganov/llama.cpp@4147962 · GitHub

Commit

llamafile_sgemm API - INT8 implementation

Browse files

This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>

Loading branch information

amritahs-ibm committed Dec 30, 2024

1 parent 9ba399d commit 4147962