Skip to content

Commit

Permalink
llamafile_sgemm API - INT8 implementation
Browse files Browse the repository at this point in the history
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.

This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.

The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.

Signed-off-by: Amrita H S <[email protected]>
  • Loading branch information
amritahs-ibm committed Dec 30, 2024
1 parent 9ba399d commit 4147962
Showing 1 changed file with 773 additions and 69 deletions.
Loading

0 comments on commit 4147962

Please sign in to comment.