-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to implement INT4/INT8 quantization and optimal way to use AVX instructions? #11
Comments
In Go, you can use the "github.com/minio/simdjson-go" library to work with SIMD and AVX. However, Go does not natively support SIMD intrinsics like C or C++. Therefore, you would need to use inline assembly functions to work directly with AVX instructions. Here is a basic example of how you can do INT8 multiplication using inline assembly and AVX-512 instructions in Go: main.gopackage main
import (
"fmt"
"unsafe"
)
//go:noescape
//go:generate go run asm.go -out vec_mul_amd64.s
func vecMulINT8(a, b, result *int8)
func main() {
a := make([]int8, 64)
b := make([]int8, 64)
result := make([]int8, 64)
// Preencha os vetores de exemplo
for i := 0; i < 64; i++ {
a[i] = int8(i)
b[i] = int8(i + 1)
}
vecMulINT8(&a[0], &b[0], &result[0])
fmt.Println("Resultado da multiplicação INT8:")
for i := 0; i < 64; i++ {
fmt.Printf("%d * %d = %d\n", a[i], b[i], result[i])
}
} Asm.go (assembly with go)// +build ignore
package main
import (
"fmt"
"log"
"os"
"text/template"
)
const tmpl = `
#include "textflag.h"
TEXT ·vecMulINT8(SB), NOSPLIT, $0-24
MOVQ a+0(FP), DI
MOVQ b+8(FP), SI
MOVQ result+16(FP), DX
VMOVDQU (DI), ZMM0
VPMOVSBW ZMM0, YMM1, YMM2
VMOVDQU (SI), ZMM0
VPMOVSBW ZMM0, YMM3, YMM4
VPMULLW YMM3, YMM1, YMM5
VPMULLW YMM4, YMM2, YMM6
VPMOVSWB YMM5, YMM6, ZMM0
VMOVDQU ZMM0, (DX)
RET
`
func main() {
t := template.Must(template.New("").Parse(tmpl))
f, err := os.Create("vec_mul_amd64.s")
if err != nil {
log.Fatalf("Failed to create vec_mul_amd64.s: %v", err)
}
defer f.Close()
err = t.Execute(f, nil)
if err != nil {
log.Fatalf("Failed to execute template: %v", err)
}
} This example defines a function vecMulINT8 that accepts three pointers to the arrays of int8. Assembly code performs INT8 multiplication using AVX-512 instructions. The main function creates example arrays and calls vecMulINT8 to perform the multiplication. Be aware that this example is simplified and may not be the most efficient. Also, it doesn't handle saturation, so you'll need to adjust your code as needed to handle overflow cases. |
I am aware that my question has nothing to do with the topic of this issue, but I just want to ask: is this https://github.com/gotzmann/llama.go/blob/main/pkg/ml/ml.go the exact port of this (tensor program that run exactly like ggml in Go) https://github.com/ggerganov/ggml/blob/master/src/ggml.c? I am just getting started in ML, I have little experience in C/C++ and Go, but I want to leverage the Go part. So I want to know if I could run other model (e.g. mnist) which Georgi has already provided, with your ml.go? Thank you. |
I've started grokking with NEON and AVX2: https://github.com/gotzmann/llama.go/tree/avx-neon After looking into the topic, it seems the most easiest way to start with is to use MinIO tooling advanced by gorse: https://gorse.io/posts/avx512-in-golang.html After some long hours having segfaults both on my Mac and PC I finally managed to fix gotchas and build a version which much easier on CPU load. Not yet big speed improvement and I suppose the RAM becomes actual bottleneck when matrix math moved from main CPU cores to SIMD vector units. |
Unfortunately, AVX-512 support is fragmentary within Intel processors. It was removed recently even from CPUs that were capable of it: So my idea is support only AVX2 which is standard de-facto for generations of Intel / AMD processors and later eventually introduce AVX-512 if it will make sense. From what I see I'm 99% sure after AVX2 RAM speed will become the main bottleneck there, not the CPU performance itself. |
Yeah, thanks! The most annoying things here:
So one needs either go deep rabbit hole learning both how AVX/NEON works and Plan9 exotics or count on the C/C++ code bases and convert needed parts from there. |
@umarrudy - exactly :)
Basically yes, but there still the chance some matrix operations not yet implemented within llama.go I've looked into the code and it seems we have not converted |
To use AVX2 instructions in Go, you can use assembly language and the go:generate directive. Here is an example of how to perform INT8 vector multiplication using AVX2 instructions in Go: Create a file called vecmul_avx2_amd64.s for the assembly code:// +build amd64,!noasm
#include "textflag.h"
TEXT ·vecMulInt8AVX2(SB), NOSPLIT, $0
MOVQ a+0(FP), AX
MOVQ b+8(FP), BX
MOVQ result+16(FP), CX
MOVQ length+24(FP), DX
XORQ R8, R8
loop:
// Carregar os vetores a e b em registradores SIMD
VMOVDQU (AX)(R8*1), Y0
VMOVDQU (BX)(R8*1), Y1
// Converter de INT8 para INT16
VPUNPCKLBW Y0, Y2
VPUNPCKHBW Y0, Y0
VPMOVZXBW Y2, Y2
VPMOVZXBW Y0, Y0
VPUNPCKLBW Y1, Y3
VPUNPCKHBW Y1, Y1
VPMOVZXBW Y3, Y3
VPMOVZXBW Y1, Y1
// Realizar a multiplicação
VPMULLW Y2, Y3, Y2
VPMULLW Y0, Y1, Y0
// Compactar os resultados de volta em INT8
VPACKUSWB Y2, Y0, Y0
VMOVDQU Y0, (CX)(R8*1)
ADDQ $32, R8
SUBQ $32, DX
JGT loop
RET Next, create a Go file called vecmul.go to use the assembly function:// +build amd64,!noasm
package main
import (
"fmt"
)
//go:noescape
func vecMulInt8AVX2(a, b, result []byte, length int)
func main() {
length := 32
a := make([]byte, length)
b := make([]byte, length)
result := make([]byte, length)
// Preencher os vetores a e b com valores de exemplo
for i := 0; i < length; i++ {
a[i] = byte(i)
b[i] = byte(i + 1)
}
vecMulInt8AVX2(a, b, result, length)
for i := 0; i < length; i++ {
fmt.Printf("a[%d] * b[%d] = %d\n", i, i, result[i])
}
} In this example, the vecmul_avx2_amd64.s file contains the assembly code that implements the vecMulInt8AVX2 function using AVX2 instructions. The vecmul.go file uses this assembly function and performs INT8 vector multiplication. To compile and run this code, you must have a processor that supports AVX2 instructions. Please note that support for SIMD intrinsics in Go is not as extensive as in other languages like C or Rust |
I think Thank you. |
Having lost some days between debugging sessions on my Mac and PC I've finally managed to release AVX2 and NEON optimisations with v1.2 release :) It helped really offload CPU and boost performance for ~2x-4x times depending on how fast your memory. I'm going to dig into AVX2 more to support memory aligned tensors and get even more better performance with slightly changed code here and there. |
I'm a bit late to the party, but I thought I might share. I've been dabbling with C to Go Assembly for a while, but generally tooling is very poor. Had some free time this weekend and came up with this small utility to generate Go assembly, it's based off goarse and minio stuff but I had to rewrite most of it. The main idea is still the same tho, using |
in reply from: @gotzmann
Implementing INT4/INT8 quantization and using AVX instructions can be challenging, mainly due to the limitations of INT8 multiplication instructions. However, here are some ideas to help you get started:
Quantization:
AVX Instructions:
To deal with the lack of specific INT8 multiplication instructions, you can try converting the INT8 data to INT16 before performing the multiplication. Here is a basic example of how you can do this:
Please note that this example is simplified and may not be the most efficient. It also doesn't handle saturation, so you'll need to tweak the code as needed to handle overflow cases.
These ideas should help you get started implementing INT4/INT8 quantization and using AVX instructions. Keep in mind that performance optimization is an iterative process and you may need to adjust and experiment with various approaches to find the most efficient solution for your specific case. If you need more information don't hesitate to call me, I don't really understand c++ as much as Go, but I'm at your disposal.
The text was updated successfully, but these errors were encountered: