ggml: x64: implement AMX dot product #7487

ReinForce-II · 2024-05-23T07:04:03Z

This PR introduces support for AMX(Advanced Matrix Extensions) kernel for the vector dot on the x64 architecture.
AMX is enabled if LLAMA_AMX=ON or LLAMA_NATIVE=ON is set in cmake on corresponding platforms.

It performs 16x16x16 of 4-byte packed dot product in bf16 instead of quantized vector dot.

Here are the performance measured on w9-3475x capped to 2.2ghz fixed frequency.

Q4_0

PR

./llama-bench -m ./Llama-2-7b-chat-q40.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	21.45 ± 0.54
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	6.83 ± 0.26
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	42.74 ± 2.58
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	12.32 ± 0.71
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp512	84.09 ± 7.45
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg128	20.50 ± 1.52

build: ba1987f (2975)

master

./llama-bench -m ./Llama-2-7b-chat-q40.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	14.46 ± 0.02
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	6.08 ± 0.24
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	28.34 ± 1.19
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	9.75 ± 0.62
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	pp512	55.79 ± 3.40
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	16	tg128	18.07 ± 1.35

build: cd93a28 (2972)

IQ4_XS

PR

./llama-bench -m ./Llama-2-7b-chat-iq4xs.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	pp512	18.95 ± 0.44
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	tg128	7.13 ± 0.35
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	pp512	37.58 ± 2.00
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	tg128	12.71 ± 0.73
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	pp512	74.06 ± 5.89
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	tg128	21.08 ± 1.91

build: ba1987f (2975)

master

./llama-bench -m ./Llama-2-7b-chat-iq4xs.gguf -pg 0,0 -t 4,8,16

model	size	params	backend	threads	test	t/s
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	pp512	9.28 ± 0.12
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	4	tg128	7.04 ± 0.29
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	pp512	18.03 ± 0.40
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	8	tg128	12.24 ± 0.87
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	pp512	34.11 ± 1.66
llama 7B IQ4_XS - 4.25 bpw	3.40 GiB	6.74 B	CPU	16	tg128	20.76 ± 1.84

build: cd93a28 (2972)

github-actions · 2024-05-23T09:21:45Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 554 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8403.85ms p(95)=22214.97ms fails=, finish reason: stop=505 truncated=49
Prompt processing (pp): avg=101.51tk/s p(95)=482.56tk/s
Token generation (tg): avg=46.61tk/s p(95)=46.65tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=amx_bf16 commit=0adedd712ed3959952db5147cbc271a2a42c2c7f

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 323.72, 323.72, 323.72, 323.72, 323.72, 812.89, 812.89, 812.89, 812.89, 812.89, 855.78, 855.78, 855.78, 855.78, 855.78, 841.23, 841.23, 841.23, 841.23, 841.23, 842.17, 842.17, 842.17, 842.17, 842.17, 830.67, 830.67, 830.67, 830.67, 830.67, 829.86, 829.86, 829.86, 829.86, 829.86, 859.79, 859.79, 859.79, 859.79, 859.79, 854.71, 854.71, 854.71, 854.71, 854.71, 866.11, 866.11, 866.11, 866.11, 866.11, 884.17, 884.17, 884.17, 884.17, 884.17, 883.72, 883.72, 883.72, 883.72, 883.72, 904.8, 904.8, 904.8, 904.8, 904.8, 920.54, 920.54, 920.54, 920.54, 920.54, 921.27, 921.27, 921.27, 921.27, 921.27, 927.47, 927.47, 927.47, 927.47, 927.47, 923.58, 923.58, 923.58, 923.58, 923.58, 939.86, 939.86, 939.86, 939.86, 939.86, 934.93, 934.93, 934.93, 934.93, 934.93, 935.08, 935.08, 935.08, 935.08, 935.08, 937.1, 937.1, 937.1, 937.1, 937.1, 935.25, 935.25, 935.25, 935.25, 935.25, 928.71, 928.71, 928.71, 928.71, 928.71, 938.33, 938.33, 938.33, 938.33, 938.33, 935.69, 935.69, 935.69, 935.69, 935.69, 934.61, 934.61, 934.61, 934.61, 934.61, 869.7, 869.7, 869.7, 869.7, 869.7, 868.76, 868.76, 868.76, 868.76, 868.76, 868.71, 868.71, 868.71, 868.71, 868.71, 872.61, 872.61, 872.61, 872.61, 872.61, 872.01, 872.01, 872.01, 872.01, 872.01, 872.04, 872.04, 872.04, 872.04, 872.04, 873.61, 873.61, 873.61, 873.61, 873.61, 887.33, 887.33, 887.33, 887.33, 887.33, 886.97, 886.97, 886.97, 886.97, 886.97, 889.18, 889.18, 889.18, 889.18, 889.18, 871.47, 871.47, 871.47, 871.47, 871.47, 871.0, 871.0, 871.0, 871.0, 871.0, 870.19, 870.19, 870.19, 870.19, 870.19, 873.04, 873.04, 873.04, 873.04, 873.04, 881.04, 881.04, 881.04, 881.04, 881.04, 861.57, 861.57, 861.57, 861.57, 861.57, 861.43, 861.43, 861.43, 861.43, 861.43, 858.55, 858.55, 858.55, 858.55, 858.55, 856.43, 856.43, 856.43, 856.43, 856.43, 856.9, 856.9, 856.9, 856.9, 856.9, 862.12, 862.12, 862.12, 862.12, 862.12, 861.08, 861.08, 861.08, 861.08, 861.08, 860.92, 860.92, 860.92, 860.92, 860.92, 865.03, 865.03, 865.03, 865.03, 865.03, 864.38, 864.38, 864.38, 864.38, 864.38, 867.89, 867.89, 867.89, 867.89, 867.89, 868.8, 868.8, 868.8, 868.8, 868.8, 872.0, 872.0, 872.0, 872.0, 872.0, 869.89, 869.89, 869.89, 869.89, 869.89, 870.34, 870.34, 870.34, 870.34, 870.34, 870.39, 870.39, 870.39, 870.39, 870.39, 868.62, 868.62, 868.62, 868.62, 868.62, 870.34, 870.34, 870.34, 870.34, 870.34, 873.45, 873.45, 873.45, 873.45, 873.45, 874.1, 874.1]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.27, 38.27, 38.27, 38.27, 38.27, 39.55, 39.55, 39.55, 39.55, 39.55, 31.61, 31.61, 31.61, 31.61, 31.61, 28.3, 28.3, 28.3, 28.3, 28.3, 28.93, 28.93, 28.93, 28.93, 28.93, 30.11, 30.11, 30.11, 30.11, 30.11, 32.08, 32.08, 32.08, 32.08, 32.08, 32.92, 32.92, 32.92, 32.92, 32.92, 33.35, 33.35, 33.35, 33.35, 33.35, 33.65, 33.65, 33.65, 33.65, 33.65, 33.88, 33.88, 33.88, 33.88, 33.88, 32.51, 32.51, 32.51, 32.51, 32.51, 32.38, 32.38, 32.38, 32.38, 32.38, 31.66, 31.66, 31.66, 31.66, 31.66, 30.67, 30.67, 30.67, 30.67, 30.67, 29.97, 29.97, 29.97, 29.97, 29.97, 29.95, 29.95, 29.95, 29.95, 29.95, 30.2, 30.2, 30.2, 30.2, 30.2, 30.04, 30.04, 30.04, 30.04, 30.04, 29.89, 29.89, 29.89, 29.89, 29.89, 29.38, 29.38, 29.38, 29.38, 29.38, 29.43, 29.43, 29.43, 29.43, 29.43, 29.8, 29.8, 29.8, 29.8, 29.8, 29.88, 29.88, 29.88, 29.88, 29.88, 29.89, 29.89, 29.89, 29.89, 29.89, 29.98, 29.98, 29.98, 29.98, 29.98, 30.21, 30.21, 30.21, 30.21, 30.21, 30.06, 30.06, 30.06, 30.06, 30.06, 30.32, 30.32, 30.32, 30.32, 30.32, 30.58, 30.58, 30.58, 30.58, 30.58, 30.73, 30.73, 30.73, 30.73, 30.73, 30.91, 30.91, 30.91, 30.91, 30.91, 30.99, 30.99, 30.99, 30.99, 30.99, 30.79, 30.79, 30.79, 30.79, 30.79, 30.71, 30.71, 30.71, 30.71, 30.71, 30.54, 30.54, 30.54, 30.54, 30.54, 30.07, 30.07, 30.07, 30.07, 30.07, 30.29, 30.29, 30.29, 30.29, 30.29, 30.41, 30.41, 30.41, 30.41, 30.41, 30.51, 30.51, 30.51, 30.51, 30.51, 30.66, 30.66, 30.66, 30.66, 30.66, 30.61, 30.61, 30.61, 30.61, 30.61, 30.35, 30.35, 30.35, 30.35, 30.35, 29.49, 29.49, 29.49, 29.49, 29.49, 28.72, 28.72, 28.72, 28.72, 28.72, 28.63, 28.63, 28.63, 28.63, 28.63, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.58, 28.58, 28.58, 28.58, 28.58, 28.66, 28.66, 28.66, 28.66, 28.66, 28.72, 28.72, 28.72, 28.72, 28.72, 28.77, 28.77, 28.77, 28.77, 28.77, 28.75, 28.75, 28.75, 28.75, 28.75, 28.7, 28.7, 28.7, 28.7, 28.7, 28.59, 28.59, 28.59, 28.59, 28.59, 28.65, 28.65, 28.65, 28.65, 28.65, 28.81, 28.81, 28.81, 28.81, 28.81, 28.93, 28.93, 28.93, 28.93, 28.93, 29.01, 29.01, 29.01, 29.01, 29.01, 29.12, 29.12, 29.12, 29.12, 29.12, 29.17, 29.17]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.39, 0.39, 0.39, 0.39, 0.39, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.37, 0.37, 0.37, 0.37, 0.37, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.28, 0.28, 0.28, 0.28, 0.28, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.22, 0.22, 0.22, 0.22, 0.22, 0.35, 0.35, 0.35, 0.35, 0.35, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.49, 0.49, 0.49, 0.49, 0.49, 0.66, 0.66, 0.66, 0.66, 0.66, 0.58, 0.58, 0.58, 0.58, 0.58, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0]

teaalltr · 2024-10-29T21:56:45Z

@ReinForce-II any news on this?

ggerganov · 2024-10-30T12:15:58Z

I think this work is superseded by the recent #8998

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs ggml changes relating to the ggml tensor library for machine learning labels May 23, 2024

github-actions bot added the build Compilation issues label May 23, 2024

ReinForce-II added 4 commits May 27, 2024 10:15

basic implementation

3047229

use larger block size

9a16633

better toolchain compability

c812542

move unsed variable

0adedd7

ReinForce-II force-pushed the amx_bf16 branch from 21a44e9 to 0adedd7 Compare May 27, 2024 02:40

mingfeima closed this Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml: x64: implement AMX dot product #7487

ggml: x64: implement AMX dot product #7487

ReinForce-II commented May 23, 2024

github-actions bot commented May 23, 2024 •

edited

Loading

teaalltr commented Oct 29, 2024

ggerganov commented Oct 30, 2024

ggml: x64: implement AMX dot product #7487

ggml: x64: implement AMX dot product #7487

Conversation

ReinForce-II commented May 23, 2024

Q4_0

PR

master

IQ4_XS

PR

master

github-actions bot commented May 23, 2024 • edited Loading

teaalltr commented Oct 29, 2024

ggerganov commented Oct 30, 2024

github-actions bot commented May 23, 2024 •

edited

Loading