Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: x64: implement AMX dot product #7487

Closed
wants to merge 4 commits into from

Conversation

ReinForce-II
Copy link
Contributor

This PR introduces support for AMX(Advanced Matrix Extensions) kernel for the vector dot on the x64 architecture.
AMX is enabled if LLAMA_AMX=ON or LLAMA_NATIVE=ON is set in cmake on corresponding platforms.

It performs 16x16x16 of 4-byte packed dot product in bf16 instead of quantized vector dot.

Here are the performance measured on w9-3475x capped to 2.2ghz fixed frequency.

Q4_0

PR

./llama-bench -m ./Llama-2-7b-chat-q40.gguf -pg 0,0 -t 4,8,16

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 pp512 21.45 ± 0.54
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 tg128 6.83 ± 0.26
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 pp512 42.74 ± 2.58
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 tg128 12.32 ± 0.71
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 pp512 84.09 ± 7.45
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 tg128 20.50 ± 1.52

build: ba1987f (2975)

master

./llama-bench -m ./Llama-2-7b-chat-q40.gguf -pg 0,0 -t 4,8,16

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 pp512 14.46 ± 0.02
llama 7B Q4_0 3.56 GiB 6.74 B CPU 4 tg128 6.08 ± 0.24
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 pp512 28.34 ± 1.19
llama 7B Q4_0 3.56 GiB 6.74 B CPU 8 tg128 9.75 ± 0.62
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 pp512 55.79 ± 3.40
llama 7B Q4_0 3.56 GiB 6.74 B CPU 16 tg128 18.07 ± 1.35

build: cd93a28 (2972)

IQ4_XS

PR

./llama-bench -m ./Llama-2-7b-chat-iq4xs.gguf -pg 0,0 -t 4,8,16

model size params backend threads test t/s
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 4 pp512 18.95 ± 0.44
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 4 tg128 7.13 ± 0.35
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 8 pp512 37.58 ± 2.00
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 8 tg128 12.71 ± 0.73
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 16 pp512 74.06 ± 5.89
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 16 tg128 21.08 ± 1.91

build: ba1987f (2975)

master

./llama-bench -m ./Llama-2-7b-chat-iq4xs.gguf -pg 0,0 -t 4,8,16

model size params backend threads test t/s
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 4 pp512 9.28 ± 0.12
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 4 tg128 7.04 ± 0.29
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 8 pp512 18.03 ± 0.40
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 8 tg128 12.24 ± 0.87
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 16 pp512 34.11 ± 1.66
llama 7B IQ4_XS - 4.25 bpw 3.40 GiB 6.74 B CPU 16 tg128 20.76 ± 1.84

build: cd93a28 (2972)

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs ggml changes relating to the ggml tensor library for machine learning labels May 23, 2024
@github-actions github-actions bot added the build Compilation issues label May 23, 2024
Copy link
Contributor

github-actions bot commented May 23, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 554 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8403.85ms p(95)=22214.97ms fails=, finish reason: stop=505 truncated=49
  • Prompt processing (pp): avg=101.51tk/s p(95)=482.56tk/s
  • Token generation (tg): avg=46.61tk/s p(95)=46.65tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=amx_bf16 commit=0adedd712ed3959952db5147cbc271a2a42c2c7f

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 323.72, 323.72, 323.72, 323.72, 323.72, 812.89, 812.89, 812.89, 812.89, 812.89, 855.78, 855.78, 855.78, 855.78, 855.78, 841.23, 841.23, 841.23, 841.23, 841.23, 842.17, 842.17, 842.17, 842.17, 842.17, 830.67, 830.67, 830.67, 830.67, 830.67, 829.86, 829.86, 829.86, 829.86, 829.86, 859.79, 859.79, 859.79, 859.79, 859.79, 854.71, 854.71, 854.71, 854.71, 854.71, 866.11, 866.11, 866.11, 866.11, 866.11, 884.17, 884.17, 884.17, 884.17, 884.17, 883.72, 883.72, 883.72, 883.72, 883.72, 904.8, 904.8, 904.8, 904.8, 904.8, 920.54, 920.54, 920.54, 920.54, 920.54, 921.27, 921.27, 921.27, 921.27, 921.27, 927.47, 927.47, 927.47, 927.47, 927.47, 923.58, 923.58, 923.58, 923.58, 923.58, 939.86, 939.86, 939.86, 939.86, 939.86, 934.93, 934.93, 934.93, 934.93, 934.93, 935.08, 935.08, 935.08, 935.08, 935.08, 937.1, 937.1, 937.1, 937.1, 937.1, 935.25, 935.25, 935.25, 935.25, 935.25, 928.71, 928.71, 928.71, 928.71, 928.71, 938.33, 938.33, 938.33, 938.33, 938.33, 935.69, 935.69, 935.69, 935.69, 935.69, 934.61, 934.61, 934.61, 934.61, 934.61, 869.7, 869.7, 869.7, 869.7, 869.7, 868.76, 868.76, 868.76, 868.76, 868.76, 868.71, 868.71, 868.71, 868.71, 868.71, 872.61, 872.61, 872.61, 872.61, 872.61, 872.01, 872.01, 872.01, 872.01, 872.01, 872.04, 872.04, 872.04, 872.04, 872.04, 873.61, 873.61, 873.61, 873.61, 873.61, 887.33, 887.33, 887.33, 887.33, 887.33, 886.97, 886.97, 886.97, 886.97, 886.97, 889.18, 889.18, 889.18, 889.18, 889.18, 871.47, 871.47, 871.47, 871.47, 871.47, 871.0, 871.0, 871.0, 871.0, 871.0, 870.19, 870.19, 870.19, 870.19, 870.19, 873.04, 873.04, 873.04, 873.04, 873.04, 881.04, 881.04, 881.04, 881.04, 881.04, 861.57, 861.57, 861.57, 861.57, 861.57, 861.43, 861.43, 861.43, 861.43, 861.43, 858.55, 858.55, 858.55, 858.55, 858.55, 856.43, 856.43, 856.43, 856.43, 856.43, 856.9, 856.9, 856.9, 856.9, 856.9, 862.12, 862.12, 862.12, 862.12, 862.12, 861.08, 861.08, 861.08, 861.08, 861.08, 860.92, 860.92, 860.92, 860.92, 860.92, 865.03, 865.03, 865.03, 865.03, 865.03, 864.38, 864.38, 864.38, 864.38, 864.38, 867.89, 867.89, 867.89, 867.89, 867.89, 868.8, 868.8, 868.8, 868.8, 868.8, 872.0, 872.0, 872.0, 872.0, 872.0, 869.89, 869.89, 869.89, 869.89, 869.89, 870.34, 870.34, 870.34, 870.34, 870.34, 870.39, 870.39, 870.39, 870.39, 870.39, 868.62, 868.62, 868.62, 868.62, 868.62, 870.34, 870.34, 870.34, 870.34, 870.34, 873.45, 873.45, 873.45, 873.45, 873.45, 874.1, 874.1]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.27, 38.27, 38.27, 38.27, 38.27, 39.55, 39.55, 39.55, 39.55, 39.55, 31.61, 31.61, 31.61, 31.61, 31.61, 28.3, 28.3, 28.3, 28.3, 28.3, 28.93, 28.93, 28.93, 28.93, 28.93, 30.11, 30.11, 30.11, 30.11, 30.11, 32.08, 32.08, 32.08, 32.08, 32.08, 32.92, 32.92, 32.92, 32.92, 32.92, 33.35, 33.35, 33.35, 33.35, 33.35, 33.65, 33.65, 33.65, 33.65, 33.65, 33.88, 33.88, 33.88, 33.88, 33.88, 32.51, 32.51, 32.51, 32.51, 32.51, 32.38, 32.38, 32.38, 32.38, 32.38, 31.66, 31.66, 31.66, 31.66, 31.66, 30.67, 30.67, 30.67, 30.67, 30.67, 29.97, 29.97, 29.97, 29.97, 29.97, 29.95, 29.95, 29.95, 29.95, 29.95, 30.2, 30.2, 30.2, 30.2, 30.2, 30.04, 30.04, 30.04, 30.04, 30.04, 29.89, 29.89, 29.89, 29.89, 29.89, 29.38, 29.38, 29.38, 29.38, 29.38, 29.43, 29.43, 29.43, 29.43, 29.43, 29.8, 29.8, 29.8, 29.8, 29.8, 29.88, 29.88, 29.88, 29.88, 29.88, 29.89, 29.89, 29.89, 29.89, 29.89, 29.98, 29.98, 29.98, 29.98, 29.98, 30.21, 30.21, 30.21, 30.21, 30.21, 30.06, 30.06, 30.06, 30.06, 30.06, 30.32, 30.32, 30.32, 30.32, 30.32, 30.58, 30.58, 30.58, 30.58, 30.58, 30.73, 30.73, 30.73, 30.73, 30.73, 30.91, 30.91, 30.91, 30.91, 30.91, 30.99, 30.99, 30.99, 30.99, 30.99, 30.79, 30.79, 30.79, 30.79, 30.79, 30.71, 30.71, 30.71, 30.71, 30.71, 30.54, 30.54, 30.54, 30.54, 30.54, 30.07, 30.07, 30.07, 30.07, 30.07, 30.29, 30.29, 30.29, 30.29, 30.29, 30.41, 30.41, 30.41, 30.41, 30.41, 30.51, 30.51, 30.51, 30.51, 30.51, 30.66, 30.66, 30.66, 30.66, 30.66, 30.61, 30.61, 30.61, 30.61, 30.61, 30.35, 30.35, 30.35, 30.35, 30.35, 29.49, 29.49, 29.49, 29.49, 29.49, 28.72, 28.72, 28.72, 28.72, 28.72, 28.63, 28.63, 28.63, 28.63, 28.63, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.56, 28.58, 28.58, 28.58, 28.58, 28.58, 28.66, 28.66, 28.66, 28.66, 28.66, 28.72, 28.72, 28.72, 28.72, 28.72, 28.77, 28.77, 28.77, 28.77, 28.77, 28.75, 28.75, 28.75, 28.75, 28.75, 28.7, 28.7, 28.7, 28.7, 28.7, 28.59, 28.59, 28.59, 28.59, 28.59, 28.65, 28.65, 28.65, 28.65, 28.65, 28.81, 28.81, 28.81, 28.81, 28.81, 28.93, 28.93, 28.93, 28.93, 28.93, 29.01, 29.01, 29.01, 29.01, 29.01, 29.12, 29.12, 29.12, 29.12, 29.12, 29.17, 29.17]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.39, 0.39, 0.39, 0.39, 0.39, 0.28, 0.28, 0.28, 0.28, 0.28, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.34, 0.34, 0.34, 0.34, 0.34, 0.37, 0.37, 0.37, 0.37, 0.37, 0.24, 0.24, 0.24, 0.24, 0.24, 0.22, 0.22, 0.22, 0.22, 0.22, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.28, 0.28, 0.28, 0.28, 0.28, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.3, 0.3, 0.3, 0.3, 0.3, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.07, 0.07, 0.07, 0.07, 0.07, 0.22, 0.22, 0.22, 0.22, 0.22, 0.35, 0.35, 0.35, 0.35, 0.35, 0.34, 0.34, 0.34, 0.34, 0.34, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.49, 0.49, 0.49, 0.49, 0.49, 0.66, 0.66, 0.66, 0.66, 0.66, 0.58, 0.58, 0.58, 0.58, 0.58, 0.26, 0.26, 0.26, 0.26, 0.26, 0.28, 0.28, 0.28, 0.28, 0.28, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 554 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716785306 --> 1716785928
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0]
                    
Loading

@teaalltr
Copy link

@ReinForce-II any news on this?

@ggerganov
Copy link
Owner

I think this work is superseded by the recent #8998

@mingfeima mingfeima closed this Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants