Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Misc. bug: Q4_0 with runtime repacking not working as expected (TYPE_Q4_0_4_4 REMOVED) #10757

Closed
smpurkis opened this issue Dec 10, 2024 · 24 comments · Fixed by #10890
Closed

Comments

@smpurkis
Copy link

Name and Version

❯ ./build/bin/llama-cli --version
version: 4295 (26a8406b)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

llama-cli

Problem description & steps to reproduce

Running on my arm64 server, I updated to llama.cpp yesterday and tried to run

./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.Q4_0_4_4.gguf ...

and got this error message, (because of this PR, #10446, this doesn't seem to be the bug that causes the slow down)

gguf_init_from_file: tensor 'blk.0.attn_k.weight' of type 31: TYPE_Q4_0_4_4 REMOVED, use Q4_0 with runtime repacking

so I tried running with it with q4_0 model, i.e. ./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.q4_0.gguf but it is much slower. I don't believe it is using the runtime repacking, otherwise it would be as fast as previous builds running q4_0_4_4.

To show what I mean, I run 4 different setups, llama.cpp before and after the PR #10446, running both q4_0_4_4 and q4_0 of a model, models from https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF/tree/main

llama.cpp after PR

❯ ./build/bin/llama-cli --version
version: 4295 (26a8406b)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu

llama.cpp before PR

❯ ./build/bin/llama-cli --version
version: 4067 (54ef9cfc)
built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu

Running the commands

./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.Q4_0.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 200 -p "..."

and

./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct.Q4_0_4_4.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 200 -p "..."

Key: (prompt t/s, generation t/s)

Before PR After PR
Q4_0 60 t/s, 10 t/s 9 t/s, 6 t/s
Q4_0_4_4 51 t/s, 11 t/s N/A

First Bad Commit

The first bad commit seems to be

❯ git bisect bad
c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8 is the first bad commit
commit c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8
Author: Shupei Fan <[email protected]>
Date:   Thu Nov 28 20:52:03 2024 +0800

    ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
    
    * ggml-cpu: support IQ4_NL_4_4 by runtime repack
    
    * ggml-cpu: add __ARM_FEATURE_DOTPROD guard

 ggml/include/ggml-cpu.h              |   1 +
 ggml/include/ggml.h                  |   3 +
 ggml/src/ggml-common.h               |   6 +
 ggml/src/ggml-cpu/ggml-cpu-aarch64.c | 321 +++++++++++++++++++++++++++++++++--
 ggml/src/ggml-cpu/ggml-cpu-aarch64.h |   2 +
 ggml/src/ggml-cpu/ggml-cpu.c         |  27 ++-
 ggml/src/ggml-cpu/ggml-cpu.cpp       |   2 +-
 ggml/src/ggml.c                      |   9 +
 8 files changed, 352 insertions(+), 19 deletions(-)

All commits checked

❯ git bisect log
git bisect start
# bad: [26a8406ba9198eb6fdd8329fa717555b4f77f05f] CUDA: fix shared memory access condition for mmv (#10740)
git bisect bad 26a8406ba9198eb6fdd8329fa717555b4f77f05f
# good: [811872a59daefb25fc0c4326bcb6d8ae893c2f7c] speculative : simplify the implementation (#10504)
git bisect good 811872a59daefb25fc0c4326bcb6d8ae893c2f7c
# bad: [5e1ed95583ca552a98d8528b73e1ff81249c2bf9] grammars : add English-only grammar (#10612)
git bisect bad 5e1ed95583ca552a98d8528b73e1ff81249c2bf9
# good: [2025fa67e94358deda4740a74fe9803916cb2f60] kompute : improve backend to pass test_backend_ops (#10542)
git bisect good 2025fa67e94358deda4740a74fe9803916cb2f60
# bad: [4b3242bbea172ac0980378496fbc676d44c4f459] ggml-cpu: fix typo in gemv/gemm iq4_nl_4_4 (#10580)
git bisect bad 4b3242bbea172ac0980378496fbc676d44c4f459
# bad: [6c595676899013102fdb0aa4b06a49954300c94a] server : (tests) don't use thread for capturing stdout/stderr, bump openai client library (#10568)
git bisect bad 6c595676899013102fdb0aa4b06a49954300c94a
# bad: [76b27d29c22af03172cf211a8a31025c7c828a57] ggml : fix row condition for i8mm kernels (#10561)
git bisect bad 76b27d29c22af03172cf211a8a31025c7c828a57
# bad: [eea986f215e1dc490654d012ccf2ab62fe8f606d] cmake : fix ARM feature detection (#10543)
git bisect bad eea986f215e1dc490654d012ccf2ab62fe8f606d
# bad: [c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8] ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)
git bisect bad c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8
# first bad commit: [c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8] ggml-cpu: support IQ4_NL_4_4 by runtime repack (#10541)

Commands to check a good and bad commit.
Compile by running rm -rdf build && cmake -B build && cmake --build build --config Release -j 4
Run using

./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 40 -p "Jane comes home from work and leave
s her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, 
where does she look for her phone? 
    Select from the following options:
    A. The bedroom
    B. The kitchen
    C. The living room 
    D. The shed
    E. Under the cooker

    Think through the problem step by step before you give an answer.
    "

A good run looks like

build: 4206 (2025fa67) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 434 tensors from ../text-generation-webui/models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen-research
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 Coder 3B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  13:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 36
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 2
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2.5-Coder-3B-Instruct...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 252
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q4_0:  248 tensors
llama_model_loader: - type q4_1:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.09 B
llm_load_print_meta: model size       = 1.70 GiB (4.72 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 Coder 3B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:  CPU_AARCH64 model buffer size =  1440.00 MiB
llm_load_tensors:   CPU_Mapped model buffer size =  1726.01 MiB
.......................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 3200
llama_new_context_with_model: n_ctx_per_seq = 3200
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (3200) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =   112.50 MiB
llama_new_context_with_model: KV self size  =  112.50 MiB, K (f16):   56.25 MiB, V (f16):   56.25 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 1266
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

sampler seed: 0
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 3200, n_batch = 2048, n_predict = 40, n_keep = 0

Jane comes home from work and leaves her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, where does she look for her phone? 
  Select from the following options:
  A. The bedroom
  B. The kitchen
  C. The living room 
  D. The shed
  E. Under the cooker

  Think through the problem step by step before you give an answer.
   Let's analyze the situation step by step:
1. Kane comes home from work.
2. Dave moves Kane's phone from the living room to the bedroom.
3. Kane looks for her phone in

llama_perf_sampler_print:    sampling time =       6.83 ms /   149 runs   (    0.05 ms per token, 21828.30 tokens per second)
llama_perf_context_print:        load time =    1654.39 ms
llama_perf_context_print: prompt eval time =    2147.19 ms /   109 tokens (   19.70 ms per token,    50.76 tokens per second)
llama_perf_context_print:        eval time =    2831.59 ms /    39 runs   (   72.60 ms per token,    13.77 tokens per second)
llama_perf_context_print:       total time =    4996.44 ms /   148 tokens

a bad run looks like

build: 4207 (c202cef1) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 434 tensors from ../text-generation-webui/models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen-research
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 Coder 3B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  13:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 36
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 2
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2.5-Coder-3B-Instruct...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 252
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q4_0:  248 tensors
llama_model_loader: - type q4_1:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.09 B
llm_load_print_meta: model size       = 1.70 GiB (4.72 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 Coder 3B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size =  1738.10 MiB
.......................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 3200
llama_new_context_with_model: n_ctx_per_seq = 3200
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (3200) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =   112.50 MiB
llama_new_context_with_model: KV self size  =  112.50 MiB, K (f16):   56.25 MiB, V (f16):   56.25 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 1266
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | 

sampler seed: 0
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 3200, n_batch = 2048, n_predict = 40, n_keep = 0

Jane comes home from work and leaves her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, where does she look for her phone? 
  Select from the following options:
  A. The bedroom
  B. The kitchen
  C. The living room 
  D. The shed
  E. Under the cooker

  Think through the problem step by step before you give an answer.
   Let's analyze the situation:
  1. Kane comes home from work and finds Dave's phone in the bedroom.
  2. Kane then goes to the living room to look for his own

llama_perf_sampler_print:    sampling time =       6.68 ms /   149 runs   (    0.04 ms per token, 22302.05 tokens per second)
llama_perf_context_print:        load time =     837.12 ms
llama_perf_context_print: prompt eval time =   11496.71 ms /   109 tokens (  105.47 ms per token,     9.48 tokens per second)
llama_perf_context_print:        eval time =    7783.24 ms /    39 runs   (  199.57 ms per token,     5.01 tokens per second)
llama_perf_context_print:       total time =   19297.38 ms /   148 tokens

Relevant log output

❯ ./build/bin/llama-cli -m Qwen2.5-Coder-3B-Instruct-Q4_0.gguf -c 3200 --temp 0.0 --seed 0 -t 4 -n 200 -p "Jane comes home from work and leav
es her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in,
 where does she look for her phone? 
    Select from the following options:
    A. The bedroom
    B. The kitchen
    C. The living room 
    D. The shed
    E. Under the cooker

    Think through the problem step by step before you give an answer.
    "
build: 4295 (26a8406b) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 39 key-value pairs and 434 tensors from ../text-generation-webui/models/Qwen2.5-Coder-3B-Instruct-Q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen2.5 Coder 3B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Qwen2.5-Coder
llama_model_loader: - kv   5:                         general.size_label str              = 3B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = qwen-research
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-3...
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Qwen2.5 Coder 3B
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-C...
llama_model_loader: - kv  13:                               general.tags arr[str,6]       = ["code", "codeqwen", "chat", "qwen", ...
llama_model_loader: - kv  14:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  15:                          qwen2.block_count u32              = 36
llama_model_loader: - kv  16:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv  17:                     qwen2.embedding_length u32              = 2048
llama_model_loader: - kv  18:                  qwen2.feed_forward_length u32              = 11008
llama_model_loader: - kv  19:                 qwen2.attention.head_count u32              = 16
llama_model_loader: - kv  20:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  21:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  22:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  23:                          general.file_type u32              = 2
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  31:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  32:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                      quantize.imatrix.file str              = /models_out/Qwen2.5-Coder-3B-Instruct...
llama_model_loader: - kv  36:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  37:             quantize.imatrix.entries_count i32              = 252
llama_model_loader: - kv  38:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  181 tensors
llama_model_loader: - type q4_0:  248 tensors
llama_model_loader: - type q4_1:    4 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 22
llm_load_vocab: token to piece cache size = 0.9310 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151936
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 36
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 3B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 3.09 B
llm_load_print_meta: model size       = 1.70 GiB (4.72 BPW) 
llm_load_print_meta: general.name     = Qwen2.5 Coder 3B Instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors:   CPU_Mapped model buffer size =  1738.10 MiB
.......................................................................................
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 3200
llama_new_context_with_model: n_ctx_per_seq = 3200
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (3200) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size =   112.50 MiB
llama_new_context_with_model: KV self size  =  112.50 MiB, K (f16):   56.25 MiB, V (f16):   56.25 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.58 MiB
llama_new_context_with_model:        CPU compute buffer size =   300.75 MiB
llama_new_context_with_model: graph nodes  = 1266
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

sampler seed: 0
sampler params: 
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 3200, n_batch = 2048, n_predict = 200, n_keep = 0

Jane comes home from work and leaves her phone in the living room. She then goes out to the shops without her phone. Her husband, Dave, moves her phone from the living room to the bedroom. When Kane gets in, where does she look for her phone? 
  Select from the following options:
  A. The bedroom
  B. The kitchen
  C. The living room 
  D. The shed
  E. Under the cooker

  Think through the problem step by step before you give an answer.
   Let's analyze the situation:
  1. Kane comes home from work and finds Dave's phone in the bedroom.
  2. Kane then goes to the living room to look for his own phone.
  3. Since Kane's phone is in the living room, he would look for it there first.
  4. If Kane's phone is not in the living room, he would look for it in the bedroom.
  5. Since Kane's phone is in the bedroom, he would look for it there first.
  6. If Kane's phone is not in the bedroom, he would look for it in the living room.
  7. Since Kane's phone is in the bedroom, he would look for it there first.
  8. If Kane's phone is not in the bedroom, he would look for it in the living room.
  9. Since Kane's phone is in the bedroom, he would look for it there first.
 

llama_perf_sampler_print:    sampling time =      31.22 ms /   309 runs   (    0.10 ms per token,  9896.55 tokens per second)
llama_perf_context_print:        load time =     805.09 ms
llama_perf_context_print: prompt eval time =   12073.52 ms /   109 tokens (  110.77 ms per token,     9.03 tokens per second)
llama_perf_context_print:        eval time =   32584.03 ms /   199 runs   (  163.74 ms per token,     6.11 tokens per second)
llama_perf_context_print:       total time =   44734.56 ms /   308 tokens
@bartowski1182
Copy link
Contributor

Would have thought maybe you compiled wrong but I see AARCH64_REPACK = 1 which I think means you've got everything set up properly.. hopefully this is an error on your end or an easy fix, definitely enjoyed the performance uplift of the N_M models.

Can you try IQ4_NL? That would should also have repacking I think

@Arnav0400
Copy link

Even I am facing the exactly same issue, I am using Raspberry Pi 5 (ARM Cortex A76) for inference. Q4_0_4_4 seems to be much faster with an older commit and the latest version does not support the same (TYPE Q4_0_4_4 REMOVED). On using Q4_0 with the latest version, runtime repacking is not working at all -

llm_load_tensors: tensor 'token_embd.weight' (q4_0) (and 290 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: CPU_Mapped model buffer size = 3316.72 MiB

I went ahead and tested this on x86_64 CPU and to my surprise the repacking worked as expected -

llm_load_tensors: tensor 'token_embd.weight' (q4_0) (and 66 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead llm_load_tensors: CPU_AARCH64 model buffer size = 2970.00 MiB llm_load_tensors: CPU_Mapped model buffer size = 3292.53 MiB repack: repack tensor blk.0.attn_q.weight with q4_0_8x8 repack: repack tensor blk.0.attn_k.weight with q4_0_8x8....

I can attach complete logs if needed, however @smpurkis has already uploaded the same in this thread.

@Djip007
Copy link
Contributor

Djip007 commented Dec 10, 2024

This PR have add a control "ggml_arm_arch_features.has_dotprod"

https://github.com/ggerganov/llama.cpp/pull/10541/files#diff-df54fa42aa6f755bff170db21c982e7d8ad853382ddbba98de7dde4789dbd53fL533

static const ggml::cpu::tensor_traits * ggml_aarch64_get_optimal_repack_type(const struct ggml_tensor * cur) {
if (cur->type == GGML_TYPE_Q4_0) {
if (ggml_cpu_has_avx2() || (ggml_cpu_has_sve() && ggml_cpu_has_matmul_int8() && ggml_cpu_get_sve_cnt() == QK8_0)) {
if (cur->ne[1] % 8 == 0) {
return &ggml::cpu::aarch64::q4_0_8x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_matmul_int8()) {
if (cur->ne[1] % 4 == 0) {
return &ggml::cpu::aarch64::q4_0_4x8_q8_0;
}
}
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
return &ggml::cpu::aarch64::q4_0_4x4_q8_0;
}
}
} else if (cur->type == GGML_TYPE_IQ4_NL) {
if (ggml_cpu_has_neon() && ggml_cpu_has_dotprod()) {
if (cur->ne[1] % 4 == 0) {
return &ggml::cpu::aarch64::iq4_nl_4x4_q8_0;
}
}
}
return nullptr;
}

#if defined(__linux__) && defined(__aarch64__)
uint32_t hwcap = getauxval(AT_HWCAP);
uint32_t hwcap2 = getauxval(AT_HWCAP2);
ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD);
ggml_arm_arch_features.has_dotprod = !!(hwcap && HWCAP_ASIMDDP);
ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM);
ggml_arm_arch_features.has_sve = !!(hwcap & HWCAP_SVE);
#if defined(__ARM_FEATURE_SVE)
ggml_arm_arch_features.sve_cnt = PR_SVE_VL_LEN_MASK & prctl(PR_SVE_GET_VL);
#endif

can you check if it is not wrong here:

ggml_arm_arch_features.has_dotprod = !!(hwcap && HWCAP_ASIMDDP);
// => 
ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP);

@smpurkis
Copy link
Author

I just tried that change from commit 92f77a6 and it is still not applying the runtime repacking on arm64 cpu.

@slaren
Copy link
Collaborator

slaren commented Dec 11, 2024

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

This indicates that llama.cpp was built without support for dotprod or i8mm, instruction sets that may be necessary to take advantage of the repacked types.

@elliotkorte
Copy link

elliotkorte commented Dec 12, 2024

I'm running on a single board computer which uses the Rockchip RK3588 chip. Q4_0 models are much slower than Q4_0_4_4 were, and I don't see anything in the verbose model output about repacking. I have confirmed the compiler flag GGML_USE_CPU_AARCH64 is being used.

Just curious, why completely remove support for Q4_0_X_X models? What's the harm in allowing both online repack and pretuned files for ARM optimizations? At runtime I know you can better ensure correct optimizations are applied for the platform being used, but for my use case fast model instantiation is a higher priority (how fast is the repack?). Seems like there should have at least been a deprecation with a warning so you can alert users and give them time to update instead of breaking their apps.

@bartowski1182
Copy link
Contributor

Yeah personally I also don't love when thousands of files out in the wild are all simultaneously rendered useless with an update. I get this was bleeding edge stuff, but like mentioned online repacking is slower due to the load time increase and then additionally anyone who has a Q4_0_X_Y file has to go now find the equivalent Q4_0 and download it just to maintain functionality

I can accept using online repacking for all support going forward, but removing the ability to run pre-repacked models feels off. Especially cause I would almost push for the ability to locally repack, AKA download Q4_0, run ./llama-cli repack -m MyModel-Q4_0.gguf -o MyRepackedModel and have it automatically repack the model into the optimized local format for faster loading

@ekcrisp
Copy link

ekcrisp commented Dec 12, 2024

I can accept using online repacking for all support going forward

I guess what what is meant by "support" here? Is there added complexity or issues with allowing both? The only justification for removal I can think of is that we want to ensure optimal repacks are always used. Seems like only difference in logic if you let people run pre-repacked models is that you don't have to perform the repack on instantiation. This wouldn't add complexity or make things harder to maintain in llama.cpp, right? If repacking is very fast (say, a fraction of a second on cheap hardware), then I suppose there isn't much of a tradeoff, but I don't know whether this is the case. @bartowski1182 your suggestion of doing a build-time repack seems like a good approach for applications which are not being deployed to different hardware platforms.

@slaren
Copy link
Collaborator

slaren commented Dec 12, 2024

Added a note to the changelog. The goal is to reduce the number of quants that need to be made. We have several repack types and I expect that more will be added over time, and the current approach does not scale. Unfortunately a consequence of the refactor that will allow us to add more repacking types more easily is that it requires removing the tensor types, and support had to be dropped. It was going to happen sooner or later either way, so it is better to do it now.

If you have an issue with the performance of online repacking, open an issue and it will be improved over time.

@ekcrisp
Copy link

ekcrisp commented Dec 12, 2024

The goal is to reduce the number of quants that need to be made.

What is wrong with having more quants for people to distribute if they choose? What about this doesn't scale? I think developers should be able to choose how they want to repack their models, no? If there is complexity in validating the tensor type up front couldn't we have an explicit "skip repack" option and provide metadata needed to run the gguf as it if were repacked at runtime? If devs do this incorrectly and have issues it would be on them. Something like "Warning: runtime repack was skipped, tensor types cannot be validated". The build-time repack suggestion from @bartowski1182 definitely has value and would be better for applications that need fast instantiation and know what hardware they'll be running on.

@bartowski1182
Copy link
Contributor

I do agree that adding potentially 3-6 repacked varieties for potentially every quant type available and all future ones is likely not scalable

I think what would be nice is similar to what I mentioned, allowing the saving of a quant type to repacked format after the fact, and then some way to recognize on load that it's a repacked model and how to load it accordingly

This would allow for Q4_0_4_4 to continue existing but without needing to be manually created each time, while benefiting from improved load times for people who care about it, especially as the RAM capacities of ARM machines scale and we start loading 70B+ models

@ekcrisp
Copy link

ekcrisp commented Dec 12, 2024

It's not scalable if there are highly specific repacks for many different platforms and the only way to leverage them is to distribute them and have users cross reference what they need with tensor type documentation. If you can essentially "cache" what is done at the repack stage, then you don't have to distribute many tensor types. If people choose to so be it, but in llama.cpp it wouldn't matter, it would be treated as a user supplied "cache" of a repack and special load time logic wouldn't be needed.

@Djip007
Copy link
Contributor

Djip007 commented Dec 12, 2024

I just tried that change from commit 92f77a6 and it is still not applying the runtime repacking on arm64 cpu.

can you try to print the 4 values:

     ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD); 
     ggml_arm_arch_features.has_dotprod = !!(hwcap && HWCAP_ASIMDDP); 
     ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM); 
     ggml_arm_arch_features.has_sve  = !!(hwcap & HWCAP_SVE); 

look like the dot support is not detected,

Or it is not build with __ARM_FEATURE_DOTPROD

int ggml_cpu_has_dotprod(void) {
#if defined(__ARM_ARCH) && defined(__ARM_FEATURE_DOTPROD)
return ggml_arm_arch_features.has_dotprod;
#else
return 0;
#endif
}

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html => +dotprod ?

@Djip007
Copy link
Contributor

Djip007 commented Dec 12, 2024

set(MARCH_FLAGS "${MARCH_FLAGS}+dotprod")
list(APPEND ARCH_DEFINITIONS __ARM_FEATURE_DOTPROD)

is only added for APPLE

if I'm right:

if (${CMAKE_SYSTEM_PROCESSOR} MATCHES "armv8")
# Android arm64-v8a
# Raspberry Pi 3, 4, Zero 2 (32-bit)
list(APPEND ARCH_FLAGS -mno-unaligned-access)
endif()

Nothing is done for other case. no fp16 / no dot8 look to be checked?
Did I miss something?

@smpurkis
Copy link
Author

smpurkis commented Dec 13, 2024

I just tried that change from commit 92f77a6 and it is still not applying the runtime repacking on arm64 cpu.

can you try to print the 4 values:

     ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD); 
     ggml_arm_arch_features.has_dotprod = !!(hwcap && HWCAP_ASIMDDP); 
     ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM); 
     ggml_arm_arch_features.has_sve  = !!(hwcap & HWCAP_SVE); 

look like the dot support is not detected,

Or it is not build with __ARM_FEATURE_DOTPROD

int ggml_cpu_has_dotprod(void) {
#if defined(__ARM_ARCH) && defined(__ARM_FEATURE_DOTPROD)
return ggml_arm_arch_features.has_dotprod;
#else
return 0;
#endif
}

https://gcc.gnu.org/onlinedocs/gcc/ARM-Options.html => +dotprod ?

Running on commit 64ae065

Adding prints as the following

    ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD);
    ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP);
    ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM);
    ggml_arm_arch_features.has_sve  = !!(hwcap & HWCAP_SVE);

    printf("NEON: %d\n", ggml_arm_arch_features.has_neon);
    printf("DotProd: %d\n", ggml_arm_arch_features.has_dotprod);
    printf("I8MM: %d\n", ggml_arm_arch_features.has_i8mm);
    printf("SVE: %d\n", ggml_arm_arch_features.has_sve);

The output when I run the reproduce build and run commands in the bug report.

NEON: 1
DotProd: 1
I8MM: 0
SVE: 0
build: 4320 (64ae0655) with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for aarch64-linux-gnu
...
llama_perf_sampler_print:    sampling time =       6.43 ms /   149 runs   (    0.04 ms per token, 23176.23 tokens per second)
llama_perf_context_print:        load time =    1202.72 ms
llama_perf_context_print: prompt eval time =   11681.50 ms /   109 tokens (  107.17 ms per token,     9.33 tokens per second)
llama_perf_context_print:        eval time =    6734.92 ms /    39 runs   (  172.69 ms per token,     5.79 tokens per second)
llama_perf_context_print:       total time =   18434.23 ms /   148 tokens

@juliensimon
Copy link

If you have an issue with the performance of online repacking, open an issue, and it will be improved over time.

FWIW, I see exactly the same problem with my Q4_0_4_4 models on Graviton. Online repacking doesn't seem to work with the default build.

CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release

system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CPU : NEON = 1 | ARM_FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Are there any extra flags we should build with?

@Djip007
Copy link
Contributor

Djip007 commented Dec 13, 2024

Can't find the power for my ROCK5... so hard to test...

NEON: 1
DotProd: 1
I8MM: 0
SVE: 0

look good with the patch...

for RK3588 and PI5 can you test build with

CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release -march=armv8.2a+dotprod

But I don't know if the define __ARM_FEATURE_DOTPROD / __ARM_NEON is set. so may be print what return:

int ggml_cpu_has_neon(void)
int ggml_cpu_has_dotprod(void)

    ggml_arm_arch_features.has_neon = !!(hwcap & HWCAP_ASIMD);
    ggml_arm_arch_features.has_dotprod = !!(hwcap & HWCAP_ASIMDDP);
    ggml_arm_arch_features.has_i8mm = !!(hwcap2 & HWCAP2_I8MM);
    ggml_arm_arch_features.has_sve  = !!(hwcap & HWCAP_SVE);

    printf("NEON: %d/%d\n", ggml_arm_arch_features.has_neon, ggml_cpu_has_neon());
    printf("DotProd: %d/%d\n", ggml_arm_arch_features.has_dotprod, ggml_cpu_has_dotprod());
    printf("I8MM: %d\n", ggml_arm_arch_features.has_i8mm);
    printf("SVE: %d\n", ggml_arm_arch_features.has_sve);

may be use -march=native but I don't know if it work on arm.

If you never have activate the __ARM_FEATURE_DOTPROD even LLAMAFILE is not used with Q4_0, If someone can make the repacking work I would be curious to see the different performances (ggml / llamafile / repacking)

and may be it is needed to dill with "ggml_cpu_has_dotprod/..." on other kernel (llamafile / ...)

gcc:
-march=native causes the compiler to auto-detect the architecture of the build computer. At present, this feature is only supported on GNU/Linux, and not all architectures are recognized. If the auto-detect is unsuccessful the option has no effect.

so may be with linux you can try it

@ekcrisp
Copy link

ekcrisp commented Dec 14, 2024

Niether IQ4_NL_4_4 or Q4_0 formats are repacking for me on rockchip boards running debian or ubuntu

@Djip007
Copy link
Contributor

Djip007 commented Dec 14, 2024

Niether IQ4_NL_4_4 or Q4_0 formats are repacking for me on rockchip boards running debian or ubuntu

What do you use for build?

(IQ4_NL_4_4 => IQ4_NL)

@ekcrisp
Copy link

ekcrisp commented Dec 15, 2024

Niether IQ4_NL_4_4 or Q4_0 formats are repacking for me on rockchip boards running debian or ubuntu

What do you use for build?

(IQ4_NL_4_4 => IQ4_NL)

Yes I meant IQ4_NL. This was the model I used, also this one for Q4_0. I am running llama.cpp with commit hash 26a8406

@Djip007
Copy link
Contributor

Djip007 commented Dec 17, 2024

What do you use for build?

I mean what cmake commande.

CMAKE_ARGS="-DGGML_CPU_AARCH64=ON" cmake --build build --config Release -march=armv8.2a+dotprod

@smpurkis
Copy link
Author

I have got it working from the first bad commit, i.e. c202cef1686182a78f8f4e253ab8d0c0ffe2fcc8 by doing these changes.
In ggml/src/ggml-cpu/ggml-cpu-aarch64.c by removing ggml_cpu_has_dotprod() from if statements on lines 533, 1118 and 3837.

Going to try it on latest master to see if the same thing works.

@smpurkis
Copy link
Author

smpurkis commented Dec 18, 2024

Seems on master the GEMV NEON asm code has been removed (in #10567) that is significantly reducing generation performance I'm getting on ARM Ampere A1 cpu, which I don't believe has dotprod. But doing the same as the above fixes the prompt processing speed, which uses GEMM. Although this may want to be a separate issue.

@smpurkis
Copy link
Author

smpurkis commented Dec 18, 2024

I've opened #10889 that restores the performance on current master, i.e. 7bbb5ac

@slaren slaren linked a pull request Dec 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants