Feature Request: add DeepSeek-v3 support #10981

RodriMora · 2024-12-26T11:08:12Z

Prerequisites

I am running the latest code. Mention the version if possible as well.
Version b4391
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Add support for DeepSeek-v3

https://huggingface.co/deepseek-ai/DeepSeek-V3

Currently not supported:

ERROR:hf-to-gguf:Model DeepseekV3ForCausalLM is not supported

Motivation

DeepSeek-v3 is a big MoE model of 685B params, would be great as offloading to RAM would be a must for most systems

Possible Implementation

There is no model card or technical report yet. I don't know how much different from v2 it is.

Edit: they have uploaded the model card and paper:
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/README.md

The text was updated successfully, but these errors were encountered:

nisten · 2024-12-26T17:43:28Z

The sigmoid routing thing or whatever is a bit different but the rest of the arch is largerly the same as deepseek2.5, just larger.

There's no PR yet in hf transformers, it looks like they've built this atop of transformers 4.33 so that will be quite a merge to get properly i guess.
So it's not thaaat hard to implement given that the modelling code but it's still hard because again it's a "2nd gen" MoE with a routing pool.

web-traveler · 2024-12-29T20:21:43Z

In case it helps: transformers 4.46.3 is written here https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/requirements.txt

arthurwolf · 2024-12-30T02:06:28Z

What's missing to get this to work, and can one do anything to help?

Nottlespike · 2024-12-30T19:07:08Z

What's missing to get this to work, and can one do anything to help?

huggingface/transformers#35425

cpumaxx · 2024-12-31T07:16:44Z

Can a dev help break down for us what would be required in convert_hf_to_gguf.py to at least get a gguf created with the new model bits? eg. mlp.gate.e_score_correction_bias, etc
Without a gguf I think implementation of multi-token prediction and the other novel parts of this model's inference architecture will be much harder to develop.
I'm happy to help if I can, but the papers and deepseek2 diffs aren't giving me as much of a clue as I was hoping.

web-traveler · 2024-12-31T09:39:57Z

Regarding gguf generation: Would this links be useful?
-- https://huggingface.co/lmstudio-community/DeepSeek-V2.5-GGUF. (with llama.cpp and LMStudio, see also Support for DeepseekV3 680B kvcache-ai/ktransformers#117 )
-- https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py (currently only till deepseek V2)
regarding quantization - as they already using 8bit weights not sure if the model could easily shrinked further without much quality loss (see fp8 e.g https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/config.json) But would love to see it as 6xx GB memory is already much (even though they using 37B Parameter per token).

cpumaxx · 2024-12-31T21:30:43Z

@fairydreaming : How much more work is needed before you can accept collaborators and testers on your branch? I see on localllama that you have at least a PoC running.

fairydreaming · 2024-12-31T22:25:38Z

@fairydreaming : How much more work is needed before you can accept collaborators and testers on your branch? I see on localllama that you have at least a PoC running.

I still have to add a new pre-tokenizer regex and test the tokenization. I'm not sure how many weird regex quirks I'll encounter along the way, but I estimate it will take a few days at most.

Edit: Also, I don't have MTP implemented, but it can be added later.

Nottlespike · 2025-01-02T00:10:44Z

@fairydreaming : How much more work is needed before you can accept collaborators and testers on your branch? I see on localllama that you have at least a PoC running.

I still have to add a new pre-tokenizer regex and test the tokenization. I'm not sure how many weird regex quirks I'll encounter along the way, but I estimate it will take a few days at most.

Edit: Also, I don't have MTP implemented, but it can be added later.

You can do this without offical HF transformers support without trust_remote_code=True?
This is my main concern and why I'm working with HF for an offical HF transformers implempentation.
huggingface/transformers#35425 (comment)
What is the branch and how can I help?

fairydreaming · 2025-01-02T12:57:58Z

My DeepSeek-V3 branch is here: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v3

To convert the model to GGUF you need dequantized DeepSeek V3. You can download it from HF (there are several BF16 DeepSeek V3 models available, but I didn't test any of them) or run inference/fp8_cast_bf16.py script from the original model to convert it to bf16 (that's what I did). Note that it uses triton, so I think you need a GPU for this. In case you experience CUDA out of memory errors during conversion check this: https://huggingface.co/deepseek-ai/DeepSeek-V3/discussions/17

There are some minor tokenization differences compared to the original model, but I think it's usable.

fairydreaming · 2025-01-02T13:44:23Z

Some initial perplexity values over wiki.test.raw (not a full run) with Q4_K_S quantized model:

$ ./build/bin/llama-perplexity --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf --no-context-shift -f ../perplexity/wikitext-2-raw/wiki.test.raw
build: 4407 (ad77e9b3) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 43 key-value pairs and 1025 tensors from /mnt/md0/models/deepseek-v3-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models Deepseek Ai DeepSeek V3 Bf16
llama_model_loader: - kv   3:                         general.size_label str              = 256x20B
llama_model_loader: - kv   4:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   5:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   6:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   7:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv   8:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   9:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  10:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  13:                          general.file_type u32              = 14
llama_model_loader: - kv  14:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  15:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  16:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  17:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  18:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  19:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  20:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  21:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  22:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  23:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  25:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  26:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  29: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  30: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  40:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_K:  652 tensors
llama_model_loader: - type q5_K:   11 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 353.90 GiB (4.53 BPW) 
llm_load_print_meta: general.name     = Models Deepseek Ai DeepSeek V3 Bf16
llm_load_print_meta: BOS token        = 0 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: EOT token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<｜fim▁begin｜>'
llm_load_print_meta: FIM SUF token    = 128800 '<｜fim▁hole｜>'
llm_load_print_meta: FIM MID token    = 128802 '<｜fim▁end｜>'
llm_load_print_meta: EOG token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors:   CPU_Mapped model buffer size = 362392.97 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61
llama_kv_cache_init:        CPU KV buffer size =  9760.00 MiB
llama_new_context_with_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.97 MiB
llama_new_context_with_model:        CPU compute buffer size =   670.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 722.377 ms
perplexity: calculating perplexity over 569 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 115.76 seconds per pass - ETA 4 hours 34.45 minutes
[1]4.3360,[2]4.8332,[3]4.7754,[4]3.3915,[5]2.6790,[6]2.3034,[7]2.1953,[8]2.1407,[9]1.9801,[10]1.8571,[11]1.8597,[12]1.8759,[13]1.8085,[14]1.9133,[15]2.0826,[16]2.2035,[17]2.3526,[18]2.5832,[19]2.6840,[20]2.7049,[21]2.8202,[22]2.8124,[23]2.7522,[24]2.7122,[25]2.6636,[26]2.6258,[27]2.6398,[28]2.6945,[29]2.7111,[30]2.7600,[31]2.8559,[32]2.9283,[33]2.9444,[34]2.9556,[35]3.0040,[36]3.0353,[37]3.0683,[38]3.1502,[39]3.2140,[40]3.2328,[41]3.3069,[42]3.3719,[43]3.3812,[44]3.4136,[45]3.5169,[46]3.5905,[47]3.5758,[48]3.4899

Nottlespike · 2025-01-02T17:47:55Z

Some initial perplexity values over wiki.test.raw (not a full run) with Q4_K_S quantized model:

$ ./build/bin/llama-perplexity --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf --no-context-shift -f ../perplexity/wikitext-2-raw/wiki.test.raw
build: 4407 (ad77e9b3) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
llama_model_loader: loaded meta data with 43 key-value pairs and 1025 tensors from /mnt/md0/models/deepseek-v3-Q4_K_S.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Models Deepseek Ai DeepSeek V3 Bf16
llama_model_loader: - kv   3:                         general.size_label str              = 256x20B
llama_model_loader: - kv   4:                      deepseek2.block_count u32              = 61
llama_model_loader: - kv   5:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv   6:                 deepseek2.embedding_length u32              = 7168
llama_model_loader: - kv   7:              deepseek2.feed_forward_length u32              = 18432
llama_model_loader: - kv   8:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv   9:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  10:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  12:                deepseek2.expert_used_count u32              = 8
llama_model_loader: - kv  13:                          general.file_type u32              = 14
llama_model_loader: - kv  14:        deepseek2.leading_dense_block_count u32              = 3
llama_model_loader: - kv  15:                       deepseek2.vocab_size u32              = 129280
llama_model_loader: - kv  16:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  17:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  18:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  19:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  20:       deepseek2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  21:                     deepseek2.expert_count u32              = 256
llama_model_loader: - kv  22:              deepseek2.expert_shared_count u32              = 1
llama_model_loader: - kv  23:             deepseek2.expert_weights_scale f32              = 2.500000
llama_model_loader: - kv  24:              deepseek2.expert_weights_norm bool             = true
llama_model_loader: - kv  25:               deepseek2.expert_gating_func u32              = 2
llama_model_loader: - kv  26:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  27:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  28:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  29: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  30: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  31:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  32:                         tokenizer.ggml.pre str              = deepseek-v3
llama_model_loader: - kv  33:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence｜>", "<�...
llama_model_loader: - kv  34:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  35:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv  36:                tokenizer.ggml.bos_token_id u32              = 0
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  40:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  42:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  361 tensors
llama_model_loader: - type q4_K:  652 tensors
llama_model_loader: - type q5_K:   11 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 818
llm_load_vocab: token to piece cache size = 0.8223 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = deepseek2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 129280
llm_load_print_meta: n_merges         = 127741
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 163840
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_layer          = 61
llm_load_print_meta: n_head           = 128
llm_load_print_meta: n_head_kv        = 128
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 192
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 24576
llm_load_print_meta: n_embd_v_gqa     = 16384
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18432
llm_load_print_meta: n_expert         = 256
llm_load_print_meta: n_expert_used    = 8
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.025
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 671B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 671.03 B
llm_load_print_meta: model size       = 353.90 GiB (4.53 BPW) 
llm_load_print_meta: general.name     = Models Deepseek Ai DeepSeek V3 Bf16
llm_load_print_meta: BOS token        = 0 '<｜begin▁of▁sentence｜>'
llm_load_print_meta: EOS token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: EOT token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: PAD token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: LF token         = 131 'Ä'
llm_load_print_meta: FIM PRE token    = 128801 '<｜fim▁begin｜>'
llm_load_print_meta: FIM SUF token    = 128800 '<｜fim▁hole｜>'
llm_load_print_meta: FIM MID token    = 128802 '<｜fim▁end｜>'
llm_load_print_meta: EOG token        = 1 '<｜end▁of▁sentence｜>'
llm_load_print_meta: max token length = 256
llm_load_print_meta: n_layer_dense_lead   = 3
llm_load_print_meta: n_lora_q             = 1536
llm_load_print_meta: n_lora_kv            = 512
llm_load_print_meta: n_ff_exp             = 2048
llm_load_print_meta: n_expert_shared      = 1
llm_load_print_meta: expert_weights_scale = 2.5
llm_load_print_meta: expert_weights_norm  = 1
llm_load_print_meta: expert_gating_func   = sigmoid
llm_load_print_meta: rope_yarn_log_mul    = 0.1000
llm_load_tensors:   CPU_Mapped model buffer size = 362392.97 MiB
....................................................................................................
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 2048
llama_new_context_with_model: n_ctx_per_seq = 512
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 10000.0
llama_new_context_with_model: freq_scale    = 0.025
llama_new_context_with_model: n_ctx_per_seq (512) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61
llama_kv_cache_init:        CPU KV buffer size =  9760.00 MiB
llama_new_context_with_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     1.97 MiB
llama_new_context_with_model:        CPU compute buffer size =   670.01 MiB
llama_new_context_with_model: graph nodes  = 5025
llama_new_context_with_model: graph splits = 1
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)

system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 722.377 ms
perplexity: calculating perplexity over 569 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 115.76 seconds per pass - ETA 4 hours 34.45 minutes
[1]4.3360,[2]4.8332,[3]4.7754,[4]3.3915,[5]2.6790,[6]2.3034,[7]2.1953,[8]2.1407,[9]1.9801,[10]1.8571,[11]1.8597,[12]1.8759,[13]1.8085,[14]1.9133,[15]2.0826,[16]2.2035,[17]2.3526,[18]2.5832,[19]2.6840,[20]2.7049,[21]2.8202,[22]2.8124,[23]2.7522,[24]2.7122,[25]2.6636,[26]2.6258,[27]2.6398,[28]2.6945,[29]2.7111,[30]2.7600,[31]2.8559,[32]2.9283,[33]2.9444,[34]2.9556,[35]3.0040,[36]3.0353,[37]3.0683,[38]3.1502,[39]3.2140,[40]3.2328,[41]3.3069,[42]3.3719,[43]3.3812,[44]3.4136,[45]3.5169,[46]3.5905,[47]3.5758,[48]3.4899

THANKS! Will begin running https://github.com/EleutherAI/lm-evaluation-harness on it ASAP!

fairydreaming · 2025-01-02T18:36:36Z

I ran farel-bench locally on the model, looks good! (first two are via OpenRouter, third is local)

Nr	Model	FaRel	child	parent	grand-child	sibling	grand-parent	great grand-child	niece or nephew	aunt or uncle	great grand-parent
1	deepseek-v3-sys	96.89	100.00	100.00	98.00	98.00	100.00	98.00	88.00	90.00	100.00
2	deepseek-v3	96.44	100.00	100.00	100.00	96.00	100.00	100.00	82.00	92.00	98.00
3	deepseek-v3-Q4_K_S	96.22	100.00	100.00	100.00	98.00	96.00	96.00	86.00	94.00	96.00

Nottlespike · 2025-01-02T18:39:18Z

I ran farel-bench locally on the model, looks good! (first two are via OpenRouter, third is local)
Nr Model FaRel child parent grand-child sibling grand-parent great grand-child niece or nephew aunt or uncle great grand-parent
1 deepseek-v3-sys 96.89 100.00 100.00 98.00 98.00 100.00 98.00 88.00 90.00 100.00
2 deepseek-v3 96.44 100.00 100.00 100.00 96.00 100.00 100.00 82.00 92.00 98.00
3 deepseek-v3-Q4_K_S 96.22 100.00 100.00 100.00 98.00 96.00 96.00 86.00 94.00 96.00

What is your rig specs wise?

fairydreaming · 2025-01-02T18:41:13Z

I ran farel-bench locally on the model, looks good! (first two are via OpenRouter, third is local)
Nr Model FaRel child parent grand-child sibling grand-parent great grand-child niece or nephew aunt or uncle great grand-parent
1 deepseek-v3-sys 96.89 100.00 100.00 98.00 98.00 100.00 98.00 88.00 90.00 100.00
2 deepseek-v3 96.44 100.00 100.00 100.00 96.00 100.00 100.00 82.00 92.00 98.00
3 deepseek-v3-Q4_K_S 96.22 100.00 100.00 100.00 98.00 96.00 96.00 86.00 94.00 96.00

What is your rig?

@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts.

Nottlespike · 2025-01-02T18:51:42Z

I ran farel-bench locally on the model, looks good! (first two are via OpenRouter, third is local)
Nr Model FaRel child parent grand-child sibling grand-parent great grand-child niece or nephew aunt or uncle great grand-parent
1 deepseek-v3-sys 96.89 100.00 100.00 98.00 98.00 100.00 98.00 88.00 90.00 100.00
2 deepseek-v3 96.44 100.00 100.00 100.00 96.00 100.00 100.00 82.00 92.00 98.00
3 deepseek-v3-Q4_K_S 96.22 100.00 100.00 100.00 98.00 96.00 96.00 86.00 94.00 96.00

What is your rig?

@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts.

No GPU's? I got as 4x3090 Ti FE's linked together with the hacked P2P driver plus a ThreadRipper Pro 8 channels of 128GB DDR4 so I should be able to run it MUCH faster! I've seen your work before and REALLY appreciate your contributions! Any way we can get in contact? I know @bartowski1182 very well if they have a contact with you?

fairydreaming · 2025-01-02T19:02:28Z

@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts.

No GPU's? I got as 4x3090 Ti FE's linked together with the hacked P2P driver plus a ThreadRipper Pro 8 channels of 128GB DDR4 so I should be able to run it MUCH faster! I've seen your work before and REALLY appreciate your contributions! Any way we can get in contact? I know @bartowski1182 very well if they have a contact with you?

@Nottlespike I have a single RTX 4090, but I didn't use it here. What is your exact CPU model?

Regarding the contact I'm active on Reddit (mostly on r/LocalLLaMA) with the same username.

Nottlespike · 2025-01-02T19:14:55Z

@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts.

No GPU's? I got as 4x3090 Ti FE's linked together with the hacked P2P driver plus a ThreadRipper Pro 8 channels of 128GB DDR4 so I should be able to run it MUCH faster! I've seen your work before and REALLY appreciate your contributions! Any way we can get in contact? I know @bartowski1182 very well if they have a contact with you?

@Nottlespike I have a single RTX 4090, but I didn't use it here. What is your exact CPU model?

Regarding the contact I'm active on Reddit (mostly on r/LocalLLaMA) with the same username.

I have been informed I am "unpopular to hated" on r/LocalLLaMA...... given I am basically using a "server" with 4 of the best consumer GPU's on the market and I called the tinybox a grift at best and a scam at worst.

Nottlespike · 2025-01-02T20:26:00Z

@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts.

No GPU's? I got as 4x3090 Ti FE's linked together with the hacked P2P driver plus a ThreadRipper Pro 8 channels of 128GB DDR4 so I should be able to run it MUCH faster! I've seen your work before and REALLY appreciate your contributions! Any way we can get in contact? I know @bartowski1182 very well if they have a contact with you?

@Nottlespike I have a single RTX 4090, but I didn't use it here. What is your exact CPU model?

Regarding the contact I'm active on Reddit (mostly on r/LocalLLaMA) with the same username.

@fairydreaming Am I reading your PR correctly and you DON'T NEED trust_remote_code=True HOW? Can you help us at HF out on a offical HF transformers implementation? Also MASSIVE kudos.... 131 lines of ELEGANTE CODE.... I'm in shock and awe

fairydreaming · 2025-01-02T20:38:40Z

@fairydreaming Am I reading your PR correctly and you DON'T NEED trust_remote_code=True HOW? Can you help us at HF out on a offical HF transformers implementation? Also MASSIVE kudos.... 131 lines of ELEGANTE CODE.... I'm in shock and awe

@Nottlespike AFAIK llama.cpp conversion scripts only use HF transformers AutoTokenizer class and DeepSeek V3 has no custom tokenizer class implementation, so I guess there is no need for trust_remote_code=True - it simply doesn't run any.

Nottlespike · 2025-01-02T21:04:29Z

@fairydreaming Am I reading your PR correctly and you DON'T NEED trust_remote_code=True HOW? Can you help us at HF out on a offical HF transformers implementation? Also MASSIVE kudos.... 131 lines of ELEGANTE CODE.... I'm in shock and awe

@Nottlespike AFAIK llama.cpp conversion scripts only use HF transformers AutoTokenizer class and DeepSeek V3 has no custom tokenizer class implementation, so I guess there is no need for trust_remote_code=True - it simply doesn't run any.

@fairydreaming This is elegant.... props. The previous HF transformers "implementation" forced trust_remote_code.

etafund · 2025-01-02T22:37:14Z

EDIT: Ignore below, simple user error.

@fairydreaming, I'm running your convert_hf_to_gguf_update.py file to create a GGUF after dequantizing the model, but when I run the script, I get an error. Any advice on what I'm doing wrong?

python ./convert_hf_to_gguf_update.py /zfspool/user/models/DeepSeek-V3-BF16/ --outtype q8_0 --outfile /zfspool/user/models/

It always gives the same error, no matter what I run:
INFO:convert_hf_to_gguf_update:Usage: python convert_hf_to_gguf_update.py <huggingface_token>

Excited to replicate what you've done! Great work.

bartowski1182 · 2025-01-02T23:15:01Z

@etafund that's the script for updating the conversion script, use the one without _update

etafund · 2025-01-02T23:27:43Z

Thanks bartowski. Running the right script, I still get an error:

Python

python ./convert_hf_to_gguf.py /zfspool/user/models/DeepSeek-V3-BF16/ --outtype q8_0 --outfile /zfspool/user/models/

Note: Have also tried outtype bf16, etc.

Error

Traceback (most recent call last):
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 4911, in <module> main()
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 4905, in main model_instance.write()
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 438, in write self.prepare_tensors()
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 3926, in prepare_tensors super().prepare_tensors()
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 298, in prepare_tensors for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 3923, in modify_tensors return [(self.map_tensor_name(name), data_torch)]
File "/zfspool/user/llama.cpp/convert_hf_to_gguf.py", line 214, in map_tensor_name raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.3.mlp.gate.e_score_correction.bias'

Perhaps this is a dequantizing issue?

cpumaxx · 2025-01-02T23:45:52Z

My DeepSeek-V3 branch is here: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v3

To convert the model to GGUF you need dequantized DeepSeek V3. You can download it from HF (there are several BF16 DeepSeek V3 models available, but I didn't test any of them) or run inference/fp8_cast_bf16.py script from the original model to convert it to bf16 (that's what I did). Note that it uses triton, so I think you need a GPU for this. In case you experience CUDA out of memory errors during conversion check this: https://huggingface.co/deepseek-ai/DeepSeek-V3/discussions/17

There are some minor tokenization differences compared to the original model, but I think it's usable.

Thanks, @fairydreaming! Your updated conversion script is working perfectly going from BF16 to q8_0.

I'll update with inference results once the quanting finishes and I have a chance to run it through its paces.

fraschm1998 · 2025-01-03T01:45:09Z

@Nottlespike I have a single RTX 4090, but I didn't use it here. What is your exact CPU model?

What are your speeds with the 4090?

RodriMora added the enhancement New feature or request label Dec 26, 2024

rick-github mentioned this issue Dec 26, 2024

DeepSeek v3 ollama/ollama#8249

Open

fairydreaming linked a pull request Jan 2, 2025 that will close this issue

Add support for DeepSeek V3 #11049

Open

Nottlespike mentioned this issue Jan 2, 2025

Support for DeepseekV3 680B kvcache-ai/ktransformers#117

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: add DeepSeek-v3 support #10981

Feature Request: add DeepSeek-v3 support #10981

RodriMora commented Dec 26, 2024 •

edited

Loading

nisten commented Dec 26, 2024

web-traveler commented Dec 29, 2024

arthurwolf commented Dec 30, 2024

Nottlespike commented Dec 30, 2024

cpumaxx commented Dec 31, 2024

web-traveler commented Dec 31, 2024 •

edited

Loading

cpumaxx commented Dec 31, 2024

fairydreaming commented Dec 31, 2024 •

edited

Loading

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025 •

edited

Loading

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

etafund commented Jan 2, 2025 •

edited

Loading

bartowski1182 commented Jan 2, 2025

etafund commented Jan 2, 2025 •

edited

Loading

cpumaxx commented Jan 2, 2025

fraschm1998 commented Jan 3, 2025

Feature Request: add DeepSeek-v3 support #10981

Feature Request: add DeepSeek-v3 support #10981

Comments

RodriMora commented Dec 26, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

nisten commented Dec 26, 2024

web-traveler commented Dec 29, 2024

arthurwolf commented Dec 30, 2024

Nottlespike commented Dec 30, 2024

cpumaxx commented Dec 31, 2024

web-traveler commented Dec 31, 2024 • edited Loading

cpumaxx commented Dec 31, 2024

fairydreaming commented Dec 31, 2024 • edited Loading

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025 • edited Loading

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

fairydreaming commented Jan 2, 2025

Nottlespike commented Jan 2, 2025

etafund commented Jan 2, 2025 • edited Loading

bartowski1182 commented Jan 2, 2025

etafund commented Jan 2, 2025 • edited Loading

Python

Error

cpumaxx commented Jan 2, 2025

fraschm1998 commented Jan 3, 2025

RodriMora commented Dec 26, 2024 •

edited

Loading

web-traveler commented Dec 31, 2024 •

edited

Loading

fairydreaming commented Dec 31, 2024 •

edited

Loading

Nottlespike commented Jan 2, 2025 •

edited

Loading

etafund commented Jan 2, 2025 •

edited

Loading

etafund commented Jan 2, 2025 •

edited

Loading