-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: add DeepSeek-v3 support #10981
Comments
The sigmoid routing thing or whatever is a bit different but the rest of the arch is largerly the same as deepseek2.5, just larger. There's no PR yet in hf transformers, it looks like they've built this atop of transformers 4.33 so that will be quite a merge to get properly i guess. |
In case it helps: transformers 4.46.3 is written here https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/requirements.txt |
What's missing to get this to work, and can one do anything to help? |
|
Can a dev help break down for us what would be required in |
|
@fairydreaming : How much more work is needed before you can accept collaborators and testers on your branch? I see on localllama that you have at least a PoC running. |
I still have to add a new pre-tokenizer regex and test the tokenization. I'm not sure how many weird regex quirks I'll encounter along the way, but I estimate it will take a few days at most. Edit: Also, I don't have MTP implemented, but it can be added later. |
You can do this without offical HF transformers support without |
My DeepSeek-V3 branch is here: https://github.com/fairydreaming/llama.cpp/tree/deepseek-v3 To convert the model to GGUF you need dequantized DeepSeek V3. You can download it from HF (there are several BF16 DeepSeek V3 models available, but I didn't test any of them) or run inference/fp8_cast_bf16.py script from the original model to convert it to bf16 (that's what I did). Note that it uses triton, so I think you need a GPU for this. In case you experience CUDA out of memory errors during conversion check this: https://huggingface.co/deepseek-ai/DeepSeek-V3/discussions/17 There are some minor tokenization differences compared to the original model, but I think it's usable. |
Some initial perplexity values over wiki.test.raw (not a full run) with Q4_K_S quantized model:
|
THANKS! Will begin running https://github.com/EleutherAI/lm-evaluation-harness on it ASAP! |
I ran farel-bench locally on the model, looks good! (first two are via OpenRouter, third is local)
|
What is your rig specs wise? |
@Nottlespike Epyc 9374F, 384GB RAM. It took almost 5 hours to run all 450 prompts. |
No GPU's? I got as 4x3090 Ti FE's linked together with the hacked P2P driver plus a ThreadRipper Pro 8 channels of 128GB DDR4 so I should be able to run it MUCH faster! I've seen your work before and REALLY appreciate your contributions! Any way we can get in contact? I know @bartowski1182 very well if they have a contact with you? |
@Nottlespike I have a single RTX 4090, but I didn't use it here. What is your exact CPU model? Regarding the contact I'm active on Reddit (mostly on r/LocalLLaMA) with the same username. |
I have been informed I am "unpopular to hated" on r/LocalLLaMA...... given I am basically using a "server" with 4 of the best consumer GPU's on the market and I called the tinybox a grift at best and a scam at worst. |
@fairydreaming Am I reading your PR correctly and you DON'T NEED |
@Nottlespike AFAIK llama.cpp conversion scripts only use HF transformers AutoTokenizer class and DeepSeek V3 has no custom tokenizer class implementation, so I guess there is no need for |
@fairydreaming This is elegant.... props. The previous HF transformers "implementation" forced |
EDIT: Ignore below, simple user error. @fairydreaming, I'm running your convert_hf_to_gguf_update.py file to create a GGUF after dequantizing the model, but when I run the script, I get an error. Any advice on what I'm doing wrong?
It always gives the same error, no matter what I run: Excited to replicate what you've done! Great work. |
@etafund that's the script for updating the conversion script, use the one without _update |
Thanks bartowski. Running the right script, I still get an error: Python
Note: Have also tried outtype bf16, etc. Error
Perhaps this is a dequantizing issue? |
Thanks, @fairydreaming! Your updated conversion script is working perfectly going from BF16 to q8_0. I'll update with inference results once the quanting finishes and I have a chance to run it through its paces. |
What are your speeds with the 4090? |
Prerequisites
Feature Description
Add support for DeepSeek-v3
https://huggingface.co/deepseek-ai/DeepSeek-V3
Currently not supported:
ERROR:hf-to-gguf:Model DeepseekV3ForCausalLM is not supported
Motivation
DeepSeek-v3 is a big MoE model of 685B params, would be great as offloading to RAM would be a must for most systems
Possible Implementation
There is no model card or technical report yet. I don't know how much different from v2 it is.
Edit: they have uploaded the model card and paper:
https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
https://huggingface.co/deepseek-ai/DeepSeek-V3/blob/main/README.md
The text was updated successfully, but these errors were encountered: