-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quantization + sparsification - model outputs zeros #942
Comments
@nirey10 Hi! Can you share how you’re running the model? And share the model config? |
Hey, exactly like the quantization_2of4_sparse_4w16a example but without the finetune stage and i used the llama3.1-8b-instruct model instead. i am running the model with 'vllm serve' and use the /v1/completions and /v1/chat/completions routes By the way, i even took one of the models that uploaded to HF (which i believe that uses this code): neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16 and it outputs only '!' when i am doing chat competions, the exact phenomena that i get. |
Hi @nirey10, if you’re running generation using vLLM, can you try setting the dtype to float16? |
@dsikka running with float16 actually fixed the released model in HF but my compressed model still outputs nonsense (atleast not '!'). |
HI @nirey10 can you share the code you're using when running the model on vLLM |
What version of vLLM is being used? We fixed some issues with the kernel in the vllm recently: |
Hey, Eventually i was able to run it with the yaml recipe but with llmcompressor==0.1.0 and its corresponding 2_4 example from git. After some experiments i found out the the fine tuning stage is crucial for decent outputs, despite the face that the original SparseGPT can provide decent results without fine tuning. To sum it up the --dtype float16 actually helped with the results, i think it should be on the README. I think that there is a bug with the current example of the 2_4 sparsification and quantization, the model output path from the stages are not going through the stages in the pipeline, instead of taking the sparse model in the finetuning stage, it is looking for the model name from the input, something that does not exists locally. Thanks for the help! |
Describe the bug
when running quantization (gptq) after sparsification (sparsegpt 2:4) the model accuracy and perplexity is damaged hard and outputs only zeros
Expected behavior
A reasonable text
Environment
To Reproduce
model: llama-3.1-8b-instruct
'''
recipe:
sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
sequential_update: false
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "channel"
targets: ["Linear"]
'''
Additional context
Quantization alone (GPTQ) provides reasonable results (but not like auto_gptq), when combining it with 2:4 sparsification first, it outputs zeros only (or !).
The only thing that differs from your example is the model and the lack of fine tuning.
I served the model with vllm and asked for a simple completion like "san fransisco is:".
The text was updated successfully, but these errors were encountered: