quantization + sparsification - model outputs zeros #942

nirey10 · 2024-11-28T08:05:44Z

Describe the bug
when running quantization (gptq) after sparsification (sparsegpt 2:4) the model accuracy and perplexity is damaged hard and outputs only zeros

Expected behavior
A reasonable text

Environment

Ubuntu 22.04.3 LTS
Python 3.12.4
LLM Compressor 0.1.0/0.3.0
Other Python package versions vLLM 0.6.2/0.5.5, compressed-tensors 0.6.0/0.7.0/0.8.0
Cuda 12.4
GPU - A6000

To Reproduce
model: llama-3.1-8b-instruct
'''
recipe:
sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
sequential_update: false
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "channel"
targets: ["Linear"]
'''

Additional context
Quantization alone (GPTQ) provides reasonable results (but not like auto_gptq), when combining it with 2:4 sparsification first, it outputs zeros only (or !).
The only thing that differs from your example is the model and the lack of fine tuning.
I served the model with vllm and asked for a simple completion like "san fransisco is:".

dsikka · 2024-11-29T13:53:30Z

@nirey10 Hi! Can you share how you’re running the model? And share the model config?

nirey10 · 2024-12-01T09:19:58Z

Hey, exactly like the quantization_2of4_sparse_4w16a example but without the finetune stage and i used the llama3.1-8b-instruct model instead.

i am running the model with 'vllm serve' and use the /v1/completions and /v1/chat/completions routes

By the way, i even took one of the models that uploaded to HF (which i believe that uses this code): neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16 and it outputs only '!' when i am doing chat competions, the exact phenomena that i get.

dsikka · 2024-12-01T11:46:06Z

Hi @nirey10, if you’re running generation using vLLM, can you try setting the dtype to float16?

jiangjiadi · 2024-12-02T02:47:35Z

@nirey10 @dsikka Same problem I had with the qwen model. #926

nirey10 · 2024-12-02T09:02:14Z

@dsikka running with float16 actually fixed the released model in HF but my compressed model still outputs nonsense (atleast not '!').
when i am trying to run in with some output folder it doesnt take the appropriate {output_folder}/{stage name} path for the next stage as well.
can you please share the versions of vllm, compressed-tensors and llmcompressor that you used for this example?

dsikka · 2024-12-02T14:07:46Z

HI @nirey10 can you share the code you're using when running the model on vLLM

robertgshaw2-neuralmagic · 2024-12-06T16:24:56Z

What version of vLLM is being used? We fixed some issues with the kernel in the vllm recently:

[Bugfix] Marlin 2:4 temp fix for large M dim (>256) vllm#10464

nirey10 · 2024-12-08T09:29:14Z

Hey,
@robertgshaw2-neuralmagic i am using vllm==0.6.2
@dsikka just running 'vllm serve {model_name} --dtype float16'

Eventually i was able to run it with the yaml recipe but with llmcompressor==0.1.0 and its corresponding 2_4 example from git. After some experiments i found out the the fine tuning stage is crucial for decent outputs, despite the face that the original SparseGPT can provide decent results without fine tuning.

To sum it up the --dtype float16 actually helped with the results, i think it should be on the README. I think that there is a bug with the current example of the 2_4 sparsification and quantization, the model output path from the stages are not going through the stages in the pipeline, instead of taking the sparse model in the finetuning stage, it is looking for the model name from the input, something that does not exists locally.

Thanks for the help!

nirey10 added the bug Something isn't working label Nov 28, 2024

kylesayrs assigned rahul-tuli Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quantization + sparsification - model outputs zeros #942

quantization + sparsification - model outputs zeros #942

nirey10 commented Nov 28, 2024 •

edited

Loading

dsikka commented Nov 29, 2024 •

edited

Loading

nirey10 commented Dec 1, 2024 •

edited

Loading

dsikka commented Dec 1, 2024

jiangjiadi commented Dec 2, 2024 •

edited

Loading

nirey10 commented Dec 2, 2024

dsikka commented Dec 2, 2024

robertgshaw2-neuralmagic commented Dec 6, 2024

nirey10 commented Dec 8, 2024 •

edited

Loading

quantization + sparsification - model outputs zeros #942

quantization + sparsification - model outputs zeros #942

Comments

nirey10 commented Nov 28, 2024 • edited Loading

dsikka commented Nov 29, 2024 • edited Loading

nirey10 commented Dec 1, 2024 • edited Loading

dsikka commented Dec 1, 2024

jiangjiadi commented Dec 2, 2024 • edited Loading

nirey10 commented Dec 2, 2024

dsikka commented Dec 2, 2024

robertgshaw2-neuralmagic commented Dec 6, 2024

nirey10 commented Dec 8, 2024 • edited Loading

nirey10 commented Nov 28, 2024 •

edited

Loading

dsikka commented Nov 29, 2024 •

edited

Loading

nirey10 commented Dec 1, 2024 •

edited

Loading

jiangjiadi commented Dec 2, 2024 •

edited

Loading

nirey10 commented Dec 8, 2024 •

edited

Loading