Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quantization + sparsification - model outputs zeros #942

Open
nirey10 opened this issue Nov 28, 2024 · 8 comments
Open

quantization + sparsification - model outputs zeros #942

nirey10 opened this issue Nov 28, 2024 · 8 comments
Assignees
Labels
bug Something isn't working

Comments

@nirey10
Copy link

nirey10 commented Nov 28, 2024

Describe the bug
when running quantization (gptq) after sparsification (sparsegpt 2:4) the model accuracy and perplexity is damaged hard and outputs only zeros

Expected behavior
A reasonable text

Environment

  1. Ubuntu 22.04.3 LTS
  2. Python 3.12.4
  3. LLM Compressor 0.1.0/0.3.0
  4. Other Python package versions vLLM 0.6.2/0.5.5, compressed-tensors 0.6.0/0.7.0/0.8.0
  5. Cuda 12.4
  6. GPU - A6000

To Reproduce
model: llama-3.1-8b-instruct
'''
recipe:
sparsity_stage:
run_type: oneshot
sparsity_modifiers:
SparseGPTModifier:
sparsity: 0.5
mask_structure: "2:4"
sequential_update: false
quantization_stage:
run_type: oneshot
quantization_modifiers:
GPTQModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 4
type: "int"
symmetric: true
strategy: "channel"
targets: ["Linear"]
'''

Additional context
Quantization alone (GPTQ) provides reasonable results (but not like auto_gptq), when combining it with 2:4 sparsification first, it outputs zeros only (or !).
The only thing that differs from your example is the model and the lack of fine tuning.
I served the model with vllm and asked for a simple completion like "san fransisco is:".

@nirey10 nirey10 added the bug Something isn't working label Nov 28, 2024
@dsikka
Copy link
Collaborator

dsikka commented Nov 29, 2024

@nirey10 Hi! Can you share how you’re running the model? And share the model config?

@nirey10
Copy link
Author

nirey10 commented Dec 1, 2024

Hey, exactly like the quantization_2of4_sparse_4w16a example but without the finetune stage and i used the llama3.1-8b-instruct model instead.

i am running the model with 'vllm serve' and use the /v1/completions and /v1/chat/completions routes

By the way, i even took one of the models that uploaded to HF (which i believe that uses this code): neuralmagic/Sparse-Llama-3.1-8B-ultrachat_200k-2of4-quantized.w4a16 and it outputs only '!' when i am doing chat competions, the exact phenomena that i get.

@dsikka
Copy link
Collaborator

dsikka commented Dec 1, 2024

Hi @nirey10, if you’re running generation using vLLM, can you try setting the dtype to float16?

@jiangjiadi
Copy link

jiangjiadi commented Dec 2, 2024

@nirey10 @dsikka Same problem I had with the qwen model. #926

@nirey10
Copy link
Author

nirey10 commented Dec 2, 2024

@dsikka running with float16 actually fixed the released model in HF but my compressed model still outputs nonsense (atleast not '!').
when i am trying to run in with some output folder it doesnt take the appropriate {output_folder}/{stage name} path for the next stage as well.
can you please share the versions of vllm, compressed-tensors and llmcompressor that you used for this example?

@dsikka
Copy link
Collaborator

dsikka commented Dec 2, 2024

HI @nirey10 can you share the code you're using when running the model on vLLM

@robertgshaw2-neuralmagic
Copy link
Collaborator

What version of vLLM is being used? We fixed some issues with the kernel in the vllm recently:

@nirey10
Copy link
Author

nirey10 commented Dec 8, 2024

Hey,
@robertgshaw2-neuralmagic i am using vllm==0.6.2
@dsikka just running 'vllm serve {model_name} --dtype float16'

Eventually i was able to run it with the yaml recipe but with llmcompressor==0.1.0 and its corresponding 2_4 example from git. After some experiments i found out the the fine tuning stage is crucial for decent outputs, despite the face that the original SparseGPT can provide decent results without fine tuning.

To sum it up the --dtype float16 actually helped with the results, i think it should be on the README. I think that there is a bug with the current example of the 2_4 sparsification and quantization, the model output path from the stages are not going through the stages in the pipeline, instead of taking the sparse model in the finetuning stage, it is looking for the model name from the input, something that does not exists locally.

Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants