-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Got Error when I load a 2of4 model using vllm. #926
Comments
@jiangjiadi Hi! Can you share the config that you're running in vllm? Running a model with 2:4 sparsity will require setting the |
@dsikka Here is my config:
Setting the dtype to |
Hi @jiangjiadi this is the recipe. Do you mind sharing the |
@dsikka The config.json for stage_sparsity model:
The config.json for stage_quantization model:
|
We only support running the model with both sparsity and quantization at the moment so you will only be able to run the model produced after the quantization stage - could you try running this model in vllm with |
@dsikka I changed the torch_dtype to 'float16' in config.json of stage_quantization model, |
Hi @jiangjiadi sorry for being unclear, please change the dtype when calling vllm:
|
@dsikka Result is same. |
@dsikka I explicitly set There must be something wrong when compressing model. When I use the code below
to compress the uncompressed model, I encounter error Besides, when I use VLLM to load the uncompressed model, I also encounter error |
Here is the config.json of the uncompressed model.
The only difference between compressed and uncompressed is the |
@dsikka Further investigation revealed that there was no issue with saving the model parameters; the problem lay in loading the model parameters. When calling 'SparseAutoModelForCausalLM.from_pretrained', the model parameters were not being loaded back. |
HI @jiangjiadi - we do not support decompression of marlin-24 models in compressed-tensors as of yet. You should be able to load the model in vllm however. Do you mind sharing the code you're using to run it in vllm? |
@dsikka Sure, you can follow the steps below to reproduce my issue step by step.
|
Describe the bug
I'm compressing a qwen2.5_7b model using
examples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
, but I failed to load the stage_sparsity model. The error is shown below:And when I use the stage_quantization model to inference using vllm, the output is abnormal. See below:
Expected behavior
The stage_sparsity model should be loaded normally and the output of the stage_quantization model should be normal.
Environment
Include all relevant environment information:
f7245c8
]: 0.3.0To Reproduce
Exact steps to reproduce the behavior:
model_stub = Qwen/Qwen2.5-7B-Instruct
, then runexamples/quantization_2of4_sparse_w4a16/llama7b_sparse_w4a16.py
to get the model.LLM(model_path)
to load the model and inference.The text was updated successfully, but these errors were encountered: