Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

Closed
HelloCard opened this issue Nov 14, 2024 · 10 comments · May be fixed by #944
Closed

Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

HelloCard opened this issue Nov 14, 2024 · 10 comments · May be fixed by #944
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@HelloCard
Copy link

What is the URL, file, or UI containing proposed doc change
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8

What is the current content or situation in question
Lack of recommendations on hyperparameters to use.

What is the proposed change
Add new content.

Additional context
Let's talk about some issues regarding the values ​​of NUM_CALIBRATION_SAMPLES and smoothing_strength in the script.

@HelloCard HelloCard added the documentation Improvements or additions to documentation label Nov 14, 2024
@HelloCard
Copy link
Author

I used this command to evaluate the accuracy after quantization:
lm_eval --model vllm --model_args pretrained="/root/autodl-tmp/output",add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,dtype=bfloat16 --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size 'auto'

And use the dataset HuggingFaceH4/ultrachat_200k recommended in the script for quantization.
Since gsm8k is a math test and ultrachat is an online chat, I expect them to be very different, so the model quantized with ultrachat can reflect the retention/stimulation of its capabilities through gsm8k, without worrying about ultrachat injecting math capabilities into the model and causing a false increase in gsm8k scores.

@HelloCard
Copy link
Author

First, let's talk about NUM_CALIBRATION_SAMPLES. I found that it is the main factor that determines the quantization time. When this value doubles, the overall quantization time will also double.
The increase in NUM_CALIBRATION_SAMPLES will not lead to a decrease in the accuracy of the final quantized model. In other words, without considering the quantization time and video memory usage, the larger this value, the better. However, this benefit has a limit.
Under the premise that other parameters remain unchanged, NUM_CALIBRATION_SAMPLES increases from 128 to 512, and a significant increase in the score can be seen after quantization. However, the increase from 512 to 1024 will increase the model score less. Basically, I imagine that this trend is similar to a logarithmic curve.
NUM_CALIBRATION_SAMPLES==2048 is a fairly sufficient setting. Under this setting, it takes two hours for a 12~14B model to be quantized on a dual-card 4090, and three hours for a 22B model to be quantized on a triple-card 4090. I think this setting takes into account the pursuit of accuracy of the quantized model and the cost of renting computing equipment.
So all my subsequent discussions will be based on NUM_CALIBRATION_SAMPLES==2048. In fact, all the models I quantized are like this. I don’t have much experience with other values ​​of NUM_CALIBRATION_SAMPLES.

@HelloCard
Copy link
Author

The parameter related to NUM_CALIBRATION_SAMPLES is MAX_SEQUENCE_LENGTH. I always keep the setting of 2048, which may mean that I missed something... Maybe changing this setting according to the maximum context length of the model will have unique findings, or 2048 is sufficient for it. I hope someone with experience will talk about it.

@HelloCard
Copy link
Author

Finally, smoothing_strength is the most complex hyperparameter.
Different smoothing_strengths will greatly change the upper limit of the score that can be achieved by increasing NUM_CALIBRATION_SAMPLES. For example, for a model with an original score of 80, when smoothing_strength=0.3, no matter how NUM_CALIBRATION_SAMPLES is increased, the quantized score is less than 40, which means that the model's ability has been greatly damaged.
I checked the SmoothQuant paper and found a similar hyperparameter discussion in it, but it doesn't match at all. As shown in the screenshot in the paper, the recommended value range of this parameter is 0.4~0.6, while the recommended value in the script is 0.8.
QQ截图20241003122245

@HelloCard
Copy link
Author

So I didn't get any clues in the end, but I still found some patterns through repeated tests.
First, I adjusted smoothing_strength drastically, such as 0.2, 0.4, 0.6, 0.8, and then observed the output of llm-compressor during the quantization process. One of the values ​​called "error" changed dramatically with smoothing_strength. Specifically, if smoothing_strength is set to a small value, such as 0.2, then llm-compressor will have very low "error" when quantizing the first few layers. On the contrary, when smoothing_strength is set to a large value, such as 0.9, then llm-compressor will have very low "error" when quantizing the last few layers.
That is to say, smoothing_strength specifies a region of the model, so that the layers in that region always get a smaller loss.

logger.info("error %.2f" % torch.sum(Losses).item())

@HelloCard
Copy link
Author

HelloCard commented Nov 14, 2024

Then I started to test the retention/stimulation of the quantified ability when smoothing_strength takes different values.
This is very torturous, because every time I test, I will quantize a 13B model, which means two hours of waiting, and boring dotting on the coordinate system, guessing where the peak is. This makes me feel like an AI that automatically calculates gradient descent.
Hard work always pays off, and I found some rules:
Different models have different preferences. In general, 8B models like smoothing_strength~=0.80, 12~14B models like smoothing_strength~=0.85, and 22B models like smoothing_strength~=0.88.

This is just a general rule, there are exceptions, which will be discussed further later. But before that, I would like to mention some of my conjectures:
The model position specified by smoothing_strength is related to the difference in the properties of different layers of LLM. I remember that a paper found that the top, bottom, and center layers of LLM are different. The top layer is responsible for converting tokens into more complex semantic vectors for the central layer to process, and the bottom layer is responsible for converting logits from the end of the central layer for the decoder to process. Perhaps the position indicated by smoothing_strength is the junction of the central layer and the top layer?

For the abliterated model, if the appropriate smoothing_strength is used, the score will increase significantly, even exceeding the score of the official model that is not abliterated. This may mean that SmoothQuant plays an annealing role after abliterated, bridging the "gap" caused by abliterated in the model.

@HelloCard
Copy link
Author

I designed some rules to verify my conjecture:
For a 13B model with num_hidden_layers==40, such as Nemo. Its preferred smoothing_strength is 0.85, and the quantized model is here:
https://huggingface.co/noneUsername/Mistral-Nemo-Instruct-2407-abliterated-W8A8-Dynamic-Per-Token

1-(6/40)=0.85, does this mean that I specified the 39th, 38th, 37th, 36th, 35th, and 34th layers through smoothing_strength and distinguished them from the 34 layers from 0 to 33?
However, I did not further verify this rule because I found that specifying exact values ​​according to the rule does not always produce the best results. For example, Qwen2.5-Coder-14B-Instruct-abliterated, its num_hidden_layers is 48, and the preferred smoothing_strength is 0.91, but using 1-(4/48)=0.9166666666, that is, smoothing_strength=0.9166666666, the score obtained is lower than the score obtained by using smoothing_strength==0.91.

@HelloCard
Copy link
Author

vllm (pretrained=/root/autodl-tmp/Qwen2.5-Coder-14B-Instruct-abliterated,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.872|±  |0.0212|
|     |       |strict-match    |     5|exact_match|↑  |0.868|±  |0.0215|

vllm (pretrained=/root/autodl-tmp/output92,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.852|±  |0.0225|
|     |       |strict-match    |     5|exact_match|↑  |0.848|±  |0.0228|

vllm (pretrained=/root/autodl-tmp/output916666666,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.844|±  | 0.023|
|     |       |strict-match    |     5|exact_match|↑  |0.844|±  | 0.023|

vllm (pretrained=/root/autodl-tmp/output91,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.872|±  |0.0212|
|     |       |strict-match    |     5|exact_match|↑  |0.872|±  |0.0212|

vllm (pretrained=/root/autodl-tmp/output90,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.868|±  |0.0215|
|     |       |strict-match    |     5|exact_match|↑  |0.868|±  |0.0215|

vllm (pretrained=/root/autodl-tmp/output88,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.852|±  |0.0225|
|     |       |strict-match    |     5|exact_match|↑  |0.852|±  |0.0225|

vllm (pretrained=/root/autodl-tmp/output85,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.848|±  |0.0228|
|     |       |strict-match    |     5|exact_match|↑  |0.848|±  |0.0228|

@HelloCard
Copy link
Author

This means that there are some rules for setting smoothing_strength that I am not aware of. Unfortunately, I can only find the best smoothing_strength through multiple attempts. For example, on
Phi-3-medium-4k-instruct, the best value of smoothing_strength is 0.885. Manually testing to three decimal places is really a crazy thing.
https://huggingface.co/noneUsername/Phi-3-medium-4k-instruct-W8A8-Dynamic-Per-Token

vllm (pretrained=/root/autodl-tmp/Phi-3-medium-4k-instruct,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=2,enforce_eager=True), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.852|±  |0.0225|
|     |       |strict-match    |     5|exact_match|↑  |0.832|±  |0.0237|

vllm (pretrained=/root/autodl-tmp/output1,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.876|±  |0.0209|
|     |       |strict-match    |     5|exact_match|↑  |0.844|±  |0.0230|

The above is all my experience with the hyperparameter settings in the W8A8 quantization script. More discussions are welcome. By the way, I quantized them for playing erotic role playing (ERP). The best model at present is noneUsername/Mistral-Nemo-Instruct-2407-abliterated-W8A8-Dynamic-Per-Token.

@kylesayrs
Copy link
Collaborator

Hi @HelloCard! Thanks for posting your experience, I'm sure others will be able to use your observations in their own parameter tuning experiments!

As you mentioned, GSM8K is probably not the most representative evaluation set for calibration with ultrachat and for your role-playing use case, an evaluation set like MMLU might be more applicable.

Thanks again for your contribution! Feel free to open any other issues if you notice anything unexpected about the smoothquant modifier or otherwise :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants