Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

HelloCard · 2024-11-14T14:25:36Z

What is the URL, file, or UI containing proposed doc change
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8

What is the current content or situation in question
Lack of recommendations on hyperparameters to use.

What is the proposed change
Add new content.

Additional context
Let's talk about some issues regarding the values of NUM_CALIBRATION_SAMPLES and smoothing_strength in the script.

HelloCard · 2024-11-14T14:35:48Z

I used this command to evaluate the accuracy after quantization:
lm_eval --model vllm --model_args pretrained="/root/autodl-tmp/output",add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,dtype=bfloat16 --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size 'auto'

And use the dataset HuggingFaceH4/ultrachat_200k recommended in the script for quantization.
Since gsm8k is a math test and ultrachat is an online chat, I expect them to be very different, so the model quantized with ultrachat can reflect the retention/stimulation of its capabilities through gsm8k, without worrying about ultrachat injecting math capabilities into the model and causing a false increase in gsm8k scores.

HelloCard · 2024-11-14T14:47:49Z

First, let's talk about NUM_CALIBRATION_SAMPLES. I found that it is the main factor that determines the quantization time. When this value doubles, the overall quantization time will also double.
The increase in NUM_CALIBRATION_SAMPLES will not lead to a decrease in the accuracy of the final quantized model. In other words, without considering the quantization time and video memory usage, the larger this value, the better. However, this benefit has a limit.
Under the premise that other parameters remain unchanged, NUM_CALIBRATION_SAMPLES increases from 128 to 512, and a significant increase in the score can be seen after quantization. However, the increase from 512 to 1024 will increase the model score less. Basically, I imagine that this trend is similar to a logarithmic curve.
NUM_CALIBRATION_SAMPLES==2048 is a fairly sufficient setting. Under this setting, it takes two hours for a 12~14B model to be quantized on a dual-card 4090, and three hours for a 22B model to be quantized on a triple-card 4090. I think this setting takes into account the pursuit of accuracy of the quantized model and the cost of renting computing equipment.
So all my subsequent discussions will be based on NUM_CALIBRATION_SAMPLES==2048. In fact, all the models I quantized are like this. I don’t have much experience with other values of NUM_CALIBRATION_SAMPLES.

HelloCard · 2024-11-14T14:51:04Z

The parameter related to NUM_CALIBRATION_SAMPLES is MAX_SEQUENCE_LENGTH. I always keep the setting of 2048, which may mean that I missed something... Maybe changing this setting according to the maximum context length of the model will have unique findings, or 2048 is sufficient for it. I hope someone with experience will talk about it.

HelloCard · 2024-11-14T14:58:24Z

Finally, smoothing_strength is the most complex hyperparameter.
Different smoothing_strengths will greatly change the upper limit of the score that can be achieved by increasing NUM_CALIBRATION_SAMPLES. For example, for a model with an original score of 80, when smoothing_strength=0.3, no matter how NUM_CALIBRATION_SAMPLES is increased, the quantized score is less than 40, which means that the model's ability has been greatly damaged.
I checked the SmoothQuant paper and found a similar hyperparameter discussion in it, but it doesn't match at all. As shown in the screenshot in the paper, the recommended value range of this parameter is 0.4~0.6, while the recommended value in the script is 0.8.

HelloCard · 2024-11-14T15:07:40Z

So I didn't get any clues in the end, but I still found some patterns through repeated tests.
First, I adjusted smoothing_strength drastically, such as 0.2, 0.4, 0.6, 0.8, and then observed the output of llm-compressor during the quantization process. One of the values called "error" changed dramatically with smoothing_strength. Specifically, if smoothing_strength is set to a small value, such as 0.2, then llm-compressor will have very low "error" when quantizing the first few layers. On the contrary, when smoothing_strength is set to a large value, such as 0.9, then llm-compressor will have very low "error" when quantizing the last few layers.
That is to say, smoothing_strength specifies a region of the model, so that the layers in that region always get a smaller loss.

llm-compressor/src/llmcompressor/modifiers/obcq/utils/sgpt_wrapper.py

Line 206 in 8b14532

logger.info("error %.2f" % torch.sum(Losses).item())

HelloCard · 2024-11-14T15:25:26Z

Then I started to test the retention/stimulation of the quantified ability when smoothing_strength takes different values.
This is very torturous, because every time I test, I will quantize a 13B model, which means two hours of waiting, and boring dotting on the coordinate system, guessing where the peak is. This makes me feel like an AI that automatically calculates gradient descent.
Hard work always pays off, and I found some rules:
Different models have different preferences. In general, 8B models like smoothing_strength~=0.80, 12~14B models like smoothing_strength~=0.85, and 22B models like smoothing_strength~=0.88.

This is just a general rule, there are exceptions, which will be discussed further later. But before that, I would like to mention some of my conjectures:
The model position specified by smoothing_strength is related to the difference in the properties of different layers of LLM. I remember that a paper found that the top, bottom, and center layers of LLM are different. The top layer is responsible for converting tokens into more complex semantic vectors for the central layer to process, and the bottom layer is responsible for converting logits from the end of the central layer for the decoder to process. Perhaps the position indicated by smoothing_strength is the junction of the central layer and the top layer?

For the abliterated model, if the appropriate smoothing_strength is used, the score will increase significantly, even exceeding the score of the official model that is not abliterated. This may mean that SmoothQuant plays an annealing role after abliterated, bridging the "gap" caused by abliterated in the model.

HelloCard · 2024-11-14T15:37:22Z

I designed some rules to verify my conjecture:
For a 13B model with num_hidden_layers==40, such as Nemo. Its preferred smoothing_strength is 0.85, and the quantized model is here:
https://huggingface.co/noneUsername/Mistral-Nemo-Instruct-2407-abliterated-W8A8-Dynamic-Per-Token

1-(6/40)=0.85, does this mean that I specified the 39th, 38th, 37th, 36th, 35th, and 34th layers through smoothing_strength and distinguished them from the 34 layers from 0 to 33?
However, I did not further verify this rule because I found that specifying exact values according to the rule does not always produce the best results. For example, Qwen2.5-Coder-14B-Instruct-abliterated, its num_hidden_layers is 48, and the preferred smoothing_strength is 0.91, but using 1-(4/48)=0.9166666666, that is, smoothing_strength=0.9166666666, the score obtained is lower than the score obtained by using smoothing_strength==0.91.

HelloCard · 2024-11-14T15:37:44Z

vllm (pretrained=/root/autodl-tmp/Qwen2.5-Coder-14B-Instruct-abliterated,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.872|±  |0.0212|
|     |       |strict-match    |     5|exact_match|↑  |0.868|±  |0.0215|

vllm (pretrained=/root/autodl-tmp/output92,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.852|±  |0.0225|
|     |       |strict-match    |     5|exact_match|↑  |0.848|±  |0.0228|

vllm (pretrained=/root/autodl-tmp/output916666666,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.844|±  | 0.023|
|     |       |strict-match    |     5|exact_match|↑  |0.844|±  | 0.023|

vllm (pretrained=/root/autodl-tmp/output91,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.872|±  |0.0212|
|     |       |strict-match    |     5|exact_match|↑  |0.872|±  |0.0212|

vllm (pretrained=/root/autodl-tmp/output90,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.868|±  |0.0215|
|     |       |strict-match    |     5|exact_match|↑  |0.868|±  |0.0215|

vllm (pretrained=/root/autodl-tmp/output88,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.852|±  |0.0225|
|     |       |strict-match    |     5|exact_match|↑  |0.852|±  |0.0225|

vllm (pretrained=/root/autodl-tmp/output85,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.848|±  |0.0228|
|     |       |strict-match    |     5|exact_match|↑  |0.848|±  |0.0228|

HelloCard · 2024-11-14T15:44:56Z

This means that there are some rules for setting smoothing_strength that I am not aware of. Unfortunately, I can only find the best smoothing_strength through multiple attempts. For example, on
Phi-3-medium-4k-instruct, the best value of smoothing_strength is 0.885. Manually testing to three decimal places is really a crazy thing.
https://huggingface.co/noneUsername/Phi-3-medium-4k-instruct-W8A8-Dynamic-Per-Token

vllm (pretrained=/root/autodl-tmp/Phi-3-medium-4k-instruct,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=2,enforce_eager=True), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.852|±  |0.0225|
|     |       |strict-match    |     5|exact_match|↑  |0.832|±  |0.0237|

vllm (pretrained=/root/autodl-tmp/output1,add_bos_token=true,tensor_parallel_size=2,max_model_len=2048,gpu_memory_utilization=0.80,max_num_seqs=5), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: 5
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.876|±  |0.0209|
|     |       |strict-match    |     5|exact_match|↑  |0.844|±  |0.0230|

The above is all my experience with the hyperparameter settings in the W8A8 quantization script. More discussions are welcome. By the way, I quantized them for playing erotic role playing (ERP). The best model at present is noneUsername/Mistral-Nemo-Instruct-2407-abliterated-W8A8-Dynamic-Per-Token.

kylesayrs · 2024-11-15T17:34:23Z

Hi @HelloCard! Thanks for posting your experience, I'm sure others will be able to use your observations in their own parameter tuning experiments!

As you mentioned, GSM8K is probably not the most representative evaluation set for calibration with ultrachat and for your role-playing use case, an evaluation set like MMLU might be more applicable.

Thanks again for your contribution! Feel free to open any other issues if you notice anything unexpected about the smoothquant modifier or otherwise :)

HelloCard added the documentation Improvements or additions to documentation label Nov 14, 2024

kylesayrs self-assigned this Nov 15, 2024

kylesayrs mentioned this issue Nov 29, 2024

Add int8 discussion section in readme #944

Open

robertgshaw2-neuralmagic closed this as completed Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024 •

edited

Loading

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

kylesayrs commented Nov 15, 2024

Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916

Comments

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024 • edited Loading

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

HelloCard commented Nov 14, 2024

kylesayrs commented Nov 15, 2024

HelloCard commented Nov 14, 2024 •

edited

Loading