-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discuss the use of hyperparameters in the quantization_w8a8_int8 script #916
Comments
I used this command to evaluate the accuracy after quantization: And use the dataset HuggingFaceH4/ultrachat_200k recommended in the script for quantization. |
First, let's talk about NUM_CALIBRATION_SAMPLES. I found that it is the main factor that determines the quantization time. When this value doubles, the overall quantization time will also double. |
The parameter related to NUM_CALIBRATION_SAMPLES is MAX_SEQUENCE_LENGTH. I always keep the setting of 2048, which may mean that I missed something... Maybe changing this setting according to the maximum context length of the model will have unique findings, or 2048 is sufficient for it. I hope someone with experience will talk about it. |
So I didn't get any clues in the end, but I still found some patterns through repeated tests.
|
Then I started to test the retention/stimulation of the quantified ability when smoothing_strength takes different values. This is just a general rule, there are exceptions, which will be discussed further later. But before that, I would like to mention some of my conjectures: For the abliterated model, if the appropriate smoothing_strength is used, the score will increase significantly, even exceeding the score of the official model that is not abliterated. This may mean that SmoothQuant plays an annealing role after abliterated, bridging the "gap" caused by abliterated in the model. |
I designed some rules to verify my conjecture: 1-(6/40)=0.85, does this mean that I specified the 39th, 38th, 37th, 36th, 35th, and 34th layers through smoothing_strength and distinguished them from the 34 layers from 0 to 33? |
|
This means that there are some rules for setting smoothing_strength that I am not aware of. Unfortunately, I can only find the best smoothing_strength through multiple attempts. For example, on
The above is all my experience with the hyperparameter settings in the W8A8 quantization script. More discussions are welcome. By the way, I quantized them for playing erotic role playing (ERP). The best model at present is noneUsername/Mistral-Nemo-Instruct-2407-abliterated-W8A8-Dynamic-Per-Token. |
Hi @HelloCard! Thanks for posting your experience, I'm sure others will be able to use your observations in their own parameter tuning experiments! As you mentioned, GSM8K is probably not the most representative evaluation set for calibration with ultrachat and for your role-playing use case, an evaluation set like MMLU might be more applicable. Thanks again for your contribution! Feel free to open any other issues if you notice anything unexpected about the smoothquant modifier or otherwise :) |
What is the URL, file, or UI containing proposed doc change
https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_int8
What is the current content or situation in question
Lack of recommendations on hyperparameters to use.
What is the proposed change
Add new content.
Additional context
Let's talk about some issues regarding the values of NUM_CALIBRATION_SAMPLES and smoothing_strength in the script.
The text was updated successfully, but these errors were encountered: