Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Mixed-precision configuration not working with STATIC quantization #163

Open
sasha-hailo opened this issue Oct 27, 2024 · 10 comments
Open
Labels
bug Something isn't working

Comments

@sasha-hailo
Copy link

Dear LLMC team,
I've been trying to run mixed-precision PTQ quantization using RTN.
I suspect there's a bug, as the non-default settings in mix_bits are ignored.

My understanding of the code:

  • In method get_act_qparams() of rtn.py, the values of qmax / qmin / scales / zeros are determined using the default quantizer bit precision
  • These values are registered as buf_act_<xxx> buffers, for all modules / layers.
  • During inference time, in method a_qdq() of rtn.py, though the aquantizer object of each layer is configured correctly, it blindly loads from buffer the registered quantization parameters qmin / qmax / scales / zeros, and uses them, instead of the actual values it should support.

What do you think?
Thanks in advance!

@Harahan
Copy link
Collaborator

Harahan commented Nov 1, 2024

There's no get_act_qparams() in rtn.py. You can print the bit-width of each linear to check the code.

PS: This function hasn't update for a long time. If you confirm there's a bug, please feel free to contact me anytime.

@Harahan Harahan closed this as completed Nov 1, 2024
@sasha-hailo
Copy link
Author

sasha-hailo commented Nov 4, 2024

Hi @Harahan,
Thank you for your response.
It turns out that a lot of changes have been made since my issue report (in this commit).
The functionality I was referring to as get_act_qparams() now resides in register_act_qparams() in file base_blockwise_quantization.py.

The bug, unfortunately, persists.

The "mechanism" is the same: function register_act_qparams() uses a single quantizer object (self.aquantizer) to determine the quantization parameters of all layers - and this quantizer is configured with the default settings. It determines the scale & zero point settings (w.r.t. incorrect bit width), and registers them via buf_act_scales / buf_act_zeros.

Note that the correct per-layer quantization configurations are loaded when executing deploy() function,
But they have no effect, as they are using the incorrect scale & zero-point values determined in the previous stage!

To sum it up: I think that the core issue that causes the [suspected] bug is that the calibration stage & function register_act_qparams() are unaware of the configured mixed-precision, and work with the default quantization config.
This code probably works well in dynamic quantization, but not in a static quantization scenario.
I also suspect that the same issue can happen with other quantization methods.

Can you please look into it?
Thanks in advance!

@sasha-hailo
Copy link
Author

P.S.
An unrelated question:
I also noticed that the commit I mentioned above added some limited support to additional quantization granularity, via functions get_matmul_in_block(), get_softmax_in_block(), get_act_fn_in_block().
Do you plan to extend this support to the more common LLM models like Qwen & LLama?
(This could be really cool)

@Harahan
Copy link
Collaborator

Harahan commented Nov 4, 2024

It depends on whether we encounter such a need or it will be used in our research. So, not sure.

@sasha-hailo
Copy link
Author

Did you succeed in reproducing the mix_bits problem I reported?
I believe the issue should be reopened as a bug...

@Harahan
Copy link
Collaborator

Harahan commented Nov 5, 2024

I'm sorry, but we do not have enough time to do this. If you are sure there's a bug, post the log/evidence and reopen the issue.

@sasha-hailo
Copy link
Author

LLMC_RTN_W8A8_MixedA16_Bug.txt
LLMC_RTN_W8A8.txt

I'm pretty sure this is a bug.
And I now suspect that the issue affects not only RTN, but nearly any method based on static quantization.
Can you please reopen the issue? I don't think I have the permissions for this.

Please find attached 2 logs of LLMC with an RTN configuration.
One log refers to a configuration without mix-bit, the other with mix-bit.
If you compare the two files, you can see that

  • The outputs of both runs are identical (same PPL score), hinting that the mix_bit configuration had no effect.
  • The mix_bit configuration of the deployed model is correct (see circa line 2458 in the log)
    ==> the bug is not at the deployment stage, but at the calibration stage (see my explanation in earlier messages).

@sasha-hailo sasha-hailo changed the title Mixed-precision configuration not working with RTN? BUG: Mixed-precision configuration not working with STATIC quantization Nov 5, 2024
@Harahan Harahan reopened this Nov 7, 2024
@Harahan
Copy link
Collaborator

Harahan commented Nov 7, 2024

I reopen the issue. Since we currently don't have the requirement for the static quantization, the bug may be fixed a long time later. You'd best try other settings.

@Harahan Harahan added the bug Something isn't working label Nov 7, 2024
@nelaturuharsha
Copy link

Hi,

I wanted to start using this library for a couple of things, but just to confirm, this bug affects situations where:
Static quantization is applied layer-wise (with the intention to have different layers/components at different bit-widths)

Can it be confirmed that it does not apply when I would like to more or less have the same bit-width for all components of the model or different for activations/weights?

@sasha-hailo
Copy link
Author

To the best of my understanding,
If the quantization configuration is the same for all layers of the model, the bug does not apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants