diff --git a/Readme.md b/Readme.md index 44aa097..1d3d855 100755 --- a/Readme.md +++ b/Readme.md @@ -17,7 +17,7 @@ This repository contains the official implementation of Half-Quadratic Quantizat
  • HQQ is compatible with peft training.
  • We try to make HQQ fully compatible `torch.compile` for faster inference and training.
  • - + What is the quality of the quantized models?
    We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.
    @@ -26,10 +26,10 @@ This repository contains the official implementation of Half-Quadratic Quantizat What quantization settings should I use?
    You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend. If you want to use lower like `nbits=2`, you should use `axis=0`with a low group-size via HQQ+, meaning adding low-rank adapters and fine-tune with a small dataset.
    - + What does the `axis` parameter mean?
    The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.
    - + What is the difference between HQQ and HQQ+?
    HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.
    @@ -65,9 +65,6 @@ The quantization parameters are set as follows: - ```nbits``` (int): supports 8, 4, 3, 2, 1 bits. - ```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```. -- ```quant_zero``` (bool): if True, it quantizes the zero-point to 8-bit without grouping. -- ```quant_scale``` (bool): if True, it quantizes the scaling factor to 8-bit with a group_size of 128. -- ```offload_meta``` (bool): if True, meta-data is offloaded to the CPU. - ```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of a int type. Setting ```offload_meta=True``` drastically decreases the GPU memory requirements but makes processing slower for smaller group-sizes. When turned on, you can run Llama2-70B and Mixtral with HQQ 2-bit using only 18.8GB and 13GB VRAM respectively. @@ -76,9 +73,9 @@ Setting ```offload_meta=True``` drastically decreases the GPU memory requirement #### Native Backends The following native backends can be used by the `HQQLinear` module: ```Python -HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend +HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch -HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend +HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported ``` The ```HQQBackend.ATEN``` backend is automatically installed and used by default when available. Note that ```HQQBackend.ATEN``` only supports `axis=0`. For `axis=1` you need to use ```HQQBackend.PYTORCH``` or ```HQQBackend.PYTORCH_COMPILE```. @@ -88,7 +85,7 @@ Below you can find the speed-up benchmark with various backends, ```HQQBackend.P
    Titan RTX - A100 + A100
    @@ -124,7 +121,7 @@ For usage with HF's transformers, see the example below from the