diff --git a/Readme.md b/Readme.md
index 44aa097..1d3d855 100755
--- a/Readme.md
+++ b/Readme.md
@@ -17,7 +17,7 @@ This repository contains the official implementation of Half-Quadratic Quantizat
HQQ is compatible with peft training.
We try to make HQQ fully compatible `torch.compile` for faster inference and training.
-
+
What is the quality of the quantized models?
We have detailed benchmarks on both language and vision models. Please refer to our blog posts: HQQ, HQQ+.
@@ -26,10 +26,10 @@ This repository contains the official implementation of Half-Quadratic Quantizat
What quantization settings should I use?
You should start with `nbits=4, group_size=64, axis=1`. These settings offer a good balance between quality, vram usage and speed. If you want better results with the same vram usage, switch to `axis=0` and use the ATEN backend. If you want to use lower like `nbits=2`, you should use `axis=0`with a low group-size via HQQ+, meaning adding low-rank adapters and fine-tune with a small dataset.
-
+
What does the `axis` parameter mean?
The `axis` parameter is the axis along which grouping is performed. In general `axis=0` gives better results than `axis=1`, especially at lower bits. However, the optimized inference runtime only supports `axis=1` for the moment.
-
+
What is the difference between HQQ and HQQ+?
HQQ+ is HQQ with trainable low-rank adapters to improve the quantization quality at lower bits.
@@ -65,9 +65,6 @@ The quantization parameters are set as follows:
- ```nbits``` (int): supports 8, 4, 3, 2, 1 bits.
- ```group_size``` (int): no restrictions as long as ```weight.numel()``` is divisible by the ```group_size```.
-- ```quant_zero``` (bool): if True, it quantizes the zero-point to 8-bit without grouping.
-- ```quant_scale``` (bool): if True, it quantizes the scaling factor to 8-bit with a group_size of 128.
-- ```offload_meta``` (bool): if True, meta-data is offloaded to the CPU.
- ```view_as_float``` (bool): if True, the quantized parameter is viewed as float instead of a int type.
Setting ```offload_meta=True``` drastically decreases the GPU memory requirements but makes processing slower for smaller group-sizes. When turned on, you can run Llama2-70B and Mixtral with HQQ 2-bit using only 18.8GB and 13GB VRAM respectively.
@@ -76,9 +73,9 @@ Setting ```offload_meta=True``` drastically decreases the GPU memory requirement
#### Native Backends
The following native backends can be used by the `HQQLinear` module:
```Python
-HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend
+HQQLinear.set_backend(HQQBackend.PYTORCH) #Pytorch backend - Default
HQQLinear.set_backend(HQQBackend.PYTORCH_COMPILE) #Compiled Pytorch
-HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend
+HQQLinear.set_backend(HQQBackend.ATEN) #Aten/CUDA backend - only axis=0 supported
```
The ```HQQBackend.ATEN``` backend is automatically installed and used by default when available.
Note that ```HQQBackend.ATEN``` only supports `axis=0`. For `axis=1` you need to use ```HQQBackend.PYTORCH``` or ```HQQBackend.PYTORCH_COMPILE```.
@@ -88,7 +85,7 @@ Below you can find the speed-up benchmark with various backends, ```HQQBackend.P
@@ -124,7 +121,7 @@ For usage with HF's transformers, see the example below from the