Releases: mobiusml/hqq
Releases · mobiusml/hqq
v0.1.5
HQQ v0.1.5
New features
- Added support for multi-gpu FSDP QLoRA training (#17)
Issues
torch.compile
and thePYTORCH_COMPILE
backend break withview_as_float=True
. No known solution for the moment.- A bit slower inference with
view_as_float=True
. Solution: after training, the user can revert back to in bitpacking.
v0.1.4
v0.1.3.post1
HQQ v0.1.3.post1
New features
- meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.
v0.1.3
v0.1.2.post1
HQQ v0.1.2.post1
Bug fixes
- Fixed LoRA adapter loading.
v0.1.2
v0.1.1.post1
HQQ v0.1.1.post1
No improvements over v0.1.1. Just removed Pytorch from the dependencies and updated the Readme.
v0.1.1
v0.1.0
HQQ v0.1.0
Improvements
- Added compile backend support
- Added Aten C++ backend (experimental)
- Faster bit unpacking via pre-allocated empty tensor
- Added VLLM support
- Refactoring to call
quantize_model()
on instances
Supported models
- Llama (Hugging Face + VLLM)
- ViT-CLIP (timm)
Limitations
- HF only supports single GPU runtime.
- VLLM only supports single GPU with a single worker.
- The compile backend sometimes creates issues with async runtime
- Doesn't support PEFT (LoRA, etc.).
0.1.0-alpha
HQQ 0.1.0-alpha
Alpha version with basic Hugging Face/Timm support.
Supported models:
- Llama (Hugging Face)
- ViT (timm)
Limitations:
- Uses a pure Pytorch implementation without optimizations.
- Only supports single GPU runtime.
- Doesn't support Peft (LoRA, etc.) for custom training.