Skip to content

Releases: mobiusml/hqq

v0.1.5

01 Mar 10:50
Compare
Choose a tag to compare

HQQ v0.1.5

New features

  • Added support for multi-gpu FSDP QLoRA training (#17)

Issues

  • torch.compile and the PYTORCH_COMPILE backend break with view_as_float=True. No known solution for the moment.
  • A bit slower inference with view_as_float=True. Solution: after training, the user can revert back to in bitpacking.

v0.1.4

28 Feb 09:55
Compare
Choose a tag to compare

HQQ v0.1.4

New features

  • Added 1-bit support with CUDA dequant kernels.

v0.1.3.post1

20 Feb 16:41
Compare
Choose a tag to compare

HQQ v0.1.3.post1

New features

  • meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.

v0.1.3

12 Feb 16:58
96ce17d
Compare
Choose a tag to compare

HQQ v0.1.3

New features

  • Added CUDA kernels for dequantization (up to 2-3x inference speed-up vs. Pytorch)
  • Added support for compute_dtype parameter (useful for float32/bfloat16 LoRA training)

v0.1.2.post1

18 Jan 11:21
Compare
Choose a tag to compare

HQQ v0.1.2.post1

Bug fixes

  • Fixed LoRA adapter loading.

v0.1.2

08 Jan 17:20
Compare
Choose a tag to compare

HQQ v0.1.2

Improvements

  • Added LoRA support
  • Added LoRA with fake quantization support (experimental)
  • Optimizer V2 with scale update support
  • Some code refactoring in quantize.py

v0.1.1.post1

03 Jan 21:52
Compare
Choose a tag to compare

HQQ v0.1.1.post1

No improvements over v0.1.1. Just removed Pytorch from the dependencies and updated the Readme.

v0.1.1

18 Dec 13:16
Compare
Choose a tag to compare

HQQ v0.1.1

Improvements:

  • Added Mixtral support for Hugging Face.
  • Added support for layer-wise custom quantization configs.

v0.1.0

05 Dec 15:11
Compare
Choose a tag to compare

HQQ v0.1.0

Improvements

  • Added compile backend support
  • Added Aten C++ backend (experimental)
  • Faster bit unpacking via pre-allocated empty tensor
  • Added VLLM support
  • Refactoring to call quantize_model() on instances

Supported models

  • Llama (Hugging Face + VLLM)
  • ViT-CLIP (timm)

Limitations

  • HF only supports single GPU runtime.
  • VLLM only supports single GPU with a single worker.
  • The compile backend sometimes creates issues with async runtime
  • Doesn't support PEFT (LoRA, etc.).

0.1.0-alpha

01 Dec 16:35
ff86218
Compare
Choose a tag to compare

HQQ 0.1.0-alpha

Alpha version with basic Hugging Face/Timm support.

Supported models:

  • Llama (Hugging Face)
  • ViT (timm)

Limitations:

  • Uses a pure Pytorch implementation without optimizations.
  • Only supports single GPU runtime.
  • Doesn't support Peft (LoRA, etc.) for custom training.