v0.1.0

mobicham released this 05 Dec 15:11

· 284 commits to master since this release

a65d766

HQQ v0.1.0

Improvements

Added compile backend support
Added Aten C++ backend (experimental)
Faster bit unpacking via pre-allocated empty tensor
Added VLLM support
Refactoring to call quantize_model() on instances

Supported models

Llama (Hugging Face + VLLM)
ViT-CLIP (timm)

Limitations

HF only supports single GPU runtime.
VLLM only supports single GPU with a single worker.
The compile backend sometimes creates issues with async runtime
Doesn't support PEFT (LoRA, etc.).

Assets 2