Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
__init__.py		__init__.py
floatx.py		floatx.py

README.md

Quant-LLM

This is a FP16 x Floatx mixed matmul kernel optimized for io bound workloads per FP6-LLM. The actual CUDA kernel is located under csrc/cuda/fp6_llm/. This module provides helper functions to quantize FP32/FP16/BF16 weights to Floatx and integration with torchao API.

Usage

from torchao.quantization import (
    quantize_,
    fpx_weight_only,
)

model = ...
model.half()  # not necessary, but recommeneded to maintain accuracy

# for generic Floatx EyMz where x = 1 + y + z
# fp6 with ebits = 3 and mbits = 2
quantize_(model, fpx_weight_only(3, 2))

# fully compatible with torch.compile()
model.compile(mode="max-autotune", fullgraph=True)

It's also possible to pre-process the weight and call the kernel directly.

import torch
from torchao.dtypes.floatx import to_scaled_tc_floatx
from torchao.ops import quant_llm_linear

fp32_weight = torch.randn(1024, 512).cuda()
ebits, mbits = 3, 2

# pre-process the weight. this will quantize the weight to FP6 and pack it in a special
# layout for tensor cores. refer to paper for more details.
fp6_weight, scales = to_scaled_tc_floatx(fp32_weight, ebits, mbits)

fp16_act = torch.randn(1, 512).cuda().half()
outputs = quant_llm_linear(ebits, mbits, fp16_act, fp6_weight, scales)  # shape (1, 1024)

NOTE:

Since this kernel's computation dtype is FP16, it is recommended to convert the model to FP16 (instead of BF16) before applying quantization and use FP16 for activations.
Only FP6 E3M2 and FP5 E2M2 are tested and enabled in the official repo. We additionally enable support for FP6 E2M3 and FP5 E3M1.
On most hardware, this kernel is faster than FP16 linear for batch size from 1 to 128, and slower for batch size larger than or equal to 256. See usyd-fsalab/fp6_llm#8 for a detailed discussion. See pytorch#223 for some microbenchmark results.
FP6 is supported for >=SM80 (Ampere generation) as well as SM75 (Turing generation) GPUs. However, SM75 support requires manual compilation of the C++/CUDA extensions (see the installation instructions in the README for details).

End-to-End benchmarks

Benchmarks are run on a machine with a single 4070Ti SUPER GPU using the scripts in _models/llama. tokens/s is measured using generate.py which generates text in a latency optimized way (batchsize=1). wikitext perplexity is measured using eval.py which uses lm_eval. The model used is meta-llama/Llama-2-7b-chat-hf.

Floatx quantization is run with --precision float16. The rest uses the default precision of bfloat16.

Quantization	wikitext perplexity	tokens/s
INT8	12.21	87.45
INT4-256 (tinygemm)	--	157.10
FP6 E3M2	12.34	106.76
FP6 E2M3	12.23	106.77
FP5 E3M1	12.55	122.69
FP5 E2M2	12.47	122.66
FP4 E3M0	14.58	145.55
FP4 E2M1	15.01	146.05
FP3 E2M0	74625.18	164.49

Credits

Credits to FP6-LLM authors

Paper: https://arxiv.org/abs/2401.14112
Code: https://github.com/usyd-fsalab/fp6_llm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

floatx

floatx

README.md

Quant-LLM

Usage

End-to-End benchmarks

Credits

Files

floatx

Directory actions

More options

Directory actions

More options

Latest commit

History

floatx

Folders and files

parent directory

README.md

Quant-LLM

Usage

End-to-End benchmarks

Credits