Question about KV Cache quantization #23

SherrySwift · 2024-09-06T09:07:37Z

Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kernel code? Thank you.

happierpig · 2024-09-06T16:28:05Z

Hi @SherrySwift ,

Yes, we use page attention with low-bit quantization to speedup memory loading time. INT4 tokens are loaded and decoded into FP16, which still uses FP16 arithmetic calculations. Check quantization code here: https://github.com/efeslab/Atom/blob/main/kernels/include/flashinfer/quantization.cuh#L10

SherrySwift · 2024-09-07T03:34:08Z

Hi @happierpig ,
Thanks for your quick reply, I have another question.
According to

Atom/kernels/include/flashinfer/quantization.cuh

Line 70 in 7e3618b

constexpr size_t PACK_NUM = 8;

, 8×4bit data are packed into one int32 data.
But in

Atom/e2e/punica-atom/punica/utils/kvcache.py

Line 19 in 7e3618b

dtype=torch.uint8,

, it seems that 4bit KV Cache are stored in int8 type rather than int32 type.
Feel a little bit confused about this.

happierpig · 2024-09-07T05:28:07Z

Sure! We use int8 data type to pack 2 int4 elements since the minimum addressable data size on GPU is byte.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about KV Cache quantization #23

Question about KV Cache quantization #23

SherrySwift commented Sep 6, 2024

happierpig commented Sep 6, 2024

SherrySwift commented Sep 7, 2024

happierpig commented Sep 7, 2024

Question about KV Cache quantization #23

Question about KV Cache quantization #23

Comments

SherrySwift commented Sep 6, 2024

happierpig commented Sep 6, 2024

SherrySwift commented Sep 7, 2024

happierpig commented Sep 7, 2024