Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about KV Cache quantization #23

Open
SherrySwift opened this issue Sep 6, 2024 · 3 comments
Open

Question about KV Cache quantization #23

SherrySwift opened this issue Sep 6, 2024 · 3 comments

Comments

@SherrySwift
Copy link

Hi, thanks for your great work!
I have a small question about KV Cache quantization. Did you use pagedattention to accelerate KV Cache 4-bit quantization? If so, where is the corresponding cuda kernel code? Thank you.

@happierpig
Copy link
Collaborator

Hi @SherrySwift ,

Yes, we use page attention with low-bit quantization to speedup memory loading time. INT4 tokens are loaded and decoded into FP16, which still uses FP16 arithmetic calculations. Check quantization code here: https://github.com/efeslab/Atom/blob/main/kernels/include/flashinfer/quantization.cuh#L10

@SherrySwift
Copy link
Author

Hi @happierpig ,
Thanks for your quick reply, I have another question.
According to

constexpr size_t PACK_NUM = 8;
, 8×4bit data are packed into one int32 data.
But in , it seems that 4bit KV Cache are stored in int8 type rather than int32 type.
Feel a little bit confused about this.

@happierpig
Copy link
Collaborator

Sure! We use int8 data type to pack 2 int4 elements since the minimum addressable data size on GPU is byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants