Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flash attention / Context quantization. K and V quant settings are unavailable. #70

Open
GabeAl opened this issue Aug 30, 2024 · 2 comments

Comments

@GabeAl
Copy link

GabeAl commented Aug 30, 2024

The current series is unusable for me because I use the full 128k context window on a 16GB GPU. To do this with Phi-medium or Llama 3.1-8B, requires setting the K and V quants to 4_0.

This feature has not only been removed, but the memory usage is 4 times higher, and even reducing the cache size dramatically leads to much slower processing for me.

Quantization is necessary for large caches to fit. The old series had this feature. The new series, which would benefit the most from the quantized context in flash attention, is completely missing this feature. Ironically, I'm resorting to the old version (0.2 series) to analyze documents and large texts.

@GabeAl
Copy link
Author

GabeAl commented Nov 30, 2024

Any progress on this?

@anrgct
Copy link

anrgct commented Dec 3, 2024

I agree that the int4 kv cache is very useful for long contexts and large models, making full use of the GPU memory. I hope this feature is added back; I don't want to downgrade to an older version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants