Flash attention / Context quantization. K and V quant settings are unavailable. #70

GabeAl · 2024-08-30T20:53:53Z

The current series is unusable for me because I use the full 128k context window on a 16GB GPU. To do this with Phi-medium or Llama 3.1-8B, requires setting the K and V quants to 4_0.

This feature has not only been removed, but the memory usage is 4 times higher, and even reducing the cache size dramatically leads to much slower processing for me.

Quantization is necessary for large caches to fit. The old series had this feature. The new series, which would benefit the most from the quantized context in flash attention, is completely missing this feature. Ironically, I'm resorting to the old version (0.2 series) to analyze documents and large texts.

GabeAl · 2024-11-30T00:40:13Z

Any progress on this?

anrgct · 2024-12-03T03:31:51Z

I agree that the int4 kv cache is very useful for long contexts and large models, making full use of the GPU memory. I hope this feature is added back; I don't want to downgrade to an older version.

This was referenced Aug 30, 2024

[High Priority Feature] Please add Support for 8-bit and 4-Bit Caching! #56

Closed

Issue: UI is missing option to change Flash Attention KV quantization setting lmstudio-ai/lmstudio-bug-tracker#104

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flash attention / Context quantization. K and V quant settings are unavailable. #70

Flash attention / Context quantization. K and V quant settings are unavailable. #70

GabeAl commented Aug 30, 2024

GabeAl commented Nov 30, 2024

anrgct commented Dec 3, 2024

Flash attention / Context quantization. K and V quant settings are unavailable. #70

Flash attention / Context quantization. K and V quant settings are unavailable. #70

Comments

GabeAl commented Aug 30, 2024

GabeAl commented Nov 30, 2024

anrgct commented Dec 3, 2024