Issue: UI is missing option to change Flash Attention KV quantization setting #104

GabeAl · 2024-08-30T21:02:39Z

I accidentally posted a bug in the cli version of the bug tracker,

Bug fix: Flash Attention - KV cache quantization is stuck at FP16 with no way to revert to Q4_0

The gist of it is, no way to set flash attention quants = no way to fit large contexts on the GPU = regression.

This leads to a large series of usability regressions.

Impossible to load 128k context on GPU
Slower performance even with the reduced context that fits on the GPU
Inability to analyze long documents with the same accuracy as before (which is ironic considering 0.3's aim to enable this)
All issues associated with rolling back to 0.2X series, including chat history import.

Feel free to close the other one I made earlier:
lmstudio-ai/lms#70

dmatora · 2024-09-19T11:00:00Z

I had to switch to ollama to use 70B 128K.
Please add support for Q4/Q8 KV cache quantization

yagil · 2024-09-19T11:22:32Z

@dmatora @GabeAl is the specific ask here a way to set the KV cache quantization level?

dmatora · 2024-09-19T11:32:41Z

yes

GabeAl · 2024-09-21T18:08:51Z

Yes absolutely. Many users have commented including on Discord about this. It is one major bottleneck holding users back.

I depend on large contexts, and large contexts need quantization (q4) to be feasible on commodity hardware.

Provide feedback