Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: UI is missing option to change Flash Attention KV quantization setting #104

Open
GabeAl opened this issue Aug 30, 2024 · 4 comments

Comments

@GabeAl
Copy link

GabeAl commented Aug 30, 2024

I accidentally posted a bug in the cli version of the bug tracker,

Bug fix: Flash Attention - KV cache quantization is stuck at FP16 with no way to revert to Q4_0

The gist of it is, no way to set flash attention quants = no way to fit large contexts on the GPU = regression.

This leads to a large series of usability regressions.

  1. Impossible to load 128k context on GPU
  2. Slower performance even with the reduced context that fits on the GPU
  3. Inability to analyze long documents with the same accuracy as before (which is ironic considering 0.3's aim to enable this)
  4. All issues associated with rolling back to 0.2X series, including chat history import.

Feel free to close the other one I made earlier:
lmstudio-ai/lms#70

@dmatora
Copy link

dmatora commented Sep 19, 2024

I had to switch to ollama to use 70B 128K.
Please add support for Q4/Q8 KV cache quantization

@yagil
Copy link
Member

yagil commented Sep 19, 2024

@dmatora @GabeAl is the specific ask here a way to set the KV cache quantization level?

@dmatora
Copy link

dmatora commented Sep 19, 2024

yes

@GabeAl
Copy link
Author

GabeAl commented Sep 21, 2024

Yes absolutely. Many users have commented including on Discord about this. It is one major bottleneck holding users back.

I depend on large contexts, and large contexts need quantization (q4) to be feasible on commodity hardware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants