We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I accidentally posted a bug in the cli version of the bug tracker,
Bug fix: Flash Attention - KV cache quantization is stuck at FP16 with no way to revert to Q4_0
The gist of it is, no way to set flash attention quants = no way to fit large contexts on the GPU = regression.
This leads to a large series of usability regressions.
Feel free to close the other one I made earlier: lmstudio-ai/lms#70
The text was updated successfully, but these errors were encountered:
I had to switch to ollama to use 70B 128K. Please add support for Q4/Q8 KV cache quantization
Sorry, something went wrong.
@dmatora @GabeAl is the specific ask here a way to set the KV cache quantization level?
yes
Yes absolutely. Many users have commented including on Discord about this. It is one major bottleneck holding users back.
I depend on large contexts, and large contexts need quantization (q4) to be feasible on commodity hardware.
No branches or pull requests
I accidentally posted a bug in the cli version of the bug tracker,
Bug fix: Flash Attention - KV cache quantization is stuck at FP16 with no way to revert to Q4_0
The gist of it is, no way to set flash attention quants = no way to fit large contexts on the GPU = regression.
This leads to a large series of usability regressions.
Feel free to close the other one I made earlier:
lmstudio-ai/lms#70
The text was updated successfully, but these errors were encountered: