You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current series is unusable for me because I use the full 128k context window on a 16GB GPU. To do this with Phi-medium or Llama 3.1-8B, requires setting the K and V quants to 4_0.
This feature has not only been removed, but the memory usage is 4 times higher, and even reducing the cache size dramatically leads to much slower processing for me.
Quantization is necessary for large caches to fit. The old series had this feature. The new series, which would benefit the most from the quantized context in flash attention, is completely missing this feature. Ironically, I'm resorting to the old version (0.2 series) to analyze documents and large texts.
The text was updated successfully, but these errors were encountered:
I agree that the int4 kv cache is very useful for long contexts and large models, making full use of the GPU memory. I hope this feature is added back; I don't want to downgrade to an older version.
The current series is unusable for me because I use the full 128k context window on a 16GB GPU. To do this with Phi-medium or Llama 3.1-8B, requires setting the K and V quants to 4_0.
This feature has not only been removed, but the memory usage is 4 times higher, and even reducing the cache size dramatically leads to much slower processing for me.
Quantization is necessary for large caches to fit. The old series had this feature. The new series, which would benefit the most from the quantized context in flash attention, is completely missing this feature. Ironically, I'm resorting to the old version (0.2 series) to analyze documents and large texts.
The text was updated successfully, but these errors were encountered: