-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama: (proposal) propagating the results of graph_compute
to the user interface
#9525
Conversation
Will a failure during compute leave the KV cache in a consistent state? Otherwise this will not be very useful. |
@slaren As far as I understand, now in case of a failure or abortion, kv_cache remains in the state in which it was configured in |
95ce058
to
20510ea
Compare
5535683
to
059e78c
Compare
@slaren @ggerganov Sorry to bother you, I would like to know if this request can be accepted? |
I think that making a copy of the entire state of the KV cache is too expensive to do on every evaluation. There must be a more lightweight way to do this, but that may also require more changes to the way the KV state is handled, at the very least whatever |
I agree that copying the entire KV state should be avoided since it can incur significant overhead. Right now, the KV state is partially managed by the user code as they need to explicitly keep track of the sequence lengths. This makes it easy for the user code to discard any result from the last batch processing, by simply submitting the next batch with the appropriate sequence positions (as if the last failed batch was never submitted). However, if in the future we want to delegate the logic for sequence positions internally to Overall, I'm not sure how proceed as well. On one hand I want to reimplement the KV state logic in a way that will allow us to support different KV cache implementations. But this can take a while to achieve. Also, regarding the error propagation, it will be easy for me to think on a case-by-case basis. What errors do you encounter that you would like to be able to recover from? For example, the examples already demonstrate how to handle the error when the KV cache is full. In my experience, I haven't encounter other types of errors that need to be handled at runtime. |
What about the changes that
I think at the moment the only way the compute can fail is if the user cancels it with the abort callback. However that's only because backends just crash the application with an assert when something goes wrong, at some point that should be addressed by returning an error instead. |
Yes, actually they should be. I was incorrectly thinking just about the failure for finding a KV slot. For non-recurrent models, these can be reverted with appropriate set of |
@ggerganov @slaren Thanks for sharing your thoughts. For me the situation looks like this: now in case of a non-critical failure in decoding/encoding (if As for reverting changes in kv_cache.cells, I agree that simple copying is too expensive. And I think there are 2 options:
In the end, I would suggest, as part of this PR:
And in a separate PR I would suggest adding rollback of WDYT? |
The cache revert logic should apply to the entire batch, not just the latest ubatch. Otherwise, it is possible that only a fraction of the changes to the KV cache will be reverted, leaving the KV cache in an inconsistent state. |
@slaren I updated the logic of |
@Xarbirus thanks. I am not confident that I understand the logic of |
@Xarbirus Sorry for this slow progress here. I'll be revisiting the KV cache implementation in the next days and try to figure out how to refactor and improve it. Will also try to resolve the error propagation as well. |
For hybrid models like Jamba (#7531), I also bumped into needing to keep both caches (recurrent states and self-Attention KV) in a consistent state when one of them fails to allocate a slot. What I noticed is that checking if the allocation will succeed can be done fully ahead of modifying the cache. There is no need to revert what can be avoided in the first place. "Checking first" does not handle reverting the full batch, only the latest Not necessarily the way to go, but another possible approach. Footnotes
|
…graph_compute` is returned
also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models
@compilade But, as you correctly noted, this will allow us to leave the cache in a consistent state, but not in the state expected by the user of this function (since the cache can be "half-full"). Besides, in case of an error in @ggerganov I would like to know how you are doing with refactoring? Should I wait for it or do you have any ideas on how to improve the code from this request? |
bbf27cc
to
ee599f9
Compare
@Xarbirus Initially I was planning to start working on this when I wrote the last comment, but I got sidetracked with improving the Metal backend. I guess this week will continue working on the Metal backend, so probably will get to the KV cache refactoring after that. The latest version of this PR is OK to merge. Just add information in the comments in |
@ggerganov Done, please check. |
* llama: propagating the results of `graph_compute` to the user interface * llama: reverting kv_cache in case of failed compute * llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned * llama: restore a kv_cache in case of failed computation * llama: correct reverting of the entire batch. also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models * llama: updated comments * llama : add comments about KV cache state after error --------- Co-authored-by: Georgi Gerganov <[email protected]>
* llama: propagating the results of `graph_compute` to the user interface * llama: reverting kv_cache in case of failed compute * llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned * llama: restore a kv_cache in case of failed computation * llama: correct reverting of the entire batch. also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models * llama: updated comments * llama : add comments about KV cache state after error --------- Co-authored-by: Georgi Gerganov <[email protected]>
* llama: propagating the results of `graph_compute` to the user interface * llama: reverting kv_cache in case of failed compute * llama: `llama_kv_cache_state` was removed, only the result of `llama_graph_compute` is returned * llama: restore a kv_cache in case of failed computation * llama: correct reverting of the entire batch. also updates `llama_kv_cache_find_slot`, will correctly count the number of `used` cells for recurrent models * llama: updated comments * llama : add comments about KV cache state after error --------- Co-authored-by: Georgi Gerganov <[email protected]>
In PR #9434 it was proposed to use enum instead of an error code. This ЗК proposes the first step towards this idea - passing the results of
ggml_backend_sched_graph_compute
tollama_decode
/llama_encode
so that they can be processed on the user side.