-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533
Comments
Possibly related: #29519 |
Hi @MrYxJ 👋 The sliding window logic from the pre-fill and chunking mechanism is implemented differently depending on the attention layer:
Regarding the rolling buffer cache mechanism, we can see that it is not needed for the correctness of the model -- the sliding window discards the same older values, i.e. the two mechanisms are redundant. The rolling buffer cache is, however, extremely beneficial for memory purposes. We are working on it (for all models), although currently only Llama and Gemma support it -- search for "static cache". (@ArthurZucker please correct if what I wrote above is not correct :) ) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi @gante, thank you very much for your detailed answer. By your answer, I have already confirmed it clear. Thank you very much. |
But still StaticCache is not implemented in Mistral yet, right? |
#30642 is coming 😉 |
Feature request
Hello , thank you very much for creating the library transformers which provides a very convenient way to use the different models.
Recently, when I use Mistral Model by transformers (4.38.2) library to perform the model inference, the execution did not seem to be completely consistent with the official mistral papers and the code they provided.
Specifically, firstly the mistral have a Pre-fill and Chunking mechanism when prompt input encoding(As shown in the following picture). Is this method included in the transformers library's implementation of mistral's generate function?
Secondly, Mistral use a Rolling Buffer Cache mechanism to enable key values to be updated in the cache as the sliding window moves. (As shown in the following picture).
When I read this piece of the transformers source code, this mechanism is implemented in the transformer library using the DynamicCache update method, and this operation is actually to concatenate all the past keys and values with the current keys and values. https://github.com/huggingface/transformers/blob/b338a6c3b8eda29610d4d472cad8cd87cbfdaaed/src/transformers/cache_utils.py#L126C1-L132C105
Am I mistaken in the logic of this program or is the implementation not fully consistent with the mistral paper? In addition, I read the official source code of mistral and found that they do release two kinds of model inference code, one of which is simple version. Is this your current implementation?
Motivation
I want to know how to use transformers library on the mistral model to achieve the same way as in mistral paper.
Whether I read the code wrong or do I need to make some additional changes based on transformer?
Your contribution
I can try fix this piece code to achieve the same implementation as in the mistral paper based on transformers.
The text was updated successfully, but these errors were encountered: