Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533

MrYxJ · 2024-03-08T08:45:43Z

Feature request

Hello , thank you very much for creating the library transformers which provides a very convenient way to use the different models.

Recently, when I use Mistral Model by transformers (4.38.2) library to perform the model inference, the execution did not seem to be completely consistent with the official mistral papers and the code they provided.

Specifically, firstly the mistral have a Pre-fill and Chunking mechanism when prompt input encoding(As shown in the following picture). Is this method included in the transformers library's implementation of mistral's generate function?

Secondly, Mistral use a Rolling Buffer Cache mechanism to enable key values to be updated in the cache as the sliding window moves. (As shown in the following picture).

When I read this piece of the transformers source code, this mechanism is implemented in the transformer library using the DynamicCache update method, and this operation is actually to concatenate all the past keys and values with the current keys and values. https://github.com/huggingface/transformers/blob/b338a6c3b8eda29610d4d472cad8cd87cbfdaaed/src/transformers/cache_utils.py#L126C1-L132C105

Am I mistaken in the logic of this program or is the implementation not fully consistent with the mistral paper? In addition, I read the official source code of mistral and found that they do release two kinds of model inference code, one of which is simple version. Is this your current implementation?

Motivation

I want to know how to use transformers library on the mistral model to achieve the same way as in mistral paper.
Whether I read the code wrong or do I need to make some additional changes based on transformer?

Your contribution

I can try fix this piece code to achieve the same implementation as in the mistral paper based on transformers.

amyeroberts · 2024-03-08T10:20:54Z

Possibly related: #29519

cc @ArthurZucker @gante

gante · 2024-03-08T14:18:33Z

Hi @MrYxJ 👋

The sliding window logic from the pre-fill and chunking mechanism is implemented differently depending on the attention layer:

eager (i.e. vanilla) attention: the attention mask with sliding window is created here
flash attention 2: the sliding window is applied in the forward pass of the attention layer itself.
SDPA: it is not enabled on main, this PR enables it

Regarding the rolling buffer cache mechanism, we can see that it is not needed for the correctness of the model -- the sliding window discards the same older values, i.e. the two mechanisms are redundant. The rolling buffer cache is, however, extremely beneficial for memory purposes. We are working on it (for all models), although currently only Llama and Gemma support it -- search for "static cache".

(@ArthurZucker please correct if what I wrote above is not correct :) )

github-actions · 2024-04-08T08:03:29Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

MrYxJ · 2024-04-09T03:42:29Z

Hi @gante, thank you very much for your detailed answer. By your answer, I have already confirmed it clear. Thank you very much.

cyr0930 · 2024-05-08T03:53:20Z

But still StaticCache is not implemented in Mistral yet, right?

ArthurZucker · 2024-05-08T14:00:16Z

#30642 is coming 😉

MrYxJ closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533

Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533

MrYxJ commented Mar 8, 2024 •

edited

Loading

amyeroberts commented Mar 8, 2024

gante commented Mar 8, 2024

github-actions bot commented Apr 8, 2024

MrYxJ commented Apr 9, 2024

cyr0930 commented May 8, 2024

ArthurZucker commented May 8, 2024

Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533

Does the support for the Mistral Model seem inconsistent or incomplete with it official? #29533

Comments

MrYxJ commented Mar 8, 2024 • edited Loading

Feature request

Motivation

Your contribution

amyeroberts commented Mar 8, 2024

gante commented Mar 8, 2024

github-actions bot commented Apr 8, 2024

MrYxJ commented Apr 9, 2024

cyr0930 commented May 8, 2024

ArthurZucker commented May 8, 2024

MrYxJ commented Mar 8, 2024 •

edited

Loading