Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable StaticCache for assisted generation #34797
base: main
Are you sure you want to change the base?
enable StaticCache for assisted generation #34797
Changes from 3 commits
c205b2e
30021dd
620c861
b5283e9
71b7d22
c967bbe
980aa08
c79411d
c717652
67618e5
c8e2428
fde7ebd
e1169a3
8a9a753
b74a7fe
e67a3fd
177634c
45d0410
a33f660
ff07e47
fbef806
3564a87
b9cf597
803166d
df1594c
87b7f15
e60b1fe
5e195c2
759da36
093b647
3775dc2
817d303
6e2ad2a
99b6bc2
af33391
587b55f
0a49d6f
1deeb55
62b70e4
210c2e0
7b97aa4
9cb45da
93cd7bf
3cc23d7
b45336c
04f2ea1
b08d1fc
0904268
dd148a8
3171476
4e064ec
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, I think it will be called on assistant model when we call
assistant.generate()
so there is no need. We can only removeself.generation_config.cache_implementation = None
in candidate generatorThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing is: when we leave to let
assistant_model.generate
which is inget_candiates
to call this. since the max_new _tokens will be set tomax_new_tokens = min(int(self.num_assistant_tokens), self.generation_config.max_length - new_cur_len - 1)
when it's first-time called, so the cache_length will be set toint(self.num_assistant_tokens) + prompt_len
, less than the actual needed cache_lengthmax_token_length + prompt_length
, and lead to assert out while generation. So, the key here is assistant model's cache length should be same as main model here. And then I can see this function has assistant_model as an argument but not used it, I think it may be here for the cases like this. That's the rational behind.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, i see, that makes sense. Then we can leave cache init here