-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phi: static cache & compile compatibility #30688
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@zucchini-nlp to clarify:
This is with static cache AND compile, correct? Without compile it has no problems, correct? (I haven't seen them yet, if it happens without compile a reproduction example would be helpful!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good to me, a few nits to be addressed!
Also -- let's enable slow phi
tests in this PR 🔥
I found some pattern that it happens only in eager-fp32 precision for Phi models, while in half-precision everything is okay. Since Llama is also compile compatible, I tested on that and found Llama has garbage generation in eager-fp16 😭 I am quite lost right now about what might be the issue, I will try to investigate more next week. If you have time, feel free to take a look. The below commands will reproduce it with the provided script in PR description python static.py --attn_implementation eager --static_cache_enabled --dtype fp16 --model_name_or_path meta-llama/Llama-2-7b-chat-hf -> for DynamicCache
python static.py --attn_implementation eager --static_cache_enabled --dtype fp32 --model_name_or_path microsoft/phi-2 -> for Compiled StaticCache |
@gante as we discussed, I will not dig into the gibberish generation for fp32. In that case the PR should be ready to merge when we get the slow-test passing. Pushed a |
Can you please port the changes to Phi3 as well? I can help test it if you want |
@hegderavin sure, we will be porting models one by one (#28981). Right now I am waiting for this PR to be merged, so that we can work on other models I can add Phi3 as a separate PR around next week, if you wanted to pull changes and compile the model :) |
Updates:
Also, I wanted to suggest to move |
@@ -171,7 +172,7 @@ def __init__(self, dim, config, device=None): | |||
|
|||
@torch.no_grad() | |||
def forward(self, x, position_ids, seq_len=None): | |||
seq_len = torch.max(position_ids) + 1 | |||
seq_len = position_ids.shape[-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just realized that position ids are not always same size as input. Will come back to revert this later, which means that compile still doesn;t work for rope scaling in Phi3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, with left-padding the maximum value of position_ids
is not the sequence length. For that (sequence length) we have cache_positions
:)
Perhaps. It is not a model architecture piece of code, but rather an input preparation one, so it could make sense to live in a shared module. Raise the discussion on slack!
Have you looked into the |
Perfect, this will do the work, thanks! |
(lmk when it is ready for a re-review) |
hi @zucchini-nlp , @gante is there any more update on this PR? I want to use phi-3 with static cache. This PR is super useful. |
@helunwencser this PR moved a bit into #31421 (review) where we're handling new format cache for all models. That PR probably will enable compile for all models, except for some cases when there's a model-specific dynamic control flow. I am planning to deal with that after the PR is merged Also, regarding Phi-3, it has a dynamic control flow if we use scaled RoPE embeddings and I didn't decide how to better solve this issue. EDIT: hehe, sorry, I forgot that we already merged Phi with new cache format, So it should be compilable as long as you use the '4k' checkpoint which has no scaling. For scaling, the PR is coming soon |
Thanks! Unfortunately I need to use StaticCache for Phi3. Looking forward to having the new PR. |
closing as Phi will be compile compatible in #32617 and for Phi3 we have to wait dynamic control flow support from Pytorch |
What does this PR do?
This PR enables compile for Phi models. Checked the correctness by running speed benchmark script (the results is below) and a test for dynamic vs static match.
A few observations while testing the generation quality:
Benchmark results
Script to evaluate on text-level match between dynamic vs static cache