-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] Add multipstep chunked-prefill support for FlashInfer #10467
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
27e3b2b
to
be33d45
Compare
"specific parameter.") | ||
|
||
if turn_prefills_into_decodes: | ||
# When mutli-Step is enabled with chunked-Prefill, prefills and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# When mutli-Step is enabled with chunked-Prefill, prefills and | |
# When Multi-Step is enabled with Chunked-Prefill, prefills and |
Please fix the linting. |
I've got the following error:
|
Could you please share the repo command? Thank you! @taegeonum |
@elfiegg VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model <any_model> --quantization fp8 --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --num-scheduler-steps 10 --enable-chunked-prefill True --max-num-batched-tokens 512 |
ad48534
to
97d6859
Compare
@taegeonum apologies for the delay - the bug should be in cuda graph mode and has been fixed. Tests are configured. |
@elfiegg Thanks! but got another error:
|
Could you please share the reproduce command? Thanks! @taegeonum |
VLLM_ATTENTION_BACKEND=FLASHINFER python3 -m vllm.entrypoints.openai.api_server --model model_path -tensor-parallel-size 8 --quantization fp8 --kv-cache-dtype fp8 --max-num-seqs 500 --max-model-len 32768 --max-num-batched-tokens 4096 --gpu-memory-utilization 0.95 --trust-remote-code --enable-chunked-prefill true --num-scheduler-steps 10 |
@elfiegg Hello, any progress? it would be good if we can use multistep+chunked prefill on FlashInfer. |
@taegeonum sure - I'm just back from Thanksgiving vacation and will update this tomorrow |
Hello @JaheimLee, There seems to be a bug in multistep+chunked prefill cuda graph mode, where it schedules batch tokens during profiling where it shouldn't really do that. The issue with the model config above seems unrelated to multistep+chunked prefill on FlashInfer. Using FlashAttn I observed the similar fail. Also, if you turn off cuda graph mode via
This will work. I'm trying to narrow down the issue and it seems it might relate to the PR: https://github.com/vllm-project/vllm/pull/8645/files here. cc @varun-sundar-rabindranath for more contexts if there is any. |
97d6859
to
c4a05d7
Compare
d5fe18f
to
a4b09d6
Compare
Hello @JaheimLee, can you pull the latest changes and confirm if they fix the issue? Thanks! |
a4b09d6
to
35be3fa
Compare
35be3fa
to
feafd61
Compare
@elfiegg on I see the following error,
Is this what you were seeing when you said
When I change the max_num_seqs to 512 however, the error goes away. Ill take a look. |
@varun-sundar-rabindranath Yes that's the error msg. Seems like it is the symptom of scheduling decode requests during profiling - since the KVcache shape are not yet determined during profiling. |
@elfiegg Thanks for catching this. I arrived at the same conclusion from my debugging. Your fix with fwiw, I am adding some of my analysis below: With FLASH_ATTN, I believe this is what is happening,
This is fine when the engine is fully initialized. But during the profile_run, this is not the case. The KV caches have not been initialized. When we add cuda graph padding during the profile_run, flash_attn.py runs both prefill and decode attentions. This blows up due to vllm/vllm/attention/backends/flash_attn.py Line 678 in 2c97eca
This manifests only when max_num_seqs is not a value in _BATCH_SIZES_TO_CAPTURE. When max_num_seqs is in _BATCH_SIZES_TO_CAPTURE, the profile_run doesn't add a graph padding. |
vllm/worker/model_runner.py
Outdated
@@ -1226,6 +1228,7 @@ def _prepare_model_input_tensors( | |||
|
|||
@torch.inference_mode() | |||
def profile_run(self) -> None: | |||
self.is_profile_run = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Pass is_profile_run
with default false as a parameter to _get_cuda_graph_pad_size
function to prevent accidentally accessing and setting/disabling the profile run in other places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep thanks for the suggestion - The default value is False and is set in the runner: https://github.com/vllm-project/vllm/pull/10467/files#diff-d3df23c3e3bcfe97ee8507061c6de54f0eff23a8c75d7f5999062c42245290f8R1028
Whenever the value is accessed, it will be default to False if that's what you're suggesting. : )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor suggestions:
- Rename to
in_profile_run
because this is a temporary status. - Use context manager to make sure this value is always reset after this function.
@elfiegg No exception during server boot-up. However, I've got an error when processing requests:
|
94c6296
to
419b010
Compare
@taegeonum - applogize for the delay and thanks so much for the patience. Can you pull the latest and try again? The issue you ran into was a CUDA graph padding issue that has been fixed. |
32f925b
to
5b3feed
Compare
@comaniac do you mind take another look? |
@elfiegg Now it works without exceptions. Thanks!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delayed review. The change LGTM.
Also does this PR is only compatible with certain FlashInfer versions? If so where do we document it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment the CI time for this file after the PR? Since you pretty much make the number of tests 4x in this file, I'm a bit worry about the impact on CI time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about CI time, but from time
command, the diff:
After:
real 7m39.973s
user 9m35.828s
sys 0m17.893s
Before
real 1m59.525s
user 2m42.044s
sys 0m9.140s
so it is about 4 times.
if model_input.attn_metadata.use_cuda_graph: | ||
use_cuda_graph = model_input.attn_metadata.use_cuda_graph | ||
is_decode = model_input.attn_metadata.num_prefills == 0 | ||
if use_cuda_graph and is_decode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comment about why we look at use_cuda_graph and is_decode to update the state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
vllm/worker/model_runner.py
Outdated
@@ -1226,6 +1228,7 @@ def _prepare_model_input_tensors( | |||
|
|||
@torch.inference_mode() | |||
def profile_run(self) -> None: | |||
self.is_profile_run = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor suggestions:
- Rename to
in_profile_run
because this is a temporary status. - Use context manager to make sure this value is always reset after this function.
Can both v0 and v1 use this feature? |
v1 doesn't have multi-step scheduling and we don't plan to add it. |
5d6d263
to
93e2cda
Compare
93e2cda
to
c10e056
Compare
Support multi-step scheduling for chunked-prefill on FlashInfer, where prefill tokens are turned into decode tokens after the first single step.
cc @comaniac @yzh199 @WoosukKwon @youkaichao