[pull] main from vllm-project:main #15

pull · 2024-05-07T19:20:03Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Co-authored-by: Lei Wen <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Cody Yu <[email protected]>

…int (#3467) Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

…ersion (#4467)

…#4494) Co-authored-by: Simon Mo <[email protected]>

… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <[email protected]>

Signed-off-by: Travis Johnson <[email protected]>

Co-authored-by: Lei Wen <[email protected]>

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>

…#4357)

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency

Remove the device="cuda" declarations in mixtral as promised in #4343

…to be provided (#4273)

…n is not 1 and max_tokens is large & Add tests for preemption (#4451)

z103cb · 2024-05-08T07:59:56Z

/ok-to-test

openshift-merge-robot · 2024-05-08T08:01:18Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…4573)

…to swap (#4659)

Co-authored-by: Cade Daniel <[email protected]>

openshift-ci · 2024-05-08T21:49:52Z

@pull[bot]: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/images	`f942efb`	link	true	`/test images`
ci/prow/pr-image-mirror	`f942efb`	link	true	`/test pr-image-mirror`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

…gprobs (#4672)

openshift-merge-robot · 2024-05-08T23:24:57Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…4626)

Co-authored-by: Michael Goin <[email protected]>

z103cb · 2024-05-09T09:41:42Z

Closed in favour of #18

This PR updates our grpc_server to add TGIS-style logs similar to https://github.com/IBM/text-generation-inference/blob/main/router/src/grpc_server.rs#L504-L512 This also disables the vllm per-request logging so that we don't double-log each request The timing info collected here is pretty rough, it doesn't plumb into the LLMEngine, it just times the generators to get the total time spent in the engine. We could do better, but this is a start. Example logs: ``` INFO 04-09 21:51:01 logs.py:43] generate_stream{input=[b'This is the story of Obama ridin...'] prefix_id= input_chars=[70] params=sampling { } stopping { max_new_tokens: 200 min_new_tokens: 16 } response { } decoding { } tokenization_time=0.45ms queue_and_inference_time=1096.67ms time_per_token=5.48ms total_time=1097.12ms input_toks=16}: Streaming response generated 200 tokens before NOT_FINISHED, output 848 chars: b' California. The story is told i...' INFO 04-09 21:51:08 logs.py:43] generate{input=[b'Lorem ipsum dolor sit amet, cons...', b'foooood man where is it'] prefix_id= input_chars=[469] params=sampling { } stopping { max_new_tokens: 20 min_new_tokens: 16 } response { } decoding { } tokenization_time=2.03ms queue_and_inference_time=122.23ms time_per_token=6.11ms total_time=124.26ms input_toks=124}: Sub-request 0 from batch of 2 generated 20 tokens before MAX_TOKENS, output 25 chars: b'?\\n\\n<!--\\n<!--\\n<!--\\n<!--\\n<!' INFO 04-09 21:51:08 logs.py:43] generate{input=[b'Lorem ipsum dolor sit amet, cons...', b'foooood man where is it'] prefix_id= input_chars=[469] params=sampling { } stopping { max_new_tokens: 20 min_new_tokens: 16 } response { } decoding { } tokenization_time=2.07ms queue_and_inference_time=122.22ms time_per_token=6.11ms total_time=124.29ms input_toks=7}: Sub-request 1 from batch of 2 generated 20 tokens before MAX_TOKENS, output 70 chars: b"?\\nI don't know.\\nI don't know.\\nI ..." ``` --------- Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Joe Runde <[email protected]>

Correctly calculating the same value for the required cache blocks num for all torchrun processes

leiwen83 and others added 30 commits April 30, 2024 10:12

[BugFix] fix num_lookahead_slots missing in async executor (#4165)

4bb53e2

Co-authored-by: Lei Wen <[email protected]>

[Doc] add visualization for multi-stage dockerfile (#4456)

b31a1fb

Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Roger Wang <[email protected]>

[Frontend] Support complex message content for chat completions endpo…

a494140

…int (#3467) Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Frontend] [Core] Tensorizer: support dynamic num_readers, update v…

715c2d8

…ersion (#4467)

[Bugfix][Minor] Make ignore_eos effective (#4468)

dd1a50a

fix_tokenizer_snapshot_download_bug (#4493)

6ad58f4

Unable to find Punica extension issue during source code installation (…

ee37328

…#4494) Co-authored-by: Simon Mo <[email protected]>

[Core] Centralize GPU Worker construction (#4419)

2e240c6

[Misc][Typo] type annotation fix (#4495)

f458112

[Misc] fix typo in block manager (#4453)

a822eb3

Allow user to define whitespace pattern for outlines (#4305)

c3845d8

[Misc]Add customized information for models (#4132)

d6f4bd7

[Test] Add ignore_eos test (#4519)

6f1df80

[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to…

a88bb9b

… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <[email protected]>

[Bugfix] Fix 307 Redirect for /metrics (#4523)

4dc8026

[Doc] update(example model): for OpenAI compatible serving (#4503)

e491c7e

[Bugfix] Use random seed if seed is -1 (#4531)

6990912

[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)

8b798ee

Signed-off-by: Travis Johnson <[email protected]>

[Speculative decoding] Add ngram prompt lookup decoding (#4237)

b38e42f

Co-authored-by: Lei Wen <[email protected]>

[Core] Enable prefix caching with block manager v2 enabled (#4142)

24750f4

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>

[Core] Add multiproc_worker_utils for multiprocessing-based workers (…

a657bfc

…#4357)

[Bugfix] Add validation for seed (#4529)

c47ba4a

[Bugfix][Core] Fix and refactor logging stats (#4336)

3a922c1

[Core][Distributed] fix pynccl del error (#4508)

6ef09b0

[Misc] Remove Mixtral device="cuda" declarations (#4543)

c9d852d

Remove the device="cuda" declarations in mixtral as promised in #4343

[Misc] Fix expert_ids shape in MoE (#4517)

826b82a

[MISC] Rework logger to enable pythonic custom logging configuration …

b8afa8b

…to be provided (#4273)

[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.…

0d62fe5

…n is not 1 and max_tokens is large & Add tests for preemption (#4451)

pull bot removed the needs-rebase label May 8, 2024

openshift-ci bot added the ok-to-test label May 8, 2024

openshift-merge-robot added the needs-rebase label May 8, 2024

rkooo567 and others added 9 commits May 8, 2024 08:42

[Core] Optimize sampler get_logprobs (#4594)

d7740ea

[CI] Make mistral tests pass (#4596)

f6a5930

[Bugfix][Kernel] allow non-power-of-2 for prefix prefill with alibi (#…

0f9a6e3

…4573)

[Misc] Add get_name method to attention backends (#4685)

5510cf0

[Core] Faster startup for LoRA enabled models (#4634)

ad932a2

[Core][Optimization] change python dict to pytorch tensor for blocks …

20cfcde

…to swap (#4659)

[CI/Test] fix swap test for multi gpu (#4689)

230c4b3

[Misc] Use vllm-flash-attn instead of flash-attn (#4686)

89579a2

[Dynamic Spec Decoding] Auto-disable by the running queue size (#4592)

f942efb

Co-authored-by: Cade Daniel <[email protected]>

pull bot removed ok-to-test needs-rebase labels May 8, 2024

[Speculative decoding] [Bugfix] Fix overallocation in ngram + spec lo…

8b9241b

…gprobs (#4672)

openshift-merge-robot added the needs-rebase label May 8, 2024

alexm-neuralmagic and others added 4 commits May 8, 2024 17:14

[Bugfix] Fine-tune gptq_marlin configs to be more similar to marlin (#…

e288df0

…4626)

[Frontend] add tok/s speed metric to llm class when using tqdm (#4400)

16bc0a0

Co-authored-by: Michael Goin <[email protected]>

[Frontend] Move async logic outside of constructor (#4674)

f12b20d

[Misc] Remove unnecessary ModelRunner imports (#4703)

190bc83

z103cb mentioned this pull request May 9, 2024

Sync main with upstream #18

Merged

z103cb closed this May 9, 2024

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag #24

Merged

prarit pushed a commit to prarit/vllm that referenced this pull request Oct 18, 2024

Merge pull request opendatahub-io#15 from ROCm/torchrun_cache_init_fix

4b39609

Correctly calculating the same value for the required cache blocks num for all torchrun processes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from vllm-project:main #15

[pull] main from vllm-project:main #15

pull bot commented May 7, 2024 •

edited

Loading

z103cb commented May 8, 2024

openshift-merge-robot commented May 8, 2024

openshift-ci bot commented May 8, 2024

openshift-merge-robot commented May 8, 2024

z103cb commented May 9, 2024

[pull] main from vllm-project:main #15

[pull] main from vllm-project:main #15

Conversation

pull bot commented May 7, 2024 • edited Loading

z103cb commented May 8, 2024

openshift-merge-robot commented May 8, 2024

openshift-ci bot commented May 8, 2024

openshift-merge-robot commented May 8, 2024

z103cb commented May 9, 2024

pull bot commented May 7, 2024 •

edited

Loading