[pull] main from vllm-project:main #11

pull · 2024-05-01T21:38:41Z

See Commits and Changes for more details.

Can you help keep this open source service alive? 💖 Please sponsor : )

Co-authored-by: Lei Wen <[email protected]>

Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Co-authored-by: Philipp Moritz <[email protected]> Co-authored-by: Woosuk Kwon <[email protected]> Co-authored-by: mgoin <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Cody Yu <[email protected]>

…int (#3467) Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

…ersion (#4467)

…#4494) Co-authored-by: Simon Mo <[email protected]>

… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <[email protected]>

Signed-off-by: Travis Johnson <[email protected]>

Co-authored-by: Lei Wen <[email protected]>

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>

…#4357)

This PR updates the tuning script for the fused_moe kernel to support FP8 and also adds configurations for TP4. Note that for the configuration I removed num_warps and num_stages for small batch sizes since that improved performance and brought the benchmarks on par with the numbers before in that regime to make sure this is a strict improvement over the status quo. All the numbers below are for mistralai/Mixtral-8x7B-Instruct-v0.1, 1000 input and 50 output tokens. Before this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.1 ms ITL, 0.52s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 14.0 ms ITL, 0.70s e2e latency qps = 10: 15.7 ms ITL, 0.79s e2e latency After this PR (with static activation scaling): qps = 1: 9.8 ms ITL, 0.49s e2e latency qps = 2: 9.7 ms ITL, 0.49s e2e latency qps = 4: 10.2 ms ITL, 0.53s e2e latency qps = 6: 11.9 ms ITL, 0.59s e2e latency qps = 8: 11.9 ms ITL, 0.59s e2e latency qps = 10: 12.1 ms ITL, 0.61s e2e latency

openshift-ci · 2024-05-01T21:38:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: pull[bot]
Once this PR has been reviewed and has the lgtm label, please assign israel-hdez for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2024-05-01T21:38:53Z

Hi @pull[bot]. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

#4567)

…default to None (#4586)

Co-authored-by: LiuXiaoxuanPKU <[email protected]>

… Dynamic/Static Activations) (#4527) Follow on to #4332 to enable FP8 checkpoint loading for Mixtral and supersedes #4436. This PR enables the following checkpoint loading features for Mixtral: Supports loading fp8 checkpoints for Mixtral, such as this "nm-testing/Mixtral-8x7B-Instruct-v0.1-FP8" test model Supports static or dynamic activation quantization with static weight quantization (all per tensor) Supports different scales for each expert weight Supports Fp8 in QKV layer Notes: The Expert Gate/Router always runs at half / full precision for now. If there are different weight scales between QKV layer (for separate QKV weights), they are re-quantized using layer.weight_scale.max() so we can have a single gemm for performance.

…trics (#3937)

Previously FP8 static scaling works if the scales are overestimating the maxima of all activation tensors during computation. However this will not always be the case even if the scales were calibrated very carefully. For example, with the activations in my checkpoint https://huggingface.co/pcmoritz/Mixtral-8x7B-v0.1-fp8-act-scale (which was calibrated on https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k), I'm getting the following mostly random performance on MMLU: | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.2295|± |0.0035| | - humanities |N/A |none | 5|acc |0.2421|± |0.0062| | - other |N/A |none | 5|acc |0.2398|± |0.0076| | - social_sciences|N/A |none | 5|acc |0.2171|± |0.0074| | - stem |N/A |none | 5|acc |0.2125|± |0.0073| With the fix in this PR where the scaled activations are clamped between [-std::numeric_limits<c10::Float8_e4m3fn>::max(), std::numeric_limits<c10::Float8_e4m3fn>::max()] to make sure there are no NaNs, the performance is | Groups |Version|Filter|n-shot|Metric|Value | |Stderr| |------------------|-------|------|-----:|------|-----:|---|-----:| |mmlu |N/A |none | 0|acc |0.7008|± |0.0036| | - humanities |N/A |none | 5|acc |0.6453|± |0.0065| | - other |N/A |none | 5|acc |0.7692|± |0.0072| | - social_sciences|N/A |none | 5|acc |0.8083|± |0.0070| | - stem |N/A |none | 5|acc |0.6115|± |0.0083| This is not perfect yet but is getting very close to the FP16 / dynamic activation scale performance.

z103cb · 2024-05-07T09:03:29Z

/ok-to-test

Had a user request for the apache 2 license file to exist in the image we provide Signed-off-by: Joe Runde <[email protected]>

leiwen83 and others added 25 commits April 30, 2024 10:12

[BugFix] fix num_lookahead_slots missing in async executor (#4165)

4bb53e2

Co-authored-by: Lei Wen <[email protected]>

[Doc] add visualization for multi-stage dockerfile (#4456)

b31a1fb

Signed-off-by: Prashant Gupta <[email protected]> Co-authored-by: Roger Wang <[email protected]>

[Frontend] Support complex message content for chat completions endpo…

a494140

…int (#3467) Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>

[Frontend] [Core] Tensorizer: support dynamic num_readers, update v…

715c2d8

…ersion (#4467)

[Bugfix][Minor] Make ignore_eos effective (#4468)

dd1a50a

fix_tokenizer_snapshot_download_bug (#4493)

6ad58f4

Unable to find Punica extension issue during source code installation (…

ee37328

…#4494) Co-authored-by: Simon Mo <[email protected]>

[Core] Centralize GPU Worker construction (#4419)

2e240c6

[Misc][Typo] type annotation fix (#4495)

f458112

[Misc] fix typo in block manager (#4453)

a822eb3

Allow user to define whitespace pattern for outlines (#4305)

c3845d8

[Misc]Add customized information for models (#4132)

d6f4bd7

[Test] Add ignore_eos test (#4519)

6f1df80

[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to…

a88bb9b

… obtain the CUDA version. (#4173) Signed-off-by: AnyISalIn <[email protected]>

[Bugfix] Fix 307 Redirect for /metrics (#4523)

4dc8026

[Doc] update(example model): for OpenAI compatible serving (#4503)

e491c7e

[Bugfix] Use random seed if seed is -1 (#4531)

6990912

[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)

8b798ee

Signed-off-by: Travis Johnson <[email protected]>

[Speculative decoding] Add ngram prompt lookup decoding (#4237)

b38e42f

Co-authored-by: Lei Wen <[email protected]>

[Core] Enable prefix caching with block manager v2 enabled (#4142)

24750f4

Co-authored-by: Lei Wen <[email protected]> Co-authored-by: Sage Moore <[email protected]>

[Core] Add multiproc_worker_utils for multiprocessing-based workers (…

a657bfc

…#4357)

[Bugfix] Add validation for seed (#4529)

c47ba4a

[Bugfix][Core] Fix and refactor logging stats (#4336)

3a922c1

openshift-ci bot requested review from heyselbi and terrytangyuan May 1, 2024 21:38

openshift-ci bot added the needs-ok-to-test label May 1, 2024

youkaichao and others added 24 commits May 2, 2024 17:32

[Core][Distributed] enable allreduce for multiple tp groups (#4566)

344a5d0

[BugFix] Prevent the task of _force_log from being garbage collected (

808632d

#4567)

[Misc] remove chunk detected debug logs (#4571)

ce3f1ee

[Doc] add env vars to the doc (#4572)

2d7bce9

[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)

3521ba4

[Bugfix] Allow "None" or "" to be passed to CLI for string args that …

7e65477

…default to None (#4586)

Fix/async chat serving (#2727)

f8e7add

[Kernel] Use flashinfer for decoding (#4353)

43c413e

Co-authored-by: LiuXiaoxuanPKU <[email protected]>

[Speculative decoding] Support target-model logprobs (#4378)

ab50275

[Misc] add installation time env vars (#4574)

344bf7c

[Misc][Refactor] Introduce ExecuteModelData (#4540)

bc8ad68

[Doc] Chunked Prefill Documentation (#4580)

36fb68f

[CI] check size of the wheels (#4319)

021b1a2

[Bugfix] Fix inappropriate content of model_name tag in Prometheus me…

4302987

…trics (#3937)

bump version to v0.4.2 (#4600)

8d8357c

[CI] Reduce wheel size by not shipping debug symbols (#4602)

c7f2cf2

Disable cuda version check in vllm-openai image (#4530)

0650e59

[Bugfix] Fix asyncio.Task not being subscriptable (#4623)

323f27b

[CI] use ccache actions properly in release workflow (#4629)

e186d37

[CI] Add retry for agent lost (#4633)

19cb471

Update lm-format-enforcer to 0.10.1 (#4631)

bd99d22

[Core][Optimization] change python dict to pytorch tensor (#4607)

63575bc

openshift-ci bot added the ok-to-test label May 7, 2024

dtrifiro merged commit fd9fef5 into opendatahub-io:main May 7, 2024
14 of 15 checks passed

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag #24

Merged

dtrifiro pushed a commit that referenced this pull request Jul 26, 2024

🍱 add license file to /licenses/vllm.md (#11)

fa458de

Had a user request for the apache 2 license file to exist in the image we provide Signed-off-by: Joe Runde <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from vllm-project:main #11

[pull] main from vllm-project:main #11

pull bot commented May 1, 2024 •

edited

Loading

openshift-ci bot commented May 1, 2024

openshift-ci bot commented May 1, 2024

z103cb commented May 7, 2024

[pull] main from vllm-project:main #11

[pull] main from vllm-project:main #11

Conversation

pull bot commented May 1, 2024 • edited Loading

openshift-ci bot commented May 1, 2024

openshift-ci bot commented May 1, 2024

z103cb commented May 7, 2024

pull bot commented May 1, 2024 •

edited

Loading