Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] main from vllm-project:main #11

Merged
merged 65 commits into from
May 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
4bb53e2
[BugFix] fix num_lookahead_slots missing in async executor (#4165)
leiwen83 Apr 30, 2024
b31a1fb
[Doc] add visualization for multi-stage dockerfile (#4456)
prashantgupta24 Apr 30, 2024
111815d
[Kernel] Support Fp8 Checkpoints (Dynamic + Static) (#4332)
robertgshaw2-neuralmagic Apr 30, 2024
a494140
[Frontend] Support complex message content for chat completions endpo…
fgreinacher Apr 30, 2024
715c2d8
[Frontend] [Core] Tensorizer: support dynamic `num_readers`, update v…
alpayariyak Apr 30, 2024
dd1a50a
[Bugfix][Minor] Make ignore_eos effective (#4468)
bigPYJ1151 Apr 30, 2024
6ad58f4
fix_tokenizer_snapshot_download_bug (#4493)
kingljl Apr 30, 2024
ee37328
Unable to find Punica extension issue during source code installation…
kingljl May 1, 2024
2e240c6
[Core] Centralize GPU Worker construction (#4419)
njhill May 1, 2024
f458112
[Misc][Typo] type annotation fix (#4495)
HarryWu99 May 1, 2024
a822eb3
[Misc] fix typo in block manager (#4453)
Juelianqvq May 1, 2024
c3845d8
Allow user to define whitespace pattern for outlines (#4305)
robcaulk May 1, 2024
d6f4bd7
[Misc]Add customized information for models (#4132)
jeejeelee May 1, 2024
6f1df80
[Test] Add ignore_eos test (#4519)
rkooo567 May 1, 2024
a88bb9b
[Bugfix] Fix the fp8 kv_cache check error that occurs when failing to…
AnyISalIn May 1, 2024
4dc8026
[Bugfix] Fix 307 Redirect for `/metrics` (#4523)
robertgshaw2-neuralmagic May 1, 2024
e491c7e
[Doc] update(example model): for OpenAI compatible serving (#4503)
fpaupier May 1, 2024
6990912
[Bugfix] Use random seed if seed is -1 (#4531)
sasha0552 May 1, 2024
8b798ee
[CI/Build][Bugfix] VLLM_USE_PRECOMPILED should skip compilation (#4534)
tjohnson31415 May 1, 2024
b38e42f
[Speculative decoding] Add ngram prompt lookup decoding (#4237)
leiwen83 May 1, 2024
24750f4
[Core] Enable prefix caching with block manager v2 enabled (#4142)
leiwen83 May 1, 2024
a657bfc
[Core] Add `multiproc_worker_utils` for multiprocessing-based workers…
njhill May 1, 2024
24bb4fe
[Kernel] Update fused_moe tuning script for FP8 (#4457)
pcmoritz May 1, 2024
c47ba4a
[Bugfix] Add validation for seed (#4529)
sasha0552 May 1, 2024
3a922c1
[Bugfix][Core] Fix and refactor logging stats (#4336)
esmeetu May 1, 2024
6ef09b0
[Core][Distributed] fix pynccl del error (#4508)
youkaichao May 1, 2024
c9d852d
[Misc] Remove Mixtral device="cuda" declarations (#4543)
pcmoritz May 1, 2024
826b82a
[Misc] Fix expert_ids shape in MoE (#4517)
WoosukKwon May 1, 2024
b8afa8b
[MISC] Rework logger to enable pythonic custom logging configuration …
May 2, 2024
0d62fe5
[Bug fix][Core] assert num_new_tokens == 1 fails when SamplingParams.…
rkooo567 May 2, 2024
5e401bc
[CI]Add regression tests to ensure the async engine generates metrics…
ronensc May 2, 2024
cf8cac8
[mypy][6/N] Fix all the core subdirectory typing (#4450)
rkooo567 May 2, 2024
2a85f93
[Core][Distributed] enable multiple tp group (#4512)
youkaichao May 2, 2024
7038e8b
[Kernel] Support running GPTQ 8-bit models in Marlin (#4533)
alexm-neuralmagic May 2, 2024
fb087af
[mypy][7/N] Cover all directories (#4555)
rkooo567 May 2, 2024
5ad60b0
[Misc] Exclude the `tests` directory from being packaged (#4552)
itechbear May 2, 2024
1ff0c73
[BugFix] Include target-device specific requirements.txt in sdist (#4…
markmc May 2, 2024
5b8a7c1
[Misc] centralize all usage of environment variables (#4548)
youkaichao May 2, 2024
32881f3
[kernel] fix sliding window in prefix prefill Triton kernel (#4405)
mmoskal May 2, 2024
9b5c9f9
[CI/Build] AMD CI pipeline with extended set of tests. (#4267)
Alexei-V-Ivanov-AMD May 2, 2024
0f8a914
[Core] Ignore infeasible swap requests. (#4557)
rkooo567 May 2, 2024
344a5d0
[Core][Distributed] enable allreduce for multiple tp groups (#4566)
youkaichao May 3, 2024
808632d
[BugFix] Prevent the task of `_force_log` from being garbage collecte…
Atry May 3, 2024
ce3f1ee
[Misc] remove chunk detected debug logs (#4571)
DefTruth May 3, 2024
2d7bce9
[Doc] add env vars to the doc (#4572)
youkaichao May 3, 2024
3521ba4
[Core][Model runner refactoring 1/N] Refactor attn metadata term (#4518)
rkooo567 May 3, 2024
7e65477
[Bugfix] Allow "None" or "" to be passed to CLI for string args that …
mgoin May 3, 2024
f8e7add
Fix/async chat serving (#2727)
schoennenbeck May 3, 2024
43c413e
[Kernel] Use flashinfer for decoding (#4353)
LiuXiaoxuanPKU May 3, 2024
ab50275
[Speculative decoding] Support target-model logprobs (#4378)
cadedaniel May 3, 2024
344bf7c
[Misc] add installation time env vars (#4574)
youkaichao May 3, 2024
bc8ad68
[Misc][Refactor] Introduce ExecuteModelData (#4540)
comaniac May 4, 2024
36fb68f
[Doc] Chunked Prefill Documentation (#4580)
rkooo567 May 4, 2024
2a05201
[Kernel] Support MoE Fp8 Checkpoints for Mixtral (Static Weights with…
mgoin May 4, 2024
021b1a2
[CI] check size of the wheels (#4319)
simon-mo May 4, 2024
4302987
[Bugfix] Fix inappropriate content of model_name tag in Prometheus me…
DearPlanet May 4, 2024
8d8357c
bump version to v0.4.2 (#4600)
simon-mo May 5, 2024
c7f2cf2
[CI] Reduce wheel size by not shipping debug symbols (#4602)
simon-mo May 5, 2024
0650e59
Disable cuda version check in vllm-openai image (#4530)
zhaoyang-star May 5, 2024
323f27b
[Bugfix] Fix `asyncio.Task` not being subscriptable (#4623)
DarkLight1337 May 6, 2024
e186d37
[CI] use ccache actions properly in release workflow (#4629)
simon-mo May 6, 2024
19cb471
[CI] Add retry for agent lost (#4633)
cadedaniel May 6, 2024
bd99d22
Update lm-format-enforcer to 0.10.1 (#4631)
noamgat May 6, 2024
a98187c
[Kernel] Make static FP8 scaling more robust (#4570)
pcmoritz May 7, 2024
63575bc
[Core][Optimization] change python dict to pytorch tensor (#4607)
youkaichao May 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 36 additions & 0 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import os
import zipfile

MAX_SIZE_MB = 100


def print_top_10_largest_files(zip_file):
with zipfile.ZipFile(zip_file, 'r') as z:
file_sizes = [(f, z.getinfo(f).file_size) for f in z.namelist()]
file_sizes.sort(key=lambda x: x[1], reverse=True)
for f, size in file_sizes[:10]:
print(f"{f}: {size/(1024*1024)} MBs uncompressed.")


def check_wheel_size(directory):
for root, _, files in os.walk(directory):
for f in files:
if f.endswith(".whl"):
wheel_path = os.path.join(root, f)
wheel_size = os.path.getsize(wheel_path)
wheel_size_mb = wheel_size / (1024 * 1024)
if wheel_size_mb > MAX_SIZE_MB:
print(
f"Wheel {wheel_path} is too large ({wheel_size_mb} MB) "
f"compare to the allowed size ({MAX_SIZE_MB} MB).")
print_top_10_largest_files(wheel_path)
return 1
else:
print(f"Wheel {wheel_path} is within the allowed size "
f"({wheel_size_mb} MB).")
return 0


if __name__ == "__main__":
import sys
sys.exit(check_wheel_size(sys.argv[1]))
58 changes: 25 additions & 33 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# This script build the ROCm docker image and run the API server inside the container.
# It serves a sanity check for compilation and basic model usage.
# This script build the ROCm docker image and runs test inside it.
set -ex

# Print ROCm version
echo "--- ROCm info"
rocminfo

echo "--- Resetting GPUs"

echo "reset" > /opt/amdgpu/etc/gpu_state

Expand All @@ -16,37 +17,28 @@ while true; do
fi
done

echo "--- Building container"
sha=$(git rev-parse --short HEAD)
container_name=rocm_${sha}
docker build \
-t ${container_name} \
-f Dockerfile.rocm \
--progress plain \
.

remove_docker_container() {
docker rm -f ${container_name} || docker image rm -f ${container_name} || true
}
trap remove_docker_container EXIT

echo "--- Running container"

# Try building the docker image
docker build -t rocm -f Dockerfile.rocm .

# Setup cleanup
remove_docker_container() { docker rm -f rocm || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
export HIP_VISIBLE_DEVICES=1
docker run --device /dev/kfd --device /dev/dri --network host -e HIP_VISIBLE_DEVICES --name rocm rocm python3 -m vllm.entrypoints.api_server &

# Wait for the server to start
wait_for_server_to_start() {
timeout=300
counter=0

while [ "$(curl -s -o /dev/null -w ''%{http_code}'' localhost:8000/health)" != "200" ]; do
sleep 1
counter=$((counter + 1))
if [ $counter -ge $timeout ]; then
echo "Timeout after $timeout seconds"
break
fi
done
}
wait_for_server_to_start
docker run \
--device /dev/kfd --device /dev/dri \
--network host \
--rm \
-e HF_TOKEN \
--name ${container_name} \
${container_name} \
/bin/bash -c $(echo $1 | sed "s/^'//" | sed "s/'$//")

# Test a simple prompt
curl -X POST -H "Content-Type: application/json" \
localhost:8000/generate \
-d '{"prompt": "San Francisco is a"}'
5 changes: 5 additions & 0 deletions .buildkite/run-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,11 @@ echo '```' >> benchmark_results.md
tail -n 20 benchmark_serving.txt >> benchmark_results.md # last 20 lines
echo '```' >> benchmark_results.md

# if the agent binary is not found, skip uploading the results, exit 0
if [ ! -f /workspace/buildkite-agent ]; then
exit 0
fi

# upload the results to buildkite
/workspace/buildkite-agent annotate --style "info" --context "benchmark-results" < benchmark_results.md

Expand Down
23 changes: 21 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,27 +17,38 @@ steps:
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_basic_correctness.py
- VLLM_ATTENTION_BACKEND=XFORMERS pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s basic_correctness/test_chunked_prefill.py
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py

- label: Core Test
mirror_hardwares: [amd]
command: pytest -v -s core

- label: Distributed Comm Ops Test
command: pytest -v -s test_comm_ops.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.
num_gpus: 2

- label: Distributed Tests
working_dir: "/vllm-workspace/tests/distributed"

num_gpus: 2 # only support 1 or 2 for now.
mirror_hardwares: [amd]

commands:
- pytest -v -s test_pynccl.py
- pytest -v -s test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m pytest -v -s test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s test_chunked_prefill_distributed.py

- label: Distributed Tests (Multiple Groups)
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 4
commands:
- pytest -v -s test_pynccl.py

- label: Engine Test
mirror_hardwares: [amd]
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
Expand All @@ -48,6 +59,7 @@ steps:

- label: Examples Test
working_dir: "/vllm-workspace/examples"
mirror_hardwares: [amd]
commands:
# install aws cli for llava_example.py
- pip install awscli
Expand All @@ -61,29 +73,35 @@ steps:
parallelism: 4

- label: Models Test
mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models --ignore=models/test_llava.py --ignore=models/test_mistral.py

- label: Llava Test
mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models/test_llava.py

- label: Prefix Caching Test
mirror_hardwares: [amd]
commands:
- pytest -v -s prefix_caching

- label: Samplers Test
command: pytest -v -s samplers

- label: LogitsProcessor Test
mirror_hardwares: [amd]
command: pytest -v -s test_logits_processor.py

- label: Worker Test
mirror_hardwares: [amd]
command: pytest -v -s worker

- label: Speculative decoding tests
mirror_hardwares: [amd]
command: pytest -v -s spec_decode

- label: LoRA Test %N
Expand All @@ -101,6 +119,7 @@ steps:

- label: Benchmarks
working_dir: "/vllm-workspace/.buildkite"
mirror_hardwares: [amd]
commands:
- pip install aiohttp
- bash run-benchmarks.sh
Expand Down
28 changes: 23 additions & 5 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -14,20 +14,33 @@ steps:
automatic:
- exit_status: -1 # Agent was lost
limit: 5
- exit_status: -10 # Agent was lost
limit: 5
- wait

- label: "AMD Test"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh
- group: "AMD Tests"
depends_on: ~
steps:
{% for step in steps %}
{% if step.mirror_hardwares and "amd" in step.mirror_hardwares %}
- label: "AMD: {{ step.label }}"
agents:
queue: amd
command: bash .buildkite/run-amd-test.sh "'cd {{ (step.working_dir or default_working_dir) | safe }} && {{ step.command or (step.commands | join(' && ')) | safe }}'"
env:
DOCKER_BUILDKIT: "1"
{% endif %}
{% endfor %}

- label: "Neuron Test"
depends_on: ~
agents:
queue: neuron
command: bash .buildkite/run-neuron-test.sh
soft_fail: true

- label: "CPU Test"
- label: "Intel Test"
depends_on: ~
command: bash .buildkite/run-cpu-test.sh

{% for step in steps %}
Expand All @@ -42,9 +55,14 @@ steps:
automatic:
- exit_status: -1 # Agent was lost
limit: 5
- exit_status: -10 # Agent was lost
limit: 5
plugins:
- kubernetes:
podSpec:
{% if step.num_gpus %}
priorityClassName: gpu-priority-cls-{{ step.num_gpus }}
{% endif %}
volumes:
- name: dshm
emptyDir:
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/mypy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ jobs:
- name: Mypy
run: |
mypy vllm/attention --config-file pyproject.toml
mypy vllm/core --config-file pyproject.toml
mypy vllm/distributed --config-file pyproject.toml
mypy vllm/entrypoints --config-file pyproject.toml
mypy vllm/executor --config-file pyproject.toml
Expand All @@ -42,9 +43,8 @@ jobs:
mypy vllm/engine --config-file pyproject.toml
mypy vllm/worker --config-file pyproject.toml
mypy vllm/spec_decode --config-file pyproject.toml
mypy vllm/lora --config-file pyproject.toml
mypy vllm/model_executor --config-file pyproject.toml

# TODO(sang): Fix nested dir
mypy vllm/core/*.py --follow-imports=skip --config-file pyproject.toml
mypy vllm/lora --config-file pyproject.toml
mypy vllm/logging --config-file pyproject.toml
mypy vllm/model_executor --config-file pyproject.toml

5 changes: 5 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ jobs:

- name: Setup ccache
uses: hendrikmuhs/[email protected]
with:
create-symlink: true
key: ${{ github.job }}-${{ matrix.python-version }}-${{ matrix.cuda-version }}

- name: Set up Linux Env
if: ${{ runner.os == 'Linux' }}
Expand All @@ -79,6 +82,8 @@ jobs:

- name: Build wheel
shell: bash
env:
CMAKE_BUILD_TYPE: Release # do not compile with debug symbol to reduce wheel size
run: |
bash -x .github/workflows/scripts/build.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
wheel_name=$(ls dist/*whl | xargs -n 1 basename)
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/scripts/create_release.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ module.exports = async (github, context, core) => {
generate_release_notes: true,
name: process.env.RELEASE_TAG,
owner: context.repo.owner,
prerelease: false,
prerelease: true,
repo: context.repo.repo,
tag_name: process.env.RELEASE_TAG,
});
Expand Down
16 changes: 12 additions & 4 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# The vLLM Dockerfile is used to construct vLLM image that can be directly used
# to run the OpenAI compatible server.

# Please update any changes made here to
# docs/source/dev/dockerfile/dockerfile.rst and
# docs/source/assets/dev/dockerfile-stages-dependency.png

#################### BASE BUILD IMAGE ####################
# prepare basic build environment
FROM nvidia/cuda:12.1.0-devel-ubuntu22.04 AS dev
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS dev

RUN apt-get update -y \
&& apt-get install -y python3-pip git
Expand All @@ -12,7 +16,7 @@ RUN apt-get update -y \
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-12.1/compat/
RUN ldconfig /usr/local/cuda-12.4/compat/

WORKDIR /workspace

Expand Down Expand Up @@ -71,6 +75,10 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/pip \
python3 setup.py bdist_wheel --dist-dir=dist

# check the size of the wheel, we cannot upload wheels larger than 100MB
COPY .buildkite/check-wheel-size.py check-wheel-size.py
RUN python3 check-wheel-size.py dist

# the `vllm_nccl` package must be installed from source distribution
# pip is too smart to store a wheel in the cache, and other CI jobs
# will directly use the wheel from the cache, which is not what we want.
Expand Down Expand Up @@ -98,7 +106,7 @@ RUN pip --verbose wheel flash-attn==${FLASH_ATTN_VERSION} \

#################### vLLM installation IMAGE ####################
# image with vLLM installed
FROM nvidia/cuda:12.1.0-base-ubuntu22.04 AS vllm-base
FROM nvidia/cuda:12.4.1-base-ubuntu22.04 AS vllm-base
WORKDIR /vllm-workspace

RUN apt-get update -y \
Expand All @@ -108,7 +116,7 @@ RUN apt-get update -y \
# https://github.com/pytorch/pytorch/issues/107960 -- hopefully
# this won't be needed for future versions of this docker image
# or future versions of triton.
RUN ldconfig /usr/local/cuda-12.1/compat/
RUN ldconfig /usr/local/cuda-12.4/compat/

# install vllm wheel first, so that torch etc will be installed
RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist \
Expand Down
Loading
Loading