Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Minimum requirements for SageMaker compatibility #11575

Closed
wants to merge 12 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,9 @@ RUN mv vllm test_docs/

#################### OPENAI API SERVER ####################
# openai api server alternative
FROM vllm-base AS vllm-openai

# define sagemaker first, so it is not default from `docker build`
FROM vllm-base AS vllm-sagemaker

# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
Expand All @@ -247,5 +249,11 @@ RUN --mount=type=cache,target=/root/.cache/pip \

ENV VLLM_USAGE_SOURCE production-docker-image

# port 8080 required by sagemaker, https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-inference-code.html#your-algorithms-inference-code-container-response
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server", "--port", "8080"]

# the default image is `vllm-openai` which is identical to sagemaker without forcing the port
from vllm-sagemaker as vllm-openai

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]
#################### OPENAI API SERVER ####################
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,7 @@ def linkcode_resolve(domain, info):

# Mock out external dependencies here, otherwise the autodoc pages may be blank.
autodoc_mock_imports = [
"blake3",
"compressed_tensors",
"cpuinfo",
"cv2",
Expand All @@ -207,7 +208,7 @@ def linkcode_resolve(domain, info):
"tensorizer",
"pynvml",
"outlines",
"xgrammar,"
"xgrammar",
"librosa",
"soundfile",
"gguf",
Expand Down
24 changes: 12 additions & 12 deletions docs/source/design/multimodal/multimodal_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,39 +45,39 @@ adding_multimodal_plugin
### Base Classes

```{eval-rst}
.. autodata:: vllm.multimodal.NestedTensors
.. automodule:: vllm.multimodal.base
:members:
:show-inheritance:
```

```{eval-rst}
.. autodata:: vllm.multimodal.BatchedTensorInputs
```
### Input Classes

```{eval-rst}
.. autoclass:: vllm.multimodal.MultiModalDataBuiltins
.. automodule:: vllm.multimodal.inputs
:members:
:show-inheritance:
```

```{eval-rst}
.. autodata:: vllm.multimodal.MultiModalDataDict
```
### Audio Classes

```{eval-rst}
.. autoclass:: vllm.multimodal.MultiModalKwargs
.. automodule:: vllm.multimodal.audio
:members:
:show-inheritance:
```

### Image Classes

```{eval-rst}
.. autoclass:: vllm.multimodal.MultiModalPlugin
.. automodule:: vllm.multimodal.image
:members:
:show-inheritance:
```

### Image Classes
### Video Classes

```{eval-rst}
.. automodule:: vllm.multimodal.image
.. automodule:: vllm.multimodal.video
:members:
:show-inheritance:
```
3 changes: 1 addition & 2 deletions docs/source/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -755,8 +755,7 @@ vLLM currently only supports adding LoRA to the language backbone of multimodal
```

```{note}
To use {code}`TIGER-Lab/Mantis-8B-siglip-llama3`, you have to install their GitHub repo ({code}`pip install git+https://github.com/TIGER-AI-Lab/Mantis.git`)
and pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
To use {code}`TIGER-Lab/Mantis-8B-siglip-llama3`, you have pass {code}`--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
```

```{note}
Expand Down
79 changes: 78 additions & 1 deletion docs/source/serving/deploying_with_k8s.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,11 @@ data:
token: "REPLACE_WITH_TOKEN"
```

Create a deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model:
Next to create the deployment file for vLLM to run the model server. The following example deploys the `Mistral-7B-Instruct-v0.3` model.

Here are two examples for using NVIDIA GPU and AMD GPU.

- NVIDIA GPU

```yaml
apiVersion: apps/v1
Expand Down Expand Up @@ -119,6 +123,79 @@ spec:
periodSeconds: 5
```

- AMD GPU

You can refer to the `deployment.yaml` below if using AMD ROCm GPU like MI300X.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: default
labels:
app: mistral-7b
spec:
replicas: 1
selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
# PVC
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
# vLLM needs to access the host's shared memory for tensor parallel inference.
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
hostNetwork: true
hostIPC: true
containers:
- name: mistral-7b
image: rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4
securityContext:
seccompProfile:
type: Unconfined
runAsGroup: 44
capabilities:
add:
- SYS_PTRACE
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-v0.3 --port 8000 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
amd.com/gpu: "1"
requests:
cpu: "6"
memory: 6G
amd.com/gpu: "1"
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
```
You can get the full example with steps and sample yaml files from <https://github.com/ROCm/k8s-device-plugin/tree/master/example/vllm-serve>.

2. **Create a Kubernetes Service for vLLM**

Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage/structured_outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Structured Outputs

vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines) or [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer) as backends for the guided decoding.
vLLM supports the generation of structured outputs using [outlines](https://github.com/dottxt-ai/outlines), [lm-format-enforcer](https://github.com/noamgat/lm-format-enforcer), or [xgrammar](https://github.com/mlc-ai/xgrammar) as backends for the guided decoding.
This document shows you some examples of the different options that are available to generate structured outputs.

## Online Inference (OpenAI API)
Expand Down
15 changes: 14 additions & 1 deletion examples/offline_inference_vision_language.py
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,20 @@ def run_mllama(question: str, modality: str):
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
)

prompt = f"<|image|><|begin_of_text|>{question}"
tokenizer = AutoTokenizer.from_pretrained(model_name)
messages = [{
"role":
"user",
"content": [{
"type": "image"
}, {
"type": "text",
"text": f"{question}"
}]
}]
prompt = tokenizer.apply_chat_template(messages,
add_generation_prompt=True,
tokenize=False)
stop_token_ids = None
return llm, prompt, stop_token_ids

Expand Down
1 change: 1 addition & 0 deletions tests/entrypoints/openai/test_serving_chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ class MockModelConfig:
hf_config = MockHFConfig()
logits_processor_pattern = None
diff_sampling_param: Optional[dict] = None
allowed_local_media_path: str = ""

def get_diff_sampling_param(self):
return self.diff_sampling_param or {}
Expand Down
4 changes: 2 additions & 2 deletions tests/entrypoints/openai/test_vision_embedding.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,5 +91,5 @@ async def test_image_embedding(server: RemoteOpenAIServer, model_name: str,
assert len(embeddings.data) == 1
assert len(embeddings.data[0].embedding) == 3072
assert embeddings.usage.completion_tokens == 0
assert embeddings.usage.prompt_tokens == 765
assert embeddings.usage.total_tokens == 765
assert embeddings.usage.prompt_tokens == 764
assert embeddings.usage.total_tokens == 764
6 changes: 1 addition & 5 deletions tests/entrypoints/test_chat_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
from typing import Optional

import pytest
from PIL import Image

from vllm.assets.image import ImageAsset
from vllm.config import ModelConfig
Expand Down Expand Up @@ -91,10 +90,7 @@ def _assert_mm_data_is_image_input(
image_data = mm_data.get("image")
assert image_data is not None

if image_count == 1:
assert isinstance(image_data, Image.Image)
else:
assert isinstance(image_data, list) and len(image_data) == image_count
assert isinstance(image_data, list) and len(image_data) == image_count


def test_parse_chat_messages_single_image(
Expand Down
24 changes: 24 additions & 0 deletions tests/lora/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from unittest.mock import MagicMock, patch

import pytest
import safetensors
import torch
import torch.nn as nn
from huggingface_hub import snapshot_download
Expand Down Expand Up @@ -169,6 +170,29 @@ def mixtral_lora_files_all_target_modules():
return snapshot_download(repo_id="dyang415/mixtral-lora-v0")


@pytest.fixture(scope="session")
def jamba_lora_files():
# some of the adapters have unnecessary weights for serving,
# hence we remove them
def remove_unnecessary_weights(path):
lora_path = f"{adapter_path}/adapter_model.safetensors"
tensors = safetensors.torch.load_file(lora_path)
nonlora_keys = []
for k in list(tensors.keys()):
if "lora" not in k:
nonlora_keys.append(k)
for k in nonlora_keys:
del tensors[k]
safetensors.torch.save_file(tensors, lora_path)

adapter_path = snapshot_download(
repo_id=
"hf-100/Jamba-1.5-mini-Spellbound-StoryWriter-0.1-6583896-ckpt53-lora")

remove_unnecessary_weights(adapter_path)
return adapter_path


@pytest.fixture(scope="session")
def gemma_lora_files():
return snapshot_download(repo_id="wskwon/gemma-7b-test-lora")
Expand Down
54 changes: 54 additions & 0 deletions tests/lora/test_jamba.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from typing import List

import pytest
import torch

import vllm
from vllm.lora.request import LoRARequest

MODEL_PATH = "ai21labs/AI21-Jamba-1.5-Mini"

MAX_TOKENS = 40


def do_sample(llm: vllm.LLM, lora_path: str, lora_id: int,
prompts: List[str]) -> List[str]:

sampling_params = vllm.SamplingParams(temperature=0, max_tokens=MAX_TOKENS)
outputs = llm.generate(
prompts,
sampling_params,
lora_request=LoRARequest(str(lora_id), lora_id, lora_path)
if lora_id else None)
# Print the outputs.
generated_texts: List[str] = []
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text.strip()
generated_texts.append(generated_text)
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
return generated_texts


@pytest.mark.parametrize("tp_size", [4])
def test_jamba_lora(jamba_lora_files, tp_size):
"""Original test, the LoRA model has the common target modules, not all"""
if torch.cuda.device_count() < tp_size:
pytest.skip(f"Not enough GPUs for tensor parallelism {tp_size}")

prompts = ["Write a story about a sheep and a goat."]

llm = vllm.LLM(
MODEL_PATH,
enable_lora=True,
max_num_seqs=16,
max_loras=4,
distributed_executor_backend="ray",
tensor_parallel_size=tp_size,
)

expected_jamba_output = [
"""Once upon a time, in a lush green meadow, there lived a sheep named Clara and a goat named Billy. Clara was a gentle creature, always nibbling on the soft grass and humming""" # noqa: E501
]
assert do_sample(llm, jamba_lora_files, lora_id=1,
prompts=prompts) == expected_jamba_output
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ def get_max_qwen2_vl_image_tokens():


@pytest.mark.parametrize("mm_processor_kwargs,expected_max_tokens", [
({}, 1225),
({}, 16384),
({
MIN_PIXELS: 64**2,
MAX_PIXELS: 512**2
Expand Down
4 changes: 3 additions & 1 deletion tests/models/decoder_only/vision_language/test_models.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@
vllm_output_post_proc=model_utils.fuyu_vllm_to_hf_output,
num_logprobs=10,
image_size_factors=[(), (0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)],
marks=[large_gpu_mark(min_gb=48)],
),
"glm4": VLMTestInfo(
models=["THUDM/glm-4v-9b"],
Expand All @@ -212,7 +213,7 @@
dtype="bfloat16",
get_stop_token_ids=lambda tok: [151329, 151336, 151338],
patch_hf_runner=model_utils.glm_patch_hf_runner,
marks=[large_gpu_mark(min_gb=48)],
marks=[large_gpu_mark(min_gb=32)],
),
"h2ovl": VLMTestInfo(
models = [
Expand Down Expand Up @@ -261,6 +262,7 @@
dtype="bfloat16",
use_tokenizer_eos=True,
patch_hf_runner=model_utils.internvl_patch_hf_runner,
marks=[large_gpu_mark(min_gb=32)],
),
"llava_next": VLMTestInfo(
models=["llava-hf/llava-v1.6-mistral-7b-hf"],
Expand Down
Loading
Loading