Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] Minor documentation fixes #11580

Merged
merged 2 commits into from
Dec 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions docs/source/contributing/dockerfile/dockerfile.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ Below is a visual representation of the multi-stage Dockerfile. The build graph

The edges of the build graph represent:

- FROM ... dependencies (with a solid line and a full arrow head)
- `FROM ...` dependencies (with a solid line and a full arrow head)

- COPY --from=... dependencies (with a dashed line and an empty arrow head)
- `COPY --from=...` dependencies (with a dashed line and an empty arrow head)

- RUN --mount=(.\*)from=... dependencies (with a dotted line and an empty diamond arrow head)
- `RUN --mount=(.\*)from=...` dependencies (with a dotted line and an empty diamond arrow head)

> ```{figure} ../../assets/dev/dockerfile-stages-dependency.png
> :align: center
Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributing/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ pytest tests/
```

```{note}
Currently, the repository does not pass the `mypy` tests.
Currently, the repository is not fully checked by `mypy`.
```

# Contribution Guidelines
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/arm-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ Contents:
## Requirements

- **Operating System**: Linux or macOS
- **Compiler**: gcc/g++ >= 12.3.0 (optional, but recommended)
- **Compiler**: `gcc/g++ >= 12.3.0` (optional, but recommended)
- **Instruction Set Architecture (ISA)**: NEON support is required

(arm-backend-quick-start-dockerfile)=
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/cpu-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Table of contents:
## Requirements

- OS: Linux
- Compiler: gcc/g++>=12.3.0 (optional, recommended)
- Compiler: `gcc/g++>=12.3.0` (optional, recommended)
- Instruction set architecture (ISA) requirement: AVX512 (optional, recommended)

(cpu-backend-quick-start-dockerfile)=
Expand Down Expand Up @@ -69,7 +69,7 @@ $ VLLM_TARGET_DEVICE=cpu python setup.py install

```{note}
- AVX512_BF16 is an extension ISA provides native BF16 data type conversion and vector product instructions, will brings some performance improvement compared with pure AVX512. The CPU backend build script will check the host CPU flags to determine whether to enable AVX512_BF16.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable VLLM_CPU_AVX512BF16=1 before the building.
- If you want to force enable AVX512_BF16 for the cross-compilation, please set environment variable `VLLM_CPU_AVX512BF16=1` before the building.
```

(env-intro)=
Expand Down
8 changes: 5 additions & 3 deletions docs/source/getting_started/gaudi-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ Currently in vLLM for HPU we support four execution modes, depending on selected
In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode.
```

(gaudi-bucketing-mechanism)=

### Bucketing mechanism

Intel Gaudi accelerators work best when operating on models with fixed tensor shapes. [Intel Gaudi Graph Compiler](https://docs.habana.ai/en/latest/Gaudi_Overview/Intel_Gaudi_Software_Suite.html#graph-compiler-and-runtime) is responsible for generating optimized binary code that implements the given model topology on Gaudi. In its default configuration, the produced binary code may be heavily dependent on input and output tensor shapes, and can require graph recompilation when encountering differently shaped tensors within the same topology. While the resulting binaries utilize Gaudi efficiently, the compilation itself may introduce a noticeable overhead in end-to-end execution.
Expand All @@ -185,7 +187,7 @@ INFO 08-01 21:37:59 hpu_model_runner.py:504] Decode bucket config (min, step, ma
INFO 08-01 21:37:59 hpu_model_runner.py:509] Generated 48 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048)]
```

`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling - `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.
`min` determines the lowest value of the bucket. `step` determines the interval between buckets, and `max` determines the upper bound of the bucket. Furthermore, interval between `min` and `step` has special handling -- `min` gets multiplied by consecutive powers of two, until `step` gets reached. We call this the ramp-up phase and it is used for handling lower batch sizes with minimum wastage, while allowing larger padding on larger batch sizes.

Example (with ramp-up)

Expand Down Expand Up @@ -214,7 +216,7 @@ If a request exceeds maximum bucket size in any dimension, it will be processed
As an example, if a request of 3 sequences, with max sequence length of 412 comes in to an idle vLLM server, it will be padded executed as `(4, 512)` prefill bucket, as `batch_size` (number of sequences) will be padded to 4 (closest batch_size dimension higher than 3), and max sequence length will be padded to 512 (closest sequence length dimension higher than 412). After prefill stage, it will be executed as `(4, 512)` decode bucket and will continue as that bucket until either batch dimension changes (due to request being finished) - in which case it will become a `(2, 512)` bucket, or context length increases above 512 tokens, in which case it will become `(4, 640)` bucket.

```{note}
Bucketing is transparent to a client - padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
Bucketing is transparent to a client -- padding in sequence length dimension is never returned to the client, and padding in batch dimension does not create new requests.
```

### Warmup
Expand All @@ -235,7 +237,7 @@ INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][47/48] batch_size
INFO 08-01 22:27:16 hpu_model_runner.py:1066] [Warmup][Decode][48/48] batch_size:1 seq_len:128 free_mem:55.43 GiB
```

This example uses the same buckets as in *Bucketing mechanism* section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.
This example uses the same buckets as in the [Bucketing Mechanism](#gaudi-bucketing-mechanism) section. Each output line corresponds to execution of a single bucket. When bucket is executed for the first time, its graph is compiled and can be reused later on, skipping further graph compilations.

```{tip}
Compiling all the buckets might take some time and can be turned off with `VLLM_SKIP_WARMUP=true` environment variable. Keep in mind that if you do that, you may face graph compilations once executing a given bucket for the first time. It is fine to disable warmup for development, but it's highly recommended to enable it in deployment.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/neuron-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Installation steps:
(build-from-source-neuron)=

```{note}
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with vLLM >= 0.5.3. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
The currently supported version of Pytorch for Neuron installs `triton` version `2.1.0`. This is incompatible with `vllm >= 0.5.3`. You may see an error `cannot import name 'default_dump_dir...`. To work around this, run a `pip install --upgrade triton==3.0.0` after installing the vLLM wheel.
```

## Build from source
Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ $ "temperature": 0
$ }'
```

Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` python package:
Since this server is compatible with OpenAI API, you can use it as a drop-in replacement for any applications using OpenAI API. For example, another way to query the server is via the `openai` Python package:

```python
from openai import OpenAI
Expand Down Expand Up @@ -151,7 +151,7 @@ $ ]
$ }'
```

Alternatively, you can use the `openai` python package:
Alternatively, you can use the `openai` Python package:

```python
from openai import OpenAI
Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/tpu-installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ Connect to your TPU using SSH:
gcloud compute tpus tpu-vm ssh TPU_NAME --zone ZONE
```

Install Miniconda
Install Miniconda:

```bash
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
Expand Down
6 changes: 3 additions & 3 deletions docs/source/models/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ despite being described otherwise on its model card.
```

If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
{func}`vllm.model_executor.models.adapters.as_embedding_model`. By default, the embeddings
of the whole prompt are extracted from the normalized hidden state corresponding to the last token.

#### Reward Modeling (`--task reward`)
Expand Down Expand Up @@ -463,7 +463,7 @@ of the whole prompt are extracted from the normalized hidden state corresponding
```

If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.
{func}`vllm.model_executor.models.adapters.as_reward_model`. By default, we return the hidden states of each token directly.

```{important}
For process-supervised reward models such as {code}`peiyi9979/math-shepherd-mistral-7b-prm`, the pooling config should be set explicitly,
Expand Down Expand Up @@ -495,7 +495,7 @@ e.g.: {code}`--override-pooler-config '{"pooling_type": "STEP", "step_tag_id": 1
```

If your model is not in the above list, we will try to automatically convert the model using
:func:`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.
{func}`vllm.model_executor.models.adapters.as_classification_model`. By default, the class probabilities are extracted from the softmaxed hidden state corresponding to the last token.

#### Sentence Pair Scoring (`--task score`)

Expand Down
6 changes: 3 additions & 3 deletions docs/source/serving/deploying_with_cerebrium.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ docker_base_image_url = "nvidia/cuda:12.1.1-runtime-ubuntu22.04"
vllm = "latest"
```

Next, let us add our code to handle inference for the LLM of your choice(`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your main.py\`:
Next, let us add our code to handle inference for the LLM of your choice (`mistralai/Mistral-7B-Instruct-v0.1` for this example), add the following code to your `main.py`:

```python
from vllm import LLM, SamplingParams
Expand All @@ -55,13 +55,13 @@ def run(prompts: list[str], temperature: float = 0.8, top_p: float = 0.95):
return {"results": results}
```

Then, run the following code to deploy it to the cloud
Then, run the following code to deploy it to the cloud:

```console
$ cerebrium deploy
```

If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case /run)
If successful, you should be returned a CURL command that you can call inference against. Just remember to end the url with the function name you are calling (in our case` /run`)

```python
curl -X POST https://api.cortex.cerebrium.ai/v4/p-xxxxxx/vllm/run \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/serving/deploying_with_dstack.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ $ cd vllm-dstack
$ dstack init
```

Next, to provision a VM instance with LLM of your choice(`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:
Next, to provision a VM instance with LLM of your choice (`NousResearch/Llama-2-7b-chat-hf` for this example), create the following `serve.dstack.yml` file for the dstack `Service`:

```yaml
type: service
Expand Down
6 changes: 3 additions & 3 deletions docs/source/serving/distributed_serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Before going into the details of distributed inference and serving, let's first

- **Single GPU (no distributed inference)**: If your model fits in a single GPU, you probably don't need to use distributed inference. Just use the single GPU to run the inference.
- **Single-Node Multi-GPU (tensor parallel inference)**: If your model is too large to fit in a single GPU, but it can fit in a single node with multiple GPUs, you can use tensor parallelism. The tensor parallel size is the number of GPUs you want to use. For example, if you have 4 GPUs in a single node, you can set the tensor parallel size to 4.
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.
- **Multi-Node Multi-GPU (tensor parallel plus pipeline parallel inference)**: If your model is too large to fit in a single node, you can use tensor parallel together with pipeline parallelism. The tensor parallel size is the number of GPUs you want to use in each node, and the pipeline parallel size is the number of nodes you want to use. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2.

In short, you should increase the number of GPUs and the number of nodes until you have enough GPU memory to hold the model. The tensor parallel size should be the number of GPUs in each node, and the pipeline parallel size should be the number of nodes.

Expand Down Expand Up @@ -77,15 +77,15 @@ Then you get a ray cluster of containers. Note that you need to keep the shells

Then, on any node, use `docker exec -it node /bin/bash` to enter the container, execute `ray status` to check the status of the Ray cluster. You should see the right number of nodes and GPUs.

After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:
After that, on any node, you can use vLLM as usual, just as you have all the GPUs on one node. The common practice is to set the tensor parallel size to the number of GPUs in each node, and the pipeline parallel size to the number of nodes. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 8 and the pipeline parallel size to 2:

```console
$ vllm serve /path/to/the/model/in/the/container \
$ --tensor-parallel-size 8 \
$ --pipeline-parallel-size 2
```

You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8GPUs per node), you can set the tensor parallel size to 16:
You can also use tensor parallel without pipeline parallel, just set the tensor parallel size to the number of GPUs in the cluster. For example, if you have 16 GPUs in 2 nodes (8 GPUs per node), you can set the tensor parallel size to 16:

```console
$ vllm serve /path/to/the/model/in/the/container \
Expand Down
2 changes: 1 addition & 1 deletion docs/source/serving/runai_model_streamer.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ For reading from S3, it will be the number of client instances the host is openi
$ vllm serve /home/meta-llama/Llama-3.2-3B-Instruct --load-format runai_streamer --model-loader-extra-config '{"concurrency":16}'
```

You can controls the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can control the size of the CPU Memory buffer to which tensors are read from the file, and limit this size.
You can read further about CPU buffer memory limiting [here](https://github.com/run-ai/runai-model-streamer/blob/master/docs/src/env-vars.md#runai_streamer_memory_limit).

```console
Expand Down
Loading