Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] [1/N] Reorganize Getting Started section #11645

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions docs/source/design/arch_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,8 +77,7 @@ python -m vllm.entrypoints.openai.api_server --model <model>

That code can be found in <gh-file:vllm/entrypoints/openai/api_server.py>.

More details on the API server can be found in the {doc}`OpenAI Compatible
Server </serving/openai_compatible_server>` document.
More details on the API server can be found in the [OpenAI-Compatible Server](#openai-compatible-server) document.

## LLM Engine

Expand Down
2 changes: 1 addition & 1 deletion docs/source/design/multiprocessing.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Debugging

Please see the [Debugging Tips](#debugging-python-multiprocessing)
Please see the [Troubleshooting](#troubleshooting-python-multiprocessing)
page for information on known issues and how to solve them.

## Introduction
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

# Installation for ARM CPUs

vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the x86 platform documentation covering:
vLLM has been adapted to work on ARM64 CPUs with NEON support, leveraging the CPU backend initially developed for the x86 platform. This guide provides installation instructions specific to ARM. For additional details on supported features, refer to the [x86 CPU documentation](#installation-x86) covering:

- CPU backend inference capabilities
- Relevant runtime environment variables
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(installation-cpu)=
(installation-x86)=

# Installation with CPU
# Installation for x86 CPUs

vLLM initially supports basic model inferencing and serving on x86 CPU platform, with data types FP32, FP16 and BF16. vLLM CPU backend supports the following vLLM features:

Expand Down Expand Up @@ -151,4 +151,4 @@ $ python examples/offline_inference.py
$ VLLM_CPU_KVCACHE_SPACE=40 VLLM_CPU_OMP_THREADS_BIND="0-31|32-63" vllm serve meta-llama/Llama-2-7b-chat-hf -tp=2 --distributed-executor-backend mp
```

- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](../serving/deploying_with_nginx.md) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
- Using Data Parallel for maximum throughput: to launch an LLM serving endpoint on each NUMA node along with one additional load balancer to dispatch the requests to those endpoints. Common solutions like [Nginx](#nginxloadbalancer) or HAProxy are recommended. Anyscale Ray project provides the feature on LLM [serving](https://docs.ray.io/en/latest/serve/index.html). Here is the example to setup a scalable LLM serving with [Ray Serve](https://github.com/intel/llm-on-ray/blob/main/docs/setup.md).
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(installation)=
(installation-cuda)=

# Installation
# Installation for CUDA

vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(installation-rocm)=

# Installation with ROCm
# Installation for ROCm

vLLM supports AMD GPUs with ROCm 6.2.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
# Installation with Intel® Gaudi® AI Accelerators
(installation-gaudi)=

# Installation for Intel® Gaudi®

This README provides instructions on running vLLM with Intel Gaudi devices.

Expand Down
19 changes: 19 additions & 0 deletions docs/source/getting_started/installation/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
(installation-index)=

# Installation

vLLM supports the following hardware platforms:

```{toctree}
:maxdepth: 1

gpu-cuda
gpu-rocm
cpu-x86
cpu-arm
hpu-gaudi
tpu
xpu
openvino
neuron
```
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(installation-neuron)=

# Installation with Neuron
# Installation for Neuron

vLLM 0.3.3 onwards supports model inferencing and serving on AWS Trainium/Inferentia with Neuron SDK with continuous batching.
Paged Attention and Chunked Prefill are currently in development and will be available soon.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
(installation-openvino)=

# Installation with OpenVINO
# Installation for OpenVINO

vLLM powered by OpenVINO supports all LLM models from {doc}`vLLM supported models list <../models/supported_models>` and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:
vLLM powered by OpenVINO supports all LLM models from [vLLM supported models list](#supported-models) and can perform optimal model serving on all x86-64 CPUs with, at least, AVX2 support, as well as on both integrated and discrete Intel® GPUs ([the list of supported GPUs](https://docs.openvino.ai/2024/about-openvino/release-notes-openvino/system-requirements.html#gpu)). OpenVINO vLLM backend supports the following advanced vLLM features:

- Prefix caching (`--enable-prefix-caching`)
- Chunked prefill (`--enable-chunked-prefill`)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(installation-tpu)=

# Installation with TPU
# Installation for TPUs

Tensor Processing Units (TPUs) are Google's custom-developed application-specific
integrated circuits (ASICs) used to accelerate machine learning workloads. TPUs
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
(installation-xpu)=

# Installation with XPU
# Installation for XPUs

vLLM initially supports basic model inferencing and serving on Intel GPU platform.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ $ conda activate myenv
$ pip install vllm
```

Please refer to the {ref}`installation documentation <installation>` for more details on installing vLLM.
Please refer to the [installation documentation](#installation-index) for more details on installing vLLM.

(offline-batched-inference)=

Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
(debugging)=
(troubleshooting)=

# Debugging Tips
# Troubleshooting

This document outlines some debugging strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.
This document outlines some troubleshooting strategies you can consider. If you think you've discovered a bug, please [search existing issues](https://github.com/vllm-project/vllm/issues?q=is%3Aissue) first to see if it has already been reported. If not, please [file a new issue](https://github.com/vllm-project/vllm/issues/new/choose), providing as much relevant information as possible.

```{note}
Once you've debugged a problem, remember to turn off any debugging environment variables defined, or simply start a new shell to avoid being affected by lingering debugging settings. Otherwise, the system might be slow with debugging functionalities left activated.
Expand Down Expand Up @@ -47,6 +47,7 @@ You might also need to set `export NCCL_SOCKET_IFNAME=<your_network_interface>`
If vLLM crashes and the error trace captures it somewhere around `self.graph.replay()` in `vllm/worker/model_runner.py`, it is a CUDA error inside CUDAGraph.
To identify the particular CUDA operation that causes the error, you can add `--enforce-eager` to the command line, or `enforce_eager=True` to the {class}`~vllm.LLM` class to disable the CUDAGraph optimization and isolate the exact CUDA operation that causes the error.

(troubleshooting-incorrect-hardware-driver)=
## Incorrect hardware/driver

If GPU/CPU communication cannot be established, you can use the following Python script and follow the instructions below to confirm whether the GPU/CPU communication is working correctly.
Expand Down Expand Up @@ -139,7 +140,7 @@ A multi-node environment is more complicated than a single-node one. If you see
Adjust `--nproc-per-node`, `--nnodes`, and `--node-rank` according to your setup, being sure to execute different commands (with different `--node-rank`) on different nodes.
```

(debugging-python-multiprocessing)=
(troubleshooting-python-multiprocessing)=
## Python multiprocessing

### `RuntimeError` Exception
Expand All @@ -150,7 +151,7 @@ If you have seen a warning in your logs like this:
WARNING 12-11 14:50:37 multiproc_worker_utils.py:281] CUDA was previously
initialized. We must use the `spawn` multiprocessing start method. Setting
VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See
https://docs.vllm.ai/en/latest/getting_started/debugging.html#python-multiprocessing
https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing
for more information.
```

Expand Down
16 changes: 4 additions & 12 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,26 +50,19 @@ For more information, check out the following:
- [vLLM announcing blog post](https://vllm.ai) (intro to PagedAttention)
- [vLLM paper](https://arxiv.org/abs/2309.06180) (SOSP 2023)
- [How continuous batching enables 23x throughput in LLM inference while reducing p50 latency](https://www.anyscale.com/blog/continuous-batching-llm-inference) by Cade Daniel et al.
- {ref}`vLLM Meetups <meetups>`.
- [vLLM Meetups](#meetups)

## Documentation

```{toctree}
:caption: Getting Started
:maxdepth: 1

getting_started/installation
getting_started/amd-installation
getting_started/openvino-installation
getting_started/cpu-installation
getting_started/gaudi-installation
getting_started/arm-installation
getting_started/neuron-installation
getting_started/tpu-installation
getting_started/xpu-installation
getting_started/installation/index
getting_started/quickstart
getting_started/debugging
getting_started/examples/examples_index
getting_started/troubleshooting
getting_started/faq
```

```{toctree}
Expand Down Expand Up @@ -110,7 +103,6 @@ usage/structured_outputs
usage/spec_decode
usage/compatibility_matrix
usage/performance
usage/faq
usage/engine_args
usage/env_vars
usage/usage_stats
Expand Down
2 changes: 1 addition & 1 deletion docs/source/models/generative_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ outputs = llm.chat(conversation, chat_template=custom_template)

## Online Inference

Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:

- [Completions API](#completions-api) is similar to `LLM.generate` but only accepts text.
- [Chat API](#chat-api) is similar to `LLM.chat`, accepting both text and [multi-modal inputs](#multimodal-inputs) for models with a chat template.
2 changes: 1 addition & 1 deletion docs/source/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ A code example can be found here: <gh-file:examples/offline_inference_scoring.py

## Online Inference

Our [OpenAI Compatible Server](../serving/openai_compatible_server.md) provides endpoints that correspond to the offline APIs:
Our [OpenAI-Compatible Server](#openai-compatible-server) provides endpoints that correspond to the offline APIs:

- [Pooling API](#pooling-api) is similar to `LLM.encode`, being applicable to all types of pooling models.
- [Embeddings API](#embeddings-api) is similar to `LLM.embed`, accepting both text and [multi-modal inputs](#multimodal-inputs) for embedding models.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/serving/distributed_serving.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ $ --tensor-parallel-size 16
To make tensor parallel performant, you should make sure the communication between nodes is efficient, e.g. using high-speed network cards like Infiniband. To correctly set up the cluster to use Infiniband, append additional arguments like `--privileged -e NCCL_IB_HCA=mlx5` to the `run_cluster.sh` script. Please contact your system administrator for more information on how to set up the flags. One way to confirm if the Infiniband is working is to run vLLM with `NCCL_DEBUG=TRACE` environment variable set, e.g. `NCCL_DEBUG=TRACE vllm serve ...` and check the logs for the NCCL version and the network used. If you find `[send] via NET/Socket` in the logs, it means NCCL uses raw TCP Socket, which is not efficient for cross-node tensor parallel. If you find `[send] via NET/IB/GDRDMA` in the logs, it means NCCL uses Infiniband with GPU-Direct RDMA, which is efficient.

```{warning}
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](../getting_started/debugging.md) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
After you start the Ray cluster, you'd better also check the GPU-GPU communication between nodes. It can be non-trivial to set up. Please refer to the [sanity check script](#troubleshooting-incorrect-hardware-driver) for more information. If you need to set some environment variables for the communication configuration, you can append them to the `run_cluster.sh` script, e.g. `-e NCCL_SOCKET_IFNAME=eth0`. Note that setting environment variables in the shell (e.g. `NCCL_SOCKET_IFNAME=eth0 vllm serve ...`) only works for the processes in the same node, not for the processes in the other nodes. Setting environment variables when you create the cluster is the recommended way. See <gh-issue:6803> for more information.
```

```{warning}
Expand Down
4 changes: 2 additions & 2 deletions docs/source/usage/spec_decode.md
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ speculative decoding, breaking down the guarantees into three key areas:
3. **vLLM Logprob Stability**
\- vLLM does not currently guarantee stable token log probabilities (logprobs). This can result in different outputs for the
same request across runs. For more details, see the FAQ section
titled *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
titled *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).

**Conclusion**

Expand All @@ -195,7 +195,7 @@ can occur due to following factors:

**Mitigation Strategies**

For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the {ref}`FAQs <faq>`.
For mitigation strategies, please refer to the FAQ entry *Can the output of a prompt vary across runs in vLLM?* in the [FAQs](#faq).

## Resources for vLLM contributors

Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage/structured_outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The following parameters are supported, which must be added as extra parameters:
- `guided_whitespace_pattern`: used to override the default whitespace pattern for guided json decoding.
- `guided_decoding_backend`: used to select the guided decoding backend to use.

You can see the complete list of supported parameters on the [OpenAI Compatible Server](../serving/openai_compatible_server.md) page.
You can see the complete list of supported parameters on the [OpenAI-Compatible Server](#openai-compatible-server)page.

Now let´s see an example for each of the cases, starting with the `guided_choice`, as it´s the easiest one:

Expand Down
2 changes: 1 addition & 1 deletion vllm/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -1938,7 +1938,7 @@ def _check_multiproc_method():
"the `spawn` multiprocessing start method. Setting "
"VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. "
"See https://docs.vllm.ai/en/latest/getting_started/"
"debugging.html#python-multiprocessing "
"troubleshooting.html#python-multiprocessing "
"for more information.")
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"

Expand Down
Loading