-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into add_gritlm_vllm
Signed-off-by: Pooya Davoodi <[email protected]>
- Loading branch information
Showing
130 changed files
with
3,623 additions
and
890 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
name: Lint and Deploy Charts | ||
|
||
on: pull_request | ||
|
||
jobs: | ||
lint-and-deploy: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout | ||
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2 | ||
with: | ||
fetch-depth: 0 | ||
|
||
- name: Set up Helm | ||
uses: azure/setup-helm@fe7b79cd5ee1e45176fcad797de68ecaf3ca4814 # v4.2.0 | ||
with: | ||
version: v3.14.4 | ||
|
||
#Python is required because ct lint runs Yamale and yamllint which require Python. | ||
- uses: actions/setup-python@0b93645e9fea7318ecaed2b359559ac225c90a2b # v5.3.0 | ||
with: | ||
python-version: '3.13' | ||
|
||
- name: Set up chart-testing | ||
uses: helm/chart-testing-action@e6669bcd63d7cb57cb4380c33043eebe5d111992 # v2.6.1 | ||
with: | ||
version: v3.10.1 | ||
|
||
- name: Run chart-testing (lint) | ||
run: ct lint --target-branch ${{ github.event.repository.default_branch }} --chart-dirs examples/chart-helm --charts examples/chart-helm | ||
|
||
- name: Setup minio | ||
run: | | ||
docker network create vllm-net | ||
docker run -d -p 9000:9000 --name minio --net vllm-net \ | ||
-e "MINIO_ACCESS_KEY=minioadmin" \ | ||
-e "MINIO_SECRET_KEY=minioadmin" \ | ||
-v /tmp/data:/data \ | ||
-v /tmp/config:/root/.minio \ | ||
minio/minio server /data | ||
export AWS_ACCESS_KEY_ID=minioadmin | ||
export AWS_SECRET_ACCESS_KEY=minioadmin | ||
export AWS_EC2_METADATA_DISABLED=true | ||
mkdir opt-125m | ||
cd opt-125m && curl -O -Ls "https://huggingface.co/facebook/opt-125m/resolve/main/{pytorch_model.bin,config.json,generation_config.json,merges.txt,special_tokens_map.json,tokenizer_config.json,vocab.json}" && cd .. | ||
aws --endpoint-url http://127.0.0.1:9000/ s3 mb s3://testbucket | ||
aws --endpoint-url http://127.0.0.1:9000/ s3 cp opt-125m/ s3://testbucket/opt-125m --recursive | ||
- name: Create kind cluster | ||
uses: helm/kind-action@0025e74a8c7512023d06dc019c617aa3cf561fde # v1.10.0 | ||
|
||
- name: Build the Docker image vllm cpu | ||
run: docker buildx build -f Dockerfile.cpu -t vllm-cpu-env . | ||
|
||
- name: Configuration of docker images, network and namespace for the kind cluster | ||
run: | | ||
docker pull amazon/aws-cli:2.6.4 | ||
kind load docker-image amazon/aws-cli:2.6.4 --name chart-testing | ||
kind load docker-image vllm-cpu-env:latest --name chart-testing | ||
docker network connect vllm-net "$(docker ps -aqf "name=chart-testing-control-plane")" | ||
kubectl create ns ns-vllm | ||
- name: Run chart-testing (install) | ||
run: | | ||
export AWS_ACCESS_KEY_ID=minioadmin | ||
export AWS_SECRET_ACCESS_KEY=minioadmin | ||
helm install --wait --wait-for-jobs --timeout 5m0s --debug --create-namespace --namespace=ns-vllm test-vllm examples/chart-helm -f examples/chart-helm/values.yaml --set secrets.s3endpoint=http://minio:9000 --set secrets.s3bucketname=testbucket --set secrets.s3accesskeyid=$AWS_ACCESS_KEY_ID --set secrets.s3accesskey=$AWS_SECRET_ACCESS_KEY --set resources.requests.cpu=1 --set resources.requests.memory=4Gi --set resources.limits.cpu=2 --set resources.limits.memory=5Gi --set image.env[0].name=VLLM_CPU_KVCACHE_SPACE --set image.env[1].name=VLLM_LOGGING_LEVEL --set-string image.env[0].value="1" --set-string image.env[1].value="DEBUG" --set-string extraInit.s3modelpath="opt-125m/" --set-string 'resources.limits.nvidia\.com/gpu=0' --set-string 'resources.requests.nvidia\.com/gpu=0' --set-string image.repository="vllm-cpu-env" | ||
- name: curl test | ||
run: | | ||
kubectl -n ns-vllm port-forward service/test-vllm-service 8001:80 & | ||
sleep 10 | ||
CODE="$(curl -v -f --location http://localhost:8001/v1/completions \ | ||
--header "Content-Type: application/json" \ | ||
--data '{ | ||
"model": "opt-125m", | ||
"prompt": "San Francisco is a", | ||
"max_tokens": 7, | ||
"temperature": 0 | ||
}'):$CODE" | ||
echo "$CODE" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,146 @@ | ||
.. _generative_models: | ||
|
||
Generative Models | ||
================= | ||
|
||
vLLM provides first-class support for generative models, which covers most of LLMs. | ||
|
||
In vLLM, generative models implement the :class:`~vllm.model_executor.models.VllmModelForTextGeneration` interface. | ||
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate, | ||
which are then passed through :class:`~vllm.model_executor.layers.Sampler` to obtain the final text. | ||
|
||
Offline Inference | ||
----------------- | ||
|
||
The :class:`~vllm.LLM` class provides various methods for offline inference. | ||
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model. | ||
|
||
For generative models, the only supported :code:`task` option is :code:`"generate"`. | ||
Usually, this is automatically inferred so you don't have to specify it. | ||
|
||
``LLM.generate`` | ||
^^^^^^^^^^^^^^^^ | ||
|
||
The :class:`~vllm.LLM.generate` method is available to all generative models in vLLM. | ||
It is similar to `its counterpart in HF Transformers <https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationMixin.generate>`__, | ||
except that tokenization and detokenization are also performed automatically. | ||
|
||
.. code-block:: python | ||
llm = LLM(model="facebook/opt-125m") | ||
outputs = llm.generate("Hello, my name is") | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") | ||
You can optionally control the language generation by passing :class:`~vllm.SamplingParams`. | ||
For example, you can use greedy sampling by setting :code:`temperature=0`: | ||
|
||
.. code-block:: python | ||
llm = LLM(model="facebook/opt-125m") | ||
params = SamplingParams(temperature=0) | ||
outputs = llm.generate("Hello, my name is", params) | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") | ||
A code example can be found in `examples/offline_inference.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference.py>`_. | ||
|
||
``LLM.beam_search`` | ||
^^^^^^^^^^^^^^^^^^^ | ||
|
||
The :class:`~vllm.LLM.beam_search` method implements `beam search <https://huggingface.co/docs/transformers/en/generation_strategies#beam-search-decoding>`__ on top of :class:`~vllm.LLM.generate`. | ||
For example, to search using 5 beams and output at most 50 tokens: | ||
|
||
.. code-block:: python | ||
llm = LLM(model="facebook/opt-125m") | ||
params = BeamSearchParams(beam_width=5, max_tokens=50) | ||
outputs = llm.generate("Hello, my name is", params) | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") | ||
``LLM.chat`` | ||
^^^^^^^^^^^^ | ||
|
||
The :class:`~vllm.LLM.chat` method implements chat functionality on top of :class:`~vllm.LLM.generate`. | ||
In particular, it accepts input similar to `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__ | ||
and automatically applies the model's `chat template <https://huggingface.co/docs/transformers/en/chat_templating>`__ to format the prompt. | ||
|
||
.. important:: | ||
|
||
In general, only instruction-tuned models have a chat template. | ||
Base models may perform poorly as they are not trained to respond to the chat conversation. | ||
|
||
.. code-block:: python | ||
llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct") | ||
conversation = [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Hello" | ||
}, | ||
{ | ||
"role": "assistant", | ||
"content": "Hello! How can I assist you today?" | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Write an essay about the importance of higher education.", | ||
}, | ||
] | ||
outputs = llm.chat(conversation) | ||
for output in outputs: | ||
prompt = output.prompt | ||
generated_text = output.outputs[0].text | ||
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") | ||
A code example can be found in `examples/offline_inference_chat.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_chat.py>`_. | ||
|
||
If the model doesn't have a chat template or you want to specify another one, | ||
you can explicitly pass a chat template: | ||
|
||
.. code-block:: python | ||
from vllm.entrypoints.chat_utils import load_chat_template | ||
# You can find a list of existing chat templates under `examples/` | ||
custom_template = load_chat_template(chat_template="<path_to_template>") | ||
print("Loaded chat template:", custom_template) | ||
outputs = llm.chat(conversation, chat_template=custom_template) | ||
Online Inference | ||
---------------- | ||
|
||
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference. | ||
Please click on the above link for more details on how to launch the server. | ||
|
||
Completions API | ||
^^^^^^^^^^^^^^^ | ||
|
||
Our Completions API is similar to ``LLM.generate`` but only accepts text. | ||
It is compatible with `OpenAI Completions API <https://platform.openai.com/docs/api-reference/completions>`__ | ||
so that you can use OpenAI client to interact with it. | ||
A code example can be found in `examples/openai_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_completion_client.py>`_. | ||
|
||
Chat API | ||
^^^^^^^^ | ||
|
||
Our Chat API is similar to ``LLM.chat``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`. | ||
It is compatible with `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__ | ||
so that you can use OpenAI client to interact with it. | ||
A code example can be found in `examples/openai_chat_completion_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_chat_completion_client.py>`_. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,99 @@ | ||
.. _pooling_models: | ||
|
||
Pooling Models | ||
============== | ||
|
||
vLLM also supports pooling models, including embedding, reranking and reward models. | ||
|
||
In vLLM, pooling models implement the :class:`~vllm.model_executor.models.VllmModelForPooling` interface. | ||
These models use a :class:`~vllm.model_executor.layers.Pooler` to aggregate the final hidden states of the input | ||
before returning them. | ||
|
||
.. note:: | ||
|
||
We currently support pooling models primarily as a matter of convenience. | ||
As shown in the :ref:`Compatibility Matrix <compatibility_matrix>`, most vLLM features are not applicable to | ||
pooling models as they only work on the generation or decode stage, so performance may not improve as much. | ||
|
||
Offline Inference | ||
----------------- | ||
|
||
The :class:`~vllm.LLM` class provides various methods for offline inference. | ||
See :ref:`Engine Arguments <engine_args>` for a list of options when initializing the model. | ||
|
||
For pooling models, we support the following :code:`task` options: | ||
|
||
- Embedding (:code:`"embed"` / :code:`"embedding"`) | ||
- Classification (:code:`"classify"`) | ||
- Sentence Pair Scoring (:code:`"score"`) | ||
- Reward Modeling (:code:`"reward"`) | ||
|
||
The selected task determines the default :class:`~vllm.model_executor.layers.Pooler` that is used: | ||
|
||
- Embedding: Extract only the hidden states corresponding to the last token, and apply normalization. | ||
- Classification: Extract only the hidden states corresponding to the last token, and apply softmax. | ||
- Sentence Pair Scoring: Extract only the hidden states corresponding to the last token, and apply softmax. | ||
- Reward Modeling: Extract all of the hidden states and return them directly. | ||
|
||
When loading `Sentence Transformers <https://huggingface.co/sentence-transformers>`__ models, | ||
we attempt to override the default pooler based on its Sentence Transformers configuration file (:code:`modules.json`). | ||
|
||
You can customize the model's pooling method via the :code:`override_pooler_config` option, | ||
which takes priority over both the model's and Sentence Transformers's defaults. | ||
|
||
``LLM.encode`` | ||
^^^^^^^^^^^^^^ | ||
|
||
The :class:`~vllm.LLM.encode` method is available to all pooling models in vLLM. | ||
It returns the aggregated hidden states directly. | ||
|
||
.. code-block:: python | ||
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed") | ||
outputs = llm.encode("Hello, my name is") | ||
outputs = model.encode(prompts) | ||
for output in outputs: | ||
embeddings = output.outputs.embedding | ||
print(f"Prompt: {prompt!r}, Embeddings (size={len(embeddings)}: {embeddings!r}") | ||
A code example can be found in `examples/offline_inference_embedding.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_embedding.py>`_. | ||
|
||
``LLM.score`` | ||
^^^^^^^^^^^^^ | ||
|
||
The :class:`~vllm.LLM.score` method outputs similarity scores between sentence pairs. | ||
It is primarily designed for `cross-encoder models <https://www.sbert.net/examples/applications/cross-encoder/README.html>`__. | ||
These types of models serve as rerankers between candidate query-document pairs in RAG systems. | ||
|
||
.. note:: | ||
|
||
vLLM can only perform the model inference component (e.g. embedding, reranking) of RAG. | ||
To handle RAG at a higher level, you should use integration frameworks such as `LangChain <https://github.com/langchain-ai/langchain>`_. | ||
|
||
You can use `these tests <https://github.com/vllm-project/vllm/blob/main/tests/models/embedding/language/test_scoring.py>`_ as reference. | ||
|
||
Online Inference | ||
---------------- | ||
|
||
Our `OpenAI Compatible Server <../serving/openai_compatible_server>`__ can be used for online inference. | ||
Please click on the above link for more details on how to launch the server. | ||
|
||
Embeddings API | ||
^^^^^^^^^^^^^^ | ||
|
||
Our Embeddings API is similar to ``LLM.encode``, accepting both text and :ref:`multi-modal inputs <multimodal_inputs>`. | ||
|
||
The text-only API is compatible with `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__ | ||
so that you can use OpenAI client to interact with it. | ||
A code example can be found in `examples/openai_embedding_client.py <https://github.com/vllm-project/vllm/blob/main/examples/openai_embedding_client.py>`_. | ||
|
||
The multi-modal API is an extension of the `OpenAI Embeddings API <https://platform.openai.com/docs/api-reference/embeddings>`__ | ||
that incorporates `OpenAI Chat Completions API <https://platform.openai.com/docs/api-reference/chat>`__, | ||
so it is not part of the OpenAI standard. Please see :ref:`this page <multimodal_inputs>` for more details on how to use it. | ||
|
||
Score API | ||
^^^^^^^^^ | ||
|
||
Our Score API is similar to ``LLM.score``. | ||
Please see `this page <../serving/openai_compatible_server.html#score-api-for-cross-encoder-models>`__ for more details on how to use it. |
Oops, something went wrong.