Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ Helm Chart for OpenVINO vLLM #403

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
df8c195
✨ Added chart for vllm-openvino
krish918 Sep 5, 2024
d339c74
✨ Added charts for llm-vllm microservice
krish918 Sep 5, 2024
c8a420c
➕ Updated chatqna to have conditional dependency on tgi and vllm
krish918 Sep 6, 2024
21be6c9
🧪 Added tests for verifying pod sanity
krish918 Sep 6, 2024
25528c9
📝 Added docs for instruction to setup chatqna with vllm
krish918 Sep 6, 2024
140d1b5
🔥 removed unsupported env vars
krish918 Sep 6, 2024
815c51b
♻️ Removed global Model ID var | resolved readme conflicts
krish918 Sep 6, 2024
2621fa3
Merge branch 'main' into chart/vllm-ov
krish918 Sep 6, 2024
4ac8fb0
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 6, 2024
5fffdd0
📌 Bumped up the chart version
krish918 Sep 6, 2024
7497322
🔥 Removed unused vars and resources
krish918 Sep 10, 2024
4154f02
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 18, 2024
027923c
🔧 added openvino values files
krish918 Sep 18, 2024
1f513a4
Merge branch 'main' into chart/vllm-ov
krish918 Sep 18, 2024
8b911f5
🩹 minor fixes
krish918 Sep 18, 2024
2ba4c8f
🩹 renamed chart llm-vllm-uservice to avoid conflict
krish918 Sep 18, 2024
207d2bd
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Sep 18, 2024
b36ac56
Merge branch 'main' into chart/vllm-ov
krish918 Sep 19, 2024
20670b7
Merge branch 'main' into chart/vllm-ov
krish918 Sep 19, 2024
01eb2b4
updated vllm-openvino image
krish918 Sep 30, 2024
738ff59
🔖 updated tags for llm-vllm and ctrl-uservice
krish918 Oct 8, 2024
ad96222
Merge branch 'main' into chart/vllm-ov
krish918 Oct 8, 2024
e7de84c
🔖 added latest tag for llm-vllm and ctrl-uservice
krish918 Oct 8, 2024
86b8064
🩹 fixed openvino values issue for chatqna
krish918 Oct 8, 2024
34f71b6
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 8, 2024
e382dea
📄 added missing openvino values file
krish918 Oct 9, 2024
4065c9e
🔥 removed tags for conditional chart selection
krish918 Oct 9, 2024
81d269c
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 9, 2024
1fce716
Merge branch 'main' into chart/vllm-ov
krish918 Oct 9, 2024
05a2be2
📝 formatting fixes in readme files
krish918 Oct 9, 2024
f0dae33
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 9, 2024
a8b85d7
🎨 prettier formatting fixes
krish918 Oct 9, 2024
294f1a0
🎨 prettier formatting fixes for chatqna readme
krish918 Oct 9, 2024
9b12618
retrigger CI checks
krish918 Oct 9, 2024
e890448
📝 minor updates in readme files
krish918 Oct 10, 2024
ef59964
retrigger CI checks
krish918 Oct 10, 2024
acd9a47
💚 enabled ci checks for new values files
krish918 Oct 29, 2024
afc3d45
Merge branch 'main' into chart/vllm-ov
krish918 Oct 29, 2024
cbb8d65
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 29, 2024
d570861
🩹 fixed vllm charts multiple installation
krish918 Oct 29, 2024
26562a9
Merge branch 'main' into chart/vllm-ov
krish918 Oct 30, 2024
03b7d26
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Oct 30, 2024
9e1cdf1
increased helm rollout timeout in ci
krish918 Oct 30, 2024
df8261e
💚 fixes to enable ci for openvino-vllm
krish918 Nov 4, 2024
ac341ac
triggering CI checks
krish918 Nov 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/_helm-e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ jobs:
echo "CHART_NAME=$CHART_NAME" >> $GITHUB_ENV
echo "RELEASE_NAME=${CHART_NAME}$(date +%Y%m%d%H%M%S)" >> $GITHUB_ENV
echo "NAMESPACE=${CHART_NAME}-$(date +%Y%m%d%H%M%S)" >> $GITHUB_ENV
echo "ROLLOUT_TIMEOUT_SECONDS=600s" >> $GITHUB_ENV
echo "ROLLOUT_TIMEOUT_SECONDS=1200s" >> $GITHUB_ENV
echo "TEST_TIMEOUT_SECONDS=600s" >> $GITHUB_ENV
echo "KUBECTL_TIMEOUT_SECONDS=60s" >> $GITHUB_ENV
echo "should_cleanup=false" >> $GITHUB_ENV
Expand Down
13 changes: 13 additions & 0 deletions helm-charts/chatqna/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,19 @@ dependencies:
- name: tgi
version: 1.0.0
repository: "file://../common/tgi"
condition: tgi.enabled
- name: vllm
version: 1.0.0
repository: "file://../common/vllm"
condition: vllm.enabled
- name: llm-uservice
version: 1.0.0
repository: "file://../common/llm-uservice"
condition: tgi.enabled
- name: llm-ctrl-uservice
version: 1.0.0
repository: "file://../common/llm-ctrl-uservice"
condition: vllm.enabled
Comment on lines +26 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why you're adding wrappers?

They were removed over month ago for v1.1 (#474), are unnecessary, and LLM wrapper uses a langserve component with a problematic license (opea-project/GenAIComps#264).

- name: tei
version: 1.0.0
repository: "file://../common/tei"
Expand Down
86 changes: 71 additions & 15 deletions helm-charts/chatqna/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,37 +9,91 @@ Helm chart for deploying ChatQnA service. ChatQnA depends on the following servi
- [redis-vector-db](../common/redis-vector-db/README.md)
- [reranking-usvc](../common/reranking-usvc/README.md)
- [teirerank](../common/teirerank/README.md)
- [llm-uservice](../common/llm-uservice/README.md)
- [tgi](../common/tgi/README.md)

For LLM inference, two more microservices will be required. We can either use [TGI](https://github.com/huggingface/text-generation-inference) or [vLLM](https://github.com/vllm-project/vllm) as our LLM backend. Depending on that, we will have following microservices as part of dependencies for ChatQnA application.

1. For using **TGI** as an inference service, following 2 microservices will be required:

- [llm-uservice](../common/llm-uservice/README.md)
- [tgi](../common/tgi/README.md)

2. For using **vLLM** as an inference service, following 2 microservices would be required:

- [llm-ctrl-uservice](../common/llm-ctrl-uservice/README.md)
Comment on lines +13 to +22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, why add wrappers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is from 1.0 release time, so with some old code.
I think it's better to merge with #610 , or just simple changes to support openvino after 610 get merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, but note that I'm testing my PR only with vLLM Gaudi version.

I.e. currently both CPU and GPU/Openvino support need to be added / tested after it.

That PR has also quite a few comment TODOs about vLLM options where some feedback would be needed / appreciated.

- [vllm](../common/vllm/README.md)

> **_NOTE :_** We shouldn't have both inference engine deployed. It is required to only setup either of them. To achieve this, conditional flags are added in the chart dependency. We will be switching off flag corresponding to one service and switching on the other, in order to have a proper setup of all ChatQnA dependencies.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why there could not be multiple inferencing engines?

ChatQnA has 4 inferencing subservices for which it is already using 2 inferencing engines, TEI and TGI.

And I do not see why it could not use e.g. TEI for embed + rerank, TGI for guardrails, and vLLM for LLM.

Please rephrase.


## Installing the Chart

To install the chart, run the following:
Please follow the following steps to install the ChatQnA Chart:

1. Clone the GenAIInfra repository:

```bash
git clone https://github.com/opea-project/GenAIInfra.git
```

2. Setup the dependencies and required environment variables:

```console
```bash
cd GenAIInfra/helm-charts/
./update_dependency.sh
helm dependency update chatqna
export HFTOKEN="insert-your-huggingface-token-here"
export MODELDIR="/mnt/opea-models"
export MODELNAME="Intel/neural-chat-7b-v3-3"
```

3. Depending on the device which we are targeting for running ChatQnA, please use one the following installation commands:

```bash
# Install the chart on a Xeon machine

# If you would like to use the traditional UI, please change the image as well as the containerport within the values
# append these at the end of the command "--set chatqna-ui.image.repository=opea/chatqna-ui,chatqna-ui.image.tag=latest,chatqna-ui.containerPort=5173"

helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME}
```

```bash
# To use Gaudi device
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that there's support for both TGI and vLLM, all these comments here could state which one is used, e.g. like this:

Suggested change
# To use Gaudi device
# To use Gaudi device for TGI

#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/gaudi-values.yaml
```

```bash
# To use Nvidia GPU
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} -f chatqna/nv-values.yaml
```

```bash
# To include guardrail component in chatqna on Xeon
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-values.yaml
```

```bash
# To include guardrail component in chatqna on Gaudi
#helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} -f chatqna/guardrails-gaudi-values.yaml
```

> **_NOTE :_** Default installation will use [TGI (Text Generation Inference)](https://github.com/huggingface/text-generation-inference) as inference engine. To use vLLM as inference engine, please see below.

```bash
# To use vLLM inference engine on XEON device

helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set llm-ctrl-uservice.LLM_MODEL_ID=${MODELNAME} --set vllm.LLM_MODEL_ID=${MODELNAME} --set tgi.enabled=false --set vllm.enabled=true

# To use OpenVINO optimized vLLM inference engine on XEON device

helm install -f ./chatqna/vllm-openvino-values.yaml chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set llm-ctrl-uservice.LLM_MODEL_ID=${MODELNAME} --set vllm.LLM_MODEL_ID=${MODELNAME} --set tgi.enabled=false --set vllm.enabled=true
```

### IMPORTANT NOTE

1. Make sure your `MODELDIR` exists on the node where your workload is scheduled so you can cache the downloaded model for next time use. Otherwise, set `global.modelUseHostPath` to 'null' if you don't want to cache the model.

2. Please set `http_proxy`, `https_proxy` and `no_proxy` values while installing chart, if you are behind a proxy.

Comment on lines +95 to +96
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO duplicating general information to application READMEs is not maintainable, there are too many of them. Instead you could include link to general options (helm-charts/README.md).

## Verify

To verify the installation, run the command `kubectl get pod` to make sure all pods are running.
Expand All @@ -52,8 +106,9 @@ Run the command `kubectl port-forward svc/chatqna 8888:8888` to expose the servi

Open another terminal and run the following command to verify the service if working:

```console
```bash
curl http://localhost:8888/v1/chatqna \
-X POST \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add redundant POST? -d already implies that (see man curl).

-H "Content-Type: application/json" \
-d '{"messages": "What is the revenue of Nike in 2023?"}'
```
Expand All @@ -71,12 +126,13 @@ Open a browser to access `http://<k8s-node-ip-address>:${port}` to play with the

## Values

| Key | Type | Default | Description |
| ----------------- | ------ | ----------------------------- | -------------------------------------------------------------------------------------- |
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.monitoring | bop; | false | Enable usage metrics for the service components. See ../monitoring.md before enabling! |
| Key | Type | Default | Description |
| -------------------------- | ------ | ----------------------------- | -------------------------------------------------------------------------------------- |
| image.repository | string | `"opea/chatqna"` | |
| service.port | string | `"8888"` | |
| tgi.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| vllm-openvino.LLM_MODEL_ID | string | `"Intel/neural-chat-7b-v3-3"` | Models id from https://huggingface.co/, or predownloaded model directory |
| global.monitoring | bop; | false | Enable usage metrics for the service components. See ../monitoring.md before enabling! |

## Troubleshooting

Expand Down
25 changes: 25 additions & 0 deletions helm-charts/chatqna/ci-vllm-openvino-values.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is identical to values file, it should be symlink, not a copy of it.

Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

tgi:
enabled: false

vllm:
enabled: true
openvino_enabled: true
image:
repository: opea/vllm-openvino
pullPolicy: IfNotPresent
# Overrides the image tag whose default is the chart appVersion.
tag: "latest"

extraCmdArgs: []

LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

CUDA_GRAPHS: "0"
VLLM_CPU_KVCACHE_SPACE: 50
VLLM_OPENVINO_KVCACHE_SPACE: 32
OMPI_MCA_btl_vader_single_copy_mechanism: none

ov_command: ["/bin/bash"]
8 changes: 8 additions & 0 deletions helm-charts/chatqna/ci-vllm-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

tgi:
enabled: false

vllm:
enabled: true
17 changes: 15 additions & 2 deletions helm-charts/chatqna/templates/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,25 @@ spec:
containers:
- name: {{ .Release.Name }}
env:
{{- if .Values.vllm.enabled }}
- name: LLM_SERVICE_HOST_IP
value: {{ .Release.Name }}-llm-ctrl-uservice
- name: LLM_SERVER_HOST_IP
value: {{ .Release.Name }}-vllm
- name: LLM_MODEL
value: {{ .Values.vllm.LLM_MODEL_ID | quote }}
{{- else }}
- name: LLM_SERVICE_HOST_IP
value: {{ .Release.Name }}-llm-uservice
- name: LLM_SERVER_HOST_IP
value: {{ .Release.Name }}-tgi
- name: LLM_SERVER_PORT
value: "80"
- name: LLM_MODEL
value: {{ .Values.tgi.LLM_MODEL_ID | quote }}
{{- end }}
- name: RERANK_SERVICE_HOST_IP
value: {{ .Release.Name }}-reranking-usvc
- name: LLM_SERVER_PORT
value: "80"
- name: RERANK_SERVER_HOST_IP
value: {{ .Release.Name }}-teirerank
- name: RERANK_SERVER_PORT
Expand Down
18 changes: 17 additions & 1 deletion helm-charts/chatqna/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,14 @@ nginx:
service:
type: NodePort

imagePullSecrets: []

podAnnotations: {}

podSecurityContext: {}

resources: {}

securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
Expand All @@ -47,6 +55,14 @@ horizontalPodAutoscaler:
# Override values in specific subcharts
tgi:
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
enabled: true

vllm:
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3
enabled: false

llm-ctrl-uservice:
LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

# disable guardrails-usvc by default
# See guardrails-values.yaml for guardrail related options
Expand All @@ -66,9 +82,9 @@ global:
https_proxy: ""
no_proxy: ""
HUGGINGFACEHUB_API_TOKEN: "insert-your-huggingface-token-here"

# set modelUseHostPath or modelUsePVC to use model cache.
modelUseHostPath: ""
# modelUseHostPath: /mnt/opea-models
# modelUsePVC: model-volume

# Install Prometheus serviceMonitors for service components
Expand Down
21 changes: 21 additions & 0 deletions helm-charts/chatqna/vllm-openvino-values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

vllm:
openvino_enabled: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not confirm to Helm best practices: https://helm.sh/docs/chart_best_practices/values/

Should be either openvinoEnabled: true, or openvino: true.

image:
repository: opea/vllm-openvino
pullPolicy: IfNotPresent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the value, it breaks CI testing for latest tag (see #587).

# Overrides the image tag whose default is the chart appVersion.
tag: "latest"

extraCmdArgs: []

LLM_MODEL_ID: Intel/neural-chat-7b-v3-3

CUDA_GRAPHS: "0"
VLLM_CPU_KVCACHE_SPACE: 50
VLLM_OPENVINO_KVCACHE_SPACE: 32
OMPI_MCA_btl_vader_single_copy_mechanism: none

ov_command: ["/bin/bash"]
23 changes: 23 additions & 0 deletions helm-charts/common/llm-ctrl-uservice/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
14 changes: 14 additions & 0 deletions helm-charts/common/llm-ctrl-uservice/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

apiVersion: v2
name: llm-ctrl-uservice
description: A Helm chart for LLM controller microservice which connects with vLLM microservice to provide inferences.
type: application
version: 1.0.0
appVersion: "v1.0"
dependencies:
- name: vllm
version: 1.0.0
repository: file://../vllm
condition: vllm.enabled
Loading
Loading