Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RHOAIENG-13566] Configure SSL for the communication between head and worker #314

Open
wants to merge 9 commits into
base: incubating
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion config/overlays/odh/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@ apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../default
- ./ray_tls_resources.yaml

patches:
- path: odh_model_controller_manager_patch.yaml

configurations:
- params.yaml
- params.yaml
83 changes: 83 additions & 0 deletions config/overlays/odh/ray_tls_resources.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
apiVersion: v1
kind: Secret
metadata:
name: ray-ca-cert
labels:
opendatahub.io/managed: 'true'
data:
# output from cat ca.crt | base64
ca.crt: |
LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZIekNDQXdlZ0F3SUJBZ0lVSlAwL1FCY0xTMFFFV1ZiRDE4NndyUnZjLzdFd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0hqRWNNQm9HQTFVRUF3d1RjMlZzWmkxemFXZHVaV1F0WTJFdFkyVnlkREFnRncweU5ERXhNak13TXpReApOVEphR0E4eU1USTBNVEF6TURBek5ERTFNbG93SGpFY01Cb0dBMVVFQXd3VGMyVnNaaTF6YVdkdVpXUXRZMkV0ClkyVnlkRENDQWlJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dJUEFEQ0NBZ29DZ2dJQkFML0lpWVdabUc0Q05qOS8KMzV6RUJheU1tL2djeGhJSFArZ1M4S04wdm9YV2krdk1Kd1hEUWR3T0JOdEZaa1l0elpJc29ESHVHenRnN0RWNApyZXZONE5JOHRKTmc2b2Jma0tVcGQ3eHdvNHNIdXMwd0h6NWVwZGt1MDhNVzVzZFZOUFNRMkhpVnhialpXSnBRClpJYWYyQWRKa1psUFdtVDBaS1pPdFFEN2oySTRtM3VCeG1jYzhTNWhiNkpaYW9NbGNVVXhkNDFscG43T21iMGEKVjRBUGZiWS9vYytwZmVDczN1cG5xamxZamVGQjR2RTV4WU1ZV0FNeitJRGh4RTRxRGVSaXNMQnhhN1kvcFRScQo2OWVhVXN6Qjl5eEQ3R0FySTJsSDhyUCtVeGpGYUl1K2tBVjVtbjc4OXdlejh0TDVGNEErWlE5cGM5TVI2UXBuCmRkanlaRXcvcFpkdTcwVmo3WUE1MU91S2owcTF2dGw3d1BPcDBUc3lwUDhadW04ZkZSNG5KbmNPaFhMTWV3bGEKTWxBeFZaUWRiMEF5dEE2TUl0dFdXSjA5L3BEOWZ6SkdjRnZOL245ZzZWZ0o5NjNhbmRoTEwyYlJMc0VKRmxUTQpEdTJIeW1CNkErb0ZlcDdjZXNxOUpJRFhkVmFqR3NxMmgrZVpPNGdxWW5nWGNmQVl5ZUloYzlYNnFoT2QvVmlZCmY2eUZoOTNuUnRYTFFNdUJRN2E1WTFzRVN3RHp3WWJKdEtuK3NrcGg5SEtCMTdVRUVOU3BNNHJSNHdxekRQd3AKSmZZeWt2a2Iyd2w0TkNCb2pjaU9icDYwV2ZDQytRcTFsNEo4VXpSOEpvWmFiQ0IzOWxVcHBKa09qNVFxYnEwMApKaUFzWENQQlp3OCtnQnV0b2JBVUs0RklqMGVQQWdNQkFBR2pVekJSTUIwR0ExVWREZ1FXQkJTcmNlRWNhMjNxCjBUQm14VmtZTWtMQTNSeHpKekFmQmdOVkhTTUVHREFXZ0JTcmNlRWNhMjNxMFRCbXhWa1lNa0xBM1J4ekp6QVAKQmdOVkhSTUJBZjhFQlRBREFRSC9NQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUNBUUJseW5IM2doVG9ObDQrTlQ1MgpNTTA1V1A3UCszVXJkQ0tGNEJCa0VzN0VueldSUjZ4bkVoVUY5VWhGZ0ZhTFBiQ1pacnlCS2krT1hrUHUva3JCCk12aE1LVGl0WnNWbVBzRktEWDYyVG9zMEJZV0VzanZ0VDM4WFhSZXA3T3BWR0lPQi85V09YVGl3VkpaT2tSZ2MKbHd3U2U1dnBQQXRpMzhUZ3BhM0FVSk5haG00bDhHNWF5WktRQWFnUDg1NHBFTjhPOW54Nk9odytWN1hzSGlNdQpwUmpvc0VTN0JJY1lXVGJxR05yNFR3eXo1cVUwOE9LOEUySFNFSnE5THA4YzZ6UTZnZzBhV1dLYWJyTUpNeSt2CkpIbjM5TEI0U3dONzJjVXJkRWYvQVVrWktNYTRKVFRjMTJnaGpKN1JvUENYUFdPOUJGZ09aZEdoUlpBYkZRMXgKcnB6b3BLZllkT0hWZ0tncG9MOVJIRm40TzZRaTBjbnBML0NZZEFGd0pXNmZYcGEyekhobEJqWXlWdHk5T1Y4TQppV01IVUNXZnl4anVaSno5NFFWZGxLRGVrY2YzUFJzU0RBRGZ4TXlBYVdJQ1NnYXNHTVNPSnRoRTlGM0JhaXNvCnNYM1NLYzRFSEc4Sk1VM1hoeWYwbkhDY2hQdWVRblU5akFBVVBDMHRrVlZtOWhmMXVkdjlOTUk4bktncWRFMkMKK2ExNnR3RVBpZzhkS1pkaFRMOFdXMGwxS3FRcCs4SnBGdTdNWUM1SGNaa0F0NEE3QXlXOHRsYmlQQ1B4RVYwZwpsYkc0eXFyV2lIOG5rWE9tNFBKb0FEMDhzNjA5Y3lFY2Iyblgvck92KzBSdFdOOWFqZGNxYlM1Z0JJaEhRSVUwCko1N0cyTVFYZ0hHbU9QL1ZZTi8vMTlsaXd3PT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only used for tests?

Copy link
Contributor Author

@Jooho Jooho Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will be used for internal ssl communication for ray. It is just self-signed certificate

# output from cat ca.key | base64
ca.key: |
LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUpRd0lCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQ1Mwd2dna3BBZ0VBQW9JQ0FRQy95SW1GbVpodUFqWS8KZjkrY3hBV3NqSnY0SE1ZU0J6L29FdkNqZEw2RjFvdnJ6Q2NGdzBIY0RnVGJSV1pHTGMyU0xLQXg3aHM3WU93MQplSzNyemVEU1BMU1RZT3FHMzVDbEtYZThjS09MQjdyTk1COCtYcVhaTHRQREZ1YkhWVFQwa05oNGxjVzQyVmlhClVHU0duOWdIU1pHWlQxcGs5R1NtVHJVQSs0OWlPSnQ3Z2NabkhQRXVZVytpV1dxREpYRkZNWGVOWmFaK3pwbTkKR2xlQUQzMjJQNkhQcVgzZ3JON3FaNm81V0kzaFFlTHhPY1dER0ZnRE0vaUE0Y1JPS2cza1lyQ3djV3UyUDZVMAphdXZYbWxMTXdmY3NRK3hnS3lOcFIvS3ovbE1ZeFdpTHZwQUZlWnArL1BjSHMvTFMrUmVBUG1VUGFYUFRFZWtLClozWFk4bVJNUDZXWGJ1OUZZKzJBT2RUcmlvOUt0YjdaZThEenFkRTdNcVQvR2Jwdkh4VWVKeVozRG9WeXpIc0oKV2pKUU1WV1VIVzlBTXJRT2pDTGJWbGlkUGY2US9YOHlSbkJiemY1L1lPbFlDZmV0MnAzWVN5OW0wUzdCQ1JaVQp6QTd0aDhwZ2VnUHFCWHFlM0hyS3ZTU0ExM1ZXb3hyS3RvZm5tVHVJS21KNEYzSHdHTW5pSVhQVitxb1RuZjFZCm1IK3NoWWZkNTBiVnkwRExnVU8ydVdOYkJFc0E4OEdHeWJTcC9ySktZZlJ5Z2RlMUJCRFVxVE9LMGVNS3N3ejgKS1NYMk1wTDVHOXNKZURRZ2FJM0lqbTZldEZud2d2a0t0WmVDZkZNMGZDYUdXbXdnZC9aVkthU1pEbytVS202dApOQ1lnTEZ3andXY1BQb0FicmFHd0ZDdUJTSTlIandJREFRQUJBb0lDQUFyYzkwaG9ud3VIWGI3ZmNtU0IxU3JZClZPWWt1WDl6aHQvRWxIb1E5cDNFSSswNWhWaFdCTmpMNjBvYXRuRlhtenk3emZtTWMyRTcyemlPam1OdmpvOGcKY1l4eDlMYmQycG5RWUlBWEJ0eDV5UUxJWUFaSUwySys3NjloRUlLYksvVzQxZG9wN05vekFMQm9MMW1FenlSZgpWS0hFU0ZDMHptS3hNOUpMYllYeWowMm9QbUhBY0NHdGJHdjFrZGZ4RkdjNldrZy80c0tnY05ld3NueUdTb0lICm8zd21ZSnkvSjUxTDF5QlhPL2J2Y1hobHNMd3djamNCQ0FNUUU0aE42UjJKUUwrdDBEWGt2SjBQcnZzRE9wa0kKakdzTlEzMWVPcEpERmdwL21zNlFNWnpObHhwdXNGQTVnNUNkaUpRMHNkSGpOdUtqTXhyeUxKRk1HY0l0OExEQwpRVzF2akxLR0l1UWtraGwxOWU1S1N2SDdjUjJja0pDME5vTzhnekpudzd0dTRGaHJaK0xQeXF2R3VSYU55a2RmCi9BKzNEOUE2RW1PNWRldFU2RzJkK0l2TmprdG91Z05UalZIUklDbk9oL01zRmlFQXdycHltVVNISzhKTjVpSjIKUm1rNFljNWlXUjhOUWs0Wkh6aVFGSHJSWkh0TW9DcEkvR2ZGcnYyRVE0bFpOOG5tZHdDWDR4a3JObUJ2ZnlIdgpLWW0yMU5VWDc5U3lRbHd5VS9lNUR6eTA3Si9zcTdoVU8xN3hVcXNzTVpZTGZCSFF0VFU0VVAwbnBOZmtxUFM1CjdJRUtIVWwyRlZudXR0THoxc0ZVVHhJTS90aE9lczRtWElrOExiYzI0Yk45VWlteXplVEN1bE83a0hZSDhTVkEKZDJqZFBTZXhZSTdMeWFVNnFHRzVBb0lCQVFEbHlVQk5CaTRNekdxVnh5NjNCY1dyZC9rdVYrYTFLQ3MyYVhzagpLbVlMT0xrSkhUSjI0YU9EWkJBVGxEMllwclZEOUM1UThTeDdPQVdlQ3FqWHd1MndsOTlabXNIVkxiQTZMRUZ6CnBoYTNQVHhkaWFpMElwZVY2ZFpIQnQrdjVDVGsvSnpxSVpjc1J2REFnNTFHYzgxbERxTzFNbnVqbldBcGpSMmMKd05ZVXd6a3hicHVTc1ZHYzFBZ09tVHBHN1MrdWVVQ2FGU0NVaGkyVWVoblM5dkNrU3Y0QTZMRlpiaXhFeWp6aApycU9mN1d1TTVUWkFoTGM2RTJUQnVOeWJlWW9DblFMdHF3dnUxaFhDOGU4TGlQWFRlMVJ4U2x5dXA5RDhiWEZBCjVPVmFUZjAzcFFweURiOXNKeGhLN3FMbUgrSjlUeU5JamhTTUZQT2pKNEJFRTlPOUFvSUJBUURWcVhDMGdCVzUKYlNUWmUzc3l1QVltRi9hVDg1ZFh1NGFTMFBJR09MakE1M2h0RVdLUkJxd1JlU1prSFdtR05uNUIyOXVXTHg2UgpPZjFNOFJkY2NYSnlxMnp1TlBiWkpabllwS0x5N0FjeDBpc1RvMjdpUy9xRS85YndsNUo3QVU5UmZ2K2ZMK2RPCmxqUndRTGUvQ1dSVHVlTlNOSWpPUC96NWRra2J5Z1kvWHZHbmI0RUJheDY4K3J2a0NYbStGdFpXV3VoblM2Uy8KZHh3Ulo2VGRMd09RZTZQSzNzN3F4c2xWNmQ2dmwrSUpwa1VVZmRvWDNyWFlTeGx2cFlQYWJpWEpaVzdQWkZwRQpVQXc0VTFpSzVLMUt5d1ZjaHlhN2tQSlpRNUplS1pUT3lPL1d5ODZLak0vcUd3NUhDR2NOL2VMbDJKUUViUkwvClJiR0pGSmhUalpjN0FvSUJBQXlyNG0zYzcyRXBUSjloMG9PcFA5TksxR1RuMkFNWmFmaWdMSGd0K0Y2YURDb2kKZ0F2cU9YZ2ZabnVONnkrbDBjMGpoQUpXcWx0SkpaWW5oRlFSbmNYbE9oM1kyT09HbDNjOXhZWTVISHVTVnVmWgpsWUlKZms1NERLYnlEQmZJL3ZmWnJsV0M4TEV5WUVoZGVhak83ZjZxcGdCeC9qdHhqRUgrVkNtMndKZDRoSWpqClRwVHlUa3ZWclhRUW94UVNORlRzdnRGQVpRR0x2S3U1Wi84b092RDBhYmxuRzVDUThNUUNXd1VlK2tyeGJzTGcKU1BPWjNmakg1UUNCenppTHBUNnJwZU94VVFFa3NTS0U4T2V6NzhwdnZLSmF0VzIwTjJRVUxQQ2xMcmlpSUZxWApNVkpFeTgrTkFGdnhlTzR6eCt1ZEY1Y0Nyc05pekdTczR2ZmVHQWtDZ2dFQkFMdFRnbWdPd0gxQlR3U0t1Ym4vCkZBMEVCNEV5R2FlbTExY1RjSTY1M21ucXgyL0F4VTFucnlibXRCMGttR2Mra2JYR1FDRE5rUnc4M25NK0VZQlEKU3NwMHQ5MmxmQ05vVHhsZFJ5eDZlZGhaYnNFYUVsYS96SllkQk9NTjBUU2RNbUMrV3ZuRGN5WTRsU014NnFmSQpZVGp6Q25ZQmIweDlWNXVUOUljenVnU0hocEdKTm03Njd3azdQODZ2N0JnWVI3V1FvS0FuOXZxVFFIMldCRHFVClJLakJiaHFvL0h0azdCS3lLRGFGa0gxclZMZWhtN3cvMitrVjl1Z25FcEpJN2tKRDkwSkh0c2liOGdyVU1CWWUKWmp6a0FRQmQwanl5MlhnZndVMWpZWDluTnJoNUdjM3BwVVNZa2d6L05mTlRmRUtPZnovZUxjQzM1dTdMcXIzZQpydzhDZ2dFQkFMT2tsTkJNRVBmM20yTXBjaVRjRmNKb08vZzBMUUpHaTJtWkN6S1g3eDJFS0N2N1ZvVWVtRkk0CjRkRFVmSlBJWlBFTUpkTHRSUy9qUDEyZWkxek9lWHIrVGlUTklpUUVoemRtL0RZWUdjd2hyb0xLNDZVTFJKY0YKYzdxZ2xNQ1Z1MW9DTmtDdTJvZ08renczRm9makJzK1pqcE1BS2kyOTZ2ZDk2YVlYNThYR0RKekdmdjhuZEF1dwpEUmU1ZE5oQU5iaHZqSlM1VXJwNnhoMVMycTNYOHorTlFGWW9CNDM1Q2NXNW50WWMzemIxYVdzY0NxMWJsUGJGCjc0QTFLTHJNNlpvU0ZlcUVWZzhvajhpWjlDaitiTTJXYm9BREIvRTROM0kyNmFDK1dDRWxtdTd3ZDdQaExQT2IKN3RrTXh2Zm10dDE5T2dYbTRKZm9SZWlkMTNYbHFoZz0KLS0tLS1FTkQgUFJJVkFURSBLRVktLS0tLQo=
---
apiVersion: v1
kind: ConfigMap
metadata:
name: ray-tls-scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts -> script since there's only one shell script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 scripts in the configmap so I use plural.

labels:
opendatahub.io/managed: 'true'
data:
gencert_ray.sh: |
#!/bin/sh
## Create tls.key
openssl genrsa -out /etc/ray/tls/tls.key 2048

## Write CSR Config
cat > /etc/ray/tls/csr.conf <<EOF
[ req ]
default_bits = 2048
prompt = no
default_md = sha256
req_extensions = req_ext
distinguished_name = dn

[ dn ]
C = US
ST = Raleigh
L = North Carolina
O = redhat
OU = redhat
CN = self-signed-cert

[ req_ext ]
subjectAltName = @alt_names

[ alt_names ]
DNS.1 = localhost
DNS.2 = *.${POD_NAMESPACE}.svc.cluster.local
IP.1 = 127.0.0.1
IP.2 = $POD_IP

EOF

## Create CSR using tls.key
openssl req -new -key /etc/ray/tls/tls.key -out /etc/ray/tls/ca.csr -config /etc/ray/tls/csr.conf

## Write cert config
cat > /etc/ray/tls/cert.conf <<EOF

authorityKeyIdentifier=keyid,issuer
basicConstraints=CA:FALSE
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment
subjectAltName = @alt_names

[alt_names]
DNS.1 = localhost
DNS.2 = *.${POD_NAMESPACE}.svc.cluster.local
IP.1 = 127.0.0.1
IP.2 = $POD_IP

EOF

## create serial file
echo '01' > /tmp/ca.srl

## Generate tls.cert
openssl x509 -req \
-in /etc/ray/tls/ca.csr \
-CA /etc/ca/tls/ca.crt -CAkey /etc/ca/tls/ca.key \
-CAserial /tmp/ca.srl -out /etc/ray/tls/tls.crt \
-days 36500 \
-sha256 -extfile /etc/ray/tls/cert.conf
113 changes: 94 additions & 19 deletions config/runtimes/vllm-multinode-template.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ metadata:
template.openshift.io/documentation-url: https://github.com/opendatahub-io/vllm
template.openshift.io/long-description: This template defines resources needed to deploy vLLM ServingRuntime Multi-Node with KServe in Red Hat OpenShift AI
opendatahub.io/modelServingSupport: '["single"]'
opendatahub.io/apiProtocol: "REST"
opendatahub.io/apiProtocol: 'REST'
name: vllm-multinode-runtime-template
objects:
- apiVersion: serving.kserve.io/v1alpha1
Expand All @@ -26,8 +26,8 @@ objects:
opendatahub.io/dashboard: "false"
spec:
annotations:
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
prometheus.io/port: '8080'
prometheus.io/path: '/metrics'
multiModel: false
supportedModelFormats:
- autoSelect: true
Expand All @@ -36,11 +36,16 @@ objects:
containers:
- name: kserve-container
image: $(vllm-image)
command: [ "bash", "-c" ]
command: ['bash', '-c']
args:
- |
# Generate self signed certificate
if [[ $RAY_USE_TLS == "1" ]]; then
/etc/gen/tls/gencert_ray.sh
fi
ray start --head --disable-usage-stats --include-dashboard false
# wait for other node to join

# Wait for other node to join
until [[ $(ray status --address ${RAY_ADDRESS} | grep -c node_) -eq ${PIPELINE_PARALLEL_SIZE} ]]; do
echo "Waiting..."
sleep 1
Expand All @@ -49,33 +54,52 @@ objects:

export SERVED_MODEL_NAME=${MODEL_NAME}
export MODEL_NAME=${MODEL_DIR}

exec python3 -m vllm.entrypoints.openai.api_server --port=8080 --distributed-executor-backend ray --model=${MODEL_NAME} --served-model-name=${SERVED_MODEL_NAME} --tensor-parallel-size=${TENSOR_PARALLEL_SIZE} --pipeline-parallel-size=${PIPELINE_PARALLEL_SIZE} --disable_custom_all_reduce
env:
- name: RAY_USE_TLS
value: '1'
- name: RAY_TLS_SERVER_CERT
value: '/etc/ray/tls/tls.crt'
- name: RAY_TLS_SERVER_KEY
value: '/etc/ray/tls/tls.key'
- name: RAY_TLS_CA_CERT
value: '/etc/ca/tls/ca.crt'
- name: RAY_PORT
value: "6379"
value: '6379'
- name: RAY_ADDRESS
value: 127.0.0.1:6379
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: VLLM_NO_USAGE_STATS
value: "1"
value: '1'
- name: HOME
value: /tmp
- name: HF_HOME
value: /tmp/hf_home
resources:
limits:
cpu: "16"
cpu: '16'
memory: 48Gi
requests:
cpu: "8"
cpu: '8'
memory: 24Gi
volumeMounts:
- name: shm
mountPath: /dev/shm
- mountPath: /etc/ca/tls
name: ray-ca-cert
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
livenessProbe:
failureThreshold: 2
periodSeconds: 5
Expand Down Expand Up @@ -133,7 +157,7 @@ objects:
echo "Unhealthy - Used: ${used_gpu}, Reserved: ${reserved_gpu}"
exit 1
fi

# Check model health
health_check=$(curl -o /dev/null -s -w "%{http_code}\n" http://localhost:8080/health)
if [[ ${health_check} != 200 ]]; then
Expand All @@ -158,7 +182,7 @@ objects:
echo "Unhealthy - Registered nodes count (${registered_node_count}) does not match PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})."
exit 1
fi

# Double check to make sure Model is ready to serve.
for i in 1 2; do
# Check model health
Expand All @@ -177,15 +201,33 @@ objects:
emptyDir:
medium: Memory
sizeLimit: 12Gi
- name: ray-ca-cert
secret:
secretName: ray-ca-cert
- name: ray-tls
emptyDir: {}
# The gencert_ray.sh can be prebaked into the docker container so the configMap is optional
- name: gen-tls-script
spolti marked this conversation as resolved.
Show resolved Hide resolved
configMap:
name: ray-tls-scripts
defaultMode: 0777
# An array of keys from the ConfigMap to create as files
items:
- key: gencert_ray.sh
path: gencert_ray.sh
workerSpec:
pipelineParallelSize: 2
tensorParallelSize: 1
containers:
- name: worker-container
image: $(vllm-image)
command: [ "bash", "-c" ]
command: ['bash', '-c']
args:
- |
# Generate self signed certificate
if [[ $RAY_USE_TLS == "1" ]]; then
/etc/gen/tls/gencert_ray.sh
fi
SECONDS=0

while true; do
Expand All @@ -203,32 +245,51 @@ objects:
echo "$SECONDS seconds elapsed: Still waiting for Global Control Service(GCS) to be ready."
echo "For troubleshooting, refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/troubleshooting.html#kuberay-troubleshootin-guides"
fi

sleep 5
done

export RAY_HEAD_ADDRESS="${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379"
echo "Attempting to connect to Ray cluster at $RAY_HEAD_ADDRESS ..."
ray start --address="${RAY_HEAD_ADDRESS}" --block
env:
- name: RAY_USE_TLS
value: '1'
- name: RAY_TLS_SERVER_CERT
Jooho marked this conversation as resolved.
Show resolved Hide resolved
value: '/etc/ray/tls/tls.crt'
- name: RAY_TLS_SERVER_KEY
value: '/etc/ray/tls/tls.key'
- name: RAY_TLS_CA_CERT
value: '/etc/ca/tls/ca.crt'
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
resources:
limits:
cpu: "16"
cpu: '16'
memory: 48Gi
requests:
cpu: "8"
cpu: '8'
memory: 24Gi
volumeMounts:
- name: shm
mountPath: /dev/shm
- mountPath: /etc/ca/tls
name: ray-ca-cert
readOnly: true
- mountPath: /etc/ray/tls
name: ray-tls
- mountPath: /etc/gen/tls
name: gen-tls-script
livenessProbe:
failureThreshold: 2
periodSeconds: 5
Expand All @@ -244,7 +305,7 @@ objects:
if [[ ${registered_node_count} -ne "${PIPELINE_PARALLEL_SIZE}" ]]; then
echo "Unhealthy - Registered nodes count (${registered_node_count}) does not match PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})."
exit 1
fi
fi
startupProbe:
failureThreshold: 40
periodSeconds: 30
Expand All @@ -261,7 +322,7 @@ objects:
echo "Unhealthy - Registered nodes count (${registered_node_count}) does not match PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})."
exit 1
fi

# Double check to make sure Model is ready to serve.
for i in 1 2; do
# Check model health
Expand All @@ -276,4 +337,18 @@ objects:
- name: shm
emptyDir:
medium: Memory
sizeLimit: 12Gi
sizeLimit: 12Gi
- name: ray-tls
emptyDir: {}
- name: ray-ca-cert
secret:
secretName: ray-ca-cert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement for later: we should find a way to not mount the root CA certificate. This is for security reasons.

# The gencert_ray.sh can be prebaked into the docker container so the configMap is optional
- name: gen-tls-script
configMap:
name: ray-tls-scripts
defaultMode: 0777
# An array of keys from the ConfigMap to create as files
items:
- key: gencert_ray.sh
path: gencert_ray.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the scripts be made part of the runtime image?

Otherwise, the template it is no longer self-contained.

6 changes: 6 additions & 0 deletions controllers/constants/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,3 +79,9 @@ const (
NimRuntimeTemplateName = "nvidia-nim-serving-template"
NimPullSecretName = "nvidia-nim-image-pull"
)

// Ray
const (
RayCATlsSecretName = "ray-ca-cert"
RayTlsScriptConfigMapName = "ray-tls-scripts"
)
Loading