-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RHOAIENG-13566] Configure SSL for the communication between head and worker #314
base: incubating
Are you sure you want to change the base?
Changes from 7 commits
70ad46b
bfeb9fb
8c974e3
d3e769d
63c9e30
d2085fb
5024ef2
797907d
64fe75b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
apiVersion: v1 | ||
kind: Secret | ||
metadata: | ||
name: ray-ca-cert | ||
labels: | ||
opendatahub.io/managed: 'true' | ||
data: | ||
# output from cat ca.crt | base64 | ||
ca.crt: | | ||
LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZIekNDQXdlZ0F3SUJBZ0lVSlAwL1FCY0xTMFFFV1ZiRDE4NndyUnZjLzdFd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0hqRWNNQm9HQTFVRUF3d1RjMlZzWmkxemFXZHVaV1F0WTJFdFkyVnlkREFnRncweU5ERXhNak13TXpReApOVEphR0E4eU1USTBNVEF6TURBek5ERTFNbG93SGpFY01Cb0dBMVVFQXd3VGMyVnNaaTF6YVdkdVpXUXRZMkV0ClkyVnlkRENDQWlJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dJUEFEQ0NBZ29DZ2dJQkFML0lpWVdabUc0Q05qOS8KMzV6RUJheU1tL2djeGhJSFArZ1M4S04wdm9YV2krdk1Kd1hEUWR3T0JOdEZaa1l0elpJc29ESHVHenRnN0RWNApyZXZONE5JOHRKTmc2b2Jma0tVcGQ3eHdvNHNIdXMwd0h6NWVwZGt1MDhNVzVzZFZOUFNRMkhpVnhialpXSnBRClpJYWYyQWRKa1psUFdtVDBaS1pPdFFEN2oySTRtM3VCeG1jYzhTNWhiNkpaYW9NbGNVVXhkNDFscG43T21iMGEKVjRBUGZiWS9vYytwZmVDczN1cG5xamxZamVGQjR2RTV4WU1ZV0FNeitJRGh4RTRxRGVSaXNMQnhhN1kvcFRScQo2OWVhVXN6Qjl5eEQ3R0FySTJsSDhyUCtVeGpGYUl1K2tBVjVtbjc4OXdlejh0TDVGNEErWlE5cGM5TVI2UXBuCmRkanlaRXcvcFpkdTcwVmo3WUE1MU91S2owcTF2dGw3d1BPcDBUc3lwUDhadW04ZkZSNG5KbmNPaFhMTWV3bGEKTWxBeFZaUWRiMEF5dEE2TUl0dFdXSjA5L3BEOWZ6SkdjRnZOL245ZzZWZ0o5NjNhbmRoTEwyYlJMc0VKRmxUTQpEdTJIeW1CNkErb0ZlcDdjZXNxOUpJRFhkVmFqR3NxMmgrZVpPNGdxWW5nWGNmQVl5ZUloYzlYNnFoT2QvVmlZCmY2eUZoOTNuUnRYTFFNdUJRN2E1WTFzRVN3RHp3WWJKdEtuK3NrcGg5SEtCMTdVRUVOU3BNNHJSNHdxekRQd3AKSmZZeWt2a2Iyd2w0TkNCb2pjaU9icDYwV2ZDQytRcTFsNEo4VXpSOEpvWmFiQ0IzOWxVcHBKa09qNVFxYnEwMApKaUFzWENQQlp3OCtnQnV0b2JBVUs0RklqMGVQQWdNQkFBR2pVekJSTUIwR0ExVWREZ1FXQkJTcmNlRWNhMjNxCjBUQm14VmtZTWtMQTNSeHpKekFmQmdOVkhTTUVHREFXZ0JTcmNlRWNhMjNxMFRCbXhWa1lNa0xBM1J4ekp6QVAKQmdOVkhSTUJBZjhFQlRBREFRSC9NQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUNBUUJseW5IM2doVG9ObDQrTlQ1MgpNTTA1V1A3UCszVXJkQ0tGNEJCa0VzN0VueldSUjZ4bkVoVUY5VWhGZ0ZhTFBiQ1pacnlCS2krT1hrUHUva3JCCk12aE1LVGl0WnNWbVBzRktEWDYyVG9zMEJZV0VzanZ0VDM4WFhSZXA3T3BWR0lPQi85V09YVGl3VkpaT2tSZ2MKbHd3U2U1dnBQQXRpMzhUZ3BhM0FVSk5haG00bDhHNWF5WktRQWFnUDg1NHBFTjhPOW54Nk9odytWN1hzSGlNdQpwUmpvc0VTN0JJY1lXVGJxR05yNFR3eXo1cVUwOE9LOEUySFNFSnE5THA4YzZ6UTZnZzBhV1dLYWJyTUpNeSt2CkpIbjM5TEI0U3dONzJjVXJkRWYvQVVrWktNYTRKVFRjMTJnaGpKN1JvUENYUFdPOUJGZ09aZEdoUlpBYkZRMXgKcnB6b3BLZllkT0hWZ0tncG9MOVJIRm40TzZRaTBjbnBML0NZZEFGd0pXNmZYcGEyekhobEJqWXlWdHk5T1Y4TQppV01IVUNXZnl4anVaSno5NFFWZGxLRGVrY2YzUFJzU0RBRGZ4TXlBYVdJQ1NnYXNHTVNPSnRoRTlGM0JhaXNvCnNYM1NLYzRFSEc4Sk1VM1hoeWYwbkhDY2hQdWVRblU5akFBVVBDMHRrVlZtOWhmMXVkdjlOTUk4bktncWRFMkMKK2ExNnR3RVBpZzhkS1pkaFRMOFdXMGwxS3FRcCs4SnBGdTdNWUM1SGNaa0F0NEE3QXlXOHRsYmlQQ1B4RVYwZwpsYkc0eXFyV2lIOG5rWE9tNFBKb0FEMDhzNjA5Y3lFY2Iyblgvck92KzBSdFdOOWFqZGNxYlM1Z0JJaEhRSVUwCko1N0cyTVFYZ0hHbU9QL1ZZTi8vMTlsaXd3PT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo= | ||
# output from cat ca.key | base64 | ||
ca.key: | | ||
LS0tLS1CRUdJTiBQUklWQVRFIEtFWS0tLS0tCk1JSUpRd0lCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQ1Mwd2dna3BBZ0VBQW9JQ0FRQy95SW1GbVpodUFqWS8KZjkrY3hBV3NqSnY0SE1ZU0J6L29FdkNqZEw2RjFvdnJ6Q2NGdzBIY0RnVGJSV1pHTGMyU0xLQXg3aHM3WU93MQplSzNyemVEU1BMU1RZT3FHMzVDbEtYZThjS09MQjdyTk1COCtYcVhaTHRQREZ1YkhWVFQwa05oNGxjVzQyVmlhClVHU0duOWdIU1pHWlQxcGs5R1NtVHJVQSs0OWlPSnQ3Z2NabkhQRXVZVytpV1dxREpYRkZNWGVOWmFaK3pwbTkKR2xlQUQzMjJQNkhQcVgzZ3JON3FaNm81V0kzaFFlTHhPY1dER0ZnRE0vaUE0Y1JPS2cza1lyQ3djV3UyUDZVMAphdXZYbWxMTXdmY3NRK3hnS3lOcFIvS3ovbE1ZeFdpTHZwQUZlWnArL1BjSHMvTFMrUmVBUG1VUGFYUFRFZWtLClozWFk4bVJNUDZXWGJ1OUZZKzJBT2RUcmlvOUt0YjdaZThEenFkRTdNcVQvR2Jwdkh4VWVKeVozRG9WeXpIc0oKV2pKUU1WV1VIVzlBTXJRT2pDTGJWbGlkUGY2US9YOHlSbkJiemY1L1lPbFlDZmV0MnAzWVN5OW0wUzdCQ1JaVQp6QTd0aDhwZ2VnUHFCWHFlM0hyS3ZTU0ExM1ZXb3hyS3RvZm5tVHVJS21KNEYzSHdHTW5pSVhQVitxb1RuZjFZCm1IK3NoWWZkNTBiVnkwRExnVU8ydVdOYkJFc0E4OEdHeWJTcC9ySktZZlJ5Z2RlMUJCRFVxVE9LMGVNS3N3ejgKS1NYMk1wTDVHOXNKZURRZ2FJM0lqbTZldEZud2d2a0t0WmVDZkZNMGZDYUdXbXdnZC9aVkthU1pEbytVS202dApOQ1lnTEZ3andXY1BQb0FicmFHd0ZDdUJTSTlIandJREFRQUJBb0lDQUFyYzkwaG9ud3VIWGI3ZmNtU0IxU3JZClZPWWt1WDl6aHQvRWxIb1E5cDNFSSswNWhWaFdCTmpMNjBvYXRuRlhtenk3emZtTWMyRTcyemlPam1OdmpvOGcKY1l4eDlMYmQycG5RWUlBWEJ0eDV5UUxJWUFaSUwySys3NjloRUlLYksvVzQxZG9wN05vekFMQm9MMW1FenlSZgpWS0hFU0ZDMHptS3hNOUpMYllYeWowMm9QbUhBY0NHdGJHdjFrZGZ4RkdjNldrZy80c0tnY05ld3NueUdTb0lICm8zd21ZSnkvSjUxTDF5QlhPL2J2Y1hobHNMd3djamNCQ0FNUUU0aE42UjJKUUwrdDBEWGt2SjBQcnZzRE9wa0kKakdzTlEzMWVPcEpERmdwL21zNlFNWnpObHhwdXNGQTVnNUNkaUpRMHNkSGpOdUtqTXhyeUxKRk1HY0l0OExEQwpRVzF2akxLR0l1UWtraGwxOWU1S1N2SDdjUjJja0pDME5vTzhnekpudzd0dTRGaHJaK0xQeXF2R3VSYU55a2RmCi9BKzNEOUE2RW1PNWRldFU2RzJkK0l2TmprdG91Z05UalZIUklDbk9oL01zRmlFQXdycHltVVNISzhKTjVpSjIKUm1rNFljNWlXUjhOUWs0Wkh6aVFGSHJSWkh0TW9DcEkvR2ZGcnYyRVE0bFpOOG5tZHdDWDR4a3JObUJ2ZnlIdgpLWW0yMU5VWDc5U3lRbHd5VS9lNUR6eTA3Si9zcTdoVU8xN3hVcXNzTVpZTGZCSFF0VFU0VVAwbnBOZmtxUFM1CjdJRUtIVWwyRlZudXR0THoxc0ZVVHhJTS90aE9lczRtWElrOExiYzI0Yk45VWlteXplVEN1bE83a0hZSDhTVkEKZDJqZFBTZXhZSTdMeWFVNnFHRzVBb0lCQVFEbHlVQk5CaTRNekdxVnh5NjNCY1dyZC9rdVYrYTFLQ3MyYVhzagpLbVlMT0xrSkhUSjI0YU9EWkJBVGxEMllwclZEOUM1UThTeDdPQVdlQ3FqWHd1MndsOTlabXNIVkxiQTZMRUZ6CnBoYTNQVHhkaWFpMElwZVY2ZFpIQnQrdjVDVGsvSnpxSVpjc1J2REFnNTFHYzgxbERxTzFNbnVqbldBcGpSMmMKd05ZVXd6a3hicHVTc1ZHYzFBZ09tVHBHN1MrdWVVQ2FGU0NVaGkyVWVoblM5dkNrU3Y0QTZMRlpiaXhFeWp6aApycU9mN1d1TTVUWkFoTGM2RTJUQnVOeWJlWW9DblFMdHF3dnUxaFhDOGU4TGlQWFRlMVJ4U2x5dXA5RDhiWEZBCjVPVmFUZjAzcFFweURiOXNKeGhLN3FMbUgrSjlUeU5JamhTTUZQT2pKNEJFRTlPOUFvSUJBUURWcVhDMGdCVzUKYlNUWmUzc3l1QVltRi9hVDg1ZFh1NGFTMFBJR09MakE1M2h0RVdLUkJxd1JlU1prSFdtR05uNUIyOXVXTHg2UgpPZjFNOFJkY2NYSnlxMnp1TlBiWkpabllwS0x5N0FjeDBpc1RvMjdpUy9xRS85YndsNUo3QVU5UmZ2K2ZMK2RPCmxqUndRTGUvQ1dSVHVlTlNOSWpPUC96NWRra2J5Z1kvWHZHbmI0RUJheDY4K3J2a0NYbStGdFpXV3VoblM2Uy8KZHh3Ulo2VGRMd09RZTZQSzNzN3F4c2xWNmQ2dmwrSUpwa1VVZmRvWDNyWFlTeGx2cFlQYWJpWEpaVzdQWkZwRQpVQXc0VTFpSzVLMUt5d1ZjaHlhN2tQSlpRNUplS1pUT3lPL1d5ODZLak0vcUd3NUhDR2NOL2VMbDJKUUViUkwvClJiR0pGSmhUalpjN0FvSUJBQXlyNG0zYzcyRXBUSjloMG9PcFA5TksxR1RuMkFNWmFmaWdMSGd0K0Y2YURDb2kKZ0F2cU9YZ2ZabnVONnkrbDBjMGpoQUpXcWx0SkpaWW5oRlFSbmNYbE9oM1kyT09HbDNjOXhZWTVISHVTVnVmWgpsWUlKZms1NERLYnlEQmZJL3ZmWnJsV0M4TEV5WUVoZGVhak83ZjZxcGdCeC9qdHhqRUgrVkNtMndKZDRoSWpqClRwVHlUa3ZWclhRUW94UVNORlRzdnRGQVpRR0x2S3U1Wi84b092RDBhYmxuRzVDUThNUUNXd1VlK2tyeGJzTGcKU1BPWjNmakg1UUNCenppTHBUNnJwZU94VVFFa3NTS0U4T2V6NzhwdnZLSmF0VzIwTjJRVUxQQ2xMcmlpSUZxWApNVkpFeTgrTkFGdnhlTzR6eCt1ZEY1Y0Nyc05pekdTczR2ZmVHQWtDZ2dFQkFMdFRnbWdPd0gxQlR3U0t1Ym4vCkZBMEVCNEV5R2FlbTExY1RjSTY1M21ucXgyL0F4VTFucnlibXRCMGttR2Mra2JYR1FDRE5rUnc4M25NK0VZQlEKU3NwMHQ5MmxmQ05vVHhsZFJ5eDZlZGhaYnNFYUVsYS96SllkQk9NTjBUU2RNbUMrV3ZuRGN5WTRsU014NnFmSQpZVGp6Q25ZQmIweDlWNXVUOUljenVnU0hocEdKTm03Njd3azdQODZ2N0JnWVI3V1FvS0FuOXZxVFFIMldCRHFVClJLakJiaHFvL0h0azdCS3lLRGFGa0gxclZMZWhtN3cvMitrVjl1Z25FcEpJN2tKRDkwSkh0c2liOGdyVU1CWWUKWmp6a0FRQmQwanl5MlhnZndVMWpZWDluTnJoNUdjM3BwVVNZa2d6L05mTlRmRUtPZnovZUxjQzM1dTdMcXIzZQpydzhDZ2dFQkFMT2tsTkJNRVBmM20yTXBjaVRjRmNKb08vZzBMUUpHaTJtWkN6S1g3eDJFS0N2N1ZvVWVtRkk0CjRkRFVmSlBJWlBFTUpkTHRSUy9qUDEyZWkxek9lWHIrVGlUTklpUUVoemRtL0RZWUdjd2hyb0xLNDZVTFJKY0YKYzdxZ2xNQ1Z1MW9DTmtDdTJvZ08renczRm9makJzK1pqcE1BS2kyOTZ2ZDk2YVlYNThYR0RKekdmdjhuZEF1dwpEUmU1ZE5oQU5iaHZqSlM1VXJwNnhoMVMycTNYOHorTlFGWW9CNDM1Q2NXNW50WWMzemIxYVdzY0NxMWJsUGJGCjc0QTFLTHJNNlpvU0ZlcUVWZzhvajhpWjlDaitiTTJXYm9BREIvRTROM0kyNmFDK1dDRWxtdTd3ZDdQaExQT2IKN3RrTXh2Zm10dDE5T2dYbTRKZm9SZWlkMTNYbHFoZz0KLS0tLS1FTkQgUFJJVkFURSBLRVktLS0tLQo= | ||
--- | ||
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: ray-tls-scripts | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. scripts -> script since there's only one shell script There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are 2 scripts in the configmap so I use plural. |
||
labels: | ||
opendatahub.io/managed: 'true' | ||
data: | ||
gencert_ray.sh: | | ||
#!/bin/sh | ||
## Create tls.key | ||
openssl genrsa -out /etc/ray/tls/tls.key 2048 | ||
|
||
## Write CSR Config | ||
cat > /etc/ray/tls/csr.conf <<EOF | ||
[ req ] | ||
default_bits = 2048 | ||
prompt = no | ||
default_md = sha256 | ||
req_extensions = req_ext | ||
distinguished_name = dn | ||
|
||
[ dn ] | ||
C = US | ||
ST = Raleigh | ||
L = North Carolina | ||
O = redhat | ||
OU = redhat | ||
CN = self-signed-cert | ||
|
||
[ req_ext ] | ||
subjectAltName = @alt_names | ||
|
||
[ alt_names ] | ||
DNS.1 = localhost | ||
DNS.2 = *.${POD_NAMESPACE}.svc.cluster.local | ||
IP.1 = 127.0.0.1 | ||
IP.2 = $POD_IP | ||
|
||
EOF | ||
|
||
## Create CSR using tls.key | ||
openssl req -new -key /etc/ray/tls/tls.key -out /etc/ray/tls/ca.csr -config /etc/ray/tls/csr.conf | ||
|
||
## Write cert config | ||
cat > /etc/ray/tls/cert.conf <<EOF | ||
|
||
authorityKeyIdentifier=keyid,issuer | ||
basicConstraints=CA:FALSE | ||
keyUsage = digitalSignature, nonRepudiation, keyEncipherment, dataEncipherment | ||
subjectAltName = @alt_names | ||
|
||
[alt_names] | ||
DNS.1 = localhost | ||
DNS.2 = *.${POD_NAMESPACE}.svc.cluster.local | ||
IP.1 = 127.0.0.1 | ||
IP.2 = $POD_IP | ||
|
||
EOF | ||
|
||
## create serial file | ||
echo '01' > /tmp/ca.srl | ||
|
||
## Generate tls.cert | ||
openssl x509 -req \ | ||
-in /etc/ray/tls/ca.csr \ | ||
-CA /etc/ca/tls/ca.crt -CAkey /etc/ca/tls/ca.key \ | ||
-CAserial /tmp/ca.srl -out /etc/ray/tls/tls.crt \ | ||
-days 36500 \ | ||
-sha256 -extfile /etc/ray/tls/cert.conf |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,7 +12,7 @@ metadata: | |
template.openshift.io/documentation-url: https://github.com/opendatahub-io/vllm | ||
template.openshift.io/long-description: This template defines resources needed to deploy vLLM ServingRuntime Multi-Node with KServe in Red Hat OpenShift AI | ||
opendatahub.io/modelServingSupport: '["single"]' | ||
opendatahub.io/apiProtocol: "REST" | ||
opendatahub.io/apiProtocol: 'REST' | ||
name: vllm-multinode-runtime-template | ||
objects: | ||
- apiVersion: serving.kserve.io/v1alpha1 | ||
|
@@ -26,8 +26,8 @@ objects: | |
opendatahub.io/dashboard: "false" | ||
spec: | ||
annotations: | ||
prometheus.io/port: "8080" | ||
prometheus.io/path: "/metrics" | ||
prometheus.io/port: '8080' | ||
prometheus.io/path: '/metrics' | ||
multiModel: false | ||
supportedModelFormats: | ||
- autoSelect: true | ||
|
@@ -36,11 +36,16 @@ objects: | |
containers: | ||
- name: kserve-container | ||
image: $(vllm-image) | ||
command: [ "bash", "-c" ] | ||
command: ['bash', '-c'] | ||
args: | ||
- | | ||
# Generate self signed certificate | ||
if [[ $RAY_USE_TLS == "1" ]]; then | ||
/etc/gen/tls/gencert_ray.sh | ||
fi | ||
ray start --head --disable-usage-stats --include-dashboard false | ||
# wait for other node to join | ||
|
||
# Wait for other node to join | ||
until [[ $(ray status --address ${RAY_ADDRESS} | grep -c node_) -eq ${PIPELINE_PARALLEL_SIZE} ]]; do | ||
echo "Waiting..." | ||
sleep 1 | ||
|
@@ -49,33 +54,52 @@ objects: | |
|
||
export SERVED_MODEL_NAME=${MODEL_NAME} | ||
export MODEL_NAME=${MODEL_DIR} | ||
|
||
exec python3 -m vllm.entrypoints.openai.api_server --port=8080 --distributed-executor-backend ray --model=${MODEL_NAME} --served-model-name=${SERVED_MODEL_NAME} --tensor-parallel-size=${TENSOR_PARALLEL_SIZE} --pipeline-parallel-size=${PIPELINE_PARALLEL_SIZE} --disable_custom_all_reduce | ||
env: | ||
- name: RAY_USE_TLS | ||
value: '1' | ||
- name: RAY_TLS_SERVER_CERT | ||
value: '/etc/ray/tls/tls.crt' | ||
- name: RAY_TLS_SERVER_KEY | ||
value: '/etc/ray/tls/tls.key' | ||
- name: RAY_TLS_CA_CERT | ||
value: '/etc/ca/tls/ca.crt' | ||
- name: RAY_PORT | ||
value: "6379" | ||
value: '6379' | ||
- name: RAY_ADDRESS | ||
value: 127.0.0.1:6379 | ||
- name: POD_NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: metadata.namespace | ||
- name: POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: status.podIP | ||
- name: VLLM_NO_USAGE_STATS | ||
value: "1" | ||
value: '1' | ||
- name: HOME | ||
value: /tmp | ||
- name: HF_HOME | ||
value: /tmp/hf_home | ||
resources: | ||
limits: | ||
cpu: "16" | ||
cpu: '16' | ||
memory: 48Gi | ||
requests: | ||
cpu: "8" | ||
cpu: '8' | ||
memory: 24Gi | ||
volumeMounts: | ||
- name: shm | ||
mountPath: /dev/shm | ||
- mountPath: /etc/ca/tls | ||
name: ray-ca-cert | ||
readOnly: true | ||
- mountPath: /etc/ray/tls | ||
name: ray-tls | ||
- mountPath: /etc/gen/tls | ||
name: gen-tls-script | ||
livenessProbe: | ||
failureThreshold: 2 | ||
periodSeconds: 5 | ||
|
@@ -133,7 +157,7 @@ objects: | |
echo "Unhealthy - Used: ${used_gpu}, Reserved: ${reserved_gpu}" | ||
exit 1 | ||
fi | ||
|
||
# Check model health | ||
health_check=$(curl -o /dev/null -s -w "%{http_code}\n" http://localhost:8080/health) | ||
if [[ ${health_check} != 200 ]]; then | ||
|
@@ -158,7 +182,7 @@ objects: | |
echo "Unhealthy - Registered nodes count (${registered_node_count}) does not match PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})." | ||
exit 1 | ||
fi | ||
|
||
# Double check to make sure Model is ready to serve. | ||
for i in 1 2; do | ||
# Check model health | ||
|
@@ -177,15 +201,33 @@ objects: | |
emptyDir: | ||
medium: Memory | ||
sizeLimit: 12Gi | ||
- name: ray-ca-cert | ||
secret: | ||
secretName: ray-ca-cert | ||
- name: ray-tls | ||
emptyDir: {} | ||
# The gencert_ray.sh can be prebaked into the docker container so the configMap is optional | ||
- name: gen-tls-script | ||
spolti marked this conversation as resolved.
Show resolved
Hide resolved
|
||
configMap: | ||
name: ray-tls-scripts | ||
defaultMode: 0777 | ||
# An array of keys from the ConfigMap to create as files | ||
items: | ||
- key: gencert_ray.sh | ||
path: gencert_ray.sh | ||
workerSpec: | ||
pipelineParallelSize: 2 | ||
tensorParallelSize: 1 | ||
containers: | ||
- name: worker-container | ||
image: $(vllm-image) | ||
command: [ "bash", "-c" ] | ||
command: ['bash', '-c'] | ||
args: | ||
- | | ||
# Generate self signed certificate | ||
if [[ $RAY_USE_TLS == "1" ]]; then | ||
/etc/gen/tls/gencert_ray.sh | ||
fi | ||
SECONDS=0 | ||
|
||
while true; do | ||
|
@@ -203,32 +245,51 @@ objects: | |
echo "$SECONDS seconds elapsed: Still waiting for Global Control Service(GCS) to be ready." | ||
echo "For troubleshooting, refer to the FAQ at https://docs.ray.io/en/master/cluster/kubernetes/troubleshooting/troubleshooting.html#kuberay-troubleshootin-guides" | ||
fi | ||
|
||
sleep 5 | ||
done | ||
|
||
export RAY_HEAD_ADDRESS="${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379" | ||
echo "Attempting to connect to Ray cluster at $RAY_HEAD_ADDRESS ..." | ||
ray start --address="${RAY_HEAD_ADDRESS}" --block | ||
env: | ||
- name: RAY_USE_TLS | ||
value: '1' | ||
- name: RAY_TLS_SERVER_CERT | ||
Jooho marked this conversation as resolved.
Show resolved
Hide resolved
|
||
value: '/etc/ray/tls/tls.crt' | ||
- name: RAY_TLS_SERVER_KEY | ||
value: '/etc/ray/tls/tls.key' | ||
- name: RAY_TLS_CA_CERT | ||
value: '/etc/ca/tls/ca.crt' | ||
- name: POD_NAME | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: metadata.name | ||
- name: POD_IP | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: status.podIP | ||
- name: POD_NAMESPACE | ||
valueFrom: | ||
fieldRef: | ||
fieldPath: metadata.namespace | ||
resources: | ||
limits: | ||
cpu: "16" | ||
cpu: '16' | ||
memory: 48Gi | ||
requests: | ||
cpu: "8" | ||
cpu: '8' | ||
memory: 24Gi | ||
volumeMounts: | ||
- name: shm | ||
mountPath: /dev/shm | ||
- mountPath: /etc/ca/tls | ||
name: ray-ca-cert | ||
readOnly: true | ||
- mountPath: /etc/ray/tls | ||
name: ray-tls | ||
- mountPath: /etc/gen/tls | ||
name: gen-tls-script | ||
livenessProbe: | ||
failureThreshold: 2 | ||
periodSeconds: 5 | ||
|
@@ -244,7 +305,7 @@ objects: | |
if [[ ${registered_node_count} -ne "${PIPELINE_PARALLEL_SIZE}" ]]; then | ||
echo "Unhealthy - Registered nodes count (${registered_node_count}) does not match PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})." | ||
exit 1 | ||
fi | ||
fi | ||
startupProbe: | ||
failureThreshold: 40 | ||
periodSeconds: 30 | ||
|
@@ -261,7 +322,7 @@ objects: | |
echo "Unhealthy - Registered nodes count (${registered_node_count}) does not match PIPELINE_PARALLEL_SIZE (${PIPELINE_PARALLEL_SIZE})." | ||
exit 1 | ||
fi | ||
|
||
# Double check to make sure Model is ready to serve. | ||
for i in 1 2; do | ||
# Check model health | ||
|
@@ -276,4 +337,18 @@ objects: | |
- name: shm | ||
emptyDir: | ||
medium: Memory | ||
sizeLimit: 12Gi | ||
sizeLimit: 12Gi | ||
- name: ray-tls | ||
emptyDir: {} | ||
- name: ray-ca-cert | ||
secret: | ||
secretName: ray-ca-cert | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Improvement for later: we should find a way to not mount the root CA certificate. This is for security reasons. |
||
# The gencert_ray.sh can be prebaked into the docker container so the configMap is optional | ||
- name: gen-tls-script | ||
configMap: | ||
name: ray-tls-scripts | ||
defaultMode: 0777 | ||
# An array of keys from the ConfigMap to create as files | ||
items: | ||
- key: gencert_ray.sh | ||
path: gencert_ray.sh | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can the scripts be made part of the runtime image? Otherwise, the template it is no longer self-contained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only used for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will be used for internal ssl communication for ray. It is just self-signed certificate