-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RHOAIENG-13566] Configure SSL for the communication between head and worker #314
base: incubating
Are you sure you want to change the base?
Conversation
Signed-off-by: jooho lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
Signed-off-by: jooho lee <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, change the target branch to incubating
.
The main
branch is now a stable/downstream branch.
data: | ||
# output from cat ca.crt | base64 | ||
ca.crt: | | ||
LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZIekNDQXdlZ0F3SUJBZ0lVSlAwL1FCY0xTMFFFV1ZiRDE4NndyUnZjLzdFd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0hqRWNNQm9HQTFVRUF3d1RjMlZzWmkxemFXZHVaV1F0WTJFdFkyVnlkREFnRncweU5ERXhNak13TXpReApOVEphR0E4eU1USTBNVEF6TURBek5ERTFNbG93SGpFY01Cb0dBMVVFQXd3VGMyVnNaaTF6YVdkdVpXUXRZMkV0ClkyVnlkRENDQWlJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dJUEFEQ0NBZ29DZ2dJQkFML0lpWVdabUc0Q05qOS8KMzV6RUJheU1tL2djeGhJSFArZ1M4S04wdm9YV2krdk1Kd1hEUWR3T0JOdEZaa1l0elpJc29ESHVHenRnN0RWNApyZXZONE5JOHRKTmc2b2Jma0tVcGQ3eHdvNHNIdXMwd0h6NWVwZGt1MDhNVzVzZFZOUFNRMkhpVnhialpXSnBRClpJYWYyQWRKa1psUFdtVDBaS1pPdFFEN2oySTRtM3VCeG1jYzhTNWhiNkpaYW9NbGNVVXhkNDFscG43T21iMGEKVjRBUGZiWS9vYytwZmVDczN1cG5xamxZamVGQjR2RTV4WU1ZV0FNeitJRGh4RTRxRGVSaXNMQnhhN1kvcFRScQo2OWVhVXN6Qjl5eEQ3R0FySTJsSDhyUCtVeGpGYUl1K2tBVjVtbjc4OXdlejh0TDVGNEErWlE5cGM5TVI2UXBuCmRkanlaRXcvcFpkdTcwVmo3WUE1MU91S2owcTF2dGw3d1BPcDBUc3lwUDhadW04ZkZSNG5KbmNPaFhMTWV3bGEKTWxBeFZaUWRiMEF5dEE2TUl0dFdXSjA5L3BEOWZ6SkdjRnZOL245ZzZWZ0o5NjNhbmRoTEwyYlJMc0VKRmxUTQpEdTJIeW1CNkErb0ZlcDdjZXNxOUpJRFhkVmFqR3NxMmgrZVpPNGdxWW5nWGNmQVl5ZUloYzlYNnFoT2QvVmlZCmY2eUZoOTNuUnRYTFFNdUJRN2E1WTFzRVN3RHp3WWJKdEtuK3NrcGg5SEtCMTdVRUVOU3BNNHJSNHdxekRQd3AKSmZZeWt2a2Iyd2w0TkNCb2pjaU9icDYwV2ZDQytRcTFsNEo4VXpSOEpvWmFiQ0IzOWxVcHBKa09qNVFxYnEwMApKaUFzWENQQlp3OCtnQnV0b2JBVUs0RklqMGVQQWdNQkFBR2pVekJSTUIwR0ExVWREZ1FXQkJTcmNlRWNhMjNxCjBUQm14VmtZTWtMQTNSeHpKekFmQmdOVkhTTUVHREFXZ0JTcmNlRWNhMjNxMFRCbXhWa1lNa0xBM1J4ekp6QVAKQmdOVkhSTUJBZjhFQlRBREFRSC9NQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUNBUUJseW5IM2doVG9ObDQrTlQ1MgpNTTA1V1A3UCszVXJkQ0tGNEJCa0VzN0VueldSUjZ4bkVoVUY5VWhGZ0ZhTFBiQ1pacnlCS2krT1hrUHUva3JCCk12aE1LVGl0WnNWbVBzRktEWDYyVG9zMEJZV0VzanZ0VDM4WFhSZXA3T3BWR0lPQi85V09YVGl3VkpaT2tSZ2MKbHd3U2U1dnBQQXRpMzhUZ3BhM0FVSk5haG00bDhHNWF5WktRQWFnUDg1NHBFTjhPOW54Nk9odytWN1hzSGlNdQpwUmpvc0VTN0JJY1lXVGJxR05yNFR3eXo1cVUwOE9LOEUySFNFSnE5THA4YzZ6UTZnZzBhV1dLYWJyTUpNeSt2CkpIbjM5TEI0U3dONzJjVXJkRWYvQVVrWktNYTRKVFRjMTJnaGpKN1JvUENYUFdPOUJGZ09aZEdoUlpBYkZRMXgKcnB6b3BLZllkT0hWZ0tncG9MOVJIRm40TzZRaTBjbnBML0NZZEFGd0pXNmZYcGEyekhobEJqWXlWdHk5T1Y4TQppV01IVUNXZnl4anVaSno5NFFWZGxLRGVrY2YzUFJzU0RBRGZ4TXlBYVdJQ1NnYXNHTVNPSnRoRTlGM0JhaXNvCnNYM1NLYzRFSEc4Sk1VM1hoeWYwbkhDY2hQdWVRblU5akFBVVBDMHRrVlZtOWhmMXVkdjlOTUk4bktncWRFMkMKK2ExNnR3RVBpZzhkS1pkaFRMOFdXMGwxS3FRcCs4SnBGdTdNWUM1SGNaa0F0NEE3QXlXOHRsYmlQQ1B4RVYwZwpsYkc0eXFyV2lIOG5rWE9tNFBKb0FEMDhzNjA5Y3lFY2Iyblgvck92KzBSdFdOOWFqZGNxYlM1Z0JJaEhRSVUwCko1N0cyTVFYZ0hHbU9QL1ZZTi8vMTlsaXd3PT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only used for tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will be used for internal ssl communication for ray. It is just self-signed certificate
- -c | ||
- | | ||
# Check if the registered nodes count matches PIPELINE_PARALLEL_SIZE | ||
registered_node_count=$(ray status --address ${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379 | grep -c node_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uppercase registered_node_count is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this is not about this Ray certificate. I used uppercase for environment variables and lowercase for variables in the script. I think this is just to make a little difference, so it shouldn't be a big problem.
apiVersion: v1 | ||
kind: ConfigMap | ||
metadata: | ||
name: ray-tls-scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scripts -> script since there's only one shell script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 scripts in the configmap so I use plural.
@spolti @terrytangyuan @israel-hdez any more comments? |
"app.kubernetes.io/name": "odh-model-controller", | ||
"app.kubernetes.io/component": "kserve", | ||
"app.kubernetes.io/part-of": "odh-model-serving", | ||
"app.kubernetes.io/managed-by": "odh-model-controller", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't owner reference needed on the configMaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it gives more information so I does not hurt .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) It is recommended labels by rhoai team. https://github.com/opendatahub-io/odh-model-controller/pull/221/files#r1655337054
based on this, I just added it but it might not need this for manifests though.
But it is just more information about the object so it should not be an issue.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Jooho, spolti The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
nice work, minor comments. |
Signed-off-by: jooho lee <[email protected]>
controllers/utils/cert.go
Outdated
Labels: map[string]string{ | ||
"opendatahub.io/managed": "true", | ||
"app.kubernetes.io/name": "odh-model-controller", | ||
"app.kubernetes.io/component": "kserve", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
component: odh-model-controller
part-of: kserve
name: shouldn't it be something related to the secret or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is util function and it is not only for the ray. So I think we can set the name like self-signed-cert
. I will update the others as well.
Signed-off-by: jooho lee <[email protected]>
- name: gen-tls-script | ||
configMap: | ||
name: ray-tls-scripts | ||
defaultMode: 0777 | ||
# An array of keys from the ConfigMap to create as files | ||
items: | ||
- key: gencert_ray.sh | ||
path: gencert_ray.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the scripts be made part of the runtime image?
Otherwise, the template it is no longer self-contained.
emptyDir: {} | ||
- name: ray-ca-cert | ||
secret: | ||
secretName: ray-ca-cert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement for later: we should find a way to not mount the root CA certificate. This is for security reasons.
} | ||
} | ||
var servingRuntimeList kservev1alpha1.ServingRuntimeList | ||
if err := r.client.List(ctx, &servingRuntimeList); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you get from all namespaces?
log.Info("Original Ray CA Cert Secret is updated", "name", constants.RayCASecretName, "namespace", req.Namespace) | ||
for _, sr := range servingRuntimeList.Items { | ||
if isMultiNodeServingRuntime(sr) { | ||
if err := r.cleanupRayResourcesByKind(ctx, log, sr.Namespace, "Secret"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you expect only one multi-node SR per namespace?
The dashboard creates a copy per InferenceService, so it is proably a good perf improvement to do the cleanup once per namespace, in case servingRuntimeList
contains runtimes from all namespaces.
This comment also applies for the ConfigMap cleanup.
} | ||
} | ||
} | ||
if err := r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, noMultiNodeSrExistInNs); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if servingRuntimeList
holds runtimes from all namespaces.... Looks like the clean-up is ran over all namespaces, but the creation of the Secret (and ConfigMap) is ran only on the target namespace?
Just trying to understand...
if err != nil { | ||
return ctrl.Result{}, err | ||
} | ||
err = r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, !existMultiNodeServingRuntimeInNs(req.Namespace, servingRuntimeList)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
err = r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, !existMultiNodeServingRuntimeInNs(req.Namespace, servingRuntimeList)) | |
err = r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, noMultiNodeSrExistInNs) |
// TO-DO upstream Kserve 0.15 will have a new API WorkerSpec | ||
// So for now, it will check servingRuntime name, but after we move to 0.15, it needs to check workerSpec is specified or not.(RHOAIENG-16147) | ||
func isMultiNodeServingRuntime(servingRuntime kservev1alpha1.ServingRuntime) bool { | ||
return servingRuntime.Name == "vllm-multinode-runtime" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, slightly better to check for the presence of the multi-node fields (was it the workerSpec field?), like in KServe to keep it generic.
Expect(cli.Update(ctx, rayTlsScriptsUpdatedConfigMap)).Should(Succeed()) | ||
|
||
// Check if 'ray-tls-script' ConfigMap is rollback | ||
originalRayTlsScriptsConfigMap, err := waitForConfigMap(cli, testNs, constants.RayTlsScriptConfigMapName, 30, 1*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originalRayTlsScriptsConfigMap, err := waitForConfigMap(cli, testNs, constants.RayTlsScriptConfigMapName, 30, 1*time.Second) | |
originalRayTlsScriptsConfigMap, err := waitForConfigMap(cli, WorkingNamespace, constants.RayTlsScriptConfigMapName, 30, 1*time.Second) |
return compareConfigMap(originalRayTlsScriptsConfigMap, updatedConfigMapFromTestNs) | ||
}, timeout, interval).Should(BeTrue()) | ||
}) | ||
It("should 'ray-tls-script' ConfigMap in the testNs when original one in the ctrlNs updated", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It("should 'ray-tls-script' ConfigMap in the testNs when original one in the ctrlNs updated", func() { | |
It("should update 'ray-tls-script' ConfigMap in the testNs when original one in the ctrlNs updated", func() { |
// Check if all ray tls script are NOT removed | ||
_, err = waitForConfigMap(cli, testNs, constants.RayTlsScriptConfigMapName, 30, 1*time.Second) | ||
Expect(err).NotTo(HaveOccurred()) | ||
_, err = waitForSecret(cli, testNs, constants.RayCASecretName, 30, 1*time.Second) | ||
Expect(err).NotTo(HaveOccurred()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to surround these inside Consistently
to make sure the conditions remain true over a timespan.
Description
Between head node and worker node, we should configure certificate for SSL communication.
https://issues.redhat.com/browse/RHOAIENG-13566
How Has This Been Tested?
DataScienceCluster
with below onfiguration.Verify ray-tls-scripts configmap and ray-ca-cert secret are created in the
redhat-ods-applications
Create a MultiNode ServingRuntime in a test ns
Verify ray-tls-scripts configmap and ray-ca-cert secret are created in the
kserve-demo
Deploy a model with multi-node feature.
Merge criteria: