Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RHOAIENG-13566] Configure SSL for the communication between head and worker #314

Open
wants to merge 9 commits into
base: incubating
Choose a base branch
from

Conversation

Jooho
Copy link
Contributor

@Jooho Jooho commented Nov 26, 2024

Description

Between head node and worker node, we should configure certificate for SSL communication.
https://issues.redhat.com/browse/RHOAIENG-13566

How Has This Been Tested?

  • Setup DataScienceCluster with below onfiguration.
apiVersion: datasciencecluster.opendatahub.io/v1
kind: DataScienceCluster
metadata:
  labels:
    app.kubernetes.io/created-by: rhods-operator
    app.kubernetes.io/instance: default-dsc
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: datasciencecluster
    app.kubernetes.io/part-of: rhods-operator
  name: default-dsc
spec:
  components:
    codeflare:
      managementState: Managed
    dashboard:
      managementState: Managed
    datasciencepipelines:
      managementState: Managed
    kserve:
      devFlags:
        manifests:
          - contextDir: config
            sourcePath: overlays/odh
            uri: 'https://github.com/opendatahub-io/kserve/tarball/master'
          - contextDir: config
            sourcePath: ''
            uri: 'https://github.com/jooho/odh-model-controller/tarball/ray_tls_test'
      serving:
        ingressGateway:
          certificate:
            type: OpenshiftDefaultIngress
        managementState: Managed
        name: knative-serving
      managementState: Managed
    kueue:
      managementState: Managed
    modelmeshserving:
      managementState: Removed
    ray:
      managementState: Managed
    trainingoperator:
      managementState: Removed
    trustyai:
      managementState: Removed
    workbenches:
      managementState: Managed
  • Verify ray-tls-scripts configmap and ray-ca-cert secret are created in the redhat-ods-applications

  • Create a MultiNode ServingRuntime in a test ns

oc new-project kserve-demo
oc process vllm-multinode-runtime-template -n redhat-ods-applications|oc create -n kserve-demo -f - 
  • Verify ray-tls-scripts configmap and ray-ca-cert secret are created in the kserve-demo

  • Deploy a model with multi-node feature.

    • It should deploy without any errors.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, change the target branch to incubating.
The main branch is now a stable/downstream branch.

@Jooho Jooho changed the base branch from main to incubating November 26, 2024 17:32
@spolti spolti requested a review from israel-hdez November 27, 2024 14:36
data:
# output from cat ca.crt | base64
ca.crt: |
LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZIekNDQXdlZ0F3SUJBZ0lVSlAwL1FCY0xTMFFFV1ZiRDE4NndyUnZjLzdFd0RRWUpLb1pJaHZjTkFRRUwKQlFBd0hqRWNNQm9HQTFVRUF3d1RjMlZzWmkxemFXZHVaV1F0WTJFdFkyVnlkREFnRncweU5ERXhNak13TXpReApOVEphR0E4eU1USTBNVEF6TURBek5ERTFNbG93SGpFY01Cb0dBMVVFQXd3VGMyVnNaaTF6YVdkdVpXUXRZMkV0ClkyVnlkRENDQWlJd0RRWUpLb1pJaHZjTkFRRUJCUUFEZ2dJUEFEQ0NBZ29DZ2dJQkFML0lpWVdabUc0Q05qOS8KMzV6RUJheU1tL2djeGhJSFArZ1M4S04wdm9YV2krdk1Kd1hEUWR3T0JOdEZaa1l0elpJc29ESHVHenRnN0RWNApyZXZONE5JOHRKTmc2b2Jma0tVcGQ3eHdvNHNIdXMwd0h6NWVwZGt1MDhNVzVzZFZOUFNRMkhpVnhialpXSnBRClpJYWYyQWRKa1psUFdtVDBaS1pPdFFEN2oySTRtM3VCeG1jYzhTNWhiNkpaYW9NbGNVVXhkNDFscG43T21iMGEKVjRBUGZiWS9vYytwZmVDczN1cG5xamxZamVGQjR2RTV4WU1ZV0FNeitJRGh4RTRxRGVSaXNMQnhhN1kvcFRScQo2OWVhVXN6Qjl5eEQ3R0FySTJsSDhyUCtVeGpGYUl1K2tBVjVtbjc4OXdlejh0TDVGNEErWlE5cGM5TVI2UXBuCmRkanlaRXcvcFpkdTcwVmo3WUE1MU91S2owcTF2dGw3d1BPcDBUc3lwUDhadW04ZkZSNG5KbmNPaFhMTWV3bGEKTWxBeFZaUWRiMEF5dEE2TUl0dFdXSjA5L3BEOWZ6SkdjRnZOL245ZzZWZ0o5NjNhbmRoTEwyYlJMc0VKRmxUTQpEdTJIeW1CNkErb0ZlcDdjZXNxOUpJRFhkVmFqR3NxMmgrZVpPNGdxWW5nWGNmQVl5ZUloYzlYNnFoT2QvVmlZCmY2eUZoOTNuUnRYTFFNdUJRN2E1WTFzRVN3RHp3WWJKdEtuK3NrcGg5SEtCMTdVRUVOU3BNNHJSNHdxekRQd3AKSmZZeWt2a2Iyd2w0TkNCb2pjaU9icDYwV2ZDQytRcTFsNEo4VXpSOEpvWmFiQ0IzOWxVcHBKa09qNVFxYnEwMApKaUFzWENQQlp3OCtnQnV0b2JBVUs0RklqMGVQQWdNQkFBR2pVekJSTUIwR0ExVWREZ1FXQkJTcmNlRWNhMjNxCjBUQm14VmtZTWtMQTNSeHpKekFmQmdOVkhTTUVHREFXZ0JTcmNlRWNhMjNxMFRCbXhWa1lNa0xBM1J4ekp6QVAKQmdOVkhSTUJBZjhFQlRBREFRSC9NQTBHQ1NxR1NJYjNEUUVCQ3dVQUE0SUNBUUJseW5IM2doVG9ObDQrTlQ1MgpNTTA1V1A3UCszVXJkQ0tGNEJCa0VzN0VueldSUjZ4bkVoVUY5VWhGZ0ZhTFBiQ1pacnlCS2krT1hrUHUva3JCCk12aE1LVGl0WnNWbVBzRktEWDYyVG9zMEJZV0VzanZ0VDM4WFhSZXA3T3BWR0lPQi85V09YVGl3VkpaT2tSZ2MKbHd3U2U1dnBQQXRpMzhUZ3BhM0FVSk5haG00bDhHNWF5WktRQWFnUDg1NHBFTjhPOW54Nk9odytWN1hzSGlNdQpwUmpvc0VTN0JJY1lXVGJxR05yNFR3eXo1cVUwOE9LOEUySFNFSnE5THA4YzZ6UTZnZzBhV1dLYWJyTUpNeSt2CkpIbjM5TEI0U3dONzJjVXJkRWYvQVVrWktNYTRKVFRjMTJnaGpKN1JvUENYUFdPOUJGZ09aZEdoUlpBYkZRMXgKcnB6b3BLZllkT0hWZ0tncG9MOVJIRm40TzZRaTBjbnBML0NZZEFGd0pXNmZYcGEyekhobEJqWXlWdHk5T1Y4TQppV01IVUNXZnl4anVaSno5NFFWZGxLRGVrY2YzUFJzU0RBRGZ4TXlBYVdJQ1NnYXNHTVNPSnRoRTlGM0JhaXNvCnNYM1NLYzRFSEc4Sk1VM1hoeWYwbkhDY2hQdWVRblU5akFBVVBDMHRrVlZtOWhmMXVkdjlOTUk4bktncWRFMkMKK2ExNnR3RVBpZzhkS1pkaFRMOFdXMGwxS3FRcCs4SnBGdTdNWUM1SGNaa0F0NEE3QXlXOHRsYmlQQ1B4RVYwZwpsYkc0eXFyV2lIOG5rWE9tNFBKb0FEMDhzNjA5Y3lFY2Iyblgvck92KzBSdFdOOWFqZGNxYlM1Z0JJaEhRSVUwCko1N0cyTVFYZ0hHbU9QL1ZZTi8vMTlsaXd3PT0KLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this only used for tests?

Copy link
Contributor Author

@Jooho Jooho Nov 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will be used for internal ssl communication for ray. It is just self-signed certificate

- -c
- |
# Check if the registered nodes count matches PIPELINE_PARALLEL_SIZE
registered_node_count=$(ray status --address ${HEAD_SVC}.${POD_NAMESPACE}.svc.cluster.local:6379 | grep -c node_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uppercase registered_node_count is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this is not about this Ray certificate. I used uppercase for environment variables and lowercase for variables in the script. I think this is just to make a little difference, so it shouldn't be a big problem.

apiVersion: v1
kind: ConfigMap
metadata:
name: ray-tls-scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scripts -> script since there's only one shell script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 scripts in the configmap so I use plural.

@Jooho
Copy link
Contributor Author

Jooho commented Nov 28, 2024

@spolti @terrytangyuan @israel-hdez any more comments?

"app.kubernetes.io/name": "odh-model-controller",
"app.kubernetes.io/component": "kserve",
"app.kubernetes.io/part-of": "odh-model-serving",
"app.kubernetes.io/managed-by": "odh-model-controller",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't owner reference needed on the configMaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it gives more information so I does not hurt .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) It is recommended labels by rhoai team. https://github.com/opendatahub-io/odh-model-controller/pull/221/files#r1655337054
based on this, I just added it but it might not need this for manifests though.
But it is just more information about the object so it should not be an issue.

Copy link
Contributor

openshift-ci bot commented Nov 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Jooho, spolti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spolti
Copy link
Member

spolti commented Nov 28, 2024

nice work, minor comments.

Labels: map[string]string{
"opendatahub.io/managed": "true",
"app.kubernetes.io/name": "odh-model-controller",
"app.kubernetes.io/component": "kserve",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

component: odh-model-controller
part-of: kserve
name: shouldn't it be something related to the secret or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is util function and it is not only for the ray. So I think we can set the name like self-signed-cert. I will update the others as well.

Signed-off-by: jooho lee <[email protected]>
Comment on lines 347 to 354
- name: gen-tls-script
configMap:
name: ray-tls-scripts
defaultMode: 0777
# An array of keys from the ConfigMap to create as files
items:
- key: gencert_ray.sh
path: gencert_ray.sh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the scripts be made part of the runtime image?

Otherwise, the template it is no longer self-contained.

emptyDir: {}
- name: ray-ca-cert
secret:
secretName: ray-ca-cert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement for later: we should find a way to not mount the root CA certificate. This is for security reasons.

}
}
var servingRuntimeList kservev1alpha1.ServingRuntimeList
if err := r.client.List(ctx, &servingRuntimeList); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you get from all namespaces?

log.Info("Original Ray CA Cert Secret is updated", "name", constants.RayCASecretName, "namespace", req.Namespace)
for _, sr := range servingRuntimeList.Items {
if isMultiNodeServingRuntime(sr) {
if err := r.cleanupRayResourcesByKind(ctx, log, sr.Namespace, "Secret"); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you expect only one multi-node SR per namespace?
The dashboard creates a copy per InferenceService, so it is proably a good perf improvement to do the cleanup once per namespace, in case servingRuntimeList contains runtimes from all namespaces.

This comment also applies for the ConfigMap cleanup.

}
}
}
if err := r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, noMultiNodeSrExistInNs); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if servingRuntimeList holds runtimes from all namespaces.... Looks like the clean-up is ran over all namespaces, but the creation of the Secret (and ConfigMap) is ran only on the target namespace?

Just trying to understand...

if err != nil {
return ctrl.Result{}, err
}
err = r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, !existMultiNodeServingRuntimeInNs(req.Namespace, servingRuntimeList))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
err = r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, !existMultiNodeServingRuntimeInNs(req.Namespace, servingRuntimeList))
err = r.reconcileRayCACertSecret(ctx, log, srcSecret, controllerNs, req.Namespace, noMultiNodeSrExistInNs)

// TO-DO upstream Kserve 0.15 will have a new API WorkerSpec
// So for now, it will check servingRuntime name, but after we move to 0.15, it needs to check workerSpec is specified or not.(RHOAIENG-16147)
func isMultiNodeServingRuntime(servingRuntime kservev1alpha1.ServingRuntime) bool {
return servingRuntime.Name == "vllm-multinode-runtime"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, slightly better to check for the presence of the multi-node fields (was it the workerSpec field?), like in KServe to keep it generic.

Expect(cli.Update(ctx, rayTlsScriptsUpdatedConfigMap)).Should(Succeed())

// Check if 'ray-tls-script' ConfigMap is rollback
originalRayTlsScriptsConfigMap, err := waitForConfigMap(cli, testNs, constants.RayTlsScriptConfigMapName, 30, 1*time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
originalRayTlsScriptsConfigMap, err := waitForConfigMap(cli, testNs, constants.RayTlsScriptConfigMapName, 30, 1*time.Second)
originalRayTlsScriptsConfigMap, err := waitForConfigMap(cli, WorkingNamespace, constants.RayTlsScriptConfigMapName, 30, 1*time.Second)

return compareConfigMap(originalRayTlsScriptsConfigMap, updatedConfigMapFromTestNs)
}, timeout, interval).Should(BeTrue())
})
It("should 'ray-tls-script' ConfigMap in the testNs when original one in the ctrlNs updated", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It("should 'ray-tls-script' ConfigMap in the testNs when original one in the ctrlNs updated", func() {
It("should update 'ray-tls-script' ConfigMap in the testNs when original one in the ctrlNs updated", func() {

Comment on lines +261 to +265
// Check if all ray tls script are NOT removed
_, err = waitForConfigMap(cli, testNs, constants.RayTlsScriptConfigMapName, 30, 1*time.Second)
Expect(err).NotTo(HaveOccurred())
_, err = waitForSecret(cli, testNs, constants.RayCASecretName, 30, 1*time.Second)
Expect(err).NotTo(HaveOccurred())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to surround these inside Consistently to make sure the conditions remain true over a timespan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants