Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up scaled down services based on hostname #775

Merged

Conversation

auniyal61
Copy link
Contributor

@auniyal61 auniyal61 commented May 28, 2024

  • Update service objects in functional test api_fixture
  • Adds a check to wait for readycount to become equal to replica
  • Delete service host from nova DB by hostname on replica count update

Implements: OSPRH-6912

@openshift-ci openshift-ci bot requested review from dprince and olliewalsh May 28, 2024 12:25
Copy link
Contributor

openshift-ci bot commented May 28, 2024

Hi @auniyal61. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

controllers/common.go Outdated Show resolved Hide resolved
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/6427ab4248be40098acfafd812f4e77f

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 49m 03s
nova-operator-kuttl FAILURE in 50m 39s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 27m 21s
nova-operator-tempest-multinode-ceph FAILURE in 2h 02m 36s (non-voting)

@@ -177,7 +177,7 @@ func (f *NovaAPIFixture) ServicesList(w http.ResponseWriter, r *http.Request) {
"id": 4,
"binary": "nova-conductor",
"disabled_reason": "test4",
"host": "host2",
"host": "nova-conductor-1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a single conductor replica then that is always postfixed with -0.

Also if you look at the service list in your deployment:

oc exec openstackclient -- openstack compute service list
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                   | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| abb15fa7-380b-4baf-85d1-adf25febae54 | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up    | 2024-05-28T16:08:00.000000 |
| 5044f9ba-ace0-407e-b5f3-ae26af8ca53b | nova-scheduler | nova-scheduler-0       | internal | enabled | up    | 2024-05-28T16:08:08.000000 |
| 2fdf49f6-65d6-4584-a2f7-82711b717db7 | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up    | 2024-05-28T16:08:00.000000 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+

then you will see that there are conductors for each cell (cell0 and cell1 in the default deployment). We have a separate statefulset per cell for the conductors as they need different DB and message bus configuration and they can be scaled independently.

oc get StatefulSet | grep conductor
nova-cell0-conductor     1/1     4m51s
nova-cell1-conductor     1/1     4m31s

oc get pods -lservice=nova-conductor
NAME                     READY   STATUS    RESTARTS   AGE
nova-cell0-conductor-0   1/1     Running   0          5m24s
nova-cell1-conductor-0   1/1     Running   0          5m4s

Bottom line, to make this test data realistic I would use nova-cell0-conductor-0

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Later I just realized that we might have more complications in the test env about the host name of the services.

The hostname of the service in the nova cell DB is coming from the name of the Pod running that service binary. The name of the Pod consists of the name of the StatefulSet and the index of the replica. The name of the StatefulSet is set by the nova-operator. For the scheduler it is set by the scheduler controller in

Name: instance.Name,
to match the name with the NovaScheduler CR. The name of the NovaScheduler is set by the nova top level controller here
Name: instance.Name + "-scheduler",
to contain the name of the Nova CR and -scheduler. The name of the Nova CR in a real deployment is set by the openstack-operator https://github.com/openstack-k8s-operators/openstack-operator/blob/6daae482c03a10df034dddbfeb1fa9e96f6b9072/pkg/openstack/nova.go#L46 and it is hardcoded to "nova".

However in the functional test env we don't run openstack-operator just the nova-operator and the test code creates the nova related CRs directly. To avoid parallel testcase execution to interfere with each other the name of the CRs created by the test is random generated. I.e. here we create a NovaScheduler CR

DeferCleanup(th.DeleteInstance, CreateNovaScheduler(novaNames.SchedulerName, spec))

and the passed in novaNames struct has a random uuid generated for each test case to be included in the name of the CRs.
func GetNovaNames(novaName types.NamespacedName, cellNames []string) NovaNames {
// NOTE(bogdando): use random UUIDs instead of static "nova" part of names.
// These **must** replicate existing Nova*/Dataplane controllers suffixing/prefixing logic.
// While dynamic UUIDs also provide enhanced testing coverage for "synthetic" cases,
// which could not be caught for normal names with static "nova" prefixes.
novaAPI := types.NamespacedName{
Namespace: novaName.Namespace,
Name: fmt.Sprintf("%s-api", novaName.Name),
}
novaScheduler := types.NamespacedName{
Namespace: novaName.Namespace,
Name: fmt.Sprintf("%s-scheduler", novaName.Name),

So in the functional env the NovaScheduler CR will have a name like -scheduler, and therefore the StatefulSet name will also be like that.

So most likely we cannot store hard coded test data here any more but generate the test data from the test case where we know what is the current uuid for the test case.

@@ -609,13 +609,14 @@ func (r *NovaConductorReconciler) cleanServiceFromNovaDb(
secret corev1.Secret,
l logr.Logger,
) error {
replica_count := int(*instance.Spec.Replicas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

golang uses camelCase instead of snake_case. The full style guide is here https://go.dev/doc/effective_go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, fixed name style

@auniyal61 auniyal61 requested a review from gibizer May 29, 2024 09:14
@auniyal61 auniyal61 marked this pull request as draft May 29, 2024 09:16
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/a47d546ccc2047608b0a9b03de6afa68

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 20m 22s
nova-operator-kuttl FAILURE in 50m 46s
nova-operator-tempest-multinode FAILURE in 2h 15m 06s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 2h 58m 28s (non-voting)

@gibizer
Copy link
Contributor

gibizer commented May 29, 2024

/ok-to-test

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/604dc6fe07db461d8037cae93d64cf82

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 07m 39s
nova-operator-kuttl FAILURE in 50m 53s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 08m 06s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 2h 47m 47s (non-voting)

@gibizer
Copy link
Contributor

gibizer commented May 31, 2024

/test functional

@auniyal61 while the patch is in draft the prow jobs are not running automatically when the PR is updated. You can always trigger it manually though.

@gibizer
Copy link
Contributor

gibizer commented May 31, 2024

without looking at the change I checked the kuttl test results where we have scale down tests and it is passing.

2024-05-31 08:27:30.786137 | controller |     logger.go:42: 08:27:30 | scale-tests/3-scale-down-nova | test step completed 3-scale-down-nova
2024-05-31 08:27:30.786290 | controller |     logger.go:42: 08:27:30 | scale-tests/4-reconfigure-service-downtime | starting test step 4-reconfigure-service-downtime
2024-05-31 08:27:30.844388 | controller |     logger.go:42: 08:27:30 | scale-tests/4-reconfigure-service-downtime | Nova:nova-kuttl-default/nova-kuttl updated
2024-05-31 08:27:30.844488 | controller |     logger.go:42: 08:27:30 | scale-tests/4-reconfigure-service-downtime | test step completed 4-reconfigure-service-downtime
2024-05-31 08:27:30.844501 | controller |     logger.go:42: 08:27:30 | scale-tests/5-wait-for-service-downtime | starting test step 5-wait-for-service-downtime
2024-05-31 08:27:30.844585 | controller |     logger.go:42: 08:27:30 | scale-tests/5-wait-for-service-downtime | running command: [sh -c sleep 25
2024-05-31 08:27:30.844617 | controller |         ]
2024-05-31 08:27:55.867521 | controller |     logger.go:42: 08:27:55 | scale-tests/5-wait-for-service-downtime | test step completed 5-wait-for-service-downtime
2024-05-31 08:27:55.868660 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | starting test step 6-trigger-service-cleanup
2024-05-31 08:27:55.908611 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | NovaConductor:nova-kuttl-default/nova-kuttl-cell0-conductor updated
2024-05-31 08:27:55.941777 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | NovaConductor:nova-kuttl-default/nova-kuttl-cell1-conductor updated
2024-05-31 08:27:55.973319 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | NovaScheduler:nova-kuttl-default/nova-kuttl-scheduler updated
2024-05-31 08:27:55.973623 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | running command: [sh -c set -euxo pipefail
2024-05-31 08:27:55.973643 | controller |         (( $(oc exec -n $NAMESPACE openstackclient -- openstack compute service list --service nova-scheduler -f value | wc -l) == 1 ))
2024-05-31 08:27:55.973654 | controller |         (( $(oc exec -n $NAMESPACE openstackclient -- openstack compute service list --service nova-conductor -f value | grep cell1 | wc -l) == 1 ))
2024-05-31 08:27:55.973664 | controller |         (( $(oc exec -n $NAMESPACE openstackclient -- openstack compute service list --service nova-conductor -f value | grep cell0 | wc -l) == 1 ))
2024-05-31 08:27:55.973674 | controller |         ]
2024-05-31 08:27:55.980207 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | ++ oc exec -n nova-kuttl-default openstackclient -- openstack compute service list --service nova-scheduler -f value
2024-05-31 08:27:55.981011 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | ++ wc -l
2024-05-31 08:28:05.357624 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | + ((  1 == 1  ))
2024-05-31 08:28:05.358403 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | ++ oc exec -n nova-kuttl-default openstackclient -- openstack compute service list --service nova-conductor -f value
2024-05-31 08:28:05.359139 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | ++ wc -l
2024-05-31 08:28:05.361411 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | ++ grep cell1
2024-05-31 08:28:09.207875 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | + ((  1 == 1  ))
2024-05-31 08:28:09.208824 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | ++ oc exec -n nova-kuttl-default openstackclient -- openstack compute service list --service nova-conductor -f value
2024-05-31 08:28:09.209806 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | ++ wc -l
2024-05-31 08:28:09.212425 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | ++ grep cell0
2024-05-31 08:28:11.369287 | controller |     logger.go:42: 08:28:11 | scale-tests/6-trigger-service-cleanup | + ((  1 == 1  ))
2024-05-31 08:28:11.369982 | controller |     logger.go:42: 08:28:11 | scale-tests/6-trigger-service-cleanup | test step completed 6-trigger-service-cleanup

https://logserver.rdoproject.org/75/775/8f3c7a0b7a64741b87ef26def3d1e1062c996447/github-check/nova-operator-kuttl/dbe4282/job-output.txt

Btw we can simplify the kuttl test now. Previously it needed to wait until the service considered down by nova as only down services is deleted. But we don't need that wait any more. I think step4,5,6 can be deleted form here https://github.com/openstack-k8s-operators/nova-operator/tree/main/test/kuttl/test-suites/default/scale-tests and the content of https://github.com/openstack-k8s-operators/nova-operator/blob/main/test/kuttl/test-suites/default/scale-tests/06-assert.yaml can be added directly to step 3 https://github.com/openstack-k8s-operators/nova-operator/blob/main/test/kuttl/test-suites/default/scale-tests/03-assert.yaml

// name index start from 0
// which means if replicaCount is 1, only nova-scheduler-0 is valid case
// so delete >= 1
if service.State == "down" || hostIndex >= int(replicaCount) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handing service in down state we should make sure that cleanServiceFromNovaDb is called after the deployment is fully up to date. You can do that by adding condition before calling cleanServiceFromNovaDb in

result, err = r.ensureDeployment(ctx, h, instance, inputHash, serviceAnnotations)
if (err != nil || result != ctrl.Result{}) {
return result, err
}
// clean up nova services from nova db should be always a last step in reconcile
// to make sure that
err = r.cleanServiceFromNovaDb(ctx, h, instance, secret, Log)
if err != nil {
Log.Error(err, "Failed cleaning services from nova db")
}
and check that the replica count requested in the spec are the same as the ready replicas reported by the statefulset. The ready replicas are stored in instance.status here
instance.Status.ReadyCount = ss.GetStatefulSet().Status.ReadyReplicas
So the condition can be as simple as instance.Spec.Replicas == instance.Status.ReadyCount

@@ -609,13 +609,14 @@ func (r *NovaConductorReconciler) cleanServiceFromNovaDb(
secret corev1.Secret,
l logr.Logger,
) error {
replicaCount := int(*instance.Spec.Replicas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would drop the int cast and instead change the function parameter in cleanNovaServiceFromNovaDb to be int32. But keep the pointer deref so that cleanNovaServiceFromNovaDb does not need to deal with nil pointers.

btw even though instance.Spec.Replicas is a pointer we know it will not be nil as when the Instance is read from the etcd the field is defaulted 1 if not provided. This behavior is defined by the kubebuilder comments on the field definition

// +kubebuilder:validation:Optional
// +kubebuilder:default=1
// +kubebuilder:validation:Maximum=32
// +kubebuilder:validation:Minimum=0
// Replicas of the service to run
Replicas *int32 `json:"replicas"`

@@ -663,11 +663,12 @@ func (r *NovaSchedulerReconciler) cleanServiceFromNovaDb(
secret corev1.Secret,
l logr.Logger,
) error {
replicaCount := int(*instance.Spec.Replicas)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto about casting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

hostSplits = hostSplits[len(hostSplits)-2:]
host, hostIndexStr := hostSplits[0], hostSplits[1]

if strings.Contains(serviceName, host) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be needed. We already only querying the given service type from nova at L123. So when we call this from the schedule controller we only list the scheduler service form the db.

It is possible that the functional test code did not simulated this precisely I need to check...

Copy link
Contributor

@gibizer gibizer May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah it was a shortcoming of the test code only. I pushed improvements to the nova API fixture to return only the services which uses the requested binary. #778

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Comment on lines 138 to 140
hostSplits := strings.Split(service.Host, "-")
hostSplits = hostSplits[len(hostSplits)-2:]
host, hostIndexStr := hostSplits[0], hostSplits[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is OK for the scheduler case as Nova deployment always have a single set of scheduler replicas. However conductor this is different. For each NovaCell we have a separate set of concuctor replicas. And the different cells can have different conductor replica count.
This also means that the delete logic for a given cell should only delete conductor services for the given cell. Right now this code deletes conductor services regardless which cell they are belonging to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. I changed the default deployment in a way that cell0 has 2 conductor replicas and cell1 has 3:

❯ oc get OpenStackControlPlane -o yaml| grep conductorServiceTemplate -B4 -A2
          cell0:
            cellDatabaseAccount: nova-cell0
            cellDatabaseInstance: openstack
            cellMessageBusInstance: rabbitmq
            conductorServiceTemplate:
              customServiceConfig: ""
              replicas: 2
--
          cell1:
            cellDatabaseAccount: nova-cell1
            cellDatabaseInstance: openstack-cell1
            cellMessageBusInstance: rabbitmq-cell1
            conductorServiceTemplate:
              customServiceConfig: ""
              replicas: 3

❯ oc get pods | grep conductor
nova-cell0-conductor-0                                          1/1     Running     0          113m
nova-cell0-conductor-1                                          1/1     Running     0          2m14s
nova-cell1-conductor-0                                          1/1     Running     0          113m
nova-cell1-conductor-1                                          1/1     Running     0          2m14s
nova-cell1-conductor-2                                          1/1     Running     0          2m14s

❯ openstack compute service list
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                   | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| 26e569f7-66ab-4a87-9bfa-e48e01835f54 | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up    | 2024-05-31T14:05:13.000000 |
| bb3164f0-c655-40c1-beb8-1592df024d10 | nova-conductor | nova-cell1-conductor-2 | internal | enabled | up    | 2024-05-31T14:05:11.000000 |
| ced135e2-7242-4e36-bf59-3ae29382b62d | nova-conductor | nova-cell1-conductor-1 | internal | enabled | up    | 2024-05-31T14:05:11.000000 |
| 4530443c-b5bc-4cfa-8d50-d880bbe74b09 | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up    | 2024-05-31T14:05:19.000000 |
| 396ce09f-cc4f-4e95-a0d3-4535dc2023de | nova-scheduler | nova-scheduler-0       | internal | enabled | up    | 2024-05-31T14:05:12.000000 |
| 14bf553c-5958-4d60-bba9-40035f602a8f | nova-conductor | nova-cell0-conductor-1 | internal | enabled | up    | 2024-05-31T14:05:11.000000 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
❯ oc get NovaConductor -o custom-columns=NAME:.metadata.name,CELL:.spec.cellName,REPLICAS:.spec.replicas
NAME                   CELL    REPLICAS
nova-cell0-conductor   cell0   2
nova-cell1-conductor   cell1   3

Now if I change the conductor replica count in cell0 from 2 to 1 I don't want that the nova-cell1-conductor-1 or nova-cell1-conductor-2 is deleted, only nova-cell0-conductor-1 should be deleted.
So the delete logic needs to depend on the cell name as well. You can read that from the conductor instance as instance.Spec.CellName.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path 778, fixes the test for names, we do not need to extract host_name anymore, removed.

if instance.Spec.Replicas != &instance.Status.ReadyCount {
return ctrl.Result{}, nil
}
// for instance.Spec.NovaServiceBase.Replicas
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to remove

@auniyal61
Copy link
Contributor Author

/test functional

@@ -676,7 +676,7 @@ var _ = Describe("NovaScheduler controller cleaning", func() {
condition.ReadyCondition,
corev1.ConditionTrue,
)
Expect(novaAPIFixture.HasRequest("DELETE", "/compute/os-services/3", "")).To(BeTrue())
Expect(novaAPIFixture.HasRequest("DELETE", "/compute/os-services/3", "binary=nova-scheduler")).To(BeTrue())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this is failing because of our added condition for rapilcaCount.
now control never reaches to cleanServiceFromNovaDb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you added binary=nova-scheduler but the controller does not send a binary query param with the DELETE request, it only do that in the GET request when querying the list of services.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you remove that change and do the fix of the condition above then the functional test passes:

  2024-06-04T10:49:59.346+0200	INFO	Controllers.NovaScheduler	Deployment is not ready	{"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "87264e31-b83b-4dcb-9b52-99db1db29b34", "Status": {"replicas":0,"availableReplicas":0}}
  2024-06-04T10:49:59.346+0200	INFO	Controllers.NovaScheduler	Waiting for ReadyCount to equal Replicas before deleting scaled in services from the nova DB.	{"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "87264e31-b83b-4dcb-9b52-99db1db29b34", "replicas": 1, "ReadyCount": 0}
  2024-06-04T10:49:59.430+0200	INFO	---Test---	Simulated statefulset success	{"on": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}}
  2024-06-04T10:49:59.430+0200	INFO	---Test---	ExpectCondition	{"type": "Ready", "expected status": "True", "on": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}}
  2024-06-04T10:49:59.430+0200	INFO	Controllers.NovaScheduler	Reconciling	{"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739"}
  2024-06-04T10:49:59.440+0200	INFO	Controllers.NovaScheduler	StatefulSet 435d1006-78aa-481a-bb1f-0-scheduler - updated	{"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739"}
  2024-06-04T10:49:59.440+0200	INFO	Controllers.NovaScheduler	Deployment is ready	{"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739"}
  2024-06-04T10:49:59.440+0200	INFO	---Test---	OpenStack API request	{"method": "GET", "URI": "/identity/"}
  2024-06-04T10:49:59.443+0200	INFO	---Test---	OpenStack API request	{"method": "GET", "URI": "/compute/os-services/?binary=nova-scheduler"}
  2024-06-04T10:49:59.444+0200	INFO	---Test---	OpenStack API request	{"method": "DELETE", "URI": "/compute/os-services/3"}
  2024-06-04T10:49:59.444+0200	INFO	Controllers.NovaScheduler	Deleted service	{"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739", "service": {"binary":"nova-scheduler","disabled_reason":"","forced_down":false,"host":"nova-scheduler-1","state":"down","status":"enabled","zone":"internal"}}

So the test case of nova-scheduler works because when novascheduler controller creates the StatefulSet with Replicas 1 then the ReadyReplicas is 0 and the cleaning does not happen. Then the test code at L672 simulates that the statefulset is ready and that sets the ReadyReplicas to 1 as well. The change of the statefulset will trigger a new reconciles on the NovaScheduler as it owns the statefulset and in that reconcile your new condition (with my fix) will allow the service deletion to happen.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9a72a2a6a9844a538f2aaac3cf477375

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 29m 59s
nova-operator-kuttl FAILURE in 53m 04s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 37m 09s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 3h 01m 45s

@openshift-ci openshift-ci bot requested a review from lewisdenny June 4, 2024 16:59
@auniyal61 auniyal61 force-pushed the OSPRH_6912 branch 3 times, most recently from b8bd2ba to 36025c8 Compare June 5, 2024 10:20
Copy link
Contributor

@gibizer gibizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of fixes but I think we are close to be able to land this.

@@ -784,6 +784,9 @@ var _ = Describe("NovaConductor controller cleaning", func() {
Expect(novaAPIServer.HasRequest("DELETE", "/compute/os-services/5", "")).To(BeTrue())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • ID 5 is nova-cell0-conductor-0 so that should not be deleted by the controller.

// name index start from 0
// which means if replicaCount is 1, only nova-scheduler-0 is valid case
// so delete >= 1
if service.State == "down" || hostIndex >= int(replicaCount) {
Copy link
Contributor

@gibizer gibizer Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to wait for the service to be marked down by nova. We know that the pod is delete as we waited for the instance.Spec.Replicas == instance.Status.ReadyCount before we started the delete. So you can drop the down check.

This will make the test pass with the asserts I proposed below.

@@ -115,6 +117,8 @@ func cleanNovaServiceFromNovaDb(
computeClient *gophercloud.ServiceClient,
serviceName string,
l logr.Logger,
replicaCount int32,
conductorCellName ...string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would name this simply cellName, and would not try to make it optional via ellipsis. In general golang does not support optional parameters and if we do need them then we need to pass structs around with pointer fields for optionality, or use a builder pattern. Here I think it is OK to have a cellName string parameter, the scheduler can pass in "" as it is not in a cell.

Comment on lines 138 to 143
if strings.Contains(service.Host, "conductor") {
if !strings.Contains(service.Host, conductorCellName[0]) {
// delete only if target conductor service is from same target cell
continue
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified to:

        replicaCount int32,
-       conductorCellName ...string,
+       cellName string,
 ) error {
        opts := services.ListOpts{
                Binary: serviceName,
@@ -135,11 +135,10 @@ func cleanNovaServiceFromNovaDb(
        }
 
        for _, service := range allServices {
-               if strings.Contains(service.Host, "conductor") {
-                       if !strings.Contains(service.Host, conductorCellName[0]) {
-                               // delete only if target conductor service is from same target cell
-                               continue
-                       }
+               // delete only if serviceHost is for our cell. If no cell is
+               // provided (e.g. scheduler cleanup case) then this check is a noop
+               if !strings.Contains(service.Host, cellName) {
+                       continue
                }

@auniyal61
Copy link
Contributor Author

/test functional

got
[FAIL] Samples when nova_v1beta1_nova-compute-fake.yaml sample is applied [DeferCleanup (Each)] Nova is created
[FAILED] Timed out after 10.000s.

Copy link
Contributor

@gibizer gibizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more request for the test and please amend the commit message and PR description. The current title is not really correct. I suggest to use your first sentence from the main commit message as the title.

- Update service objects in functional test api_fixture
- Adds a check to wait for readycount to become equal to replica
- Delete service host from nova DB by hostname on replica count update
@auniyal61 auniyal61 changed the title Adds a check for scaled down service Clean up scaled down services based on hostname Jun 6, 2024
Copy link
Contributor

@gibizer gibizer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you @auniyal61 !

Copy link
Contributor

@mrkisaolamb mrkisaolamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@@ -280,6 +280,10 @@ func (r *NovaConductorReconciler) Reconcile(ctx context.Context, req ctrl.Reques
return ctrl.Result{}, err
}

if *instance.Spec.Replicas != instance.Status.ReadyCount {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in follow up we can handle also generation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I need to look at it again when #739 lands. Either we need a direct generation check here, or we can just rely on DeploymentReady is only set if the generation matches.

Copy link
Contributor

openshift-ci bot commented Jun 6, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: auniyal61, mrkisaolamb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jun 6, 2024
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/8e48a4b416894f6195f517b7f2e35544

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 43m 18s
✔️ nova-operator-kuttl SUCCESS in 48m 36s
nova-operator-tempest-multinode RETRY_LIMIT in 21m 02s
nova-operator-tempest-multinode-ceph FAILURE in 3h 13m 47s

@auniyal61
Copy link
Contributor Author

/test functional

tempest test test_server_connectivity_live_migration failed while verifying downtime of VM after live migration

@auniyal61
Copy link
Contributor Author

/test functional

this did not ran rdoproject.org/github-check job again

@gibizer
Copy link
Contributor

gibizer commented Jun 7, 2024

recheck

/test functional

this did not ran rdoproject.org/github-check job again

the rdo jobs need "recheck"

@gibizer
Copy link
Contributor

gibizer commented Jun 7, 2024

/test functional

tempest test test_server_connectivity_live_migration failed while verifying downtime of VM after live migration

yepp we see that before so that is unrealted

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/074a26a3358e4cacbeeaeb08fc5151cf

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 51m 11s
✔️ nova-operator-kuttl SUCCESS in 50m 05s
nova-operator-tempest-multinode FAILURE in 1h 47m 31s
nova-operator-tempest-multinode-ceph FAILURE in 2h 01m 48s

@auniyal61
Copy link
Contributor Author

recheck

SSH issue while
TASK [artifacts : Copy logs from CRC VM output_dir={{ cifmw_artifacts_basedir }}/artifacts, script=scp -v -r -i {{ cifmw_artifacts_crc_sshkey }} root@{{ cifmw_artifacts_crc_host }}:/ostree/deploy/rhcos/var/log/pods {{ cifmw_artifacts_basedir }}/logs/crc/]

logs:
https://logserver.rdoproject.org/75/775/f34e93d99861e28d1afab587dfc8821d1db1d179/github-check/nova-operator-tempest-multinode/a44e32c/controller/ci-framework-data/logs/ci_script_013_copy_logs_from_crc.log
[email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Connection closed

@gibizer
Copy link
Contributor

gibizer commented Jun 7, 2024

It seems like CI is failing everywhere. But I haven't checked the other results yet. Let's see if the recheck helps

2024-06-07 09:04:55.777926 | controller | TASK [edpm_deploy : Wait for OpenStackDataPlaneDeployment become Ready _raw_params=oc wait OpenStackDataPlaneDeployment {{ cr_name }} --namespace={{ cifmw_install_yamls_defaults['NAMESPACE'] }} --for=condition=Ready --timeout={{ cifmw_edpm_deploy_timeout }}m] ***
2024-06-07 09:04:55.777931 | controller | Friday 07 June 2024  09:04:55 -0400 (0:00:00.817)       0:21:35.470 ***********
2024-06-07 10:04:56.078703 | controller | fatal: [localhost]: FAILED! => changed=true
2024-06-07 10:04:56.078748 | controller |   cmd:
2024-06-07 10:04:56.078755 | controller |   - oc
2024-06-07 10:04:56.078761 | controller |   - wait
2024-06-07 10:04:56.078767 | controller |   - OpenStackDataPlaneDeployment
2024-06-07 10:04:56.078772 | controller |   - edpm-deployment
2024-06-07 10:04:56.078776 | controller |   - --namespace=openstack
2024-06-07 10:04:56.078781 | controller |   - --for=condition=Ready
2024-06-07 10:04:56.078785 | controller |   - --timeout=60m
2024-06-07 10:04:56.078790 | controller |   delta: '1:00:00.111638'
2024-06-07 10:04:56.078794 | controller |   end: '2024-06-07 10:04:56.057461'
2024-06-07 10:04:56.078798 | controller |   msg: non-zero return code
2024-06-07 10:04:56.078803 | controller |   rc: 1
2024-06-07 10:04:56.078807 | controller |   start: '2024-06-07 09:04:55.945823'
2024-06-07 10:04:56.078812 | controller |   stderr: 'error: timed out waiting for the condition on openstackdataplanedeployments/edpm-deployment'

https://logserver.rdoproject.org/75/775/f34e93d99861e28d1afab587dfc8821d1db1d179/github-check/nova-operator-tempest-multinode/a44e32c/job-output.txt

ovn deployment failed on the EDPM node:

Error: statfs�[0m
�[1;35m/var/lib/openstack/certs/ovn/default/ca.crt: no such file or directory�[0m

https://logserver.rdoproject.org/75/775/f34e93d99861e28d1afab587dfc8821d1db1d179/github-check/nova-operator-tempest-multinode/a44e32c/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack/pods/ovn-edpm-deployment-openstack-edpm-ipam-6jjvb/logs/openstackansibleee.log

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9463417c42a34402846f6154cf0c5070

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 38m 36s
✔️ nova-operator-kuttl SUCCESS in 48m 09s
nova-operator-tempest-multinode FAILURE in 1h 44m 22s
nova-operator-tempest-multinode-ceph FAILURE in 1h 59m 50s

@gibizer
Copy link
Contributor

gibizer commented Jun 10, 2024

recheck
openstack-k8s-operators/openstack-operator#834 merged, CI should be in better shape.

@openshift-merge-bot openshift-merge-bot bot merged commit f61fdd2 into openstack-k8s-operators:main Jun 10, 2024
7 checks passed
@auniyal61 auniyal61 deleted the OSPRH_6912 branch December 9, 2024 06:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants