Clean up scaled down services based on hostname #775

auniyal61 · 2024-05-28T12:25:44Z

Update service objects in functional test api_fixture
Adds a check to wait for readycount to become equal to replica
Delete service host from nova DB by hostname on replica count update

openshift-ci · 2024-05-28T12:25:56Z

Hi @auniyal61. Thanks for your PR.

I'm waiting for a openstack-k8s-operators member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

controllers/common.go

softwarefactory-project-zuul · 2024-05-28T15:28:43Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/6427ab4248be40098acfafd812f4e77f

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 49m 03s
❌ nova-operator-kuttl FAILURE in 50m 39s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 27m 21s
❌ nova-operator-tempest-multinode-ceph FAILURE in 2h 02m 36s (non-voting)

gibizer · 2024-05-28T15:51:01Z

test/functional/base_test.go

@@ -177,7 +177,7 @@ func (f *NovaAPIFixture) ServicesList(w http.ResponseWriter, r *http.Request) {
 						"id": 4,
 						"binary": "nova-conductor",
 						"disabled_reason": "test4",
-						"host": "host2",
+						"host": "nova-conductor-1",


If we have a single conductor replica then that is always postfixed with -0.

Also if you look at the service list in your deployment:

oc exec openstackclient -- openstack compute service list +--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+ | abb15fa7-380b-4baf-85d1-adf25febae54 | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up | 2024-05-28T16:08:00.000000 | | 5044f9ba-ace0-407e-b5f3-ae26af8ca53b | nova-scheduler | nova-scheduler-0 | internal | enabled | up | 2024-05-28T16:08:08.000000 | | 2fdf49f6-65d6-4584-a2f7-82711b717db7 | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up | 2024-05-28T16:08:00.000000 | +--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+

then you will see that there are conductors for each cell (cell0 and cell1 in the default deployment). We have a separate statefulset per cell for the conductors as they need different DB and message bus configuration and they can be scaled independently.

oc get StatefulSet | grep conductor nova-cell0-conductor 1/1 4m51s nova-cell1-conductor 1/1 4m31s oc get pods -lservice=nova-conductor NAME READY STATUS RESTARTS AGE nova-cell0-conductor-0 1/1 Running 0 5m24s nova-cell1-conductor-0 1/1 Running 0 5m4s

Bottom line, to make this test data realistic I would use nova-cell0-conductor-0

Later I just realized that we might have more complications in the test env about the host name of the services.

The hostname of the service in the nova cell DB is coming from the name of the Pod running that service binary. The name of the Pod consists of the name of the StatefulSet and the index of the replica. The name of the StatefulSet is set by the nova-operator. For the scheduler it is set by the scheduler controller in

nova-operator/pkg/novascheduler/deployment.go

Line 105 in 3f20f39

Name: instance.Name,

to match the name with the NovaScheduler CR. The name of the NovaScheduler is set by the nova top level controller here

nova-operator/controllers/nova_controller.go

Line 1171 in 3f20f39

Name: instance.Name + "-scheduler",

to contain the name of the Nova CR and -scheduler. The name of the Nova CR in a real deployment is set by the openstack-operator https://github.com/openstack-k8s-operators/openstack-operator/blob/6daae482c03a10df034dddbfeb1fa9e96f6b9072/pkg/openstack/nova.go#L46 and it is hardcoded to "nova".

However in the functional test env we don't run openstack-operator just the nova-operator and the test code creates the nova related CRs directly. To avoid parallel testcase execution to interfere with each other the name of the CRs created by the test is random generated. I.e. here we create a NovaScheduler CR

nova-operator/test/functional/nova_scheduler_test.go

Line 101 in 3f20f39

DeferCleanup(th.DeleteInstance, CreateNovaScheduler(novaNames.SchedulerName, spec))

and the passed in novaNames struct has a random uuid generated for each test case to be included in the name of the CRs.

nova-operator/test/functional/base_test.go

Lines 714 to 725 in 3f20f39

func GetNovaNames(novaName types.NamespacedName, cellNames []string) NovaNames {

// NOTE(bogdando): use random UUIDs instead of static "nova" part of names.

// These **must** replicate existing Nova*/Dataplane controllers suffixing/prefixing logic.

// While dynamic UUIDs also provide enhanced testing coverage for "synthetic" cases,

// which could not be caught for normal names with static "nova" prefixes.

novaAPI := types.NamespacedName{

Namespace: novaName.Namespace,

Name: fmt.Sprintf("%s-api", novaName.Name),

}

novaScheduler := types.NamespacedName{

Namespace: novaName.Namespace,

Name: fmt.Sprintf("%s-scheduler", novaName.Name),

So in the functional env the NovaScheduler CR will have a name like -scheduler, and therefore the StatefulSet name will also be like that.

So most likely we cannot store hard coded test data here any more but generate the test data from the test case where we know what is the current uuid for the test case.

gibizer · 2024-05-28T16:34:58Z

controllers/novaconductor_controller.go

@@ -609,13 +609,14 @@ func (r *NovaConductorReconciler) cleanServiceFromNovaDb(
 	secret corev1.Secret,
 	l logr.Logger,
 ) error {
+	replica_count := int(*instance.Spec.Replicas)


golang uses camelCase instead of snake_case. The full style guide is here https://go.dev/doc/effective_go

ack, fixed name style

softwarefactory-project-zuul · 2024-05-29T12:34:02Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/a47d546ccc2047608b0a9b03de6afa68

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 20m 22s
❌ nova-operator-kuttl FAILURE in 50m 46s
❌ nova-operator-tempest-multinode FAILURE in 2h 15m 06s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 2h 58m 28s (non-voting)

gibizer · 2024-05-29T14:23:41Z

/ok-to-test

auniyal61 · 2024-05-30T05:36:53Z

recheck
job [1] failed at [2] during reconcile it asked to delete conductor service, which returned False instead of True.

[1] https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openstack-k8s-operators_nova-operator/775/pull-ci-openstack-k8s-operators-nova-operator-main-functional/1795823352099639296
[2] https://github.com/openstack-k8s-operators/nova-operator/blob/main/test/functional/novaconductor_controller_test.go#L801

softwarefactory-project-zuul · 2024-05-30T08:46:26Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/604dc6fe07db461d8037cae93d64cf82

✔️ openstack-k8s-operators-content-provider SUCCESS in 3h 07m 39s
❌ nova-operator-kuttl FAILURE in 50m 53s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 08m 06s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 2h 47m 47s (non-voting)

gibizer · 2024-05-31T13:19:37Z

/test functional

@auniyal61 while the patch is in draft the prow jobs are not running automatically when the PR is updated. You can always trigger it manually though.

gibizer · 2024-05-31T13:27:18Z

without looking at the change I checked the kuttl test results where we have scale down tests and it is passing.

2024-05-31 08:27:30.786137 | controller |     logger.go:42: 08:27:30 | scale-tests/3-scale-down-nova | test step completed 3-scale-down-nova
2024-05-31 08:27:30.786290 | controller |     logger.go:42: 08:27:30 | scale-tests/4-reconfigure-service-downtime | starting test step 4-reconfigure-service-downtime
2024-05-31 08:27:30.844388 | controller |     logger.go:42: 08:27:30 | scale-tests/4-reconfigure-service-downtime | Nova:nova-kuttl-default/nova-kuttl updated
2024-05-31 08:27:30.844488 | controller |     logger.go:42: 08:27:30 | scale-tests/4-reconfigure-service-downtime | test step completed 4-reconfigure-service-downtime
2024-05-31 08:27:30.844501 | controller |     logger.go:42: 08:27:30 | scale-tests/5-wait-for-service-downtime | starting test step 5-wait-for-service-downtime
2024-05-31 08:27:30.844585 | controller |     logger.go:42: 08:27:30 | scale-tests/5-wait-for-service-downtime | running command: [sh -c sleep 25
2024-05-31 08:27:30.844617 | controller |         ]
2024-05-31 08:27:55.867521 | controller |     logger.go:42: 08:27:55 | scale-tests/5-wait-for-service-downtime | test step completed 5-wait-for-service-downtime
2024-05-31 08:27:55.868660 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | starting test step 6-trigger-service-cleanup
2024-05-31 08:27:55.908611 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | NovaConductor:nova-kuttl-default/nova-kuttl-cell0-conductor updated
2024-05-31 08:27:55.941777 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | NovaConductor:nova-kuttl-default/nova-kuttl-cell1-conductor updated
2024-05-31 08:27:55.973319 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | NovaScheduler:nova-kuttl-default/nova-kuttl-scheduler updated
2024-05-31 08:27:55.973623 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | running command: [sh -c set -euxo pipefail
2024-05-31 08:27:55.973643 | controller |         (( $(oc exec -n $NAMESPACE openstackclient -- openstack compute service list --service nova-scheduler -f value | wc -l) == 1 ))
2024-05-31 08:27:55.973654 | controller |         (( $(oc exec -n $NAMESPACE openstackclient -- openstack compute service list --service nova-conductor -f value | grep cell1 | wc -l) == 1 ))
2024-05-31 08:27:55.973664 | controller |         (( $(oc exec -n $NAMESPACE openstackclient -- openstack compute service list --service nova-conductor -f value | grep cell0 | wc -l) == 1 ))
2024-05-31 08:27:55.973674 | controller |         ]
2024-05-31 08:27:55.980207 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | ++ oc exec -n nova-kuttl-default openstackclient -- openstack compute service list --service nova-scheduler -f value
2024-05-31 08:27:55.981011 | controller |     logger.go:42: 08:27:55 | scale-tests/6-trigger-service-cleanup | ++ wc -l
2024-05-31 08:28:05.357624 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | + ((  1 == 1  ))
2024-05-31 08:28:05.358403 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | ++ oc exec -n nova-kuttl-default openstackclient -- openstack compute service list --service nova-conductor -f value
2024-05-31 08:28:05.359139 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | ++ wc -l
2024-05-31 08:28:05.361411 | controller |     logger.go:42: 08:28:05 | scale-tests/6-trigger-service-cleanup | ++ grep cell1
2024-05-31 08:28:09.207875 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | + ((  1 == 1  ))
2024-05-31 08:28:09.208824 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | ++ oc exec -n nova-kuttl-default openstackclient -- openstack compute service list --service nova-conductor -f value
2024-05-31 08:28:09.209806 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | ++ wc -l
2024-05-31 08:28:09.212425 | controller |     logger.go:42: 08:28:09 | scale-tests/6-trigger-service-cleanup | ++ grep cell0
2024-05-31 08:28:11.369287 | controller |     logger.go:42: 08:28:11 | scale-tests/6-trigger-service-cleanup | + ((  1 == 1  ))
2024-05-31 08:28:11.369982 | controller |     logger.go:42: 08:28:11 | scale-tests/6-trigger-service-cleanup | test step completed 6-trigger-service-cleanup

https://logserver.rdoproject.org/75/775/8f3c7a0b7a64741b87ef26def3d1e1062c996447/github-check/nova-operator-kuttl/dbe4282/job-output.txt

Btw we can simplify the kuttl test now. Previously it needed to wait until the service considered down by nova as only down services is deleted. But we don't need that wait any more. I think step4,5,6 can be deleted form here https://github.com/openstack-k8s-operators/nova-operator/tree/main/test/kuttl/test-suites/default/scale-tests and the content of https://github.com/openstack-k8s-operators/nova-operator/blob/main/test/kuttl/test-suites/default/scale-tests/06-assert.yaml can be added directly to step 3 https://github.com/openstack-k8s-operators/nova-operator/blob/main/test/kuttl/test-suites/default/scale-tests/03-assert.yaml

gibizer · 2024-05-31T13:38:42Z

controllers/common.go

+			// name index start from 0
+			// which means if replicaCount is 1, only nova-scheduler-0 is valid case
+			// so delete >= 1
+			if service.State == "down" || hostIndex >= int(replicaCount) {


Instead of handing service in down state we should make sure that cleanServiceFromNovaDb is called after the deployment is fully up to date. You can do that by adding condition before calling cleanServiceFromNovaDb in

nova-operator/controllers/novascheduler_controller.go

Lines 276 to 286 in ab95f15

result, err = r.ensureDeployment(ctx, h, instance, inputHash, serviceAnnotations)

if (err != nil || result != ctrl.Result{}) {

return result, err

}

// clean up nova services from nova db should be always a last step in reconcile

// to make sure that

err = r.cleanServiceFromNovaDb(ctx, h, instance, secret, Log)

if err != nil {

Log.Error(err, "Failed cleaning services from nova db")

}

and check that the replica count requested in the spec are the same as the ready replicas reported by the statefulset. The ready replicas are stored in instance.status here

nova-operator/controllers/novascheduler_controller.go

Line 582 in ab95f15

instance.Status.ReadyCount = ss.GetStatefulSet().Status.ReadyReplicas

So the condition can be as simple as instance.Spec.Replicas == instance.Status.ReadyCount

gibizer · 2024-05-31T13:47:41Z

controllers/novaconductor_controller.go

@@ -609,13 +609,14 @@ func (r *NovaConductorReconciler) cleanServiceFromNovaDb(
 	secret corev1.Secret,
 	l logr.Logger,
 ) error {
+	replicaCount := int(*instance.Spec.Replicas)


I would drop the int cast and instead change the function parameter in cleanNovaServiceFromNovaDb to be int32. But keep the pointer deref so that cleanNovaServiceFromNovaDb does not need to deal with nil pointers.

btw even though instance.Spec.Replicas is a pointer we know it will not be nil as when the Instance is read from the etcd the field is defaulted 1 if not provided. This behavior is defined by the kubebuilder comments on the field definition

nova-operator/api/v1beta1/novascheduler_types.go

Lines 32 to 37 in ab95f15

// +kubebuilder:validation:Optional

// +kubebuilder:default=1

// +kubebuilder:validation:Maximum=32

// +kubebuilder:validation:Minimum=0

// Replicas of the service to run

Replicas *int32 `json:"replicas"`

gibizer · 2024-05-31T13:48:50Z

controllers/novascheduler_controller.go

@@ -663,11 +663,12 @@ func (r *NovaSchedulerReconciler) cleanServiceFromNovaDb(
 	secret corev1.Secret,
 	l logr.Logger,
 ) error {
+	replicaCount := int(*instance.Spec.Replicas)


ditto about casting.

gibizer · 2024-05-31T13:57:05Z

controllers/common.go

+		hostSplits = hostSplits[len(hostSplits)-2:]
+		host, hostIndexStr := hostSplits[0], hostSplits[1]
+
+		if strings.Contains(serviceName, host) {


This should not be needed. We already only querying the given service type from nova at L123. So when we call this from the schedule controller we only list the scheduler service form the db.

It is possible that the functional test code did not simulated this precisely I need to check...

yeah it was a shortcoming of the test code only. I pushed improvements to the nova API fixture to return only the services which uses the requested binary. #778

gibizer · 2024-05-31T14:01:09Z

controllers/common.go

+		hostSplits := strings.Split(service.Host, "-")
+		hostSplits = hostSplits[len(hostSplits)-2:]
+		host, hostIndexStr := hostSplits[0], hostSplits[1]


this is OK for the scheduler case as Nova deployment always have a single set of scheduler replicas. However conductor this is different. For each NovaCell we have a separate set of concuctor replicas. And the different cells can have different conductor replica count.
This also means that the delete logic for a given cell should only delete conductor services for the given cell. Right now this code deletes conductor services regardless which cell they are belonging to.

E.g. I changed the default deployment in a way that cell0 has 2 conductor replicas and cell1 has 3:

❯ oc get OpenStackControlPlane -o yaml| grep conductorServiceTemplate -B4 -A2 cell0: cellDatabaseAccount: nova-cell0 cellDatabaseInstance: openstack cellMessageBusInstance: rabbitmq conductorServiceTemplate: customServiceConfig: "" replicas: 2 -- cell1: cellDatabaseAccount: nova-cell1 cellDatabaseInstance: openstack-cell1 cellMessageBusInstance: rabbitmq-cell1 conductorServiceTemplate: customServiceConfig: "" replicas: 3 ❯ oc get pods | grep conductor nova-cell0-conductor-0 1/1 Running 0 113m nova-cell0-conductor-1 1/1 Running 0 2m14s nova-cell1-conductor-0 1/1 Running 0 113m nova-cell1-conductor-1 1/1 Running 0 2m14s nova-cell1-conductor-2 1/1 Running 0 2m14s ❯ openstack compute service list +--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+ | 26e569f7-66ab-4a87-9bfa-e48e01835f54 | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up | 2024-05-31T14:05:13.000000 | | bb3164f0-c655-40c1-beb8-1592df024d10 | nova-conductor | nova-cell1-conductor-2 | internal | enabled | up | 2024-05-31T14:05:11.000000 | | ced135e2-7242-4e36-bf59-3ae29382b62d | nova-conductor | nova-cell1-conductor-1 | internal | enabled | up | 2024-05-31T14:05:11.000000 | | 4530443c-b5bc-4cfa-8d50-d880bbe74b09 | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up | 2024-05-31T14:05:19.000000 | | 396ce09f-cc4f-4e95-a0d3-4535dc2023de | nova-scheduler | nova-scheduler-0 | internal | enabled | up | 2024-05-31T14:05:12.000000 | | 14bf553c-5958-4d60-bba9-40035f602a8f | nova-conductor | nova-cell0-conductor-1 | internal | enabled | up | 2024-05-31T14:05:11.000000 | +--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+ ❯ oc get NovaConductor -o custom-columns=NAME:.metadata.name,CELL:.spec.cellName,REPLICAS:.spec.replicas NAME CELL REPLICAS nova-cell0-conductor cell0 2 nova-cell1-conductor cell1 3

Now if I change the conductor replica count in cell0 from 2 to 1 I don't want that the nova-cell1-conductor-1 or nova-cell1-conductor-2 is deleted, only nova-cell0-conductor-1 should be deleted.
So the delete logic needs to depend on the cell name as well. You can read that from the conductor instance as instance.Spec.CellName.

path 778, fixes the test for names, we do not need to extract host_name anymore, removed.

auniyal61 · 2024-06-03T15:11:28Z

controllers/novascheduler_controller.go

+	if instance.Spec.Replicas != &instance.Status.ReadyCount {
+		return ctrl.Result{}, nil
+	}
+	// for instance.Spec.NovaServiceBase.Replicas


need to remove

auniyal61 · 2024-06-03T15:14:55Z

/test functional

auniyal61 · 2024-06-03T16:51:12Z

test/functional/nova_scheduler_test.go

@@ -676,7 +676,7 @@ var _ = Describe("NovaScheduler controller cleaning", func() {
 				condition.ReadyCondition,
 				corev1.ConditionTrue,
 			)
-			Expect(novaAPIFixture.HasRequest("DELETE", "/compute/os-services/3", "")).To(BeTrue())
+			Expect(novaAPIFixture.HasRequest("DELETE", "/compute/os-services/3", "binary=nova-scheduler")).To(BeTrue())


so this is failing because of our added condition for rapilcaCount.
now control never reaches to cleanServiceFromNovaDb

here you added binary=nova-scheduler but the controller does not send a binary query param with the DELETE request, it only do that in the GET request when querying the list of services.

GET

nova-operator/controllers/common.go

Lines 119 to 123 in 845b371

opts := services.ListOpts{

Binary: serviceName,

}

allPages, err := services.List(computeClient, opts).AllPages()

DELETE

nova-operator/controllers/common.go

Line 135 in 845b371

rsp := services.Delete(computeClient, service.ID)

if you remove that change and do the fix of the condition above then the functional test passes:

2024-06-04T10:49:59.346+0200 INFO Controllers.NovaScheduler Deployment is not ready {"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "87264e31-b83b-4dcb-9b52-99db1db29b34", "Status": {"replicas":0,"availableReplicas":0}} 2024-06-04T10:49:59.346+0200 INFO Controllers.NovaScheduler Waiting for ReadyCount to equal Replicas before deleting scaled in services from the nova DB. {"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "87264e31-b83b-4dcb-9b52-99db1db29b34", "replicas": 1, "ReadyCount": 0} 2024-06-04T10:49:59.430+0200 INFO ---Test--- Simulated statefulset success {"on": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}} 2024-06-04T10:49:59.430+0200 INFO ---Test--- ExpectCondition {"type": "Ready", "expected status": "True", "on": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}} 2024-06-04T10:49:59.430+0200 INFO Controllers.NovaScheduler Reconciling {"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739"} 2024-06-04T10:49:59.440+0200 INFO Controllers.NovaScheduler StatefulSet 435d1006-78aa-481a-bb1f-0-scheduler - updated {"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739"} 2024-06-04T10:49:59.440+0200 INFO Controllers.NovaScheduler Deployment is ready {"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739"} 2024-06-04T10:49:59.440+0200 INFO ---Test--- OpenStack API request {"method": "GET", "URI": "/identity/"} 2024-06-04T10:49:59.443+0200 INFO ---Test--- OpenStack API request {"method": "GET", "URI": "/compute/os-services/?binary=nova-scheduler"} 2024-06-04T10:49:59.444+0200 INFO ---Test--- OpenStack API request {"method": "DELETE", "URI": "/compute/os-services/3"} 2024-06-04T10:49:59.444+0200 INFO Controllers.NovaScheduler Deleted service {"controller": "novascheduler", "controllerGroup": "nova.openstack.org", "controllerKind": "NovaScheduler", "NovaScheduler": {"name":"435d1006-78aa-481a-bb1f-0-scheduler","namespace":"586295de-5de7-48a9-b231-59ccd7b0b02b"}, "namespace": "586295de-5de7-48a9-b231-59ccd7b0b02b", "name": "435d1006-78aa-481a-bb1f-0-scheduler", "reconcileID": "6eddc6b3-2f9d-4879-8284-97db2c579739", "service": {"binary":"nova-scheduler","disabled_reason":"","forced_down":false,"host":"nova-scheduler-1","state":"down","status":"enabled","zone":"internal"}}

So the test case of nova-scheduler works because when novascheduler controller creates the StatefulSet with Replicas 1 then the ReadyReplicas is 0 and the cleaning does not happen. Then the test code at L672 simulates that the statefulset is ready and that sets the ReadyReplicas to 1 as well. The change of the statefulset will trigger a new reconciles on the NovaScheduler as it owns the statefulset and in that reconcile your new condition (with my fix) will allow the service deletion to happen.

softwarefactory-project-zuul · 2024-06-03T20:42:45Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9a72a2a6a9844a538f2aaac3cf477375

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 29m 59s
❌ nova-operator-kuttl FAILURE in 53m 04s
✔️ nova-operator-tempest-multinode SUCCESS in 2h 37m 09s
✔️ nova-operator-tempest-multinode-ceph SUCCESS in 3h 01m 45s

controllers/novaconductor_controller.go

gibizer

couple of fixes but I think we are close to be able to land this.

gibizer · 2024-06-05T14:03:08Z

test/functional/novaconductor_controller_test.go

@@ -784,6 +784,9 @@ var _ = Describe("NovaConductor controller cleaning", func() {
 			Expect(novaAPIServer.HasRequest("DELETE", "/compute/os-services/5", "")).To(BeTrue())


ID 5 is nova-cell0-conductor-0 so that should not be deleted by the controller.

test/functional/novaconductor_controller_test.go

gibizer · 2024-06-05T14:16:36Z

controllers/common.go

+		// name index start from 0
+		// which means if replicaCount is 1, only nova-scheduler-0 is valid case
+		// so delete >= 1
+		if service.State == "down" || hostIndex >= int(replicaCount) {


we don't have to wait for the service to be marked down by nova. We know that the pod is delete as we waited for the instance.Spec.Replicas == instance.Status.ReadyCount before we started the delete. So you can drop the down check.

This will make the test pass with the asserts I proposed below.

gibizer · 2024-06-05T14:27:54Z

controllers/common.go

@@ -115,6 +117,8 @@ func cleanNovaServiceFromNovaDb(
 	computeClient *gophercloud.ServiceClient,
 	serviceName string,
 	l logr.Logger,
+	replicaCount int32,
+	conductorCellName ...string,


I would name this simply cellName, and would not try to make it optional via ellipsis. In general golang does not support optional parameters and if we do need them then we need to pass structs around with pointer fields for optionality, or use a builder pattern. Here I think it is OK to have a cellName string parameter, the scheduler can pass in "" as it is not in a cell.

gibizer · 2024-06-05T15:31:14Z

controllers/common.go

+		if strings.Contains(service.Host, "conductor") {
+			if !strings.Contains(service.Host, conductorCellName[0]) {
+				// delete only if target conductor service is from same target cell
+				continue
+			}
+		}


I think this can be simplified to:

replicaCount int32, - conductorCellName ...string, + cellName string, ) error { opts := services.ListOpts{ Binary: serviceName, @@ -135,11 +135,10 @@ func cleanNovaServiceFromNovaDb( } for _, service := range allServices { - if strings.Contains(service.Host, "conductor") { - if !strings.Contains(service.Host, conductorCellName[0]) { - // delete only if target conductor service is from same target cell - continue - } + // delete only if serviceHost is for our cell. If no cell is + // provided (e.g. scheduler cleanup case) then this check is a noop + if !strings.Contains(service.Host, cellName) { + continue }

auniyal61 · 2024-06-06T08:27:15Z

/test functional

got
[FAIL] Samples when nova_v1beta1_nova-compute-fake.yaml sample is applied [DeferCleanup (Each)] Nova is created
[FAILED] Timed out after 10.000s.

gibizer

I have one more request for the test and please amend the commit message and PR description. The current title is not really correct. I suggest to use your first sentence from the main commit message as the title.

test/functional/novaconductor_controller_test.go

- Update service objects in functional test api_fixture - Adds a check to wait for readycount to become equal to replica - Delete service host from nova DB by hostname on replica count update

gibizer

Looks good to me. Thank you @auniyal61 !

mrkisaolamb

lgtm

mrkisaolamb · 2024-06-06T12:28:12Z

controllers/novaconductor_controller.go

@@ -280,6 +280,10 @@ func (r *NovaConductorReconciler) Reconcile(ctx context.Context, req ctrl.Reques
 		return ctrl.Result{}, err
 	}

+	if *instance.Spec.Replicas != instance.Status.ReadyCount {


in follow up we can handle also generation

Yeah, I need to look at it again when #739 lands. Either we need a direct generation check here, or we can just rely on DeploymentReady is only set if the generation matches.

openshift-ci · 2024-06-06T12:36:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: auniyal61, mrkisaolamb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrkisaolamb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

softwarefactory-project-zuul · 2024-06-06T17:16:09Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/8e48a4b416894f6195f517b7f2e35544

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 43m 18s
✔️ nova-operator-kuttl SUCCESS in 48m 36s
❌ nova-operator-tempest-multinode RETRY_LIMIT in 21m 02s
❌ nova-operator-tempest-multinode-ceph FAILURE in 3h 13m 47s

auniyal61 · 2024-06-07T02:40:50Z

/test functional

tempest test test_server_connectivity_live_migration failed while verifying downtime of VM after live migration

auniyal61 · 2024-06-07T02:42:23Z

/test functional

this did not ran rdoproject.org/github-check job again

gibizer · 2024-06-07T09:32:58Z

recheck

/test functional

this did not ran rdoproject.org/github-check job again

the rdo jobs need "recheck"

gibizer · 2024-06-07T09:33:23Z

/test functional

tempest test test_server_connectivity_live_migration failed while verifying downtime of VM after live migration

yepp we see that before so that is unrealted

softwarefactory-project-zuul · 2024-06-07T14:42:54Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/074a26a3358e4cacbeeaeb08fc5151cf

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 51m 11s
✔️ nova-operator-kuttl SUCCESS in 50m 05s
❌ nova-operator-tempest-multinode FAILURE in 1h 47m 31s
❌ nova-operator-tempest-multinode-ceph FAILURE in 2h 01m 48s

auniyal61 · 2024-06-07T14:59:47Z

recheck

SSH issue while
TASK [artifacts : Copy logs from CRC VM output_dir={{ cifmw_artifacts_basedir }}/artifacts, script=scp -v -r -i {{ cifmw_artifacts_crc_sshkey }} root@{{ cifmw_artifacts_crc_host }}:/ostree/deploy/rhcos/var/log/pods {{ cifmw_artifacts_basedir }}/logs/crc/]

logs:
https://logserver.rdoproject.org/75/775/f34e93d99861e28d1afab587dfc8821d1db1d179/github-check/nova-operator-tempest-multinode/a44e32c/controller/ci-framework-data/logs/ci_script_013_copy_logs_from_crc.log
[email protected]: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
Connection closed

gibizer · 2024-06-07T15:07:42Z

It seems like CI is failing everywhere. But I haven't checked the other results yet. Let's see if the recheck helps

2024-06-07 09:04:55.777926 | controller | TASK [edpm_deploy : Wait for OpenStackDataPlaneDeployment become Ready _raw_params=oc wait OpenStackDataPlaneDeployment {{ cr_name }} --namespace={{ cifmw_install_yamls_defaults['NAMESPACE'] }} --for=condition=Ready --timeout={{ cifmw_edpm_deploy_timeout }}m] ***
2024-06-07 09:04:55.777931 | controller | Friday 07 June 2024  09:04:55 -0400 (0:00:00.817)       0:21:35.470 ***********
2024-06-07 10:04:56.078703 | controller | fatal: [localhost]: FAILED! => changed=true
2024-06-07 10:04:56.078748 | controller |   cmd:
2024-06-07 10:04:56.078755 | controller |   - oc
2024-06-07 10:04:56.078761 | controller |   - wait
2024-06-07 10:04:56.078767 | controller |   - OpenStackDataPlaneDeployment
2024-06-07 10:04:56.078772 | controller |   - edpm-deployment
2024-06-07 10:04:56.078776 | controller |   - --namespace=openstack
2024-06-07 10:04:56.078781 | controller |   - --for=condition=Ready
2024-06-07 10:04:56.078785 | controller |   - --timeout=60m
2024-06-07 10:04:56.078790 | controller |   delta: '1:00:00.111638'
2024-06-07 10:04:56.078794 | controller |   end: '2024-06-07 10:04:56.057461'
2024-06-07 10:04:56.078798 | controller |   msg: non-zero return code
2024-06-07 10:04:56.078803 | controller |   rc: 1
2024-06-07 10:04:56.078807 | controller |   start: '2024-06-07 09:04:55.945823'
2024-06-07 10:04:56.078812 | controller |   stderr: 'error: timed out waiting for the condition on openstackdataplanedeployments/edpm-deployment'

https://logserver.rdoproject.org/75/775/f34e93d99861e28d1afab587dfc8821d1db1d179/github-check/nova-operator-tempest-multinode/a44e32c/job-output.txt

ovn deployment failed on the EDPM node:

Error: statfs�[0m
�[1;35m/var/lib/openstack/certs/ovn/default/ca.crt: no such file or directory�[0m

https://logserver.rdoproject.org/75/775/f34e93d99861e28d1afab587dfc8821d1db1d179/github-check/nova-operator-tempest-multinode/a44e32c/controller/ci-framework-data/logs/openstack-k8s-operators-openstack-must-gather/namespaces/openstack/pods/ovn-edpm-deployment-openstack-edpm-ipam-6jjvb/logs/openstackansibleee.log

softwarefactory-project-zuul · 2024-06-07T18:29:57Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/9463417c42a34402846f6154cf0c5070

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 38m 36s
✔️ nova-operator-kuttl SUCCESS in 48m 09s
❌ nova-operator-tempest-multinode FAILURE in 1h 44m 22s
❌ nova-operator-tempest-multinode-ceph FAILURE in 1h 59m 50s

gibizer · 2024-06-10T14:48:33Z

recheck
openstack-k8s-operators/openstack-operator#834 merged, CI should be in better shape.

openshift-ci bot requested review from dprince and olliewalsh May 28, 2024 12:25

openshift-ci bot added the needs-ok-to-test label May 28, 2024

auniyal61 commented May 28, 2024

View reviewed changes

controllers/common.go Outdated Show resolved Hide resolved

gibizer reviewed May 28, 2024

View reviewed changes

auniyal61 force-pushed the OSPRH_6912 branch from 9879fe7 to c0f9d26 Compare May 29, 2024 09:11

auniyal61 requested a review from gibizer May 29, 2024 09:14

auniyal61 marked this pull request as draft May 29, 2024 09:16

openshift-ci bot added the do-not-merge/work-in-progress label May 29, 2024

openshift-ci bot added ok-to-test and removed needs-ok-to-test labels May 29, 2024

auniyal61 force-pushed the OSPRH_6912 branch from c0f9d26 to 74687f5 Compare May 31, 2024 10:29

gibizer reviewed May 31, 2024

View reviewed changes

gibizer mentioned this pull request May 31, 2024

[envtest]Improve API simulator #778

Merged

auniyal61 force-pushed the OSPRH_6912 branch from 8f3c7a0 to 3fc9774 Compare June 3, 2024 15:09

auniyal61 commented Jun 3, 2024

View reviewed changes

auniyal61 force-pushed the OSPRH_6912 branch from 3fc9774 to 71dedbe Compare June 3, 2024 16:40

auniyal61 commented Jun 3, 2024

View reviewed changes

gibizer reviewed Jun 4, 2024

View reviewed changes

controllers/novaconductor_controller.go Show resolved Hide resolved

auniyal61 force-pushed the OSPRH_6912 branch from 71dedbe to c805393 Compare June 4, 2024 09:21

openshift-ci bot removed the do-not-merge/work-in-progress label Jun 4, 2024

openshift-ci bot requested a review from lewisdenny June 4, 2024 16:59

auniyal61 force-pushed the OSPRH_6912 branch 3 times, most recently from b8bd2ba to 36025c8 Compare June 5, 2024 10:20

gibizer reviewed Jun 5, 2024

View reviewed changes

auniyal61 force-pushed the OSPRH_6912 branch from 36025c8 to 28e7850 Compare June 5, 2024 17:00

gibizer reviewed Jun 6, 2024

View reviewed changes

test/functional/novaconductor_controller_test.go Show resolved Hide resolved

Clean up scaled down services based on hostname

f34e93d

- Update service objects in functional test api_fixture - Adds a check to wait for readycount to become equal to replica - Delete service host from nova DB by hostname on replica count update

auniyal61 force-pushed the OSPRH_6912 branch from 28e7850 to f34e93d Compare June 6, 2024 11:45

auniyal61 changed the title ~~Adds a check for scaled down service~~ Clean up scaled down services based on hostname Jun 6, 2024

gibizer reviewed Jun 6, 2024

View reviewed changes

mrkisaolamb approved these changes Jun 6, 2024

View reviewed changes

openshift-ci bot assigned mrkisaolamb Jun 6, 2024

openshift-ci bot added the lgtm label Jun 6, 2024

openshift-ci bot added the approved label Jun 6, 2024

openshift-merge-bot bot merged commit f61fdd2 into openstack-k8s-operators:main Jun 10, 2024
7 checks passed

auniyal61 deleted the OSPRH_6912 branch December 9, 2024 06:47

	func GetNovaNames(novaName types.NamespacedName, cellNames []string) NovaNames {
	// NOTE(bogdando): use random UUIDs instead of static "nova" part of names.
	// These must replicate existing Nova*/Dataplane controllers suffixing/prefixing logic.
	// While dynamic UUIDs also provide enhanced testing coverage for "synthetic" cases,
	// which could not be caught for normal names with static "nova" prefixes.
	novaAPI := types.NamespacedName{
	Namespace: novaName.Namespace,
	Name: fmt.Sprintf("%s-api", novaName.Name),
	}
	novaScheduler := types.NamespacedName{
	Namespace: novaName.Namespace,
	Name: fmt.Sprintf("%s-scheduler", novaName.Name),

	result, err = r.ensureDeployment(ctx, h, instance, inputHash, serviceAnnotations)
	if (err != nil \|\| result != ctrl.Result{}) {
	return result, err
	}

	// clean up nova services from nova db should be always a last step in reconcile
	// to make sure that
	err = r.cleanServiceFromNovaDb(ctx, h, instance, secret, Log)
	if err != nil {
	Log.Error(err, "Failed cleaning services from nova db")
	}

	// +kubebuilder:validation:Optional
	// +kubebuilder:default=1
	// +kubebuilder:validation:Maximum=32
	// +kubebuilder:validation:Minimum=0
	// Replicas of the service to run
	Replicas *int32 `json:"replicas"`

	opts := services.ListOpts{
	Binary: serviceName,
	}

	allPages, err := services.List(computeClient, opts).AllPages()

		@@ -784,6 +784,9 @@ var _ = Describe("NovaConductor controller cleaning", func() {
		Expect(novaAPIServer.HasRequest("DELETE", "/compute/os-services/5", "")).To(BeTrue())

Clean up scaled down services based on hostname #775

Clean up scaled down services based on hostname #775

Conversation

auniyal61 commented May 28, 2024 • edited by openshift-ci bot Loading

openshift-ci bot commented May 28, 2024

softwarefactory-project-zuul bot commented May 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented May 29, 2024

gibizer commented May 29, 2024

auniyal61 commented May 30, 2024

softwarefactory-project-zuul bot commented May 30, 2024

gibizer commented May 31, 2024

gibizer commented May 31, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibizer May 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

auniyal61 commented Jun 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Jun 3, 2024

gibizer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibizer Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

auniyal61 commented Jun 6, 2024

gibizer left a comment

Choose a reason for hiding this comment

gibizer left a comment

Choose a reason for hiding this comment

mrkisaolamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Jun 6, 2024

softwarefactory-project-zuul bot commented Jun 6, 2024

auniyal61 commented Jun 7, 2024

auniyal61 commented Jun 7, 2024

gibizer commented Jun 7, 2024

gibizer commented Jun 7, 2024

softwarefactory-project-zuul bot commented Jun 7, 2024

auniyal61 commented Jun 7, 2024

gibizer commented Jun 7, 2024

softwarefactory-project-zuul bot commented Jun 7, 2024

gibizer commented Jun 10, 2024

auniyal61 commented May 28, 2024 •

edited by openshift-ci bot

Loading

gibizer May 31, 2024 •

edited

Loading

gibizer Jun 5, 2024 •

edited

Loading