Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Share cleanup job #337

Merged

Conversation

fmount
Copy link
Collaborator

@fmount fmount commented Sep 23, 2024

When a backend config is removed, the "service" database entry in manila needs to be adjusted to account for the removal. This patch introduces a Job that runs a manila-manage command every time a share is scaled down. The main job definition is now more flexible and we can pass a TTL in case we want to customize the time the job is kept.

Jira: https://issues.redhat.com/browse/OSPRH-6526

Copy link
Contributor

openshift-ci bot commented Sep 23, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fmount

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fmount
Copy link
Collaborator Author

fmount commented Sep 23, 2024

holding for now to review and discuss more about this.

@fmount
Copy link
Collaborator Author

fmount commented Sep 23, 2024

/test functional

@silvacarloss
Copy link
Contributor

silvacarloss commented Sep 23, 2024

Looks good from a manila point of view and the removal makes sense! Thank you for working on this. @gouthampacha might want to take a look and share his thoughts too :)

fmount added 2 commits October 4, 2024 13:54
When a share is deleted from the top-level CR, its reference is not
deleted, and manila still sees it as Down/not available. For this
reason, every time a scale down operation is executed, we need to run a
job that is supposed to run a service cleanup, removing everything set
as down.

Jira: https://issues.redhat.com/browse/OSPRH-6526

Signed-off-by: Francesco Pantano <[email protected]>
When a backend config is removed, the "service" database entry in manila needs
to be adjusted to account for the removal. This patch introduces a Job that
runs a manila-manage command every time a share is scaled down. The main job
definition is now more flexible and we can pass a TTL in case we want to
customize the time the job is kept.

Signed-off-by: Francesco Pantano <[email protected]>
@@ -48,6 +48,8 @@ const (
ManilaPublicPort int32 = 8786
// ManilaInternalPort -
ManilaInternalPort int32 = 8786
// ServiceCleanupDelay -
ManilaServiceCleanupDelay int32 = 120
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gouthampacha we delay according to this parameter.

@@ -62,12 +70,13 @@ func DbSyncJob(instance *manilav1.Manila, labels map[string]string, annotations
},
}

args := []string{"-c", DBSyncCommand}
delayCommand := fmt.Sprintf("sleep %d", delay)
args := []string{"-c", fmt.Sprintf("%s && %s", delayCommand, jobCommand)}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gouthampacha the resulting command is the concatenation between the delay and the jobCommand.

@fmount
Copy link
Collaborator Author

fmount commented Oct 4, 2024

@gouthampacha @silvacarloss I made a few more local tests related to the fact we need to delay the job execution and here a summary.

## Check Initial deploy

We have two shares, share0 connected to native_cephfs, share1 backed by ceph-nfs

sh-5.1$ openstack share service list
+----+------------------+---------------------+------+---------+-------+----------------------------+
| ID | Binary           | Host                | Zone | Status  | State | Updated At                 |
+----+------------------+---------------------+------+---------+-------+----------------------------+
|  1 | manila-scheduler | hostgroup           | nova | enabled | up    | 2024-10-04T11:41:41.814368 |
|  3 | manila-share     | hostgroup@cephfsnfs | nova | enabled | up    | 2024-10-04T11:41:48.227259 |
|  4 | manila-share     | hostgroup@cephfs    | nova | enabled | up    | 2024-10-04T11:41:43.552743 |
+----+------------------+---------------------+------+---------+-------+----------------------------+


sh-5.1$ openstack share pool list
+----------------------------+-----------+-----------+--------+
| Name                       | Host      | Backend   | Pool   |
+----------------------------+-----------+-----------+--------+
| hostgroup@cephfsnfs#cephfs | hostgroup | cephfsnfs | cephfs |
| hostgroup@cephfs#cephfs    | hostgroup | cephfs    | cephfs |
+----------------------------+-----------+-----------+--------+

We cleanup share-0, removing the native cephfs share.

## Pods (manila-share0 not there anymore)

manila-api-0                                                   2/2     Running     0          20m
manila-db-sync-k5skw                                           0/1     Completed   0          4m3s
manila-scheduler-0                                             2/2     Running     0          9m55s
manila-service-cleanup-n5b5h655-9f7kl                          1/1     Running     0          9s
manila-share-share1-0                                          2/2     Running     0          3m47s

## Cleanup job

The job starts, waits for 120 sec before running the command
[stack@osp-storage-04 ~]$ oc logs -f manila-service-cleanup-n5b5h655-9f7kl
Cleaned up service hostgroup@cephfs
## Check services

$ openstack share service list                                                                                                                                                                                                                                            

+----+------------------+---------------------+------+---------+-------+----------------------------+
| ID | Binary           | Host                | Zone | Status  | State | Updated At                 |
+----+------------------+---------------------+------+---------+-------+----------------------------+
|  1 | manila-scheduler | hostgroup           | nova | enabled | up    | 2024-10-04T11:46:21.871310 |
|  3 | manila-share     | hostgroup@cephfsnfs | nova | enabled | up    | 2024-10-04T11:46:28.299363 |
+----+------------------+---------------------+------+---------+-------+----------------------------+

The previous "manila-share: hostgroup@cephfs" is not there anymore

## Cleanup pool list

Due to an existing bug on cached data, we need to restart manila-scheduler.

$ oc rollout restart statefulset manila-scheduler 

+----------------------------+-----------+-----------+--------+
| Name                       | Host      | Backend   | Pool   |
+----------------------------+-----------+-----------+--------+
| hostgroup@cephfsnfs#cephfs | hostgroup | cephfsnfs | cephfs |
+----------------------------+-----------+-----------+--------+

With the current behavior I think we're good to go. WRT documentation, I think we need to mention that we might need to restart the scheduler. @gouthampacha do we have any upstream bug or Jira to fix the scheduler?

@fmount
Copy link
Collaborator Author

fmount commented Oct 4, 2024

/test functional

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a6dbd5c6a6e94ba69592d7a48f9c342e

openstack-k8s-operators-content-provider FAILURE in 8m 59s
⚠️ manila-operator-kuttl SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider (non-voting)
⚠️ manila-operator-tempest SKIPPED Skipped due to failed job openstack-k8s-operators-content-provider (non-voting)

If the share cleanup job is executed right after the share deletion, it
might be possible that the other manila components have not yet been
notified about the event. The result is that the job has no effect.
This patch introduces a parameter for the ManilaJob that is supposed to
delay the command. A constant has been defined to properly indentify the
delay time associated with the service cleanup.

Signed-off-by: Francesco Pantano <[email protected]>
@fmount
Copy link
Collaborator Author

fmount commented Oct 4, 2024

/test functional

@gouthampacha
Copy link
Contributor

With the current behavior I think we're good to go. WRT documentation, I think we need to mention that we might need to restart the scheduler. @gouthampacha do we have any upstream bug or Jira to fix the scheduler?

Agreed; as a known issue: https://bugs.launchpad.net/manila/+bug/1915094

@gouthampacha
Copy link
Contributor

/lgtm

Thank you @fmount

@openshift-ci openshift-ci bot added the lgtm label Oct 7, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit e746e34 into openstack-k8s-operators:main Oct 7, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants