Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the steps to reboot the computes after update. #2587

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sathlan
Copy link
Contributor

@sathlan sathlan commented Dec 5, 2024

This sequence implements reboot of the compute nodes after the update.

We have one instance created. If the hypervisor being rebooted has
the instance that instance will be live-migrated to another hypervisor
before the reboot and migrated back to that original hypervisor after
the reboot.

Some basic sanity checks are performed after the reboot and before the
migration back to ensure that the necessary services are up and
running.

During the reboot we start two scripts. One monitors and log the
reboot of the hypervisors. The other log where the instance is
currently running. The log files can be found in
~/ci-framework-data/tests/update/ in monitor_servers.log and
monitor_vm_placement.log respectively.

A note about node evacuation. We are still using node evaction from
the nova cli. This command has not been ported to the openstack
cli. There's a discussion about it on launchpad.

The official documentation mention only the live-migration path, but
as we also use the live-migration in the test sequence that part is
covered. We still expect customer to use the nova cli as it's way
more user friendly and is still currently working.

Closes: https://issues.redhat.com/browse/OSPRH-8937

@github-actions github-actions bot marked this pull request as draft December 5, 2024 10:25
Copy link

github-actions bot commented Dec 5, 2024

Thanks for the PR! ❤️
I'm marking it as a draft, once your happy with it merging and the PR is passing CI, click the "Ready for review" button below.

@sathlan
Copy link
Contributor Author

sathlan commented Dec 5, 2024

Current tested with ping test running in the background and found not loss of connectivity.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/4f12a60d8dc64dad92d4f7b5f5bec990

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 40m 14s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 19m 06s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 28m 19s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 44s
cifmw-pod-pre-commit FAILURE in 8m 14s
✔️ build-push-container-cifmw-client SUCCESS in 21m 46s
cifmw-molecule-update FAILURE in 5m 04s

Copy link
Contributor

openshift-ci bot commented Dec 16, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign cescgina for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/bf06493eaeea430c898bf25520dfdd04

✔️ openstack-k8s-operators-content-provider SUCCESS in 1h 44m 32s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 15m 44s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 31m 55s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 7m 45s
cifmw-pod-pre-commit FAILURE in 7m 28s
✔️ build-push-container-cifmw-client SUCCESS in 21m 20s
cifmw-molecule-update FAILURE in 5m 20s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/caf50b2e762a4aaeb3326c9399dffd15

✔️ openstack-k8s-operators-content-provider SUCCESS in 2h 21m 02s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 18m 47s
✔️ cifmw-crc-podified-edpm-baremetal SUCCESS in 1h 28m 48s
✔️ noop SUCCESS in 0s
✔️ cifmw-pod-ansible-test SUCCESS in 8m 43s
cifmw-pod-pre-commit FAILURE in 7m 37s
✔️ build-push-container-cifmw-client SUCCESS in 36m 42s
cifmw-molecule-update FAILURE in 4m 29s

@sathlan sathlan force-pushed the update-reboot branch 11 times, most recently from c056d3c to 85e4367 Compare December 18, 2024 14:14
@sathlan sathlan added enhancement New feature or request and removed do-not-merge/work-in-progress labels Dec 19, 2024
@sathlan sathlan marked this pull request as ready for review December 19, 2024 08:45
@sathlan sathlan requested a review from a team as a code owner December 19, 2024 08:45
PATH: "{{ cifmw_path | default(ansible_env.PATH) }}"
ansible.builtin.command: >-
{{ cifmw_update_oc_cmd_prefix }}
get openstackdataplanedeployment {{ cifmw_reboot_dep_name }}
Copy link
Contributor

@ciecierski ciecierski Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
get openstackdataplanedeployment {{ cifmw_reboot_dep_name }}
wait openstackdataplanedeployment {{ cifmw_reboot_dep_name }}
--for=condition=ready
--timeout={{ cifmw_update_timeout_reboot }}m

With oc wait ansible log is more readable, as there no retires logged to output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Done.

edpm_reboot_strategy: force
ansibleLimit: {{ cifmw_update_hypervisor_short_name }}

- name: Apply the OpenStackDataPlaneDeployment CR to trigger a reboot
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Apply the OpenStackDataPlaneDeployment CR to trigger a reboot
- name: Create the OpenStackDataPlaneDeployment CR to trigger a reboot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@ciecierski ciecierski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment about waiting for reboot to complete

Copy link

github-actions bot commented Jan 4, 2025

This PR is stale because it has been for over 15 days with no activity.
Remove stale label or comment or this will be closed in 7 days.

@sathlan
Copy link
Contributor Author

sathlan commented Jan 9, 2025

/retest

This sequence implements reboot of the compute nodes after the update.

We have one instance created.  If the hypervisor being rebooted has
the instance that instance will be live-migrated to another hypervisor
before the reboot and migrated back to that original hypervisor after
the reboot.

Some basic sanity checks are performed after the reboot and before the
migration back to ensure that the necessary services are up and
running.

During the reboot we start two scripts. One monitors and log the
reboot of the hypervisors.  The other log where the instance is
currently running.  The log files can be found in
`~/ci-framework-data/tests/update/` in `monitor_servers.log` and
`monitor_vm_placement.log` respectively.

A note about node evacuation.  We are still using node evaction from
the nova cli.  This command has not been ported to the openstack
cli. There's a discussion about it [on launchpad](https://bugs.launchpad.net/python-openstackclient/+bug/2055552).

The official documentation mention only the live-migration path, but
as we also use the live-migration in the test sequence that part is
covered.  We still expect customer to use the nova cli as it's way
more user friendly and is still currently working.

Closes: https://issues.redhat.com/browse/OSPRH-8937
@jistr
Copy link
Contributor

jistr commented Jan 9, 2025

/lgtm

Copy link
Collaborator

@pablintino pablintino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some minor suggestions.

- name: Define command for OpenStack client interactions
ansible.builtin.set_fact:
cifmw_update_openstack_cmd: >-
oc rsh -n {{ cifmw_update_namespace }} openstackclient openstack
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a variable, not a default in the vars/main.yaml file.

set -o pipefail;
{{ cifmw_update_openstack_cmd }} volume service list -f json |
jq -r -c '.[] | select(.Binary | contains("cinder-volume")) | .Host'
register: storage_backend
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you already added changed_when: false to some tasks I guess it will be good to add it here too.


- name: Construct date string for the Reboot CR name
ansible.builtin.set_fact:
cifmw_update_cr_date: "{{ lookup('pipe', 'date +%Y%m%d%H%S') }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's used one is usually less confusing to place it in the vars section of the task that uses it.

---
- name: Define command prefix for OpenShift operations
ansible.builtin.set_fact:
cifmw_update_oc_cmd_prefix: "oc -n {{ cifmw_update_namespace }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another one that can be an internal var.

- name: Create the OpenStackDataPlaneDeployment CR used for reboot
ansible.builtin.copy:
dest: "{{ cifmw_update_artifacts_basedir }}/{{ cifmw_reboot_dep_name }}.yaml"
content: |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a yaml (or json) with plain string manipulation is asking for trouble. Why not using to_nice yaml like this?

vars:
  _content:
      apiVersion: dataplane.openstack.org/v1beta1
      kind: OpenStackDataPlaneDeployment
      metadata:
        name: "{{ cifmw_reboot_dep_name }}"
        namespace: "{{ cifmw_update_namespace }}"
      spec:
        nodeSets: "{{ cifmw_update_node_sets.stdout | split('\n') }}"
        servicesOverride:
        - reboot-os
        ansibleExtraVars:
          edpm_reboot_strategy: force
        ansibleLimit: {{ cifmw_update_hypervisor_short_name }}
ansible.builtin.copy:
    dest: "{{ cifmw_update_artifacts_basedir }}/{{ cifmw_reboot_dep_name }}.yaml"
    content: "{{ _content | to_nice_yaml }}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants