-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nova services adoption (no extra cell) #176
Conversation
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
This comment was marked as resolved.
This comment was marked as resolved.
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
done |
7a45acc
to
d24d826
Compare
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/dcbaff645f6b4dff89affddb666903fd ❌ data-plane-adoption-github-rdo-centos-9-crc-single-node FAILURE in 1h 07m 15s |
FWIW, if you'd like to merge this in multiple chunks, e.g. first ensuring that the control plane comes up fine and data plane Ansible executes successfully, and then have another story tracking "Make sure the workload survives undamaged", i think that would be fine too. |
9f9be89
to
222e02e
Compare
I'll do my best to split this into commits |
This is a single unit of work according to the Nova team feadback I can split this PR into commits for simplicity of reviewing it. But cannot split jira stories. FFU is the only target state we agreed to accept, cannot stop in intermediate states. |
An update: this works now for my testing env. I'm going to respin it from the beginning just to confirm I didn't break the mariadb related checks (I had to move them to the pull openstack configuration steps, before we stop tripleo services). Then, I will switch to recomposition of commits w/o introducing functional changes. |
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/4b4a338d4ee6472b9e6901ba8c5d7969 ❌ data-plane-adoption-github-rdo-centos-9-crc-single-node RETRY_LIMIT in 7m 50s |
"a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character " ref: openstack-k8s-operators/data-plane-adoption#176 (comment) When Label wasn't provided it was breaking the AEE deploy Signed-off-by: Fabricio Aguiar <[email protected]>
first split (just quick and dirty Nova adoption) done #191 @jistr @SeanMooney @GIBI PTAL |
Build failed (check pipeline). Post https://review.rdoproject.org/zuul/buildset/465fc5da92ec425f916cdc8986a86100 ❌ data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 1h 30m 48s |
```bash | ||
oc exec -it mariadb-openstack-cell1 -- mysql --user=root --password=${PODIFIED_DB_ROOT_PASSWORD} \ | ||
-e "select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='nova-compute';" | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point you can observe the the compute was able to report status to the new control plane so the service is now UP:
[gibi@osp-dev-01 ~]$ openstack compute service list
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| ID | Binary | Host | Zone | Status | State | Updated At |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| 0954e21c-9022-4718-a570-a7d3eb0fd79f | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up | 2023-11-09T09:51:43.000000 |
| c3ad2def-18ed-49e5-8af4-a7c1a0840171 | nova-scheduler | nova-scheduler-0 | internal | enabled | up | 2023-11-09T09:51:39.000000 |
| a7a20d50-b85d-4321-a576-5a12fea9bc8f | nova-compute | standalone.localdomain | nova | enabled | up | 2023-11-09T09:51:41.000000 |
| 9eb053f9-a404-4b02-92a8-d2a5fe339849 | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up | 2023-11-09T09:51:45.000000 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
[gibi@osp-dev-01 ~]$ openstack hypervisor list
+--------------------------------------+------------------------+-----------------+-----------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+--------------------------------------+------------------------+-----------------+-----------------+-------+
| d3d2be51-a0b9-4538-a298-62280a52fece | standalone.localdomain | QEMU | 192.168.122.100 | up |
+--------------------------------------+------------------------+-----------------+-----------------+-------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get this, sorry. What is expected to change along these lines?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you can add a check here that shows the compute is UP from the nova-api perspective
* Verify if Nova services control the existing VM instance: | ||
|
||
```bash | ||
openstack server list | grep -qF '| test | ACTIVE |' && openstack server stop test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something is wrong as at this point nova-compute produces a stack trace
2023-11-09 09:55:54.826 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Lock "4191c6c5-7c94-4715-ab88-64b27a7ad2c6" "released" by "nova.compute.manager.ComputeManager.stop_instance.<locals>.do_stop_instance" :: held 3.115s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:423
2023-11-09 09:55:54.978 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Acquiring lock "compute_resources" by "nova.compute.resource_tracker.ResourceTracker.update_usage" inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:404
2023-11-09 09:55:54.980 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.update_usage" :: waited 0.002s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:409
2023-11-09 09:55:55.070 2 DEBUG nova.compute.provider_tree [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Inventory has not changed in ProviderTree for provider: d3d2be51-a0b9-4538-a298-62280a52fece update_inventory /usr/lib/python3.9/site-packages/nova/compute/provider_tree.py:180
2023-11-09 09:55:55.098 2 DEBUG nova.scheduler.client.report [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Inventory has not changed for provider d3d2be51-a0b9-4538-a298-62280a52fece based on inventory data: {'VCPU': {'total': 8, 'reserved': 0, 'min_unit': 1, 'max_unit': 8, 'step_size': 1, 'allocation_ratio': 16.0}, 'MEMORY_MB': {'total': 19744, 'reserved': 512, 'min_unit': 1, 'max_unit': 19744, 'step_size': 1, 'allocation_ratio': 1.0}, 'DISK_GB': {'total': 69, 'reserved': 1, 'min_unit': 1, 'max_unit': 69, 'step_size': 1, 'allocation_ratio': 1.0}} set_inventory_for_provider /usr/lib/python3.9/site-packages/nova/scheduler/client/report.py:940
2023-11-09 09:55:55.103 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker.update_usage" :: held 0.123s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:423
2023-11-09 09:55:55.104 2 INFO nova.compute.manager [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] [instance: 4191c6c5-7c94-4715-ab88-64b27a7ad2c6] Successfully reverted task state from powering-off on failure for instance.
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Exception during message handling: nova.exception.InstanceNotFound: Instance 4191c6c5-7c94-4715-ab88-64b27a7ad2c6 could not be found.
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 690, in _get_domain
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return conn.lookupByUUIDString(instance.uuid)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 193, in doit
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server result = proxy_call(self._autowrap, f, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 151, in proxy_call
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server rv = execute(f, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 132, in execute
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server six.reraise(c, e, tb)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/six.py", line 709, in reraise
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server raise value
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 86, in tworker
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server rv = meth(*args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib64/python3.9/site-packages/libvirt.py", line 5008, in lookupByUUIDString
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server raise libvirtError('virDomainLookupByUUIDString() failed')
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server libvirt.libvirtError: Domain not found: no domain with matching uuid '4191c6c5-7c94-4715-ab88-64b27a7ad2c6'
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 71, in wrapped
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server _emit_versioned_exception_notification(
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server raise self.value
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 186, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server LOG.warning("Failed to revert task state for instance. "
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server raise self.value
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 157, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1439, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 203, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3381, in stop_instance
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server do_stop_instance()
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 414, in inner
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3369, in do_stop_instance
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server self._power_off_instance(instance, clean_shutdown)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3076, in _power_off_instance
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server self.driver.power_off(instance, timeout, retry_interval)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 4099, in power_off
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server self._clean_shutdown(instance, timeout, retry_interval)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 4059, in _clean_shutdown
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server guest = self._host.get_guest(instance)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 674, in get_guest
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server return libvirt_guest.Guest(self._get_domain(instance))
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 694, in _get_domain
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server raise exception.InstanceNotFound(instance_id=instance.uuid)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server nova.exception.InstanceNotFound: Instance 4191c6c5-7c94-4715-ab88-64b27a7ad2c6 could not be found.
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server
2023-11-09 09:55:56.053 2 DEBUG ovsdbapp.backend.ovs_idl.vlog [-] [POLLIN] on fd 19 __log_wakeup /usr/lib64/python3.9/site-packages/ovs/poller.py:263
Then I cannot start the instance up again
[gibi@osp-dev-01 ~]$ openstack server start test
Cannot 'start' instance 4191c6c5-7c94-4715-ab88-64b27a7ad2c6 while it is in vm_state active (HTTP 409) (Request-ID: req-eb72c115-a9da-4e47-acbc-2457ddbf7607)
command terminated with exit code 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is weird, I haven't observed that during my testing :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to go an reproduce it. I have a feeling that the cleanup of old libvirt services was incomplete in my case
Merge Failed. This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset. |
Note about remapping cell DB names from OSP cells naming scheme to the NG scheme with the superconductor layout. Add a step to rename default cell as cell1, and to delete stale Nova services records from cell1 DB during initial databases import, to properly transition it into a superconductor layout later on. Adjust minor gaps in the dependencies adoption docs (Placement, Nova cells DB, OVN etc.) Address the switch for service overrides spec instead of externalEndpoints, where it is missing on the path to Nova adotpion. Remove Nova Metadata secret creation workarounds from the EDPM adotopion docs and test suits. Provide workaround for renaming 'default' cell's DB during adoption. Add test suits for Nova CP services adoption. Update EDPM adoption docs and tests to execute Nova compute post-FFU. Add missing nova and libvirt services for the edpm adoption tests. Verify no dataplane disruptions during the adoption and upgrade process. Verify Nova services still control pre-created VM workload after FFU/adotpion is done. Update and fix the composition of the services pre-check list to execute it before stopping services. Update and fix the composition of the list of the services to be stopped (cannot pull data from stopped services). Stop Nova services in stop_openstack_services instead of edpm_adoption (that was too late to do that). Get services topology specific configuration in pull_openstack_configuration. Add missing role for that as well. Also note about cleaning up delorean repos for tripleo standalone dev env. Signed-off-by: Bohdan Dobrelia <[email protected]>
Signed-off-by: Bohdan Dobrelia <[email protected]>
Signed-off-by: Bohdan Dobrelia <[email protected]>
please move review to the PRs splitted out of this one |
first split (just quick and dirty Nova adoption) done #191
ffu split done #192 - the new commit on top cf59540
pre/post checks changes extracted here #193 - commit 2cc55f3
Note about remapping cell DB names from OSP cells naming scheme
to the NG scheme with the superconductor layout.
Add a step to rename default cell as cell1, and to delete stale
Nova services records from cell1 DB during initial databases import,
to properly transition it into a superconductor layout later on.
Adjust minor gaps in the dependencies adoption docs (Placement,
Nova cells DB, OVN etc.)
Address the switch for service overrides spec instead of
externalEndpoints, where it is missing on the path to Nova adotpion.
Remove Nova Metadata secret creation workarounds from the EDPM
adotopion docs and test suits.
Provide workaround for renaming 'default' cell's DB during adoption.
Add test suits for Nova CP services adoption.
Update EDPM adoption docs and tests to execute Nova compute post-FFU.
Add missing nova and libvirt services for the edpm adoption tests.
Verify no dataplane disruptions during the adoption and upgrade
process.
Verify Nova services still control pre-created VM workload after
FFU/adotpion is done.
Update and fix the composition of the services pre-check list to
execute it before stopping services.
Update and fix the composition of the list of the services to be
stopped (cannot pull data from stopped services).
Stop Nova services in stop_openstack_services instead of edpm_adoption
(that was too late to do that).
Get services topology specific configuration in
pull_openstack_configuration. Add missing role for that as well.
Also note about cleaning up delorean repos for tripleo standalone dev
env.