Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nova adoption ffu (no extra cell) #192

Merged
merged 1 commit into from
Dec 4, 2023

Conversation

bogdando
Copy link
Contributor

@bogdando bogdando commented Nov 6, 2023

Extracted nova services FFU step being done from #176

Update EDPM adoption docs and tests to execute Nova compute post-FFU.

For that, deploy an additional nova-compute-ffu EDPM service and
patch openstack control plane CR for nova services.
Because of the different lifecycle management tooling used for both
actions, orchestrate FFU w/o a lock-step between nova compute EDPM
and podified control plane services.

Jira: OSPRH-338

Depends-on: https://review.rdoproject.org/r/c/rdo-jobs/+/50917

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/b03024da30604a15b1a498c1491770fc

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 1h 36m 39s

@bogdando bogdando force-pushed the nova_adoption_ffu branch 2 times, most recently from 1dcded2 to d6baec4 Compare November 8, 2023 16:37
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/473fd3e628504e53ba1dff4e0408cf1f

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 01m 27s

@bogdando bogdando force-pushed the nova_adoption_ffu branch 5 times, most recently from 3ba8a12 to 3b256aa Compare November 13, 2023 15:31
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/375347d7b9ce40e48c615ac4c6d53b02

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 02m 23s

@bogdando bogdando force-pushed the nova_adoption_ffu branch 2 times, most recently from 626671d to 01af10f Compare November 14, 2023 17:00
@bogdando
Copy link
Contributor Author

this has been tested on my env, let's fix CI...

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/a6888a92072c4e3595c31b7cb9da9a14

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 21m 12s

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/02bd8c038bb94cefa57fee73bcd432a9

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 03m 56s

tests/roles/dataplane_adoption/tasks/nova_ffu.yaml Outdated Show resolved Hide resolved
docs/openstack/edpm_adoption.md Outdated Show resolved Hide resolved
docs/openstack/edpm_adoption.md Outdated Show resolved Hide resolved
tests/roles/dataplane_adoption/tasks/main.yaml Outdated Show resolved Hide resolved
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/836fbc4237bd4aadbaa4a202b402eb41

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 33m 34s

@gibizer
Copy link
Contributor

gibizer commented Nov 20, 2023

The adopted compute did not reported a status as the compute version is still wallaby.

fatal: [localhost]: FAILED! => {"attempts": 20, "changed": true, "cmd": "set -euxo pipefail\n\n\nPODIFIED_DB_ROOT_PASSWORD=\"12345678\"\n\noc rsh mariadb-openstack-cell1 mysql --user=root --password=${PODIFIED_DB_ROOT_PASSWORD}  -e \"select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='nova-compute';\"\n", "delta": "0:00:00.251873", "end": "2023-11-17 14:51:43.574921", "msg": "", "rc": 0, "start": "2023-11-17 14:51:43.323048", "stderr": "+ PODIFIED_DB_ROOT_PASSWORD=12345678\n+ oc rsh mariadb-openstack-cell1 mysql --user=root --password=12345678 -e 'select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='\\''nova-compute'\\'';'", "stderr_lines": ["+ PODIFIED_DB_ROOT_PASSWORD=12345678", "+ oc rsh mariadb-openstack-cell1 mysql --user=root --password=12345678 -e 'select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='\\''nova-compute'\\'';'"], "stdout": "version\n56", "stdout_lines": ["version", "56"]}

https://review.rdoproject.org/zuul/build/df8d241f458747ea897046524cdf807e/log/controller/data-plane-adoption-tests-repo/data-plane-adoption/tests/logs/test_with_ceph_out_2023-11-17T14:15:42EST.log

I don't know where are the nova-compute logs in CI during the adoption testing. :/ So I only have indirect checks

The DataPlaneService/nova looks good to me: https://logserver.rdoproject.org/92/192/67fec82340574efbfb56d234e7bf3a680888939f/github-check/data-plane-adoption-github-rdo-centos-9-extracted-crc/df8d241/controller/ci-framework-data/logs/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-0812f031e363406238d47a3ba3cfb33412e9a2d143d2b4b9365c9796b80bb8aa/namespaces/openstack/crs/openstackdataplaneservices.dataplane.openstack.org/nova-compute-extraconfig.yaml

The execution log of the nova edpm role also look clean: https://logserver.rdoproject.org/92/192/67fec82340574efbfb56d234e7bf3a680888939f/github-check/data-plane-adoption-github-rdo-centos-9-extracted-crc/df8d241/controller/ci-framework-data/logs/quay-io-openstack-k8s-operators-openstack-must-gather-sha256-0812f031e363406238d47a3ba3cfb33412e9a2d143d2b4b9365c9796b80bb8aa/namespaces/openstack/pods/nova.compute.extraconfig-openstack-tgdjr/logs/openstackansibleee.log

Without the compute logs it is hard to tell why the compute does not come up :/

@gibizer
Copy link
Contributor

gibizer commented Nov 20, 2023

recheck
maybe it was just slow?

@fao89
Copy link
Contributor

fao89 commented Nov 20, 2023

@gibizer
Copy link
Contributor

gibizer commented Nov 20, 2023

@gibizer this is what we have for the logs so far: https://github.com/rdo-infra/rdo-jobs/blob/master/playbooks/data_plane_adoption/collect_logs_crc.yaml#L6

In case of an greenfield job ci-framework takes bunch of logs from the computes. It would be good to reuse that logic somehow here in the adoption jobs

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/a67f04155be7487d83ab6966f60165db

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 38m 16s

@gibizer
Copy link
Contributor

gibizer commented Nov 27, 2023

recheck PS2

@gibizer
Copy link
Contributor

gibizer commented Nov 27, 2023

recheck

@gibizer
Copy link
Contributor

gibizer commented Nov 27, 2023

recheck PS2

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/19bff67041a54aec92db93b6edba241f

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 40m 51s

@gibizer
Copy link
Contributor

gibizer commented Nov 27, 2023

recheck ps3

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/1f6eba7722c64b7d938462bd36b211b9

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 41m 21s

@gibizer
Copy link
Contributor

gibizer commented Nov 28, 2023

recheck ps5

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/c5afcfcf9d6f4683883dc742d95000c2

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 16m 59s

@bogdando
Copy link
Contributor Author

@bogdando
Copy link
Contributor Author

Copy link

This change depends on a change that failed to merge.

Change https://review.rdoproject.org/r/c/rdo-jobs/+/50917 is needed.

@bogdando
Copy link
Contributor Author

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/47c5b94f603744e782dc849df3b27702

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 05m 01s

@bogdando

This comment was marked as outdated.

@bogdando
Copy link
Contributor Author

recheck

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/ad32854f5dac47dda1d1c1f197a2485a

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 2h 26m 58s

@bogdando
Copy link
Contributor Author

@gibizer
Copy link
Contributor

gibizer commented Dec 1, 2023

recheck we saw https://review.rdoproject.org/r/c/rdo-jobs/+/50917 make this PR pass CI.

Update EDPM adoption docs and tests to execute Nova compute post-FFU.

For that, deploy an additional nova-compute-ffu EDPM service and
patch openstack control plane CR for nova services.
Because of the different lifecycle management tooling used for both
actions, orchestrate FFU w/o a lock-step between nova compute EDPM
and podified control plane services.

Signed-off-by: Bohdan Dobrelia <[email protected]>
Copy link

This change depends on a change that failed to merge.

Change https://review.rdoproject.org/r/c/rdo-jobs/+/50917 is needed.

@gibizer
Copy link
Contributor

gibizer commented Dec 1, 2023

recheck rebased the dependency

@jistr jistr merged commit a7dbd5c into openstack-k8s-operators:main Dec 4, 2023
2 checks passed
{{ mariadb_copy_shell_vars }}
oc rsh mariadb-openstack-cell1 mysql --user=root --password=${PODIFIED_DB_ROOT_PASSWORD} \
-e "select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='nova-compute';"
register: records_check_results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new job is failing here https://logserver.rdoproject.org/53/50653/11/check/data-plane-adoption-OSP-17-to-extracted-crc/1a62ea3/controller/data-plane-adoption-tests-repo/data-plane-adoption/tests/logs/test_with_ceph_out_2023-12-04T05:20:41EST.log

https://logserver.rdoproject.org/53/50653/11/check/data-plane-adoption-OSP-17-to-extracted-crc/1a62ea3/controller/pod/nova-cell1-conductor-0-logs.txt

2023-12-04 10:32:38.433 1 DEBUG oslo_db.sqlalchemy.engines [None req-76342adf-c98e-4a74-91f9-71e7e86e87f2 - - - - - -] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3.9/site-packages/oslo_db/sqlalchemy/engines.py:335�[00m
2023-12-04 10:32:38.434 1 ERROR nova.context [None req-76342adf-c98e-4a74-91f9-71e7e86e87f2 - - - - - -] Error gathering result from cell 292fe7d7-f10c-4546-876d-753875e67b77: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1045, "Access denied for user 'nova_cell1'@'192.168.122.100' (using password: YES)")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor log shows that the issue is earlier than the compute adoption. Show the new k8s control plane has a wrong / incomplete DB setup as the conductor cannot talk to its DB. Wondering how the db sync on that same DB was run successfully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Nova CR status is Ready so there was a succesfully db sync run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the logs and dumps but I don't see why the cell1 conductor cannot connect to the DB. Unfortunately all the passwords are masked in must gather so I cannot check those. @marios if you have a held node with this issue then I can check the creds there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibizer thanks for having a look sorry we were trying to get to a solution and missed your comments. in the end it was a dns issue resolved with https://github.com/openstack-k8s-operators/data-plane-adoption/pull/218/files

green run there if you want to poke at logs https://review.rdoproject.org/zuul/build/87df5976f8814ea9a319eea1caececb2/artifacts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The green result nova-compute logs looks good to me!

@bogdando bogdando deleted the nova_adoption_ffu branch December 4, 2023 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants