Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nova adoption ffu (no extra cell) #192

Merged
merged 1 commit into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 127 additions & 1 deletion docs/openstack/edpm_adoption.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,12 @@

## Variables

(There are no shell variables necessary currently.)
Define the shell variables used in the Fast-forward upgrade steps below.
The values are just illustrative, use values that are correct for your environment:

```bash
PODIFIED_DB_ROOT_PASSWORD=$(oc get -o json secret/osp-secret | jq -r .data.DbRootPassword | base64 -d)
```

## Pre-checks

Expand Down Expand Up @@ -308,3 +313,124 @@ EOF
```
oc wait --for condition=Ready osdpns/openstack --timeout=30m
```

## Nova compute services fast-forward upgrade from Wallaby to Antelope

Nova services rolling upgrade cannot be done during adoption,
there is in a lock-step with Nova control plane services, because those
are managed independently by EDPM ansible, and Kubernetes operators.
Nova service operator and OpenStack Dataplane operator ensure upgrading
is done independently of each other, by configuring
`[upgrade_levels]compute=auto` for Nova services. Nova control plane
services apply the change right after CR is patched. Nova compute EDPM
services will catch up the same config change with ansible deployment
later on.

> **NOTE**: Additional orchestration happening around the FFU workarounds
> configuration for Nova compute EDPM service is a subject of future changes.

* Wait for cell1 Nova compute EDPM services version updated (it may take some time):

```bash
oc exec -it mariadb-openstack-cell1 -- mysql --user=root --password=${PODIFIED_DB_ROOT_PASSWORD} \
-e "select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='nova-compute';"
```
The above query should return an empty result as a completion criterion.

* Remove pre-FFU workarounds for Nova control plane services:

```yaml
oc patch openstackcontrolplane openstack -n openstack --type=merge --patch '
spec:
nova:
template:
cellTemplates:
cell0:
conductorServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and below: you don't need an explicit disable_compute_service_check_for_ffu=false as false is the default. I suggest to just drop the content of customServiceConfig field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edpm-nova config cleanup bug mentioned below does not effect the k8s control plane.

Copy link
Contributor Author

@bogdando bogdando Nov 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how am I supposed to drop this section for EDPM side of things? I'm not certain patching osdpservices is a valid approach.

Neither can we do "removing the nova-compute-ffu from the OpenStackDataPlaneService and doing a deployment".
This is the 1st place we refer to it to be deployed, nothing to remove.
We could remove the nova-extra-config service from the existing osdpns, and make it deploying the standard nova servcice. But patching osdpns is antipattern?..

cell1:
metadataServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
conductorServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
apiServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
metadataServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
schedulerServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
'
```

* Wait for Nova control plane services' CRs to become ready:

```bash
oc wait --for condition=Ready --timeout=300s Nova/nova
```

* Remove pre-FFU workarounds for Nova compute EDPM services:

```yaml
oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: nova-compute-ffu
namespace: openstack
data:
20-nova-compute-cell1-ffu-cleanup.conf: |
[workarounds]
disable_compute_service_check_for_ffu=false
Comment on lines +383 to +395
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a note that this is currently needed due to a bug in the config handling in the edpm_nova role. The proper solution is removing the nova-compute-ffu from the OpenStackDataPlaneService and doing a deployment. As that is expected to remove the 19-nova...conf from the EDPM node and therefore remove the disable_compute_service_check_for_ffu configuration. Today the 19-nova...conf is left there due to the bug and hence you need an explicity config, 20-nova...conf, that flips the disable_compute_service_check_for_ffu back to false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm confused now. Please see my comment above

---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneService
metadata:
name: nova-compute-ffu
namespace: openstack
spec:
label: nova.compute.ffu
configMaps:
- nova-compute-ffu
secrets:
- nova-cell1-compute-config
- nova-migration-ssh-key
playbook: osp.edpm.nova
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: openstack-nova-compute-ffu
namespace: openstack
spec:
nodeSets:
- openstack
servicesOverride:
- nova-compute-ffu
EOF
```

* Wait for Nova compute EDPM service to become ready:

```bash
oc wait --for condition=Ready osdpd/openstack-nova-compute-ffu --timeout=5m
```

* Run Nova DB online migrations to complete FFU:

```bash
oc exec -it nova-cell0-conductor-0 -- nova-manage db online_data_migrations
oc exec -it nova-cell1-conductor-0 -- nova-manage db online_data_migrations
```

4 changes: 4 additions & 0 deletions tests/roles/dataplane_adoption/tasks/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -308,3 +308,7 @@
oc wait --for condition=Ready osdpns/openstack --timeout=40m
# TODO: work on network configuration for making possible to run this task on other IP ranges
when: "edpm_node_ip.startswith('192.168.122')"

- name: Complete Nova services Wallaby->Antelope FFU
ansible.builtin.include_tasks:
file: nova_ffu.yaml
121 changes: 121 additions & 0 deletions tests/roles/dataplane_adoption/tasks/nova_ffu.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
- name: set podified MariaDB copy shell vars
no_log: "{{ use_no_log }}"
ansible.builtin.set_fact:
mariadb_copy_shell_vars: |
PODIFIED_DB_ROOT_PASSWORD="{{ podified_db_root_password }}"

- name: wait for cell1 Nova compute EDPM services version updated
ansible.builtin.shell: |
{{ shell_header }}
{{ oc_header }}
{{ mariadb_copy_shell_vars }}
oc rsh mariadb-openstack-cell1 mysql --user=root --password=${PODIFIED_DB_ROOT_PASSWORD} \
-e "select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='nova-compute';"
register: records_check_results
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new job is failing here https://logserver.rdoproject.org/53/50653/11/check/data-plane-adoption-OSP-17-to-extracted-crc/1a62ea3/controller/data-plane-adoption-tests-repo/data-plane-adoption/tests/logs/test_with_ceph_out_2023-12-04T05:20:41EST.log

https://logserver.rdoproject.org/53/50653/11/check/data-plane-adoption-OSP-17-to-extracted-crc/1a62ea3/controller/pod/nova-cell1-conductor-0-logs.txt

2023-12-04 10:32:38.433 1 DEBUG oslo_db.sqlalchemy.engines [None req-76342adf-c98e-4a74-91f9-71e7e86e87f2 - - - - - -] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python3.9/site-packages/oslo_db/sqlalchemy/engines.py:335�[00m
2023-12-04 10:32:38.434 1 ERROR nova.context [None req-76342adf-c98e-4a74-91f9-71e7e86e87f2 - - - - - -] Error gathering result from cell 292fe7d7-f10c-4546-876d-753875e67b77: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (1045, "Access denied for user 'nova_cell1'@'192.168.122.100' (using password: YES)")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conductor log shows that the issue is earlier than the compute adoption. Show the new k8s control plane has a wrong / incomplete DB setup as the conductor cannot talk to its DB. Wondering how the db sync on that same DB was run successfully.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Nova CR status is Ready so there was a succesfully db sync run.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the logs and dumps but I don't see why the cell1 conductor cannot connect to the DB. Unfortunately all the passwords are masked in must gather so I cannot check those. @marios if you have a held node with this issue then I can check the creds there

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibizer thanks for having a look sorry we were trying to get to a solution and missed your comments. in the end it was a dns issue resolved with https://github.com/openstack-k8s-operators/data-plane-adoption/pull/218/files

green run there if you want to poke at logs https://review.rdoproject.org/zuul/build/87df5976f8814ea9a319eea1caececb2/artifacts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The green result nova-compute logs looks good to me!

until: records_check_results.rc == 0 and records_check_results.stdout_lines | length == 0
retries: 20
delay: 6

- name: remove pre-FFU workarounds for Nova control plane services
ansible.builtin.shell: |
{{ shell_header }}
{{ oc_header }}
oc patch openstackcontrolplane openstack -n openstack --type=merge --patch '
spec:
nova:
template:
cellTemplates:
cell0:
conductorServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
cell1:
metadataServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
conductorServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
apiServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
metadataServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
schedulerServiceTemplate:
customServiceConfig: |
[workarounds]
disable_compute_service_check_for_ffu=false
'

- name: Wait for Nova control plane services' CRs to become ready
ansible.builtin.include_role:
name: nova_adoption
tasks_from: wait.yaml

- name: remove pre-FFU workarounds for Nova compute EDPM services
ansible.builtin.shell: |
{{ shell_header }}
{{ oc_header }}
oc apply -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: nova-compute-ffu
namespace: openstack
data:
20-nova-compute-cell1-ffu-cleanup.conf: |
[workarounds]
disable_compute_service_check_for_ffu=false
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneService
metadata:
name: nova-compute-ffu
namespace: openstack
spec:
label: nova.compute.ffu
configMaps:
- nova-compute-ffu
secrets:
- nova-cell1-compute-config
- nova-migration-ssh-key
playbook: osp.edpm.nova
---
apiVersion: dataplane.openstack.org/v1beta1
kind: OpenStackDataPlaneDeployment
metadata:
name: openstack-nova-compute-ffu
namespace: openstack
spec:
nodeSets:
- openstack
servicesOverride:
- nova-compute-ffu
EOF

- name: wait for Nova compute EDPM services to become ready
ansible.builtin.shell: |
{{ shell_header }}
{{ oc_header }}
oc wait --for condition=Ready osdpd/openstack-nova-compute-ffu --timeout=5m
register: nova_ffu_edpm_result
until: nova_ffu_edpm_result is success
retries: 10
delay: 6

- name: run Nova DB migrations to complete Wallaby->antelope FFU
ansible.builtin.shell: |
{{ shell_header }}
{{ oc_header }}
oc rsh nova-cell0-conductor-0 nova-manage db online_data_migrations
oc rsh nova-cell1-conductor-0 nova-manage db online_data_migrations
register: nova_exec_result
until: nova_exec_result is success
retries: 10
delay: 6