Add nw connectivity task to all playbooks #809

stuggi · 2024-11-13T11:05:47Z

When using bgp where pods using a network-attachment-definition it takes some seconds to advertise it on the fabric. As a result roles which don't have a the connectivity task check will fail.

A rerun of the ansible jobs will not fix this because every time the deployment job pod gets created it will receive a new ip from the ctlplane pool.

There were discussions on a more generic way on slack, for now lets just add the check to all the playbooks which is already there on the bootstrap playbook:

edpm-ansible/playbooks/bootstrap.yml

Line 8 in 6615889

ansible.builtin.wait_for_connection:

Jira: https://issues.redhat.com/browse/OSPRH-8680

openshift-ci · 2024-11-13T11:05:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [stuggi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rabi · 2024-11-13T11:33:57Z

There were discussions on a more generic way on slack, for now lets just add the check to all the playbooks which is already there on the bootstrap playbook:

As mentioned in the slack thread, we also have to ensure that all custom services have that task, and it has to be documented. If we know it's only for few seconds, adding sleep x [1] or some relevant check to https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/openstack_ansibleee/edpm_entrypoint.sh would possibly do the job rather than changing all roles?

[1] https://github.com/openstack-k8s-operators/edpm-ansible/compare/main...rabi:edpm-ansible:wait_for_network?expand=1

rabi · 2024-11-13T11:42:57Z

playbooks/configure_network.yml

+  hosts: "{{ edpm_override_hosts | default('all', true) }}"
+  strategy: linear
+  gather_facts: "{{ gather_facts | default(false) }}"
+  tasks:


If decide to go this route, then we also should add an ansible var to run the task only for bgp deployments.

softwarefactory-project-zuul · 2024-11-13T15:53:54Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3bb947da160646cda979c69c3c92e9b9

playbooks/ceph_client.yml

rabi · 2024-11-16T03:33:39Z

playbooks/ceph_client.yml

 - name: Configure EDPM as client of Ceph
  hosts: "{{ edpm_override_hosts | default('all', true) }}"
  strategy: linear
  gather_facts: "{{ gather_facts | default(false) }}"
  any_errors_fatal: "{{ edpm_any_errors_fatal | default(true) }}"
  max_fail_percentage: "{{ edpm_max_fail_percentage | default(0) }}"
  tasks:
+    - name: Check remote connections


This won't work when gather_facts is true, as it would try gathering facts and fail before calling the task to wait_for_connection. That's why it was added as a separate play in [1] with gather_facts: false. We regressed that later though.

We either create a separate play (with gather_facts: false) like earlier or change it to false in every playbook (not honor user provided gather_facts ansible var at all).

Also, what're we going to do for custom services? Are we not expecting people to use them in bgp deployments? Are we going to document that?

[1] 07a0815

thanks for the pointer on the gather_facts. I'll update it.

re custom services, from my pov I think its ok to document it. @slagle whats your take on this?

Why don't we just change our default ansible configuration for SSH connection timeout/retries to be these same values? Then we wouldn't need to include the extra task on every execution.

As a downside of doing that way, if for some reason connectivity is lost during an execution, then we could end up waiting 10 minutes to report that.

Otherwise, we'd need the docs.

we tried to just pass in the ANSIBLE_TIMEOUT env var via the nodeset, but it does not work.

env: - name: ANSIBLE_FORCE_COLOR value: "True" - name: ANSIBLE_TIMEOUT value: "60"

its being set properly on the deploy pod

spec: containers: - args: - ansible-runner - run - /runner - -p - osp.edpm.download_cache - -i - download-cache-edpm-deployment-r2-compute-nodes-r2 env: - name: ANSIBLE_FORCE_COLOR value: "True" - name: ANSIBLE_TIMEOUT value: "60"

Not sure if the No route to host error results in the timeout to be handled as we need it.

$ oc logs download-cache-edpm-deployment-r2-compute-nodes-r2-fqvgp Identity added: /runner/artifacts/download-cache-edpm-deployment-r2-compute-nodes-r2/ssh_key_data (cifmw_reproducer_key) [WARNING]: Invalid characters were found in group names but not replaced, use -vvvv to see details PLAY [EDPM Download cache] ***************************************************** TASK [osp.edpm.edpm_download_cache : Validating arguments against arg spec 'main' - The main entry point for the edpm_download_cache role.] *** ok: [r2-compute-ib21ttu9-0] TASK [osp.edpm.edpm_download_cache : Install role requirements] **************** included: /usr/share/ansible/collections/ansible_collections/osp/edpm/roles/edpm_download_cache/tasks/install.yml for r2-compute-ib21ttu9-0 TASK [Execute bootstrap command] *********************************************** included: osp.edpm.edpm_bootstrap for r2-compute-ib21ttu9-0 TASK [osp.edpm.edpm_bootstrap : Re-read facts] ********************************* fatal: [r2-compute-ib21ttu9-0]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host r2-compute-ib21ttu9-0 port 22: No route to host", "unreachable": true} NO MORE HOSTS LEFT ************************************************************* NO MORE HOSTS LEFT ************************************************************* PLAY RECAP ********************************************************************* r2-compute-ib21ttu9-0 : ok=3 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0

@slagle you have other idea for a ansible configuration to test instead of what we tried as mentioned above?

ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT ( i believe that's the same as ANSIBLE_TIMEOUT, but it may be worth trying)

tests with setting ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT were successful. I'll close this pr. Thanks a lot @slagle

When using bgp where pods using a network-attachment-definition it takes some seconds to advertise it on the fabric. As a result roles which don't have a the connectivity task check will fail. A rerun of the ansible jobs will not fix this because every time the deployment job pod gets created it will receive a new ip from the ctlplane pool. There were discussions on a more generic way on slack, for now lets add a connectivity check to the edpm_nodes_validation role and call it from the playbooks, similar to: https://github.com/openstack-k8s-operators/edpm-ansible/blob/6615889c5d708df2accb7dafec5aa099946292b8/playbooks/bootstrap.yml#L8 Jira: https://issues.redhat.com/browse/OSPRH-8680 Signed-off-by: Martin Schuppert <[email protected]>

stuggi · 2024-12-19T11:26:26Z

tests with setting ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT were successful. I'll close this pr. Thanks a lot @slagle

openshift-ci bot requested review from frenzyfriday and viroel November 13, 2024 11:05

openshift-ci bot added the approved label Nov 13, 2024

stuggi requested review from lmiccini, rabi and bshephar and removed request for viroel and frenzyfriday November 13, 2024 11:05

rabi reviewed Nov 13, 2024

View reviewed changes

slagle reviewed Nov 14, 2024

View reviewed changes

playbooks/ceph_client.yml Outdated Show resolved Hide resolved

stuggi force-pushed the cache_wait_connection branch from 8976ff6 to 22678b5 Compare November 15, 2024 14:37

rabi reviewed Nov 16, 2024

View reviewed changes

stuggi force-pushed the cache_wait_connection branch from 22678b5 to 88c3e88 Compare November 18, 2024 09:52

stuggi force-pushed the cache_wait_connection branch from 88c3e88 to c2cfaa2 Compare November 18, 2024 09:54

stuggi closed this Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add nw connectivity task to all playbooks #809

Add nw connectivity task to all playbooks #809

stuggi commented Nov 13, 2024

openshift-ci bot commented Nov 13, 2024

rabi commented Nov 13, 2024

rabi Nov 13, 2024

softwarefactory-project-zuul bot commented Nov 13, 2024

rabi Nov 16, 2024

stuggi Nov 18, 2024

slagle Nov 20, 2024

stuggi Nov 20, 2024 •

edited

Loading

stuggi Dec 13, 2024

slagle Dec 18, 2024

stuggi Dec 19, 2024

stuggi commented Dec 19, 2024

Add nw connectivity task to all playbooks #809

Add nw connectivity task to all playbooks #809

Conversation

stuggi commented Nov 13, 2024

openshift-ci bot commented Nov 13, 2024

rabi commented Nov 13, 2024

rabi Nov 13, 2024

Choose a reason for hiding this comment

softwarefactory-project-zuul bot commented Nov 13, 2024

rabi Nov 16, 2024

Choose a reason for hiding this comment

stuggi Nov 18, 2024

Choose a reason for hiding this comment

slagle Nov 20, 2024

Choose a reason for hiding this comment

stuggi Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

stuggi Dec 13, 2024

Choose a reason for hiding this comment

slagle Dec 18, 2024

Choose a reason for hiding this comment

stuggi Dec 19, 2024

Choose a reason for hiding this comment

stuggi commented Dec 19, 2024

stuggi Nov 20, 2024 •

edited

Loading