Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nw connectivity task to all playbooks #809

Closed

Conversation

stuggi
Copy link
Contributor

@stuggi stuggi commented Nov 13, 2024

When using bgp where pods using a network-attachment-definition it takes some seconds to advertise it on the fabric. As a result roles which don't have a the connectivity task check will fail.

A rerun of the ansible jobs will not fix this because every time the deployment job pod gets created it will receive a new ip from the ctlplane pool.

There were discussions on a more generic way on slack, for now lets just add the check to all the playbooks which is already there on the bootstrap playbook:

ansible.builtin.wait_for_connection:

Jira: https://issues.redhat.com/browse/OSPRH-8680

Copy link
Contributor

openshift-ci bot commented Nov 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: stuggi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stuggi stuggi requested review from lmiccini, rabi and bshephar and removed request for viroel and frenzyfriday November 13, 2024 11:05
@rabi
Copy link
Contributor

rabi commented Nov 13, 2024

There were discussions on a more generic way on slack, for now lets just add the check to all the playbooks which is already there on the bootstrap playbook:

As mentioned in the slack thread, we also have to ensure that all custom services have that task, and it has to be documented. If we know it's only for few seconds, adding sleep x [1] or some relevant check to https://github.com/openstack-k8s-operators/edpm-ansible/blob/main/openstack_ansibleee/edpm_entrypoint.sh would possibly do the job rather than changing all roles?

[1] https://github.com/openstack-k8s-operators/edpm-ansible/compare/main...rabi:edpm-ansible:wait_for_network?expand=1

hosts: "{{ edpm_override_hosts | default('all', true) }}"
strategy: linear
gather_facts: "{{ gather_facts | default(false) }}"
tasks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If decide to go this route, then we also should add an ansible var to run the task only for bgp deployments.

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/3bb947da160646cda979c69c3c92e9b9

✔️ openstack-k8s-operators-content-provider SUCCESS in 4h 46m 43s
✔️ podified-multinode-edpm-deployment-crc SUCCESS in 1h 21m 38s
cifmw-crc-podified-edpm-baremetal RETRY_LIMIT in 13m 23s
✔️ edpm-ansible-tempest-multinode SUCCESS in 1h 42m 24s
✔️ edpm-ansible-molecule-edpm_bootstrap SUCCESS in 5m 47s
✔️ edpm-ansible-molecule-edpm_podman SUCCESS in 5m 52s
✔️ edpm-ansible-molecule-edpm_module_load SUCCESS in 5m 27s
✔️ edpm-ansible-molecule-edpm_kernel SUCCESS in 11m 16s
✔️ edpm-ansible-molecule-edpm_libvirt SUCCESS in 10m 12s
✔️ edpm-ansible-molecule-edpm_nova SUCCESS in 8m 47s
✔️ edpm-ansible-molecule-edpm_frr SUCCESS in 6m 21s
✔️ edpm-ansible-molecule-edpm_iscsid SUCCESS in 4m 04s
✔️ edpm-ansible-molecule-edpm_ovn_bgp_agent SUCCESS in 6m 25s
✔️ edpm-ansible-molecule-edpm_ovs SUCCESS in 11m 11s
✔️ edpm-ansible-molecule-edpm_tripleo_cleanup SUCCESS in 3m 34s
✔️ edpm-ansible-molecule-edpm_tuned SUCCESS in 5m 51s
✔️ edpm-ansible-molecule-edpm_telemetry_power_monitoring SUCCESS in 6m 17s
✔️ adoption-standalone-to-crc-ceph-provider SUCCESS in 3h 24m 38s

playbooks/ceph_client.yml Outdated Show resolved Hide resolved
@stuggi stuggi force-pushed the cache_wait_connection branch from 8976ff6 to 22678b5 Compare November 15, 2024 14:37
- name: Configure EDPM as client of Ceph
hosts: "{{ edpm_override_hosts | default('all', true) }}"
strategy: linear
gather_facts: "{{ gather_facts | default(false) }}"
any_errors_fatal: "{{ edpm_any_errors_fatal | default(true) }}"
max_fail_percentage: "{{ edpm_max_fail_percentage | default(0) }}"
tasks:
- name: Check remote connections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work when gather_facts is true, as it would try gathering facts and fail before calling the task to wait_for_connection. That's why it was added as a separate play in [1] with gather_facts: false. We regressed that later though.

We either create a separate play (with gather_facts: false) like earlier or change it to false in every playbook (not honor user provided gather_facts ansible var at all).

Also, what're we going to do for custom services? Are we not expecting people to use them in bgp deployments? Are we going to document that?

[1] 07a0815

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the pointer on the gather_facts. I'll update it.

re custom services, from my pov I think its ok to document it. @slagle whats your take on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just change our default ansible configuration for SSH connection timeout/retries to be these same values? Then we wouldn't need to include the extra task on every execution.

As a downside of doing that way, if for some reason connectivity is lost during an execution, then we could end up waiting 10 minutes to report that.

Otherwise, we'd need the docs.

Copy link
Contributor Author

@stuggi stuggi Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we tried to just pass in the ANSIBLE_TIMEOUT env var via the nodeset, but it does not work.

  env:                                                        
  - name: ANSIBLE_FORCE_COLOR                                 
    value: "True"                                         
  - name: ANSIBLE_TIMEOUT                               
    value: "60"    

its being set properly on the deploy pod

spec:
  containers:
  - args:
    - ansible-runner
    - run
    - /runner
    - -p
    - osp.edpm.download_cache
    - -i
    - download-cache-edpm-deployment-r2-compute-nodes-r2
    env:
    - name: ANSIBLE_FORCE_COLOR
      value: "True"
    - name: ANSIBLE_TIMEOUT
      value: "60"

Not sure if the No route to host error results in the timeout to be handled as we need it.

$ oc logs download-cache-edpm-deployment-r2-compute-nodes-r2-fqvgp                                                                                                                        
Identity added: /runner/artifacts/download-cache-edpm-deployment-r2-compute-nodes-r2/ssh_key_data (cifmw_reproducer_key)                                                                                              
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details

PLAY [EDPM Download cache] *****************************************************

TASK [osp.edpm.edpm_download_cache : Validating arguments against arg spec 'main' - The main entry point for the edpm_download_cache role.] ***                                                                       
ok: [r2-compute-ib21ttu9-0]

TASK [osp.edpm.edpm_download_cache : Install role requirements] ****************
included: /usr/share/ansible/collections/ansible_collections/osp/edpm/roles/edpm_download_cache/tasks/install.yml for r2-compute-ib21ttu9-0                                                                           

TASK [Execute bootstrap command] ***********************************************
included: osp.edpm.edpm_bootstrap for r2-compute-ib21ttu9-0

TASK [osp.edpm.edpm_bootstrap : Re-read facts] *********************************
fatal: [r2-compute-ib21ttu9-0]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host r2-compute-ib21ttu9-0 port 22: No route to host", "unreachable": true}         

NO MORE HOSTS LEFT *************************************************************

NO MORE HOSTS LEFT *************************************************************

PLAY RECAP *********************************************************************
r2-compute-ib21ttu9-0      : ok=3    changed=0    unreachable=1    failed=0    skipped=0    rescued=0    ignored=0   

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slagle you have other idea for a ansible configuration to test instead of what we tried as mentioned above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT ( i believe that's the same as ANSIBLE_TIMEOUT, but it may be worth trying)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests with setting ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT were successful. I'll close this pr. Thanks a lot @slagle

@stuggi stuggi force-pushed the cache_wait_connection branch from 22678b5 to 88c3e88 Compare November 18, 2024 09:52
When using bgp where pods using a network-attachment-definition
it takes some seconds to advertise it on the fabric. As a result
roles which don't have a the connectivity task check will fail.

A rerun of the ansible jobs will not fix this because every time
the deployment job pod gets created it will receive a new ip from
the ctlplane pool.

There were discussions on a more generic way on slack, for
now lets add a connectivity check to the edpm_nodes_validation role
and call it from the playbooks, similar to:
https://github.com/openstack-k8s-operators/edpm-ansible/blob/6615889c5d708df2accb7dafec5aa099946292b8/playbooks/bootstrap.yml#L8

Jira: https://issues.redhat.com/browse/OSPRH-8680

Signed-off-by: Martin Schuppert <[email protected]>
@stuggi stuggi force-pushed the cache_wait_connection branch from 88c3e88 to c2cfaa2 Compare November 18, 2024 09:54
@stuggi
Copy link
Contributor Author

stuggi commented Dec 19, 2024

tests with setting ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT were successful. I'll close this pr. Thanks a lot @slagle

@stuggi stuggi closed this Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants