-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nw connectivity task to all playbooks #809
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: stuggi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
As mentioned in the slack thread, we also have to ensure that all custom services have that task, and it has to be documented. If we know it's only for few seconds, adding |
playbooks/configure_network.yml
Outdated
hosts: "{{ edpm_override_hosts | default('all', true) }}" | ||
strategy: linear | ||
gather_facts: "{{ gather_facts | default(false) }}" | ||
tasks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If decide to go this route, then we also should add an ansible var to run the task only for bgp deployments.
8976ff6
to
22678b5
Compare
playbooks/ceph_client.yml
Outdated
- name: Configure EDPM as client of Ceph | ||
hosts: "{{ edpm_override_hosts | default('all', true) }}" | ||
strategy: linear | ||
gather_facts: "{{ gather_facts | default(false) }}" | ||
any_errors_fatal: "{{ edpm_any_errors_fatal | default(true) }}" | ||
max_fail_percentage: "{{ edpm_max_fail_percentage | default(0) }}" | ||
tasks: | ||
- name: Check remote connections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work when gather_facts
is true, as it would try gathering facts and fail before calling the task to wait_for_connection
. That's why it was added as a separate play in [1] with gather_facts: false
. We regressed that later though.
We either create a separate play (with gather_facts: false
) like earlier or change it to false in every playbook (not honor user provided gather_facts
ansible var at all).
Also, what're we going to do for custom services? Are we not expecting people to use them in bgp deployments? Are we going to document that?
[1] 07a0815
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the pointer on the gather_facts. I'll update it.
re custom services, from my pov I think its ok to document it. @slagle whats your take on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we just change our default ansible configuration for SSH connection timeout/retries to be these same values? Then we wouldn't need to include the extra task on every execution.
As a downside of doing that way, if for some reason connectivity is lost during an execution, then we could end up waiting 10 minutes to report that.
Otherwise, we'd need the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we tried to just pass in the ANSIBLE_TIMEOUT
env var via the nodeset, but it does not work.
env:
- name: ANSIBLE_FORCE_COLOR
value: "True"
- name: ANSIBLE_TIMEOUT
value: "60"
its being set properly on the deploy pod
spec:
containers:
- args:
- ansible-runner
- run
- /runner
- -p
- osp.edpm.download_cache
- -i
- download-cache-edpm-deployment-r2-compute-nodes-r2
env:
- name: ANSIBLE_FORCE_COLOR
value: "True"
- name: ANSIBLE_TIMEOUT
value: "60"
Not sure if the No route to host
error results in the timeout to be handled as we need it.
$ oc logs download-cache-edpm-deployment-r2-compute-nodes-r2-fqvgp
Identity added: /runner/artifacts/download-cache-edpm-deployment-r2-compute-nodes-r2/ssh_key_data (cifmw_reproducer_key)
[WARNING]: Invalid characters were found in group names but not replaced, use
-vvvv to see details
PLAY [EDPM Download cache] *****************************************************
TASK [osp.edpm.edpm_download_cache : Validating arguments against arg spec 'main' - The main entry point for the edpm_download_cache role.] ***
ok: [r2-compute-ib21ttu9-0]
TASK [osp.edpm.edpm_download_cache : Install role requirements] ****************
included: /usr/share/ansible/collections/ansible_collections/osp/edpm/roles/edpm_download_cache/tasks/install.yml for r2-compute-ib21ttu9-0
TASK [Execute bootstrap command] ***********************************************
included: osp.edpm.edpm_bootstrap for r2-compute-ib21ttu9-0
TASK [osp.edpm.edpm_bootstrap : Re-read facts] *********************************
fatal: [r2-compute-ib21ttu9-0]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host r2-compute-ib21ttu9-0 port 22: No route to host", "unreachable": true}
NO MORE HOSTS LEFT *************************************************************
NO MORE HOSTS LEFT *************************************************************
PLAY RECAP *********************************************************************
r2-compute-ib21ttu9-0 : ok=3 changed=0 unreachable=1 failed=0 skipped=0 rescued=0 ignored=0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slagle you have other idea for a ansible configuration to test instead of what we tried as mentioned above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ANSIBLE_SSH_RETRIES
and ANSIBLE_SSH_TIMEOUT
( i believe that's the same as ANSIBLE_TIMEOUT
, but it may be worth trying)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tests with setting ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT were successful. I'll close this pr. Thanks a lot @slagle
22678b5
to
88c3e88
Compare
When using bgp where pods using a network-attachment-definition it takes some seconds to advertise it on the fabric. As a result roles which don't have a the connectivity task check will fail. A rerun of the ansible jobs will not fix this because every time the deployment job pod gets created it will receive a new ip from the ctlplane pool. There were discussions on a more generic way on slack, for now lets add a connectivity check to the edpm_nodes_validation role and call it from the playbooks, similar to: https://github.com/openstack-k8s-operators/edpm-ansible/blob/6615889c5d708df2accb7dafec5aa099946292b8/playbooks/bootstrap.yml#L8 Jira: https://issues.redhat.com/browse/OSPRH-8680 Signed-off-by: Martin Schuppert <[email protected]>
88c3e88
to
c2cfaa2
Compare
tests with setting ANSIBLE_SSH_RETRIES and ANSIBLE_SSH_TIMEOUT were successful. I'll close this pr. Thanks a lot @slagle |
When using bgp where pods using a network-attachment-definition it takes some seconds to advertise it on the fabric. As a result roles which don't have a the connectivity task check will fail.
A rerun of the ansible jobs will not fix this because every time the deployment job pod gets created it will receive a new ip from the ctlplane pool.
There were discussions on a more generic way on slack, for now lets just add the check to all the playbooks which is already there on the bootstrap playbook:
edpm-ansible/playbooks/bootstrap.yml
Line 8 in 6615889
Jira: https://issues.redhat.com/browse/OSPRH-8680