-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Playbooks terminated unexpectedly after 4 hours #11805
Comments
Anything new in this topic? |
Hi, we are also seeing this issue on K3s using AWX 19.5.0 and 21.0.0.
|
@spireob maybe you can check below
I've the same issue of 4hrs jobs end with errors (in k3s) |
We also encounter an issue like this regularly. It seems awx-task just decides that the job is done and kills the pod. This is what i find in the awx-task log:
All the while the task output just says "canceled". It is not related to the linked issues above. *edit: sorry, missing data about the system AWX: 21.0.0 running on K3S v1.23.6 |
exactly same issue |
can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8 |
Have you opened a case to RedHat? |
yes but no news yet |
Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me. |
Will do. |
I got a similar problem today, after four hours the task fell. |
For my installation I don't believe it's k3s related as I have a 500 Mb limit for logs. More than that I don't even see log files created under /var/log/pods/, just empty folders. Also I'm using custom EE built with ansible 2.9 as suggested in some @AlanCoding repo, so I believe the issue is not related to ansible-runner, but related to awx-task that seems like have some timeout for waiting output from a task. |
Same issue on AWX: 21.0.0 running on K3S v1.23.6 :( |
same problem with awx 21.1.0 & k3s v1.21.7+k3s1 well, I can only assume that the last task ended without error, as I don't see an "ok", "changed" or "failed".... |
@meis4h Is there any news from support? |
Does this problem occur in the paid version and there is no solution? |
We are also having this issue running the latest AWX, k3s, and docker backend. Container log size is set to 500mb and allowed to have 4 files. (Single log file is generated and gets nowhere near 500mb) This is a pretty big show stopper for long running maintenance playbooks. Can we get an update on this? This issue has been open since February and i've seen numerous closed issues with the same problem. |
same here |
We have exactly the same issue. |
@meis4h could you please update? Also lets probably call active developers like @AlanCoding 😅 |
Hello, As many here, we have the same behaviour on our side. (AWX 21.5.0 with k3s) We have some issues in our migration tests but this one is on top. Is there anybody with a workaround for long duration tasks ? How do you proceed in the meantime ? Thank you all. |
The only workaround that worked for me was to create a workflow, and split the task in multiple jobs. It does not even require multiple job templates if you work smart with tags, marking some tasks with tag like step1, next with step2 and so on, and than include the same job template in workflow multiple times, each time with different tag. Passing variables between different steps can be realised with ansible.builtin.set_stats, yet this is still cumbersome and problematic with single task that might run longer than 4h. For such a single task you have to use poll: 0, async: xxx, pass registered variable via set_stats and from optionally next step in workflow query the progress with async_status... This is doable, but the only way to get around this is to redesign the part where the automation container is started. Instead reading its output as it is now, it has to be read in a kind of while true loop, until the container really finishes. Now, the container gets aborted when the https connection to kubectl gets disconnected. |
Thanks for your reply. |
Hi adpavlov, before all, thank you for sharing your experience. I was wondering if maybe you get any workaround or answer. I keep trying but I can't find the solution. Now I'm using CRIO-O as container runtime, I see on other post that somebody tried with Docker runtime too and it didn't work. Thank you. |
Unfortunately not. All proposed workarounds simply have no positive effect. |
after looking more deeply into this issue, AWX uses receptor to handle running k8s pods. It might be, that the fix that we need is either by calling receptor in a different way, or even the issue should be reported against receptor. In particular, receptor is being called from run_until_complete method, which submits the work request to receptor, and queries for status. It also allocates some sockets for bidirectional communication in _run_internal and this is what I'm afraid is running into timeout. So basically, AWX uses external projects like receptor or kubernetes, kubernetes declared static 4h hardcoded timeouts, and receptor seems to be breaking off after this time. Now the big question is - who should fix this issue and in which part of code? Is it receptor that needs fixing, or maybe the way AWX uses it? One has to consider, that AWX needs bi-directional communication with running automation pod for example in order to interactively pass passwords and so on. On the other hand, it should implement some re-attach mechanism after the connection breakes, the pod has to run even if the connection brakes. This addresses the reaper not to reap such detached pods. Also, once AWX re-connects to running pod, it needs to figure out till when it read the output previously, and continue from that moment in order to update web pages looking at the progress without skipping/doubling tasks... This is way beyond trivial task, at least for what it looks to me. |
@meis4h :( Thank you for your answer. Let's keep looking for the solution. :) |
@bartowl I already was aware of the polling feature in Ansible but I'm not sure if it can be used in my case, same for the workflow. Maybe other people are in the same situation, so I would like to dig for a workaround. At the moment, I have a task which is using an until loop to query a REST API until the API return "state: finished".
How could such a task use poll: 0, async: xxx like you suggested since it's not really a one operation long running task, but instead a task that execute multiple times for a long time? Any suggestion is welcome 😃 Thanks |
@nicolasbouchard-ubi here it is not the task that takes long, as the task it self is retried multiple times. The same way with a long running task you would use async_status. The task itself finishes rather quickly but the until: condition is not met.
What this will do is - it will execute all tasks until querying the agent for build status, then wait 1h (or shorter if you manually approve the step), and then continue from the moment it checks for the status. Now 4h limit applies to step 1 and 3 separately, effectively 4h for each one run. This should help you. |
We ran into a really obscure issue when working on ansible/receptor#683. I'll try to make this at least somewhat digestable. Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805 To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod. Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same: ``` 2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"} 2022-11-09T00:07:46.853902397Z {"zipfile": 1307} 2022-11-09T00:07:46.853927512Z UEsDBBQAAAAIAPYAaVVe7/th..... ``` After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
We ran into a really obscure issue when working on ansible/receptor#683. I'll try to make this at least somewhat digestable. Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805 To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod. Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same: ``` 2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"} 2022-11-08T23:07:58.648753832Z {"zipfile": 1330} 2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK.... ``` After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
We ran into a really obscure issue when working on ansible/receptor#683. I'll try to make this at least somewhat digestable. Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805 To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod. Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same: ``` 2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"} 2022-11-08T23:07:58.648753832Z {"zipfile": 1330} 2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK.... ``` After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
We ran into a really obscure issue when working on ansible/receptor#683. I'll try to make this at least somewhat digestable. Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805 To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod. Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same: ``` 2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"} 2022-11-08T23:07:58.648753832Z {"zipfile": 1330} 2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK.... ``` After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
I think this issue and the following issue are talking about the same bug: @TheRealHaoLiu left a comment that may be of interest to those on this thread: |
UpdateFixThis should address the 4 hour timeout limit as well as the log rotation issue The fix applies to both K8S and OCP users How to get this change
Requirements for K8S
Requirements for OpenShift
|
Great news! Hope this will fix the problem permanently. If you cannot update to the fixed version for any reason possible, you might want to read a detailed description on how to workaround this limitation here https://thecattlecrew.net/2022/12/06/overcome-4h-runtime-problem-with-ansible-awx/ |
couple of potential gotchas with our current fix
NOTE: the default mode auto mode (which currently is not able to detect openshift version and enable the fix correctly) will detect the kube version case by case basis
|
fixed in ansible/receptor#683 |
We found issue using this fix. We get huge cuts in awx job output, fox example ~2 hours are missing, only small part of start and small part of end: k8s 1.25.4-do.0 Time: 22:04:32 -> 00:11:31
Time: 01:39:55 -> 01:59:38
Looks awx pick first log and last and skips everything in between |
Raised new issue #13376 |
|
Please confirm the following
Summary
Playbooks running longer than 4 hours are terminated unexpectedly. The Jobs finish with error state in GUI.
The exit code "137"
Same issue : closed without resolution.
Tested on versions: AWX 19.4.0 and AWX 20.0.0
AWX version
20.0.0
Select the relevant components
Installation method
kubernetes
Modifications
yes
Ansible version
core 2.13.0.dev0
Operating system
CentOS8
Web browser
Chrome
Steps to reproduce
hosts: all
gather_facts: no
tasks:
shell: |
while(1){
Write-Output "."
start-sleep -seconds 1800
}
args:
executable: /usr/bin/pwsh
async: 43200
poll: 900
register: pwsh_output_job
ignore_errors: true
Expected results
Playbook completes successfully
Actual results
Container running the job is terminated after running for 4 hours
Additional information
exitCode: 137
The text was updated successfully, but these errors were encountered: