Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playbooks terminated unexpectedly after 4 hours #11805

Closed
4 of 6 tasks
spireob opened this issue Feb 24, 2022 · 62 comments
Closed
4 of 6 tasks

Playbooks terminated unexpectedly after 4 hours #11805

spireob opened this issue Feb 24, 2022 · 62 comments

Comments

@spireob
Copy link

spireob commented Feb 24, 2022

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that AWX is open source software provided for free and that I might not receive a timely response.

Summary

Playbooks running longer than 4 hours are terminated unexpectedly. The Jobs finish with error state in GUI.
The exit code "137"
Same issue : closed without resolution.
Tested on versions: AWX 19.4.0 and AWX 20.0.0

AWX version

20.0.0

Select the relevant components

  • UI
  • API
  • Docs

Installation method

kubernetes

Modifications

yes

Ansible version

core 2.13.0.dev0

Operating system

CentOS8

Web browser

Chrome

Steps to reproduce


  • hosts: all
    gather_facts: no

    tasks:

    • name: Run Job
      shell: |
      while(1){
      Write-Output "."
      start-sleep -seconds 1800
      }
      args:
      executable: /usr/bin/pwsh
      async: 43200
      poll: 900
      register: pwsh_output_job
      ignore_errors: true

Expected results

Playbook completes successfully

Actual results

Container running the job is terminated after running for 4 hours

Additional information

exitCode: 137

@spireob
Copy link
Author

spireob commented Mar 17, 2022

Anything new in this topic?

@meis4h
Copy link

meis4h commented Apr 28, 2022

Hi, we are also seeing this issue on K3s using AWX 19.5.0 and 21.0.0.
A few things we observed looking at this:

  • Jobs fail after the last task over the 4 hour mark has completed
    • for example a wait_for task with a timeout of 8 hours causes the job to fail with error after 8 hours, a 5 hour task fails after 5 hours
  • The limit seems to be pretty much exactly 4 hours as jobs running 3h 50min complete successfully
  • Jobs are not continued in the background as the Pod is deleted instantly after the job errors
  • Also happens with job timeout set in the job template

@kladiv
Copy link

kladiv commented Apr 28, 2022

@spireob maybe you can check below

I've the same issue of 4hrs jobs end with errors (in k3s)

@d-rupp
Copy link

d-rupp commented May 19, 2022

We also encounter an issue like this regularly. It seems awx-task just decides that the job is done and kills the pod.

This is what i find in the awx-task log:

2022-05-19 13:20:17,503 INFO     [3abf5855276042c595518de57f670161] awx.main.commands.run_callback_receiver Event processing is finished for Job 15161, sending notifications
2022-05-19 13:20:17,503 INFO     [3abf5855276042c595518de57f670161] awx.main.commands.run_callback_receiver Event processing is finished for Job 15161, sending notifications
2022-05-19 13:20:18,107 DEBUG    [3abf5855276042c595518de57f670161] awx.main.tasks.jobs job 15161 (running) finished running, producing 382 events.
2022-05-19 13:20:18,107 DEBUG    [3abf5855276042c595518de57f670161] awx.main.dispatch task c950003b-4c05-49d1-9b45-43e671098931 starting awx.main.tasks.system.handle_success_and_failure_notifications(*[15161])
2022-05-19 13:20:18,109 DEBUG    [3abf5855276042c595518de57f670161] awx.analytics.job_lifecycle job-15161 post run
2022-05-19 13:20:18,238 DEBUG    [3abf5855276042c595518de57f670161] awx.analytics.job_lifecycle job-15161 finalize run
2022-05-19 13:20:18,342 WARNING  [3abf5855276042c595518de57f670161] awx.main.dispatch job 15161 (error) encountered an error (rc=None), please see task stdout for details.
2022-05-19 13:20:18,345 DEBUG    [3abf5855276042c595518de57f670161] awx.main.tasks.system Executing error task id ecbb37f9-809d-4317-9d01-af93846de8d6, subtasks: [{'type': 'job', 'id': 15161}]

All the while the task output just says "canceled".
If there is anything i can help analyse this please tell me what to do.

It is not related to the linked issues above.

*edit: sorry, missing data about the system

AWX: 21.0.0 running on K3S v1.23.6

@adpavlov
Copy link

adpavlov commented Jun 2, 2022

exactly same issue

@meis4h
Copy link

meis4h commented Jun 2, 2022

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

@adpavlov
Copy link

adpavlov commented Jun 2, 2022

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

Have you opened a case to RedHat?

@meis4h
Copy link

meis4h commented Jun 2, 2022

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

Have you opened a case to RedHat?

yes but no news yet

@adpavlov
Copy link

adpavlov commented Jun 2, 2022

can confirm that this also happens on RedHat Ansible Automation Platform 2.1 on OpenShift 4.8

Have you opened a case to RedHat?

yes but no news yet

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

@meis4h
Copy link

meis4h commented Jun 2, 2022

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

Will do.
In the meantime we could largely work around the issue by splitting the job into multiple separate jobs connected via workflow.

@kiril18
Copy link

kiril18 commented Jun 2, 2022

I got a similar problem today, after four hours the task fell.

@adpavlov
Copy link

adpavlov commented Jun 2, 2022

@spireob maybe you can check below

it could be related to that issues.

I've the same issue of 4hrs jobs end with errors (in k3s)

For my installation I don't believe it's k3s related as I have a 500 Mb limit for logs. More than that I don't even see log files created under /var/log/pods/, just empty folders.

Also I'm using custom EE built with ansible 2.9 as suggested in some @AlanCoding repo, so I believe the issue is not related to ansible-runner, but related to awx-task that seems like have some timeout for waiting output from a task.

@cmatsis
Copy link

cmatsis commented Jun 8, 2022

Same issue on AWX: 21.0.0 running on K3S v1.23.6 :(
any workaround to this problem?

@stefanpinter
Copy link

stefanpinter commented Jun 17, 2022

same problem with awx 21.1.0 & k3s v1.21.7+k3s1
for now, where I "know" that the last task ends like it should, I re-run the playbook with the remaining tags only

well, I can only assume that the last task ended without error, as I don't see an "ok", "changed" or "failed"....

@adpavlov
Copy link

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

Will do. In the meantime we could largely work around the issue by splitting the job into multiple separate jobs connected via workflow.

@meis4h Is there any news from support?

@cmatsis
Copy link

cmatsis commented Jul 6, 2022

Does this problem occur in the paid version and there is no solution?
There are few people working more than 4 hours?

@d-rupp d-rupp mentioned this issue Jul 14, 2022
3 tasks
@3zAlb
Copy link

3zAlb commented Aug 1, 2022

We are also having this issue running the latest AWX, k3s, and docker backend. Container log size is set to 500mb and allowed to have 4 files. (Single log file is generated and gets nowhere near 500mb)

This is a pretty big show stopper for long running maintenance playbooks.

Can we get an update on this? This issue has been open since February and i've seen numerous closed issues with the same problem.

@NadavShani
Copy link

We are also having this issue running the latest AWX, k3s, and docker backend. Container log size is set to 500mb and allowed to have 4 files. (Single log file is generated and gets nowhere near 500mb)

This is a pretty big show stopper for long running maintenance playbooks.

Can we get an update on this? This issue has been open since February and i've seen numerous closed issues with the same problem.

same here

@StefanSpecht
Copy link

We have exactly the same issue.

@adpavlov
Copy link

adpavlov commented Aug 3, 2022

Okay, could you please keep us posted on this case status? Also there should be SLA for paid subscription. Issue is quite critical as for me.

Will do. In the meantime we could largely work around the issue by splitting the job into multiple separate jobs connected via workflow.

@meis4h Is there any news from support?

@meis4h could you please update?

Also lets probably call active developers like @AlanCoding 😅

@sylvain-de-fuster
Copy link

Hello,

As many here, we have the same behaviour on our side. (AWX 21.5.0 with k3s)
The informations given before don't give a good perspective for a quick resolution.

We have some issues in our migration tests but this one is on top.
We don't have a lot of long duration jobs but they are very important.

Is there anybody with a workaround for long duration tasks ? How do you proceed in the meantime ?

Thank you all.

@bartowl
Copy link

bartowl commented Oct 4, 2022

The only workaround that worked for me was to create a workflow, and split the task in multiple jobs. It does not even require multiple job templates if you work smart with tags, marking some tasks with tag like step1, next with step2 and so on, and than include the same job template in workflow multiple times, each time with different tag. Passing variables between different steps can be realised with ansible.builtin.set_stats, yet this is still cumbersome and problematic with single task that might run longer than 4h. For such a single task you have to use poll: 0, async: xxx, pass registered variable via set_stats and from optionally next step in workflow query the progress with async_status...

This is doable, but the only way to get around this is to redesign the part where the automation container is started. Instead reading its output as it is now, it has to be read in a kind of while true loop, until the container really finishes. Now, the container gets aborted when the https connection to kubectl gets disconnected.

@sylvain-de-fuster
Copy link

The only workaround that worked for me was to create a workflow, and split the task in multiple jobs. It does not even require multiple job templates if you work smart with tags, marking some tasks with tag like step1, next with step2 and so on, and than include the same job template in workflow multiple times, each time with different tag. Passing variables between different steps can be realised with ansible.builtin.set_stats, yet this is still cumbersome and problematic with single task that might run longer than 4h. For such a single task you have to use poll: 0, async: xxx, pass registered variable via set_stats and from optionally next step in workflow query the progress with async_status...

This is doable, but the only way to get around this is to redesign the part where the automation container is started. Instead reading its output as it is now, it has to be read in a kind of while true loop, until the container really finishes. Now, the container gets aborted when the https connection to kubectl gets disconnected.

Thanks for your reply.
This is a very interesting workaround. We will experiment with that way and see if this is compatible and not too painful for our users's workcases.

@arcsurf
Copy link

arcsurf commented Oct 10, 2022

@adpavlov sadly there is nothing new to report 😕​

Hi adpavlov, before all, thank you for sharing your experience. I was wondering if maybe you get any workaround or answer. I keep trying but I can't find the solution. Now I'm using CRIO-O as container runtime, I see on other post that somebody tried with Docker runtime too and it didn't work. Thank you.

@adpavlov
Copy link

Unfortunately not. All proposed workarounds simply have no positive effect.
Probably @meis4h got some response from redhat support? I bet that SLA already violated:)

@arcsurf
Copy link

arcsurf commented Oct 10, 2022

I'm sorry @adpavlov, Yes I was trying to ask to @meis4h. Maybe @meis4h got some from Redhat.

Thank you everyone.

@bartowl
Copy link

bartowl commented Oct 10, 2022

after looking more deeply into this issue, AWX uses receptor to handle running k8s pods. It might be, that the fix that we need is either by calling receptor in a different way, or even the issue should be reported against receptor. In particular, receptor is being called from run_until_complete method, which submits the work request to receptor, and queries for status. It also allocates some sockets for bidirectional communication in _run_internal and this is what I'm afraid is running into timeout.

So basically, AWX uses external projects like receptor or kubernetes, kubernetes declared static 4h hardcoded timeouts, and receptor seems to be breaking off after this time. Now the big question is - who should fix this issue and in which part of code? Is it receptor that needs fixing, or maybe the way AWX uses it?

One has to consider, that AWX needs bi-directional communication with running automation pod for example in order to interactively pass passwords and so on. On the other hand, it should implement some re-attach mechanism after the connection breakes, the pod has to run even if the connection brakes. This addresses the reaper not to reap such detached pods. Also, once AWX re-connects to running pod, it needs to figure out till when it read the output previously, and continue from that moment in order to update web pages looking at the progress without skipping/doubling tasks... This is way beyond trivial task, at least for what it looks to me.

@meis4h
Copy link

meis4h commented Oct 10, 2022

@adpavlov @arcsurf sadly I don't have anything new to report from Red Hat. They said they also are tracking this internally and are working on a fix but there is no ETA since there is no SLA on bugs like this.

@arcsurf
Copy link

arcsurf commented Oct 10, 2022

@meis4h :( Thank you for your answer. Let's keep looking for the solution. :)

@nicolasbouchard-ubi
Copy link

nicolasbouchard-ubi commented Oct 14, 2022

@bartowl
Thanks for your workaround suggestion!

I already was aware of the polling feature in Ansible but I'm not sure if it can be used in my case, same for the workflow. Maybe other people are in the same situation, so I would like to dig for a workaround.

At the moment, I have a task which is using an until loop to query a REST API until the API return "state: finished".

- name: Wait for agent to finish running build
           ansible.windows.win_uri:
             url: "REDACTED"
             method: GET
             url_username: "REDACTED"
             url_password: "REDACTED"
             return_content: yes
             content_type: application/json
             headers:
               accept: application/json
           throttle: 25
           register: build_info
           until: build_info.json is defined and build_info.json.state == "finished"
           retries: 120
           delay: 60 #Check build status every minutes for 2h to try prevent 4h entire playbook disconnection (not really working because I have multiple plays)

How could such a task use poll: 0, async: xxx like you suggested since it's not really a one operation long running task, but instead a task that execute multiple times for a long time?

Any suggestion is welcome 😃

Thanks

@bartowl
Copy link

bartowl commented Oct 14, 2022

@nicolasbouchard-ubi here it is not the task that takes long, as the task it self is retried multiple times. The same way with a long running task you would use async_status. The task itself finishes rather quickly but the until: condition is not met.
What you could do is - when you know, that this task will take at least 1h, split the job template at this place marking all tasks until this one with tag: stage1 for example. Starting from this task mark everything else stage2.
You will have to propagate all variables defined/changed by tasks in stage1, that will be needed in stage2 via set_stats module.
Then, create a workflow template, with 3 elements.

  1. this job template with tags: stage1
  2. approval step with 1h timeout and default Accept approval (this will give you 1h delay before first status check
  3. this job template with tags: stage2

What this will do is - it will execute all tasks until querying the agent for build status, then wait 1h (or shorter if you manually approve the step), and then continue from the moment it checks for the status. Now 4h limit applies to step 1 and 3 separately, effectively 4h for each one run. This should help you.

shanemcd added a commit to shanemcd/ansible-runner that referenced this issue Nov 9, 2022
We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

```
2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-09T00:07:46.853902397Z {"zipfile": 1307}
2022-11-09T00:07:46.853927512Z UEsDBBQAAAAIAPYAaVVe7/th.....
```

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
shanemcd added a commit to shanemcd/ansible-runner that referenced this issue Nov 9, 2022
We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

```
2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-08T23:07:58.648753832Z {"zipfile": 1330}
2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK....
```

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
shanemcd added a commit to shanemcd/ansible-runner that referenced this issue Nov 9, 2022
We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

```
2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-08T23:07:58.648753832Z {"zipfile": 1330}
2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK....
```

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
shanemcd added a commit to shanemcd/ansible-runner that referenced this issue Nov 9, 2022
We ran into a really obscure issue when working on ansible/receptor#683.

I'll try to make this at least somewhat digestable.

Due to a bug in Kubernetes, AWX can't currently run jobs longer than 4 hours when deployed into Kubernetes. More context on that in ansible/awx#11805

To address this issue, we needed a way to restart from a certain point in the logs. The only mechanism Kubernetes provides to do this is by passing "sinceTime" to the API endpoint for retrieving logs from a pod.

Our patch in ansible/receptor#683 worked when we ran it locally, but in OpenShift, jobs errored when unpacking the zip stream at the end of the results of "ansible-runner worker". Upon further investigation this was because the timestamps of the last 2 lines were exactly the same:

```
2022-11-09T00:07:46.851687621Z {"status": "successful", "runner_ident": "1"}
2022-11-08T23:07:58.648753832Z {"zipfile": 1330}
2022-11-08T23:07:58.648753832Z UEsDBBQAAAAIAPy4aFVGnUFkqQMAAIwK....
```

After squinting at this code for a bit I noticed that we weren't flushing the buffer here like we do in the event_handler and other callbacks that are fired in streaming.py. The end. Ugh.
@rooftopcellist
Copy link
Member

I think this issue and the following issue are talking about the same bug:

@TheRealHaoLiu left a comment that may be of interest to those on this thread:

@fosterseth
Copy link
Member

fosterseth commented Dec 2, 2022

Update

Fix

PR ansible/receptor#683

This should address the 4 hour timeout limit as well as the log rotation issue

The fix applies to both K8S and OCP users

How to get this change

  • The fix is already in quay.io/ansible/awx-ee:latest
  • custom EEs to run jobs based on awx-ee do not need to be updated
  • only the awx-ee container running the AWX pod needs to be latest
  • Redeploying AWX should trigger awx-operator to re-pull awx-ee:latest to use as the control_plane_ee_image. This is all that most users should need to do to get the change.

Requirements for K8S

  • K8S server should be >= one of the following,
    • 1.23.14
    • 1.24.8
    • 1.25.4
  • use kubectl version to check your K8S server version
  • Receptor will auto detect the K8S server and enable the reconnect support feature if on a compatible version
  • if the cluster is not running at least those versions, receptor will fallback to the older method and the 4 hour timeout issue will persist.

Requirements for OpenShift

  • Currently Receptor does not auto detect OCP versions, so enabling the fix must be manual
  • OCP server should be >= one of the following:
    • 4.10.42
    • 4.11.16
    • 4.12.0
  • To enable the fix on OCP, you should set the following variable in the AWX resource
  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled
  • it is critical that you only enable this if you are on a compatible version of OCP, otherwise jobs may fail unexpectedly.

@bartowl
Copy link

bartowl commented Dec 6, 2022

Great news! Hope this will fix the problem permanently. If you cannot update to the fixed version for any reason possible, you might want to read a detailed description on how to workaround this limitation here https://thecattlecrew.net/2022/12/06/overcome-4h-runtime-problem-with-ansible-awx/

@TheRealHaoLiu
Copy link
Member

TheRealHaoLiu commented Dec 6, 2022

couple of potential gotchas with our current fix

  1. using the flag RECEPTOR_KUBE_SUPPORT_RECONNECT = "enable" will completely skip the kube version check ALL container groups (including external container group that's added by user) if any of the cluster in the container group does not include the kube fix we rely on it will cause job execution on that container group to fail

NOTE: the default mode auto mode (which currently is not able to detect openshift version and enable the fix correctly) will detect the kube version case by case basis

  1. the "auto" detection rely on kubernetes apiserver version. The fix from kube that we need is in the kubelet. it is possible that individual worker node have kubelet out of sync with kube-apiserver's version. In this case we will incorrectly detect and enable the fix which will cause job failure when running on the out of date worker node

@TheRealHaoLiu
Copy link
Member

fixed in ansible/receptor#683

@kzinas-adv
Copy link

kzinas-adv commented Dec 22, 2022

We found issue using this fix. We get huge cuts in awx job output, fox example ~2 hours are missing, only small part of start and small part of end:

k8s 1.25.4-do.0
AWX 21.10.1
receptor 1.3.0
Fix enabled

Time: 22:04:32 -> 00:11:31
Playbook run: 0:01:40.151 -> 2:08:39.984

Wednesday 21 December 2022  22:04:32 +0000 (0:00:29.791)       0:01:40.151 ****

TASK [create : Create var volume] ***************************************
ok: [db02.host.ge -> localhost]
ok: [db01.host.ge -> localhost]

changed: [localhost] => (item=www.random.org)
Thursday 22 December 2022  00:11:31 +0000 (0:00:03.724)       2:08:39.984 *****

Time: 01:39:55 -> 01:59:38
Playbook run: 0:08:04.813 -> 0:27:48.142

Thursday 22 December 2022  01:39:55 +0000 (0:00:00.186)       0:08:04.813 ***** 

TASK [generic-backend : Copy files to /opt/ab-reports] *************************
ok: [srv02.host.ge] => (item=/runner/project/private/nats/nats.ca.pem)
ok: [srv02.host.ge] => (item=/runner/project/private/nats/nats.client.pem)
ok: [srv02.host.ge] => (item=/runner/project/private/nats/nats.client.key)
Thursday 22 December 2022  01:59:38 +0000 (0:00:00.498)       0:27:48.142 ***** 

Looks awx pick first log and last and skips everything in between

@kzinas-adv
Copy link

Raised new issue #13376

@MaciejLeszczynski
Copy link

Update

Fix

PR ansible/receptor#683

This should address the 4 hour timeout limit as well as the log rotation issue

The fix applies to both K8S and OCP users

How to get this change

  • The fix is already in quay.io/ansible/awx-ee:latest
  • custom EEs to run jobs based on awx-ee do not need to be updated
  • only the awx-ee container running the AWX pod needs to be latest
  • Redeploying AWX should trigger awx-operator to re-pull awx-ee:latest to use as the control_plane_ee_image. This is all that most users should need to do to get the change.

Requirements for K8S

  • K8S server should be >= one of the following,

    • 1.23.14
    • 1.24.8
    • 1.25.4
  • use kubectl version to check your K8S server version

  • Receptor will auto detect the K8S server and enable the reconnect support feature if on a compatible version

  • if the cluster is not running at least those versions, receptor will fallback to the older method and the 4 hour timeout issue will persist.

Requirements for OpenShift

  • Currently Receptor does not auto detect OCP versions, so enabling the fix must be manual

  • OCP server should be >= one of the following:

    • 4.10.42
    • 4.11.16
    • 4.12.0
  • To enable the fix on OCP, you should set the following variable in the AWX resource

  ee_extra_env: |
    - name: RECEPTOR_KUBE_SUPPORT_RECONNECT
      value: enabled
  • it is critical that you only enable this if you are on a compatible version of OCP, otherwise jobs may fail unexpectedly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests