Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX RuntimeError resulting in loss of worker task logs and true outcome #9961

Closed
duntonr opened this issue Apr 20, 2021 · 46 comments
Closed

Comments

@duntonr
Copy link

duntonr commented Apr 20, 2021

ISSUE TYPE
  • Bug Report
SUMMARY

AWX UI reporting stalls eventually errors out with unhelpful

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)

In reality, though, the job has actually continued and either completed successfully OR failed due to a normal (helpful) job/task/play error. This is found by looking at the spawned worker container logs.

The UI error IS raised at the same time as when the JOB completes.

Downloading logs from the UI results in the INCOMPLETE log set, eg not the apparent zip file the worker container uploads back

AWX reports the job as a failure in the job list, even though it succeeded from the actual container logs. This is VERY confusing.

ENVIRONMENT
  • AWX version: 19.0.0
  • AWX install method: AWX Operator 0.8.0 via microk8s 1.4.4
  • Ansible version: 2.11.0b4
  • Operating System: Ubuntu 20.04
  • Web Browser: Brave 1.23.71 / Chromium 90.0.4430.72
STEPS TO REPRODUCE
  1. Use standard AWX EE 0.1.1 or derived custom EE container for a long-running/verbose job
  2. Track job progress from both AWX UI as well as container logs from k8s (dashboard, kubectl, etc)
  3. After some point, the UI stops tracking
  4. Continue to monitor the container output of the job
  5. Compare results from job container output to AWX UI (note you need to adjust the cleanup timing in AWX or be on top of tailing / dumping the container logs as the container is cleaned up once the job exits
EXPECTED RESULTS

AWX UI Job output continues to stay synced with actual job output and status match

ACTUAL RESULTS

AWX UI logging falls out of sync from execution and reports job failed with .../tasks.py error, despite actual job outcome. If there was an actual Job error, this is not displayed (so could not tell WHY a job failed without worker pod logs).

ADDITIONAL INFORMATION

After continuously tailing the worker pod logs, I was able to see this:

...
{"uuid": "cfed869a-3ecb-4919-9301-5479b816bec1", "counter": 10218, "stdout": "\r\nPLAY RECAP *********************************************************************\r\n\u001b[0;33mconsul01.internal.therack.io\u001b[0m : \u001b[0;32mok=109 \u001b[0m \u001b[0;33mchanged=39  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=37  \u001b[0m rescued=0    \u001b[1;35mignored=4   \u001b[0m\r\n\u001b[0;33mconsul02.internal.therack.io\u001b[0m : \u001b[0;32mok=97  \u001b[0m \u001b[0;33mchanged=27  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=49  \u001b[0m rescued=0    \u001b[1;35mignored=1   \u001b[0m\r\n\u001b[0;33mconsul03.internal.therack.io\u001b[0m : \u001b[0;32mok=97  \u001b[0m \u001b[0;33mchanged=27  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=49  \u001b[0m rescued=0    \u001b[1;35mignored=1   \u001b[0m\r\n\u001b[0;33mmaas.internal.therack.io\u001b[0m   : \u001b[0;32mok=63  \u001b[0m \u001b[0;33mchanged=21  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=22  \u001b[0m rescued=0    \u001b[1;35mignored=3   \u001b[0m\r\n\u001b[0;33mmarathon.internal.therack.io\u001b[0m : \u001b[0;32mok=64  \u001b[0m \u001b[0;33mchanged=22  \u001b[0m unreachable=0    failed=0    \u001b[0;36mskipped=21  \u001b[0m rescued=0    \u001b[1;35mignored=3   \u001b[0m\r\n", "start_line": 30927, "end_line": 30935, "runner_ident": "209", "event": "playbook_on_stats", "job_id": 209, "pid": 20, "created": "2021-04-20T17:51:16.379887", "parent_uuid": "87632333-0e50-47fe-8754-3fe63b1b5216", "event_data": {"playbook": "playbooks/controlDeck/common/configureServices.yml", "playbook_uuid": "87632333-0e50-47fe-8754-3fe63b1b5216", "changed": {"consul01.internal.therack.io": 39, "consul02.internal.therack.io": 27, "consul03.internal.therack.io": 27, "marathon.internal.therack.io": 22, "maas.internal.therack.io": 21}, "dark": {}, "failures": {}, "ignored": {"consul01.internal.therack.io": 4, "consul02.internal.therack.io": 1, "consul03.internal.therack.io": 1, "marathon.internal.therack.io": 3, "maas.internal.therack.io": 3}, "ok": {"consul01.internal.therack.io": 109, "consul02.internal.therack.io": 97, "consul03.internal.therack.io": 97, "marathon.internal.therack.io": 64, "maas.internal.therack.io": 63}, "processed": {"consul01.internal.therack.io": 1, "consul02.internal.therack.io": 1, "consul03.internal.therack.io": 1, "marathon.internal.therack.io": 1, "maas.internal.therack.io": 1}, "rescued": {}, "skipped": {"consul01.internal.therack.io": 37, "consul02.internal.therack.io": 49, "consul03.internal.therack.io": 49, "marathon.internal.therack.io": 21, "maas.internal.therack.io": 22}, "artifact_data": {}, "uuid": "cfed869a-3ecb-4919-9301-5479b816bec1"}}
{"status": "successful", "runner_ident": "209"}
{"zipfile": 675955}
P)h>@6aWAK2mrv0lu{6R!x41^000*a000L7003ieZEaz0WUW?PZ=....<more binary dump, I'm assuming its a zip file>

As you can see, job 209 finished successfully.

However
image

I also inspected the container logs for the redis, awx-web, awx-task, and awx-ee in the AWX pod but didn't see anything immediately apparently around the time AWX UI stopped tracking

@shanemcd
Copy link
Member

Interesting. Some questions:

  • Does this happen consistently?
  • What can you tell us about the workload itself?
    • How many tasks?
    • Are you producing a lot of stdout?
    • Are you making heavy use of ansible facts?

@duntonr
Copy link
Author

duntonr commented Apr 20, 2021

Hi @shanemcd

  • Does this happen consistently?
    • So far, it seems intermittent or at least is not triggered if the job naturally hits an error earlier in the run. I have a limited amount of data so far (decided to rebuild my AWX environment from scratch due to issues upgrading from pre-18) and am working through configuring (and improving) my playbooks and templates along the way.
  • Workload
    • It is a rather heavy set of plays. It does things like configure partitions/fs/exports, install docker, gusterfs, consul, nomad, vault, do system updates, etc. This is across 5 hosts.
    • The UI seemed to only capture 23 plays and 445 tasks. Based on the pod log, its more like 80 plays
    • It's currently sent to verbose (level 3) output and if I download the incomplete file from UI, its ~2.5M. When downloading the worker pod logs, it was getting close to 10mb (but the formatting is different there)
    • It does make use of facts throughout the plays, eg to determine if a host is a gusterfs server or client, nomad server or client, etc

When running with AWX 17, via local Docker Compose, I did not have these issues for a very similar amount of tasks/stdio

I have some additional templates/playbooks to bring over, including a larger one that will run over ~36 hosts and will keep an eye out there for the same behavior

@shanemcd
Copy link
Member

@duntonr Thanks for the information. We did indeed make some pretty radical changes to the architecture of AWX in version 18, so that explains why you weren't seeing this issue before.

I will spend some time tomorrow trying to reproduce this, or at very least improve the error handling.

@shanemcd
Copy link
Member

@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback?

@duntonr
Copy link
Author

duntonr commented Apr 20, 2021

Thanks @shanemcd !

For what its worth, I came across #9917 and https://groups.google.com/g/awx-project/c/MACNtPrGpV8 as those are the only google hits for File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run

Separately, I had similar issues as reported in 9917 but was able to eventually get a custom EE container builtt and working (to include Galaxy collections) by using the workarounds mentioned there. I needed to use your suggested --build-arg ANSIBLE_RUNNER_IMAGE=quay.io/ansible/awx-ee:0.1.1

If though, I tried the (very new) --build-arg ANSIBLE_RUNNER_IMAGE=quay.io/ansible/awx-ee:0.2.0, I would get

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Pod Running

I would get that same error if I tried the out-of-the-box 0.2.0 quay image as well. I can report the error over there, but mentioning it here in case it helps shed any light on the AWX Runtime vs worker pod relationships

@duntonr
Copy link
Author

duntonr commented Apr 20, 2021

@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback?

That's an interesting thing as well... when the job starts, things are normal, eg the log starts with

Enter passphrase for /runner/artifacts/209/ssh_key_data: 
Identity added: /runner/artifacts/209/ssh_key_data ([email protected])
ansible-playbook [core 2.11.0b4] 
...

I can keep scrolling though the logs via AWX UI (double down carrots) until it stalls. At that point, the flashy green indicator keeps flashing to show the job is still running but no new logs.

When the job does end, the indicator will turn red, but again, no new logs/errors/etc at the bottom of the logging window

If I then refresh the Job Output screen, that's when the Traceback error shows up, at the TOP of the log

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished
Enter passphrase for /runner/artifacts/209/ssh_key_data: 
Identity added: /runner/artifacts/209/ssh_key_data ([email protected])
ansible-playbook [core 2.11.0b4] 
...

@duntonr
Copy link
Author

duntonr commented Apr 22, 2021

Just by way of update, I have been able to run a different playbook across different "stacks" of 4 hosts/stack WITHOUT issue so far.

I've made 5 runs of a template that consists of:

  • 4 hosts
  • 4 plays
  • ~528 tasks
  • ~ 37minutes
  • L3 logging, resulting in download size of ~71KB log txt file when finished
  • uses facts

This playbook/template is a slightly lighter version of the one I was having issues with. That said it's mostly the same software and actually imports a lot of the same task files....just these are worker nodes, vs the control/server hosts where the issue was.

This is a home lab, so there aren't any crazy vlaning/security issues running between the two (same subnet, etc) host groups. The only major difference I could think of is the problematic playbook runs against x86-64 hosts, whereas this "worker" playbook runs against arm64v8 hosts. Should not make a difference for this it seems, but figured I would mention it.

Once this rollout is finished, I will run a playbook that touches all ~35 hosts and see whats happens

@DrackThor
Copy link

DrackThor commented May 4, 2021

@duntonr I'm experiencing quite the same issues - I often get this output:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

I've recently got my custom EE working on my test cluster (#9917 ), where I also dont get this error.
On my second cluster however I keep getting this RuntimeError quite frequently.

ENVIRONMENT
AWX version: 19.0.0
AWX install method: AWX Operator 0.8.0 via kubeadm 1.19.10
Ansible version - awx-ee: core 2.12.0.dev0
Operating System: Ubuntu 18.04
Web Browser: Chromium 90.0.4430.93

@nlvw
Copy link

nlvw commented May 4, 2021

I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors.

@DrackThor
Copy link

I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors.

I was able to get my EE running like I mentioned in #9917, maybe this also helps you 😄

@nlvw
Copy link

nlvw commented May 6, 2021

@DrackThor I'm using the builtin EE and not a custom one. I've read through the issue you mentioned but failed to see how it address the problem mentioned here.

@duntonr
Copy link
Author

duntonr commented May 7, 2021

Thanks @DrackThor . I'm using a custom EE too but the issue seems to happen with a custom EE (with Galaxy stuff installed in the image) or the standard EE (with Galaxy stuff cobbled right into the repo that hold my playbooks)

I did run a larger set of plays... 28 hosts, ~ 3.5hrs run time, etc... issue still occurs

It's somewhat interesting that error is injected at the TOP of the output in awx.

Also, the error seems to occur quicker on my play against the 5 x86 hosts vs the play against 28 arm64v8 hosts

@shanemcd
Copy link
Member

shanemcd commented May 7, 2021

@duntonr I tried to improve the error handling in 19.1.0. Can you try again and paste the new error?

@shanemcd
Copy link
Member

shanemcd commented May 12, 2021

I think we've finally gotten to the bottom of this. Testing patch at ansible/receptor#319

@rysolv-bot
Copy link

travelingladybug has contributed $50.00 to this issue on Rysolv.

The total bounty is now $50.00. Solve this issue on Rysolv to earn this bounty.

@duntonr
Copy link
Author

duntonr commented May 18, 2021

@shanemcd - Sorry for the delay but 19.1 did NOT solve the issue. The same issue/behavior remains with:

  • Operator 0.9
  • AWX 19.1
  • Custom awx-ee image, based off quay.io/ansible/awx-ee:0.2.0 (just using Dockerfile FROM, not using ansible-builder anymore)

I was excited by ansible/receptor#319, until I read that issues updates :( . It does "feel" kinda like a lock or contention type issue though

@shanemcd
Copy link
Member

shanemcd commented Jun 1, 2021

In terms of what it means to test this: we shouldn't see this error anymore.

This was fixed in the latest alpha release of Receptor, which is going out in AWX 19.2.0 sometime today or tomorrow.

@kdelee
Copy link
Member

kdelee commented Jun 28, 2021

should be fixed by 1ed170f

@shanemcd
Copy link
Member

To be clear, the unhelpful RuntimeError: Pod Running error should be fixed by the aforementioned commit.

@kazigk
Copy link

kazigk commented Jul 1, 2021

1ed170f didn't fix it for me. I'm currently running:

  • AWX Operator 0.12.0
  • AWX 19.2.2 (which includes said commit)
  • AWX EE 0.5.0

Update:
I tried once again on the same playbook and it finished without any error. 115 hosts, 2500 lines of logs (stdout).
Sadly I can't get logs from the previous run as pod is already deleted.
I will post an update if I get any issues in near future.

@kazigk
Copy link

kazigk commented Jul 2, 2021

I'm still facing the issue, but in my case I'm not getting any error nor play recap in automation-job pod logs. The job finishes like it should, but the output is incomplete, so AWX can't even mark hosts as failed, which is very unhelpful. Running the same job on smaller part of inventory (or just splicing it) does solve the issue, but it's less readable (a few different stdout logs to look at isn't ideal).

Maybe I didn't understand what the commit mentioned above should fix or I'm encountering a different issue that just fits the description?

@daenney
Copy link

daenney commented Aug 13, 2021

I'm observing this issue too. We initially had a run that ended with the following traceback and a bunch of log output truncated.

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

It was really slow, so we ran it again with bigger instances powering it, and now we've ended up with:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Pod Running

I was following the logs with kubectl, and it ends with:

rpc error: code = Unknown desc = Error: No such container: 51b1416e42f635758859cfd2890a9f1021e48d56ff5f2006835c52bcb8483fd8

The logs right before it where just normal play output. Like the container just terminated.

Doing the same job in smaller batches seems to do the trick for us too.

@daenney
Copy link

daenney commented Aug 13, 2021

We've also had a case where running with a larger batch it makes it to the end of the run, but seems to crash when flushing the recap.

The last of the logs in the pod show:

{"status": "failed", "runner_ident": "29178"}
{"zipfile": 28832}
P)h>@6a [..] some output that looks like it's straight out of The Matrix [..] 3SpWb400{"eof": true}

The traceback was:

Traceback (most recent call last):
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
    res = receptor_job.run()
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
    return self._run_internal(receptor_ctl)
  File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
    raise RuntimeError(detail)
RuntimeError: Finished

@davilmb
Copy link

davilmb commented Aug 23, 2021

Hi @shanemcd and everyone!

I tested the fix mentioned in version 19.2.0, but without success, the same happened when upgrading to 19.2.2 and the new one release 19.3.0.

For the tests I used the awx-ee default (0.5.0) and also the 0.6.0 even the latest tag, always the same problem, after 4 hours of execution the Job is marked with "Error", even if it's terminate successfully in container log.

As an example below:

image

image

Note: I also ran the test on a new installation with version 18.0.0 (where there was no report of the problem), but I had the same problem and this didn't happen in the version that I was using previously (14.1.0).

image

Is there any palliative measure that we can use in this cases ?

Awaiting further comments regarding this issue.

@nlvw
Copy link

nlvw commented Aug 29, 2021

I've only found 2 solutions to this problem. Either have your templates split your job using a high number of slices or revert back to an earlier version of AWX before all of this Kubernetes only refactor stuff began. I really hope this issue gets more attention as it has made the product unusable.

If you deploy AWX 16 (before the incomplete UI redesign for stability) you can update ansible within the docker container to 10.x relatively safely. This will give you the git collection downloads from the requirements.yml file. The steps are relatively simple just anter a bash shell in the ansible container, yum uninstall ansible, and then pip install the desired ansible version.

@nicovs
Copy link

nicovs commented Oct 15, 2021

See: #10366 (comment)

@kladiv
Copy link

kladiv commented Oct 17, 2021

Hi @nicovs ,
i tested the suggested changes (in #10366 comment) in our infrastructure (k3s) but the job still failing after 4:02 hours.

  • AWX 19.3.0
  • AWX EE (latest) built-in installed by AWX installation

Job is a Workflow Job, and one playbook runs a task like below:

  raw: "ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete"
  register: quiesce_ps
  failed_when: false
  until: quiesce_ps.rc == 1
  retries: "{{ quiesce_wait_max_retries }}"
  delay: "{{ quiesce_wait_retries_delay }}"
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

or like below (i tried a change to check if got same error):

  shell: >-
    RC=0;
    while [ $RC -eq 0 ]; do
      sleep 60;
      ps -ef | grep -w /opt/xensource/sm/LVMoISCSISR | grep -v grep | grep -wq vdi_delete;
      RC=$?;
    done
  register: quiesce_status
  async: 10800 # 3 hrs
  poll: 60 # 1 min
  become: no
  delegate_to: "{{ xen_pool_master_inventory_hostname }}"

Both the until/retries/delay task and async/poll task fails the Job without any error after about 4hrs. Every time it runs, it fails after 4 hrs.

image

image

Another playbook task (it makes XenServer big VM export via command module) fails the Job after about 14hrs without any error:

image

Below the logs i see in /var/log/pods/default_awx-777854cdfb-z2bs4_b24bc0a5-ca74-4561-89ed-378ddbed4d08/awx-task/1.log

2021-10-17T07:43:48.929344109+08:00 stderr F 2021-10-16 23:43:48,928 INFO     [ba8cf785179d40f19847119a0a9dfef2] awx.main.commands.run_callback_receiver Event processing is finished for Job 1090, sending notifications
2021-10-17T07:43:48.929927511+08:00 stderr F 2021-10-16 23:43:48,928 INFO     [ba8cf785179d40f19847119a0a9dfef2] awx.main.commands.run_callback_receiver Event processing is finished for Job 1090, sending notifications
2021-10-17T07:43:49.42528382+08:00 stderr F 2021-10-16 23:43:49,409 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.tasks job 1090 (running) finished running, producing 389 events.
2021-10-17T07:43:49.425323136+08:00 stderr F 2021-10-16 23:43:49,421 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.analytics.job_lifecycle job-1090 post run
2021-10-17T07:43:49.612110438+08:00 stderr F 2021-10-16 23:43:49,611 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.analytics.job_lifecycle job-1090 finalize run
2021-10-17T07:43:49.640764078+08:00 stderr F 2021-10-16 23:43:49,639 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task d2b391fd-10bc-488d-9269-c4d6c18fbfed starting awx.main.tasks.update_inventory_computed_fields(*[2])
2021-10-17T07:43:49.683846539+08:00 stderr F 2021-10-16 23:43:49,683 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.models.inventory Going to update inventory computed fields, pk=2
2021-10-17T07:43:49.714765297+08:00 stderr F 2021-10-16 23:43:49,714 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.models.inventory Finished updating inventory computed fields, pk=2, in 0.029 seconds
2021-10-17T07:43:50.735156488+08:00 stderr F 2021-10-16 23:43:50,733 WARNING  [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch job 1090 (error) encountered an error (rc=None), please see task stdout for details.
2021-10-17T07:43:50.736988429+08:00 stderr F 2021-10-16 23:43:50,736 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b starting awx.main.tasks.handle_work_error(*['2b73b673-6d40-4c65-b1a1-7588fd1a9c5b'])
2021-10-17T07:43:50.738166086+08:00 stderr F 2021-10-16 23:43:50,737 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.tasks Executing error task id 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b, subtasks: [{'type': 'job', 'id': 1090}]
2021-10-17T07:43:50.775480607+08:00 stderr F 2021-10-16 23:43:50,770 DEBUG    [ba8cf785179d40f19847119a0a9dfef2] awx.main.dispatch task 2b73b673-6d40-4c65-b1a1-7588fd1a9c5b starting awx.main.tasks.handle_work_success(*[])

@mw-0
Copy link

mw-0 commented Nov 11, 2021

@smullenrga have you tried the same but change your log sizes. They are 16k default.

@smullenrga
Copy link

@mw-0 I'm looking into the options for log sizes, I'm simply a consumer of our k8s cluster so will possibly have to work with the cluster admins to make any global changes. Will update once I've had a chance to change log settings. Thanks for the reply.

@smullenrga
Copy link

smullenrga commented Nov 15, 2021

Cleaning up/deleting several earlier comments.

We're running Kubernetes and using the GELF driver to log to an ELK stack (configured via /etc/docker/daemon.json). AWX changed somewhere between 15 (worked fine in our environment) and 19.4 to create separate STDOUT lines for each event. In 19.4, very long STDOUT lines (>16K) are being generated which are being split up and then improperly reassembled somewhere. Per moby/moby#22982 a 16K limit / split was put on STDOUT lines in Docker 1.13.

In our environment, after upgrading from AWX 15 to AWX 19.4, AWX Jobs break when a STDOUT line >16K is encountered because whatever is re-assembling these docker-split long lines is failing to put a carriage return on the reassembled line that goes back to AWX. As a result, you get the 16+K JSON object rebuilt as expected however, you also end up with the following JSON even log line appended to the end of the prior long line and the JSON parsing breaks.

As a result of the failed log parsing, the jobs are marked failed regardless of true status.
01-GoodEvents
02-BadEvents
03-ParsedFormattedStart
04-Truncated end of output

@shanemcd
Copy link
Member

Can y'all take a look at #11338 (comment) ? Wondering if this is the same thing.

@daenney
Copy link

daenney commented Nov 19, 2021

@shanemcd Sure looks like it from my point of view. Unfortunately tweaking kubelet args isn't always an option (it's possible but very annoying with managed node pools etc. in certain clouds) so it would be nice to find a way around it. Ideally whatever is tailing/streaming the log needs to handle file rotation transparently.

@smullenrga
Copy link

smullenrga commented Nov 19, 2021

@shanemcd This does not look like my issue, perhaps I should open a new issue. I get all of the logs from kubectl, and I get the full JSON output at the top of the job output after the job's errored out. My overall log is only like 700K and that's only because I'm intentionally generating event data's over 16K (which happens naturally with a windows gather facts when the event body contais all of the host vars on at least some of our systems). My issue really seems to be about the log lines being split over 16K and whatever's reassembling them not putting a carriage return on the line. In the "stack trace" at the top of the failed job in AWX, I see the full event items in JSON format, each on its own line UNLESS the event is over 16K, as soon as it crosses 16K for the event JSON object, the next event's JSON object is appended to the end of the 16+K line and that's the point at which AWX lists it as failed and the pretty/formatted output stops.

@bschaeffer
Copy link

From #11511:

FWIW, this does seem to be related to container runtime. In GKE, we recently upgraded to 1.20 which defaults to containerd at the same time we saw this error. When we rolled back to 1.19, it was also broken, but we realised it was still using containerd://1.4.8.

When we switched the runtime to docker://19.3.15, it actually fixed our problem.

So things are now in a working state, but it will be EOL for us in about 5 months.

@kladiv
Copy link

kladiv commented Jan 18, 2022 via email

@smullenrga
Copy link

I think I've found my root cause - the GELF log driver in docker is not updated to handle the 16K log limit. Our enterprise cluster that AWX is on is configured to use the GELF driver to send logs to ELK and fails as noted above.

AWX on Docker Desktop with default configs works fine, no failure.

As soon as I change Docker Desktop to the GELF driver, I get the same failure.

Looking at the docker (moby) source code for the default jsonfilelog driver, it's updated to read PLogMetaData (which contains the partial log message info) and concatenate lines as needed.

The fluentd driver reads the metadata and passes it on to the log recipient.

The GELF driver has no processing of metadata or line concatenation logic from what I can see and therefore passes the bad partial messages through without any metadata needed for reassembly.

I don't know if AWX is written to handle the docker split log line metadata / reassembly itself or if it is expecting to receive the log lines already reassembled. I'm working on testing the fluentd driver to see if it breaks AWX as well. As far as I can tell, using the jsonfilelog log driver in docker will fix my issue but results in the problem of not being able to send logs to our logging systems as I'm required to do.

@shanemcd
Copy link
Member

Going to close in favor of #11338 since it pinpoints the underlying issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests