-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWX RuntimeError resulting in loss of worker task logs and true outcome #9961
Comments
Interesting. Some questions:
|
Hi @shanemcd
When running with AWX 17, via local Docker Compose, I did not have these issues for a very similar amount of tasks/stdio I have some additional templates/playbooks to bring over, including a larger one that will run over ~36 hosts and will keep an eye out there for the same behavior |
@duntonr Thanks for the information. We did indeed make some pretty radical changes to the architecture of AWX in version 18, so that explains why you weren't seeing this issue before. I will spend some time tomorrow trying to reproduce this, or at very least improve the error handling. |
@duntonr One more question: is this error the only thing you see in the stdout? Or was there output before the traceback? |
Thanks @shanemcd ! For what its worth, I came across #9917 and https://groups.google.com/g/awx-project/c/MACNtPrGpV8 as those are the only google hits for Separately, I had similar issues as reported in 9917 but was able to eventually get a custom EE container builtt and working (to include Galaxy collections) by using the workarounds mentioned there. I needed to use your suggested If though, I tried the (very new)
I would get that same error if I tried the out-of-the-box |
That's an interesting thing as well... when the job starts, things are normal, eg the log starts with
I can keep scrolling though the logs via AWX UI (double down carrots) until it stalls. At that point, the flashy green indicator keeps flashing to show the job is still running but no new logs. When the job does end, the indicator will turn red, but again, no new logs/errors/etc at the bottom of the logging window If I then refresh the Job Output screen, that's when the Traceback error shows up, at the TOP of the log
|
Just by way of update, I have been able to run a different playbook across different "stacks" of 4 hosts/stack WITHOUT issue so far. I've made 5 runs of a template that consists of:
This playbook/template is a slightly lighter version of the one I was having issues with. That said it's mostly the same software and actually imports a lot of the same task files....just these are worker nodes, vs the control/server hosts where the issue was. This is a home lab, so there aren't any crazy vlaning/security issues running between the two (same subnet, etc) host groups. The only major difference I could think of is the problematic playbook runs against x86-64 hosts, whereas this "worker" playbook runs against arm64v8 hosts. Should not make a difference for this it seems, but figured I would mention it. Once this rollout is finished, I will run a playbook that touches all ~35 hosts and see whats happens |
@duntonr I'm experiencing quite the same issues - I often get this output: Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
res = receptor_job.run()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
return self._run_internal(receptor_ctl)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
raise RuntimeError(detail)
RuntimeError: Finished I've recently got my custom EE working on my test cluster (#9917 ), where I also dont get this error. ENVIRONMENT |
I was receiving the same error, and behavior in 19.0.0. In 19.1.0 I no longer receive the error but the behavior remains. I cannot run any job with more then 3 hosts without the job failing mysteriously with cutoff output and no errors. |
I was able to get my EE running like I mentioned in #9917, maybe this also helps you 😄 |
@DrackThor I'm using the builtin EE and not a custom one. I've read through the issue you mentioned but failed to see how it address the problem mentioned here. |
Thanks @DrackThor . I'm using a custom EE too but the issue seems to happen with a custom EE (with Galaxy stuff installed in the image) or the standard EE (with Galaxy stuff cobbled right into the repo that hold my playbooks) I did run a larger set of plays... 28 hosts, ~ 3.5hrs run time, etc... issue still occurs It's somewhat interesting that error is injected at the TOP of the output in awx. Also, the error seems to occur quicker on my play against the 5 x86 hosts vs the play against 28 arm64v8 hosts |
@duntonr I tried to improve the error handling in 19.1.0. Can you try again and paste the new error? |
I think we've finally gotten to the bottom of this. Testing patch at ansible/receptor#319 |
travelingladybug has contributed $50.00 to this issue on Rysolv. |
@shanemcd - Sorry for the delay but 19.1 did NOT solve the issue. The same issue/behavior remains with:
I was excited by ansible/receptor#319, until I read that issues updates :( . It does "feel" kinda like a lock or contention type issue though |
In terms of what it means to test this: we shouldn't see this error anymore. This was fixed in the latest alpha release of Receptor, which is going out in AWX 19.2.0 sometime today or tomorrow. |
should be fixed by 1ed170f |
To be clear, the unhelpful |
1ed170f didn't fix it for me. I'm currently running:
Update: |
I'm still facing the issue, but in my case I'm not getting any error nor play recap in automation-job pod logs. The job finishes like it should, but the output is incomplete, so AWX can't even mark hosts as failed, which is very unhelpful. Running the same job on smaller part of inventory (or just splicing it) does solve the issue, but it's less readable (a few different stdout logs to look at isn't ideal). Maybe I didn't understand what the commit mentioned above should fix or I'm encountering a different issue that just fits the description? |
I'm observing this issue too. We initially had a run that ended with the following traceback and a bunch of log output truncated. Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
res = receptor_job.run()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
return self._run_internal(receptor_ctl)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
raise RuntimeError(detail)
RuntimeError: Finished It was really slow, so we ran it again with bigger instances powering it, and now we've ended up with: Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
res = receptor_job.run()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
return self._run_internal(receptor_ctl)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
raise RuntimeError(detail)
RuntimeError: Pod Running I was following the logs with kubectl, and it ends with:
The logs right before it where just normal play output. Like the container just terminated. Doing the same job in smaller batches seems to do the trick for us too. |
We've also had a case where running with a larger batch it makes it to the end of the run, but seems to crash when flushing the recap. The last of the logs in the pod show:
The traceback was: Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 1397, in run
res = receptor_job.run()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 2957, in run
return self._run_internal(receptor_ctl)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/main/tasks.py", line 3008, in _run_internal
raise RuntimeError(detail)
RuntimeError: Finished |
Hi @shanemcd and everyone! I tested the fix mentioned in version 19.2.0, but without success, the same happened when upgrading to 19.2.2 and the new one release 19.3.0. For the tests I used the awx-ee default (0.5.0) and also the 0.6.0 even the latest tag, always the same problem, after 4 hours of execution the Job is marked with "Error", even if it's terminate successfully in container log. As an example below: Note: I also ran the test on a new installation with version 18.0.0 (where there was no report of the problem), but I had the same problem and this didn't happen in the version that I was using previously (14.1.0). Is there any palliative measure that we can use in this cases ? Awaiting further comments regarding this issue. |
I've only found 2 solutions to this problem. Either have your templates split your job using a high number of slices or revert back to an earlier version of AWX before all of this Kubernetes only refactor stuff began. I really hope this issue gets more attention as it has made the product unusable. If you deploy AWX 16 (before the incomplete UI redesign for stability) you can update ansible within the docker container to 10.x relatively safely. This will give you the git collection downloads from the requirements.yml file. The steps are relatively simple just anter a bash shell in the ansible container, yum uninstall ansible, and then pip install the desired ansible version. |
See: #10366 (comment) |
Hi @nicovs ,
Job is a Workflow Job, and one playbook runs a task like below:
or like below (i tried a change to check if got same error):
Both the until/retries/delay task and async/poll task fails the Job without any error after about 4hrs. Every time it runs, it fails after 4 hrs. Another playbook task (it makes XenServer big VM export via command module) fails the Job after about 14hrs without any error: Below the logs i see in /var/log/pods/default_awx-777854cdfb-z2bs4_b24bc0a5-ca74-4561-89ed-378ddbed4d08/awx-task/1.log
|
@smullenrga have you tried the same but change your log sizes. They are 16k default. |
@mw-0 I'm looking into the options for log sizes, I'm simply a consumer of our k8s cluster so will possibly have to work with the cluster admins to make any global changes. Will update once I've had a chance to change log settings. Thanks for the reply. |
Cleaning up/deleting several earlier comments. We're running Kubernetes and using the GELF driver to log to an ELK stack (configured via /etc/docker/daemon.json). AWX changed somewhere between 15 (worked fine in our environment) and 19.4 to create separate STDOUT lines for each event. In 19.4, very long STDOUT lines (>16K) are being generated which are being split up and then improperly reassembled somewhere. Per moby/moby#22982 a 16K limit / split was put on STDOUT lines in Docker 1.13. In our environment, after upgrading from AWX 15 to AWX 19.4, AWX Jobs break when a STDOUT line >16K is encountered because whatever is re-assembling these docker-split long lines is failing to put a carriage return on the reassembled line that goes back to AWX. As a result, you get the 16+K JSON object rebuilt as expected however, you also end up with the following JSON even log line appended to the end of the prior long line and the JSON parsing breaks. As a result of the failed log parsing, the jobs are marked failed regardless of true status. |
Can y'all take a look at #11338 (comment) ? Wondering if this is the same thing. |
@shanemcd Sure looks like it from my point of view. Unfortunately tweaking kubelet args isn't always an option (it's possible but very annoying with managed node pools etc. in certain clouds) so it would be nice to find a way around it. Ideally whatever is tailing/streaming the log needs to handle file rotation transparently. |
@shanemcd This does not look like my issue, perhaps I should open a new issue. I get all of the logs from kubectl, and I get the full JSON output at the top of the job output after the job's errored out. My overall log is only like 700K and that's only because I'm intentionally generating event data's over 16K (which happens naturally with a windows gather facts when the event body contais all of the host vars on at least some of our systems). My issue really seems to be about the log lines being split over 16K and whatever's reassembling them not putting a carriage return on the line. In the "stack trace" at the top of the failed job in AWX, I see the full event items in JSON format, each on its own line UNLESS the event is over 16K, as soon as it crosses 16K for the event JSON object, the next event's JSON object is appended to the end of the 16+K line and that's the point at which AWX lists it as failed and the pretty/formatted output stops. |
From #11511:
|
Hi,
interesting. Containerd is the default CRI in most Kubernetes distributions
so i hope this issue will be solved soon in AWX side 'cause it could affect
all users in the next future.
Thank you
Best,
Claudio
Il giorno mar 18 gen 2022 alle ore 15:21 Braden Schaeffer <
***@***.***> ha scritto:
… From #11511 <#11511>:
FWIW, this does seem to be related to container runtime. In GKE, we
recently upgraded to 1.20 which defaults to containerd at the same time we
saw this error. When we rolled back to 1.19, it was also broken, but we
realised it was still using containerd://1.4.8.
When we switched the runtime to docker://19.3.15, it actually fixed our
problem.
So things are now in a working state, but it will be EOL for us in about 5
months.
—
Reply to this email directly, view it on GitHub
<#9961 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AANYNFVYP2RTERBNKRW3NELUWVZPDANCNFSM43IX6QDQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you commented.Message ID:
***@***.***>
--
Ciao,
Claudio
mail @ ***@***.***
|
I think I've found my root cause - the GELF log driver in docker is not updated to handle the 16K log limit. Our enterprise cluster that AWX is on is configured to use the GELF driver to send logs to ELK and fails as noted above. AWX on Docker Desktop with default configs works fine, no failure. As soon as I change Docker Desktop to the GELF driver, I get the same failure. Looking at the docker (moby) source code for the default jsonfilelog driver, it's updated to read PLogMetaData (which contains the partial log message info) and concatenate lines as needed. The fluentd driver reads the metadata and passes it on to the log recipient. The GELF driver has no processing of metadata or line concatenation logic from what I can see and therefore passes the bad partial messages through without any metadata needed for reassembly. I don't know if AWX is written to handle the docker split log line metadata / reassembly itself or if it is expecting to receive the log lines already reassembled. I'm working on testing the fluentd driver to see if it breaks AWX as well. As far as I can tell, using the jsonfilelog log driver in docker will fix my issue but results in the problem of not being able to send logs to our logging systems as I'm required to do. |
Going to close in favor of #11338 since it pinpoints the underlying issue. |
ISSUE TYPE
SUMMARY
AWX UI reporting stalls eventually errors out with unhelpful
In reality, though, the job has actually continued and either completed successfully OR failed due to a normal (helpful) job/task/play error. This is found by looking at the spawned worker container logs.
The UI error IS raised at the same time as when the JOB completes.
Downloading logs from the UI results in the INCOMPLETE log set, eg not the apparent zip file the worker container uploads back
AWX reports the job as a failure in the job list, even though it succeeded from the actual container logs. This is VERY confusing.
ENVIRONMENT
STEPS TO REPRODUCE
EXPECTED RESULTS
AWX UI Job output continues to stay synced with actual job output and status match
ACTUAL RESULTS
AWX UI logging falls out of sync from execution and reports job failed with
.../tasks.py
error, despite actual job outcome. If there was an actual Job error, this is not displayed (so could not tell WHY a job failed without worker pod logs).ADDITIONAL INFORMATION
After continuously tailing the worker pod logs, I was able to see this:
As you can see, job 209 finished successfully.
However
I also inspected the container logs for the
redis
,awx-web
,awx-task
, andawx-ee
in the AWX pod but didn't see anything immediately apparently around the time AWX UI stopped trackingThe text was updated successfully, but these errors were encountered: