-
-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process hanging around after container stopped #1489
Comments
I imagine this is the problem. That Docker is sending a SIGKILL which is abruptly interruping the process, rather than giving it a chance to shut down gracefully. I did move GoodJob over entirely to a heartbeat model for detecting active processes (previously the process would take an Advisory Lock: #1451). A heartbeat results in a slightly slower cleanup of process records... but that's more of an accounting issue. The underlying problem is: the process is being stopped ungracefully. Do you know what the signal is being sent to processes to shut them down? https://github.com/bensheldon/good_job?tab=readme-ov-file#Interrupts-graceful-shutdown-and-SIGKILL |
here is what I've found so far:
"State": {
"Status": "exited",
"Running": false,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 0,
"ExitCode": 137,
"Error": "",
"StartedAt": "2024-10-09T18:26:38.770016471Z",
"FinishedAt": "2024-10-09T21:22:42.292104872Z"
}, |
ok, back at it, debugging some more and I think I'm close to find the actual problem. there are more people with similar problems, apparently, and I think it all boils down to another process "wrapping" a good job and not passing the signal down. I thought that would be weird because I'm explicitly setting the command in my Kamal config: job:
hosts:
- web
env:
clear:
CRON_ENABLED: 1
GOOD_JOB_MAX_THREADS: 10 # Number of threads to process background jobs
DB_POOL: <%= 10 + 1 + 2 %> # GOOD_JOB_MAX_THREADS + 1 + 2
cmd: bin/good_job start --probe-port=7001
options:
health-cmd: curl -f http://localhost:7001/status/connected || exit 1
health-retries: 15
health-interval: 5s Running kamal@web:~$ docker top a7bd263ec291
UID PID PPID C STIME TTY TIME CMD
kamal 4163342 4163321 3 15:59 ? 00:00:09 ruby /rails/bin/good_job start --probe-port=7001 Apparently, #!/bin/bash -e
# If running the rails server then create or migrate existing database
if [ "${*}" == "bundle exec thrust ./bin/rails server" ]; then
./bin/rails db:prepare
fi
exec "${@}" which I also tried to remove but didn't do anything. so, tldr:
Does that make any sense? if so, it might be a Kamal thing and I will follow up with an issue on their side. |
hey (again) 👋🏻
I'm seeing a strange behavior when after a deployment with Kamal: the old process is still showing up, even though the container has stopped
There are a number of job containers, though they are all stopped.
After a couple of minutes, the old process disappears automagically.
Before I dig deeper into it, I need to ask a couple of questions:
Is this a new behavior? the old container stops gracefully, from what I can tell, and I don't see any exceptions in their logs. They just stop and that's it.
Is this view purely for presentation purposes and it does not affect the processing (e.g. jobs being enqueued by this "ghost" process)
I initially suspected that it had something to do with Kamal and their healthcheck thing but I see from the logs that the container is indeed passing the healthcheck and the old container is stopped with
docker stop job-container-name
. Could it have something to do with the way docker stops and the way Good Job traps the signal? I don't know if I'm making sense hahahahThe text was updated successfully, but these errors were encountered: