Process hanging around after container stopped #1489

luizkowalski · 2024-09-17T16:02:44Z

hey (again) 👋🏻

I'm seeing a strange behavior when after a deployment with Kamal: the old process is still showing up, even though the container has stopped

There are a number of job containers, though they are all stopped.

CONTAINER ID   IMAGE                                                                                          COMMAND                  CREATED          STATUS                        PORTS            NAMES
7d91ae5526c8   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89   "/rails/bin/docker-e…"   3 minutes ago    Up 3 minutes                  80/tcp           sumiu-job-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89
b1b63fba1d87   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89   "/rails/bin/docker-e…"   4 minutes ago    Up 4 minutes (healthy)        80/tcp           sumiu-web-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_62ecdc8cec127b89
90854613ed86   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d   "/rails/bin/docker-e…"   16 minutes ago   Exited (137) 3 minutes ago                     sumiu-job-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d
daaa1f28cb1b   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d   "/rails/bin/docker-e…"   17 minutes ago   Exited (137) 3 minutes ago                     sumiu-web-6563f795f6a4b489e0e4bea52b682724a69c247c_uncommitted_c84c94f7475e371d
ff4573af4493   4c6c00662b98                                                                                   "/rails/bin/docker-e…"   2 days ago       Exited (137) 16 minutes ago                    sumiu-job-latest
5cec3bfd1d0f   4c6c00662b98                                                                                   "/rails/bin/docker-e…"   2 days ago       Exited (137) 2 days ago                        sumiu-job-latest_replaced_40cad9c926616e5c
171ee6e3ce16   luizkowalski/sumiu-app:6563f795f6a4b489e0e4bea52b682724a69c247c                                "/rails/bin/docker-e…"   2 days ago       Exited (137) 2 days ago                        sumiu-job-6563f795f6a4b489e0e4bea52b682724a69c247c

After a couple of minutes, the old process disappears automagically.

Before I dig deeper into it, I need to ask a couple of questions:

Is this a new behavior? the old container stops gracefully, from what I can tell, and I don't see any exceptions in their logs. They just stop and that's it.
Is this view purely for presentation purposes and it does not affect the processing (e.g. jobs being enqueued by this "ghost" process)
I initially suspected that it had something to do with Kamal and their healthcheck thing but I see from the logs that the container is indeed passing the healthcheck and the old container is stopped with docker stop job-container-name. Could it have something to do with the way docker stops and the way Good Job traps the signal? I don't know if I'm making sense hahahah

The text was updated successfully, but these errors were encountered:

bensheldon · 2024-10-08T22:05:06Z

Could it have something to do with the way docker stops and the way Good Job traps the signal?

I imagine this is the problem. That Docker is sending a SIGKILL which is abruptly interruping the process, rather than giving it a chance to shut down gracefully.

I did move GoodJob over entirely to a heartbeat model for detecting active processes (previously the process would take an Advisory Lock: #1451). A heartbeat results in a slightly slower cleanup of process records... but that's more of an accounting issue. The underlying problem is: the process is being stopped ungracefully.

Do you know what the signal is being sent to processes to shut them down? https://github.com/bensheldon/good_job?tab=readme-ov-file#Interrupts-graceful-shutdown-and-SIGKILL

luizkowalski · 2024-10-09T21:27:27Z

here is what I've found so far:

Kamal uses docker stop with a timeout of 30s

Running docker container ls --all --filter name=^sumiu-job-f11a5b84a8dbfcbd5972457fa596059710dea2d3_uncommitted_c517826e379b8498$ --quiet | xargs docker stop -t 30 on web

INFO [16b022c8] Finished in 12.525 seconds with exit status 0 (successful).

docker stop sends SIGTERM, and after a grace period, SIGKILL
I manually run docker stop, it does not take more than 10ish seconds for the container to stop, there is no way that the container is receiving a SIGKILL
Checking the logs after the container has stopped has no mention of a shutdown. These are the last lines of docker logs

I, [2024-10-09T21:15:00.156245 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=urgent max_threads=5)-thread-34] Executed GoodJob 69cd2dc4-7901-404f-9836-53d51e1f446e
I, [2024-10-09T21:20:00.016480 #1]  INFO -- : [ActiveJob] Enqueued CaptureQueryStatsJob (Job ID: 72f0fc26-6e24-4b28-85d6-5cad0d1a08db) to GoodJob(urgent)
I, [2024-10-09T21:20:00.047545 #1]  INFO -- : [ActiveJob] [CaptureQueryStatsJob] [72f0fc26-6e24-4b28-85d6-5cad0d1a08db] Performing CaptureQueryStatsJob (Job ID: 72f0fc26-6e24-4b28-85d6-5cad0d1a08db) from GoodJob(urgent) enqueued at 2024-10-09T21:20:00.004459967Z
I, [2024-10-09T21:20:00.121179 #1]  INFO -- : [ActiveJob] [CaptureQueryStatsJob] [72f0fc26-6e24-4b28-85d6-5cad0d1a08db] Performed CaptureQueryStatsJob (Job ID: 72f0fc26-6e24-4b28-85d6-5cad0d1a08db) from GoodJob(urgent) in 73.77ms
I, [2024-10-09T21:20:00.122605 #1]  INFO -- : [GoodJob] [1] [GoodJob::Scheduler(queues=urgent max_threads=5)-thread-36] Executed GoodJob 72f0fc26-6e24-4b28-85d6-5cad0d1a08db

docker inspect looks ok too, even though the exit code is 137

"State": {
    "Status": "exited",
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": false,
    "Dead": false,
    "Pid": 0,
    "ExitCode": 137,
    "Error": "",
    "StartedAt": "2024-10-09T18:26:38.770016471Z",
    "FinishedAt": "2024-10-09T21:22:42.292104872Z"
},

luizkowalski · 2024-10-11T14:20:07Z

ok, back at it, debugging some more and I think I'm close to find the actual problem.
so good job is indeed trapping the signals correctly and there's is no reason for it not to work...unless good job is not receiving the signal at all

there are more people with similar problems, apparently, and I think it all boils down to another process "wrapping" a good job and not passing the signal down. I thought that would be weird because I'm explicitly setting the command in my Kamal config:

job:
  hosts:
  - web
  env:
    clear:
      CRON_ENABLED: 1
      GOOD_JOB_MAX_THREADS: 10 # Number of threads to process background jobs
      DB_POOL: <%= 10 + 1 + 2 %> # GOOD_JOB_MAX_THREADS + 1 + 2
  cmd: bin/good_job start --probe-port=7001 
  options:
    health-cmd: curl -f http://localhost:7001/status/connected || exit 1
    health-retries: 15
    health-interval: 5s

Running docker top to confirm it shows me this

kamal@web:~$ docker top a7bd263ec291
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
kamal               4163342             4163321             3                   15:59               ?                   00:00:09            ruby /rails/bin/good_job start --probe-port=7001

Apparently, ruby is prepended on the cmd and could be the cause. I do have an ENTRYPOINT directive in my dockerfile like this:

#!/bin/bash -e

# If running the rails server then create or migrate existing database
if [ "${*}" == "bundle exec thrust ./bin/rails server" ]; then
  ./bin/rails db:prepare
fi

exec "${@}"

which I also tried to remove but didn't do anything.

so, tldr:

I guess the SIGTERM is not being captured by GJ
GJ might not be the problem
ruby command prepended to the cmd could be swallowing the signal

Does that make any sense? if so, it might be a Kamal thing and I will follow up with an issue on their side.

github-project-automation bot added this to GoodJob Backlog v2 Sep 17, 2024

github-project-automation bot moved this to Inbox in GoodJob Backlog v2 Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Process hanging around after container stopped #1489

Process hanging around after container stopped #1489

luizkowalski commented Sep 17, 2024

bensheldon commented Oct 8, 2024

luizkowalski commented Oct 9, 2024

luizkowalski commented Oct 11, 2024 •

edited

Loading

Process hanging around after container stopped #1489

Process hanging around after container stopped #1489

Comments

luizkowalski commented Sep 17, 2024

bensheldon commented Oct 8, 2024

luizkowalski commented Oct 9, 2024

luizkowalski commented Oct 11, 2024 • edited Loading

luizkowalski commented Oct 11, 2024 •

edited

Loading