Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should worker process exit when the worker commits seppuku? #501

Open
psugihara opened this issue Nov 4, 2024 · 9 comments · May be fixed by #474
Open

Should worker process exit when the worker commits seppuku? #501

psugihara opened this issue Nov 4, 2024 · 9 comments · May be fixed by #474

Comments

@psugihara
Copy link

Summary

I have a single graphile-worker process that processes a small queue of mostly low priority web scraper stuff with occasional high priority items (emails).

I'm seeing my worker die without the process exiting. There's a supervisor that is supposed to restart if it fails, but the process never exits so the supervisor doesn't get triggered. I can program around this by parsing the log output (e.g. look for "seppuku" and restart), but I'm guessing I'm just missing something in my setup. Is there a setting that will make the process exit when it commits seppuku?

Additional context

Node v20.15.1

Here's the script I run on the server to start the worker:
bunx --bun graphile-worker --cleanup DELETE_PERMAFAILED_JOBS,GC_TASK_IDENTIFIERS,GC_JOB_QUEUES && bunx --bun graphile-worker --no-prepared-statements

Here's an example of the error I see when it stops processing without exiting the process:

[job(worker-6f1b0ec5b9b6945331: enrich-job-posts{39050})] ERROR: [scrapeJob] error scraping job data: Request failed with status code 429

[worker(worker-6f1b0ec5b9b6945331)] ERROR: Failed task 39050 (enrich-job-posts, 11853.30ms, attempt 2 of 25) with error 'Request failed with status code 429':

  Error

      at new AxiosError (/opt/render/project/src/node_modules/.pnpm/[email protected]/node_modules/axios/lib/core/AxiosError.js:21:4)

      at settle (/opt/render/project/src/node_modules/.pnpm/[email protected]/node_modules/axios/lib/core/settle.js:2)

      at handleStreamEnd (/opt/render/project/src/node_modules/.pnpm/[email protected]/node_modules/axios/lib/adapters/http.js:599:9)

      at endReadableNT (native)

      at processTicksAndRejections (native)

      at /opt/render/project/src/node_modules/.pnpm/[email protected]/node_modules/axios/lib/core/Axios.js:48:12

      at asyncFunctionResume (native)

      at promiseReactionJobWithoutPromiseUnwrapAsyncContext (native)

      at promiseReactionJob (native)

      at processTicksAndRejections (native)

[worker(worker-6f1b0ec5b9b6945331)] ERROR: Failed to release job '39050' after failure 'Request failed with status code 429'; committing seppuku

Failed to connect

[core] ERROR: Worker exited, but pool is in continuous mode, is active, and is not shutting down... Did something go wrong?

41 |     rej = reject

42 |   }).catch((err) => {

43 |     // replace the stack trace that leads to `TCP.onStreamRead` with one that leads back to the

44 |     // application that created the query

45 |     Error.captureStackTrace(err)

46 |     throw err

       ^

ECONNREFUSED: Failed to connect

 syscall: "connect"


      at object.<anonymous> (/opt/render/project/src/node_modules/.pnpm/[email protected][email protected]/node_modules/pg-pool/index.js:46:3)

      at promiseReactionJob (native:1:1)

      at processTicksAndRejections (native:1:1)


      at /opt/render/project/src/node_modules/.pnpm/[email protected][email protected]/node_modules/graphile-worker/dist/main.js:610:13

[core] ERROR: Worker exited with error: ECONNREFUSED: Failed to connect
@benjie
Copy link
Member

benjie commented Nov 4, 2024

What version of worker?

@psugihara
Copy link
Author

[email protected]

@benjie
Copy link
Member

benjie commented Nov 4, 2024

Can you reproduce the issue using Node rather than bun?

@psugihara
Copy link
Author

I haven't tried yet or done much debugging other than confirming that the process is still around. I can try to put together a minimal repro at some point. For clarification, the process is supposed to die after the seppuku message?

Thanks for the quick help (and this amazing tool)!

@benjie
Copy link
Member

benjie commented Nov 4, 2024

It doesn’t explicitly exit the process, it just stops running the worker; that should mean everything then shuts down relatively cleanly and then the process should exit.

@JosephHalter
Copy link

I just had a similar issue with version 0.16.5 - I had 2 scheduled tasks configured in cronTab, they both commited seppuku and just stop running entirely despite being scheduled to run every 5 minutes. I didn't expect a job failure to cause it to stop being scheduled entirely, I only noticed after many days and it caused a lot of issues.

Is there any way to ensure that a job failing doesn't prevent it from running at the next scheduled time?

@benjie
Copy link
Member

benjie commented Nov 22, 2024

@JosephHalter can you provide a reproduction?

@benjie
Copy link
Member

benjie commented Nov 22, 2024

It's not 100% clear how to handle this. A force exit would kill all concurrently working jobs which isn't ideal. A graceful shutdown might take forever. Spinning a new worker up to replace the faulty one may cause a thundering herd problem.

Here's my first punt at it:

fdd041c

@benjie benjie linked a pull request Nov 22, 2024 that will close this issue
5 tasks
@JosephHalter
Copy link

JosephHalter commented Nov 23, 2024

@benjie Sadly it's pretty hard to reproduce, I only have the logs to show that the tasks failed and seppuku happened but I don't know exactly how it failed. It's also on a private repo that I can't share. Failure in itself can happen and isn't a big issue, the problem was that the job stop begin scheduled entirely - if it was in the system crontab new jobs would have been spawned no matter how the previous ones failed. Indeed it's possible to have herd issues if you keep spawning new jobs but that would have been easier to notice. I've no problem with force killing concurrently working jobs, although like you suggested in the comments of your pull request, going for a graceful exit and only forcefully killing it after a delay would be best, even better if the delay is configurable and if you put 0 then it doesn't forcefully kill the workers for people who don't want the behavior at all.

Thank you for taking this issue seriously and immediately creating a pull request 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants