Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for ongoing uploads and downloads when shutting down the worker #882

Closed
wants to merge 1 commit into from

Conversation

peterjan
Copy link
Member

@peterjan peterjan commented Jan 9, 2024

This PR makes the worker shut down gracefully and allows ongoing uploads and downloads to finish.

@peterjan peterjan self-assigned this Jan 9, 2024
worker/interactions.go Outdated Show resolved Hide resolved
worker/interactions.go Outdated Show resolved Hide resolved
worker/worker.go Outdated Show resolved Hide resolved
@peterjan peterjan marked this pull request as draft January 12, 2024 16:17
@peterjan peterjan changed the title Avoid logging context canceled errors when they involve the shutdown context Wait for ongoing uploads and downloads when shutting down the worker Jan 12, 2024
@peterjan
Copy link
Member Author

peterjan commented Jan 12, 2024

I moved it to DRAFT since it's more of a proposal now.

I propose we register ongoing uploads and downloads and wait for them to finish when shutting down the worker, this means we'll only use the shutdown context to prevent starting new uploads/downloads when we were blocking on memory. I added a middleware to check if the shutdown context was cancelled, and if that is the case we return service unavailable... this way we make sure we only handle ongoing requests and not take on any new ones.

I also want to change the way we handle the shutdown timeout, we now divide it by the number of services we have but that seems a little arbitrary. I see how it allows every service to shut down gracefully but at the same time it might needlessly prevent a service from shutting down gracefully simply because we restrict it to shutdownTimeout/10.

Maybe we can allocate (an arbitrary) 5 seconds to every service and reserve that. So we subtract that time from the shutdown timeout and then we just allow all services to use the remaining time, if we do timeout we still have that reserve to ensure we can gracefully shut down the rest. I feel this would give time to those services that need it, while ensuring we're not giving the opportunity to following services to complete their shutdown.

internal/testing/cluster_test.go Outdated Show resolved Hide resolved
internal/testing/cluster_test.go Outdated Show resolved Hide resolved
internal/testing/cluster_test.go Outdated Show resolved Hide resolved
worker/download.go Outdated Show resolved Hide resolved
worker/download.go Show resolved Hide resolved
cmd/renterd/main.go Outdated Show resolved Hide resolved
cmd/renterd/main.go Outdated Show resolved Hide resolved
worker/mocks_test.go Outdated Show resolved Hide resolved
worker/worker.go Outdated Show resolved Hide resolved
@peterjan peterjan marked this pull request as ready for review February 6, 2024 16:36
Copy link
Member

@ChrisSchinnerl ChrisSchinnerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you manually tested this a few times to see if it throws any unexpected errors?

autopilot/autopilot.go Outdated Show resolved Hide resolved
worker/upload.go Outdated Show resolved Hide resolved
@peterjan
Copy link
Member Author

Have you manually tested this a few times to see if it throws any unexpected errors?

I manually tested it a bunch of times. Downloads and uploads are handled gracefully, and we do reserve a portion of the shutdown ctx to try and shut down the remaining services in case we time out. If the worker shut down times out though, we might not record some interactions and then we get:

2024-02-14T10:20:50+01:00       ERROR   worker.worker   worker/interactions.go:91       failed to record 1 price table updates on worker shutdown

@peterjan peterjan marked this pull request as draft February 14, 2024 09:41
@peterjan
Copy link
Member Author

Converting it to DRAFT to prevent merge, ran into a 2024-02-14T10:40:07+01:00 ERROR uploadmanager worker/contract_lock.go:109 failed to send keepalive testing a shorter shutdown timeout.

@peterjan peterjan marked this pull request as ready for review February 14, 2024 11:05
@peterjan peterjan changed the base branch from dev to pj/fix-packed-slab-uploads February 27, 2024 16:41
Base automatically changed from pj/fix-packed-slab-uploads to dev February 28, 2024 12:16
@peterjan peterjan marked this pull request as draft February 29, 2024 10:02
@peterjan
Copy link
Member Author

peterjan commented Mar 5, 2024

This was never scheduled and is not currently a priority.

@peterjan peterjan closed this Mar 5, 2024
@ChrisSchinnerl ChrisSchinnerl deleted the pj/shutdown-err branch July 3, 2024 07:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants