Draining CC-api VMs should let local-worker jobs finish #496
+24
−4
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A short explanation of the proposed change:
In case of a graceful shutdown of a CC api VM, local-worker will wait before shutdown if there are still jobs running on the local-worker queue. The default grace period is set to 5 minutes, it is configurable by setting a value for:
cc.jobs.local.local_worker_grace_period_seconds
.The order of the shutdown is now:
Before the order was shutdown local_worker, then nginx, then cloud_controller.
An explanation of the use cases your change solves:
It seems that a graceful shutdown of a CC API VM (e.g., during an update) does not properly account for draining the worker jobs on the API VM that handle file uploads.
When the CC API VM is restarted or recreated while a local worker on the API VM is processing an upload job—transferring files from disk to the blobstore—the package status gets stuck in PROCESSING_UPLOAD. The upload job seems to have the standard timeout of 4h configured - which leads to hanging deployments that are stopped finally by client side timeouts.
With the proposed change the local-worker will wait for 5 minutes if there are still jobs running, until shutdown is performed. That will give the upload job more time to finish.
Links to any other associated PRs
I have viewed signed and have submitted the Contributor License Agreement
I have made this pull request to the
develop
branch[] I have run CF Acceptance Tests on bosh lite