-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use finalizers to avoid losing delete events and to ensure full resource cleanup #941
Conversation
I added some more commits to close the loop around the Finalizer:
As for the |
4386c2d
to
8e7c4ed
Compare
Two more additions:
|
@FxKu are you still considering this PR for 1.6? If so, should I rebase it then? |
@FxKu could you kindly give me an indication if I should rebase this PR at some point or if you don't like it enough to even consider merging it? |
@FxKu @CyberDem0n @Jan-M is this PR still in any way in focus to you? See i.e. https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/ on why finalizers are quite sensible to have. |
@FxKu @CyberDem0n @Jan-M may I bug you once more and ask if are you still interested in this change? |
@frittentheke sorry to keep you hanging here for so long, because we fear it could break things. But since we have delete protection on the operator side (and also in our internal admission controller) we should be safe to allow cleanups when there was a delete event. Hope I will find time next week to look at it again. |
@FxKu May I ask if you maybe find the time to look at this? Is there anything we could assist with for this PR? |
Maybe something to look into for 1.9 (https://github.com/zalando/postgres-operator/milestone/12) @FxKu @Jan-M ? |
@frittentheke I'll put it on the list again 😃 @SimonDreher it would be great if you could add tests - like a unit and e2e test. For the e2e test, you could maybe implement a finalizer test for the second cluster that we create. For the other one we already test the normal delete procedure at the end. |
I've rebased this patchset against current master here for anyone interested. |
@FxKu are you still interested in this addition / PR? |
Hey there 😃 . I have not forgotten you. Before I give this PR another review I have one request. Can we make this feature optional we a global |
Just wanted to add my interest in seeing this merged as a quiet spectator for a few months now. This would be such a joy to have in contexts like in https://tilt.dev/ local environment setups. Being able to rapidly tear down and set up a cluster without manual deletion intervention would be a massive boon for us! |
… Run Delete() and remove finalizer when done.
…t entries from their maps - just like with Secrets
…dpoint, LogicalBackup CronJob, PDB and Secret) - this is not a real error when deleting things
…ccessfully first. Otherwise the next sync run should let us try again
0db169d
to
4f6f0de
Compare
👍 |
1 similar comment
👍 |
So finally, we have merged your PR @frittentheke . Thanks for this contribution and all the patience over the last years (!). There are still a few minor things I would change, but I will take it from here. |
After adding way more work into it than anticipated (refactoring, adding sync support, writing unit tests and docs) I have to say, I don't see the appeal of using finalizers. At least not with our operator and the way we listen to events. With finalizers there can be a huge delay until the cluster is finally gone. And then it will emit the DELETE event to the queue when the job is already done - so logs look a little confusing now. Anyway, all of this might soon be history if we switch to K8s controller runtime. |
Currently the Operator fully depends on the events received when running. While it has a sync loop iterating over clusters again and again in order to sync or repair them, if a delete event it lost the CR is deleted, but the resources might not, leaving the resources such as StatefulSets or Secrets dangling. Those resources a user usually does not have to interact with, but then it's his responsibility to clean those up.
To protect a CR from being deleted before proper cleanup has happend Finalizers are used (https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#finalizers) and best-practice.
This PR works towards issue #450 as it adds a finalizer to the CR when a cluster is first created and will only remove it when it's resources are deleted, then causing the CR to be garbage collected by Kubernetes.
A deleted postgres resources will then only be "marked" with a deletion timestamp, but remains (with deletion timestamp set). To reconcile cases where the deletion event was "missed" (operator not running) I added a check for a deletion timestamp to exist when doing a sync loop (or initial sync of all clusters) if so the operator will then trigger the deletion.
I took the liberty to give the Delete function a returnable error, but I only added one condition (finalizer removal failed) as failing / error condition. To fully tackle #450 (delete event missed) but even more importantly #498 (resource cleanup incomplete/failing) full error handling should be added to the various resource deletions. If the operator cannot cleanup all the resources it created, it should report this (logs, events to the cr) but NOT remove the finalizer / CR until it can.
This will prevent a new CR (as the old one will not be deleted with the Finalizer on) from being created with the same (cluster) name causing conflicts with the not cleaned up resources.
I don't know if this relates at all to #387, but @sdudoladov mentioned Finalizers there - maybe this PR also helps providing initial scaffolding.
closes #450