Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

Merged
merged 9 commits into from
Jan 4, 2024

Conversation

frittentheke
Copy link
Contributor

@frittentheke frittentheke commented Apr 28, 2020

Currently the Operator fully depends on the events received when running. While it has a sync loop iterating over clusters again and again in order to sync or repair them, if a delete event it lost the CR is deleted, but the resources might not, leaving the resources such as StatefulSets or Secrets dangling. Those resources a user usually does not have to interact with, but then it's his responsibility to clean those up.
To protect a CR from being deleted before proper cleanup has happend Finalizers are used (https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#finalizers) and best-practice.

This PR works towards issue #450 as it adds a finalizer to the CR when a cluster is first created and will only remove it when it's resources are deleted, then causing the CR to be garbage collected by Kubernetes.

A deleted postgres resources will then only be "marked" with a deletion timestamp, but remains (with deletion timestamp set). To reconcile cases where the deletion event was "missed" (operator not running) I added a check for a deletion timestamp to exist when doing a sync loop (or initial sync of all clusters) if so the operator will then trigger the deletion.

I took the liberty to give the Delete function a returnable error, but I only added one condition (finalizer removal failed) as failing / error condition. To fully tackle #450 (delete event missed) but even more importantly #498 (resource cleanup incomplete/failing) full error handling should be added to the various resource deletions. If the operator cannot cleanup all the resources it created, it should report this (logs, events to the cr) but NOT remove the finalizer / CR until it can.
This will prevent a new CR (as the old one will not be deleted with the Finalizer on) from being created with the same (cluster) name causing conflicts with the not cleaned up resources.

I don't know if this relates at all to #387, but @sdudoladov mentioned Finalizers there - maybe this PR also helps providing initial scaffolding.

closes #450

@frittentheke frittentheke changed the title Use finalizers avoid losing deletions [starts on #450] Use finalizers to avoid losing delete events and to allow full cleanup Apr 28, 2020
@FxKu FxKu added this to the 1.6 milestone Apr 29, 2020
@frittentheke
Copy link
Contributor Author

@FxKu on a local build testing this we ran into issues with null pointers on delete functions.
This is why I added commit f36bba3 to make all those map accessing functions work simliar and to not crash on map entries for Services being "nil".

@frittentheke
Copy link
Contributor Author

frittentheke commented May 5, 2020

I added some more commits to close the loop around the Finalizer:

  1. 05fdbb2 makes all the deleteXYZresource() functions not treat ResourceNotFound as an actual error

  2. f567667 sends out events of (actual) errors to the postgres resource so the user knows there are issues removing things

  3. 9bc3d8f causes the Finalizer to not be removed until all resources could successfully be deleted (or don't exist anyways).

As for the anyErrors boolean, this is the smallest footprint in my opinion. Certainly straightening up the error handling and propagation a lot more is possible, as in collecting all the errors and then handling them jointly in the calling function. But making changes to the whole error logic is always possible, I wanted this PR to be about improving the way the operator carefully ensures that all created resources are deleted again and to use the Finalizer to a) allow for retries (next sync run) and b) ensure no delete event is ever missed, even when the operator is not running.

@frittentheke frittentheke force-pushed the finalizer branch 4 times, most recently from 4386c2d to 8e7c4ed Compare May 5, 2020 17:48
@frittentheke
Copy link
Contributor Author

Two more additions:

  1. 8e7c4ed - When doing a delete of a whole cluster and some resources are just not there (anymore), we should not treat this as an error. Log it and go on on our mission to delete the other things that belong to this cluster to reach a clean state.

  2. f8a06fe - Maybe a bit late to the game of implementing the Finalizer logic, but when a Finalizer is on a resource there is no delete event received from the Kubernetes API by the operators informer, but rather an update to the resource with the deleteTimestamp set. This commit implements this logic to enable the operator handling of such "deletes" just like a regular deletion.

@frittentheke frittentheke changed the title Use finalizers to avoid losing delete events and to allow full cleanup Use finalizers to avoid losing delete events and to ensure full resource cleanup May 5, 2020
@frittentheke
Copy link
Contributor Author

frittentheke commented Jul 30, 2020

@FxKu are you still considering this PR for 1.6? If so, should I rebase it then?

@FxKu FxKu removed this from the 1.6 milestone Aug 26, 2020
@frittentheke
Copy link
Contributor Author

@FxKu could you kindly give me an indication if I should rebase this PR at some point or if you don't like it enough to even consider merging it?

@frittentheke
Copy link
Contributor Author

@FxKu @CyberDem0n @Jan-M is this PR still in any way in focus to you?

See i.e. https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/ on why finalizers are quite sensible to have.

@frittentheke
Copy link
Contributor Author

@FxKu @CyberDem0n @Jan-M may I bug you once more and ask if are you still interested in this change?

@FxKu
Copy link
Member

FxKu commented Dec 3, 2021

@frittentheke sorry to keep you hanging here for so long, because we fear it could break things. But since we have delete protection on the operator side (and also in our internal admission controller) we should be safe to allow cleanups when there was a delete event. Hope I will find time next week to look at it again.

@SimonDreher
Copy link

@FxKu May I ask if you maybe find the time to look at this? Is there anything we could assist with for this PR?

@frittentheke
Copy link
Contributor Author

Maybe something to look into for 1.9 (https://github.com/zalando/postgres-operator/milestone/12) @FxKu @Jan-M ?

@FxKu
Copy link
Member

FxKu commented May 19, 2022

@frittentheke I'll put it on the list again 😃 @SimonDreher it would be great if you could add tests - like a unit and e2e test. For the e2e test, you could maybe implement a finalizer test for the second cluster that we create. For the other one we already test the normal delete procedure at the end.

@FxKu FxKu added this to the 1.9 milestone May 19, 2022
@bwrobc
Copy link

bwrobc commented Jun 8, 2023

I've rebased this patchset against current master here for anyone interested.

@frittentheke
Copy link
Contributor Author

I've rebased this patchset against current master here for anyone interested.

@bwrobc
I did not check out your branch yet ... but could I reset my PR to your kindly done rebase to keep this PR current then?

@frittentheke
Copy link
Contributor Author

I've rebased this patchset against current master here for anyone interested.

@bwrobc I did not check out your branch yet ... but could I reset my PR to your kindly done rebase to keep this PR current then?

@FxKu are you still interested in this addition / PR?
I'd then spend the time to rebase it again ... likely using @bwrobc's work.

@bwrobc
Copy link

bwrobc commented Jul 21, 2023

I've rebased this patchset against current master here for anyone interested.

@bwrobc I did not check out your branch yet ... but could I reset my PR to your kindly done rebase to keep this PR current then?

Yes, please feel free

@FxKu
Copy link
Member

FxKu commented Aug 1, 2023

Hey there 😃 . I have not forgotten you. Before I give this PR another review I have one request. Can we make this feature optional we a global enable_finalizers option? Like this everybody can start testing and then using it at their own will. We would feel for comfortable if finalizers are optional and then we can finally merge this PR.

@KunJakob
Copy link

Just wanted to add my interest in seeing this merged as a quiet spectator for a few months now. This would be such a joy to have in contexts like in https://tilt.dev/ local environment setups. Being able to rapidly tear down and set up a cluster without manual deletion intervention would be a massive boon for us!

… Run Delete() and remove finalizer when done.
…t entries from their maps - just like with Secrets
…dpoint, LogicalBackup CronJob, PDB and Secret) - this is not a real error when deleting things
…ccessfully first. Otherwise the next sync run should let us try again
@frittentheke
Copy link
Contributor Author

@FxKu @bwrobc @KunJakob sorry for the delay. I pushed a rebased PR now and added a config option to enable finalizers. I'll have to wait for CI so see if / where I messed up somewhere ... but PTAL.

pkg/cluster/cluster.go Outdated Show resolved Hide resolved
@FxKu
Copy link
Member

FxKu commented Jan 4, 2024

👍

1 similar comment
@idanovinda
Copy link
Member

👍

@FxKu FxKu merged commit 743aade into zalando:master Jan 4, 2024
3 of 4 checks passed
@FxKu
Copy link
Member

FxKu commented Jan 4, 2024

So finally, we have merged your PR @frittentheke . Thanks for this contribution and all the patience over the last years (!). There are still a few minor things I would change, but I will take it from here.

@FxKu
Copy link
Member

FxKu commented Jan 24, 2024

After adding way more work into it than anticipated (refactoring, adding sync support, writing unit tests and docs) I have to say, I don't see the appeal of using finalizers. At least not with our operator and the way we listen to events. With finalizers there can be a huge delay until the cluster is finally gone. And then it will emit the DELETE event to the queue when the job is already done - so logs look a little confusing now.

Anyway, all of this might soon be history if we switch to K8s controller runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

use finalizers avoid losing deletions
6 participants