Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

frittentheke · 2020-04-28T07:12:49Z

Currently the Operator fully depends on the events received when running. While it has a sync loop iterating over clusters again and again in order to sync or repair them, if a delete event it lost the CR is deleted, but the resources might not, leaving the resources such as StatefulSets or Secrets dangling. Those resources a user usually does not have to interact with, but then it's his responsibility to clean those up.
To protect a CR from being deleted before proper cleanup has happend Finalizers are used (https://kubernetes.io/docs/tasks/access-kubernetes-api/custom-resources/custom-resource-definitions/#finalizers) and best-practice.

This PR works towards issue #450 as it adds a finalizer to the CR when a cluster is first created and will only remove it when it's resources are deleted, then causing the CR to be garbage collected by Kubernetes.

A deleted postgres resources will then only be "marked" with a deletion timestamp, but remains (with deletion timestamp set). To reconcile cases where the deletion event was "missed" (operator not running) I added a check for a deletion timestamp to exist when doing a sync loop (or initial sync of all clusters) if so the operator will then trigger the deletion.

I took the liberty to give the Delete function a returnable error, but I only added one condition (finalizer removal failed) as failing / error condition. To fully tackle #450 (delete event missed) but even more importantly #498 (resource cleanup incomplete/failing) full error handling should be added to the various resource deletions. If the operator cannot cleanup all the resources it created, it should report this (logs, events to the cr) but NOT remove the finalizer / CR until it can.
This will prevent a new CR (as the old one will not be deleted with the Finalizer on) from being created with the same (cluster) name causing conflicts with the not cleaned up resources.

I don't know if this relates at all to #387, but @sdudoladov mentioned Finalizers there - maybe this PR also helps providing initial scaffolding.

closes #450

frittentheke · 2020-05-04T16:24:04Z

@FxKu on a local build testing this we ran into issues with null pointers on delete functions.
This is why I added commit f36bba3 to make all those map accessing functions work simliar and to not crash on map entries for Services being "nil".

frittentheke · 2020-05-05T12:28:36Z

I added some more commits to close the loop around the Finalizer:

05fdbb2 makes all the deleteXYZresource() functions not treat ResourceNotFound as an actual error
f567667 sends out events of (actual) errors to the postgres resource so the user knows there are issues removing things
9bc3d8f causes the Finalizer to not be removed until all resources could successfully be deleted (or don't exist anyways).

As for the anyErrors boolean, this is the smallest footprint in my opinion. Certainly straightening up the error handling and propagation a lot more is possible, as in collecting all the errors and then handling them jointly in the calling function. But making changes to the whole error logic is always possible, I wanted this PR to be about improving the way the operator carefully ensures that all created resources are deleted again and to use the Finalizer to a) allow for retries (next sync run) and b) ensure no delete event is ever missed, even when the operator is not running.

frittentheke · 2020-05-05T18:03:41Z

Two more additions:

8e7c4ed - When doing a delete of a whole cluster and some resources are just not there (anymore), we should not treat this as an error. Log it and go on on our mission to delete the other things that belong to this cluster to reach a clean state.
f8a06fe - Maybe a bit late to the game of implementing the Finalizer logic, but when a Finalizer is on a resource there is no delete event received from the Kubernetes API by the operators informer, but rather an update to the resource with the deleteTimestamp set. This commit implements this logic to enable the operator handling of such "deletes" just like a regular deletion.

frittentheke · 2020-07-30T09:16:22Z

@FxKu are you still considering this PR for 1.6? If so, should I rebase it then?

frittentheke · 2020-10-10T14:54:33Z

@FxKu could you kindly give me an indication if I should rebase this PR at some point or if you don't like it enough to even consider merging it?

frittentheke · 2021-07-20T08:33:30Z

@FxKu @CyberDem0n @Jan-M is this PR still in any way in focus to you?

See i.e. https://kubernetes.io/blog/2021/05/14/using-finalizers-to-control-deletion/ on why finalizers are quite sensible to have.

frittentheke · 2021-11-21T22:36:30Z

@FxKu @CyberDem0n @Jan-M may I bug you once more and ask if are you still interested in this change?

FxKu · 2021-12-03T16:41:09Z

@frittentheke sorry to keep you hanging here for so long, because we fear it could break things. But since we have delete protection on the operator side (and also in our internal admission controller) we should be safe to allow cleanups when there was a delete event. Hope I will find time next week to look at it again.

SimonDreher · 2022-05-04T11:51:43Z

@FxKu May I ask if you maybe find the time to look at this? Is there anything we could assist with for this PR?

frittentheke · 2022-05-17T13:15:24Z

Maybe something to look into for 1.9 (https://github.com/zalando/postgres-operator/milestone/12) @FxKu @Jan-M ?

FxKu · 2022-05-19T13:36:35Z

@frittentheke I'll put it on the list again 😃 @SimonDreher it would be great if you could add tests - like a unit and e2e test. For the e2e test, you could maybe implement a finalizer test for the second cluster that we create. For the other one we already test the normal delete procedure at the end.

bwrobc · 2023-06-08T13:46:19Z

I've rebased this patchset against current master here for anyone interested.

frittentheke · 2023-07-10T09:37:22Z

I've rebased this patchset against current master here for anyone interested.

@bwrobc
I did not check out your branch yet ... but could I reset my PR to your kindly done rebase to keep this PR current then?

frittentheke · 2023-07-12T09:39:19Z

I've rebased this patchset against current master here for anyone interested.

@bwrobc I did not check out your branch yet ... but could I reset my PR to your kindly done rebase to keep this PR current then?

@FxKu are you still interested in this addition / PR?
I'd then spend the time to rebase it again ... likely using @bwrobc's work.

bwrobc · 2023-07-21T11:35:01Z

I've rebased this patchset against current master here for anyone interested.

@bwrobc I did not check out your branch yet ... but could I reset my PR to your kindly done rebase to keep this PR current then?

Yes, please feel free

FxKu · 2023-08-01T09:15:47Z

Hey there 😃 . I have not forgotten you. Before I give this PR another review I have one request. Can we make this feature optional we a global enable_finalizers option? Like this everybody can start testing and then using it at their own will. We would feel for comfortable if finalizers are optional and then we can finally merge this PR.

KunJakob · 2023-09-11T09:08:13Z

Just wanted to add my interest in seeing this merged as a quiet spectator for a few months now. This would be such a joy to have in contexts like in https://tilt.dev/ local environment setups. Being able to rapidly tear down and set up a cluster without manual deletion intervention would be a massive boon for us!

…lete events

… Run Delete() and remove finalizer when done.

…t entries from their maps - just like with Secrets

…dpoint, LogicalBackup CronJob, PDB and Secret) - this is not a real error when deleting things

…nformed

…ccessfully first. Otherwise the next sync run should let us try again

frittentheke · 2023-09-12T19:02:04Z

@FxKu @bwrobc @KunJakob sorry for the delay. I pushed a rebased PR now and added a config option to enable finalizers. I'll have to wait for CI so see if / where I messed up somewhere ... but PTAL.

pkg/cluster/cluster.go

FxKu · 2024-01-04T15:09:35Z

👍

idanovinda · 2024-01-04T15:12:19Z

👍

FxKu · 2024-01-04T15:24:17Z

So finally, we have merged your PR @frittentheke . Thanks for this contribution and all the patience over the last years (!). There are still a few minor things I would change, but I will take it from here.

FxKu · 2024-01-24T16:35:37Z

After adding way more work into it than anticipated (refactoring, adding sync support, writing unit tests and docs) I have to say, I don't see the appeal of using finalizers. At least not with our operator and the way we listen to events. With finalizers there can be a huge delay until the cluster is finally gone. And then it will emit the DELETE event to the queue when the job is already done - so logs look a little confusing now.

Anyway, all of this might soon be history if we switch to K8s controller runtime.

frittentheke requested review from avaczi, CyberDem0n, erthalion, FxKu, Jan-M, RafiaSabih and sdudoladov as code owners April 28, 2020 07:12

frittentheke changed the title ~~Use finalizers avoid losing deletions [starts on #450]~~ Use finalizers to avoid losing delete events and to allow full cleanup Apr 28, 2020

FxKu added this to the 1.6 milestone Apr 29, 2020

frittentheke force-pushed the finalizer branch from f36bba3 to 2471639 Compare May 5, 2020 08:13

frittentheke force-pushed the finalizer branch 4 times, most recently from 4386c2d to 8e7c4ed Compare May 5, 2020 17:48

frittentheke changed the title ~~Use finalizers to avoid losing delete events and to allow full cleanup~~ Use finalizers to avoid losing delete events and to ensure full resource cleanup May 5, 2020

frittentheke mentioned this pull request May 7, 2020

Deletion of connection pooler incorrectly handles ResourceNotFound #961

Closed

FxKu removed this from the 1.6 milestone Aug 26, 2020

FxKu mentioned this pull request Mar 25, 2022

Deleting PostgreSQL cluster leaves orphaned master endpoints #1751

Open

FxKu added this to the 1.9 milestone May 19, 2022

frittentheke force-pushed the finalizer branch from e37b55b to 706864f Compare June 9, 2022 10:53

frittentheke force-pushed the finalizer branch from 706864f to 0db169d Compare July 12, 2022 08:30

FxKu modified the milestones: 1.9, 2.0 Dec 14, 2022

frittentheke added 6 commits September 12, 2023 20:22

Add Finalizer functions to Cluster; add/remove finalizer on Create/De…

29516c3

…lete events

Check if clusters have a deletion timestamp and we missed that event.…

94ed573

… Run Delete() and remove finalizer when done.

Fix nil handling when using Service from map; Remove Service, Endpoin…

897ab48

…t entries from their maps - just like with Secrets

Add handling of ResourceNotFound to all delete functions (Service, En…

33912c4

…dpoint, LogicalBackup CronJob, PDB and Secret) - this is not a real error when deleting things

Emit events when there are issues deleting resources to the user is i…

8194aea

…nformed

Depend the removal of the Finalizer on all resources being deleted su…

4f6f0de

…ccessfully first. Otherwise the next sync run should let us try again

frittentheke force-pushed the finalizer branch from 0db169d to 4f6f0de Compare September 12, 2023 18:28

frittentheke requested a review from hughcapet as a code owner September 12, 2023 18:28

frittentheke added 2 commits September 12, 2023 20:57

Add config option to enable finalizers

644a88d

Removed dangling whitespace at EOL

a30eae6

FxKu reviewed Oct 14, 2023

View reviewed changes

pkg/cluster/cluster.go Outdated Show resolved Hide resolved

config.EnableFinalizers is a bool pointer

fc97180

FxKu merged commit 743aade into zalando:master Jan 4, 2024
3 of 4 checks passed

FxKu mentioned this pull request Jan 9, 2024

add unit test and documentation for finalizers #2509

Merged

FxKu mentioned this pull request Jan 24, 2024

adding finalizer option in all manifests and rework its docs #2521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

frittentheke commented Apr 28, 2020 •

edited by FxKu

Loading

frittentheke commented May 4, 2020

frittentheke commented May 5, 2020 •

edited

Loading

frittentheke commented May 5, 2020

frittentheke commented Jul 30, 2020 •

edited

Loading

frittentheke commented Oct 10, 2020

frittentheke commented Jul 20, 2021

frittentheke commented Nov 21, 2021

FxKu commented Dec 3, 2021

SimonDreher commented May 4, 2022

frittentheke commented May 17, 2022

FxKu commented May 19, 2022

bwrobc commented Jun 8, 2023

frittentheke commented Jul 10, 2023

frittentheke commented Jul 12, 2023

bwrobc commented Jul 21, 2023

FxKu commented Aug 1, 2023

KunJakob commented Sep 11, 2023

frittentheke commented Sep 12, 2023

FxKu commented Jan 4, 2024

idanovinda commented Jan 4, 2024

FxKu commented Jan 4, 2024

FxKu commented Jan 24, 2024

Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

Use finalizers to avoid losing delete events and to ensure full resource cleanup #941

Conversation

frittentheke commented Apr 28, 2020 • edited by FxKu Loading

frittentheke commented May 4, 2020

frittentheke commented May 5, 2020 • edited Loading

frittentheke commented May 5, 2020

frittentheke commented Jul 30, 2020 • edited Loading

frittentheke commented Oct 10, 2020

frittentheke commented Jul 20, 2021

frittentheke commented Nov 21, 2021

FxKu commented Dec 3, 2021

SimonDreher commented May 4, 2022

frittentheke commented May 17, 2022

FxKu commented May 19, 2022

bwrobc commented Jun 8, 2023

frittentheke commented Jul 10, 2023

frittentheke commented Jul 12, 2023

bwrobc commented Jul 21, 2023

FxKu commented Aug 1, 2023

KunJakob commented Sep 11, 2023

frittentheke commented Sep 12, 2023

FxKu commented Jan 4, 2024

idanovinda commented Jan 4, 2024

FxKu commented Jan 4, 2024

FxKu commented Jan 24, 2024

frittentheke commented Apr 28, 2020 •

edited by FxKu

Loading

frittentheke commented May 5, 2020 •

edited

Loading

frittentheke commented Jul 30, 2020 •

edited

Loading