feat: add console manager supervisor logic w/ restart option #270

floreks · 2024-09-17T15:16:45Z

Added supervised controller start functionality that can restart the controller if a last poll/reconcile time indicates that it might have died
Updated Reconciler interface and Controller implementation to allow it to be restarted
Removed 3 different logger implementations usage across the codebase and replaced it with a single klog logger.
poll/refresh/jitter interval args are now correctly used by the controllers
Cleaned up console Reconcilers. Some struct fields were not used anywhere
Refactored gate cache queue into a standalone cache that can be safely reused by multiple goroutines now
Refactored queue usage across console reconcilers to use getter instead of reference to a variable
Refactored PollUntilContextCancel usage in the console controller manager not to rely on our internal method implementation when deciding when to stop polling. Internal method will only return error now that can be logged but the poll function will always return false, nil (never stop).
Added controller restart metric counter to be able to track the number of per controller restarts (if any)

linear · 2024-09-17T15:16:48Z

PROD-2611 deployment operator service reconcilers died

michaeljguarino

This mostly lgtm for the most part but I think @zreigz should give it a close review as well.

One thing i'm wondering about though is if the heartbeat approach is unnecessary. You only really need heartbeats for multi-process communication (monitoring liveness in a distributed system). Couldn't we have some wrapper class like:

type MonitoredRoutine struct {
  alive bool
  Runnable func()
}

func (r *MonitoredRoutine) Run() {
  defer func() { 
    // catch panics
    r.alive = false
  }()
  
  r.Runnable()
}

func (r *MonitoredRoutine) Alive() bool { return r.alive }

and then if any of them die, just force restart the controller according to some mechanism?

I suppose part of the problem here is you don't have a natural way to restart child goroutines too, but this seems like a more robust and general api for liveness than a heartbeat, could also write to a channel to signal a parent goroutine that it needs to restart.

floreks · 2024-09-18T18:09:37Z

What might be problematic with this approach is detecting if the controller is still running or not. Heartbeat in this case is the last poll time. Since we have information about how often polling should be executed, we can calculate the time difference between last poll time and current time to see if controller could be dead.

Recovering from panic technically does not help us much since if it will panic the app should crash and pod will be restarted anyway.

We should try to avoid a situation where there is no panic but controller for some unknown reason stopped polling/reconciling.

maciaszczykm · 2024-09-19T08:36:40Z

I reviewed as well, then we talked about it with @floreks and @zreigz. It looks good to me, issues with pollers being stuck for any reasons should not happen anymore. One thing that can be added is validation for args to avoid situations like poll interval or jitter being too short.

…eployment-operator-service-reconcilers-died

feat: add console manager supervisor logic w/ restart option

00b715f

github-actions bot added the size/XXL label Sep 17, 2024

seemywingz approved these changes Sep 17, 2024

View reviewed changes

floreks added 2 commits September 18, 2024 14:52

restart only a single controller instead of all if only one fails

2ba374e

fix unit tests and lint

2f4ca12

floreks changed the title ~~wip: feat: add console manager supervisor logic w/ restart option~~ feat: add console manager supervisor logic w/ restart option Sep 18, 2024

cleanup

4f54af5

floreks self-assigned this Sep 18, 2024

floreks added 2 commits September 18, 2024 16:28

remove test code

a05df4c

update comment and heartbeat deadline

a283eb1

michaeljguarino reviewed Sep 18, 2024

View reviewed changes

zreigz approved these changes Sep 19, 2024

View reviewed changes

maciaszczykm approved these changes Sep 19, 2024

View reviewed changes

floreks added 8 commits September 19, 2024 10:40

set min value for poll/jitter interval to at least 10 seconds

c3b46bc

fix lint

c67d9a1

fix reconcile default polling time

019a6e0

add additional last reconcile time check to service controller

ad96f54

fix lint

483661c

cleanup code

dab1c8a

add controller restart metric counter

89a2b48

Merge remote-tracking branch 'origin/main' into sebastian/prod-2611-d…

08432d8

…eployment-operator-service-reconcilers-died

floreks merged commit f7fa8a7 into main Sep 25, 2024
33 checks passed

floreks deleted the sebastian/prod-2611-deployment-operator-service-reconcilers-died branch September 25, 2024 10:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add console manager supervisor logic w/ restart option #270

feat: add console manager supervisor logic w/ restart option #270

floreks commented Sep 17, 2024 •

edited

Loading

linear bot commented Sep 17, 2024

michaeljguarino left a comment •

edited

Loading

floreks commented Sep 18, 2024 •

edited

Loading

maciaszczykm commented Sep 19, 2024

feat: add console manager supervisor logic w/ restart option #270

feat: add console manager supervisor logic w/ restart option #270

Conversation

floreks commented Sep 17, 2024 • edited Loading

linear bot commented Sep 17, 2024

michaeljguarino left a comment • edited Loading

Choose a reason for hiding this comment

floreks commented Sep 18, 2024 • edited Loading

maciaszczykm commented Sep 19, 2024

floreks commented Sep 17, 2024 •

edited

Loading

michaeljguarino left a comment •

edited

Loading

floreks commented Sep 18, 2024 •

edited

Loading