-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add console manager supervisor logic w/ restart option #270
feat: add console manager supervisor logic w/ restart option #270
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mostly lgtm for the most part but I think @zreigz should give it a close review as well.
One thing i'm wondering about though is if the heartbeat approach is unnecessary. You only really need heartbeats for multi-process communication (monitoring liveness in a distributed system). Couldn't we have some wrapper class like:
type MonitoredRoutine struct {
alive bool
Runnable func()
}
func (r *MonitoredRoutine) Run() {
defer func() {
// catch panics
r.alive = false
}()
r.Runnable()
}
func (r *MonitoredRoutine) Alive() bool { return r.alive }
and then if any of them die, just force restart the controller according to some mechanism?
I suppose part of the problem here is you don't have a natural way to restart child goroutines too, but this seems like a more robust and general api for liveness than a heartbeat, could also write to a channel to signal a parent goroutine that it needs to restart.
What might be problematic with this approach is detecting if the controller is still running or not. Heartbeat in this case is the last poll time. Since we have information about how often polling should be executed, we can calculate the time difference between last poll time and current time to see if controller could be dead. Recovering from panic technically does not help us much since if it will panic the app should crash and pod will be restarted anyway. We should try to avoid a situation where there is no panic but controller for some unknown reason stopped polling/reconciling. |
…eployment-operator-service-reconcilers-died
PollUntilContextCancel
usage in the console controller manager not to rely on our internal method implementation when deciding when to stop polling. Internal method will only return error now that can be logged but the poll function will always returnfalse, nil
(never stop).