You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Have a max-recoveries-per-hour limitation or similar. Even across clusters, we may wish to get a human involved in such case where there's just too many things breaking concurrently.
The text was updated successfully, but these errors were encountered:
So limit automatic failure if you hit X failures in S seconds.
This brake should need to be (configurable) manually disabled. that is once the brake is applied you MUST explicitly disable it.
I notice there's a --noop option but that requires getting onto the orchestrator server and changing orchestrator.conf.json, and restarting orchestrator. If you're in a cluster another orchestrator process is likely to take over so this does not work as immediately as you might hope.
Consequently I'd be tempted to have a storage setting for "globalAutomaticRecoveryDisabled" which is read every few seconds and the running/active node will take that into consideration. The GUI should also have a way to change this setting: "GlobalAutomaticRecovery: Disabled/Enabled" which updates this table, and an appropriate CLI entry to query/enable/disable this behaviour, perhaps with a hook to notify people of the change in state.
This is a long list of things I would like to see. It may not seem useful to have all of this but a global failure such as a DC failure may make this sort of brake quite useful.
Have a max-recoveries-per-hour limitation or similar. Even across clusters, we may wish to get a human involved in such case where there's just too many things breaking concurrently.
The text was updated successfully, but these errors were encountered: