Rate limit orchestrator recoveries even across topologies #206

shlomi-noach · 2016-05-03T18:19:49Z

Have a max-recoveries-per-hour limitation or similar. Even across clusters, we may wish to get a human involved in such case where there's just too many things breaking concurrently.

sjmudd · 2016-05-04T12:11:37Z

For situations like this a brake is good.

So limit automatic failure if you hit X failures in S seconds.
This brake should need to be (configurable) manually disabled. that is once the brake is applied you MUST explicitly disable it.
I notice there's a --noop option but that requires getting onto the orchestrator server and changing orchestrator.conf.json, and restarting orchestrator. If you're in a cluster another orchestrator process is likely to take over so this does not work as immediately as you might hope.

Consequently I'd be tempted to have a storage setting for "globalAutomaticRecoveryDisabled" which is read every few seconds and the running/active node will take that into consideration. The GUI should also have a way to change this setting: "GlobalAutomaticRecovery: Disabled/Enabled" which updates this table, and an appropriate CLI entry to query/enable/disable this behaviour, perhaps with a hook to notify people of the change in state.

This is a long list of things I would like to see. It may not seem useful to have all of this but a global failure such as a DC failure may make this sort of brake quite useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rate limit orchestrator recoveries even across topologies #206

Rate limit orchestrator recoveries even across topologies #206

shlomi-noach commented May 3, 2016

sjmudd commented May 4, 2016

Rate limit orchestrator recoveries even across topologies #206

Rate limit orchestrator recoveries even across topologies #206

Comments

shlomi-noach commented May 3, 2016

sjmudd commented May 4, 2016