Can initial cluster startup time be improved? #72

ironcladlou · 2021-09-28T20:37:58Z

In my testing, I've observed that the cold start time for a new 3 node cluster can range from 1 to 2 minutes depending on the timing of instance startup. This seems to result from the unconditional and fixed etcd health checking interval (worst case 30s given retries and timeouts) and the absence of any assumptions which would allow the program to distinguish between an unreachable cluster and one that never existed (i.e. not worth checking and assumed to be unhealthy). 2 minutes is the average I observe over repeated testing in a more real world environment (statefulset on Kubernetes with persistent volumes, etc.). The usual case is that each instance must incur multiple futile etcd health checks before the state machines converge to the startup conditions.

I've been thinking about how to optimize initial startup time but haven't thought of a good solution yet that doesn't introduce some kind of shared state or that doesn't come with some other poor tradeoff (e.g. reducing the etcd health retries/timeouts and increasing the possibility of false negative health check results).

I suspect that the delay is basically a tradeoff that comes with the stateless architecture, but I wonder if this is something you've thought about before. Very curious to hear what you think!

Quentin-M · 2021-09-28T21:19:52Z

Hi Dan,

I have not been bothered by start-up times because we mostly are on AWS EC2 and things really tend to add up there (e.g. the Auto Scaling Group creating instances, etc) and therefore haven't thought of any solutions until now. You're right on trade-offs regarding retries/timouts and ensuring we don't have false negatives as it could be pretty disastrous.

Depending on your target, there might be a few things we can do to reach it without having to introduce shared states (although we could definitely imagine a world where a centralized state is published to some resource in Kubernetes to accelerate operations if that would meet your requirements). Dumbest idea, we could make the check interval dynamic based on which step we're on (e.g. on "START" steps we could continuously loop without a waiting period). Another way we could speed up things would be to take evaluate out of the main loop, making it a go routine checking continuously with a mutex preventing the execution of the state altering parts of evaluate at the same time as execute. Furthermore, if we wanted to get faster results from evaluate while keeping all of the etcd timeouts & retry backoffs identical, we could technically run a few evaluate functions parallel with some sort of basic scheduler (e.g. targeting running a function, which run time is variable, n times per minute) so that even though an evaluate run takes, say, between 1 second and 30 seconds - we'd have a state update every few seconds.

Quentin-M · 2021-09-28T21:23:26Z

We could also introduce some sort of watcher triggering execute following evaluate state changes too, rather than looping at intervals too. That could make startup faster if we're "watching" for instances to be up/ready rather than waiting for next loop.

ironcladlou · 2021-09-29T15:45:25Z

Thanks for the ideas! Something I think needs accounted for is the fact that the evaluate step actually collects three discrete pieces of data the state machine needs to execute:

The ASG status (independent)
ECO instance states (depends on ASG status)
etcd cluster health and seeder status (depends on ASG status)

As long as each of those components are coupled into the same logical evaluation function, it seems the state machine won't be able to safely make decisions until the etcd health check completes, even if the checking is asynchronous. For example, if the state machine executes concurrent with the initial asynchronous evaluation, it could act on an uninitialized cluster size value, which is dangerous (e.g. incorrectly computing quorum size).

To achieve safe async behavior, I wonder if the 3 evaluation components need decoupled into their own async routines, and also represented with a structure which enables logic to distinguish between uninitialized values.

The state machine execute function could return early if it detects uninitialized ASG or ECO instance data. The special case would be etcd cluster health, which can simply be assumed to be unhealthy by default without distinguishing an uninitialized state. Whenever ASG and ECO instance data is available (very quickly), the state machine can execute.
ECO status and etcd collection can wait until ASG is initialized (since ASG data is input to those functions)

This is just one way I thought of to try and resolve it, but do you agree with the general challenge around coupling the status component collections?

It happens that for my use case I need to optimize startup times to improve the bottom line end to end time to create lightweight ephemeral (but HA) clusters, so I'm willing to incur some additional complexity to achieve the gains, but I can understand that your looser constraints could make it harder to justify the risk of serious architectural changes.

I really appreciate you taking the time to think through it!

ironcladlou · 2021-09-29T19:54:50Z

I implemented a prototype of the decoupling I described and so far the results are promising. I have a lot of testing and before I would say I'm confident that I haven't introduced some other issues, and I have some other refactoring ideas if the basic idea turns out to be solid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can initial cluster startup time be improved? #72

Can initial cluster startup time be improved? #72

ironcladlou commented Sep 28, 2021

Quentin-M commented Sep 28, 2021 •

edited

Loading

Quentin-M commented Sep 28, 2021

ironcladlou commented Sep 29, 2021

ironcladlou commented Sep 29, 2021

Can initial cluster startup time be improved? #72

Can initial cluster startup time be improved? #72

Comments

ironcladlou commented Sep 28, 2021

Quentin-M commented Sep 28, 2021 • edited Loading

Quentin-M commented Sep 28, 2021

ironcladlou commented Sep 29, 2021

ironcladlou commented Sep 29, 2021

Quentin-M commented Sep 28, 2021 •

edited

Loading