-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can initial cluster startup time be improved? #72
Comments
Hi Dan, I have not been bothered by start-up times because we mostly are on AWS EC2 and things really tend to add up there (e.g. the Auto Scaling Group creating instances, etc) and therefore haven't thought of any solutions until now. You're right on trade-offs regarding retries/timouts and ensuring we don't have false negatives as it could be pretty disastrous. Depending on your target, there might be a few things we can do to reach it without having to introduce shared states (although we could definitely imagine a world where a centralized state is published to some resource in Kubernetes to accelerate operations if that would meet your requirements). Dumbest idea, we could make the check interval dynamic based on which step we're on (e.g. on "START" steps we could continuously loop without a waiting period). Another way we could speed up things would be to take |
We could also introduce some sort of watcher triggering |
Thanks for the ideas! Something I think needs accounted for is the fact that the evaluate step actually collects three discrete pieces of data the state machine needs to execute:
As long as each of those components are coupled into the same logical evaluation function, it seems the state machine won't be able to safely make decisions until the etcd health check completes, even if the checking is asynchronous. For example, if the state machine executes concurrent with the initial asynchronous evaluation, it could act on an uninitialized cluster size value, which is dangerous (e.g. incorrectly computing quorum size). To achieve safe async behavior, I wonder if the 3 evaluation components need decoupled into their own async routines, and also represented with a structure which enables logic to distinguish between uninitialized values.
This is just one way I thought of to try and resolve it, but do you agree with the general challenge around coupling the status component collections? It happens that for my use case I need to optimize startup times to improve the bottom line end to end time to create lightweight ephemeral (but HA) clusters, so I'm willing to incur some additional complexity to achieve the gains, but I can understand that your looser constraints could make it harder to justify the risk of serious architectural changes. I really appreciate you taking the time to think through it! |
I implemented a prototype of the decoupling I described and so far the results are promising. I have a lot of testing and before I would say I'm confident that I haven't introduced some other issues, and I have some other refactoring ideas if the basic idea turns out to be solid. |
In my testing, I've observed that the cold start time for a new 3 node cluster can range from 1 to 2 minutes depending on the timing of instance startup. This seems to result from the unconditional and fixed etcd health checking interval (worst case 30s given retries and timeouts) and the absence of any assumptions which would allow the program to distinguish between an unreachable cluster and one that never existed (i.e. not worth checking and assumed to be unhealthy). 2 minutes is the average I observe over repeated testing in a more real world environment (statefulset on Kubernetes with persistent volumes, etc.). The usual case is that each instance must incur multiple futile etcd health checks before the state machines converge to the startup conditions.
I've been thinking about how to optimize initial startup time but haven't thought of a good solution yet that doesn't introduce some kind of shared state or that doesn't come with some other poor tradeoff (e.g. reducing the etcd health retries/timeouts and increasing the possibility of false negative health check results).
I suspect that the delay is basically a tradeoff that comes with the stateless architecture, but I wonder if this is something you've thought about before. Very curious to hear what you think!
The text was updated successfully, but these errors were encountered: