Drift Exponential Wave-Based Rollout #1775
Labels
deprovisioning
Issues related to node deprovisioning
kind/feature
Categorizes issue or PR as related to a new feature.
Description
What problem are you trying to solve?
When making changes to a NodePool, all nodes for that nodepool could be drifted, at which Karpenter begins automated upgrades to all the nodes owned by the nodepool. Depending on how NodePools are architected in a cluster, this could be a large percentage if not all of the nodes in your cluster.
Thankfully there are ways to rate-limit the speed at which these upgrades happen, natively implemented in Karpenter.
do-not-disrupt
annotations limits how quick nodes can be drained, never violating the user-defined minimum level of application availability.Yet, this doesn't solve all rollout cases. Users with quick drain times (less restrictions on pod evictions) may actually see rollouts be too quick, since Karpenter would drain nodes as fast as it can with no "bake time" in between an upgrade. This is particularly painful for issues with a node image that present itself after some period of time or some level of stress/load.
As such, I'm proposing that Drift be rolled out to a cluster in waves. This would be automatically computed based on the number of nodes in a NodePool/Cluster, the number of nodes that are drifted, and sane defaults on the total amount of time to rollout a cluster, and the number of waves.
As an example, take 100 nodes in a NodePool. If I were to drift the nodes in a NodePool, and let's say I wanted it to take 24 hours, I could "leak in" nodes that can be considered drifted every hour. With an increasing factor of 2, we could drift all 100 nodes in 8 waves (1 -> 2 -> ... -> 64 -> 100). We can divide the 24 hours into 8 time intervals, so that in the 0-3h time frame, only one node is driftable, and in the 21-24h time frame 64 nodes could be drifted.
Configuration could be all of or a subset of any of the following:
The text was updated successfully, but these errors were encountered: