feat(options): Add consolidation timeout options #1754

Pokom · 2024-10-15T18:57:32Z

Adds two options for the following timeouts

multinodeconsolidation
singlenodeconsolidation

These are exposed on the following ways:

--multi-node-consolidation-timeout or MULTI_NODE_CONSOLIDATION_TIMEOUT
--single-node-consolidation-timeout or SINGLE_NODE_CONSOLIDATION_TIMEOUT

The primary way of testing this was by building the image and running within dev and production clusters within Grafana Labs fleet.

refs Parameterize Multinode + Single node consolidation timeout #1733

Fixes #N/A

Description

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Adds two options for the following timeouts - multinodeconsolidation - singlenodeconsolidation These are exposed on the following ways: - `--multi-node-consolidation-timeout` or `MULTI_NODE_CONSOLIDATION_TIMEOUT` - `--single-node-consolidation-timeout` or `SINGLE_NODE_CONSOLIDATION_TIMEOUT` The primary way of testing this was by building the image and running within dev and production clusters within Grafana Labs fleet. --- - refs kubernetes-sigs#1733 Signed-off-by: pokom <[email protected]>

k8s-ci-robot · 2024-10-15T18:57:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Pokom
Once this PR has been reviewed and has the lgtm label, please assign njtran for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-10-15T18:57:41Z

Welcome @Pokom!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2024-10-15T18:57:42Z

Hi @Pokom. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

nikimanoledaki

Spotted a few changes that are needed.
🚀 let's go!

nikimanoledaki · 2024-10-16T06:58:46Z

pkg/controllers/disruption/multinodeconsolidation.go

@@ -119,7 +117,7 @@ func (m *MultiNodeConsolidation) firstNConsolidationOption(ctx context.Context,
 	lastSavedCommand := Command{}
 	lastSavedResults := scheduling.Results{}
 	// Set a timeout
-	timeout := m.clock.Now().Add(MultiNodeConsolidationTimeoutDuration)
+	timeout := m.clock.Now().Add(options.FromContext(ctx).SinglenodeConsolidationTimeout)


This should be:

Suggested change

timeout := m.clock.Now().Add(options.FromContext(ctx).SinglenodeConsolidationTimeout)

timeout := m.clock.Now().Add(options.FromContext(ctx).MultiNodeConsolidationTimeout)

Fixing this might fix the failing test (specifically this) - not sure 🤔

nikimanoledaki · 2024-10-16T07:03:29Z

pkg/operator/options/options.go

@@ -96,6 +98,8 @@ func (o *Options) AddFlags(fs *FlagSet) {
 	fs.DurationVar(&o.BatchMaxDuration, "batch-max-duration", env.WithDefaultDuration("BATCH_MAX_DURATION", 10*time.Second), "The maximum length of a batch window. The longer this is, the more pods we can consider for provisioning at one time which usually results in fewer but larger nodes.")
 	fs.DurationVar(&o.BatchIdleDuration, "batch-idle-duration", env.WithDefaultDuration("BATCH_IDLE_DURATION", time.Second), "The maximum amount of time with no new pending pods that if exceeded ends the current batching window. If pods arrive faster than this time, the batching window will be extended up to the maxDuration. If they arrive slower, the pods will be batched separately.")
 	fs.StringVar(&o.FeatureGates.inputStr, "feature-gates", env.WithDefaultString("FEATURE_GATES", "SpotToSpotConsolidation=false"), "Optional features can be enabled / disabled using feature gates. Current options are: SpotToSpotConsolidation")
+	fs.DurationVar(&o.MultinodeConsolidationTimeout, "multi-node-consolidation-timeout", env.WithDefaultDuration("MULTI_NODE_CONSOLIDATION_TIMEOUT", 1*time.Minute), "The maximum amount of time that can be spent doing multinode consolidation before timing out. Defaults to 1 minute")
+	fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute")


Nit 😄

Suggested change

fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute")

fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minutes")

I'm going to remove those all together since it's redundant if you pass the --help flag:

-multi-node-consolidation-timeout duration The maximum amount of time that can be spent doing multinode consolidation before timing out. Defaults to 1 minute (default 1m0s) -single-node-consolidation-timeout duration The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute (default 3m0s)

nikimanoledaki · 2024-10-16T07:06:00Z

pkg/operator/options/options.go

+	MultinodeConsolidationTimeout  time.Duration
+	SinglenodeConsolidationTimeout time.Duration


Could we rename these to MultiNodeConsolidationTimeoutDuration (capitalised N for Node) and SingleNodeConsolidationTimeout? This would be similar to the previous const and keep consistency with the MultiNodeConsolidation struct.

coveralls · 2024-10-16T18:38:43Z

Pull Request Test Coverage Report for Build 11371495375

Details

4 of 4 (100.0%) changed or added relevant lines in 3 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.02%) to 80.934%

Files with Coverage Reduction	New Missed Lines	%
pkg/scheduling/requirements.go	2	98.01%

Totals
Change from base Build 11332670114:	0.02%
Covered Lines:	8494
Relevant Lines:	10495

💛 - Coveralls

k8s-ci-robot · 2024-11-07T01:27:31Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

njtran

The main problem I have with this is that it locks in API that looks at our implementation. If we change the implementation in the future, it makes it tougher to do so. I could see this being an "alpha" feature, but I also wonder if there's a way to make this apply to both types without specifically naming them?

Pokom · 2024-11-08T18:07:05Z

The main problem I have with this is that it locks in API that looks at our implementation. If we change the implementation in the future, it makes it tougher to do so. I could see this being an "alpha" feature, but I also wonder if there's a way to make this apply to both types without specifically naming them?

That's fair. I'm not tied to any implementation here, and having one variable + config is certainly easier to manage then multiple. I'll be at KubeCon next week, so it'll be a bit of time before I can pick this up again

github-actions · 2024-11-23T12:01:53Z

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 15, 2024

k8s-ci-robot requested review from engedaam and jackfrancis October 15, 2024 18:57

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 15, 2024

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 15, 2024

nikimanoledaki suggested changes Oct 16, 2024

View reviewed changes

Fixes from PR comments

1ce27cd

Merge branch 'main' into feat/consolidation-timeout-options

2f9cc8f

Pokom marked this pull request as ready for review October 24, 2024 11:14

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 24, 2024

k8s-ci-robot requested a review from jmdeal October 24, 2024 11:14

Pokom requested a review from nikimanoledaki October 24, 2024 11:14

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 7, 2024

njtran reviewed Nov 8, 2024

View reviewed changes

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(options): Add consolidation timeout options #1754

feat(options): Add consolidation timeout options #1754

Pokom commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

nikimanoledaki left a comment

nikimanoledaki Oct 16, 2024

nikimanoledaki Oct 16, 2024 •

edited

Loading

nikimanoledaki Oct 16, 2024

Pokom Oct 16, 2024

nikimanoledaki Oct 16, 2024

coveralls commented Oct 16, 2024

k8s-ci-robot commented Nov 7, 2024

njtran left a comment

Pokom commented Nov 8, 2024

github-actions bot commented Nov 23, 2024

	timeout := m.clock.Now().Add(options.FromContext(ctx).SinglenodeConsolidationTimeout)
	timeout := m.clock.Now().Add(options.FromContext(ctx).MultiNodeConsolidationTimeout)

	fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute")
	fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minutes")

		MultinodeConsolidationTimeout time.Duration
		SinglenodeConsolidationTimeout time.Duration

feat(options): Add consolidation timeout options #1754

Are you sure you want to change the base?

feat(options): Add consolidation timeout options #1754

Conversation

Pokom commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

k8s-ci-robot commented Oct 15, 2024

nikimanoledaki left a comment

Choose a reason for hiding this comment

nikimanoledaki Oct 16, 2024

Choose a reason for hiding this comment

nikimanoledaki Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

nikimanoledaki Oct 16, 2024

Choose a reason for hiding this comment

Pokom Oct 16, 2024

Choose a reason for hiding this comment

nikimanoledaki Oct 16, 2024

Choose a reason for hiding this comment

coveralls commented Oct 16, 2024

Pull Request Test Coverage Report for Build 11371495375

Details

💛 - Coveralls

k8s-ci-robot commented Nov 7, 2024

njtran left a comment

Choose a reason for hiding this comment

Pokom commented Nov 8, 2024

github-actions bot commented Nov 23, 2024

nikimanoledaki Oct 16, 2024 •

edited

Loading