Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(options): Add consolidation timeout options #1754

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Pokom
Copy link

@Pokom Pokom commented Oct 15, 2024

Adds two options for the following timeouts

  • multinodeconsolidation
  • singlenodeconsolidation

These are exposed on the following ways:

  • --multi-node-consolidation-timeout or MULTI_NODE_CONSOLIDATION_TIMEOUT
  • --single-node-consolidation-timeout or SINGLE_NODE_CONSOLIDATION_TIMEOUT

The primary way of testing this was by building the image and running within dev and production clusters within Grafana Labs fleet.


Fixes #N/A

Description

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Adds two options for the following timeouts
- multinodeconsolidation
- singlenodeconsolidation

These are exposed on the following ways:
- `--multi-node-consolidation-timeout` or
  `MULTI_NODE_CONSOLIDATION_TIMEOUT`
- `--single-node-consolidation-timeout` or
  `SINGLE_NODE_CONSOLIDATION_TIMEOUT`

The primary way of testing this was by building the image and running
within dev and production clusters within Grafana Labs fleet.

---

- refs kubernetes-sigs#1733

Signed-off-by: pokom <[email protected]>
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 15, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Pokom
Once this PR has been reviewed and has the lgtm label, please assign njtran for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @Pokom!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 15, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @Pokom. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 15, 2024
Copy link

@nikimanoledaki nikimanoledaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spotted a few changes that are needed.
🚀 let's go!

@@ -119,7 +117,7 @@ func (m *MultiNodeConsolidation) firstNConsolidationOption(ctx context.Context,
lastSavedCommand := Command{}
lastSavedResults := scheduling.Results{}
// Set a timeout
timeout := m.clock.Now().Add(MultiNodeConsolidationTimeoutDuration)
timeout := m.clock.Now().Add(options.FromContext(ctx).SinglenodeConsolidationTimeout)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be:

Suggested change
timeout := m.clock.Now().Add(options.FromContext(ctx).SinglenodeConsolidationTimeout)
timeout := m.clock.Now().Add(options.FromContext(ctx).MultiNodeConsolidationTimeout)

Copy link

@nikimanoledaki nikimanoledaki Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this might fix the failing test (specifically this) - not sure 🤔

@@ -96,6 +98,8 @@ func (o *Options) AddFlags(fs *FlagSet) {
fs.DurationVar(&o.BatchMaxDuration, "batch-max-duration", env.WithDefaultDuration("BATCH_MAX_DURATION", 10*time.Second), "The maximum length of a batch window. The longer this is, the more pods we can consider for provisioning at one time which usually results in fewer but larger nodes.")
fs.DurationVar(&o.BatchIdleDuration, "batch-idle-duration", env.WithDefaultDuration("BATCH_IDLE_DURATION", time.Second), "The maximum amount of time with no new pending pods that if exceeded ends the current batching window. If pods arrive faster than this time, the batching window will be extended up to the maxDuration. If they arrive slower, the pods will be batched separately.")
fs.StringVar(&o.FeatureGates.inputStr, "feature-gates", env.WithDefaultString("FEATURE_GATES", "SpotToSpotConsolidation=false"), "Optional features can be enabled / disabled using feature gates. Current options are: SpotToSpotConsolidation")
fs.DurationVar(&o.MultinodeConsolidationTimeout, "multi-node-consolidation-timeout", env.WithDefaultDuration("MULTI_NODE_CONSOLIDATION_TIMEOUT", 1*time.Minute), "The maximum amount of time that can be spent doing multinode consolidation before timing out. Defaults to 1 minute")
fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit 😄

Suggested change
fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute")
fs.DurationVar(&o.SinglenodeConsolidationTimeout, "single-node-consolidation-timeout", env.WithDefaultDuration("SINGLE_NODE_CONSOLIDATION_TIMEOUT", 3*time.Minute), "The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minutes")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove those all together since it's redundant if you pass the --help flag:

  -multi-node-consolidation-timeout duration
        The maximum amount of time that can be spent doing multinode consolidation before timing out. Defaults to 1 minute (default 1m0s)
  -single-node-consolidation-timeout duration
        The maximum amount of time that can be spent doing single node consolidation before timing out. Defaults to 3 minute (default 3m0s)

Comment on lines 64 to 65
MultinodeConsolidationTimeout time.Duration
SinglenodeConsolidationTimeout time.Duration

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we rename these to MultiNodeConsolidationTimeoutDuration (capitalised N for Node) and SingleNodeConsolidationTimeout? This would be similar to the previous const and keep consistency with the MultiNodeConsolidation struct.

@coveralls
Copy link

Pull Request Test Coverage Report for Build 11371495375

Details

  • 4 of 4 (100.0%) changed or added relevant lines in 3 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.02%) to 80.934%

Files with Coverage Reduction New Missed Lines %
pkg/scheduling/requirements.go 2 98.01%
Totals Coverage Status
Change from base Build 11332670114: 0.02%
Covered Lines: 8494
Relevant Lines: 10495

💛 - Coveralls

@Pokom Pokom marked this pull request as ready for review October 24, 2024 11:14
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 24, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@njtran njtran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem I have with this is that it locks in API that looks at our implementation. If we change the implementation in the future, it makes it tougher to do so. I could see this being an "alpha" feature, but I also wonder if there's a way to make this apply to both types without specifically naming them?

@Pokom
Copy link
Author

Pokom commented Nov 8, 2024

The main problem I have with this is that it locks in API that looks at our implementation. If we change the implementation in the future, it makes it tougher to do so. I could see this being an "alpha" feature, but I also wonder if there's a way to make this apply to both types without specifically naming them?

That's fair. I'm not tied to any implementation here, and having one variable + config is certainly easier to manage then multiple. I'll be at KubeCon next week, so it'll be a bit of time before I can pick this up again

Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants