Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop][wip-1] Job Level Scaling for Node Sharing case #558

Merged
merged 5 commits into from
Sep 7, 2023

Conversation

lukeseawalker
Copy link
Contributor

@lukeseawalker lukeseawalker commented Sep 5, 2023

Description of changes

Series of changes needed for the work in progress feature "Job Level Scaling for Node Sharing".

  • Add temporary config to be able to enable node sharing jls, useful when developing the feature

  • Split oversubscribed job list, to have ready to consume knowledge about
    *  oversubscribed job list with single node allocation
    *  oversubscribed job list with multiple node allocation
    *  oversubscribed single node list
    *  oversubscribed multi node list

  • Change retry logic on DescribeInstances to be exponential backoff plus a random number in the interval 0 + 0.5.
    The random number is to add a jitter so to avoid wave requests.
    Retries have been increased from 4 to 5

  • Accumulate unused capacity over different instance launch calls, when it isn't possible to assign the full requested allocation to a job

Tests

  • unit tests added
  • manual tests performed on running cluster

References

  • n/a

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Add temporary config to be able to enable node sharing jls, useful when developing the feature

Signed-off-by: Luca Carrogu <[email protected]>
Split oversubscribed job list, to have ready to consume knowledge about
*  oversubscribed job list with single node allocation
*  oversubscribed job list with multiple node allocation
*  oversubscribed single node list
*  oversubscribed multi node list

Signed-off-by: Luca Carrogu <[email protected]>
@codecov
Copy link

codecov bot commented Sep 5, 2023

Codecov Report

Patch coverage: 97.50% and project coverage change: +0.09% 🎉

Comparison is base (e28032f) 89.44% compared to head (7795ad3) 89.53%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #558      +/-   ##
===========================================
+ Coverage    89.44%   89.53%   +0.09%     
===========================================
  Files           16       16              
  Lines         2633     2656      +23     
===========================================
+ Hits          2355     2378      +23     
  Misses         278      278              
Flag Coverage Δ
unittests 89.53% <97.50%> (+0.09%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
src/slurm_plugin/instance_manager.py 99.50% <96.42%> (+0.02%) ⬆️
src/slurm_plugin/fleet_manager.py 92.30% <100.00%> (+0.07%) ⬆️
src/slurm_plugin/resume.py 76.47% <100.00%> (+0.19%) ⬆️
src/slurm_plugin/slurm_resources.py 95.48% <100.00%> (+0.02%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lukeseawalker lukeseawalker force-pushed the wip/nodeSharingJLS branch 3 times, most recently from a215e8b to 16b8026 Compare September 6, 2023 14:08
@lukeseawalker lukeseawalker changed the title [develop] Job Level Scaling for Node Sharing case [develop][wip-1] Job Level Scaling for Node Sharing case Sep 7, 2023
@lukeseawalker lukeseawalker marked this pull request as ready for review September 7, 2023 14:12
@lukeseawalker lukeseawalker requested review from a team as code owners September 7, 2023 14:12
Change retry logic on DescribeInstances to be exponential backoff plus a random number in the interval 0 + 0.5.
 The random number is to add a jitter so to avoid wave requests.

Retries have been increased from 4 to 5

Signed-off-by: Luca Carrogu <[email protected]>
Accumulate unused capacity over different instance launch calls, when it isn't possible to assign the full requested allocation to a job

Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker merged commit f6d545e into aws:develop Sep 7, 2023
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants