Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include MemSpecLimit when calculating defmem #3300

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

wiktorn
Copy link
Contributor

@wiktorn wiktorn commented Nov 21, 2024

To prevent OOMKiller killing random processes on the node it is possible to define MemSpecLimit which reserves some of the memory for the system and limit job memory below what is available on the node.

This example reserves 1024MB of RAM for system on the node:

      - id: debug_nodeset
        source: community/modules/compute/schedmd-slurm-gcp-v6-nodeset
        use: [network]
        settings:
          disk_size_gb: 30
          machine_type: n2d-standard-2
          node_conf:
            MemSpecLimit: 1024

But with such definition, running job fails with:

$ srun -p debug hostname
srun: error: Unable to allocate resources: Requested node configuration is not available

Though running job with:

$ srun -p debug --mem 100 hostname

Succeeds, as it requests less memory.

This change subtracts reserved memory from total memory available on the instance before calculating DefMemPerCPU which results in default memory claim within available memory.

This is most visible on 1 CPU nodes, but with larger nodes, at least one CPU may not be available for scheduling due to this.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@wiktorn wiktorn requested a review from mr0re1 November 21, 2024 15:41
@wiktorn wiktorn added the release-improvements Added to release notes under the "Improvements" heading. label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-improvements Added to release notes under the "Improvements" heading.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant