Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance Memory calculation #17

Open
ghost opened this issue Mar 8, 2021 · 2 comments
Open

Instance Memory calculation #17

ghost opened this issue Mar 8, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@ghost
Copy link

ghost commented Mar 8, 2021

I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the /mnt/shared/etc/slurm.conf file claiming that the instance should have RealMem=94992, but when the node comes up, slurmctld.log shows that the node has less memory than slurm.conf indicates, and thus Slurm rejects the node (puts it in DRAIN state):

[2021-03-05T18:40:19.392] _slurm_rpc_submit_batch_job: JobId=9 InitPrio=4294901757 usec=551
[2021-03-05T18:40:19.879] sched: Allocate JobId=9 NodeList=nice-wolf-c5-12xlarge-0001 #CPUs=48 Partition=compute
[2021-03-05T18:41:58.852] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:58.852] Node nice-wolf-c5-12xlarge-0001 now responding
[2021-03-05T18:41:58.852] error: Setting node nice-wolf-c5-12xlarge-0001 state to DRAIN
[2021-03-05T18:41:58.852] drain_nodes: node nice-wolf-c5-12xlarge-0001 state set to DRAIN
[2021-03-05T18:41:58.852] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
[2021-03-05T18:41:59.855] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:59.855] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument

This led me to the following calculation for expected RealMem for AWS:

"memory": d["MemoryInfo"]["SizeInMiB"]

        "memory": d["MemoryInfo"]["SizeInMiB"] - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),

Contrast this to GCP memory calculation:

"memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),

        "memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),

It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.

Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.

@milliams milliams added the bug Something isn't working label Mar 8, 2021
@milliams
Copy link
Member

milliams commented Mar 8, 2021

When we made these calculations, we didn't have any Google credits to do a full survey to base it on. I did some calculations based on a bunch of AWS nodes and eye-balled a relationship:

index

For AWS, we didn't include C5 nodes in the mix so I guess they're just outside the bounds of what works. I'll try to put together the data we collected and the code to visualise it in the wiki in this repo so that we can re-evaluate.

@boegel
Copy link

boegel commented Jun 15, 2023

I was hitting problems with c5.4xlarge as well, where the node went into draining frequently in Slurm due to Low RealMemory.

To work around this, I've tweaked the +500 in /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py to +1000, which effectively tells Slurm there's ~500MB less memory available, which reduces the chance that Slurm will report that the memory is low on available memory when it comes up.

$ diff -u /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py.orig /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py
--- /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py.orig	2023-06-15 21:26:00.303448073 +0000
+++ /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py	2023-06-15 21:21:49.035548872 +0000
@@ -96,7 +96,7 @@
     return {
         s: {
             "memory": d["MemoryInfo"]["SizeInMiB"]
-            - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),
+            - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 1000),
             "cores_per_socket": d["VCpuInfo"].get(
                 "DefaultCores", d["VCpuInfo"]["DefaultVCpus"]
             ),

Before this change (30965MB of memory):

$ [citc@mgmt ~]$ list_nodes
NODELIST                                STATE       REASON                        CPUS S:C:T   MEMORY    AVAIL_FEATURES                          GRES                NODE_ADDR           TIMESTAMP
fair-mastodon-c5-4xlarge-0001           idle~       none                          16   1:8:2   30965     shape=c5.4xlarge,ad=None,arch=x86_64    (null)              10.0.44.152         Unknown

after (30465MB of memory, so effectively -500):

$ [citc@mgmt ~]$ list_nodes
NODELIST                                STATE       REASON                        CPUS S:C:T   MEMORY    AVAIL_FEATURES                          GRES                NODE_ADDR           TIMESTAMP
fair-mastodon-c5-4xlarge-0001           idle~       none                          16   1:8:2   30465     shape=c5.4xlarge,ad=None,arch=x86_64    (null)              10.0.44.152         Unknown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants