Instance Memory calculation #17

ghost · 2021-03-08T16:23:20Z

I recently tried to start up a c5.12xlarge instance on AWS, and ran into the case of the /mnt/shared/etc/slurm.conf file claiming that the instance should have RealMem=94992, but when the node comes up, slurmctld.log shows that the node has less memory than slurm.conf indicates, and thus Slurm rejects the node (puts it in DRAIN state):

[2021-03-05T18:40:19.392] _slurm_rpc_submit_batch_job: JobId=9 InitPrio=4294901757 usec=551
[2021-03-05T18:40:19.879] sched: Allocate JobId=9 NodeList=nice-wolf-c5-12xlarge-0001 #CPUs=48 Partition=compute
[2021-03-05T18:41:58.852] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:58.852] Node nice-wolf-c5-12xlarge-0001 now responding
[2021-03-05T18:41:58.852] error: Setting node nice-wolf-c5-12xlarge-0001 state to DRAIN
[2021-03-05T18:41:58.852] drain_nodes: node nice-wolf-c5-12xlarge-0001 state set to DRAIN
[2021-03-05T18:41:58.852] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument
[2021-03-05T18:41:59.855] error: Node nice-wolf-c5-12xlarge-0001 has low real_memory size (94256 < 94992)
[2021-03-05T18:41:59.855] error: _slurm_rpc_node_registration node=nice-wolf-c5-12xlarge-0001: Invalid argument

This led me to the following calculation for expected RealMem for AWS:

python-citc/citc/aws.py

Line 104 in c32b80a

"memory": d["MemoryInfo"]["SizeInMiB"]

        "memory": d["MemoryInfo"]["SizeInMiB"] - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),

Contrast this to GCP memory calculation:

python-citc/citc/google.py

Line 106 in c32b80a

"memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),

        "memory": int(math.pow(mt["memoryMb"], 0.7) * 0.9 + 500),

It appears that the AWS config is attempting to estimate how much memory will actually be available (versus what is advertised), but the code for GCP is drastically underestimating.

Heavily under-estimating the amount of available memory allows Slurm to be more tolerant of nodes which don't quite meet their advertised claims, however it can cause issues when jobs request a specific amount of memory. These two cloud equations should probably be consistent, but I think the estimates need to be more conservative (lower) than AWS currently calculates, as shown by the example above with C5.12xlarge.

The text was updated successfully, but these errors were encountered:

milliams · 2021-03-08T17:46:33Z

When we made these calculations, we didn't have any Google credits to do a full survey to base it on. I did some calculations based on a bunch of AWS nodes and eye-balled a relationship:

For AWS, we didn't include C5 nodes in the mix so I guess they're just outside the bounds of what works. I'll try to put together the data we collected and the code to visualise it in the wiki in this repo so that we can re-evaluate.

boegel · 2023-06-15T21:28:21Z

I was hitting problems with c5.4xlarge as well, where the node went into draining frequently in Slurm due to Low RealMemory.

To work around this, I've tweaked the +500 in /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py to +1000, which effectively tells Slurm there's ~500MB less memory available, which reduces the chance that Slurm will report that the memory is low on available memory when it comes up.

$ diff -u /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py.orig /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py
--- /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py.orig	2023-06-15 21:26:00.303448073 +0000
+++ /opt/cloud_sdk/lib64/python3.8/site-packages/citc/aws.py	2023-06-15 21:21:49.035548872 +0000
@@ -96,7 +96,7 @@
     return {
         s: {
             "memory": d["MemoryInfo"]["SizeInMiB"]
-            - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 500),
+            - int(math.pow(d["MemoryInfo"]["SizeInMiB"], 0.7) * 0.9 + 1000),
             "cores_per_socket": d["VCpuInfo"].get(
                 "DefaultCores", d["VCpuInfo"]["DefaultVCpus"]
             ),

Before this change (30965MB of memory):

$ [citc@mgmt ~]$ list_nodes
NODELIST                                STATE       REASON                        CPUS S:C:T   MEMORY    AVAIL_FEATURES                          GRES                NODE_ADDR           TIMESTAMP
fair-mastodon-c5-4xlarge-0001           idle~       none                          16   1:8:2   30965     shape=c5.4xlarge,ad=None,arch=x86_64    (null)              10.0.44.152         Unknown

after (30465MB of memory, so effectively -500):

$ [citc@mgmt ~]$ list_nodes
NODELIST                                STATE       REASON                        CPUS S:C:T   MEMORY    AVAIL_FEATURES                          GRES                NODE_ADDR           TIMESTAMP
fair-mastodon-c5-4xlarge-0001           idle~       none                          16   1:8:2   30465     shape=c5.4xlarge,ad=None,arch=x86_64    (null)              10.0.44.152         Unknown

milliams added the bug Something isn't working label Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Instance Memory calculation #17

Instance Memory calculation #17

ghost commented Mar 8, 2021

milliams commented Mar 8, 2021

boegel commented Jun 15, 2023

Instance Memory calculation #17

Instance Memory calculation #17

Comments

ghost commented Mar 8, 2021

milliams commented Mar 8, 2021

boegel commented Jun 15, 2023