-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
argo_init crashes with large memory requestion #17
Comments
Could you please add the error output you get? |
It just returns 1 without any output. |
How much memory do your nodes have each, physically, and how much memory is available for files under /dev/shm/? |
We have 252G per node. Extra 95G can be stored into /dev/shm per node (still larger than 64G). What dose ArgoDSM use /dev/shm for? |
when using ARGO_VM_SHM (the default) it creates a file descriptor in /dev/shm for mapping pages. See https://github.com/etascale/argodsm/blob/master/src/virtual_memory/shm.cpp for the use of /dev/shm. First thing to try is to enable another vm handler. Currently available are ARGO_VM_MEMFD (requires linux kernel 3.17 or newer) and ARGO_VM_ANONYMOUS. Please set one of these to ON (and ARGO_VM_SHM to OFF) using cmake, and recompile. If both of these show the same behaviour, I'll need more information about your system. |
With MEMFD, it failed even with small memory requestion. The output is:
( With ANONYMOUS, it doesn't crash. But it swallows around 160G memory when I request only 64G, and it uses around 2 minutes to initialize. I hope I don't have to use ANONYMOUS. |
Ideally we would be able to use the option for ARGO_VM_* that provides the best performance, but unfortunately we have to adapt the system a bit depending on what can and cannot be changed on a system. Unfortunately, we will need more information about the system you are using to be able to debug this. Can you tell us what hardware you are running this on? Would it be possible for us to get access to the machine to debug this? An easy way to confirm that the /dev/shm size is the issue for ARGO_VM_SHM would be to initialize with 45G (and see that it works) and just over half the free memory in /dev/shm, e.g. 48G and see that it fails. What is the output of I am not sure the initialization time is that much different from what can be expected for this amount of memory, usually for us initialization is dominated by the time needed by MPI to register the memory range. As for the "swallowed memory", I do not know what numbers you are looking at, but they can be deceiving. Unless you actually cannot allocate the memory you have, I don't think showing larger numbers of memory in use are an issue. |
You are right. It's the /dev/shm size issue. Approximately it will occupy 1.5x of the requested memory in /dev/shm. So it's OK to request 60G in our system. It will be better if ArgoDSM can report a detailed error message. |
I agree. The code is supposed to print an error message, so I would be very interested in finding out why this does not happen. On my machines, this has not happened thus far, so I would require your assistance to find out more about it. |
Hi again, it is possible your issues disappear with the patches in #19. |
argo_init((size_t)64 * 1024 * 1024 * 1024)
crashes butargo_init((size_t)32 * 1024 * 1024 * 1024)
works fine. I guess it's easy to reproduce becauseargo_init
is the first statement in the program. Tested using the master branch, with the MPI backend, using 2 nodes with 1 process each.The text was updated successfully, but these errors were encountered: