-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Program Hangs When Number of Process Exceeds 308 #12993
Comments
Any idea about this issue? Thank you in advance! |
Are you saying SLURM schedules hyperthreads instead of cores? What if you run the |
No. I mean, when the number of process is smaller than or equals 308, the result is correct (except some errors will display on the screen, but it would not affect the final results). However, when it is larger than 308, it will stall without ANY output. The result is like this: naonao@hpc01:~$ mpicc hello.c -o hello -O3
naonao@hpc01:~$ srun -n308 --mpi=pmix hello
[hpc04:614492] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc04:614492] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc04:614492] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc04:614492] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc04:614492] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc08:613816] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc08:613816] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc08:613816] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc08:613816] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc08:613816] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc05:585696] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc05:585696] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc05:585696] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc05:585696] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc05:585696] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc07:594351] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc07:594351] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc07:594351] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc07:594351] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc07:594351] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc06:578574] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc06:578574] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc06:578574] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc06:578574] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc06:578574] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc01:992494] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc01:992494] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc01:992494] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc01:992494] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc01:992494] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc02:580514] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc02:580514] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc02:580514] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc02:580514] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc02:580514] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
[hpc03:588183] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1958
[hpc03:588183] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../../src/mca/gds/shmem/gds_shmem.c at line 1966
[hpc03:588183] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 238
[hpc03:588183] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../../../src/mca/gds/base/gds_base_fns.c at line 255
[hpc03:588183] PMIX ERROR: PMIX_ERR_NOMEM in file ../../../src/server/pmix_server.c at line 3409
Hello, world, I am 288 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 290 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 296 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 298 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 300 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 302 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 280 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 305 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 282 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 289 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 291 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 293 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 295 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 281 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 283 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 285 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 287 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 297 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
Hello, world, I am 299 of 308, (Open MPI v4.1.6, package: Debian OpenMPI, ident: 4.1.6, repo rev: v4.1.6, Sep 30, 2023, 87)
# ... And so on
naonao@hpc01:~$ srun -n309 --mpi=pmix hello
# Stalls here |
You are heavily oversubscribing your node which might slow down your application tremendously. Let's make sure it really stall and not just run extremely slow. Can you add some output to your loop to see if there is any progress ? |
It doesn't seem to be oversubscribing the node, because even when the number of process is 308, the CPU utilization ratio is not 100% for every node (70% for the last node). When the number of process is 309, the CPU utilization ratio for each node is about 2% or so, which doesn't seem to run the tasks. In addition, the execution speed of -n308 is almost in a flash, but -n309 just has no output. I don't think 1 process can make such a difference. In addition, if mixing OpenMPI and OpenMP, it can utilize all the cores and threads.
Wherever I add the "print" statement, there is no output. I think it might be a process launching issue. |
Have you tried launching it with Also, your output shows that Slurm is using PMIx v5.x. I'm not sure if this will work, but you might try putting |
If using mpirun instead of srun, there is nothing wrong with. Strange though. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
v4.1.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From an opearting system distribution package: Ubuntu APT
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.N/A
Please describe the system on which you are running
uname -a
output:Linux hpc01 6.8.0-51-generic #52-Ubuntu SMP PREEMPT_DYNAMIC Thu Dec 5 13:09:44 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Details of the problem
I am using Open MPI and slurm-wlm 23.11.4 with PMIx to perform distributed parallel computing. All of the softwares and dependencies are installed from the official Ubuntu APT source and have been upgraded to the newest version. My cluster has 8 nodes, and each node has 2 CPUs (10 cores, 20 threads), so theoretically it should be able to launch at most 20 * 2 * 8 = 320 processes simultaneously. When it is running non-MPI programs like "hostname," there is nothing wrong. However, when running MPI programs with NUM_PROCS larger than 308, the program will hang without any output. The cluster monitor also shows a low CPU & Memory utilization rate. If NUM_PROCS is less than or equal to 308, the behavior of the program is correct.
Tested programs: HPL, HPCG, and various simple testing programs like array addition and summation.
The command is as follows:
The text was updated successfully, but these errors were encountered: