cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

s769 · 2024-11-15T03:45:41Z

I was testing the mp_potrf_potrs example (with fixed SPD matrix generation code) on several configurations on Perlmutter. When I request 1 node (4 GPUs), running

srun -u -n 4 --gpus-per-node 4 ./mp_potrf_potrs -p 4 -q 1 -ia 1 -ja 1 -ib 1 -jb 1 -mbA 2500 
-nbA 2500 -mbB 2500 -nbB 2500 -n 10000

works fine. However, if I request 2 nodes and run the same thing, it hangs after the potrf step (i.e. potrf completes successfully, but the potrs hangs). I also tried running srun -n 8 ... (keeping -p 4 -q 1), but this seems to hang at the scatter from host to device. If I decrease n to 1000 (and the tile sizes to 250), the code runs successfully with srun -n 4 .... I don't think it's an out-of-memory issue though, since the same code with n=10000 runs when I only request one node.

It's also interesting that the potrf completes but the potrs hangs; not sure what could be causing that.

I'll keep trying different configurations; please let me know if you would like any log output.

Module list:

1) craype-x86-milan 8) gpu/1.0 15) evp-patch
2) libfabric/1.15.2.0 9) craype/2.7.30 (c) 16) python/3.11 (dev)
3) craype-network-ofi 10) cray-dsmml/0.2.2 17) cudatoolkit/12.4 (g)
4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 11) cray-libsci/23.12.5 (math) 18) nvidia/24.5 (g,c)
5) perftools-base/23.12.0 12) PrgEnv-nvidia/8.5.0 (cpe) 19) cray-mpich/8.1.28 (mpi)
6) cpe/23.12 13) cray-hdf5-parallel/1.12.2.3 (io)
7) craype-accel-nvidia80 14) conda/Miniconda3-py311_23.11.0-2

The text was updated successfully, but these errors were encountered:

JanuszL added the cuSolverMp label Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

s769 commented Nov 15, 2024

cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

Comments

s769 commented Nov 15, 2024