Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

Open
s769 opened this issue Nov 15, 2024 · 0 comments
Open

cuSOLVERMp hangs on mppotrs when running on subset of nodes #234

s769 opened this issue Nov 15, 2024 · 0 comments

Comments

@s769
Copy link

s769 commented Nov 15, 2024

I was testing the mp_potrf_potrs example (with fixed SPD matrix generation code) on several configurations on Perlmutter. When I request 1 node (4 GPUs), running

srun -u -n 4 --gpus-per-node 4 ./mp_potrf_potrs -p 4 -q 1 -ia 1 -ja 1 -ib 1 -jb 1 -mbA 2500 
-nbA 2500 -mbB 2500 -nbB 2500 -n 10000

works fine. However, if I request 2 nodes and run the same thing, it hangs after the potrf step (i.e. potrf completes successfully, but the potrs hangs). I also tried running srun -n 8 ... (keeping -p 4 -q 1), but this seems to hang at the scatter from host to device. If I decrease n to 1000 (and the tile sizes to 250), the code runs successfully with srun -n 4 .... I don't think it's an out-of-memory issue though, since the same code with n=10000 runs when I only request one node.

It's also interesting that the potrf completes but the potrs hangs; not sure what could be causing that.

I'll keep trying different configurations; please let me know if you would like any log output.

Module list:

1) craype-x86-milan 8) gpu/1.0 15) evp-patch
2) libfabric/1.15.2.0 9) craype/2.7.30 (c) 16) python/3.11 (dev)
3) craype-network-ofi 10) cray-dsmml/0.2.2 17) cudatoolkit/12.4 (g)
4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta 11) cray-libsci/23.12.5 (math) 18) nvidia/24.5 (g,c)
5) perftools-base/23.12.0 12) PrgEnv-nvidia/8.5.0 (cpe) 19) cray-mpich/8.1.28 (mpi)
6) cpe/23.12 13) cray-hdf5-parallel/1.12.2.3 (io)
7) craype-accel-nvidia80 14) conda/Miniconda3-py311_23.11.0-2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants