You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was testing the mp_potrf_potrs example (with fixed SPD matrix generation code) on several configurations on Perlmutter. When I request 1 node (4 GPUs), running
works fine. However, if I request 2 nodes and run the same thing, it hangs after the potrf step (i.e. potrf completes successfully, but the potrs hangs). I also tried running srun -n 8 ... (keeping -p 4 -q 1), but this seems to hang at the scatter from host to device. If I decrease n to 1000 (and the tile sizes to 250), the code runs successfully with srun -n 4 .... I don't think it's an out-of-memory issue though, since the same code with n=10000 runs when I only request one node.
It's also interesting that the potrf completes but the potrs hangs; not sure what could be causing that.
I'll keep trying different configurations; please let me know if you would like any log output.
I was testing the
mp_potrf_potrs
example (with fixed SPD matrix generation code) on several configurations on Perlmutter. When I request 1 node (4 GPUs), runningworks fine. However, if I request 2 nodes and run the same thing, it hangs after the potrf step (i.e. potrf completes successfully, but the potrs hangs). I also tried running
srun -n 8 ...
(keeping-p 4 -q 1
), but this seems to hang at the scatter from host to device. If I decreasen
to 1000 (and the tile sizes to 250), the code runs successfully withsrun -n 4 ...
. I don't think it's an out-of-memory issue though, since the same code withn=10000
runs when I only request one node.It's also interesting that the potrf completes but the potrs hangs; not sure what could be causing that.
I'll keep trying different configurations; please let me know if you would like any log output.
Module list:
The text was updated successfully, but these errors were encountered: