Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GASNet errors with AMD GPUs on El Capitan (early access) #16

Open
unpyoukz opened this issue Feb 9, 2023 · 7 comments
Open

GASNet errors with AMD GPUs on El Capitan (early access) #16

unpyoukz opened this issue Feb 9, 2023 · 7 comments

Comments

@unpyoukz
Copy link

unpyoukz commented Feb 9, 2023

I am running in-house Regent/Pygion/Legion codes on Tioga (early access of El Capitan at LLNL) using AMD GPUs with various versions of GASNet, with the control_replication branch of Legion (28db23b4). I am getting the following errors.

  1. With GASNet-2022.3.0:
*** FATAL ERROR (proc 0): in gasnetc_segment_register() at anguage/gasnet/GASNet-2022.3.0/ofi-conduit/gasnet_ofi.c:1246: fi_mr_reg for rdma failed: -22(Invalid argument)

Tracing of the single rank case is here: trace.dat.

This error shows up frequently when using a single rank and almost always with multiple ranks. This did not happen before and I wonder if it is potentially due to some recent changes in the machine config and/or some other compatibility issues.

A similar application worked last year with an old commit of Legion (5a77dcbf) with GASNet-2022.3.0. The app works fine on various machines with NVIDIA GPUs with Intel- and IBM-CPUs.

  1. With GASNet-2022.9.2:
test.exec: symbol lookup error: [Legion]/bindings/regent/libregent.so: undefined symbol: PMI_Allgather

Naturally the application does not even start, regardless of the number of ranks used. The executable (.exec) of the app seems to run on a single rank when interactively executed without srun.

I am using the following modules for these cases:

rocm/4.5.0   gcc/11.2.0   cray-pmi/6.1.3   cray-mpich/8.1.2
@elliottslaughter
Copy link
Contributor

I'm wondering if @PHHargrove or @bonachea can comment. I'm scratching my head on this one.

@unpyoukz just to confirm, the GASNet-2022.9.2 one is a clean build?

@unpyoukz
Copy link
Author

unpyoukz commented Feb 9, 2023

@elliottslaughter Yes, it is a clean build.

@PHHargrove
Copy link
Contributor

If I understand correctly, you need advice to resolve the issue with PMI_Allgather being undefined. If that it the case, I recommend adding --enable-pmi-rpath to the GASNet configure command line. That will make a difference if Legion is using the link options selected by GASNet's configure script. If that is not the case, then one can try injecting -Wl,-rpath=/opt/cray/pe/pmi/default/lib into the link command that generates libregent.so, by whatever means are appropriate.

@PHHargrove
Copy link
Contributor

FWIW, GASNet-EX version 2022.3.0 predates support for GPUs (AMD or otherwise) on the HPE Slingshot 11 network.
So, I would be surprised is multi-node runs were ever successful with that version of GASNet-EX and the network of El Capitan.

@unpyoukz
Copy link
Author

@PHHargrove I pulled up a log and think the app was certainly running using 4 nodes on Tioga back in Oct 2022 with the following warning messages. I wonder if there was something different at that time (I am not able to reproduce this now).

WARNING: Using GASNet's ofi-conduit, which exists for portability convenience.
WARNING: This system appears to contain recognized network hardware: Cray Gemini (XE and XK) or Aries (XC)
WARNING: which is supported by a GASNet native conduit, although
WARNING: it was not detected at configure time (missing drivers?)
WARNING: You should *really* use the high-performance native GASNet conduit
WARNING: if communication performance is at all important in this program run.
WARNING: ofi-conduit is experimental and should not be used for
performance measurements.
Please see `ofi-conduit/README` for more details.

Tioga was under major software update at the end of last week (I am not sure if this is related to the present issue). In any event I am trying the recommended procedures.

@PHHargrove
Copy link
Contributor

@unpyoukz Looking at this page I am confident that ofi-conduit and the cxi provider are the right choice for Tioga as it is described today.

@elliottslaughter
Copy link
Contributor

@unpyoukz Have you had an opportunity to try this again with recent versions? We've been running extensively on Frontier and while there are still issues, I think we understand all the failure modes at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants