-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GASNet errors with AMD GPUs on El Capitan (early access) #16
Comments
I'm wondering if @PHHargrove or @bonachea can comment. I'm scratching my head on this one. @unpyoukz just to confirm, the GASNet-2022.9.2 one is a clean build? |
@elliottslaughter Yes, it is a clean build. |
If I understand correctly, you need advice to resolve the issue with |
FWIW, GASNet-EX version 2022.3.0 predates support for GPUs (AMD or otherwise) on the HPE Slingshot 11 network. |
@PHHargrove I pulled up a log and think the app was certainly running using 4 nodes on Tioga back in Oct 2022 with the following warning messages. I wonder if there was something different at that time (I am not able to reproduce this now).
Tioga was under major software update at the end of last week (I am not sure if this is related to the present issue). In any event I am trying the recommended procedures. |
@unpyoukz Have you had an opportunity to try this again with recent versions? We've been running extensively on Frontier and while there are still issues, I think we understand all the failure modes at this point. |
I am running in-house Regent/Pygion/Legion codes on Tioga (early access of El Capitan at LLNL) using AMD GPUs with various versions of GASNet, with the control_replication branch of Legion (28db23b4). I am getting the following errors.
Tracing of the single rank case is here: trace.dat.
This error shows up frequently when using a single rank and almost always with multiple ranks. This did not happen before and I wonder if it is potentially due to some recent changes in the machine config and/or some other compatibility issues.
A similar application worked last year with an old commit of Legion (5a77dcbf) with GASNet-2022.3.0. The app works fine on various machines with NVIDIA GPUs with Intel- and IBM-CPUs.
Naturally the application does not even start, regardless of the number of ranks used. The executable (.exec) of the app seems to run on a single rank when interactively executed without
srun
.I am using the following modules for these cases:
The text was updated successfully, but these errors were encountered: