-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problems with ucx-conduit+PSHM in CI #7
Comments
I'll need to dig into this a bit. We've been seeing some failures like this every once in a while, but haven't been able to reproduce them reliably. The Realm quiescence check is not time-based, but instead assumes that a gasnet collective is not measurably faster than the network latency of an AM request. If that's not true with pshm enabled, I may need to go back to rolling my own AM-based "collective" operation to flush things. |
@streichler wrote:
Not sure exactly what this means. On average and at large scale, gasnet collectives should usually have higher overall latency than an AM request. However there are no implicit ordering constraints between in-flight communication operations; in particular they are NOT guaranteed to complete in the order they were initiated (this applies to all non-blocking GASNet operations). What this means in practice is that in rare/unusual cases an AM request may be delayed and have a higher end-to-end latency than the overall latency of a concurrent collective. Also worth noting: I say "overall latency" because in some cases of load imbalance where a straggler process arrives "last" at a collective, it could see that collective complete almost immediately (without any wire delay), an interval which could be much shorter than any off-node AM latency. Now if we're talking about small scale, collectives may have overall latencies comparable to an AM request, even on average. PSHM activates hierarchical collectives with very fast on-node coordination, so a mostly on-node collective could easily rival the performance of an off-node AM request or possibly even a single on-node AM. If this performance behavior breaks the quiescence algorithm, then I agree some adjustment may be needed.
It is not possible to write anything with AM that is guaranteed to reliably "flush things". Concurrently in-flight AM's are unordered, full stop. Point-to-point AMs of similar size between the same source and target will often arrive in the order they were issued, but this is not in any way guaranteed. |
Background
This issue is forked from issue #6, where the GASNet-EX configure default of
--enable-pshm
was restored for most Realm build configurations in 7a073d3, thereby enabling GASNet's efficient shared-memory transport, which provides huge speedups for intranode comms when running multiple processes-per-node.Unfortunately initial CI testing with ucx-conduit+PSHM in CI led to some new failures, and as a result PSHM support was quickly re-disabled for the ucx-conduit configuration in f9d1a06. This issue exists to triage and hopefully solve the CI failures, so the PSHM enable can be restored in configs/config.ucx.release.
It's worth noting that ucx-conduit currently remains an "experimental" conduit (and likely to remain that way in the near-term), for reasons of both stability and performance. As of the current GASNet v2022.9.0 release there's very very few use cases where ucx-conduit might be preferable to either ibv-conduit (on InfiniBand systems) or ofi-conduit (on Slingshot-10 systems). Those production-quality conduits are currently both more robust and more performant than ucx-conduit in basically all our testing. So IMHO Legion users should never be using ucx-conduit in production, meaning this issue to polish Legion's use of ucx-conduit is probably low-priority.
Initial requests:
GASNET_VERSION
to the current GASNet release, so I'm guessing this an accidental oversight in the CI scripting. There have been non-trivial improvements made to both ucx-conduit and PSHM internals since 2021.3.0, so can we please re-run against the current GASNet-EX 2022.9.0 release to avoid potentially wasting time triaging already-fixed defects?--enable-debug
akaGASNET_DEBUG
mode to enable assertions and envvarGASNET_BACKTRACE=1
to get backtraces? This might help us narrow down what's happening (e.g. if Realm happens to be breaking any checkable preconditions on GASNet calls).ucx-conduit/terra failure mode
The ucx+PSHM failure point on the two terra tests looks like this:
Based on the message I'm assuming something in the realm logic decided to "give up" on test program exit quiescence, presumably based on some heuristic (of which I have no knowledge). Could someone explain how that works? In particular, does it use real wallclock time (Does 0.223742 indicate it gave up after about a ~200 ms timeout?), or does it rely primarily on the latency/overheads of GASNet AM (which differs wildly between the UCX and shared-memory transports, meaning the heuristic might just need adjustment?).
Recommendation: Investigate the quiescence heuristic, and in particular the time basis for the abort condition
CC: @streichler @elliottslaughter @PHHargrove
The text was updated successfully, but these errors were encountered: