-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Per PE Fence #314
Comments
PR describing the interface to address the issue |
I think this is an interesting feature, but that users need some additional information about the implementation to understand the cases in which Consider a message-passing example in which a PE is sending, in order, data and a flag (like void message_exchange(void *dst, const void *src, size_t nbytes,
uint64_t *flags, int *targets, int num_targets) {
// fence_pe < fence
for (int i = 0; i < num_targets; i++)
shmem_putmem(dstdata, srcdata, size, targets[i]);
for (int i = 0; i < num_targets; i++) {
shmem_pe_fence(targets[i]);
shmem_atomic_set(flag, 1, targets[i]);
}
// fence_pe == fence
for (int i = 0; i < num_targets; i++)
shmem_put(dstdata, srcdata, size, targets[i]);
shmem_fence();
for (int i = 0; i < num_targets; i++)
shmem_atomic_set(flag, 1, targets[i]);
} |
@nspark Thanks for the nice example. I agree, the performance query will be helpful. Using |
for (int i = 0; i < num_targets; i++) {
shmem_pe_fence(targets[i]);
shmem_atomic_set(flag, 1, targets[i]);
} If That said, I see |
@naveen-rn Agree that optimization can be used. IIUC, it will require a flag to be set in RMA/AMO operations to track whether the PE is still in the fenced/quieted state. In a multithreaded PE, the flag would require atomics. So, it seems like there is a tradeoff to be made in terms of performance of per-pe fence/quiet and RMA/AMO rate for small messages. For the NVSHMEM or NUMA case I mentioned it may not be so straightforward. It will be platform dependent whether O(N) finer-grain (e.g. GPU or Socket local) fence/quiet operations will be cheaper that O(1) system-level operation. |
@nspark - Users would have to be really careful how they use it. If you use it without understanding system topology performance will be hurt. In case of shared memory we would have to place "global" memory barrier regardless what is used. If you have to do more than one fence/quite you better use regular/fence quite. |
(I'm late to this). The example above can pretty much be replaced by put w/signal. It's more interesting if there's multiple puts being fenced. What about non-blocking fences? Has that been part of this discussion?
Actually, if it's non-blocking, does it matter as much if it's shmem_pe_fence or shmem_fence? DMAPP did a "bundled" put where you could put multiple times with a flag on the last put, guaranteed to be delivered after the "bundle".
|
@bcernohous - Does it have to be put to the same PE or can be to different PEs ?
|
It was pretty limited. The same PE. And since it was internally an FMA chain, the chain would be broken by non-bundled puts. I'm not sure it was ever used, but it was a requested experimental feature. Not proposing it for SHMEM, just reminded by the per-PE fence/flag example. There's an implicit PE fence on the last bundled put similar to put w/signal. I'm really looking for clarification on the use case(s) driving this feature. The example was basically put w/signal. |
No, but I also don't think there's anything about |
That wasn't the point of the example, which acknowledged that Going further, for N == |
This feature is integral part of Verbs and IBTA spec (post list). It actually the default API for sending messages. It definitely has performance benefits and provides some opportunities to reduce the number of memory barrier. It also comes with some overheads. |
Something like post-list of request is much more powerful. User provides clear "plan" to the library what are the next steps and library can optimize those. It is difficult to optimize for fence. |
@nspark - the cost might be runtime depended since with each allocation you can get different layout of nodes and as result, the cost will be highly inconsistent. The cost of operation will depend on the transports that are used and locality. |
I agree it's difficult to to optimize which is why I'd like the use case spelled out. I question the value of tracking PE completions (cost) vs per-PE fence (benefits) without hardware support. |
@bcernohous the main use case that is driving this interface is very close to what Nick has posted. The DMAPP interface you mention can be used for it. In fact, as it was pointed out, it is supported in Verbs as well. I made a choice of Per_PE_fence interface because it covers the use case that Nick has posted, it also expands the semantics to AMOs, and memory operations. Besides that, for network paths, where the packets does not arrive to the destination in order, this provides a potentially low cost option compared to All_PE_Fence option. Also, for such network paths, this can be leveraged for multiple NIC scenarios .i.e., you need to order one PE (1 * M * HCAs) rather than multiple PEs (NPEs * M * HCAs). This though advantageous for OpenSHMEM using p2p/network transports. This is not necessarily advantageous to OpenSHMEM using shared memory (the granularity of the memory barrier is much coarser in shared memory systems). As folks have pointed out it can have variable performance and can hurt the performance of the overall application (with a wrong usage). In the working group, we discussed the idea of using performance_query. I’m not sure how to design that query and standardize it, yet. |
Problem :
To achieve ordering for RMA operations between two PEs, the OpenSHMEM programs have to use shmem_fence. The shmem_fence routine, however, orders all RMA operations from the calling PE to all other PEs. Though this achieves the intended result, this is an expensive operation for unordered networks.
Proposal :
Introduce PE specific ordering routines that orders RMA operations from the calling PE to the target PE
shmem_pe_fence(target_pe)
shmem_ctx_pe_fence(ctx, target_pe)
Impact on Users:
This provides a light-weight alternative for shmem_fence if the OpenSHMEM programs desire an ordering guarantee only from the calling PE to a specific PE. There is no change in semantics for the current shmem_fence interface.
Impact on implementation:
Implementations will have to implement the new interface shmem_fence(target_pe) and its context variants shmem_ctx_fence(ctx, int target_pe).
The text was updated successfully, but these errors were encountered: