Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need for local completion and remote commit #468

Open
naveen-rn opened this issue May 24, 2021 · 10 comments
Open

Need for local completion and remote commit #468

naveen-rn opened this issue May 24, 2021 · 10 comments
Assignees
Milestone

Comments

@naveen-rn
Copy link
Contributor

naveen-rn commented May 24, 2021

Motivation

In general, implementing shmem_quiet based memory ordering semantics is expensive. With the introduction of system processors with weak memory model, and support for multiple NICs per node, the cost of performing remote completion and committing any previously posted RMA and AMO events is getting really expensive. This introduces the need for performing dummy read-like operations to commit any outstanding operations into the remote targets memory.

Solution

As part of this proposal, we would like to introduce explicit options to perform local completion in OpenSHMEM. To complete the API we also would like to introduce the option to explicitly perform the remote commit operation. We can implement the existing shmem_quiet semantics as a combination of the local completion and remote commit operation.

Proposed API

The following new routines are proposed:

# Additions to OpenSHMEM Memory Ordering Operations
void shmem_local_complete(void);
void shmem_ctx_local_complete(shmem_ctx_t ctx);
void shmem_remote_commit(void);
void shmem_ctx_remote_commit(shmem_ctx_t ctx);

# Additions to OpenSHMEM collective operations
void shmem_team_remote_commit(shmem_team_t team);

API Semantics

shmem_local_complete and shmem_ctx_local_complete

The shmem_local_complete routine ensures the local completion of all operations on symmetric data objects issued by the calling PE on a given context. By local completion, the shmem_local_complete routine ensures the completion of all previously posted operations on symmetric data objects, but it does not guarantee any visibility of those operations when it returns from shmem_local_complete. With the local completion the symmetric data objects from all previously posted operations are ready to be reusable for performing other operations.

shmem_remote_commit and shmem_ctx_remote_commit

The shmem_remote_visible routine ensures the global visibility of all previously locally completed operations. It is to be noted that, this routine ensure only global visibility of only the previously locally completed operation. The local completion can be attained implicitly through the OpenSHMEM routines (like blocking put and AMO) or explicitly calling the shmem_local_complete operations.

shmem_team_remote_commit

This is a collective variant of the shmem_remote_commit operation. This routine registers the arrival of a PE at a shmem_team_remote_commit operation and blocks the PE until all other PEs arrive at the same shmem_team_remote_commit operation and also ensures that any locally completed operation on all PEs are made globally visible

@naveen-rn naveen-rn assigned naveen-rn, manjugv and nspark and unassigned naveen-rn May 24, 2021
@nspark
Copy link
Contributor

nspark commented May 24, 2021

static long target;
long base;
shmem_atomic_fetch_add_nbi(ctx, &base, &target, value, target_pe);

shmem_ctx_local_complete(ctx);
// The 'base' object has been updated on the calling PE.

shmem_ctx_remote_commit(ctx);
// The update to 'target' is now visible in memory on the target PE.

@naveen-rn
Copy link
Contributor Author

naveen-rn commented May 24, 2021

Some examples to clarify the local complete and remote commit semantics:

1. shmem_put_nbi
2. shmem_remote_commit // remote commit is a no-op here - local completion of previous put is not provided
1. shmem_put_nbi
2. shmem_local_complete
3. shmem_remote_commit // remote commit guarantees global visibility of target buffer from step(1)
1. shmem_put
2. shmem_remote_commit // remote commit guarantees global visibility of target buffer from step(1) 
                       // because, implicit local completion is available as part of blocking put operation
1: shmem_put_nbi
2: shmem_local_complete
3. shmem_put
4. shmem_remote_commit  // target buffers from step(1) and (3) are made globally visible
                        // because, implicit local completion for blocking put in step(3) and explicit local
                        // completion in step(2) for nbi put operation in step(1) are available 
1: shmem_put_nbi
2. shmem_put
3. shmem_remote_commit  // target buffer only from step(2) is globally visible and not from step(1)
                        // implicit local complete semantics in blocking put does not guarantee local completion 
                        // from other operations
1. shmem_get_nbi
2. shmem_local_complete // guarantees the availability of received value with return from local complete
                        // local completion of the get operation guarantees the actual completion of operation
// Nick's example
1. shmem_atomic_fetch_add
2. shmem_local_complete // fetched value is made available on returning from local complete
                        // but global visibility of target buffer from the AMO is not guaranteed
3. shmem_remote_commit  // global visibility of target buffer from the AMO is guaranteed

@manjugv
Copy link
Collaborator

manjugv commented May 24, 2021

"2. shmem_local_complete // fetched value is made available on returning from local complete
                        // but global visibility of target buffer from the AMO is not guaranteed" 

FYI - From implementation perspective, this requires remote completion and it will have a latency of remote completion.

@naveen-rn
Copy link
Contributor Author

FYI - From implementation perspective, this requires remote completion and it will have a latency of remote completion.

@manjugv Does that mean - every FAMO in your implementation provides global visibility guarantees? If so, aren't you providing more guarantees than what OSM-1.5 expects?

AFAIU, a local completion operation is not used to create delayed execution. That is for the shmem_session to handle. It just provides a way for delayed remote completion.

Meaning, you can try to implement all NBI and blocking operation by maintaining a local staging buffer. But, you would need to definitely need to post all these operations from the local staging buffer into the NIC during local_complete and make sure it has reached a state in the NIC, where it is safe from retransmission request.

@nspark
Copy link
Contributor

nspark commented Jun 1, 2021

I was thinking about this proposal today; in particular, how it seems to give rise to a set of "equivalences:"

  1. shmem_putshmem_put_nbi + shmem_local_complete
  2. shmem_quietshmem_local_complete + shmem_remote_commit
  3. shmem_barrier_all
    shmem_quiet + shmem_sync_all
    shmem_local_complete + shmem_remote_commit + shmem_team_sync
    shmem_local_complete + shmem_team_remote_commit

On one hand, I think that thinking about how existing OpenSHMEM operations can be translated into equivalent forms could be helpful. On the other hand, I think the put_nbi + put + remote_commit example is a good counter example that shows the limited "scope" of the putput_nbi + local_complete equivalence.

@nspark
Copy link
Contributor

nspark commented Jun 1, 2021

Separately, I'm a little nervous that we're adding complexity here that may be hard to reconcile with any eventual memory model. I think we had a reasonably clear mapping of AMOs and fence/quiet to the C++ memory model. I feel less confident about the mapping in terms of local_complete and remote_commit.

@manjugv
Copy link
Collaborator

manjugv commented Jun 7, 2021

@nspark
Copy link
Contributor

nspark commented Jun 7, 2021

On today's call, it seemed like:

  1. Not everyone loves the name shmem_local_complete, but most generally support the concept.
  2. Not everyone thinks that splitting shmem_quiet into the semantic pieces of shmem_local_complete + shmem_remote_commit provides a benefit.

While I understand @naveen-rn's rationale for all three new APIs, I wonder whether this issue—in particular, the need for an efficient successor to shmem_ctx_quiet + shmem_team_sync—is best handled by focusing primarily on the team-based synchronization aspect.

It seems to me (perhaps naively) that this issue could really be two mostly independent features: shmem_local_complete (or some renamed variant) and shmem_team_barrier. If anything, the originating motivation seems to be primarily for the latter.

@nspark
Copy link
Contributor

nspark commented Jun 7, 2021

Separately, there was a lot of discussion about completion semantics and how they're implemented. As an application user, I feel like libfabric has reasonably understandable language regarding completion semantics. (See "Completion Event Semantics" under man fi_cq.) In my understanding,

  • FI_INJECT_COMPLETE ≈ what we call "local completion" for puts
  • FI_DELIVERY_COMPLETE ≈ what we call "remote completion" for puts + "local completion" for gets
  • FI_COMMIT_COMPLETE ≈ RDMA flush (which mashes persistence and global visibility together)
  • OpenSHMEM doesn't have anything analogous to libfabric's FI_TRANSMIT_COMPLETE and FI_MATCH_COMPLETE

Likely someone can correct me, but it doesn't seem like libfabric has anything quite analogous to shmem_quiet's "globally visible" requirement—unless that's FI_COMMIT_COMPLETE but for non-persistent memory.

@naveen-rn
Copy link
Contributor Author

The status of this PR as of June, 25 - before the Spec Meeting:

  1. It is good to split the shmem_quiet semantics
  2. There was good acceptance of the shmem_local_complete semantics, though the name of the routine is still being discussed
  3. shmem_remote_commit seems not that really useful - no pressing use case
  4. shmem_team_remote_commit is option:1 to address the deprecated shmem_barrier routine
  5. shmem_team_barrier with similar semantics as shmem_barrier but the flush semantics available only on shared contexts and not on private contexts is option:2 to address the deprecated shmem_barrier routine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants