-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Supporting multiple SHMEM initialize/finalize calls #263
Comments
(quotes: emphasis mine)
I think we were careful in specifying that the success or failure of Is it (expected to be) permitted that |
@nspark Thanks for the comments. I somehow misunderstood the condition of a successful For the change of thread level at subsequent calls to |
@minsii Clarifying my question from Threads WG today. Per 3 in the outlined spec changes, there will only be one instance of the library that will be active. This issue does not capture the usecase where two different OpenSHMEM libraries are used concurrently. What we agreed was that being already a supported use by the spec. Also, as far as how to support |
AFAIU, the specification doesn't speak about any interoperability between different SHMEM implementations. If we could simplify the above provided use case - we could ask whether it is possible to use RMA/AMO from an implementation and just collectives from another implementation? The spec doesn't allow this or better the spec doesn't address this usage model. |
@anshumang Thanks for the clarification. If I understand your usecase correctly, the SHMEM init/finalize calls are still performed only on the host process, and multiple devices (e.g., GPUs) may share a single SHMEM portion initialized by the host process. Is it correct ? So the program might look like: CPU side
GPU side:
I think both of us are OK that the second init call in I think the "portion" word is not well defined in this proposal...If we consider it as an isolated communication environment (e.g., each portion always use different network resource, shmem synchronization in one portion does not interfere with the other), then the above program might contain two "portions" (i.e., CPU initializes two sets of network resource, one for CPU and the other for GPU, As @naveen-rn said, the current spec does not yet define the above model. I am not sure how to accurately describe this kind of communication environment. Let me think about how we can workaround it in the proposal. |
@minsii Thanks for expanding on this. This is very useful. My comments below.
Does not have to be multiple GPUs. It could be a single GPU per host that calls into a SHMEM library but that library is initialized by the CPU. Alongside, the CPU thread could also use another CPU-side SHMEM library. So I will modify the CPU side code example from
to
and
Correct |
@anshumang What I'm confused is this notion of GPU-side OpenSHMEM calls something different from the regular SHMEM calls. Aren't kernel initiated operations just another feature of a SHMEM library? Lets assume OpenSHMEM implementation A, which has support for both CPU-side and GPU-side operations. While implementation B has support only for CPU-side operation. For users to interoperate them, both the implementations needs to provide some form of support. Atleast in this case, implementation A has to block its CPU-side operations. I'm not sure whether any implementation is capable of doing this. Exposing features through implementation specific APIs is beyond the control of the specification. Please correct me, if my understanding is wrong. |
@naveen-rn It makes sense for kernel initiated SHMEM calls to operate out of a symmetric heap physically located on the GPU memory. Support for CPU-side and GPU-side operations mean that both calls use the GPU SHEAP. As such, an app may need to use another SHMEM library to use SHEAP backed on the system memory. The interoperability support could be available for "free" if the GPU-side SHMEM library only provides GPU-side operations (quiet possible). Compiler annotations are going to distinguish the SHMEM APIs in the GPU-side library from those in the CPU-side library. In such an interop scenario, do you think there is something for the spec to clarify? |
AFAIU - you are referring to interoperate two different OpenSHMEM implementations. In general, I feel that the change to support this usage model requires a broader look at the specification. A small change like this PR wouldn't be sufficient.
If I understand correctly, I think this will work only when one of the implementation supports GPU-side kernel initiated operation, and the other supports only the CPU-side operation. For example, if I could change the previous example, if both implementation A and B supports CPU and GPU initiated operation. In this case, I don't think compilers could differentiate operations from these two implementations. If we need to support this usage model, then we would require different levels of OpenSHMEM compliance. Where an implementation could be designed to be modular in such a way that users could pick and choose features from different implementations and interoperate. For example, if we say that there are three levels of OpenSHMEM compliance:
PS: I would prefer not to go in direction. Just stating an example based on my understanding. |
I think there are only a few conflict causing APIs or APIs that would reasonably be always invoked from the CPU thread for CPU or other backends (GPU, FPGA). These APIs are |
@anshumang According to your code examples, all the GPU-side APIs (e.g., I agree with @naveen-rn that the interoperability of two SHMEM implementations is out of the scope of this proposal, and we will need a much broader discussion covering all SHMEM APIs (not only init/finalize/malloc/free) if we want to support this case in the specification.
I would think that this is only one approach to implement, but other approach may exist (e.g., one wants to support collectives among CPUs and devices ? ). This cannot be addressed by simply considering a subset of SHMEM APIs. |
I wrote them as
Agree that this needs to be covered in a separate proposal. I think |
Trying to draft some text, here's what I have so far. (Note, these are not the complete description of the listed routines; I've tried to limit it to the most relevant sections.) For
For
|
Some thoughts on what's missing:
|
@nspark Thanks for driving the draft. The text looks great to me. A few comments on the missing items.
Not sure if I understand it correctly, is your intention to support reinitialization? I am afraid that it is hard for many implementations.
I feel they are still useful. E.g., the user program may want to check whether any library has initialized SHMEM so that it can issue a PUT. |
No. But, if an init-finalize pair happens inside another init-finalize pair, the inner finalize shouldn't cause the library to be "completely finalized." (I need better or more precise terminology here.) For example, this should be allowed (time flows top to bottom; fini == finalize):
The following should also be allowed:
However, the following should be disallowed:
|
Drafting the latest updates for these changes has me thinking: Is multithread initialization/finalization sanely permissible? In a sense, it seemed like we were trending toward allowing: #pragma omp parallel
{
shmem_init_thread(SHMEM_THREAD_MULTIPLE, ...);
#pragma omp parallel
// ...do shmem stuff...
#pragma omp barrier
shmem_finalize();
} However, |
Multithreaded init/finalize -- we can make this work by putting an init/finalize mutex into the library and only allowing one thread to enter the routine. But, unless there is a use case driving it, I'd rather not distract implementors with this. We should clearly specify that the threading level returned by a call to shmem_init_thread must be greater than or equal to any threading level previously returned. |
Is there a use case that may need increased thread level with multiple init calls? |
Slides from today: Multiple Init_Finalize.pdf |
Need to clarify that init/finalize must be called by all PEs. That is, in a nested usage case, you can't call init on a subset of the PEs. |
Goal
Allowing the user program to initialize and finalize SHMEM multiple times in order to support the scenario where SHMEM is used as the communication runtime of other libraries.
Problem Description
Current SHMEM spec defines that (1) multiple calls to
shmem_init|shmem_init_thread
within a program result in undefined behavior, and (2)shmem_finalize
must be the last OpenSHMEM library call encountered in the OpenSHMEM portion of a program.The above semantics allows the program to have at most one SHMEM portion, and interleaving calls to init or finalize become illegal. Thus, SHMEM is not able to be used as the communication runtime for multiple libraries. The following two examples show typical usages of init/finalize in this scenario, both of them are prohibited in current spec (
FOO
andBAR
can be either a library or the main program).Example-1
Example-2
Proposed Solution
Change in specification
shmem_init|shmem_init_thread
call, and ends with a call toshmem_finalize
.shmem_init|shmem_init_thread
call of an OpenSHMEM portion allocates and initializes resources for OpenSHMEM communication; the lastshmem_finalize
call of the same portion releases all resources initialized in this portion.shmem_init|shmem_init_thread
more than once within a program is permitted. However, callinginit, init, finalize, finalize
in a program will only initialize a single OpenSHMEM portion, even if the calls are made by different threads.shmem_init_thread|shmem_init
has no subsequent effect. The thread level cannot be changed after initialization.provided
parameter ofshmem_init_thread|shmem_query_thread
returns the thread level initialized in the current OpenSHMEM portion.Possible implementation
refcount
global variable, which is increased at everyshmem_init|shmem_init_thread
call, and decreased at everyshmem_finalize
call.shmem_init|shmem_init_thread
call only ifrefcount==0
; the resource is released at ashmem_finalize
call only ifrefcount==0
.Requirement to user
init, init, finalize
may cause unreleased resource in an OpenSHMEM portion, and subsequent calls to OpenSHMEM (except the call toshmem_finalize
) result in undefined behavior.Current Progress
This issue is separated from ticket #243 . See past discussion at #243.
The text was updated successfully, but these errors were encountered: