You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are using the OffloadActivations context manager with separate streams and pinned memory but currently don't see any overlap between the streams.
The default stream has a D2H transfer (which is a cudaMemcpyAsync followed by a cudaStreamSynchronize as described here. We observed that if we use pinned memory for activation offloading and transfer that in a separate stream, the cudaStreamSync on the default stream would block, leading to no overlap.
Our current hypothesis for this is due to implicit synchronization semantic, specifically a page-locked (i.e. pinned) host memory allocation issued before the cudaStreamSync on the default stream would lead to cudaStreamSync block on that stream. So our conjecture is offloading using pinned memory is NOT pre-allocating pinned memory but rather allocate that on the fly which forces other stream sync to be blocking.
Is this behavior expected and is our hypothesis on why this is happening correct? If so, is there any way to have offloading done using preallocated pinned memory instead of allocating pinned
memory on the fly to avoid this synchronization?
Thanks! Happy to answer any further questions/share profile traces.
The text was updated successfully, but these errors were encountered:
Thanks for investigating it and sharing this info! @janeyx99 is our PoC for offloading. She is currently on vacation, so we may not hear back from her for a few days.
Hi @amogkam thanks for sharing your detailed findings on the issue. To piggyback on @felipemello1's comment, in the meantime we can debug some as well. A couple questions/comments for you:
are you just running the OffloadActivations context manager with use_pin_memory=True and use_streams=True, or are you doing some additional customization beyond that?
regarding your offer to share profile traces, that would be quite helpful, and
is there a simple repro we can use to observe the same behavior you mentioned? (e.g. using one of our default configs)
We are using the
OffloadActivations
context manager with separate streams and pinned memory but currently don't see any overlap between the streams.The default stream has a D2H transfer (which is a
cudaMemcpyAsync
followed by acudaStreamSynchronize
as described here. We observed that if we use pinned memory for activation offloading and transfer that in a separate stream, the cudaStreamSync on the default stream would block, leading to no overlap.Our current hypothesis for this is due to implicit synchronization semantic, specifically a page-locked (i.e. pinned) host memory allocation issued before the
cudaStreamSync
on the default stream would lead tocudaStreamSync
block on that stream. So our conjecture is offloading using pinned memory is NOT pre-allocating pinned memory but rather allocate that on the fly which forces other stream sync to be blocking.Is this behavior expected and is our hypothesis on why this is happening correct? If so, is there any way to have offloading done using preallocated pinned memory instead of allocating pinned
memory on the fly to avoid this synchronization?
Thanks! Happy to answer any further questions/share profile traces.
The text was updated successfully, but these errors were encountered: