`OffloadActivations` MemCpy Stream Sync behavior #2076

amogkam · 2024-11-27T00:25:16Z

We are using the OffloadActivations context manager with separate streams and pinned memory but currently don't see any overlap between the streams.

The default stream has a D2H transfer (which is a cudaMemcpyAsync followed by a cudaStreamSynchronize as described here. We observed that if we use pinned memory for activation offloading and transfer that in a separate stream, the cudaStreamSync on the default stream would block, leading to no overlap.

Our current hypothesis for this is due to implicit synchronization semantic, specifically a page-locked (i.e. pinned) host memory allocation issued before the cudaStreamSync on the default stream would lead to cudaStreamSync block on that stream. So our conjecture is offloading using pinned memory is NOT pre-allocating pinned memory but rather allocate that on the fly which forces other stream sync to be blocking.

Is this behavior expected and is our hypothesis on why this is happening correct? If so, is there any way to have offloading done using preallocated pinned memory instead of allocating pinned
memory on the fly to avoid this synchronization?

Thanks! Happy to answer any further questions/share profile traces.

The text was updated successfully, but these errors were encountered:

felipemello1 · 2024-11-27T02:36:50Z

Thanks for investigating it and sharing this info! @janeyx99 is our PoC for offloading. She is currently on vacation, so we may not hear back from her for a few days.

ebsmothers · 2024-11-27T05:19:56Z

Hi @amogkam thanks for sharing your detailed findings on the issue. To piggyback on @felipemello1's comment, in the meantime we can debug some as well. A couple questions/comments for you:

are you just running the OffloadActivations context manager with use_pin_memory=True and use_streams=True, or are you doing some additional customization beyond that?
regarding your offer to share profile traces, that would be quite helpful, and
is there a simple repro we can use to observe the same behavior you mentioned? (e.g. using one of our default configs)

amogkam · 2024-11-27T21:05:07Z

Hi @ebsmothers, @felipemello1 thanks for getting back!

Yes I am running just OffloadActivations context manager with those 2 args set.

Yes let me create a small example with profile traces and I will get back to you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`OffloadActivations` MemCpy Stream Sync behavior #2076

`OffloadActivations` MemCpy Stream Sync behavior #2076

amogkam commented Nov 27, 2024 •

edited

Loading

felipemello1 commented Nov 27, 2024

ebsmothers commented Nov 27, 2024

amogkam commented Nov 27, 2024

OffloadActivations MemCpy Stream Sync behavior #2076

OffloadActivations MemCpy Stream Sync behavior #2076

Comments

amogkam commented Nov 27, 2024 • edited Loading

felipemello1 commented Nov 27, 2024

ebsmothers commented Nov 27, 2024

amogkam commented Nov 27, 2024

`OffloadActivations` MemCpy Stream Sync behavior #2076

`OffloadActivations` MemCpy Stream Sync behavior #2076

amogkam commented Nov 27, 2024 •

edited

Loading