Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OffloadActivations MemCpy Stream Sync behavior #2076

Open
amogkam opened this issue Nov 27, 2024 · 3 comments
Open

OffloadActivations MemCpy Stream Sync behavior #2076

amogkam opened this issue Nov 27, 2024 · 3 comments

Comments

@amogkam
Copy link

amogkam commented Nov 27, 2024

We are using the OffloadActivations context manager with separate streams and pinned memory but currently don't see any overlap between the streams.

The default stream has a D2H transfer (which is a cudaMemcpyAsync followed by a cudaStreamSynchronize as described here. We observed that if we use pinned memory for activation offloading and transfer that in a separate stream, the cudaStreamSync on the default stream would block, leading to no overlap.

Our current hypothesis for this is due to implicit synchronization semantic, specifically a page-locked (i.e. pinned) host memory allocation issued before the cudaStreamSync on the default stream would lead to cudaStreamSync block on that stream. So our conjecture is offloading using pinned memory is NOT pre-allocating pinned memory but rather allocate that on the fly which forces other stream sync to be blocking.

Is this behavior expected and is our hypothesis on why this is happening correct? If so, is there any way to have offloading done using preallocated pinned memory instead of allocating pinned
memory on the fly to avoid this synchronization?

Thanks! Happy to answer any further questions/share profile traces.

@felipemello1
Copy link
Contributor

Thanks for investigating it and sharing this info! @janeyx99 is our PoC for offloading. She is currently on vacation, so we may not hear back from her for a few days.

@ebsmothers
Copy link
Contributor

Hi @amogkam thanks for sharing your detailed findings on the issue. To piggyback on @felipemello1's comment, in the meantime we can debug some as well. A couple questions/comments for you:

  1. are you just running the OffloadActivations context manager with use_pin_memory=True and use_streams=True, or are you doing some additional customization beyond that?
  2. regarding your offer to share profile traces, that would be quite helpful, and
  3. is there a simple repro we can use to observe the same behavior you mentioned? (e.g. using one of our default configs)

@amogkam
Copy link
Author

amogkam commented Nov 27, 2024

Hi @ebsmothers, @felipemello1 thanks for getting back!

  1. Yes I am running just OffloadActivations context manager with those 2 args set.

Yes let me create a small example with profile traces and I will get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants