Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CUDA backend specification (revised) #420

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AerialMantis
Copy link
Collaborator

@AerialMantis AerialMantis commented Jun 1, 2023

This pull request is a revised version of the original pull request for introducing the CUDA backend (#197). I created a new PR as the original PR was becoming very difficult to continue rebasing.

Introduces the CUDA backend specification covering a mapping of the platform, execution and memories models and interoperability with the CUDA API for SYCL implementations targeting a CUDA backend.

The backend specification has been added as an appendix to the SYCL 2020 specification.

A pull request for introducing a test plan to the SYCL CTS for the CUDA backend interoperability defined in this backend specification is here - KhronosGroup/SYCL-CTS#207. This will still need to be updated with recent changes.

Contributors: @AerialMantis, @AidanBeltonS, @npmiller, @hdelan, @t4c1, @jchlanda, @JackAKirk, @gmlueck, @illuhad.

@AerialMantis AerialMantis requested a review from gmlueck June 1, 2023 12:43
@AerialMantis
Copy link
Collaborator Author

Since #197 I've made some additional changes based on feedback:

  • Added a get_error_code free function for retrieving native CUDA error codes.
  • Removed the optional features table as it's not necessary.
  • Clarified that the CUDA backend can use either CUDA Driver API or the CUDA Runtime API.
  • Clarified the context and queue mappings.
  • Improved wording on synchronization of streams for interop.
  • Specified that a queue created from a stream should be in-order.
  • Added host task interop section.
  • Removed image interop.
  • Added sections describing synchronization and setting the current context.
  • Added note about using the primary context.
  • Clarified that it's defined as a KHR extension.
  • Cleaned up some duplicated wording and fixed merge conflicts.

|====
| Function | Description
| [code]#bool cuda::has_native_event()# | Returns true if the SYCL [code]#event# has a native [code]#CUevent# associated with it.
|====
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to sycl::error_category_for (see #164)? If the idea is that each backend interop spec will provide a get_error_code function like this, then can we remove the ill-defined sycl::error_category_for from the core spec?

It would be very good to resolve #164 because we deferred any CTS testing of sycl::error_category_for due to its unclear specification.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed your comment here. Are you referring to cuda::get_error_code(exception &) and how that relates to sycl::error_category_for?

Yeah, I think the consensus was that we should move towards per-backend get_error_code functions, so then we can remove sycl::error_category_for. Though I'm not sure if it was decided exactly what we should do with sycl::error_category_for, whether it's removed immediately from SYCL 2020 or deprecated, I think there is justification for removing immediately it as it's been found to not work as it was intended.

@keryell
Copy link
Member

keryell commented Jun 15, 2023

SYCL WG meeting 2023/06/15:

No recent progress, need to come back in a couple of weeks

@tomdeakin
Copy link
Contributor

Discussed in Oct/5/23 - @AerialMantis to say when to discuss again.

single thread is a SYCL work-item.

CUDA primarily synchronizes the threads through two functions,
[code]#cudaStreamSynchronize()# and [code]#\__syncthreads()#.
Copy link
Contributor

@hdelan hdelan Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure all of the mentions of cudaStreamSynchronize in this paragraph should be the more general cudaDeviceSynchronize/cuCtxSynchronize

| [code]#device# | [code]#CUdevice# | [code]#CUdevice# | A SYCL [code]#device# created from a [code]#CUdevice# will return a copy of an existing device with that native [code]#CUdevice#. The [code]#CUdevice# retireved from a SYCL[code]#device# will be the native [code]#CUdevice# associated with it.
| [code]#context# | [code]#CUcontext# | [code]#std::vector<CUcontext># | A SYCL [code]#context# created from a single [code]#CUcontext# will encapsulate that context. All [code]#CUcontext#s associated with a SYCL [code]#context# are retireved from a SYCL [code]#context#. The native [code]#CUcontext#s may contain onr or more CUDA primary contexts.
| [code]#queue# | [code]#CUstream# | [code]#CUstream# | A SYCL [code]#queue# created from a [code]#CUstream# will encapsulate that stream and will have the [code]#property::queue::in_order# property . A single [code]#CUstream# is retrieved from a SYCL[code]#queue#, and before it is returned all native [code]#CUstream#s associated with the SYCL [code]#queue# synchronize with the calling thread.
| [code]#event# | [code]#CUevent# | [code]#CUevent# | A SYCL [code]#event# created from a [code]#CUstream# will encapsulate that event. A single [code]#CUevent# is potentially retrieved from a SYCL [code]#event# if there is a valid native [code]#CUevent# associated with it, otherwise [code]#nullptr# is returned instead. The CUDA backend-specific free function [code]#cuda::has_native_event# can be used to query whether the SYCL [code]#event# has a valid native [code]#CUevent# associated with it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| [code]#event# | [code]#CUevent# | [code]#CUevent# | A SYCL [code]#event# created from a [code]#CUstream# will encapsulate that event. A single [code]#CUevent# is potentially retrieved from a SYCL [code]#event# if there is a valid native [code]#CUevent# associated with it, otherwise [code]#nullptr# is returned instead. The CUDA backend-specific free function [code]#cuda::has_native_event# can be used to query whether the SYCL [code]#event# has a valid native [code]#CUevent# associated with it.
| [code]#event# | [code]#CUevent# | [code]#CUevent# | A SYCL [code]#event# created from a [code]#CUevent# will encapsulate that event. A single [code]#CUevent# is potentially retrieved from a SYCL [code]#event# if there is a valid native [code]#CUevent# associated with it, otherwise [code]#nullptr# is returned instead. The CUDA backend-specific free function [code]#cuda::has_native_event# can be used to query whether the SYCL [code]#event# has a valid native [code]#CUevent# associated with it.


==== Synchronization

When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime
When retrieving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime


When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime
must synchronize all [code]#CUstream#s associated with the SYCL [code]#queue#
in order to guarnatee consistent ordering of commands previously enqueued to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
in order to guarnatee consistent ordering of commands previously enqueued to
in order to guarantee consistent ordering of commands previously enqueued to

must synchronize all [code]#CUstream#s associated with the SYCL [code]#queue#
in order to guarnatee consistent ordering of commands previously enqueued to
the SYCL [code]#queue# in relation to any commands enqueued using the native
[code]#CUstream#. The native [code]#stream# must be synchornized with and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[code]#CUstream#. The native [code]#stream# must be synchornized with and
[code]#CUstream#. The native [code]#stream# must be synchronized with and

@hdelan
Copy link
Contributor

hdelan commented Mar 5, 2024

Some things missing here:

  • When we call ih.get_native_queue(); are we always guaranteed to get the same CUstream/cudaStream? I think this should be a yes
  • From within a HT function, is it necessary to set the context? This is a bit tricky because we are essentially saying: Is a HT function executed in a new thread? If the context needs to be set then how should this happen? If we call an ih.get_native... method should this set the cuCtx?
  • How do we reconcile the CUDA types that are equivalent with SYCL types if different implementations may use CUDA driver types or CUDA runtime types? Since the CUDA driver can mix and match with runtime, maybe it is better to make SYCL types map to CUDA runtime types instead of driver types

@t4c1
Copy link

t4c1 commented Mar 5, 2024

Why the preference for the same cuda stream? I think for performance reasons it is better to leave this unspecified to allow different streams to be returned.

@hdelan
Copy link
Contributor

hdelan commented Mar 5, 2024

If I want ih.get_native_queue to give me different streams for any two separate HTs then I have no way of checking that this has indeed happened. It's better to always return the same stream for every call to ih.get_native_queue, or indeed to give all streams associated with the queue. However I think that the latter is still a bit confusing as the stream pool may grow or shrink according to the implementation's preferences (unless we specified that it doesn't, which is over prescriptive I think), so it would be tricky to reason about which stream is which in separate HT funcs.

I think if a secondary stream is needed for a queue then it can be created within a HT or the user can use in order SYCL queues as CUstreams, which seems to be what a lot of users do anyway.

@t4c1
Copy link

t4c1 commented Mar 5, 2024

There is no need to check if that is not specified. And I think this is the way to go - letting implementations optimize by providing different streams when they can, but not requiring it. I agree returning all underlying streams has little value.

@hdelan
Copy link
Contributor

hdelan commented Mar 5, 2024

The primary motivation is that something like AdaptiveCPP's enqueue_custom_operation can work (being able to submit async work in a HT-like lambda) only if we have guarantees that we will always get the same stream from get_native_queue. Similar behaviour might be desirable to have in the future in core SYCL.

This is a small step towards being able to submit async work in the HT lambda.

At the minimum I think we should define this for in order queues, whatever about out of order.

@hdelan
Copy link
Contributor

hdelan commented Mar 7, 2024

@t4c1 , @jle-quel and I were talking about this offline. We think it's important to guarantee that for an in order queue, ih.get_native_queue is guaranteed to return the same queue every time, whereas in order queues have no such guarantees, even within the same host task func.

So for out of order queues:

q.host_task(...
   bool sameStream = ih.get_native_queue() == ih.get_native_queue();
});

Might be false, whereas for in order queues this is always true.

And

q.host_task(...
   stream_acc[0] = ih.get_native_queue();
}).wait();
q.host_task(...
   bool sameStream = ih.get_native_queue() == stream_acc[0];
   // Using the stream stored in previous HT lambda through accessor
});

Will always be true for in order queues. This may or may not evaluate to true for out of order queues.

Maybe also worth mentioning that if dispatching to the same CUstream across different HTs it is important to make sure the correct command dependencies are used, so async work is dispatched to CUstreams in the correct order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants