Introduce CUDA backend specification (revised) #420

AerialMantis · 2023-06-01T12:43:26Z

This pull request is a revised version of the original pull request for introducing the CUDA backend (#197). I created a new PR as the original PR was becoming very difficult to continue rebasing.

Introduces the CUDA backend specification covering a mapping of the platform, execution and memories models and interoperability with the CUDA API for SYCL implementations targeting a CUDA backend.

The backend specification has been added as an appendix to the SYCL 2020 specification.

A pull request for introducing a test plan to the SYCL CTS for the CUDA backend interoperability defined in this backend specification is here - KhronosGroup/SYCL-CTS#207. This will still need to be updated with recent changes.

Contributors: @AerialMantis, @AidanBeltonS, @npmiller, @hdelan, @t4c1, @jchlanda, @JackAKirk, @gmlueck, @illuhad.

CUDA backend specification doc pulled in from KhronosGroup#197.

AerialMantis · 2023-06-01T12:52:59Z

Since #197 I've made some additional changes based on feedback:

Added a get_error_code free function for retrieving native CUDA error codes.
Removed the optional features table as it's not necessary.
Clarified that the CUDA backend can use either CUDA Driver API or the CUDA Runtime API.
Clarified the context and queue mappings.
Improved wording on synchronization of streams for interop.
Specified that a queue created from a stream should be in-order.
Added host task interop section.
Removed image interop.
Added sections describing synchronization and setting the current context.
Added note about using the primary context.
Clarified that it's defined as a KHR extension.
Cleaned up some duplicated wording and fixed merge conflicts.

gmlueck · 2023-06-01T13:40:19Z

adoc/chapters/cuda_backend.adoc

+|====
+| Function | Description
+| [code]#bool cuda::has_native_event()# | Returns true if the SYCL [code]#event# has a native [code]#CUevent# associated with it.
+|====


How does this relate to sycl::error_category_for (see #164)? If the idea is that each backend interop spec will provide a get_error_code function like this, then can we remove the ill-defined sycl::error_category_for from the core spec?

It would be very good to resolve #164 because we deferred any CTS testing of sycl::error_category_for due to its unclear specification.

Sorry I missed your comment here. Are you referring to cuda::get_error_code(exception &) and how that relates to sycl::error_category_for?

Yeah, I think the consensus was that we should move towards per-backend get_error_code functions, so then we can remove sycl::error_category_for. Though I'm not sure if it was decided exactly what we should do with sycl::error_category_for, whether it's removed immediately from SYCL 2020 or deprecated, I think there is justification for removing immediately it as it's been found to not work as it was intended.

keryell · 2023-06-15T19:36:10Z

SYCL WG meeting 2023/06/15:

No recent progress, need to come back in a couple of weeks

tomdeakin · 2023-10-05T16:14:37Z

Discussed in Oct/5/23 - @AerialMantis to say when to discuss again.

hdelan · 2024-02-22T11:44:36Z

adoc/chapters/cuda_backend.adoc

+single thread is a SYCL work-item.
+
+CUDA primarily synchronizes the threads through two functions,
+[code]#cudaStreamSynchronize()# and [code]#\__syncthreads()#.


I'm pretty sure all of the mentions of cudaStreamSynchronize in this paragraph should be the more general cudaDeviceSynchronize/cuCtxSynchronize

hdelan · 2024-02-22T11:46:10Z

adoc/chapters/cuda_backend.adoc

+| [code]#device# | [code]#CUdevice# | [code]#CUdevice# | A SYCL [code]#device# created from a [code]#CUdevice# will return a copy of an existing device with that native [code]#CUdevice#. The [code]#CUdevice# retireved from a SYCL[code]#device# will be the native  [code]#CUdevice# associated with it.
+| [code]#context# | [code]#CUcontext# | [code]#std::vector<CUcontext># | A SYCL [code]#context# created from a single [code]#CUcontext# will encapsulate that context. All [code]#CUcontext#s associated with a SYCL [code]#context# are retireved from a SYCL [code]#context#. The native [code]#CUcontext#s may contain onr or more CUDA primary contexts.
+| [code]#queue# | [code]#CUstream# | [code]#CUstream# | A SYCL [code]#queue# created from a [code]#CUstream# will encapsulate that stream and will have the [code]#property::queue::in_order# property . A single [code]#CUstream# is retrieved from a SYCL[code]#queue#, and before it is returned all native [code]#CUstream#s associated with the SYCL [code]#queue# synchronize with the calling thread.
+| [code]#event# | [code]#CUevent# | [code]#CUevent# | A SYCL [code]#event# created from a [code]#CUstream# will encapsulate that event. A single [code]#CUevent# is potentially retrieved from a SYCL [code]#event# if there is a valid native [code]#CUevent# associated with it, otherwise [code]#nullptr# is returned instead. The CUDA backend-specific free function [code]#cuda::has_native_event# can be used to query whether the SYCL [code]#event# has a valid native [code]#CUevent# associated with it.


Suggested change

| [code]#event# | [code]#CUevent# | [code]#CUevent# | A SYCL [code]#event# created from a [code]#CUstream# will encapsulate that event. A single [code]#CUevent# is potentially retrieved from a SYCL [code]#event# if there is a valid native [code]#CUevent# associated with it, otherwise [code]#nullptr# is returned instead. The CUDA backend-specific free function [code]#cuda::has_native_event# can be used to query whether the SYCL [code]#event# has a valid native [code]#CUevent# associated with it.

| [code]#event# | [code]#CUevent# | [code]#CUevent# | A SYCL [code]#event# created from a [code]#CUevent# will encapsulate that event. A single [code]#CUevent# is potentially retrieved from a SYCL [code]#event# if there is a valid native [code]#CUevent# associated with it, otherwise [code]#nullptr# is returned instead. The CUDA backend-specific free function [code]#cuda::has_native_event# can be used to query whether the SYCL [code]#event# has a valid native [code]#CUevent# associated with it.

hdelan · 2024-02-22T11:46:43Z

adoc/chapters/cuda_backend.adoc

+
+==== Synchronization
+
+When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime


Suggested change

When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime

When retrieving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime

hdelan · 2024-02-22T11:47:05Z

adoc/chapters/cuda_backend.adoc

+
+When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime
+must synchronize all [code]#CUstream#s associated with the SYCL [code]#queue#
+in order to guarnatee consistent ordering of commands previously enqueued to


Suggested change

in order to guarnatee consistent ordering of commands previously enqueued to

in order to guarantee consistent ordering of commands previously enqueued to

hdelan · 2024-02-22T11:47:37Z

adoc/chapters/cuda_backend.adoc

+must synchronize all [code]#CUstream#s associated with the SYCL [code]#queue#
+in order to guarnatee consistent ordering of commands previously enqueued to
+the SYCL [code]#queue# in relation to any commands enqueued using the native
+[code]#CUstream#. The native [code]#stream# must be synchornized with and


Suggested change

[code]#CUstream#. The native [code]#stream# must be synchornized with and

[code]#CUstream#. The native [code]#stream# must be synchronized with and

hdelan · 2024-03-05T11:18:02Z

Some things missing here:

When we call ih.get_native_queue(); are we always guaranteed to get the same CUstream/cudaStream? I think this should be a yes
From within a HT function, is it necessary to set the context? This is a bit tricky because we are essentially saying: Is a HT function executed in a new thread? If the context needs to be set then how should this happen? If we call an ih.get_native... method should this set the cuCtx?
How do we reconcile the CUDA types that are equivalent with SYCL types if different implementations may use CUDA driver types or CUDA runtime types? Since the CUDA driver can mix and match with runtime, maybe it is better to make SYCL types map to CUDA runtime types instead of driver types

t4c1 · 2024-03-05T11:52:32Z

Why the preference for the same cuda stream? I think for performance reasons it is better to leave this unspecified to allow different streams to be returned.

hdelan · 2024-03-05T11:59:18Z

If I want ih.get_native_queue to give me different streams for any two separate HTs then I have no way of checking that this has indeed happened. It's better to always return the same stream for every call to ih.get_native_queue, or indeed to give all streams associated with the queue. However I think that the latter is still a bit confusing as the stream pool may grow or shrink according to the implementation's preferences (unless we specified that it doesn't, which is over prescriptive I think), so it would be tricky to reason about which stream is which in separate HT funcs.

I think if a secondary stream is needed for a queue then it can be created within a HT or the user can use in order SYCL queues as CUstreams, which seems to be what a lot of users do anyway.

t4c1 · 2024-03-05T12:10:45Z

There is no need to check if that is not specified. And I think this is the way to go - letting implementations optimize by providing different streams when they can, but not requiring it. I agree returning all underlying streams has little value.

hdelan · 2024-03-05T13:10:45Z

The primary motivation is that something like AdaptiveCPP's enqueue_custom_operation can work (being able to submit async work in a HT-like lambda) only if we have guarantees that we will always get the same stream from get_native_queue. Similar behaviour might be desirable to have in the future in core SYCL.

This is a small step towards being able to submit async work in the HT lambda.

At the minimum I think we should define this for in order queues, whatever about out of order.

hdelan · 2024-03-07T14:26:50Z

@t4c1 , @jle-quel and I were talking about this offline. We think it's important to guarantee that for an in order queue, ih.get_native_queue is guaranteed to return the same queue every time, whereas in order queues have no such guarantees, even within the same host task func.

So for out of order queues:

q.host_task(...
   bool sameStream = ih.get_native_queue() == ih.get_native_queue();
});

Might be false, whereas for in order queues this is always true.

And

q.host_task(...
   stream_acc[0] = ih.get_native_queue();
}).wait();
q.host_task(...
   bool sameStream = ih.get_native_queue() == stream_acc[0];
   // Using the stream stored in previous HT lambda through accessor
});

Will always be true for in order queues. This may or may not evaluate to true for out of order queues.

Maybe also worth mentioning that if dispatching to the same CUstream across different HTs it is important to make sure the correct command dependencies are used, so async work is dispatched to CUstreams in the correct order.

AerialMantis added 2 commits May 30, 2023 14:09

Add CUDA backend specification from previous PR

55c1394

CUDA backend specification doc pulled in from KhronosGroup#197.

Update CUDA backend specification based on feedback

904ba27

AerialMantis requested a review from gmlueck June 1, 2023 12:43

Add fix for typo

ce1011c

AerialMantis mentioned this pull request Jun 1, 2023

Introduce CUDA backend specification #197

Closed

Add inclusion of CUDA backend spec file in syclbase

ebd8f6e

gmlueck reviewed Jun 1, 2023

View reviewed changes

Add fix for has_native_event

c992be1

JackAKirk mentioned this pull request Aug 21, 2023

[SYCL][HIP] Implemented supported make_* interop functions. intel/llvm#10526

Merged

hdelan mentioned this pull request Sep 4, 2023

[SYCL][CUDA][HIP] Deprecate context interop for CUDA and HIP intel/llvm#10975

Merged

hdelan reviewed Feb 22, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce CUDA backend specification (revised) #420

Introduce CUDA backend specification (revised) #420

AerialMantis commented Jun 1, 2023 •

edited

Loading

AerialMantis commented Jun 1, 2023

gmlueck Jun 1, 2023

AerialMantis Sep 14, 2023

keryell commented Jun 15, 2023

tomdeakin commented Oct 5, 2023

hdelan Feb 22, 2024 •

edited

Loading

hdelan Feb 22, 2024

hdelan Feb 22, 2024

hdelan Feb 22, 2024

hdelan Feb 22, 2024

hdelan commented Mar 5, 2024

t4c1 commented Mar 5, 2024

hdelan commented Mar 5, 2024

t4c1 commented Mar 5, 2024 •

edited

Loading

hdelan commented Mar 5, 2024

hdelan commented Mar 7, 2024


		==== Synchronization

		When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime

	When retireving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime
	When retrieving a [code]#CUstream# from a SYCL [code]#queue# the SYCL runtime

	in order to guarnatee consistent ordering of commands previously enqueued to
	in order to guarantee consistent ordering of commands previously enqueued to

	[code]#CUstream#. The native [code]#stream# must be synchornized with and
	[code]#CUstream#. The native [code]#stream# must be synchronized with and

Introduce CUDA backend specification (revised) #420

Are you sure you want to change the base?

Introduce CUDA backend specification (revised) #420

Conversation

AerialMantis commented Jun 1, 2023 • edited Loading

AerialMantis commented Jun 1, 2023

gmlueck Jun 1, 2023

Choose a reason for hiding this comment

AerialMantis Sep 14, 2023

Choose a reason for hiding this comment

keryell commented Jun 15, 2023

tomdeakin commented Oct 5, 2023

hdelan Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

hdelan Feb 22, 2024

Choose a reason for hiding this comment

hdelan Feb 22, 2024

Choose a reason for hiding this comment

hdelan Feb 22, 2024

Choose a reason for hiding this comment

hdelan Feb 22, 2024

Choose a reason for hiding this comment

hdelan commented Mar 5, 2024

t4c1 commented Mar 5, 2024

hdelan commented Mar 5, 2024

t4c1 commented Mar 5, 2024 • edited Loading

hdelan commented Mar 5, 2024

hdelan commented Mar 7, 2024

AerialMantis commented Jun 1, 2023 •

edited

Loading

hdelan Feb 22, 2024 •

edited

Loading

t4c1 commented Mar 5, 2024 •

edited

Loading