Introduce CUDA backend specification #197

AerialMantis · 2021-10-22T14:48:33Z

Introduce a first draft of a CUDA backend specification covering a mapping of the platform, execution and memories models and interoperability with the CUDA API for SYCL implementations targeting a CUDA backend.

The backend specification has been added as an appendix to the SYCL 2020 specification.

A pull request for introducing a test plan to the SYCL CTS for the CUDA backend interoperability defined in this backend specification is here - KhronosGroup/SYCL-CTS#207

…ecification

…tion' Add API interoperability section See merge request oneapi-core/SYCL-Docs-mirror!7

…-interop' into 'SYCL-2020/cuda-backend-specification' Add CUDA backend specification kernel function interop definitions See merge request oneapi-core/SYCL-Docs-mirror!10

CLAassistant · 2021-10-22T14:48:57Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 6 committers have signed the CLA.

✅ npmiller
✅ hdelan
✅ JackAKirk
✅ AidanBeltonS
✅ AerialMantis
❌ t4c1
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

illuhad · 2021-10-22T15:11:50Z

hipSYCL will not be able to follow this specification because it contradicts with the meanings of e.g. queue and context in hipSYCL and how they are mapped to CUDA. I'm not convinced there is need for this spec if only DPC++ is going to implement it.

AerialMantis · 2021-10-25T12:52:57Z

Thanks for the feedback @illuhad, so this is just a first draft, which is based initially on the DPC++ CUDA backend implementation, as this is what we are most familiar with, however, we appreciate this may not map in the same way in hipSYCL. Our goal is to converge on a specification which all SYCL implementations which provide a CUDA backend can conform to, so we are keen to collaborate with you on how we can incorporate the design of hipSYCL's CUDA backend into this as well.

To that end we'd be keen to understand where it would be difficult for hipSYCL to conform to the specification as we've currently written it, and we can work with you to adapt it, perhaps there are places where we have been too prescriptive, and we may need to be more flexible?

illuhad · 2021-10-25T14:59:25Z

Hi @AerialMantis, thanks for the clarification regarding your plans, I'm happy to hear you want to converge on a common specification.
I think for kernel code (execution model etc) there is strong overlap, and any potential differences could probably be fixed easily. The differences are mostly on the runtime side. From the quick read of the spec that I had so far, I think there are three main concerns:

hipSYCL deliberately decouples sycl::queue and sycl::context from backend objects. For example, it maintains a pool of CUDA streams, and then distributes the work that comes in through all sycl::queue objects across the pool. This has the advantage that it can already benefit from concurrent kernel execution, concurrent memory copies, or overlap of data transfers and compute even if the user code only utilizes a single sycl::queue. Additionally, the performance behavior is independent of the number of SYCL queues, so the user does not need to worry about how to best utilize queues, and can instead focus on expressing the work as part of the SYCL task graph. Similarly, decoupling context allows for an implicit context management that prevents the user from accidentally spawning new backend contexts (e.g. using the default queue constructor). The downside of this approach is that in general there is no well-defined 1:1 mapping between SYCL queue, context and backend objects, which limits interoperability.
Similarly, there might not be a direct relation between backend events and sycl::event, because hipSYCL might elide operations early on in the task graph processing as an optimization, in which case the operation becomes "virtual" in hipSYCL terminology, and then the event is not associated with any real backend event. Maybe it might be possible to at least construct a sycl::event on top of a CUDA event, but that's still unclear. Also, there might be other mechanisms in play for synchronization, such as callbacks instead of CUDA events, and whether the SYCL event comes from the CUDA or e.g. the host backend might also be unclear for the user (e.g. a host task might return an event from the CPU backend). Because of all these complexities, hipSYCL currently does not support event interop.
While DPC++ relies exclusively on the CUDA driver API, hipSYCL mostly utilizes the CUDA runtime API, (except for a couple of places where the driver API is used). Most runtime types are a direct typedef for a driver type so e.g. interop with driver types might "just work", but it's unclear how well this guaranteed to be the case and we should check to make sure that we don't miss anything.

Of course we could just make all the problematic interop optional to support both hipSYCL and DPC++ designs, but my understanding is that one of main arguments for such a specification would be precisely to make guarantees for users when they wish to interop with the backend.

AerialMantis · 2021-10-26T09:05:10Z

@illuhad thank you for the detailed response, this is really helpful.

hipSYCL deliberately decouples sycl::queue and sycl::context from backend objects. For example, it maintains a pool of CUDA streams, and then distributes the work that comes in through all sycl::queue objects across the pool. This has the advantage that it can already benefit from concurrent kernel execution, concurrent memory copies, or overlap of data transfers and compute even if the user code only utilizes a single sycl::queue. Additionally, the performance behavior is independent of the number of SYCL queues, so the user does not need to worry about how to best utilize queues, and can instead focus on expressing the work as part of the SYCL task graph. Similarly, decoupling context allows for an implicit context management that prevents the user from accidentally spawning new backend contexts (e.g. using the default queue constructor). The downside of this approach is that in general there is no well-defined 1:1 mapping between SYCL queue, context and backend objects, which limits interoperability.

So this is a really good point, and we thought about this quite a bit. Our thinking for mapping of the SYCL queue to CUDA streams was that it should be loosely defined to allow either a 1:1 mapping or a 1:n mapping, to allow for an implementation to have, for example, different streams to overlap kernel functions and data movement or even to provide a separate stream for interop to make it easier to have concurrent execution with work being done via interop. And we believe this should extend to a n:p mapping where there is a fixed pool of CUDA streams regardless of how many SYCL queues are constructed, we can tweak the definition to ensure this case is included.

For the mapping of the SYCL context to CUDA contexts, we chose a mapping of 1:n in order to support the multi-device context use case, where you have a context representing multiple devices. Since CUDA contexts are associated with a single device, it made sense to allow a SYCL context to map to multiple CUDA contexts. However, if you were to have a pool of CUDA streams then it makes sense that this would naturally extend to the CUDA contexts as well.

For the interop API, we wanted to keep the mapping of the SYCL queue separate somewhat from the interop API requirements in that, even though a SYCL queue could map to multiple CUDA streams, the interop API is still only required to provide a single CUDA stream and the implementation can choose which to provide. I believe this should work for hipSYCL. However, for SYCL context I think it's still necessary to return all CUDA contexts as they are unique for each CUDA device, though perhaps this will still work for hipSYCL if you can retrieve the CUDA contexts from the pool corresponding to the devices of the SYCL context?

Where I think this gets tricky is in the interop API for creating SYCL objects; make_queue and make_context if a CUDA backend is implemented using a pool of CUDA streams or CUDA contexts, then it gets difficult to have the objects created only represent the input CUDA streams or CUDA contexts. However, I see a couple of ways this could work:

Create a new object with it's own CUDA streams or CUDA contexts which are independent of the pools.
Add the CUDA stream or CUDA context from make_queue or make_context to the pools, but tag them as only useable by the interop constructed SYCL objects.

Similarly, there might not be a direct relation between backend events and sycl::event, because hipSYCL might elide operations early on in the task graph processing as an optimization, in which case the operation becomes "virtual" in hipSYCL terminology, and then the event is not associated with any real backend event. Maybe it might be possible to at least construct a sycl::event on top of a CUDA event, but that's still unclear. Also, there might be other mechanisms in play for synchronization, such as callbacks instead of CUDA events, and whether the SYCL event comes from the CUDA or e.g. the host backend might also be unclear for the user (e.g. a host task might return an event from the CPU backend). Because of all these complexities, hipSYCL currently does not support event interop.

This is a good point, and I think most implementations will do this at some point or another, so it definitely needs to be accounted for, and it's difficult to emulate in CUDA as it's not possible to create a user event to represent virtual or composite operations in the runtime. Perhaps we may need to loosen the mapping of the SYCL event to the CUDA event, and say instead that a SYCL event may have an underlying CUDA event associated with it, but it may not. Then perhaps we could require the interop API to be defined, but make it implementation defined when a CUDA event is returned?

Though as you said, it does reduce the usefulness of having interoperability in this case since users wouldn't be able to rely on it always being available, so perhaps it makes less sense to have interop for the SYCL event. There may be other ways to achieve what you would do with a CUDA event, for example since CUDA streams are in order you could simply wait on the stream, though this is not quite the same it may be sufficient. We'll need to think about this some more.

While DPC++ relies exclusively on the CUDA driver API, hipSYCL mostly utilizes the CUDA runtime API, (except for a couple of places where the driver API is used). Most runtime types are a direct typedef for a driver type so e.g. interop with driver types might "just work", but it's unclear how well this guaranteed to be the case and we should check to make sure that we don't miss anything.

That's a good point, we'll need to investigate this and make sure that these types are compatible, otherwise we may need to consider a way of the interop API using one or the other of the types, depending on which API the implementation is using. We'll need to look into this some more.

illuhad · 2021-10-26T15:55:55Z

Thanks for your thoughts @AerialMantis !

For the interop API, we wanted to keep the mapping of the SYCL queue separate somewhat from the interop API requirements in that, even though a SYCL queue could map to multiple CUDA streams, the interop API is still only required to provide a single CUDA stream and the implementation can choose which to provide. I believe this should work for hipSYCL.

I could of course just return some stream from the pool, but I'm not sure if this might not do more harm than good because I don't think it's what users would expect. In that case, there'd be no guarantees that the returned stream synchronizes in any way with previous operations submitted to the SYCL queue (they may have been scheduled to different streams, even for in-order queues) and neither would subsequent operations be guaranteed to synchronize with anything the user submits to the stream via interop.

However, for SYCL context I think it's still necessary to return all CUDA contexts as they are unique for each CUDA device, though perhaps this will still work for hipSYCL if you can retrieve the CUDA contexts from the pool corresponding to the devices of the SYCL context?

That might work. In most workflows, hipSYCL just uses the default context from the CUDA runtime API, which I think can be extracted if the device is known. There are some caveats since hipSYCL also supports multi-device queues, but those are not covered by the SYCL standard anyway.

Where I think this gets tricky is in the interop API for creating SYCL objects; make_queue and make_context if a CUDA backend is implemented using a pool of CUDA streams or CUDA contexts, then it gets difficult to have the objects created only represent the input CUDA streams or CUDA contexts. However, I see a couple of ways this could work:

I'm not sure those functions make much sense for hipSYCL. The hipSYCL interpretation of e.g. a queue is really more of a high-level interface to the SYCL task graph, with some functionality to manage synchronization of groups of tasks (e.g. wait()). To make this more clear I am pondering whether we should introduce the alias using task_group = queue; in hipSYCL. Relating a task group to a stream does not sound very useful semantically, and adding streams at runtime for interop purposes breaks some of the assumptions that the scheduler currently can make. I'll have to think some more about this. For context it might be even more complicated as we couldn't rely on the CUDA runtime API context anymore in the workflows that use it, and this in turn might break other interop use cases that currently "just work" for SYCL+CUDA interop applications that use the runtime API.

This is a good point, and I think most implementations will do this at some point or another, so it definitely needs to be accounted for, and it's difficult to emulate in CUDA as it's not possible to create a user event to represent virtual or composite operations in the runtime. Perhaps we may need to loosen the mapping of the SYCL event to the CUDA event, and say instead that a SYCL event may have an underlying CUDA event associated with it, but it may not. Then perhaps we could require the interop API to be defined, but make it implementation defined when a CUDA event is returned?

Sure, that could be implemented - assuming this is still useful for users then.

I wonder if maybe we should take a step back to see what we want to accomplish with the spec from a user's point of view to make sure that a common spec that targets the least common denominator between DPC++ and hipSYCL is really better for users than two separate implementation backend specs that are specifically built to match the DPC++ and hipSYCL designs?

adoc/chapters/cuda_backend.adoc

gmlueck · 2021-10-28T14:11:27Z

adoc/chapters/cuda_backend.adoc

+| [code]#template<backend Backend> +
+device +
+make_device(const backend_input_t<Backend, device> &backendObject);# 
+        | Create a SYCL `device` from a CUDA device.


I think it would be good to clarify that the device returned from this API is just a copy of one of the devices in device::get_devices(). For example, I think you could add wording like this (inspired from similar wording we have for the Level Zero interop spec):

The SYCL execution environment for the CUDA backend contains a fixed number of devices that are enumerated via sycl::device::get_devices(). Calling this function does not create a new device. Rather it merely creates a sycl::device object that is a copy of one of the devices from that enumeration.

The wording for this has been updated to reflect that make_device doesn't create a new platform just returns a device object that is a copy of one of the existing devices, so I think this is resolved now.

adoc/chapters/cuda_backend.adoc

gmlueck · 2021-10-28T14:23:05Z

adoc/chapters/cuda_backend.adoc

+
+When creating a SYCL object from a native CUDA object SYCL does not take
+ownership of the object and it is up to the application to dispose of them when
+appropriate.


The Level Zero backend has a similar issue about ownership. We addressed this by adding an ownership field to the interop APIs that create SYCL objects, which allows the application to say whether it keeps ownership of the native handle vs. transferring ownership to the implementation. I think it might make sense to use the same technique here. You can see the wording we used in the Level Zero spec at:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LevelZeroBackend/LevelZeroBackend.md#44-level-zero-handles-ownership-and-thread-safety

Yes, we looked at that, and I do think it makes sense as well, however it does use an extension header, which is fine as the Level Zero is maintained separately, but it's not ideal if we want the CUDA backend spec to be part of the main SYCL specification.

So we decided to go with a simpler approach for now, for the CUDA backend spec, with the idea in mind that we should probably look into maybe codifying what Level Zero is doing directly in the SYCL interop API, or at least a similar system that could be re-used by the non-OpenCL backends to handle ownership.

however it does use an extension header, which is fine as the Level Zero is maintained separately, but it's not ideal if we want the CUDA backend spec to be part of the main SYCL specification.

Sorry, I'm not following. What do you mean by "it does use an extension header"? How does this preclude the interop API from being part of the main SYCL specification?

It doesn't preclude it, we just thought it might be better done separately as a generic change to the interop API that could be used by all backends.

I'm not sure it makes sense to try to solve this in a general way in the core spec, since only some backends have this issue of "ownership". (For example, OpenCL doesn't have this issue because objects are reference counted.) For backends that have this issue (e.g. Level Zero, CUDA), I think it makes sense for each backend to define the mechanism for deciding who "owns" an object. Of course, it would be better if all backends had a similar mechanism, though.

If the Level Zero mechanism seems good to others, can we use that for CUDA too? If it has flaws, we could discuss them and potentially adopt a consistent mechanism across both backends.

I would argue that the OpenCL backend also has this issue of ownership, it's just that it happens to be easily solved by the OpenCL reference counting. And that the Level Zero approach could easily be implemented by the OpenCL backend, where the ownership transfer would just dictate how SYCL calls clRetain* and clRelease* on the OpenCL objects.

I could be completely wrong on that but looking at the interop API with make_* and get_* it seems to me that ownership questions would naturally come up and I get the feeling that maybe it wasn't part of the original interop API simply because it was solved so easily by reference counting for the OpenCL backend.

gmlueck · 2021-10-28T15:31:03Z

adoc/chapters/cuda_backend.adoc

+are non-core CUDA features. Therefore, the runtime must be able to determine what aspects CUDA 
+devices have. This can be performed by querying `cudaDeviceProp::major` and `cudaDeviceProp::minor`
+to find out the compute capability. The compute capability indicates what extensions are
+available to the device, and therefore what aspects are available.


This paragraph seems like a design document or a description of the implementation. I don't think it's relevant for an API spec.

This is a valid point. We have omitted/rephrased some things in the latest iteration, which we will be posting shortly.

This wording has been removed, so I think this is resolved now.

gmlueck · 2021-10-28T15:33:11Z

adoc/chapters/cuda_backend.adoc

+[code]#half4#, [code]#half8# and [code]#half16# must be
+available at compile-time. However a kernel using these types is only
+supported on devices that have [code]#aspect::fp16#, i.e. compute capability
+5.3 or greater.


I assume you are talking about the SYCL types here (e.g. sycl::half), correct?

If so, none of this is specific to CUDA. The core SYCL spec already says that kernels using these types are only compatible with a device the has aspect::fp16. I don't see why we need to say this again in the CUDA backend spec.

This is correct. We have rephrased this section in the next iteration.

This wording has been updated, so I think this is resolved now.

gmlueck · 2021-10-28T15:34:04Z

adoc/chapters/cuda_backend.adoc

+5.3 or greater.
+
+[[sub:cuda:extensions]]
+=== Extensions


Delete this empty section? It seems like you already talked about extensions above.

This is a placeholder section for later extensions

This wording has been updated, so I think this is resolved now.

gmlueck · 2021-10-28T15:56:08Z

adoc/chapters/cuda_backend.adoc

+[[sub:cuda:platform_model]]
+=== Platform Model
+
+All CUDA enabled devices which can be executed on are represented by a single `CUdevice`. A SYCL device maps to a single CUDA device.


This first sentence seems confusing to me. Should it read "Each CUDA enabled device ..."? Why is it important to say that, though? It seems like it is just talking about CUDA in general, not anything specific to SYCL interoperability. Maybe this first sentence should just be deleted?

This wording has been updated to address this, so I think this is resolved now.

adoc/chapters/cuda_backend.adoc

Fix editing typo and improve make_device docs See merge request oneapi-core/SYCL-Docs-mirror!12

gmlueck · 2022-01-19T16:31:42Z

adoc/chapters/cuda_backend.adoc

+| [code]#context#  | `CUcontext`               | `std::vector<CUcontext>` | A SYCL context can encapsulate multiple CUDA contexts , however it is not possible to create a SYCL context from multiple CUDA contexts.
+| [code]#queue#    | `CUstream`   | `CUstream` | A SYCL queue encapsulates a CUDA stream.
+| [code]#event#    | `CUevent`    | `CUevent`  | A SYCL event encapsulates a CUDA event.
+| [code]#buffer# | `struct { CUdeviceptr ptr; size_t size; }` | `CUdeviceptr` | A SYCL buffer encapsulates a CUDA device pointer.


A question came up recently about the return value of get_native() for buffer. What value is returned if buffer is a sub-buffer? Is it a pointer to the original (non-sub) buffer, or is it an offset to the start of the sub-buffer? It would be nice if there was a table describing the behavior of get_native() for each of these objects. That table would be a good place to clarify things like this.

Another question about the interop for buffer ... the OpenCL interop has a vector of cl_mem object because there might be multiple OpenCL devices on the system and the buffer might be represented on more than one device. (At least I think that's the reasoning.) Is it correct that there's just a single CUdeviceptr here, or should there be a vector of them?

So we've been discussing application interop for buffer as well and for a few reasons we came to the conclusion that we should not support it and instead encourage buffer interop via interop_handle::get_native_mem:

As you mentioned, as a buffer can represent multiple devices it would need to return a vector of CUdeviceptr, but because it allocates data lazily, this would also mean we would either need to return a partial list of memory objects depending on the current state of the runtime or enforce allocation within the get_native.

If the return value has multiple memory objects, we would need some way to reflect which one has the latest copy of the data.

A buffer can also represent memory on the host and potentially across multiple backends, and representing this would add further complexity to the return type.

In effect we came to the conclusion we would need to replicate much of the inner workings of a buffer in order to make it really useful to users, and the use cases we had could be supported via host task interop.

…ecification

This PR is adds part of the CUDA-backend spec interop proposed in KhronosGroup/SYCL-Docs#197. The changes work with the CUDA CTS interop checks KhronosGroup/SYCL-CTS#336. This PR just adds the queue interop. llvm-test-suite: intel/llvm-test-suite#1054

illuhad · 2022-06-16T14:59:51Z

adoc/chapters/cuda_backend.adoc

+context +
+make_context(const backend_input_t<Backend, context> &backendObject,
+                     const async_handler asyncHandler = {});# 
+        | Create a SYCL `context` from a CUDA context.


This probably cannot be supported by SYCL implementations relying on the CUDA runtime API instead of the driver API such as hipSYCL.

@illuhad you make a good point, thanks for raising this. I think we could resolve this by clarifying the wording such that for some SYCL implementations the mapping of SYCL context to CUDA context is limited to the CUDA primary context, which would limit the usefulness of the feature when the implementation is based on the runtime API, however, would allow us to maintain a generic interface between the two implementation approaches.

So the wording would be clarified such that get_native<context, backend::cuda> is permitted to return the CUDA primary context and make_context<backend::cuda> is permitted to only accept the CUDA primary context, and throw an exception otherwise.

However, in order to allow users to reason about this there would need to be some kind of query which exposes whether the implementation supports interop with non-primary CUDA contexts. Aside from this particular use case, there is another situation where having such a query would be important. Within a host task, when interoperating with the native objects, it's important to know whether the SYCL implementation initializes non-primary contexts, in order to ensure that the native API calls within the host task are compatible.

I see a couple of ways we could do this:

A. Introduce a query which tells the user specifically whether the SYCL implementation maps all SYCL contexts to the CUDA primary context or whether unique contexts are allocated, this provides the specific information we need in this case, but if perhaps less scalable.

B. Introduce a broader query which tells the user whether the SYCL implementation is using the driver API or the runtime API, this could be useful for other places where there is a difference between the drive API and the runtime API, though it may not reflect the case where an implementation combines the two APIs.

Thanks for your thoughts @AerialMantis! I agree that what you are proposing are possible solutions. I have a hunch that in practice it might be easier to work with for users if they know if an implementation is runtime or driver API based rather than make the query more fine-grained and just about context interop.
E.g. if you know that your CUDA backend is runtime API-based, you can make other assumptions about backend interop. In that case, you can know that backend interop is generally just going to work without having to worry about contexts at all, which I think can be very valuable.
I wonder if it might be easier to understand and clearer if the backend specification just had two flavors, or profiles, so that we can expose the behavior of both approaches in a more obvious way.

I suspect that even if an implementation were to utilize both APIs, you could probably figure out a definition to classify them, most likely around how they behave regarding context interop, and whether the current CUDA device from cudaSetDevice() plays a role.
For example, hipSYCL already uses some driver API components for manual management of PTX modules, but still follows the interop behavior of the runtime API-approach.
Do you think there are use cases where such a classification is not possible?

Also, I wonder: Do we believe that there will actually be a use case to invoke make_context with the same context obtained from get_native()? Or will having this query for runtime API backends just tempt users to do things that are unsupported?
My guess is that the appropriate approach to backend interop might be quite different for a runtime and driver API-based implementation. I guess the core question is: Can we have backend interop in client code that works well across both approaches, or would client code need to be structured differently anyway? In the latter case, it might not be useful to stick to a common backend interop interface.

I wonder if it might be easier to understand and clearer if the backend specification just had two flavors, or profiles, so that we can expose the behavior of both approaches in a more obvious way.

Just throwing this out there ... it is also an option to have two different CUDA backends. This means:

There would be two different enumerators in enum backend.

Each backend would be responsible for defining its own interop specification (if it wanted to support interop).

Having a single backend interop specification is nice, but only if developers can reasonably write interop code that is portable across both CUDA backends. If we think the two CUDA backends are just too different, it might be easier for users if each one had its own separate interop specification.

@illuhad @gmlueck I think you make a good point, I am wary of introducing different profiles of the CUDA backend specification if possible, I think it would be valuable to have a single specification that all implementations can follow, though I recognize this may not be possible if there are cases we can't handle generically.

I've clarified the wording to state that both APIs are supported. The biggest concern there was regarding the context and whether the primary context is used. I've added a note that the CUcontexts that are returned from interop can include the primary context, and in fact, this is now what the DPC++ CUDA backend does. I suspect we may still need a query for whether the implementation is using the CUDA Driver API or CUDA Runtime API, but I'd like to avoid this if possible to avoid having any bifurcation of the specification.

Also, I wonder: Do we believe that there will actually be a use case to invoke make_context with the same context obtained from get_native()? Or will having this query for runtime API backends just tempt users to do things that are unsupported?

This is a good point, I suspect we don't want to support this, it would like cause havoc with the ownership model.

This PR is adds part of the CUDA-backend spec interop proposed in KhronosGroup/SYCL-Docs#197. The changes work with the CUDA CTS interop checks KhronosGroup/SYCL-CTS#336. This PR just adds the event interop. llvm-test-suite: intel/llvm-test-suite#1053

…fication' Use device version to report compute capabilities See merge request oneapi-core/SYCL-Docs-mirror!23

…pecification' Clarify active contexts and devices See merge request oneapi-core/SYCL-Docs-mirror!28

…ation' Remove is_backend_active See merge request oneapi-core/SYCL-Docs-mirror!29

# Conflicts: # adoc/chapters/cuda_backend.adoc

…ackend-specification' Updating section on index flipping See merge request oneapi-core/SYCL-Docs-mirror!27

# Conflicts: # adoc/chapters/cuda_backend.adoc

updated MR for cuda backend spec See merge request oneapi-core/SYCL-Docs-mirror!31

fraggamuffin · 2022-08-18T15:33:59Z

still need to discuss how to handle context interrupt

This patch fixes: #6635 In #6483, the implementation of `get_native` for device for the CUDA plugin was mistakenly moved to the experimental interface header, and so it was no longer available for the regular interface, causing build issues. For the CUDA plugin there is currently two interfaces for the CUDA interop, the "legacy" one which is used by projects such as oneMKL and oneDNN, and the "experimental" one, defined in the `sycl/ext/oneapi/experimental/backend/cuda.hpp` header which implements the interop as described in the CUDA backend specification proposed here: KhronosGroup/SYCL-Docs#197

JackAKirk · 2022-09-21T12:42:55Z

adoc/chapters/cuda_backend.adoc

+[width="100%",options="header",cols="40%,30%,30%"]
+|====
+| SYCL shared USM advice | CUDA managed memory advice | processor, the advice is set for
+| [code]#sycl::cuda::advice::cuda_mem_advise_set_read_mostly# | [code]#cudaMemAdviseSetReadMostly# | device associated with the queue/handler


@AerialMantis FYI these advise names seem to be different than those used in DPC++: https://github.com/t4c1/llvm-test-suite/blob/1a5a4e9d72cccf290d88f2498c08314d8bfb271e/SYCL/USM/memadvise_cuda.cpp#L39

Do we need to update one or other of the documentation / dpc++ impl?

Good catch, these are different, but I think we may want to update DPC++ to match what we have in here. I have also updated the definitions here to follow the correct extension convention.

* [SYCL][ABI-BREAK] Remove sycl::program class (#6666) According to the SYCL 2020 spec, section D.1. What has changed from SYCL 1.2.1 to SYCL 2020: > The program class has been removed and replaced with a new > class `kernel_bundle`, which provides similar functionality in a > type-safe and thread-safe way. Removing of `program_impl` class will be done with a separate commit since it's not an ABI-breaking change and some performance analysis should be done in scope of that removal. Tests depending on `sycl::program` were removed in intel/llvm-test-suite#1187 * [flang] Support lowering of intrinsic module procedure C_F_POINTER As Fortran 2018 18.2.3.3, the intrinsic module procedure C_F_POINTER(CPTR, FPTR [, SHAPE]) associates a data pointer with the target of a C pointer and specify its shape. CPTR shall be a scalar of type C_PTR, and its value is the C address or the result of a reference to C_LOC. FPTR is one pointer, either scalar or array. SHAPE is a rank-one integer array, and it shall be present if and only if FPTR is an array. C_PTR is the derived type with only one component of integer 64, and the integer 64 component value is the address. Build the right "source" fir::ExtendedValue based on the address and shape, and use associateMutableBox to associate the pointer with the target of the C pointer. Refactor the getting the address of C_PTR to reuse the code. Reviewed By: jeanPerier Differential Revision: https://reviews.llvm.org/D132303 * [LoongArch] Fix annotations not matching predicates. NFC. * [RISCV] Improve vector fceil/ffloor lowering by changing FRM. This adds new VFCVT pseudoinstructions that take a rounding mode operand. A custom inserter is used to insert additional instructions to change FRM around the VFCVT. Some of this is borrowed from D122860, but takes a somewhat different direction. We may migrate to that patch, but for now I was trying to keep this as independent from RVV intrinsics as I could. A followup patch will use this approach for FROUND too. Still need to fix the cost model. Reviewed By: arcbbb Differential Revision: https://reviews.llvm.org/D133238 * [NFC] Remove invisible character in Diagnostic message and tests * [X86] Support SAE for VCVTPS2PH from intrinsic. For now, clang and gcc both failed to generate sae version from _mm512_cvt_roundps_ph: https://godbolt.org/z/oh7eTGY5z. Intrinsic guide description is also wrong, which will be update soon. Reviewed By: pengfei Differential Revision: https://reviews.llvm.org/D132641 * [RegisterCoalescer] Fix crash on early clobbered subreg operands. The issue was with processing two subregs of the same reg are used in the same instruction (e.g. inline asm): "def early-clobber" and other just "def". Register coalescer ran in bad recursion if the early clobbered subreg is second in the following sequence of COPYs. Reviewed By: arsenm Differential Revision: https://reviews.llvm.org/D127136 * [OpenMP][OMPD] GDB plugin code to leverage libompd to provide debugging support for OpenMP programs. This is 5th of 6 patches started from https://reviews.llvm.org/D100181 This plugin code, when loaded in gdb, adds a few commands like ompd icv, ompd bt, ompd parallel. These commands create an interface for GDB to read the OpenMP runtime through libompd. Reviewed By: @dreachem Differential Revision: https://reviews.llvm.org/D100185 * [lld-macho] Simplify linker optimization hint processing This commit removes the `relocTargets` vector, and instead makes the code reconstruct the referent addresses from the relocated instructions. This will allow us to move `applyOptimizationHints` from `ConcatInputSection::writeTo` to a separate pass that parses and applies LOHs in one step, on a per-file basis. This will improve performance, as parsing is currently done serially in `ObjFile::parse`. I opted to remove the sanity check that ensures that all relocations within a LOH point to the same symbol. This completely eliminates the need to search through relocations. It is my understanding that mismatched relocation targets should not be present in valid object files, so it's unlikely that the removal will lead to mislinks. Differential Revision: https://reviews.llvm.org/D133274 * [NFC] Add test of sized deallocation for coroutines [dcl.fct.def.coroutine]p12 says: > If both a usual deallocation function with only a pointer parameter > and a usual deallocation function with both a pointer parameter and a > size parameter are found, then the selected deallocation function > shall be the one with two parameters. However, the sized deallocation function is disabled by default for ABI reasons. This leads the sentence never get tested and covered. This commit tries to add a test for it * [NFC] [Coroutines] Add tests for looking up deallocation According to [dcl.fct.def.coroutine]p12, the program should be ill-formed if the promise_type contains operator delete but none of them are available. But this behavior was not tested before. This commit adds the tests for it. * [CMake][MLGO] Fix cmake for MLGO The if-statement should check whehter TFLITE is on or not rather than if the variable is specified. Reviewed By: mtrofin Differential Revision: https://reviews.llvm.org/D132902 * [OpenMP] Mark -fopenmp-implicit-rpath as NoArgumentUnused This matches the behavior for all the other -fopenmp options, as well as -frtlib-add-rpath. For context, Fedora passes this flag by default in case OpenMP is used, and this results in a warning if it (usually) isn't, which causes build failures for some programs with unnecessarily strict build systems (like Ruby). Differential Revision: https://reviews.llvm.org/D133316 * [MemorySSA][NFC] Simplify if condition Differential Revision: https://reviews.llvm.org/D133332 * [lldb] Enable the insertion of "pending callbacks" to MainLoops from other threads This will be used as a replacement for selecting over a pipe fd, which does not work on windows. The posix implementation still uses a pipe under the hood, while the windows version uses windows event handles. The idea is that, instead of writing to a pipe, one just inserts a callback, which does whatever you wanted to do after the bytes come out the read end of the pipe. Differential Revision: https://reviews.llvm.org/D131160 * [CostModel][X86] Add CostKinds handling for fcmp ops This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (although it still struggles with avx512 predicate numbers which had to be done manually) SSE numbers are still too low for FCMP_ONE/FCMP_UEQ cases which expand to a more complex sequence than the existing 'ExtraCost' system can manage. * [OpenCL] Remove argument names from atomic/fence builtins This simplifies completeness comparisons against OpenCLBuiltins.td and also makes the header no longer "claim" the argument name identifiers. Continues the direction set out in D119560. * [SelectionDAG] Rewrite bfloat16 softening to use the "half promotion" path The main difference is that this preserves intermediate rounding steps, which the other route doesn't. This aligns bfloat16 more with half floats, which use this path on most targets. I didn't understand what the difference was between these softening approaches when I first added bfloat lowerings, would be nice if we only had one of them. Based on @pengfei 's D131502 Differential Revision: https://reviews.llvm.org/D133207 * Apply clang-tidy fixes for readability-identifier-naming in OptimizeSharedMemory.cpp (NFC) * Apply clang-tidy fixes for readability-identifier-naming in OpenMPDialect.cpp (NFC) * [clang-format] [doc] Fix example of wrapping class definitions Example of BraceWrapping AfterClass is wrong Differential Revision: https://reviews.llvm.org/D133087 * [clang-format] Change heuristic for locating lambda template arguments Previously, the heuristic was simply to look for template argument- specific keywords, such as typename, class, template and auto that are preceded by a left angle bracket <. This changes the heuristic to instead look for a left angle bracket < preceded by a right square bracket ], since according to the C++ grammar, the template arguments must *directly* follow the introducer. (This sort of check might just end up being *too* aggressive) This patch also adds a bunch more token annotator tests for lambdas, specifically for some of the stranger forms of lambdas now allowed as of C++20 or soon-to-be-allowed as part of C++23. Fixes https://github.com/llvm/llvm-project/issues/57093 This does NOT resolve the FIXME regarding explicit template lists, but perhaps it gets closer Differential Revision: https://reviews.llvm.org/D132295 * [MLIR] Switch lit tests to %mlir_lib_dir and %mlir_src_dir replacements. The old replacements will be removed soon: - `%linalg_test_lib_dir` - `%cuda_wrapper_library_dir` - `%spirv_wrapper_library_dir` - `%vulkan_wrapper_library_dir` - `%mlir_runner_utils_dir` - `%mlir_integration_test_dir` Reviewed By: herhut Differential Revision: https://reviews.llvm.org/D133270 * [ARM] Constant pools need 4-byte alignment if we only have tADR When the only ADR instruction we have is the 16-bit thumb one then all constant pool entries need to be 4-byte aligned, as tADR has an offset that's a multiple of 4. It looks like previously there happened to be no situations in which we encountered a constant pool entry with alignment less than 4, so failing to do this didn't cause any problems, but the expansion of cttz to a table added by D128911 does use a constant pool with alignment 1, so we now need to handle it correctly. Differential Revision: https://reviews.llvm.org/D133199 * [AMDGPU][MC][GFX11][NFC] Update assembler tests for MIMG instructions Differential Revision: https://reviews.llvm.org/D133322 * [MLIR] Fix for commit 0f2ec35 Fix incorrectly formatted python file. * [CostModel][X86] Add CostKinds handling for SSE FCMP_ONE/FCMP_UEQ predicates These require special handling to account for their expansion in lowering. I'm trying very hard not to have to add predicate specific costs - but it might be inevitable..... * [lldb] [Core] Split read thread support into ThreadedCommunication Split the read thread support from Communication into a dedicated ThreadedCommunication subclass. The read thread support is used only by a subset of Communication consumers, and it adds a lot of complexity to the base class. Furthermore, having a dedicated subclass makes it clear whether a particular consumer needs to account for the possibility of read thread being running or not. The modules currently calling `StartReadThread()` are updated to use `ThreadedCommunication`. The remaining modules use the simplified `Communication` class. `SBCommunication` is changed to use `ThreadedCommunication` in order to avoid changing the public API. `CommunicationKDP` is updated in order to (hopefully) compile with the new code. However, I do not have a Darwin box to test it, so I've limited the changes to the bare minimum. `GDBRemoteCommunication` is updated to become a `Broadcaster` directly. Since it does not inherit from `ThreadedCommunication`, its event support no longer collides with the one used for read thread and can be implemented cleanly. The support for `eBroadcastBitReadThreadDidExit` is removed from the code -- since the read thread was not used, this event was never reported. Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.llvm.org/D133251 * [gn build] Port 9823d42557eb * [mlir] Add materializeOpFoldResults to turn OpFoldResult array into values. Differential Revision: https://reviews.llvm.org/D133346 * [lldb] Go back to process-directed signals in MainLoopTest.cpp Thread-directed signals are not caught by kqueue (used on Mac). This reverts half of D133181. * [OpenMP] Add lit test for metadirective device arch inspired from sollve This lit test is added based upon the tests present in the tests/5.0/metadirective directory of the SOLLVE repo https://github.com/SOLLVE/sollve_vv Reviewed By: saiislam Differential Revision: https://reviews.llvm.org/D131763 * [InstCombine] add tests for icmp-of-trunc; NFC * [InstCombine] reduce code duplication; NFC * [InstSimplify] allow poison/undef in constant match for "C - X ==/!= X -> false/true" This fold was added with 5e9522c311dd, but over-specified. We can assume that an undef element is an odd number: https://alive2.llvm.org/ce/z/djQmWU * Update the clang and clang-tools-extra code owners files This also converts the Clang code owners file from a flat text file to an RST file that is linked in to the rest of our documentation. The RFC for this can be found at: https://discourse.llvm.org/t/rfc-proposed-changes-to-clangs-code-ownership/ Differential Revision: https://reviews.llvm.org/D132550 * [GlobalISel] Combine G_INSERT/EXTRACT_VECTOR_ELT with out of bounds indices to undef. Differential Revision: https://reviews.llvm.org/D133309 * [flang] Accept assumed shape arrays as SHAPE in C_F_POINTER C_F_POINTER was added in https://reviews.llvm.org/D132303, but the code assumed that SHAPE would always be an explicit shape with compile time constant rank. It can actually be an assumed shape, or an explicit shape with non compile time constant rank. Get the rank from FPTR pointer instead. Differential Revision: https://reviews.llvm.org/D133347 * [SCCP] convert signed div/rem to unsigned for non-negative operands This extends the transform added with D81756 to handle div/rem opcodes. For example: https://alive2.llvm.org/ce/z/cX6za6 This replicates part of what CVP already does, but the motivating example from issue #57472 demonstrates a phase ordering problem - we convert branches to select before CVP runs and miss the transform. Differential Revision: https://reviews.llvm.org/D133198 * [CostModel][X86] Add CostKinds test coverage for ctpop intrinsics * [CostModel][X86] Add CostKinds test coverage for cttz intrinsics * [CostModel][X86] Add CostKinds test coverage for ctlz intrinsics * Fix Clang Sphinx docs build The CodeOwners.rst file needs to live in the same directory as the rest of the documentation. This copies the file to the correct place when making a Sphinx build but continues to leave the .rst file at the root directory where it's easier for developers to find. This also ensures that local doc builds using `make html` work as expected. * [Metadata] Introduce MD_pcsections Introduces MD_pcsections metadata kind. See added documentation for more details. Subsequent patches enable propagating PC sections metadata through code generation to the AsmPrinter. RFC: https://discourse.llvm.org/t/rfc-pc-keyed-metadata-at-runtime/64191 Reviewed By: dvyukov, vitalybuka Differential Revision: https://reviews.llvm.org/D130875 * [MachineInstr] Allow setting PCSections in ExtraInfo Provide MachineInstr::setPCSection(), to propagate relevant metadata through the backend. Use ExtraInfo to store the metadata. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D130876 * [Object] Refactor code for extracting offload binaries We currently extract offload binaries inside of the linker wrapper. Other tools may wish to do the same extraction operation. This patch simply factors out this handling into the `OffloadBinary.h` interface. Reviewed By: yaxunl Differential Revision: https://reviews.llvm.org/D132689 * [OffloadPackager] Add ability to extract images from other file types A previous patch added support for extracting images from offloading binaries. Users may wish to extract these files from the file types they are most commonly emebedded in, such as an ELF or bitcode. This can be difficult for the user to do manually, as these could be stored in different section names potentially. This patch addsp support for extracting these file types. Reviewed By: saiislam Differential Revision: https://reviews.llvm.org/D132607 * [SYCL][CUDA] Fix get_native interop for device (#6649) This patch fixes: https://github.com/intel/llvm/issues/6635 In https://github.com/intel/llvm/pull/6483, the implementation of `get_native` for device for the CUDA plugin was mistakenly moved to the experimental interface header, and so it was no longer available for the regular interface, causing build issues. For the CUDA plugin there is currently two interfaces for the CUDA interop, the "legacy" one which is used by projects such as oneMKL and oneDNN, and the "experimental" one, defined in the `sycl/ext/oneapi/experimental/backend/cuda.hpp` header which implements the interop as described in the CUDA backend specification proposed here: https://github.com/KhronosGroup/SYCL-Docs/pull/197 * [llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64 This adds the ExpandLargeDivRem to the default pass pipeline. The limit at which it expands div/rem instructions is configured via a new TargetTransformInfo hook (default: no expansion) X86, Arm and AArch64 backends implement this hook to expand div/rem instructions with more than 128 bits. Differential Revision: https://reviews.llvm.org/D130076 * [SelectionDAG] Rename CallSiteDbgInfo to NodeExtraInfo For information infrequently attached to SDNodes, it is useful to provide a way to add this information out-of-line. This is already done for call-site specific information. Rename CallSiteDbgInfo to NodeExtraInfo in preparation of adding additional information not necessarily related to call sites only. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D130880 * [SelectionDAG] Properly copy ExtraInfo on RAUW During SelectionDAG legalization SDNodes with associated extra info may be replaced with a new SDNode. Preserve associated extra info on ReplaceAllUsesWith and remove entries in DeallocateNode. Reviewed By: vitalybuka Differential Revision: https://reviews.llvm.org/D130881 * Add parantheses to silence warning. * [AArch64] Additional tests for sinking splats to muls. NFC * Fix "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64" compilation on Windows * Fix AMDGPU test failures due to "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64" * [tsan] Replace mem intrinsics with calls to interceptors After https://reviews.llvm.org/rG463aa814182a23 tsan replaces llvm intrinsics with calls to glibc functions. However this approach is fragile, as slight changes in pipeline can return llvm intrinsics back. In particular InstCombine can do that. Msan/Asan already declare own version of these memory functions for the similar purpose. KCSAN, or anything that uses something else than compiler-rt, needs to implement this callbacks. Reviewed By: melver Differential Revision: https://reviews.llvm.org/D133268 * Fix remaining test failures for "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64" * [gn build] port 5dbc7cf7cac44 * [bazel] port 5dbc7cf7cac44 * Revert "[lldb][bindings] Fix module_access handling of regex" This reverts commit 75f05fccbbdd91393bdc7b6183b9dd2b1e859f8e. This commit broke the windows lldb bot: https://lab.llvm.org/buildbot/#/builders/83/builds/23284 * Fix OpenMP Opt for target without a parallel region. Remove ctx redeclaration. Format code. Remove parallel check. Modify tests. Clean-up code. Fix another test. Move code to helper functions. Format file. Minor fixes. * [InstCombine] add tests for add of select with 0 and negate arms; NFC * [InstCombine] add/move tests for add with select operands that simplify; NFC * [InstCombine] remove dead code for add (select cond, (sub), 0); NFC This pattern is handled more generally in SimplifySelectsFeedingBinaryOp(). Tests to confirm that added to the add.ll test file in the previous commit. * Add docs for Mach-O lld I wasn't able to find any docs for Mach-O in `lld/docs`, so here's an attempt at adding basic docs. One of my goals here is to make it easy for users who are unfamiliar with linkers to successfully use lld. Reviewed By: #lld-macho, int3 Differential Revision: https://reviews.llvm.org/D132893 * [CostModel][X86] Add CostKinds handling for ctpop ops This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (although it still struggles with avx512 predicate numbers which had to be done manually) Some of the pre-AVX values still aren't great - atom/slm worst case numbers for ctpop expansion really affect these (especially throughput/latency), so we need to clean them up in a more consistent way - its a pity we don't have models for more older cpus (merom/nehalem etc.) as other examples. * [clang] fix profiling of template arguments of template and declaration kind Template arguments of template and declaration kind were being profiled only by their canonical properties, which would cause incorrect uniquing of constrained AutoTypes, leading to a crash in some cases. This exposed some places in CheckTemplateArgumentList where non-canonical arguments where being pushed into the resulting converted list. We also throw in some asserts to catch early and explain the crashes. Note that the fix for the 'declaration' kind is untestable at this point, because there should be no cases right now in the AST where we try to unique a non-canonical converted template argument. This fixes GH55567. Signed-off-by: Matheus Izvekov <[email protected]> Differential Revision: https://reviews.llvm.org/D133072 * [RISCV] Improve vector fround lowering by changing FRM. This is a follow up to D133238 which did this for ceil/floor. Reviewed By: arcbbb, frasercrmck Differential Revision: https://reviews.llvm.org/D133335 * [mlir][sparse] codegen for sparse alloc Reviewed By: Peiming Differential Revision: https://reviews.llvm.org/D133241 * Revert "[tsan] Replace mem intrinsics with calls to interceptors" Breaks http://45.33.8.238/macm1/43944/step_4.txt https://lab.llvm.org/buildbot/#/builders/70/builds/26926 This reverts commit 77654a65a373da9c4829de821e7b393ea811ee40. * [mlir][sparse] Refactoring: renaming StorageNewOp to StorageOp To address comment in https://reviews.llvm.org/D133241 Reviewed By: aartbik Differential Revision: https://reviews.llvm.org/D133363 * [ConstraintElimination] Replace pair with named struct (NFC). This slightly improves the readability and allows further extensions in follow-ups. * [libc++] Avoid instantiating type_trait classes Use `using` aliases to avoid instantiating lots of types Reviewed By: ldionne, #libc Spies: libcxx-commits, miyuki Differential Revision: https://reviews.llvm.org/D132785 * [AArch64] Add an option to reserve physical registers from RA This patch adds an option --reserve-regs-for-regalloc, so we can reserve a list of physical registers. These registers will not be used by register allocator, but can still be used as ABI requests such as passing arguments to function call. Its main purpose is simulating high register pressure by reserving many physical registers. So it will be much easier to test and debug register allocation changes. Differential Revision: https://reviews.llvm.org/D132717 * Revert "[SCCP] convert signed div/rem to unsigned for non-negative operands" This reverts commit fe1f3cfc2669aca387a45c8ce615b45c1db50fc6. It looks like this commit breaks building llvm-test-suite. To reproduce, run `opt -passes=ipsccp` on the IR below. @g = internal global i32 256, align 4 define void @test() { entry: %0 = load i32, ptr @g, align 4 %div = sdiv i32 %0, undef ret void } * [clang] Implement setting crash_diagnostics_dir through env variable This implements setting the equivalent of `-fcrash-diagnostics-dir` through the environment variable `CLANG_CRASH_DIAGNOSTICS_DIR`. If present, the flag still takes precedence. This helps integration with test frameworks and pipelines. With this feature, we change the libcxx bootstrapping build pipeline to produce clang crash reproducers as artifacts. Signed-off-by: Matheus Izvekov <[email protected]> Differential Revision: https://reviews.llvm.org/D133082 * [mlir][spirv] Add base classes for vendor ops This is the first patch in the series to rename vendor ops from `spv.NameVENDOR` to `spv.VENDOR.Name`. The goal is to make the SPIR-V dialect more internally consistent. Issue: https://github.com/llvm/llvm-project/issues/56863 * [mlir][spirv] Refactor vendor op definitions Use dedicated vendor op classes/categories. This is so that we can later change the mnemonics of all vendor ops by changing the base class: `SPV_VendorOp`. Issue: https://github.com/llvm/llvm-project/issues/56863 * [mlir][spirv] Change vendor op mnemonics to `spv.VENDOR.name` Make vendor ops more consistent with the naming scheme within the SPIR-V dialect. Issue: https://github.com/llvm/llvm-project/issues/56863 Reviewed By: antiagainst Differential Revision: https://reviews.llvm.org/D133247 * [mlir] Make bit enum operators constexpr This allows using the | operator on the values of enum attributes in complie-time constants. Reviewed By: antiagainst Differential Revision: https://reviews.llvm.org/D133159 * [lldb] Fix CommunicationKDP following D133251 Add `m_bytes` and `m_bytes_mutex` to `CommunicationKDP`, following refactoring in D133251. Differential Revision: https://reviews.llvm.org/D133365 * [ADT] Remove is_splat `is_splat` is superseded by `all_equal` and marked as deprecated. See the discussion thread for more details: https://discourse.llvm.org/t/adt-is-splat-and-empty-ranges/64692 Reviewed By: dblaikie Differential Revision: https://reviews.llvm.org/D132336 * [OpenMP] Fix `omp_get_wtime` function being marked incorrectly as readonly OpenMP has a list of of optimistic attributes that can be attached to known runtime functions to aid some analysis. The `omp_get_wtime` function incorrectly used the `readonly` attribute. This is not correct at the `omp_get_wtime` function changes values depending on some external state. This is more correctly modeled with `inaccessiblememonly` meaning that the value does not depend on anything within the module, but can not be removes as it depends on external state. Fixes #57578 Reviewed By: tianshilei1992 Differential Revision: https://reviews.llvm.org/D133360 * [SYCL] Fix accessor's CTAD for g++ host compiler (#6673) * [SYCL] Replace hardcoded namespaces with attribute (#6674) Namespaces were hardcoded and used in compiler to check for various SYCL types including accessors, spec_constants, etc. This patch implements an attribute to uniquely identify the types instead. Attribute argument is an Identifier which denotes each type. E.g. __attribute__((sycl_type(accessor)) is used to mark accessor class. The attribute has been implemented as with an accepted list of arguments via EnumArg. The attribute definition should be updated to support any new types. The attribute takes 1 argument. Fixes: https://github.com/intel/llvm/issues/5186 Signed-off-by: Elizabeth Andrews <[email protected]> * [MC] Emit Stackmaps before debug info This patch is essentially an alternative to https://reviews.llvm.org/D75836 and was mentioned by @lhames in a comment. The gist of the issue is that Mach-O has restrictions on which kind of sections are allowed after debug info has been emitted, which is also properly asserted within LLVM. Problem is that stack maps are currently emitted as one of the last sections in each target-specific AsmPrinter so far, which would cause the assertion to trigger. The current approach of special casing for the `__LLVM_STACKMAPS` section is not viable either, as downstream users can overwrite the stackmap format using plugins, which may want to use different sections. This patch fixes the issue by emitting the stack map earlier, right before debug info is emitted. The way this is implemented is by taking the choice when to emit the StackMap away from the target AsmPrinter and doing so in the base class. The only disadvantage of this approach is that the `StackMaps` member is now part of the base class, even for targets that do not support them. This is functionaly not a problem however, as emitting an empty `StackMaps` is a no-op. Differential Revision: https://reviews.llvm.org/D132708 * [OpenMP] Remove use of removed '-f[no-]openmp-new-driver' flag The changes in D130020 removed all support for the old method of compiling OpenMP offloading programs. This means that `-fopenmp-new-driver` has no effect and `-fno-openmp-new-driver` does not work. This patch removes the use and documentation of this flag. Note that the `--offload-new-driver` flag still exists for using the new driver optionally with CUDA and HIP. Reviewed By: tra Differential Revision: https://reviews.llvm.org/D133367 * [SYCL][ABI-Break] Remove deprecated OptionalDevice input type of make_queue and remove deprecated make_queue ABIs (#6628) We introduced a tentative OptionalDevice input type of make_queue API to support the SYCL 2020 API as well as the legacy API which did not require Device to be passed as a parameter. This PR intends to remove this tentative input type. Extended: Now this PR also removed deprecated make_queue ABIs and make the make_queue to take a pointer to pi_device, which becomes back to optional. Signed-off-by: Byoungro So <[email protected]> * [SYCL] Emit an error on attempt to compile in less than C++17 mode (#6678) * [CI][NFC] Rename workflow job (#6572) Just pure renaming of "resolve test matrix" to "generate test matrix" Follow up on a comment made at https://github.com/intel/llvm/pull/6528 * [CI][NFC] Make lint for commits in PR change only (#6722) Make lint for commits in PR change only not taking into account sycl branch HEAD. Should fix issues like in https://github.com/intel/llvm/pull/6705 where lint reported errors for files not affected in PR. To overcome current PRs stuck because of this they need to use ignore-lint tag until PR will be based on devops directory changes made here. * [CI][NFC] Fixed GFX driver update PR (#6723) Now generate test matrix correctly use github.event.pull_request.head.sha that is required for pull_request_target. * [SYCL] Improve sycl-post-link performance with -split=kernel (#6689) Right now we are computing a new callgraph in every call to extractCallGraph. extractCallGraph is called every time we do a module split, so for -split=kernel, that would be once per kernel. For modules with many kernels, this can take a very long time. We only need to compute this once because the input IR doesn't seem to change between splits. This improves performance of sycl-post-link from ~45min to ~7min for an example with 13k kernels Signed-off-by: Sarnie, Nick <[email protected]> * [SYCL] Deprecate SYCL 1.2.1 device selectors (#6599) Now that we have added the SYCL 2020 callable device selectors, we need to prepare for the removal of the older SYCL 1.2.1 `device_selector` class. The first step is to add the deprecation message to the 1.2.1 style device selectors, which this PR does. It also removes the usage of those from our own codebase so as to not trip on our own messages in the future. Signed-off-by: Chris Perkins <[email protected]> Signed-off-by: Larsen, Steffen <[email protected]> Co-authored-by: Larsen, Steffen <[email protected]> * [SYCL][CUDA][HIP] Add support for intel extension free-memory (#6709) The support is added for NVIDIA and AMD devices. Co-authored-by: Abhishek Bagusetty <[email protected]> * [SYCL][Windows] Improve windows sycl.lib linking (#6699) This patch does two things, first it makes `-fsycl` ignore `-nostdlib` when linking the SYCL library. This is necessary because for Clang on Windows CMake will generate link commands using `-nostdlib` and explicitly list the system libraries, but of course it doesn't do it for SYCL, so we currently end up never linking the SYCL library when this is used. Ignoring `-nostdlib` for `-fsycl` on Windows seems like a reasonnable solution for this as this is also what is done for the OpenMP runtime libraries. See the CMake module: * https://github.com/Kitware/CMake/blob/aa2de7cd2a04699744a224ab84e0ca483559c5d3/Modules/Platform/Windows-Clang.cmake#L79 In addition this patch also adds a linker parameter to help clang find the `sycl.lib` file without requiring users to tweak their environments to link against it. * [SYCL][Windows] Enable building lld by default on Windows (#6701) The Windows-Clang CMake module uses `lld-link` by default, so having lld built on windows is helpful for people wanting to build SYCL applications using CMake. Related to: https://github.com/intel/llvm/issues/6026 * [SYCL] Allow buitin_assume_aligned to be called from device code. (#6705) Prior to this PR, if there is a direct call to __builtin_assume_aligned from within the device code, we would get the following error: `error: SYCL kernel cannot call a variadic function` This PR allows `__builtin_assume_aligned` to be invoked from within device code. The reason for doing so is, for example, if we load a float* pointer from a data structure, the compiler has no way of knowing the underlying alignment of the data the pointer points to, and therefore it cannot combine the load/stores and one ends up with many unnecessary load/store instructions plus lots of unnecessary pointer arithmetic plus increased register pressure. We see this all over the place in our kernels. With `__builtin_assume_aligned` we can tell the compiler to safely assume a certain alignment, therefore implicitly forcing coalescing. * [SYCL][NFC] Remove unnecessary NULL check for 'Fn' (#6726) Klocwork thinks that it is possible for 'Fn' to be a nullptr due to this check but there's already an assert at the beginning of CodeGenFunction::GenerateCode to avoid that possibility. * [SYCL] Fix device comparison in removeDuplicateDevices (#6730) Kernel bundles attempt to remove duplicate devices from a passed device list. This is done through the `removeDuplicateDevices` function which creates a set through comparing devices by their native handles. However, the `getNative` member function on `device_impl` used to get these handles will retain the native devices if the backend is OpenCL. For root devices this will not have an effect, but on sub-devices this can potentially lead to a leak of the devices. As a fix this commit compares the PI devices rather than the native handles. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Remove cl namespace deprecation warning (#6735) According to the SYCL 2020 specification, CL/sycl.hpp should supply the cl prefix namespace for backwards compatibility, but it does not specify that the namespace is deprecated. This commit removes the deprecation message. Signed-off-by: Larsen, Steffen <[email protected]> * [GHA] Uplift Linux GPU RT version to 22.35.24055 (#6704) Scheduled drivers uplift Co-authored-by: GitHub Actions <[email protected]> * [BuildBot] Uplift GPU RT version for Linux CI Process (#6697) Uplift GPU RT version for Linux to 22.35.24055 Signed-off-by: bb-sycl <[email protected]> * Update test to match -fvisibility=arg option requirements (#7098) * [SYCL][ESIMD] Introduce predicates for lsc_block_store/load (#6688) * Update for LLVM Optional API changes Update for LLVM commit b5f8d42efe3e ("[ADT] Deprecate Optional::{hasValue,getValue} (NFC)", 2022-08-07). This is a mechanical replacement of `hasValue` to `has_value` and `getValue` to `value`. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/ffeb4df * Restore getArgIndex in OCLTypeToSPIRV (#1567) * Update for LLVM Optional API changes Update for LLVM commit b5f8d42efe3e ("[ADT] Deprecate Optional::{hasValue,getValue} (NFC)", 2022-08-07). This is a mechanical replacement of `hasValue` to `has_value` and `getValue` to `value`. * Restore argument tracing in OCLTypeToSPIRV Commit 4a9c78ee ("Prepare SPIRVWriter for type conversion without opaque pointers. (#1499)", 2022-06-20) removed `getArgIndex` in favor of passing `Idx`, but this leads to incorrect adaptation of argument types if sampler arguments are not in the same position between different functions. We might be able to drop `adaptArgumentsBySamplerUse` entirely, as we don't represent samplers as i32 anymore; but for now just fix the regression. Fixes https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/1562 Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/87f8a58 * Move to C++17 LLVM has switched to C++17 with commit b1356504e63a ("[LLVM] Update C++ standard to 17", 2022-08-06). Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/75d16c2 * .clang-tidy: temporarily disable misc-const-correctness This is a relatively new check added to clang-tidy by 46ae26e7eb70 ("[clang-tidy] implement new check 'misc-const-correctness' to add 'const' to unmodified variables", 2022-07-24). Currently the code base doesn't follow the practice of declaring variables `const` where possible, which makes the check quite noisy, so disable it for now. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/4dd494e * Drop the JointMatrixINTEL struct-renaming pass when opaque pointers are enabled. (#1570) The frontend is being changed to lower the struct name to the correct LLVM name directly, obviating the need for this check. See https://github.com/intel/llvm/pull/6535 for this change. This marks the removal of the final call to the deprecated method Type::getPointerElementType, although there remains some code that is not fully working with opaque pointers enabled. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/e2bb479 * Use TypedPointerType in lieu of PointerIndirectPair in mangleBuiltin. (#1568) Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/42cf770 * Remove 2 unused SPCV_ macros Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/160f013 * Upgrade to Ubuntu 20.04 The Ubuntu 18.04 image is marked deprecated [1], so move to a newer image. [1] https://github.com/actions/runner-images Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/20a1fd7 * Fix a -Wmaybe-uninitialized warning `MDWrapper::get` may not fill its argument in error cases, so initialize `Arg` to an all-ones value. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/2c247b7 * Dot product bugfix to include more floating point types (#1578) Switched the visitCallDot check to use isFloatingPointTy for scalar floating point operands. Bugfix for previous change regarding integer dot product. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/71e01b5 * Mass add -emit-opaque-pointers for tests that don't require changes. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/1ea77d2 * Add support for toolchain compilation with LLVM_LINK_LLVM_DYLIB option (#1543) Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/40fd741 * Update SPV_INTEL_vector_compute to rev 15 This adds NamedBarrierCountINTEL Execution Mode, see more in https://github.com/intel/llvm/pull/1612 Co-authored-by: nrudenko <[email protected]> Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/21c7a30 * Remove ExecutionModeNamedBarrierCountINTEL from spirv_internal (#1604) It was actually upstreamed to SPIR-V Headers already. Signed-off-by: Sidorov, Dmitry <[email protected]> Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/3335c25 * Mass add -emit-opaque-pointers for tests that require some changes. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/6e15642 * Implement SPIR-V support for max_reinvocation_delay attribute (#1594) The attribute generates the llvm.loop.intel.max_reinvocation_delay.count metadata in LLVM IR. There is one positive integer literal operand denoting the maximum number of cycles allowed between loop invocations. Spec: KhronosGroup/SPIRV-Registry#163 Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/d4ec010 * Fix type scavenger for variable arguments and multiple-uses-of-types cases. (#1606) Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/401d124 * Add support for split barriers extension SPV_INTEL_split_barrier (#1424) This PR adds support for split barriers and the SPV_INTEL_split_barrier extension. The related SPIR-V extension spec can be found here: * https://github.com/KhronosGroup/SPIRV-Registry/pull/136 The related OpenCL C extension spec can be found here: * https://github.com/KhronosGroup/OpenCL-Docs/pull/765 Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/e3cd296 * Reinstate deprecated-declarations warning as error (#1609) This reverts commit 55d1de820841e6d9d1c6ca0cd534323d69a1cbf1. Now that the last use of the deprecated `getPointerElementType` has been removed, treat uses of deprecated functions as errors again. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/6811488 * Rework the demangler to support recovering more pointer element types. (#1556) Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/b412ae5 * Translate function pointer from global variable as pointer, not as declaration (#1608) This patch helps to avoid invalid SPV generation. When global variable contains a pointer to a function, translator tries to translate it as declaration. Then it translates this function the second time when going through the function list. This leads to double translation of the same function and to the usage of the same IDs in SPIR-V file. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/2b4ce42 * Add SPV_INTEL_masked_gather_scatter extension (#1580) This extension allows TypeVector to have a Physical Pointer Type Component Type and introduces gather/scatter instructions. It will be useful for explicitly vectorized kernels. Spec: https://github.com/intel/llvm/pull/6613 Signed-off-by: Sidorov, Dmitry <[email protected] Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/49b08e8 * [SYCL][Doc] Remove stale SPIR-V extension spec (#6741) This SPIR-V extension was promoted to a KHR, and that specification is on the SPIR-V registry. The DPC++ compiler uses the KHR version (not this INTEL version), so this old specification can be removed. * [SYCL][CUDA] Make `piextKernelSetArgMemObj` setting an error message (#6521) This patch makes `cuda_piextKernelSetArgMemObj` setting an error message instead of `std::terminate` in case of the image format is not supported. This error message is encapsulated in an exception thrown by the RT. This allows to continue the SYCL-CTS execution in case of tests using unsupported channel types, see https://github.com/intel/llvm/issues/2119#issuecomment-1201548912. * [SYCL][Doc] Add sycl complex to complex algorithms extension (#6717) This PR extends the complex algorithms extension to support `sycl::ext::oneapi::complex` and `marray<sycl::ext::oneapi::complex>`. Additionally it adds the `multiplies` operator as a valid binary operation for complex values when reducing and scanning across work items. This PR has a dependency upon #6550. * [SYCL][NFC] Add SYCLPropagateAspectsUsage pass (#6670) Added a pass which is a part of optional kernel features design: it uses information provided by FE & Headers about aspects used in device code to propagate it through the call graph to mark all kernels and functions with list of aspects they use. Co-authored-by: Maksim Sabianin <[email protected]> * [NFC] Fix the expected assert message in LIT test gather_scatter_rgba.cpp Signed-off-by: Vyacheslav N Klochkov <[email protected]> * [SYCL] Make intel specific device info descriptors namespace qualified (#6639) Conforming to SYCL 2020 specification section 6.3.1 and 4.6.4.2, to make extension information descriptors templated and within the correct namespace. -Also moved deprecated info descriptors for device into separate file -Changed namespace of the recently added [device memory extension](https://github.com/intel/llvm/pull/6604) to ext::intel::info::device Signed-off-by: Rauf, Rana <[email protected]> Co-authored-by: Steffen Larsen <[email protected]> * [SYCL] Fix USM free for descendent devices (#6733) Now that use of descendent devices of context members is supported, we cannot rely on always choosing the only device in single-device contexts. Remove this branch to always fetch the device from the platform. * [SYCL] Test must pass -opaque-pointers explicitly Not yet the default for sycl. * Add opaque pointers switch to llvm/test/Verifier/dllstorage.ll * Remove clang/test/Driver/openmp-sycl-interop.c test form testing. The test is specific to behaviors involving SYCL and OpenMP offloading. Since OpenMP offload has moved to the new offloading model the test is not relevant anymore. * [SYCL] Pass /Zc:__cplusplus in -fsycl-host-compiler-options in some tests (#6751) By default MSVC reports 199711L as the standard being used and needs that option to report C++ version properly. This fixes current post-commit failures on the tests modified. * Delete obsolete Clang::Driver tests * Revert "Delete obsolete Clang::Driver tests" This reverts commit 84be9c2cb06b5c98225ae0da80e32aa161483187. * The XFAIL tests are specific to the old OpenMP offloading model, which was removed. * Disable sycl-libspirv for amdgcn-amd-amdhsa target in clang/test/Driver/sycl-amdgcn-sqrt.cpp * [SYCL][ABI-Break][NFC] Remove unusued members (#6731) This commit removes the following: - The unused member function `kernel_bundle::join_impl` overload. - The unused member variable `Offset` from `stream`. - The unused member function `buffer_impl::constructorNotification` overload. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Fix memory leak in program link (#6641) This PR fixes the memory leak caused by missing the call to program release. Signed-off-by: Byoungro So <[email protected]> * [SYCL] Remove mentions of host device from in-tree LIT tests (#6683) * [ESIMD] Fix invoke_simd calls case with pointer passed to it (#6696) The helper function created during translation of invoke_simd must accept a pointer to a function, not a reference to a pointer to a function. That additional level of indirection is automatically resolved by compiler for invoke_simd, but needs to be manually resolved/adjusted for the helper function. Signed-off-by: Vyacheslav N Klochkov <[email protected]> * [SYCL] Remove CG/handler extended members mechanism (#6759) Now that the extended members have been promoted to proper fields of CG/handler classes, the extended member mechanism can be removed until it's needed again. * Suppress some clang-tidy errors Split the .clang-tidy check lists out over multiple lines to improve readability. Suppress `misc-non-private-member-variables-in-classes` as the code currently contains many instances that fail this check. Drop `constexpr` from `LoopControlLoopCountINTELMask` after clang started diagnosing this with b36453530418 ("[Clang] Diagnose ill-formed constant expression when setting ...", 2022-07-28). Removing `constexpr` is just a workaround, the long term fix would be to upstream the new enum value to `spirv.hpp`. Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/c5b29f2 * Update LLVM version from 15 to 16 Original commit: https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/7aa1fd2 * [SYCL] Assign code owners for invoke_simd.hpp and simd.hpp. (#6746) * [SYCL] Fix memory leak (queue_impl) due to #5901 (#6707) Cross dependency event_impl vs queue_impl prevents objects release. Event_impl now has only weak pointer to queue. Signed-off-by: Tikhomirova, Kseniya <[email protected]> * Fix the Clang sphinx build bot; NFC This should address the failure introduced in: https://lab.llvm.org/buildbot/#/builders/92/builds/32377 * [SYCL] Move accessor_impl to source directory (#6698) Moving impl part of host accessor implementation to avoid exposing implementation details in the headers. This allows for more changes in accessor without breaking ABI. Also updated the gdb xmethods since it was relying on the impl details which are not available for gdb(unless libsycl.so is built with debug symbols) anymore. Instead of accessing members of impl directly gdb printers now accessing helper methods. To prevent compiler discarding these methods there are dummy references which are active when NDEBUG is not defined. * [SYCL] Silence -Wctad-maybe-unsupported warning This might be an FE bug. Will follow up separately. Workaround should be fine to unblock pulldown. * [sycl-bisect] Add missing exit (#6771) This script has been missing one call to exit, which causes control to fall through and run successful tests a second time when using --command without --command-allow-bisect-codes. This didn't change the results of the script, but it did make it slower if the test command takes a while to run. * [ESIMD] Change LSC API to improve template argument type deduction (#6764) * [CI] Fix dependabot alert (#6725) * [SYCL] More changes to silence -Wctad-maybe-unsupported Mutex changes could be reverted if/when this change lands: https://reviews.llvm.org/D133425 * [SYCL][ABI-Break] Merge DeviceBinaryImage and RTDeviceBinaryImage (#6768) To avoid future issues with ABI for DeviceBinaryImage, this commit makes the following changes: * Moves RTDeviceBinaryImage to a source header. * Merges DeviceBinaryImage into RTDeviceBinaryImage. * Promotes the common property ranges from DeviceBinaryImage into new members. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Fix get_pointer_device for cases with descendent devices (#6719) Looking through context members alone when searching for a specific device isn't enough anymore since now descendent devices of context members can be used within that context as well. Change the logic to look for the device in the cache instead. * [Matrix][SYCL] Add use argument for joint_matrix and add another feat… (#5835) …ure macro for it * [SYCL] Remove unused argument from getDeviceImpl (#6780) * [SYCL][Windows] Fix debug build in non cl mode (#6721) This fixes building SYCL programs in Debug mode with `Windows-Clang.cmake`. The issue is that the code was using `OPT__SLASH_MDd` to select `sycld.lib` but under the `!C.getDriver.IsCLMode()` condition this flag will never be set, `OPT_g_Flag` should be used instead (`-g` rather than `/MDd`). Note that using the regular `clang` command line to manually build with `-g` still doesn't work as it will link against `msvcrt` rather than `msvcrtd` and will miss required defines for debug builds on Windows (`_DEBUG`). This is correctly done by the CMake module or simply when using `clang-cl`. * [SYCL][DOC] Add extension for FPGA kernel interface properties (#5715) ## SYCL extension contains the following new kernel properties - `streaming_interface<...>` - `register_map_interface<...>` The first two properties take enum arguments that provide the compiler information about whether the logic downstream to the kernel will back-pressure the kernel or not. - `pipelined<N>` Takes an integer, non-zero values specify minimum cycles between kernel invocations, and 0 specifies that pipelining should be disabled. Co-authored-by: GarveyJoe <[email protected]> * [SYCL] Fix llvm.used removal when used with opaque pointers. (#6773) The code priorily assumed that all functions when used in @llvm.used would be wrapped within a bitcast <fnptr type> to i8*; with opaque pointers, the values would be functions directly, causing a crash since functions don't have any operands. * [SYCL] Add MAJOR_VERSION to the name of the sycl library on Win (#6745) * [SYCL] Silence -Wctad-maybe-unsupported for check-sycl Add deduction guide to kernel_bundle * [SYCL] Add --host-target flag and remove --arm flag (#6620) This patch adds a new flag to `configure.py` to allow changing the host target to build, this makes it easier to build on different hosts such as ARM or PowerPC. This patch also removes the `--arm` flag as it is now redundant, `--host-target=ARM` or `--host-target=AArch64` should be used instead. This is slightly different than the original `--arm` flag as it only allows to enable one of the ARM platforms. But I'm not sure what the use case was for enabling both, as only one will be the host platform for a given build. And compilation time was given as a reason to drop the X86 architecture originally so only enabling the correct ARM architecture should also help with that. Co-authored-by: Alexey Bader <[email protected]> * [NFC] Update sycl-force-target test to play nicely for 32-bit host (#6785) Update the target for the test to use 64-bit to allow for the test to emit the expected device target when unbundling. * [SYCL] Use copy engine for memory read/write operations (#6783) In the present state of the L0 plugin, MemBufferWrite and MemBufferRead operations are being executed using compute engines. This patch changes this behavior for the operations to be executed using copy engines. It is expected to improve performance. Signed-off-by: Arvind Sudarsanam <[email protected]> * [SYCL][FPGA] Support Intel FPGA simulator device selector (#6715) Create **fpga_simulator_selector** to be used to select the FPGA simulator device. This assumes that the user has compiled their program with -Xssimulation (as well as -fintelfpga) to prepare for the simulator. An object of class fpga_simulator_selector must be constructed early in the execution of the host program in order to make the simulator available at runtime. The current implementation has the effect that if an object of class sycl::ext::intel::fpga_simulator_selector is defined, FPGA hardware devices selected using sycl::ext::intel::fpga_selector will select simulator devices. This will be documented in the release notes and the specification. We expect this behaviour to be eliminated in the future. * [SYCL][CUDA] Add support of CUDA XPTI tracing (#6373) Fork of https://github.com/intel/llvm/pull/5797 This patch rebases and finalizes (similarly to https://github.com/intel/llvm/pull/6023) the draft in https://github.com/intel/llvm/pull/5797, which already contained the most important commits thanks to @alexbatashev. The most relevant additions of this patch were done in CMake files, in particular - switch from `FindCUDA` CMake (deprecated) module to the `FindCUDAToolkit` one in order to find cupti library by means of `CUDA_cupti_LIBRARY`. This is advisable because on some systems `FindCUDA` fails to find `CUDA_cupti_LIBRARY`. This is also the case of the CI, see the [log](https://github.com/intel/llvm/runs/7115612243?check_suite_focus=true) in case of `FindCUDA` is used. - find `generated_cuda_meta.h` for generating the CUDA printer definitions, since the location of this header file seems to vary depending on the system, in case of this file is not found a warning is printed and no errors are thrown. Co-authored-by: Alexander Batashev <[email protected]> * [SYCL] Add diagnostic test for global_variable_allowed attribute (#6777) Add cases for this attribute independent of device_global * [SYCL][ABI-Break] Implement property interface for local_accessor & usm_allocator (#6737) * [SYCL][Windows] Fix DataMovement test (#6790) Using `-g` with the regular `clang` command line is not supported on Windows. On Windows `clang-cl` and `/Mdd` should be used instead. However it doesn't seem like this test is testing anything to do debug info and I couldn't find any reason for having it in the history, so removing `-g` is the simplest solution to make the test work on both Linux and Windows. This should fix the post-commit issue that showed up in: https://github.com/intel/llvm/pull/6721 * [SYCL][ABI-Break] Add SYCL 2020 kernel_device_specific::max_sub_group_size (#6782) SYCL 2020 promotes the info::kernel_device_specific::max_sub_group_size query on kernels, but removes the additional argument. This commit adds an overload with no additional argument and deprecates the old variant. * [SYCL][NFC] Add another option for the lambda size mismatch message (#6794) * [SYCL] Introduce a fully-mocked PI plugin for unit tests (#6684) The idea behind this PR is to introduce an infrastructure which allows to write unit-tests, which are not dependent on a presence of actual backends (like OpenCL, L0, etc.) or devices (like CPU, GPU, etc.). Motivation for the patch: host device is going to be removed and we have a number of in-tree LIT tests, which can't pass if there are no devices available, so they are likely to be removed/disabled in short-term (#6683), unless we move them into llvm-test-suite or unit-tests with new infrastructure suggested here. Signed-off-by: Larsen, Steffen <[email protected]> Co-authored-by: Sachkov, Alexey <[email protected]> * [SYCL][ESIMD][EMU] Enable fp64/double type support (#6796) * For PI_DEVICE_INFO_EXTENSIONS, 'piDeviceGetInfo' returns 'cl_khr_fp64' as the type can be supported as native 'double' type * Fixes current pre-commit issue with unexpected pass on SYCL :: spec_const/spec_const_double.cpp * [SYCL] Fix macro definition conflicting with MSVC (#6798) The _CONCAT macro is defined in MSVC headers so to avoid redefinition warnings this commit changes the naming in the PI mock plugin from _CONCAT to _PI_MOCK_PLUGIN_CONCAT. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Update DPCPP library major version (#6801) * [SYCL] Add assert for device_global without device_image_scope (#6791) device_global is currently not fully supported but for backends that support it, they should be usable on device only when the device_image_scope property is present. This commit adds a temporary static assert to avoid prevent the use of device_global without device_image_scope until proper initialization has been implemented. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Fix unittests on MSVC (#6805) This commit makes two changes: * Fixes a cast of incompatible size in the PI mock plugin that caused a warning on MSVC. * Changes the definition of the captureless lambda function in the RedefineAPI PiMock unittest to avoid MSVC considering it equal to the function defined by the PI mock plugin. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL][Doc] Closing the ABI breaking changes window (#6800) Next one is expected to be not earlier than June 2023. * [SYCL] Fix deprecation warning for headers (#6808) Using pragmas to emit warnings didn't work because SYCL headers are considered to be system headers and any warnings in them are suppressed. Use "#warning" instead. Unfortunately, MSVC doesn't support it (although it's part of C23/C++23 and they'll have to add support eventually), so we need some #if guards. Also, #warning cannot be put inside a macro definition, thus we have to have some code duplication. Luckily, entire headers deprecations aren't as often and we can be a little bit verbose. * [SYCL][DOC] Extension to add hints to stop batching and start executing (#6465) Signed-off-by: James Brodman <[email protected]> Co-authored-by: Greg Lueck <[email protected]> Co-authored-by: John Pennycook <[email protected]> * [SYCL] Make host device inaccessible through SYCL API (#6685) This commit removes the host device from the device list and as such the host device will no longer be available in user code. The following changes are a result of this: * Device filters using 'host' as either backend or device type will cause a warning at runtime. Since there is no host device selectable for these filters, the resulting device list will not contain a host device. * is_host() on SYCL objects has been deprecated. Any use of them internally on a host device should cause an assertion to fail. * host_selector deprecation message has been changed to better reflect that there is no alternative. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Fix typos in xmethods script (#6814) * [SYCL][NFC] Fix aspects detection of AllocaInst when opaque pointers are enabled (#6767) * [NFC][Sema] Minor code quality change in SemaSYCL.cpp (#6827) Signed-off-by: Elizabeth Andrews <[email protected]> * [SYCL] Remove device-dependent tests from in-tree LIT (#6829) This commit removes the currently failing in-tree LIT tests that were previously dependent on a device being present. The removed tests are either moved to the test-suite or were already there. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL][NFC] Fix unused argument warning in host_selector (#6831) Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL] Make the mock plugin report cl_khr_il_program as supported (#6823) To make build-log tests not depend on the binary type, this commit makes the mock plugin report that the cl_khr_il_program is supported. Together with adding PI_DEVICE_INFO_NAME to mock_piDeviceGetInfo, the corresponding redefinition from the build-log tests can be removed. Signed-off-by: Larsen, Steffen <[email protected]> * [SYCL][NFC] Reintroduce missing Windows symbols (#6830) https://github.com/intel/llvm/pull/6685 unintentionally caused the windows library to lose symbols for handler::verifyKernelInvoc and handler::is_host. This patch introduces these as unused members to avoid the patch being ABI-breaking. Signed-off-by: Larsen, Steffen <[email protected]> * Fix available memory reporting for Arc devices (#6825) current PI Level Zero implementation assumed device memory is only HBM and hence returned 0 for Alchemist GPUs. * [CI] Enable independent build of libclc (#6833) * Update CMakeLists to build libclc based on project presence in cmake command * Add remangled versions build in --ci-defaults * Fix issue on possible libclc targets strings concat * [SYCL][DOCS] Remove references to host device in top-level docs (#6836) This commit removes references to the host device from the Getting Started Guide and the FAQ documents. Signed-off-by: Larsen, Steffen <[email protected]> * [CI] Enable CUDA SYCL CTS tests (#6439) Signed-off-by: Yin Yang <[email protected]> Co-authored-by: Alexey Bader <[email protected]> * [ESIMD] Do simd<T, N>* to <N x T> arg/ret type conv when possible. (#6835) Today, `simd<T, N> foo(simd<T, N> x)` is codegenerated by clang as `void foo(simd<T, N>* sret(simd<T, N>) %res, simd<T, N>* %x)` for the SPIRV target (unless `__regcall` is specified), which is then converted to `void foo(<N x T>* sret(<N x T>) %res, <N x T>* %x)` in the LowerESIMDVecArg, then to `<N x T> foo(<N x T> %x)` in the VC BE. With the opaque pointers this becomes impossible, and the optimization must happen in the "ESIMD FE". This patch implements it. It also changes `lowerEsimdConstructs` in sycl-post-link.cpp to use new pass manager to avoid the old pass manager-related boiler plate code in new ESIMD transformations. Signed-off-by: Konstantin S Bobrovsky <[email protected]> * [SYCL][L0] Use compute engine for me…

keryell · 2023-03-16T16:11:56Z

Gordon to be back on this in a few weeks.

fraggamuffin · 2023-04-13T16:17:25Z

a good F2F

adoc/chapters/cuda_backend.adoc

CUDA backend specification doc pulled in from #197.

AerialMantis · 2023-06-01T14:54:38Z

@illuhad apologies that it took me so long to get back to this, I've created a significant update to the pull request and actually moved this to a new PR as I ran into issues rebasing the changes.

I could of course just return some stream from the pool, but I'm not sure if this might not do more harm than good because I don't think it's what users would expect. In that case, there'd be no guarantees that the returned stream synchronizes in any way with previous operations submitted to the SYCL queue (they may have been scheduled to different streams, even for in-order queues) and neither would subsequent operations be guaranteed to synchronize with anything the user submits to the stream via interop.

Regarding your comment about which CUDA stream is returned when interoperating with a SYCL queue, and how to correctly synchronize with this, I clarified the wording such that both the application interop and host task interop require synchronization with the SYCL queue prior to retrieving the native CUDA stream or executing the host task. This is quite strict, and perhaps not the most efficient approach, however, it does ensure correctness, and I think we could address this at a later point with a general improvement to the core specification with a feature like your extension for custom operations or a similar extension that we've been working on at Codeplay.

I'm not sure those functions make much sense for hipSYCL. The hipSYCL interpretation of e.g. a queue is really more of a high-level interface to the SYCL task graph, with some functionality to manage synchronization of groups of tasks (e.g. wait()). To make this more clear I am pondering whether we should introduce the alias using task_group = queue; in hipSYCL. Relating a task group to a stream does not sound very useful semantically, and adding streams at runtime for interop purposes breaks some of the assumptions that the scheduler currently can make. I'll have to think some more about this. For context it might be even more complicated as we couldn't rely on the CUDA runtime API context anymore in the workflows that use it, and this in turn might break other interop use cases that currently "just work" for SYCL+CUDA interop applications that use the runtime API.

So I think the way to support this is to have the SYCL queue or context created from a native CUstream or CUcontext independent of the rest of CUstreams or CUcontexts, so if there is a pool, there would be kept separate from those. The high-level scheduling based on dependencies could still apply as normal.

AerialMantis · 2023-06-01T14:55:57Z

@illuhad @gmlueck @JackAKirk @hdelan FYI I've had to open a new PR for this as I ran into some issues rebasing this, coming back to it after some time. the latest version is now in #420.

AerialMantis · 2023-07-20T15:15:14Z

Closing this as a new PR has been opened for this; #420.

hdelan and others added 11 commits October 21, 2021 21:21

Added initial chapter skeleton for CUDA backend specification

29006bd

Add introduction and mapping of context and queue for CUDA backend sp…

9500d56

…ecification

Add memory model section to CUDA backend specification

4bef59b

Add cuda execution model

0a6915c

cuda extension

ee341d3

Error handling

ece44e8

Device and Platform

a0c5644

Add API interoperability section

c7d304f

Merge branch 'nicolas/interop' into 'SYCL-2020/cuda-backend-specifica…

6406402

…tion' Add API interoperability section See merge request oneapi-core/SYCL-Docs-mirror!7

Add CUDA backend specification kernel function interop definitions

e094cac

Merge branch 'SYCL-2020/gordon/cuda-backend-specification-kernel-func…

4ba0065

…-interop' into 'SYCL-2020/cuda-backend-specification' Add CUDA backend specification kernel function interop definitions See merge request oneapi-core/SYCL-Docs-mirror!10

AerialMantis changed the title ~~Introduce first draft of CUDA backend specification and interop API~~ Introduce CUDA backend specification Oct 22, 2021

AerialMantis mentioned this pull request Oct 22, 2021

Introduce test plan for CUDA backend interop KhronosGroup/SYCL-CTS#207

Merged

AerialMantis mentioned this pull request Oct 26, 2021

[SYCL] Rename cuda, hip and fix renaming for level_zero backends intel/llvm#4785

Merged

gmlueck reviewed Oct 28, 2021

View reviewed changes

npmiller added 2 commits January 11, 2022 11:37

Fix editing typo and improve make_device docs

6976a1a

Merge branch 'nicolas/fixes' into 'SYCL-2020/cuda-backend-specification'

489ad46

Fix editing typo and improve make_device docs See merge request oneapi-core/SYCL-Docs-mirror!12

gmlueck reviewed Jan 19, 2022

View reviewed changes

hdelan and others added 6 commits January 20, 2022 15:09

Added initial chapter skeleton for CUDA backend specification

e585417

Add introduction and mapping of context and queue for CUDA backend sp…

007b659

…ecification

Add memory model section to CUDA backend specification

f09d7a4

Add cuda execution model

6301215

cuda extension

f22dac8

Error handling

caf9edc

illuhad reviewed Jun 16, 2022

View reviewed changes

AerialMantis and others added 7 commits June 22, 2022 10:16

Merge branch 'nicolas/fix-version' into 'SYCL-2020/cuda-backend-speci…

54c71c6

…fication' Use device version to report compute capabilities See merge request oneapi-core/SYCL-Docs-mirror!23

Merge branch 'nicolas/clarify-context' into 'SYCL-2020/cuda-backend-s…

2ee9c33

…pecification' Clarify active contexts and devices See merge request oneapi-core/SYCL-Docs-mirror!28

Merge branch 'update_cuda_spec' into 'SYCL-2020/cuda-backend-specific…

4399149

…ation' Remove is_backend_active See merge request oneapi-core/SYCL-Docs-mirror!29

Merge branch 'SYCL-2020/cuda-backend-specification' into 'updates'

180d046

# Conflicts: # adoc/chapters/cuda_backend.adoc

Merge branch 'SYCL-2020/update-index-flipping' into 'SYCL-2020/cuda-b…

fc62894

…ackend-specification' Updating section on index flipping See merge request oneapi-core/SYCL-Docs-mirror!27

Merge branch 'SYCL-2020/cuda-backend-specification' into 'updates'

a0e071b

# Conflicts: # adoc/chapters/cuda_backend.adoc

Merge branch 'updates' into 'SYCL-2020/cuda-backend-specification'

0f4fafc

updated MR for cuda backend spec See merge request oneapi-core/SYCL-Docs-mirror!31

gmlueck mentioned this pull request Aug 8, 2022

OpenCL make_device ownership of native device should be uniqued? #266

Open

npmiller mentioned this pull request Aug 26, 2022

[SYCL][CUDA] Fix get_native interop for device intel/llvm#6649

Merged

JackAKirk reviewed Sep 21, 2022

View reviewed changes

keryell added enhancement New feature or request and removed Agenda To be discussed during a SYCL committee meeting labels Feb 17, 2023

JackAKirk mentioned this pull request Mar 24, 2023

[SYCL][CUDA] Implement sycl_ext_oneapi_peer_access extension intel/llvm#8303

Merged

keryell added the F2F discussion label Apr 14, 2023

kbenzie mentioned this pull request Apr 19, 2023

Missing ur_mem_advice_t enum values oneapi-src/unified-runtime#423

Closed

hdelan reviewed May 4, 2023

View reviewed changes

adoc/chapters/cuda_backend.adoc Show resolved Hide resolved

AerialMantis added a commit that referenced this pull request May 30, 2023

Add CUDA backend specification from previous PR

55c1394

CUDA backend specification doc pulled in from #197.

AerialMantis mentioned this pull request Jun 1, 2023

Introduce CUDA backend specification (revised) #420

Open

AerialMantis closed this Jul 20, 2023

Introduce CUDA backend specification #197

Introduce CUDA backend specification #197

Conversation

AerialMantis commented Oct 22, 2021 • edited Loading

CLAassistant commented Oct 22, 2021 • edited Loading

illuhad commented Oct 22, 2021

AerialMantis commented Oct 25, 2021

illuhad commented Oct 25, 2021 • edited Loading

AerialMantis commented Oct 26, 2021

illuhad commented Oct 26, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

illuhad Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fraggamuffin commented Aug 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keryell commented Mar 16, 2023

fraggamuffin commented Apr 13, 2023

AerialMantis commented Jun 1, 2023

AerialMantis commented Jun 1, 2023

AerialMantis commented Jul 20, 2023

AerialMantis commented Oct 22, 2021 •

edited

Loading

CLAassistant commented Oct 22, 2021 •

edited

Loading

illuhad commented Oct 25, 2021 •

edited

Loading

illuhad Jul 28, 2022 •

edited

Loading