Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce CUDA backend specification #197

Closed

Conversation

AerialMantis
Copy link
Collaborator

@AerialMantis AerialMantis commented Oct 22, 2021

Introduce a first draft of a CUDA backend specification covering a mapping of the platform, execution and memories models and interoperability with the CUDA API for SYCL implementations targeting a CUDA backend.

The backend specification has been added as an appendix to the SYCL 2020 specification.

A pull request for introducing a test plan to the SYCL CTS for the CUDA backend interoperability defined in this backend specification is here - KhronosGroup/SYCL-CTS#207

@AerialMantis AerialMantis changed the title Introduce first draft of CUDA backend specification and interop API Introduce CUDA backend specification Oct 22, 2021
@CLAassistant
Copy link

CLAassistant commented Oct 22, 2021

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
5 out of 6 committers have signed the CLA.

✅ npmiller
✅ hdelan
✅ JackAKirk
✅ AidanBeltonS
✅ AerialMantis
❌ t4c1
You have signed the CLA already but the status is still pending? Let us recheck it.

@illuhad
Copy link
Contributor

illuhad commented Oct 22, 2021

hipSYCL will not be able to follow this specification because it contradicts with the meanings of e.g. queue and context in hipSYCL and how they are mapped to CUDA. I'm not convinced there is need for this spec if only DPC++ is going to implement it.

@AerialMantis
Copy link
Collaborator Author

Thanks for the feedback @illuhad, so this is just a first draft, which is based initially on the DPC++ CUDA backend implementation, as this is what we are most familiar with, however, we appreciate this may not map in the same way in hipSYCL. Our goal is to converge on a specification which all SYCL implementations which provide a CUDA backend can conform to, so we are keen to collaborate with you on how we can incorporate the design of hipSYCL's CUDA backend into this as well.

To that end we'd be keen to understand where it would be difficult for hipSYCL to conform to the specification as we've currently written it, and we can work with you to adapt it, perhaps there are places where we have been too prescriptive, and we may need to be more flexible?

@illuhad
Copy link
Contributor

illuhad commented Oct 25, 2021

Hi @AerialMantis, thanks for the clarification regarding your plans, I'm happy to hear you want to converge on a common specification.
I think for kernel code (execution model etc) there is strong overlap, and any potential differences could probably be fixed easily. The differences are mostly on the runtime side. From the quick read of the spec that I had so far, I think there are three main concerns:

  1. hipSYCL deliberately decouples sycl::queue and sycl::context from backend objects. For example, it maintains a pool of CUDA streams, and then distributes the work that comes in through all sycl::queue objects across the pool. This has the advantage that it can already benefit from concurrent kernel execution, concurrent memory copies, or overlap of data transfers and compute even if the user code only utilizes a single sycl::queue. Additionally, the performance behavior is independent of the number of SYCL queues, so the user does not need to worry about how to best utilize queues, and can instead focus on expressing the work as part of the SYCL task graph. Similarly, decoupling context allows for an implicit context management that prevents the user from accidentally spawning new backend contexts (e.g. using the default queue constructor). The downside of this approach is that in general there is no well-defined 1:1 mapping between SYCL queue, context and backend objects, which limits interoperability.
  2. Similarly, there might not be a direct relation between backend events and sycl::event, because hipSYCL might elide operations early on in the task graph processing as an optimization, in which case the operation becomes "virtual" in hipSYCL terminology, and then the event is not associated with any real backend event. Maybe it might be possible to at least construct a sycl::event on top of a CUDA event, but that's still unclear. Also, there might be other mechanisms in play for synchronization, such as callbacks instead of CUDA events, and whether the SYCL event comes from the CUDA or e.g. the host backend might also be unclear for the user (e.g. a host task might return an event from the CPU backend). Because of all these complexities, hipSYCL currently does not support event interop.
  3. While DPC++ relies exclusively on the CUDA driver API, hipSYCL mostly utilizes the CUDA runtime API, (except for a couple of places where the driver API is used). Most runtime types are a direct typedef for a driver type so e.g. interop with driver types might "just work", but it's unclear how well this guaranteed to be the case and we should check to make sure that we don't miss anything.

Of course we could just make all the problematic interop optional to support both hipSYCL and DPC++ designs, but my understanding is that one of main arguments for such a specification would be precisely to make guarantees for users when they wish to interop with the backend.

@AerialMantis
Copy link
Collaborator Author

@illuhad thank you for the detailed response, this is really helpful.

  1. hipSYCL deliberately decouples sycl::queue and sycl::context from backend objects. For example, it maintains a pool of CUDA streams, and then distributes the work that comes in through all sycl::queue objects across the pool. This has the advantage that it can already benefit from concurrent kernel execution, concurrent memory copies, or overlap of data transfers and compute even if the user code only utilizes a single sycl::queue. Additionally, the performance behavior is independent of the number of SYCL queues, so the user does not need to worry about how to best utilize queues, and can instead focus on expressing the work as part of the SYCL task graph. Similarly, decoupling context allows for an implicit context management that prevents the user from accidentally spawning new backend contexts (e.g. using the default queue constructor). The downside of this approach is that in general there is no well-defined 1:1 mapping between SYCL queue, context and backend objects, which limits interoperability.

So this is a really good point, and we thought about this quite a bit. Our thinking for mapping of the SYCL queue to CUDA streams was that it should be loosely defined to allow either a 1:1 mapping or a 1:n mapping, to allow for an implementation to have, for example, different streams to overlap kernel functions and data movement or even to provide a separate stream for interop to make it easier to have concurrent execution with work being done via interop. And we believe this should extend to a n:p mapping where there is a fixed pool of CUDA streams regardless of how many SYCL queues are constructed, we can tweak the definition to ensure this case is included.

For the mapping of the SYCL context to CUDA contexts, we chose a mapping of 1:n in order to support the multi-device context use case, where you have a context representing multiple devices. Since CUDA contexts are associated with a single device, it made sense to allow a SYCL context to map to multiple CUDA contexts. However, if you were to have a pool of CUDA streams then it makes sense that this would naturally extend to the CUDA contexts as well.

For the interop API, we wanted to keep the mapping of the SYCL queue separate somewhat from the interop API requirements in that, even though a SYCL queue could map to multiple CUDA streams, the interop API is still only required to provide a single CUDA stream and the implementation can choose which to provide. I believe this should work for hipSYCL. However, for SYCL context I think it's still necessary to return all CUDA contexts as they are unique for each CUDA device, though perhaps this will still work for hipSYCL if you can retrieve the CUDA contexts from the pool corresponding to the devices of the SYCL context?

Where I think this gets tricky is in the interop API for creating SYCL objects; make_queue and make_context if a CUDA backend is implemented using a pool of CUDA streams or CUDA contexts, then it gets difficult to have the objects created only represent the input CUDA streams or CUDA contexts. However, I see a couple of ways this could work:

  • Create a new object with it's own CUDA streams or CUDA contexts which are independent of the pools.
  • Add the CUDA stream or CUDA context from make_queue or make_context to the pools, but tag them as only useable by the interop constructed SYCL objects.
  1. Similarly, there might not be a direct relation between backend events and sycl::event, because hipSYCL might elide operations early on in the task graph processing as an optimization, in which case the operation becomes "virtual" in hipSYCL terminology, and then the event is not associated with any real backend event. Maybe it might be possible to at least construct a sycl::event on top of a CUDA event, but that's still unclear. Also, there might be other mechanisms in play for synchronization, such as callbacks instead of CUDA events, and whether the SYCL event comes from the CUDA or e.g. the host backend might also be unclear for the user (e.g. a host task might return an event from the CPU backend). Because of all these complexities, hipSYCL currently does not support event interop.

This is a good point, and I think most implementations will do this at some point or another, so it definitely needs to be accounted for, and it's difficult to emulate in CUDA as it's not possible to create a user event to represent virtual or composite operations in the runtime. Perhaps we may need to loosen the mapping of the SYCL event to the CUDA event, and say instead that a SYCL event may have an underlying CUDA event associated with it, but it may not. Then perhaps we could require the interop API to be defined, but make it implementation defined when a CUDA event is returned?

Though as you said, it does reduce the usefulness of having interoperability in this case since users wouldn't be able to rely on it always being available, so perhaps it makes less sense to have interop for the SYCL event. There may be other ways to achieve what you would do with a CUDA event, for example since CUDA streams are in order you could simply wait on the stream, though this is not quite the same it may be sufficient. We'll need to think about this some more.

  1. While DPC++ relies exclusively on the CUDA driver API, hipSYCL mostly utilizes the CUDA runtime API, (except for a couple of places where the driver API is used). Most runtime types are a direct typedef for a driver type so e.g. interop with driver types might "just work", but it's unclear how well this guaranteed to be the case and we should check to make sure that we don't miss anything.

That's a good point, we'll need to investigate this and make sure that these types are compatible, otherwise we may need to consider a way of the interop API using one or the other of the types, depending on which API the implementation is using. We'll need to look into this some more.

@illuhad
Copy link
Contributor

illuhad commented Oct 26, 2021

Thanks for your thoughts @AerialMantis !

For the interop API, we wanted to keep the mapping of the SYCL queue separate somewhat from the interop API requirements in that, even though a SYCL queue could map to multiple CUDA streams, the interop API is still only required to provide a single CUDA stream and the implementation can choose which to provide. I believe this should work for hipSYCL.

I could of course just return some stream from the pool, but I'm not sure if this might not do more harm than good because I don't think it's what users would expect. In that case, there'd be no guarantees that the returned stream synchronizes in any way with previous operations submitted to the SYCL queue (they may have been scheduled to different streams, even for in-order queues) and neither would subsequent operations be guaranteed to synchronize with anything the user submits to the stream via interop.

However, for SYCL context I think it's still necessary to return all CUDA contexts as they are unique for each CUDA device, though perhaps this will still work for hipSYCL if you can retrieve the CUDA contexts from the pool corresponding to the devices of the SYCL context?

That might work. In most workflows, hipSYCL just uses the default context from the CUDA runtime API, which I think can be extracted if the device is known. There are some caveats since hipSYCL also supports multi-device queues, but those are not covered by the SYCL standard anyway.

Where I think this gets tricky is in the interop API for creating SYCL objects; make_queue and make_context if a CUDA backend is implemented using a pool of CUDA streams or CUDA contexts, then it gets difficult to have the objects created only represent the input CUDA streams or CUDA contexts. However, I see a couple of ways this could work:

I'm not sure those functions make much sense for hipSYCL. The hipSYCL interpretation of e.g. a queue is really more of a high-level interface to the SYCL task graph, with some functionality to manage synchronization of groups of tasks (e.g. wait()). To make this more clear I am pondering whether we should introduce the alias using task_group = queue; in hipSYCL. Relating a task group to a stream does not sound very useful semantically, and adding streams at runtime for interop purposes breaks some of the assumptions that the scheduler currently can make. I'll have to think some more about this. For context it might be even more complicated as we couldn't rely on the CUDA runtime API context anymore in the workflows that use it, and this in turn might break other interop use cases that currently "just work" for SYCL+CUDA interop applications that use the runtime API.

This is a good point, and I think most implementations will do this at some point or another, so it definitely needs to be accounted for, and it's difficult to emulate in CUDA as it's not possible to create a user event to represent virtual or composite operations in the runtime. Perhaps we may need to loosen the mapping of the SYCL event to the CUDA event, and say instead that a SYCL event may have an underlying CUDA event associated with it, but it may not. Then perhaps we could require the interop API to be defined, but make it implementation defined when a CUDA event is returned?

Sure, that could be implemented - assuming this is still useful for users then.

I wonder if maybe we should take a step back to see what we want to accomplish with the spec from a user's point of view to make sure that a common spec that targets the least common denominator between DPC++ and hipSYCL is really better for users than two separate implementation backend specs that are specifically built to match the DPC++ and hipSYCL designs?

adoc/chapters/cuda_backend.adoc Outdated Show resolved Hide resolved
adoc/chapters/cuda_backend.adoc Outdated Show resolved Hide resolved
| [code]#template<backend Backend> +
device +
make_device(const backend_input_t<Backend, device> &backendObject);#
| Create a SYCL `device` from a CUDA device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to clarify that the device returned from this API is just a copy of one of the devices in device::get_devices(). For example, I think you could add wording like this (inspired from similar wording we have for the Level Zero interop spec):

The SYCL execution environment for the CUDA backend contains a fixed number of devices that are enumerated via sycl::device::get_devices(). Calling this function does not create a new device. Rather it merely creates a sycl::device object that is a copy of one of the devices from that enumeration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wording for this has been updated to reflect that make_device doesn't create a new platform just returns a device object that is a copy of one of the existing devices, so I think this is resolved now.

adoc/chapters/cuda_backend.adoc Outdated Show resolved Hide resolved

When creating a SYCL object from a native CUDA object SYCL does not take
ownership of the object and it is up to the application to dispose of them when
appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Level Zero backend has a similar issue about ownership. We addressed this by adding an ownership field to the interop APIs that create SYCL objects, which allows the application to say whether it keeps ownership of the native handle vs. transferring ownership to the implementation. I think it might make sense to use the same technique here. You can see the wording we used in the Level Zero spec at:

https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/LevelZeroBackend/LevelZeroBackend.md#44-level-zero-handles-ownership-and-thread-safety

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we looked at that, and I do think it makes sense as well, however it does use an extension header, which is fine as the Level Zero is maintained separately, but it's not ideal if we want the CUDA backend spec to be part of the main SYCL specification.

So we decided to go with a simpler approach for now, for the CUDA backend spec, with the idea in mind that we should probably look into maybe codifying what Level Zero is doing directly in the SYCL interop API, or at least a similar system that could be re-used by the non-OpenCL backends to handle ownership.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however it does use an extension header, which is fine as the Level Zero is maintained separately, but it's not ideal if we want the CUDA backend spec to be part of the main SYCL specification.

Sorry, I'm not following. What do you mean by "it does use an extension header"? How does this preclude the interop API from being part of the main SYCL specification?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't preclude it, we just thought it might be better done separately as a generic change to the interop API that could be used by all backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it makes sense to try to solve this in a general way in the core spec, since only some backends have this issue of "ownership". (For example, OpenCL doesn't have this issue because objects are reference counted.) For backends that have this issue (e.g. Level Zero, CUDA), I think it makes sense for each backend to define the mechanism for deciding who "owns" an object. Of course, it would be better if all backends had a similar mechanism, though.

If the Level Zero mechanism seems good to others, can we use that for CUDA too? If it has flaws, we could discuss them and potentially adopt a consistent mechanism across both backends.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would argue that the OpenCL backend also has this issue of ownership, it's just that it happens to be easily solved by the OpenCL reference counting. And that the Level Zero approach could easily be implemented by the OpenCL backend, where the ownership transfer would just dictate how SYCL calls clRetain* and clRelease* on the OpenCL objects.

I could be completely wrong on that but looking at the interop API with make_* and get_* it seems to me that ownership questions would naturally come up and I get the feeling that maybe it wasn't part of the original interop API simply because it was solved so easily by reference counting for the OpenCL backend.

are non-core CUDA features. Therefore, the runtime must be able to determine what aspects CUDA
devices have. This can be performed by querying `cudaDeviceProp::major` and `cudaDeviceProp::minor`
to find out the compute capability. The compute capability indicates what extensions are
available to the device, and therefore what aspects are available.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This paragraph seems like a design document or a description of the implementation. I don't think it's relevant for an API spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a valid point. We have omitted/rephrased some things in the latest iteration, which we will be posting shortly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording has been removed, so I think this is resolved now.

[code]#half4#, [code]#half8# and [code]#half16# must be
available at compile-time. However a kernel using these types is only
supported on devices that have [code]#aspect::fp16#, i.e. compute capability
5.3 or greater.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you are talking about the SYCL types here (e.g. sycl::half), correct?

If so, none of this is specific to CUDA. The core SYCL spec already says that kernels using these types are only compatible with a device the has aspect::fp16. I don't see why we need to say this again in the CUDA backend spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is correct. We have rephrased this section in the next iteration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording has been updated, so I think this is resolved now.

5.3 or greater.

[[sub:cuda:extensions]]
=== Extensions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this empty section? It seems like you already talked about extensions above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a placeholder section for later extensions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording has been updated, so I think this is resolved now.

[[sub:cuda:platform_model]]
=== Platform Model

All CUDA enabled devices which can be executed on are represented by a single `CUdevice`. A SYCL device maps to a single CUDA device.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This first sentence seems confusing to me. Should it read "Each CUDA enabled device ..."? Why is it important to say that, though? It seems like it is just talking about CUDA in general, not anything specific to SYCL interoperability. Maybe this first sentence should just be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording has been updated to address this, so I think this is resolved now.

adoc/chapters/cuda_backend.adoc Show resolved Hide resolved
Fix editing typo and improve make_device docs

See merge request oneapi-core/SYCL-Docs-mirror!12
| [code]#context# | `CUcontext` | `std::vector<CUcontext>` | A SYCL context can encapsulate multiple CUDA contexts , however it is not possible to create a SYCL context from multiple CUDA contexts.
| [code]#queue# | `CUstream` | `CUstream` | A SYCL queue encapsulates a CUDA stream.
| [code]#event# | `CUevent` | `CUevent` | A SYCL event encapsulates a CUDA event.
| [code]#buffer# | `struct { CUdeviceptr ptr; size_t size; }` | `CUdeviceptr` | A SYCL buffer encapsulates a CUDA device pointer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question came up recently about the return value of get_native() for buffer. What value is returned if buffer is a sub-buffer? Is it a pointer to the original (non-sub) buffer, or is it an offset to the start of the sub-buffer? It would be nice if there was a table describing the behavior of get_native() for each of these objects. That table would be a good place to clarify things like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question about the interop for buffer ... the OpenCL interop has a vector of cl_mem object because there might be multiple OpenCL devices on the system and the buffer might be represented on more than one device. (At least I think that's the reasoning.) Is it correct that there's just a single CUdeviceptr here, or should there be a vector of them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we've been discussing application interop for buffer as well and for a few reasons we came to the conclusion that we should not support it and instead encourage buffer interop via interop_handle::get_native_mem:

  • As you mentioned, as a buffer can represent multiple devices it would need to return a vector of CUdeviceptr, but because it allocates data lazily, this would also mean we would either need to return a partial list of memory objects depending on the current state of the runtime or enforce allocation within the get_native.
  • If the return value has multiple memory objects, we would need some way to reflect which one has the latest copy of the data.
  • A buffer can also represent memory on the host and potentially across multiple backends, and representing this would add further complexity to the return type.

In effect we came to the conclusion we would need to replicate much of the inner workings of a buffer in order to make it really useful to users, and the use cases we had could be supported via host task interop.

pvchupin pushed a commit to intel/llvm that referenced this pull request Jun 15, 2022
This PR is adds part of the CUDA-backend spec interop proposed in KhronosGroup/SYCL-Docs#197. The changes work with the CUDA CTS interop checks KhronosGroup/SYCL-CTS#336.

This PR just adds the queue interop.

llvm-test-suite: intel/llvm-test-suite#1054
context +
make_context(const backend_input_t<Backend, context> &backendObject,
const async_handler asyncHandler = {});#
| Create a SYCL `context` from a CUDA context.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably cannot be supported by SYCL implementations relying on the CUDA runtime API instead of the driver API such as hipSYCL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@illuhad you make a good point, thanks for raising this. I think we could resolve this by clarifying the wording such that for some SYCL implementations the mapping of SYCL context to CUDA context is limited to the CUDA primary context, which would limit the usefulness of the feature when the implementation is based on the runtime API, however, would allow us to maintain a generic interface between the two implementation approaches.

So the wording would be clarified such that get_native<context, backend::cuda> is permitted to return the CUDA primary context and make_context<backend::cuda> is permitted to only accept the CUDA primary context, and throw an exception otherwise.

However, in order to allow users to reason about this there would need to be some kind of query which exposes whether the implementation supports interop with non-primary CUDA contexts. Aside from this particular use case, there is another situation where having such a query would be important. Within a host task, when interoperating with the native objects, it's important to know whether the SYCL implementation initializes non-primary contexts, in order to ensure that the native API calls within the host task are compatible.

I see a couple of ways we could do this:

  • A. Introduce a query which tells the user specifically whether the SYCL implementation maps all SYCL contexts to the CUDA primary context or whether unique contexts are allocated, this provides the specific information we need in this case, but if perhaps less scalable.
  • B. Introduce a broader query which tells the user whether the SYCL implementation is using the driver API or the runtime API, this could be useful for other places where there is a difference between the drive API and the runtime API, though it may not reflect the case where an implementation combines the two APIs.

Copy link
Contributor

@illuhad illuhad Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your thoughts @AerialMantis! I agree that what you are proposing are possible solutions. I have a hunch that in practice it might be easier to work with for users if they know if an implementation is runtime or driver API based rather than make the query more fine-grained and just about context interop.
E.g. if you know that your CUDA backend is runtime API-based, you can make other assumptions about backend interop. In that case, you can know that backend interop is generally just going to work without having to worry about contexts at all, which I think can be very valuable.
I wonder if it might be easier to understand and clearer if the backend specification just had two flavors, or profiles, so that we can expose the behavior of both approaches in a more obvious way.

I suspect that even if an implementation were to utilize both APIs, you could probably figure out a definition to classify them, most likely around how they behave regarding context interop, and whether the current CUDA device from cudaSetDevice() plays a role.
For example, hipSYCL already uses some driver API components for manual management of PTX modules, but still follows the interop behavior of the runtime API-approach.
Do you think there are use cases where such a classification is not possible?

Also, I wonder: Do we believe that there will actually be a use case to invoke make_context with the same context obtained from get_native()? Or will having this query for runtime API backends just tempt users to do things that are unsupported?
My guess is that the appropriate approach to backend interop might be quite different for a runtime and driver API-based implementation. I guess the core question is: Can we have backend interop in client code that works well across both approaches, or would client code need to be structured differently anyway? In the latter case, it might not be useful to stick to a common backend interop interface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it might be easier to understand and clearer if the backend specification just had two flavors, or profiles, so that we can expose the behavior of both approaches in a more obvious way.

Just throwing this out there ... it is also an option to have two different CUDA backends. This means:

  • There would be two different enumerators in enum backend.
  • Each backend would be responsible for defining its own interop specification (if it wanted to support interop).

Having a single backend interop specification is nice, but only if developers can reasonably write interop code that is portable across both CUDA backends. If we think the two CUDA backends are just too different, it might be easier for users if each one had its own separate interop specification.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@illuhad @gmlueck I think you make a good point, I am wary of introducing different profiles of the CUDA backend specification if possible, I think it would be valuable to have a single specification that all implementations can follow, though I recognize this may not be possible if there are cases we can't handle generically.

I've clarified the wording to state that both APIs are supported. The biggest concern there was regarding the context and whether the primary context is used. I've added a note that the CUcontexts that are returned from interop can include the primary context, and in fact, this is now what the DPC++ CUDA backend does. I suspect we may still need a query for whether the implementation is using the CUDA Driver API or CUDA Runtime API, but I'd like to avoid this if possible to avoid having any bifurcation of the specification.

Also, I wonder: Do we believe that there will actually be a use case to invoke make_context with the same context obtained from get_native()? Or will having this query for runtime API backends just tempt users to do things that are unsupported?

This is a good point, I suspect we don't want to support this, it would like cause havoc with the ownership model.

steffenlarsen pushed a commit to intel/llvm that referenced this pull request Jun 20, 2022
This PR is adds part of the CUDA-backend spec interop proposed in KhronosGroup/SYCL-Docs#197. The changes work with the CUDA CTS interop checks KhronosGroup/SYCL-CTS#336.

This PR just adds the event interop. 

llvm-test-suite: intel/llvm-test-suite#1053
AerialMantis and others added 7 commits June 22, 2022 10:16
…fication'

Use device version to report compute capabilities

See merge request oneapi-core/SYCL-Docs-mirror!23
…pecification'

Clarify active contexts and devices

See merge request oneapi-core/SYCL-Docs-mirror!28
…ation'

Remove is_backend_active

See merge request oneapi-core/SYCL-Docs-mirror!29
# Conflicts:
#   adoc/chapters/cuda_backend.adoc
…ackend-specification'

Updating section on index flipping

See merge request oneapi-core/SYCL-Docs-mirror!27
# Conflicts:
#   adoc/chapters/cuda_backend.adoc
updated MR for cuda backend spec

See merge request oneapi-core/SYCL-Docs-mirror!31
@fraggamuffin
Copy link

still need to discuss how to handle context interrupt

steffenlarsen pushed a commit to intel/llvm that referenced this pull request Sep 6, 2022
This patch fixes: #6635

In #6483, the implementation of `get_native` for device for the CUDA plugin was mistakenly moved to the experimental interface header, and so it was no longer available for the regular interface, causing build issues.

For the CUDA plugin there is currently two interfaces for the CUDA interop, the "legacy" one which is used by projects such as oneMKL and oneDNN, and the "experimental" one, defined in the `sycl/ext/oneapi/experimental/backend/cuda.hpp` header  which implements the interop as described in the CUDA backend specification proposed here: KhronosGroup/SYCL-Docs#197
[width="100%",options="header",cols="40%,30%,30%"]
|====
| SYCL shared USM advice | CUDA managed memory advice | processor, the advice is set for
| [code]#sycl::cuda::advice::cuda_mem_advise_set_read_mostly# | [code]#cudaMemAdviseSetReadMostly# | device associated with the queue/handler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AerialMantis FYI these advise names seem to be different than those used in DPC++: https://github.com/t4c1/llvm-test-suite/blob/1a5a4e9d72cccf290d88f2498c08314d8bfb271e/SYCL/USM/memadvise_cuda.cpp#L39

Do we need to update one or other of the documentation / dpc++ impl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, these are different, but I think we may want to update DPC++ to match what we have in here. I have also updated the definitions here to follow the correct extension convention.

YuriPlyakhin pushed a commit to oneapi-src/SYCLomatic that referenced this pull request Sep 30, 2022
* [SYCL][ABI-BREAK] Remove sycl::program class (#6666)

According to the SYCL 2020 spec, section
D.1. What has changed from SYCL 1.2.1 to SYCL 2020:

> The program class has been removed and replaced with a new
> class `kernel_bundle`, which provides similar functionality in a
> type-safe and thread-safe way.

Removing of `program_impl` class will be done with a separate commit
since it's not an ABI-breaking change and some performance analysis
should be done in scope of that removal.

Tests depending on `sycl::program` were removed in intel/llvm-test-suite#1187

* [flang] Support lowering of intrinsic module procedure C_F_POINTER

As Fortran 2018 18.2.3.3, the intrinsic module procedure
C_F_POINTER(CPTR, FPTR [, SHAPE]) associates a data pointer with the
target of a C pointer and specify its shape. CPTR shall be a scalar of
type C_PTR, and its value is the C address or the result of a reference
to C_LOC. FPTR is one pointer, either scalar or array. SHAPE is a
rank-one integer array, and it shall be present if and only if FPTR is
an array.

C_PTR is the derived type with only one component of integer 64, and the
integer 64 component value is the address. Build the right "source"
fir::ExtendedValue based on the address and shape, and use
associateMutableBox to associate the pointer with the target of the C
pointer.

Refactor the getting the address of C_PTR to reuse the code.

Reviewed By: jeanPerier

Differential Revision: https://reviews.llvm.org/D132303

* [LoongArch] Fix annotations not matching predicates. NFC.

* [RISCV] Improve vector fceil/ffloor lowering by changing FRM.

This adds new VFCVT pseudoinstructions that take a rounding mode operand. A custom inserter is used to insert additional instructions to change FRM around the
VFCVT.

Some of this is borrowed from D122860, but takes a somewhat different direction. We may migrate to that patch, but for now I was trying to keep this as independent from
RVV intrinsics as I could.

A followup patch will use this approach for FROUND too.

Still need to fix the cost model.

Reviewed By: arcbbb

Differential Revision: https://reviews.llvm.org/D133238

* [NFC] Remove invisible character in Diagnostic message and tests

* [X86] Support SAE for VCVTPS2PH from intrinsic.

For now, clang and gcc both failed to generate sae version from _mm512_cvt_roundps_ph:
https://godbolt.org/z/oh7eTGY5z. Intrinsic guide description is also wrong, which will be
update soon.

Reviewed By: pengfei

Differential Revision: https://reviews.llvm.org/D132641

* [RegisterCoalescer] Fix crash on early clobbered subreg operands.

The issue was with processing two subregs of the same reg are used in the same
instruction (e.g. inline asm): "def early-clobber" and other just "def".
Register coalescer ran in bad recursion if the early clobbered subreg is second
in the following sequence of COPYs.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D127136

* [OpenMP][OMPD] GDB plugin code to leverage libompd to provide debugging
support for OpenMP programs.

This is 5th of 6 patches started from https://reviews.llvm.org/D100181
This plugin code, when loaded in gdb, adds a few commands like
ompd icv, ompd bt, ompd parallel.
These commands create an interface for GDB to read the OpenMP
runtime through libompd.

Reviewed By: @dreachem
Differential Revision: https://reviews.llvm.org/D100185

* [lld-macho] Simplify linker optimization hint processing

This commit removes the `relocTargets` vector, and instead makes the
code reconstruct the referent addresses from the relocated instructions.
This will allow us to move `applyOptimizationHints` from
`ConcatInputSection::writeTo` to a separate pass that parses and applies
LOHs in one step, on a per-file basis. This will improve performance, as
parsing is currently done serially in `ObjFile::parse`.

I opted to remove the sanity check that ensures that all relocations
within a LOH point to the same symbol. This completely eliminates the
need to search through relocations. It is my understanding that
mismatched relocation targets should not be present in valid object
files, so it's unlikely that the removal will lead to mislinks.

Differential Revision: https://reviews.llvm.org/D133274

* [NFC] Add test of sized deallocation for coroutines

[dcl.fct.def.coroutine]p12 says:

> If both a usual deallocation function with only a pointer parameter
> and a usual deallocation function with both a pointer parameter and a
> size parameter are found, then the selected deallocation function
> shall be the one with two parameters.

However, the sized deallocation function is disabled by default for ABI
reasons. This leads the sentence never get tested and covered. This
commit tries to add a test for it

* [NFC] [Coroutines] Add tests for looking up deallocation

According to [dcl.fct.def.coroutine]p12, the program should be
ill-formed if the promise_type contains operator delete but none of them
are available. But this behavior was not tested before. This commit adds
the tests for it.

* [CMake][MLGO] Fix cmake for MLGO

The if-statement should check whehter TFLITE is on or not rather than if the variable is specified.

Reviewed By: mtrofin

Differential Revision: https://reviews.llvm.org/D132902

* [OpenMP] Mark -fopenmp-implicit-rpath as NoArgumentUnused

This matches the behavior for all the other -fopenmp options,
as well as -frtlib-add-rpath.

For context, Fedora passes this flag by default in case OpenMP is
used, and this results in a warning if it (usually) isn't, which
causes build failures for some programs with unnecessarily strict
build systems (like Ruby).

Differential Revision: https://reviews.llvm.org/D133316

* [MemorySSA][NFC] Simplify if condition

Differential Revision: https://reviews.llvm.org/D133332

* [lldb] Enable the insertion of "pending callbacks" to MainLoops from other threads

This will be used as a replacement for selecting over a pipe fd, which
does not work on windows. The posix implementation still uses a pipe
under the hood, while the windows version uses windows event handles.

The idea is that, instead of writing to a pipe, one just inserts a
callback, which does whatever you wanted to do after the bytes come out
the read end of the pipe.

Differential Revision: https://reviews.llvm.org/D131160

* [CostModel][X86] Add CostKinds handling for fcmp ops

This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (although it still struggles with avx512 predicate numbers which had to be done manually)

SSE numbers are still too low for FCMP_ONE/FCMP_UEQ cases which expand to a more complex sequence than the existing 'ExtraCost' system can manage.

* [OpenCL] Remove argument names from atomic/fence builtins

This simplifies completeness comparisons against OpenCLBuiltins.td and
also makes the header no longer "claim" the argument name identifiers.

Continues the direction set out in D119560.

* [SelectionDAG] Rewrite bfloat16 softening to use the "half promotion" path

The main difference is that this preserves intermediate rounding steps,
which the other route doesn't. This aligns bfloat16 more with half
floats, which use this path on most targets.

I didn't understand what the difference was between these softening
approaches when I first added bfloat lowerings, would be nice if we only
had one of them.

Based on @pengfei 's D131502

Differential Revision: https://reviews.llvm.org/D133207

* Apply clang-tidy fixes for readability-identifier-naming in OptimizeSharedMemory.cpp (NFC)

* Apply clang-tidy fixes for readability-identifier-naming in OpenMPDialect.cpp (NFC)

* [clang-format] [doc] Fix example of wrapping class definitions

Example of

BraceWrapping
  AfterClass
is wrong

Differential Revision: https://reviews.llvm.org/D133087

* [clang-format] Change heuristic for locating lambda template arguments

Previously, the heuristic was simply to look for template argument-
specific keywords, such as typename, class, template and auto
that are preceded by a left angle bracket <.

This changes the heuristic to instead look for a left angle bracket <
preceded by a right square bracket ], since according to the C++
grammar, the template arguments must *directly* follow the introducer.
(This sort of check might just end up being *too* aggressive)

This patch also adds a bunch more token annotator tests for lambdas,
specifically for some of the stranger forms of lambdas now allowed as
of C++20 or soon-to-be-allowed as part of C++23.

Fixes https://github.com/llvm/llvm-project/issues/57093

This does NOT resolve the FIXME regarding explicit template lists, but
perhaps it gets closer

Differential Revision: https://reviews.llvm.org/D132295

* [MLIR] Switch lit tests to %mlir_lib_dir and %mlir_src_dir replacements.

The old replacements will be removed soon:
- `%linalg_test_lib_dir`
- `%cuda_wrapper_library_dir`
- `%spirv_wrapper_library_dir`
- `%vulkan_wrapper_library_dir`
- `%mlir_runner_utils_dir`
- `%mlir_integration_test_dir`

Reviewed By: herhut

Differential Revision: https://reviews.llvm.org/D133270

* [ARM] Constant pools need 4-byte alignment if we only have tADR

When the only ADR instruction we have is the 16-bit thumb one then all
constant pool entries need to be 4-byte aligned, as tADR has an offset
that's a multiple of 4.

It looks like previously there happened to be no situations in which
we encountered a constant pool entry with alignment less than 4, so
failing to do this didn't cause any problems, but the expansion of
cttz to a table added by D128911 does use a constant pool with
alignment 1, so we now need to handle it correctly.

Differential Revision: https://reviews.llvm.org/D133199

* [AMDGPU][MC][GFX11][NFC] Update assembler tests for MIMG instructions

Differential Revision: https://reviews.llvm.org/D133322

* [MLIR] Fix for commit 0f2ec35

Fix incorrectly formatted python file.

* [CostModel][X86] Add CostKinds handling for SSE FCMP_ONE/FCMP_UEQ predicates

These require special handling to account for their expansion in lowering.

I'm trying very hard not to have to add predicate specific costs - but it might be inevitable.....

* [lldb] [Core] Split read thread support into ThreadedCommunication

Split the read thread support from Communication into a dedicated
ThreadedCommunication subclass.  The read thread support is used only
by a subset of Communication consumers, and it adds a lot of complexity
to the base class.  Furthermore, having a dedicated subclass makes it
clear whether a particular consumer needs to account for the possibility
of read thread being running or not.

The modules currently calling `StartReadThread()` are updated to use
`ThreadedCommunication`.  The remaining modules use the simplified
`Communication` class.

`SBCommunication` is changed to use `ThreadedCommunication` in order
to avoid changing the public API.

`CommunicationKDP` is updated in order to (hopefully) compile with
the new code.  However, I do not have a Darwin box to test it, so I've
limited the changes to the bare minimum.

`GDBRemoteCommunication` is updated to become a `Broadcaster` directly.
Since it does not inherit from `ThreadedCommunication`, its event
support no longer collides with the one used for read thread and can
be implemented cleanly.  The support for
`eBroadcastBitReadThreadDidExit` is removed from the code -- since
the read thread was not used, this event was never reported.

Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.llvm.org/D133251

* [gn build] Port 9823d42557eb

* [mlir] Add materializeOpFoldResults to turn OpFoldResult array into values.

Differential Revision: https://reviews.llvm.org/D133346

* [lldb] Go back to process-directed signals in MainLoopTest.cpp

Thread-directed signals are not caught by kqueue (used on Mac). This
reverts half of D133181.

* [OpenMP] Add lit test for metadirective device arch inspired
from sollve

This lit test is added based upon the tests present in the
tests/5.0/metadirective directory of the SOLLVE repo
https://github.com/SOLLVE/sollve_vv

Reviewed By: saiislam

Differential Revision: https://reviews.llvm.org/D131763

* [InstCombine] add tests for icmp-of-trunc; NFC

* [InstCombine] reduce code duplication; NFC

* [InstSimplify] allow poison/undef in constant match for "C - X ==/!= X -> false/true"

This fold was added with 5e9522c311dd, but over-specified.
We can assume that an undef element is an odd number:
https://alive2.llvm.org/ce/z/djQmWU

* Update the clang and clang-tools-extra code owners files

This also converts the Clang code owners file from a flat text file to
an RST file that is linked in to the rest of our documentation.

The RFC for this can be found at:
https://discourse.llvm.org/t/rfc-proposed-changes-to-clangs-code-ownership/

Differential Revision: https://reviews.llvm.org/D132550

* [GlobalISel] Combine G_INSERT/EXTRACT_VECTOR_ELT with out of bounds indices to undef.

Differential Revision: https://reviews.llvm.org/D133309

* [flang] Accept assumed shape arrays as SHAPE in C_F_POINTER

C_F_POINTER was added in https://reviews.llvm.org/D132303, but the code
assumed that SHAPE would always be an explicit shape with compile time
constant rank. It can actually be an assumed shape, or an explicit shape
with non compile time constant rank. Get the rank from FPTR pointer
instead.

Differential Revision: https://reviews.llvm.org/D133347

* [SCCP] convert signed div/rem to unsigned for non-negative operands

This extends the transform added with D81756 to handle div/rem opcodes.
For example:
https://alive2.llvm.org/ce/z/cX6za6

This replicates part of what CVP already does, but the motivating example
from issue #57472 demonstrates a phase ordering problem - we convert
branches to select before CVP runs and miss the transform.

Differential Revision: https://reviews.llvm.org/D133198

* [CostModel][X86] Add CostKinds test coverage for ctpop intrinsics

* [CostModel][X86] Add CostKinds test coverage for cttz intrinsics

* [CostModel][X86] Add CostKinds test coverage for ctlz intrinsics

* Fix Clang Sphinx docs build

The CodeOwners.rst file needs to live in the same directory as the rest
of the documentation. This copies the file to the correct place when
making a Sphinx build but continues to leave the .rst file at the root
directory where it's easier for developers to find. This also ensures
that local doc builds using `make html` work as expected.

* [Metadata] Introduce MD_pcsections

Introduces MD_pcsections metadata kind. See added documentation for
more details.

Subsequent patches enable propagating PC sections metadata through code
generation to the AsmPrinter.

RFC: https://discourse.llvm.org/t/rfc-pc-keyed-metadata-at-runtime/64191

Reviewed By: dvyukov, vitalybuka

Differential Revision: https://reviews.llvm.org/D130875

* [MachineInstr] Allow setting PCSections in ExtraInfo

Provide MachineInstr::setPCSection(), to propagate relevant metadata
through the backend. Use ExtraInfo to store the metadata.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D130876

* [Object] Refactor code for extracting offload binaries

We currently extract offload binaries inside of the linker wrapper.
Other tools may wish to do the same extraction operation. This patch
simply factors out this handling into the `OffloadBinary.h` interface.

Reviewed By: yaxunl

Differential Revision: https://reviews.llvm.org/D132689

* [OffloadPackager] Add ability to extract images from other file types

A previous patch added support for extracting images from offloading
binaries. Users may wish to extract these files from the file types they
are most commonly emebedded in, such as an ELF or bitcode. This can be
difficult for the user to do manually, as these could be stored in
different section names potentially. This patch addsp support for
extracting these file types.

Reviewed By: saiislam

Differential Revision: https://reviews.llvm.org/D132607

* [SYCL][CUDA] Fix get_native interop for device (#6649)

This patch fixes: https://github.com/intel/llvm/issues/6635

In https://github.com/intel/llvm/pull/6483, the implementation of `get_native` for device for the CUDA plugin was mistakenly moved to the experimental interface header, and so it was no longer available for the regular interface, causing build issues.

For the CUDA plugin there is currently two interfaces for the CUDA interop, the "legacy" one which is used by projects such as oneMKL and oneDNN, and the "experimental" one, defined in the `sycl/ext/oneapi/experimental/backend/cuda.hpp` header  which implements the interop as described in the CUDA backend specification proposed here: https://github.com/KhronosGroup/SYCL-Docs/pull/197

* [llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64

This adds the ExpandLargeDivRem to the default pass pipeline.
The limit at which it expands div/rem instructions is configured
via a new TargetTransformInfo hook (default: no expansion)
X86, Arm and AArch64 backends implement this hook to expand div/rem
instructions with more than 128 bits.

Differential Revision: https://reviews.llvm.org/D130076

* [SelectionDAG] Rename CallSiteDbgInfo to NodeExtraInfo

For information infrequently attached to SDNodes, it is useful to
provide a way to add this information out-of-line. This is already done
for call-site specific information.

Rename CallSiteDbgInfo to NodeExtraInfo in preparation of adding
additional information not necessarily related to call sites only.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D130880

* [SelectionDAG] Properly copy ExtraInfo on RAUW

During SelectionDAG legalization SDNodes with associated extra info may
be replaced with a new SDNode. Preserve associated extra info on
ReplaceAllUsesWith and remove entries in DeallocateNode.

Reviewed By: vitalybuka

Differential Revision: https://reviews.llvm.org/D130881

* Add parantheses to silence warning.

* [AArch64] Additional tests for sinking splats to muls. NFC

* Fix "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64" compilation on Windows

* Fix AMDGPU test failures due to "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64"

* [tsan] Replace mem intrinsics with calls to interceptors

After https://reviews.llvm.org/rG463aa814182a23 tsan replaces llvm
intrinsics with calls to glibc functions. However this approach is
fragile, as slight changes in pipeline can return llvm intrinsics back.
In particular InstCombine can do that.

Msan/Asan already declare own version of these memory
functions for the similar purpose.

KCSAN, or anything that uses something else than compiler-rt, needs to
implement this callbacks.

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D133268

* Fix remaining test failures for "[llvm/CodeGen] Enable the ExpandLargeDivRem pass for X86, Arm and AArch64"

* [gn build] port 5dbc7cf7cac44

* [bazel] port 5dbc7cf7cac44

* Revert "[lldb][bindings] Fix module_access handling of regex"

This reverts commit 75f05fccbbdd91393bdc7b6183b9dd2b1e859f8e.

This commit broke the windows lldb bot: https://lab.llvm.org/buildbot/#/builders/83/builds/23284

* Fix OpenMP Opt for target without a parallel region.

Remove ctx redeclaration.

Format code.

Remove parallel check. Modify tests. Clean-up code.

Fix another test.

Move code to helper functions.

Format file.

Minor fixes.

* [InstCombine] add tests for add of select with 0 and negate arms; NFC

* [InstCombine] add/move tests for add with select operands that simplify; NFC

* [InstCombine] remove dead code for add (select cond, (sub), 0); NFC

This pattern is handled more generally in SimplifySelectsFeedingBinaryOp().
Tests to confirm that added to the add.ll test file in the previous commit.

* Add docs for Mach-O lld

I wasn't able to find any docs for Mach-O in `lld/docs`, so here's an attempt at adding basic docs. One of my goals here is to make it easy for users who are unfamiliar with linkers to successfully use lld.

Reviewed By: #lld-macho, int3

Differential Revision: https://reviews.llvm.org/D132893

* [CostModel][X86] Add CostKinds handling for ctpop ops

This was achieved with an updated version of the 'cost-tables vs llvm-mca' script D103695 (although it still struggles with avx512 predicate numbers which had to be done manually)

Some of the pre-AVX values still aren't great - atom/slm worst case numbers for ctpop expansion really affect these (especially throughput/latency), so we need to clean them up in a more consistent way - its a pity we don't have models for more older cpus (merom/nehalem etc.) as other examples.

* [clang] fix profiling of template arguments of template and declaration kind

Template arguments of template and declaration kind were being profiled
only by their canonical properties, which would cause incorrect
uniquing of constrained AutoTypes, leading to a crash in some cases.

This exposed some places in CheckTemplateArgumentList where non-canonical
arguments where being pushed into the resulting converted list.

We also throw in some asserts to catch early and explain the crashes.

Note that the fix for the 'declaration' kind is untestable at this point,
because there should be no cases right now in the AST where we try
to unique a non-canonical converted template argument.

This fixes GH55567.

Signed-off-by: Matheus Izvekov <[email protected]>

Differential Revision: https://reviews.llvm.org/D133072

* [RISCV] Improve vector fround lowering by changing FRM.

This is a follow up to D133238 which did this for ceil/floor.

Reviewed By: arcbbb, frasercrmck

Differential Revision: https://reviews.llvm.org/D133335

* [mlir][sparse] codegen for sparse alloc

Reviewed By: Peiming

Differential Revision: https://reviews.llvm.org/D133241

* Revert "[tsan] Replace mem intrinsics with calls to interceptors"

Breaks
http://45.33.8.238/macm1/43944/step_4.txt
https://lab.llvm.org/buildbot/#/builders/70/builds/26926

This reverts commit 77654a65a373da9c4829de821e7b393ea811ee40.

* [mlir][sparse] Refactoring: renaming StorageNewOp to StorageOp

To address comment in https://reviews.llvm.org/D133241

Reviewed By: aartbik

Differential Revision: https://reviews.llvm.org/D133363

* [ConstraintElimination] Replace pair with named struct (NFC).

This slightly improves the readability and allows further extensions in
follow-ups.

* [libc++] Avoid instantiating type_trait classes

Use `using` aliases to avoid instantiating lots of types

Reviewed By: ldionne, #libc

Spies: libcxx-commits, miyuki

Differential Revision: https://reviews.llvm.org/D132785

* [AArch64] Add an option to reserve physical registers from RA

This patch adds an option --reserve-regs-for-regalloc, so we can reserve a list
of physical registers. These registers will not be used by register allocator,
but can still be used as ABI requests such as passing arguments to function
call.

Its main purpose is simulating high register pressure by reserving many physical
registers. So it will be much easier to test and debug register allocation
changes.

Differential Revision: https://reviews.llvm.org/D132717

* Revert "[SCCP] convert signed div/rem to unsigned for non-negative operands"

This reverts commit fe1f3cfc2669aca387a45c8ce615b45c1db50fc6.

It looks like this commit breaks building llvm-test-suite.

To reproduce, run `opt -passes=ipsccp` on the IR below.

    @g = internal global i32 256, align 4

    define void @test() {
    entry:
      %0 = load i32, ptr @g, align 4
      %div = sdiv i32 %0, undef
      ret void
    }

* [clang] Implement setting crash_diagnostics_dir through env variable

This implements setting the equivalent of `-fcrash-diagnostics-dir`
through the environment variable `CLANG_CRASH_DIAGNOSTICS_DIR`.
If present, the flag still takes precedence.

This helps integration with test frameworks and pipelines.

With this feature, we change the libcxx bootstrapping build
pipeline to produce clang crash reproducers as artifacts.

Signed-off-by: Matheus Izvekov <[email protected]>

Differential Revision: https://reviews.llvm.org/D133082

* [mlir][spirv] Add base classes for vendor ops

This is the first patch in the series to rename vendor ops from
`spv.NameVENDOR` to `spv.VENDOR.Name`. The goal is to make the SPIR-V
dialect more internally consistent.

Issue: https://github.com/llvm/llvm-project/issues/56863

* [mlir][spirv] Refactor vendor op definitions

Use dedicated vendor op classes/categories. This is so that we can later
change the mnemonics of all vendor ops by changing the base class: `SPV_VendorOp`.

Issue: https://github.com/llvm/llvm-project/issues/56863

* [mlir][spirv] Change vendor op mnemonics to `spv.VENDOR.name`

Make vendor ops more consistent with the naming scheme within the SPIR-V
dialect.

Issue: https://github.com/llvm/llvm-project/issues/56863

Reviewed By: antiagainst

Differential Revision: https://reviews.llvm.org/D133247

* [mlir] Make bit enum operators constexpr

This allows using the | operator on the values of enum attributes
in complie-time constants.

Reviewed By: antiagainst

Differential Revision: https://reviews.llvm.org/D133159

* [lldb] Fix CommunicationKDP following D133251

Add `m_bytes` and `m_bytes_mutex` to `CommunicationKDP`, following refactoring
in D133251.

Differential Revision: https://reviews.llvm.org/D133365

* [ADT] Remove is_splat

`is_splat` is superseded by `all_equal` and marked as deprecated.
See the discussion thread for more details:
https://discourse.llvm.org/t/adt-is-splat-and-empty-ranges/64692

Reviewed By: dblaikie

Differential Revision: https://reviews.llvm.org/D132336

* [OpenMP] Fix `omp_get_wtime` function being marked incorrectly as readonly

OpenMP has a list of of optimistic attributes that can be attached to
known runtime functions to aid some analysis. The `omp_get_wtime`
function incorrectly used the `readonly` attribute. This is not correct
at the `omp_get_wtime` function changes values depending on some
external state. This is more correctly modeled with
`inaccessiblememonly` meaning that the value does not depend on anything
within the module, but can not be removes as it depends on external
state.

Fixes #57578

Reviewed By: tianshilei1992

Differential Revision: https://reviews.llvm.org/D133360

* [SYCL] Fix accessor's CTAD for g++ host compiler (#6673)

* [SYCL] Replace hardcoded namespaces with attribute (#6674)

Namespaces were hardcoded and used in compiler to
check for various SYCL types including accessors,
spec_constants, etc. This patch implements an
attribute to uniquely identify the types instead.
Attribute argument is an Identifier which denotes
 each type.

E.g. __attribute__((sycl_type(accessor)) is used 
to mark accessor class.

The attribute has been implemented as with an
accepted list of arguments via EnumArg. The attribute
definition should be updated to support any new types.

The attribute takes 1 argument.

Fixes: https://github.com/intel/llvm/issues/5186

Signed-off-by: Elizabeth Andrews <[email protected]>

* [MC] Emit Stackmaps before debug info

This patch is essentially an alternative to https://reviews.llvm.org/D75836 and was mentioned by @lhames in a comment.

The gist of the issue is that Mach-O has restrictions on which kind of sections are allowed after debug info has been emitted, which is also properly asserted within LLVM. Problem is that stack maps are currently emitted as one of the last sections in each target-specific AsmPrinter so far, which would cause the assertion to trigger. The current approach of special casing for the `__LLVM_STACKMAPS` section is not viable either, as downstream users can overwrite the stackmap format using plugins, which may want to use different sections.

This patch fixes the issue by emitting the stack map earlier, right before debug info is emitted. The way this is implemented is by taking the choice when to emit the StackMap away from the target AsmPrinter and doing so in the base class. The only disadvantage of this approach is that the `StackMaps` member is now part of the base class, even for targets that do not support them. This is functionaly not a problem however, as emitting an empty `StackMaps` is a no-op.

Differential Revision: https://reviews.llvm.org/D132708

* [OpenMP] Remove use of removed '-f[no-]openmp-new-driver' flag

The changes in D130020 removed all support for the old method of
compiling OpenMP offloading programs. This means that
`-fopenmp-new-driver` has no effect and `-fno-openmp-new-driver` does
not work. This patch removes the use and documentation of this flag.
Note that the `--offload-new-driver` flag still exists for using the new
driver optionally with CUDA and HIP.

Reviewed By: tra

Differential Revision: https://reviews.llvm.org/D133367

* [SYCL][ABI-Break] Remove deprecated OptionalDevice input type of make_queue and remove deprecated make_queue ABIs (#6628)

We introduced a tentative OptionalDevice input type of make_queue API to support the SYCL 2020 API as well as the legacy API which did not require Device to be passed as a parameter.
This PR intends to remove this tentative input type.
Extended: Now this PR also removed deprecated make_queue ABIs and make the make_queue to take a pointer to pi_device, which becomes back to optional.

Signed-off-by: Byoungro So <[email protected]>

* [SYCL] Emit an error on attempt to compile in less than C++17 mode (#6678)

* [CI][NFC] Rename workflow job (#6572)

Just pure renaming of "resolve test matrix" to "generate test matrix"
Follow up on a comment made at https://github.com/intel/llvm/pull/6528

* [CI][NFC] Make lint for commits in PR change only  (#6722)

Make lint for commits in PR change only not taking into account sycl
branch HEAD. Should fix issues like in
https://github.com/intel/llvm/pull/6705 where lint reported errors for
files not affected in PR. To overcome current PRs stuck because of this
they need to use ignore-lint tag until PR will be based on devops
directory changes made here.

* [CI][NFC] Fixed GFX driver update PR (#6723)

Now generate test matrix correctly use
github.event.pull_request.head.sha that is required for
pull_request_target.

* [SYCL] Improve sycl-post-link performance with -split=kernel (#6689)

Right now we are computing a new callgraph in every call to
extractCallGraph. extractCallGraph is called every time we do a module
split, so for -split=kernel, that would be once per kernel. For modules
with many kernels, this can take a very long time. We only need to
compute this once because the input IR doesn't seem to change between
splits.

This improves performance of sycl-post-link from ~45min to ~7min for an
example with 13k kernels

Signed-off-by: Sarnie, Nick <[email protected]>

* [SYCL] Deprecate SYCL 1.2.1 device selectors (#6599)

Now that we have added the SYCL 2020 callable device selectors, we need
to prepare for the removal of the older SYCL 1.2.1 `device_selector`
class. The first step is to add the deprecation message to the 1.2.1
style device selectors, which this PR does. It also removes the usage of
those from our own codebase so as to not trip on our own messages in the
future.

Signed-off-by: Chris Perkins <[email protected]>
Signed-off-by: Larsen, Steffen <[email protected]>
Co-authored-by: Larsen, Steffen <[email protected]>

* [SYCL][CUDA][HIP] Add support for intel extension free-memory (#6709)

The support is added for NVIDIA and AMD devices.

Co-authored-by: Abhishek Bagusetty <[email protected]>

* [SYCL][Windows] Improve windows sycl.lib linking (#6699)

This patch does two things, first it makes `-fsycl` ignore `-nostdlib`
when linking the SYCL library. This is necessary because for Clang on
Windows CMake will generate link commands using `-nostdlib` and
explicitly list the system libraries, but of course it doesn't do it for
SYCL, so we currently end up never linking the SYCL library when this is
used.

Ignoring `-nostdlib` for `-fsycl` on Windows seems like a reasonnable
solution for this as this is also what is done for the OpenMP runtime
libraries.

See the CMake module:
*
https://github.com/Kitware/CMake/blob/aa2de7cd2a04699744a224ab84e0ca483559c5d3/Modules/Platform/Windows-Clang.cmake#L79

In addition this patch also adds a linker parameter to help clang find
the `sycl.lib` file without requiring users to tweak their environments
to link against it.

* [SYCL][Windows] Enable building lld by default on Windows (#6701)

The Windows-Clang CMake module uses `lld-link` by default, so
having lld built on windows is helpful for people wanting to build SYCL
applications using CMake.

Related to: https://github.com/intel/llvm/issues/6026

* [SYCL] Allow buitin_assume_aligned to be called from device code. (#6705)

Prior to this PR, if there is a direct call to __builtin_assume_aligned
from within the device code, we would get the following error:
`error: SYCL kernel cannot call a variadic function`

This PR allows `__builtin_assume_aligned` to be invoked from within
device code.
The reason for doing so is, for example, if we load a float* pointer
from a data structure, the compiler has no way of knowing the underlying
alignment of the data the pointer points to, and therefore it cannot
combine the load/stores and one ends up with many unnecessary load/store
instructions plus lots of unnecessary pointer arithmetic plus increased
register pressure. We see this all over the place in our kernels. With
`__builtin_assume_aligned` we can tell the compiler to safely assume a
certain alignment, therefore implicitly forcing coalescing.

* [SYCL][NFC] Remove unnecessary NULL check for 'Fn' (#6726)

Klocwork thinks that it is possible for 'Fn' to be a nullptr due to this
check but there's already an assert at the beginning of 
CodeGenFunction::GenerateCode to avoid that possibility.

* [SYCL] Fix device comparison in removeDuplicateDevices (#6730)

Kernel bundles attempt to remove duplicate devices from a passed device
list. This is done through the `removeDuplicateDevices` function which
creates a set through comparing devices by their native handles.
However, the `getNative` member function on `device_impl` used to get
these handles will retain the native devices if the backend is OpenCL.
For root devices this will not have an effect, but on sub-devices this
can potentially lead to a leak of the devices. As a fix this commit
compares the PI devices rather than the native handles.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Remove cl namespace deprecation warning (#6735)

According to the SYCL 2020 specification, CL/sycl.hpp should supply the
cl prefix namespace for backwards compatibility, but it does not specify
that the namespace is deprecated. This commit removes the deprecation
message.

Signed-off-by: Larsen, Steffen <[email protected]>

* [GHA] Uplift Linux GPU RT version to 22.35.24055 (#6704)

Scheduled drivers uplift

Co-authored-by: GitHub Actions <[email protected]>

* [BuildBot] Uplift GPU RT version for Linux CI Process (#6697)

Uplift GPU RT version for Linux to 22.35.24055

Signed-off-by: bb-sycl <[email protected]>

* Update test to match -fvisibility=arg option requirements (#7098)

* [SYCL][ESIMD] Introduce predicates for lsc_block_store/load (#6688)

* Update for LLVM Optional API changes

Update for LLVM commit b5f8d42efe3e ("[ADT] Deprecate
Optional::{hasValue,getValue} (NFC)", 2022-08-07).

This is a mechanical replacement of `hasValue` to `has_value` and
`getValue` to `value`.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/ffeb4df

* Restore getArgIndex in OCLTypeToSPIRV (#1567)

* Update for LLVM Optional API changes

Update for LLVM commit b5f8d42efe3e ("[ADT] Deprecate
Optional::{hasValue,getValue} (NFC)", 2022-08-07).

This is a mechanical replacement of `hasValue` to `has_value` and
`getValue` to `value`.

* Restore argument tracing in OCLTypeToSPIRV

Commit 4a9c78ee ("Prepare SPIRVWriter for type conversion without
opaque pointers. (#1499)", 2022-06-20) removed `getArgIndex` in favor
of passing `Idx`, but this leads to incorrect adaptation of argument
types if sampler arguments are not in the same position between
different functions.

We might be able to drop `adaptArgumentsBySamplerUse` entirely, as we
don't represent samplers as i32 anymore; but for now just fix the
regression.

Fixes https://github.com/KhronosGroup/SPIRV-LLVM-Translator/issues/1562

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/87f8a58

* Move to C++17

LLVM has switched to C++17 with commit b1356504e63a ("[LLVM] Update
C++ standard to 17", 2022-08-06).

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/75d16c2

* .clang-tidy: temporarily disable misc-const-correctness

This is a relatively new check added to clang-tidy by 46ae26e7eb70
("[clang-tidy] implement new check 'misc-const-correctness' to add
'const' to unmodified variables", 2022-07-24).  Currently the code
base doesn't follow the practice of declaring variables `const`
where possible, which makes the check quite noisy, so disable it for
now.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/4dd494e

* Drop the JointMatrixINTEL struct-renaming pass when opaque pointers are enabled. (#1570)

The frontend is being changed to lower the struct name to the correct LLVM name
directly, obviating the need for this check. See
https://github.com/intel/llvm/pull/6535 for this change.

This marks the removal of the final call to the deprecated method
Type::getPointerElementType, although there remains some code that is not fully
working with opaque pointers enabled.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/e2bb479

* Use TypedPointerType in lieu of PointerIndirectPair in mangleBuiltin. (#1568)

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/42cf770

* Remove 2 unused SPCV_ macros

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/160f013

* Upgrade to Ubuntu 20.04

The Ubuntu 18.04 image is marked deprecated [1], so move to a newer
image.

[1] https://github.com/actions/runner-images

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/20a1fd7

* Fix a -Wmaybe-uninitialized warning

`MDWrapper::get` may not fill its argument in error cases, so
initialize `Arg` to an all-ones value.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/2c247b7

* Dot product bugfix to include more floating point types (#1578)

Switched the visitCallDot check to use isFloatingPointTy for scalar
floating point operands. Bugfix for previous change regarding
integer dot product.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/71e01b5

* Mass add -emit-opaque-pointers for tests that don't require changes.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/1ea77d2

* Add support for toolchain compilation with LLVM_LINK_LLVM_DYLIB option (#1543)

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/40fd741

* Update SPV_INTEL_vector_compute to rev 15

This adds NamedBarrierCountINTEL Execution Mode,
see more in https://github.com/intel/llvm/pull/1612

Co-authored-by: nrudenko <[email protected]>

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/21c7a30

* Remove ExecutionModeNamedBarrierCountINTEL from spirv_internal (#1604)

It was actually upstreamed to SPIR-V Headers already.

Signed-off-by: Sidorov, Dmitry <[email protected]>

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/3335c25

* Mass add -emit-opaque-pointers for tests that require some changes.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/6e15642

* Implement SPIR-V support for max_reinvocation_delay attribute (#1594)

The attribute generates the llvm.loop.intel.max_reinvocation_delay.count metadata in LLVM IR.
There is one positive integer literal operand denoting the maximum number of cycles allowed
between loop invocations.

Spec: KhronosGroup/SPIRV-Registry#163

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/d4ec010

* Fix type scavenger for variable arguments and multiple-uses-of-types cases. (#1606)

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/401d124

* Add support for split barriers extension SPV_INTEL_split_barrier (#1424)

This PR adds support for split barriers and the SPV_INTEL_split_barrier extension.

The related SPIR-V extension spec can be found here:

* https://github.com/KhronosGroup/SPIRV-Registry/pull/136

The related OpenCL C extension spec can be found here:

* https://github.com/KhronosGroup/OpenCL-Docs/pull/765

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/e3cd296

* Reinstate deprecated-declarations warning as error (#1609)

This reverts commit 55d1de820841e6d9d1c6ca0cd534323d69a1cbf1.

Now that the last use of the deprecated `getPointerElementType` has
been removed, treat uses of deprecated functions as errors again.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/6811488

* Rework the demangler to support recovering more pointer element types. (#1556)

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/b412ae5

* Translate function pointer from global variable as pointer, not as declaration (#1608)

This patch helps to avoid invalid SPV generation.

When global variable contains a pointer to a function, translator tries to
translate it as declaration. Then it translates this function the second time
when going through the function list. This leads to double translation
of the same function and to the usage of the same IDs in SPIR-V file.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/2b4ce42

* Add SPV_INTEL_masked_gather_scatter extension (#1580)

This extension allows TypeVector to have a Physical Pointer Type
Component Type and introduces gather/scatter instructions.
It will be useful for explicitly vectorized kernels.

Spec: https://github.com/intel/llvm/pull/6613

Signed-off-by: Sidorov, Dmitry <[email protected]

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/49b08e8

* [SYCL][Doc] Remove stale SPIR-V extension spec (#6741)

This SPIR-V extension was promoted to a KHR, and that specification is
on the SPIR-V registry. The DPC++ compiler uses the KHR version (not
this INTEL version), so this old specification can be removed.

* [SYCL][CUDA] Make `piextKernelSetArgMemObj` setting an error message (#6521)

This patch makes `cuda_piextKernelSetArgMemObj` setting an error message
instead of `std::terminate` in case of the image format is not
supported. This error message is encapsulated in an exception thrown by
the RT.

This allows to continue the SYCL-CTS execution in case of tests using
unsupported channel types, see
https://github.com/intel/llvm/issues/2119#issuecomment-1201548912.

* [SYCL][Doc] Add sycl complex to complex algorithms extension (#6717)

This PR extends the complex algorithms extension to support
`sycl::ext::oneapi::complex` and `marray<sycl::ext::oneapi::complex>`.
Additionally it adds the `multiplies` operator as a valid binary
operation for complex values when reducing and scanning across work
items. This PR has a dependency upon #6550.

* [SYCL][NFC] Add SYCLPropagateAspectsUsage pass (#6670)

Added a pass which is a part of optional kernel features design: it uses
information provided by FE & Headers about aspects used in device code
to propagate it through the call graph to mark all kernels and functions
with list of aspects they use.

Co-authored-by: Maksim Sabianin <[email protected]>

* [NFC] Fix the expected assert message in LIT test gather_scatter_rgba.cpp

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

* [SYCL] Make intel specific device info descriptors namespace qualified (#6639)

Conforming to SYCL 2020 specification section 6.3.1 and 4.6.4.2, to make
extension information descriptors templated and within the correct
namespace.

-Also moved deprecated info descriptors for device into separate file   

-Changed namespace of the recently added [device memory
extension](https://github.com/intel/llvm/pull/6604) to
ext::intel::info::device

Signed-off-by: Rauf, Rana <[email protected]>
Co-authored-by: Steffen Larsen <[email protected]>

* [SYCL] Fix USM free for descendent devices (#6733)

Now that use of descendent devices of context members is supported, we
cannot rely on always choosing the only device in single-device
contexts. Remove this branch to always fetch the device from the
platform.

* [SYCL] Test must pass -opaque-pointers explicitly

Not yet the default for sycl.

* Add opaque pointers switch to llvm/test/Verifier/dllstorage.ll

* Remove clang/test/Driver/openmp-sycl-interop.c test form testing.
The test is specific to behaviors involving SYCL and OpenMP offloading.
Since OpenMP offload has moved to the new offloading model the test is
not relevant anymore.

* [SYCL] Pass /Zc:__cplusplus in -fsycl-host-compiler-options in some tests (#6751)

By default MSVC reports 199711L as the standard being used and needs
that option to report C++ version properly. This fixes current post-commit
failures on the tests modified.

* Delete obsolete Clang::Driver tests

* Revert "Delete obsolete Clang::Driver tests"

This reverts commit 84be9c2cb06b5c98225ae0da80e32aa161483187.

* The XFAIL tests are specific to the old OpenMP offloading model, which was removed.

* Disable sycl-libspirv for amdgcn-amd-amdhsa target in clang/test/Driver/sycl-amdgcn-sqrt.cpp

* [SYCL][ABI-Break][NFC] Remove unusued members (#6731)

This commit removes the following:

- The unused member function `kernel_bundle::join_impl` overload.
- The unused member variable `Offset` from `stream`.
- The unused member function `buffer_impl::constructorNotification`
overload.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Fix memory leak in program link (#6641)

This PR fixes the memory leak caused by missing the call to program
release.

Signed-off-by: Byoungro So <[email protected]>

* [SYCL] Remove mentions of host device from in-tree LIT tests (#6683)

* [ESIMD] Fix invoke_simd calls case with pointer passed to it (#6696)

The helper function created during translation of invoke_simd must
accept a pointer to a function, not a reference to a pointer to a function.
That additional level of indirection is automatically resolved by compiler
for invoke_simd, but needs to be manually resolved/adjusted for the helper
function.

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

* [SYCL] Remove CG/handler extended members mechanism (#6759)

Now that the extended members have been promoted to proper fields of
CG/handler classes, the extended member mechanism can be removed until
it's needed again.

* Suppress some clang-tidy errors

Split the .clang-tidy check lists out over multiple lines to improve
readability.

Suppress `misc-non-private-member-variables-in-classes` as the code
currently contains many instances that fail this check.

Drop `constexpr` from `LoopControlLoopCountINTELMask` after clang
started diagnosing this with b36453530418 ("[Clang] Diagnose
ill-formed constant expression when setting ...", 2022-07-28).
Removing `constexpr` is just a workaround, the long term fix would be
to upstream the new enum value to `spirv.hpp`.

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/c5b29f2

* Update LLVM version from 15 to 16

Original commit:
https://github.com/KhronosGroup/SPIRV-LLVM-Translator/commit/7aa1fd2

* [SYCL] Assign code owners for invoke_simd.hpp and simd.hpp. (#6746)

* [SYCL] Fix memory leak (queue_impl)  due to #5901 (#6707)

Cross dependency event_impl vs queue_impl prevents objects release.
Event_impl now has only weak pointer to queue.

Signed-off-by: Tikhomirova, Kseniya <[email protected]>

* Fix the Clang sphinx build bot; NFC

This should address the failure introduced in:
https://lab.llvm.org/buildbot/#/builders/92/builds/32377

* [SYCL] Move accessor_impl to source directory (#6698)

Moving impl part of host accessor implementation to avoid exposing
implementation details in the headers. This allows for more changes
in accessor without breaking ABI.

Also updated the gdb xmethods since it was relying on the
impl details which are not available for gdb(unless libsycl.so is
built with debug symbols) anymore. Instead of accessing members of
impl directly gdb printers now accessing helper methods. To prevent
compiler discarding these methods there are dummy references which
are active when NDEBUG is not defined.

* [SYCL] Silence -Wctad-maybe-unsupported warning

This might be an FE bug. Will follow up separately. Workaround should be
fine to unblock pulldown.

* [sycl-bisect] Add missing exit (#6771)

This script has been missing one call to exit, which causes control to
fall through and run successful tests a second time when using --command
without --command-allow-bisect-codes. This didn't change the results of
the script, but it did make it slower if the test command takes a while
to run.

* [ESIMD] Change LSC API to improve template argument type deduction (#6764)

* [CI] Fix dependabot alert (#6725)

* [SYCL] More changes to silence -Wctad-maybe-unsupported

Mutex changes could be reverted if/when this change lands:
https://reviews.llvm.org/D133425

* [SYCL][ABI-Break] Merge DeviceBinaryImage and RTDeviceBinaryImage (#6768)

To avoid future issues with ABI for DeviceBinaryImage, this commit makes
the following changes:
* Moves RTDeviceBinaryImage to a source header.
* Merges DeviceBinaryImage into RTDeviceBinaryImage.
* Promotes the common property ranges from DeviceBinaryImage into new
members.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Fix get_pointer_device for cases with descendent devices (#6719)

Looking through context members alone when searching for a specific
device isn't enough anymore since now descendent devices of context
members can be used within that context as well. Change the logic to
look for the device in the cache instead.

* [Matrix][SYCL] Add use argument for joint_matrix and add another feat… (#5835)

…ure macro for it

* [SYCL] Remove unused argument from getDeviceImpl (#6780)

* [SYCL][Windows] Fix debug build in non cl mode  (#6721)

This fixes building SYCL programs in Debug mode with
`Windows-Clang.cmake`.

The issue is that the code was using `OPT__SLASH_MDd` to select
`sycld.lib` but under the  `!C.getDriver.IsCLMode()` condition this flag
will never be set, `OPT_g_Flag` should be used instead (`-g` rather than
`/MDd`).

Note that using the regular `clang` command line to manually build with
`-g` still doesn't work as it will link against `msvcrt` rather than
`msvcrtd` and will miss required defines for debug builds on Windows
(`_DEBUG`). This is correctly done by the CMake module or simply when
using `clang-cl`.

* [SYCL][DOC] Add extension for FPGA kernel interface properties (#5715)

## SYCL extension contains the following new kernel properties
- `streaming_interface<...>`
- `register_map_interface<...>`
The first two properties take enum arguments that provide the compiler
information about whether the logic downstream to the kernel will
back-pressure the kernel or not.
- `pipelined<N>`
Takes an integer, non-zero values specify minimum cycles between kernel
invocations, and 0 specifies that pipelining should be disabled.

Co-authored-by: GarveyJoe <[email protected]>

* [SYCL] Fix llvm.used removal when used with opaque pointers. (#6773)

The code priorily assumed that all functions when used in @llvm.used
would be wrapped within a bitcast <fnptr type> to i8*; with opaque
pointers, the values would be functions directly, causing a crash since
functions don't have any operands.

* [SYCL] Add MAJOR_VERSION to the name of the sycl library on Win (#6745)

* [SYCL] Silence -Wctad-maybe-unsupported for check-sycl

Add deduction guide to kernel_bundle

* [SYCL] Add --host-target flag and remove --arm flag (#6620)

This patch adds a new flag to `configure.py` to allow changing the host
target to build, this makes it easier to build on different hosts such
as ARM or PowerPC.

This patch also removes the `--arm` flag as it is now redundant,
`--host-target=ARM` or `--host-target=AArch64` should be used instead.

This is slightly different than the original `--arm` flag as it only
allows to enable one of the ARM platforms. But I'm not sure what the use
case was for enabling both, as only one will be the host platform for a
given build. And compilation time was given as a reason to drop the X86
architecture originally so only enabling the correct ARM architecture
should also help with that.

Co-authored-by: Alexey Bader <[email protected]>

* [NFC] Update sycl-force-target test to play nicely for 32-bit host (#6785)

Update the target for the test to use 64-bit to allow for the test to
emit the expected device target when unbundling.

* [SYCL] Use copy engine for memory read/write operations (#6783)

In the present state of the L0 plugin, MemBufferWrite and MemBufferRead
operations are being executed using compute engines. This patch changes
this behavior for the operations to be executed using copy engines.
It is expected to improve performance.

Signed-off-by: Arvind Sudarsanam <[email protected]>

* [SYCL][FPGA] Support Intel FPGA simulator device selector (#6715)

Create **fpga_simulator_selector** to be used to select the FPGA
simulator device. This assumes that the user has compiled their program
with -Xssimulation (as well as -fintelfpga) to prepare for the
simulator.

An object of class fpga_simulator_selector must be constructed early in
the execution of the host program in order to make the simulator
available at runtime.

The current implementation has the effect that if an object of class
sycl::ext::intel::fpga_simulator_selector is defined, FPGA hardware
devices selected using sycl::ext::intel::fpga_selector will select
simulator devices. This will be documented in the release notes and the
specification. We expect this behaviour to be eliminated in the future.

* [SYCL][CUDA] Add support of CUDA XPTI tracing (#6373)

Fork of https://github.com/intel/llvm/pull/5797

This patch rebases and finalizes (similarly to
https://github.com/intel/llvm/pull/6023) the draft in
https://github.com/intel/llvm/pull/5797, which already contained the
most important commits thanks to @alexbatashev.

The most relevant additions of this patch were done in CMake files, in
particular
 
- switch from `FindCUDA` CMake (deprecated) module to the
`FindCUDAToolkit` one in order to find cupti library by means of
`CUDA_cupti_LIBRARY`. This is advisable because on some systems
`FindCUDA` fails to find `CUDA_cupti_LIBRARY`. This is also the case of
the CI, see the
[log](https://github.com/intel/llvm/runs/7115612243?check_suite_focus=true)
in case of `FindCUDA` is used.
- find `generated_cuda_meta.h` for generating the CUDA printer
definitions, since the location of this header file seems to vary
depending on the system, in case of this file is not found a warning is
printed and no errors are thrown.

Co-authored-by: Alexander Batashev <[email protected]>

* [SYCL] Add diagnostic test for global_variable_allowed attribute (#6777)

Add cases for this attribute independent of device_global

* [SYCL][ABI-Break] Implement property interface for local_accessor & usm_allocator (#6737)

* [SYCL][Windows] Fix DataMovement test (#6790)

Using `-g` with the regular `clang` command line is not supported on
Windows.

On Windows `clang-cl` and `/Mdd` should be used instead.

However it doesn't seem like this test is testing anything to do debug
info and I couldn't find any reason for having it in the history, so
removing `-g` is the simplest solution to make the test work on both
Linux and Windows.

This should fix the post-commit issue that showed up in:
https://github.com/intel/llvm/pull/6721

* [SYCL][ABI-Break] Add SYCL 2020 kernel_device_specific::max_sub_group_size (#6782)

SYCL 2020 promotes the info::kernel_device_specific::max_sub_group_size
query on kernels, but removes the additional argument. This commit adds
an overload with no additional argument and deprecates the old variant.

* [SYCL][NFC] Add another option for the lambda size mismatch message (#6794)

* [SYCL] Introduce a fully-mocked PI plugin for unit tests (#6684)

The idea behind this PR is to introduce an infrastructure which allows
to write unit-tests, which are not dependent on a presence of actual
backends (like OpenCL, L0, etc.) or devices (like CPU, GPU, etc.).

Motivation for the patch: host device is going to be removed and we have
a number of in-tree LIT tests, which can't pass if there are no devices
available, so they are likely to be removed/disabled in short-term
(#6683), unless we move them into llvm-test-suite or unit-tests with new
infrastructure suggested here.

Signed-off-by: Larsen, Steffen <[email protected]>
Co-authored-by: Sachkov, Alexey <[email protected]>

* [SYCL][ESIMD][EMU] Enable fp64/double type support (#6796)

* For PI_DEVICE_INFO_EXTENSIONS, 'piDeviceGetInfo' returns 'cl_khr_fp64'
as the type can be supported as native 'double' type
* Fixes current pre-commit issue with unexpected pass on SYCL :: spec_const/spec_const_double.cpp

* [SYCL] Fix macro definition conflicting with MSVC (#6798)

The _CONCAT macro is defined in MSVC headers so to avoid redefinition
warnings this commit changes the naming in the PI mock plugin from
_CONCAT to _PI_MOCK_PLUGIN_CONCAT.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Update DPCPP library major version (#6801)

* [SYCL] Add assert for device_global without device_image_scope (#6791)

device_global is currently not fully supported but for backends that
support it, they should be usable on device only when the
device_image_scope property is present. This commit adds a temporary
static assert to avoid prevent the use of device_global without
device_image_scope until proper initialization has been implemented.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Fix unittests on MSVC (#6805)

This commit makes two changes:
* Fixes a cast of incompatible size in the PI mock plugin that caused a
warning on MSVC.
* Changes the definition of the captureless lambda function in the
RedefineAPI PiMock unittest to avoid MSVC considering it equal to the
function defined by the PI mock plugin.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL][Doc] Closing the ABI breaking changes window (#6800)

Next one is expected to be not earlier than June 2023.

* [SYCL] Fix deprecation warning for headers (#6808)

Using pragmas to emit warnings didn't work because SYCL headers are
considered to be system headers and any warnings in them are suppressed.

Use "#warning" instead. Unfortunately, MSVC doesn't support it (although
it's part of C23/C++23 and they'll have to add support eventually), so
we need some #if guards. Also, #warning cannot be put inside a macro
definition, thus we have to have some code duplication. Luckily, entire
headers deprecations aren't as often and we can be a little bit verbose.

* [SYCL][DOC] Extension to add hints to stop batching and start executing (#6465)

Signed-off-by: James Brodman <[email protected]>
Co-authored-by: Greg Lueck <[email protected]>
Co-authored-by: John Pennycook <[email protected]>

* [SYCL] Make host device inaccessible through SYCL API (#6685)

This commit removes the host device from the device list and as such the
host device will no longer be available in user code. The following
changes are a result of this:
* Device filters using 'host' as either backend or device type will
cause a warning at runtime. Since there is no host device selectable for
these filters, the resulting device list will not contain a host
   device.
* is_host() on SYCL objects has been deprecated. Any use of them
internally on a host device should cause an assertion to fail.
* host_selector deprecation message has been changed to better reflect
that there is no alternative.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Fix typos in xmethods script (#6814)

* [SYCL][NFC] Fix aspects detection of AllocaInst when opaque pointers are enabled (#6767)

* [NFC][Sema] Minor code quality change in SemaSYCL.cpp (#6827)

Signed-off-by: Elizabeth Andrews <[email protected]>

* [SYCL] Remove device-dependent tests from in-tree LIT (#6829)

This commit removes the currently failing in-tree LIT tests that were
previously dependent on a device being present. The removed tests are
either moved to the test-suite or were already there.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL][NFC] Fix unused argument warning in host_selector (#6831)

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL] Make the mock plugin report cl_khr_il_program as supported (#6823)

To make build-log tests not depend on the binary type, this commit makes
the mock plugin report that the cl_khr_il_program is supported. Together
with adding PI_DEVICE_INFO_NAME to mock_piDeviceGetInfo, the
corresponding redefinition from the build-log tests can be removed.

Signed-off-by: Larsen, Steffen <[email protected]>

* [SYCL][NFC] Reintroduce missing Windows symbols (#6830)

https://github.com/intel/llvm/pull/6685 unintentionally caused the
windows library to lose symbols for handler::verifyKernelInvoc and
handler::is_host. This patch introduces these as unused members to avoid
the patch being ABI-breaking.

Signed-off-by: Larsen, Steffen <[email protected]>

* Fix available memory reporting for Arc devices (#6825)

current PI Level Zero implementation assumed device memory is only HBM
and hence returned 0 for Alchemist GPUs.

* [CI] Enable independent build of libclc (#6833)

* Update CMakeLists to build libclc based on project presence in cmake
command
* Add remangled versions build in --ci-defaults
* Fix issue on possible libclc targets strings concat

* [SYCL][DOCS] Remove references to host device in top-level docs (#6836)

This commit removes references to the host device from the Getting
Started Guide and the FAQ documents.

Signed-off-by: Larsen, Steffen <[email protected]>

* [CI] Enable CUDA SYCL CTS tests (#6439)

Signed-off-by: Yin Yang <[email protected]>
Co-authored-by: Alexey Bader <[email protected]>

* [ESIMD] Do simd<T, N>* to <N x T> arg/ret type conv when possible. (#6835)

Today, `simd<T, N> foo(simd<T, N> x)` is codegenerated by clang as
   `void foo(simd<T, N>* sret(simd<T, N>) %res, simd<T, N>* %x)`
for the SPIRV target (unless `__regcall` is specified), which is then
converted to
  `void foo(<N x T>* sret(<N x T>) %res, <N x T>* %x)`
in the LowerESIMDVecArg, then to
  `<N x T> foo(<N x T> %x)`
in the VC BE.

With the opaque pointers this becomes impossible, and the optimization
must happen in the "ESIMD FE". This patch implements it.

It also changes `lowerEsimdConstructs` in sycl-post-link.cpp to use new
pass manager to avoid the old pass manager-related boiler plate code in
new ESIMD transformations.

Signed-off-by: Konstantin S Bobrovsky <[email protected]>

* [SYCL][L0] Use compute engine for me…
@keryell keryell added enhancement New feature or request and removed Agenda To be discussed during a SYCL committee meeting labels Feb 17, 2023
@keryell
Copy link
Member

keryell commented Mar 16, 2023

Gordon to be back on this in a few weeks.

@fraggamuffin
Copy link

a good F2F

AerialMantis added a commit that referenced this pull request May 30, 2023
CUDA backend specification doc pulled in from
#197.
@AerialMantis
Copy link
Collaborator Author

@illuhad apologies that it took me so long to get back to this, I've created a significant update to the pull request and actually moved this to a new PR as I ran into issues rebasing the changes.

I could of course just return some stream from the pool, but I'm not sure if this might not do more harm than good because I don't think it's what users would expect. In that case, there'd be no guarantees that the returned stream synchronizes in any way with previous operations submitted to the SYCL queue (they may have been scheduled to different streams, even for in-order queues) and neither would subsequent operations be guaranteed to synchronize with anything the user submits to the stream via interop.

Regarding your comment about which CUDA stream is returned when interoperating with a SYCL queue, and how to correctly synchronize with this, I clarified the wording such that both the application interop and host task interop require synchronization with the SYCL queue prior to retrieving the native CUDA stream or executing the host task. This is quite strict, and perhaps not the most efficient approach, however, it does ensure correctness, and I think we could address this at a later point with a general improvement to the core specification with a feature like your extension for custom operations or a similar extension that we've been working on at Codeplay.

I'm not sure those functions make much sense for hipSYCL. The hipSYCL interpretation of e.g. a queue is really more of a high-level interface to the SYCL task graph, with some functionality to manage synchronization of groups of tasks (e.g. wait()). To make this more clear I am pondering whether we should introduce the alias using task_group = queue; in hipSYCL. Relating a task group to a stream does not sound very useful semantically, and adding streams at runtime for interop purposes breaks some of the assumptions that the scheduler currently can make. I'll have to think some more about this. For context it might be even more complicated as we couldn't rely on the CUDA runtime API context anymore in the workflows that use it, and this in turn might break other interop use cases that currently "just work" for SYCL+CUDA interop applications that use the runtime API.

So I think the way to support this is to have the SYCL queue or context created from a native CUstream or CUcontext independent of the rest of CUstreams or CUcontexts, so if there is a pool, there would be kept separate from those. The high-level scheduling based on dependencies could still apply as normal.

@AerialMantis
Copy link
Collaborator Author

@illuhad @gmlueck @JackAKirk @hdelan FYI I've had to open a new PR for this as I ran into some issues rebasing this, coming back to it after some time. the latest version is now in #420.

@AerialMantis
Copy link
Collaborator Author

Closing this as a new PR has been opened for this; #420.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request F2F discussion
Projects
None yet
Development

Successfully merging this pull request may close these issues.