Skip to content
This repository has been archived by the owner on May 27, 2024. It is now read-only.

Requesting intrinsics for DS_Consume and DS_Append in HCC #3

Open
dragontamer opened this issue Jan 25, 2019 · 3 comments
Open

Requesting intrinsics for DS_Consume and DS_Append in HCC #3

dragontamer opened this issue Jan 25, 2019 · 3 comments

Comments

@dragontamer
Copy link

dragontamer commented Jan 25, 2019

Since this is a minor feature request, I don't believe it to require a RFC. If the team thinks a formal RFC would be best, just let me know and I'll do the "pull request" dance.

Summary

Add "DS_Consume" and "DS_Append" intrinsics to HCC.

Motivation

DS_Consume and DS_Append can be used to implement highly efficient, compact queues to LDS within a wavefront. Support for these functions seems to exist as early as GCN 1.0.

Detailed design

According to the GCN ISA, DS_Append increment an LDS variable by the popcount of the execution mask. For example, if 40 threads are active, DS_Append would increment the location by +=40. DS_Consume is the inverse, it would decrement the location by the population count of the execution mask.

HCC already implements a number of intrinsics, such as __amdgcn_ds_bpermute. Following the convention, the functions would look something like this:

int __amdgcn_ds_append(tile_static int& val);
int __amdgcn_ds_consume(tile_static int& val);

The return value is the pre-operation value, as per the ISA.

Drawbacks

DS_Consume and DS_Append are somewhat obscure functions of the hardware. I'm not sure if many people would be aware of how to use the functions.

Alternatives

The functions could take a pointer instead, like this:

int __amdgcn_ds_append(tile_static int* val);

The pointer is more C-like, while the reference would be C++ like code.

Unresolved questions

These functions also can be used with GDS memory, but I don't know how GDS memory works.

@b-sumner
Copy link

Thank you for your request. We are looking into this.

@b-sumner
Copy link

Support was just committed to LLVM trunk. The committed clang builtins are:

int __builtin_amdgcn_ds_append(__local int *v);
int __builtin_amdgcn_ds_append(__local int *v);

(_local here means address space 3).

I think a tile_static int * (or shared int *) argument will work, but an explicit cast to (attribute((address_space(3))) int *) is possible.

It might be best to consider this function similar to a relaxed atomic add. However, one significant difference from atomic add is that for all active lanes in the wavefront executing the call, the argument must be the same. If this is known not to be true, then explicit logic is necessary, e.g.

if (this_lane_is_accessing_var_1)
... _builtin_amdgcn_ds_append(var_1_ptr) ...
else
... __builtin_amdgcn_ds_append(var_2_ptr) ...

Finally, please note that while this has been checked in, it is not guaranteed to appear in the next release. It should be appearing relatively soon though.

@dragontamer
Copy link
Author

dragontamer commented Jan 29, 2019

Those restrictions make sense, given how the assembly statement works. It doesn't seem like HCC has a way to represent ISPC "uniform" variables. So just checking for that at compile time / assembly time is the best bet for now.

__local int * uniform probably would be the type in ISPC, a Uniform pointer to an int in the __local address space.

Anyway, I'll await the release of the feature. Thanks for accepting my request! In the meantime, I can use inline-assembly to access the functionality.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants