Navi card subgroup shuffle support for gemm #512

fancyIX · 2023-10-28T22:53:32Z

WIP

Use inline assembly for single precision on Navi cards.

Question: how to process all kinds of realN type

CNugteren

Question: how to handle all realN types?

Hmm, from your implementation it seems like the NVIDIA case does a similar thing, so apparently that does work. What compilation errors do you get?

CNugteren · 2023-10-30T15:42:28Z

src/kernels/level3/xgemm_part1.opencl

@@ -138,6 +138,25 @@ R"(
  #endif
 #endif

+#if USE_SUBGROUP_SHUFFLING == 1 && SUBGROUP_SHUFFLING_GCN == 1
+  #define SUBGROUP_SIZE 32              // Assumes subgroup size is always 4 on AMD GCN GPUs


On the left you write 32, on the right you write 4, one of them probably is incorrect?

Yes. It should be 32 (for Navi cards). Will change the comment.

CNugteren · 2023-10-30T15:45:37Z

src/utilities/compile.cpp

+  if (device.IsGPU() && device.IsAMD() && device.Name().find("gfx1") != std::string::npos) {
+    header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
+    header_string += "#define SUBGROUP_SHUFFLING_GCN 1\n";
+  }


Implementing it like this forces it on for all these AMD GPUs. Did you verify that using subgroup shuffling is always better compared to not using subgroup shuffling? The easiest way to test is to run the XGEMM kernel tuner with and without this modification and compare execution times of a good portion of the first chunk of kernels.

The alternative to what you implemented here is to make it a AMD-specific tuning parameter, and then at tuning time it will decide whether subgroup shuffling was a good idea or not.

It only is supposed to turn on with device name beginning with "gfx1", which I assume only applies on Navi cards, like gfx1010, gfx 1030, etc

I have run xgemm tuner and it seems a little bit better on rx 6900 xt card. I can test on rx 5700 xt too later.

fancyIX · 2023-11-01T03:32:45Z

@CNugteren code is ready for review. Have run xgemm tuner on 6900 XT.
Performance improvement on round 3:
With change:

Found best result 1.12 ms: 1913.8 GFLOPS
Best parameters: GEMMK=1 KREG=1 KWG=1 KWI=1 MDIMA=8 MDIMC=8 MWG=64 NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1
Without change:
Found best result 1.39 ms: 1545.4 GFLOPS
Best parameters: GEMMK=1 KREG=4 KWG=1 KWI=1 MDIMA=8 MDIMC=8 MWG=16 NDIMB=16 NDIMC=16 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=2

CNugteren · 2023-11-01T16:15:34Z

Have run xgemm tuner on 6900 XT.
Performance improvement on round 3:

Good to see performance increases by quite a lot! But also the best parameters are now different (which could be coincidence). So I would like you to test a bit more in depth, regarding what I said above:

Implementing it like this forces it on for all these AMD GPUs. Did you verify that using subgroup shuffling is always better compared to not using subgroup shuffling? The easiest way to test is to run the XGEMM kernel tuner with and without this modification and compare execution times of a good portion of the first chunk of kernels.

So in other words, the way you've implemented it now assumes this is always a good idea. So if you can run the tuner (the first part with the non-random configurations) and can compare the before & after for each individual run and assure that they are either equal or better with the new code (within some noise margin of course), then this is fine. Otherwise we'll need to change the way you've implemented this and make this is a tuning parameter instead, which complicates things.

fancyIX · 2023-11-01T21:06:05Z

ompare the before & after for each individual run

There is existing logic that undef shuffle flag if the subgroup size is not met, so I guess only a few runs need to be consider, not each run since many of them doesn't meed the subgroup requirement to turn on shuffle flag.
I mean this code:
https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part1.opencl#L141C15-L141C15

fancyIX · 2023-11-02T04:37:51Z

Taking closer look at the constraint:

#if NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE
  #undef USE_SUBGROUP_SHUFFLING
  #define USE_SUBGROUP_SHUFFLING 0     // Disables subgroups in case the assumptions don't hold
#endif

Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work.
In any case, I will check for AMD how do we cope with this condition check.

CNugteren · 2023-11-02T08:32:02Z

so I guess only a few runs need to be consider, not each run since many of them doesn't meed the subgroup requirement to turn on shuffle flag.

Indeed, not each run has to be checked, but I thought it was easier to just check everything, and make sure it is easier unchanged or better. But feel free to look at the cases that use subgroup shuffling indeed. Furthermore, keep in mind that there might be other constraints, because even if USE_SUBGROUP_SHUFFLING is set to 1, there might be cases where the code that uses subgroup shuffling isn't actually called. I believe only if GEMMK == 1 (see e.g. https://github.com/CNugteren/CLBlast/blob/master/src/kernels/level3/xgemm_part3.opencl#L75) but there might be more constraints, I would have to take a closer look.

Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work.

Good point, I will also have a more detailed look later.

CNugteren · 2023-11-03T20:50:52Z

Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work.

You are right that this can be an issue, as said even in the original post of the PR that introduced it (#297). However, a bit further down in that thread I do make some analysis (#297 (comment)) which concludes that there are some combinations that trigger it, although not many, and that they might have to be changed to increase the chances - which we never did I believe. However a user is free to modify the tuner files manually of course.

Regarding the PR itself, apart from my earlier comment (to see if shuffling here is always a good idea), I have 3 more comments before we can merge this in:

Could you add a short statement about this to the Changelog file such that it is incorporated in the release notes of the next version? Thanks!
Could you also test performance on at least one other AMD Navi GPU if you have access to it? Or perhaps someone else can do that?
Could you also run the CLBlast XGEMM tests on your GPU with your changes and with the new best tuning parameters?

Again, thanks a lot for your contribution!

fancyIX · 2023-11-04T02:17:09Z

Thanks for the info.
One thing I don't understand is that in the comment of #297 it says that:
In summary, we'll thus need: an NWG that is 32 times as large as NDIMC, and an MDIMC which is smaller than 32.
However the macro we see is:
#if NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE
#undef USE_SUBGROUP_SHUFFLING
Based one the comment, shouldn't we change it to:
#if NWI != SUBGROUP_SIZE || MDIMC >= SUBGROUP_SIZE

tangjinchuan · 2023-11-05T12:09:17Z

If you need to test it on a AMD 7800XT, please send me your compiled windows tunner files with shfl enabled for Navi. I have already posted the previously tunned 7800xt to github.

CNugteren · 2023-11-08T19:09:04Z

One thing I don't understand is ...

I re-read it and looked at the code again and I don't understand my own comment either. I would rather trust the code than the comment, perhaps I made a typo in the comment and I meant larger instead of smaller, I might have gotten confused by the fact that it says MDIMC < SUBGROUP_SIZE but that is to disable it of course.

tangjinchuan · 2024-02-22T17:12:37Z

Also, is this ready to roll?

tangjinchuan · 2024-06-08T14:20:33Z

7800xt.txt
Using the latest AMD RoCM 6.1.2 on Ubuntu 24.04 with the repository ed32af0 with AMD shuffle. There is no problem reported.

tuningresults.zip
In terms of tuning performance, if I read correctly, the AMD shuffle version has the best performance of 10TFlops while the latest unshuffle version on today's CLBLAST master branch has a performance of 9.7TFlops.
I have attached all outputs including the tunning process from terminal as well as the Jason files on xGEMMs.
Also, please do ignore the device no. gfx1100 which should be an 7900XTX. Here I am using 7800XT (gfx1101) while change the device no. as guided by Rocm github project to enable roc libs for PyTorch. I do have 7900XTX if you need another test.

fancyIX · 2024-06-09T08:25:13Z

@tangjinchuan Sorry I don't have time to finish this PR recently. To me it's not mature to apply AMD assembly shuffle yet. If you read previous posts in this thread, only certain configuration of matrix size will trigger assembly shuffle. For most cases it's not triggered. I am not sure if shuffle is needed at all.

tangjinchuan · 2024-06-09T13:38:47Z

@fancyIX Thank you very much for this message. It's fine. I believe the community highly appreciates any commitment you have been making.

fancyIX added 6 commits October 23, 2023 20:25

Some draft

fde7943

modifiy draft with sub_group func

4e35c89

Add Navi subgroup shuffle support

a871605

Question: how to process all kinds of realN type

try to fix build error

b5fe348

try to fix build error

3ef0f88

try to fix build error of too long string

27648f5

CNugteren reviewed Oct 30, 2023

View reviewed changes

only apply asm for shuffle for single precision

ed32af0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Navi card subgroup shuffle support for gemm #512

Navi card subgroup shuffle support for gemm #512

fancyIX commented Oct 28, 2023 •

edited

Loading

CNugteren left a comment

CNugteren Oct 30, 2023

fancyIX Oct 30, 2023

CNugteren Oct 30, 2023

fancyIX Oct 30, 2023

fancyIX Oct 30, 2023

fancyIX commented Nov 1, 2023

CNugteren commented Nov 1, 2023

fancyIX commented Nov 1, 2023

fancyIX commented Nov 2, 2023 •

edited by CNugteren

Loading

CNugteren commented Nov 2, 2023

CNugteren commented Nov 3, 2023

fancyIX commented Nov 4, 2023

tangjinchuan commented Nov 5, 2023

CNugteren commented Nov 8, 2023 •

edited

Loading

tangjinchuan commented Feb 22, 2024

tangjinchuan commented Jun 8, 2024 •

edited

Loading

fancyIX commented Jun 9, 2024

tangjinchuan commented Jun 9, 2024

Navi card subgroup shuffle support for gemm #512

Are you sure you want to change the base?

Navi card subgroup shuffle support for gemm #512

Conversation

fancyIX commented Oct 28, 2023 • edited Loading

CNugteren left a comment

Choose a reason for hiding this comment

CNugteren Oct 30, 2023

Choose a reason for hiding this comment

fancyIX Oct 30, 2023

Choose a reason for hiding this comment

CNugteren Oct 30, 2023

Choose a reason for hiding this comment

fancyIX Oct 30, 2023

Choose a reason for hiding this comment

fancyIX Oct 30, 2023

Choose a reason for hiding this comment

fancyIX commented Nov 1, 2023

CNugteren commented Nov 1, 2023

fancyIX commented Nov 1, 2023

fancyIX commented Nov 2, 2023 • edited by CNugteren Loading

CNugteren commented Nov 2, 2023

CNugteren commented Nov 3, 2023

fancyIX commented Nov 4, 2023

tangjinchuan commented Nov 5, 2023

CNugteren commented Nov 8, 2023 • edited Loading

tangjinchuan commented Feb 22, 2024

tangjinchuan commented Jun 8, 2024 • edited Loading

fancyIX commented Jun 9, 2024

tangjinchuan commented Jun 9, 2024

fancyIX commented Oct 28, 2023 •

edited

Loading

fancyIX commented Nov 2, 2023 •

edited by CNugteren

Loading

CNugteren commented Nov 8, 2023 •

edited

Loading

tangjinchuan commented Jun 8, 2024 •

edited

Loading