Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Navi card subgroup shuffle support for gemm #512

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

fancyIX
Copy link

@fancyIX fancyIX commented Oct 28, 2023

WIP

Use inline assembly for single precision on Navi cards.

Copy link
Owner

@CNugteren CNugteren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: how to handle all realN types?

Hmm, from your implementation it seems like the NVIDIA case does a similar thing, so apparently that does work. What compilation errors do you get?

@@ -138,6 +138,25 @@ R"(
#endif
#endif

#if USE_SUBGROUP_SHUFFLING == 1 && SUBGROUP_SHUFFLING_GCN == 1
#define SUBGROUP_SIZE 32 // Assumes subgroup size is always 4 on AMD GCN GPUs
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the left you write 32, on the right you write 4, one of them probably is incorrect?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It should be 32 (for Navi cards). Will change the comment.

Comment on lines 81 to 84
if (device.IsGPU() && device.IsAMD() && device.Name().find("gfx1") != std::string::npos) {
header_string += "#define USE_SUBGROUP_SHUFFLING 1\n";
header_string += "#define SUBGROUP_SHUFFLING_GCN 1\n";
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementing it like this forces it on for all these AMD GPUs. Did you verify that using subgroup shuffling is always better compared to not using subgroup shuffling? The easiest way to test is to run the XGEMM kernel tuner with and without this modification and compare execution times of a good portion of the first chunk of kernels.

The alternative to what you implemented here is to make it a AMD-specific tuning parameter, and then at tuning time it will decide whether subgroup shuffling was a good idea or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It only is supposed to turn on with device name beginning with "gfx1", which I assume only applies on Navi cards, like gfx1010, gfx 1030, etc

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have run xgemm tuner and it seems a little bit better on rx 6900 xt card. I can test on rx 5700 xt too later.

@fancyIX
Copy link
Author

fancyIX commented Nov 1, 2023

@CNugteren code is ready for review. Have run xgemm tuner on 6900 XT.
Performance improvement on round 3:
With change:

  • Found best result 1.12 ms: 1913.8 GFLOPS
  • Best parameters: GEMMK=1 KREG=1 KWG=1 KWI=1 MDIMA=8 MDIMC=8 MWG=64 NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1
    Without change:
  • Found best result 1.39 ms: 1545.4 GFLOPS
  • Best parameters: GEMMK=1 KREG=4 KWG=1 KWI=1 MDIMA=8 MDIMC=8 MWG=16 NDIMB=16 NDIMC=16 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=2

@CNugteren
Copy link
Owner

Have run xgemm tuner on 6900 XT.
Performance improvement on round 3:

Good to see performance increases by quite a lot! But also the best parameters are now different (which could be coincidence). So I would like you to test a bit more in depth, regarding what I said above:

Implementing it like this forces it on for all these AMD GPUs. Did you verify that using subgroup shuffling is always better compared to not using subgroup shuffling? The easiest way to test is to run the XGEMM kernel tuner with and without this modification and compare execution times of a good portion of the first chunk of kernels.

So in other words, the way you've implemented it now assumes this is always a good idea. So if you can run the tuner (the first part with the non-random configurations) and can compare the before & after for each individual run and assure that they are either equal or better with the new code (within some noise margin of course), then this is fine. Otherwise we'll need to change the way you've implemented this and make this is a tuning parameter instead, which complicates things.

@fancyIX
Copy link
Author

fancyIX commented Nov 1, 2023

ompare the before & after for each individual run

There is existing logic that undef shuffle flag if the subgroup size is not met, so I guess only a few runs need to be consider, not each run since many of them doesn't meed the subgroup requirement to turn on shuffle flag.
I mean this code:
https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part1.opencl#L141C15-L141C15

@fancyIX
Copy link
Author

fancyIX commented Nov 2, 2023

Taking closer look at the constraint:

#if NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE
  #undef USE_SUBGROUP_SHUFFLING
  #define USE_SUBGROUP_SHUFFLING 0     // Disables subgroups in case the assumptions don't hold
#endif

Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work.
In any case, I will check for AMD how do we cope with this condition check.

@CNugteren
Copy link
Owner

so I guess only a few runs need to be consider, not each run since many of them doesn't meed the subgroup requirement to turn on shuffle flag.

Indeed, not each run has to be checked, but I thought it was easier to just check everything, and make sure it is easier unchanged or better. But feel free to look at the cases that use subgroup shuffling indeed. Furthermore, keep in mind that there might be other constraints, because even if USE_SUBGROUP_SHUFFLING is set to 1, there might be cases where the code that uses subgroup shuffling isn't actually called. I believe only if GEMMK == 1 (see e.g. https://github.com/CNugteren/CLBlast/blob/master/src/kernels/level3/xgemm_part3.opencl#L75) but there might be more constraints, I would have to take a closer look.

Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work.

Good point, I will also have a more detailed look later.

@CNugteren
Copy link
Owner

Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work.

You are right that this can be an issue, as said even in the original post of the PR that introduced it (#297). However, a bit further down in that thread I do make some analysis (#297 (comment)) which concludes that there are some combinations that trigger it, although not many, and that they might have to be changed to increase the chances - which we never did I believe. However a user is free to modify the tuner files manually of course.

Regarding the PR itself, apart from my earlier comment (to see if shuffling here is always a good idea), I have 3 more comments before we can merge this in:

  1. Could you add a short statement about this to the Changelog file such that it is incorporated in the release notes of the next version? Thanks!
  2. Could you also test performance on at least one other AMD Navi GPU if you have access to it? Or perhaps someone else can do that?
  3. Could you also run the CLBlast XGEMM tests on your GPU with your changes and with the new best tuning parameters?

Again, thanks a lot for your contribution!

@fancyIX
Copy link
Author

fancyIX commented Nov 4, 2023

Thanks for the info.
One thing I don't understand is that in the comment of #297 it says that:
In summary, we'll thus need: an NWG that is 32 times as large as NDIMC, and an MDIMC which is smaller than 32.
However the macro we see is:
#if NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE
#undef USE_SUBGROUP_SHUFFLING
Based one the comment, shouldn't we change it to:
#if NWI != SUBGROUP_SIZE || MDIMC >= SUBGROUP_SIZE

@tangjinchuan
Copy link

If you need to test it on a AMD 7800XT, please send me your compiled windows tunner files with shfl enabled for Navi. I have already posted the previously tunned 7800xt to github.

@CNugteren
Copy link
Owner

CNugteren commented Nov 8, 2023

One thing I don't understand is ...

I re-read it and looked at the code again and I don't understand my own comment either. I would rather trust the code than the comment, perhaps I made a typo in the comment and I meant larger instead of smaller, I might have gotten confused by the fact that it says MDIMC < SUBGROUP_SIZE but that is to disable it of course.

@tangjinchuan
Copy link

Also, is this ready to roll?

@tangjinchuan
Copy link

tangjinchuan commented Jun 8, 2024

7800xt.txt
Using the latest AMD RoCM 6.1.2 on Ubuntu 24.04 with the repository ed32af0 with AMD shuffle. There is no problem reported.

tuningresults.zip
In terms of tuning performance, if I read correctly, the AMD shuffle version has the best performance of 10TFlops while the latest unshuffle version on today's CLBLAST master branch has a performance of 9.7TFlops.
I have attached all outputs including the tunning process from terminal as well as the Jason files on xGEMMs.
Also, please do ignore the device no. gfx1100 which should be an 7900XTX. Here I am using 7800XT (gfx1101) while change the device no. as guided by Rocm github project to enable roc libs for PyTorch. I do have 7900XTX if you need another test.

@fancyIX
Copy link
Author

fancyIX commented Jun 9, 2024

@tangjinchuan Sorry I don't have time to finish this PR recently. To me it's not mature to apply AMD assembly shuffle yet. If you read previous posts in this thread, only certain configuration of matrix size will trigger assembly shuffle. For most cases it's not triggered. I am not sure if shuffle is needed at all.

@tangjinchuan
Copy link

@fancyIX Thank you very much for this message. It's fine. I believe the community highly appreciates any commitment you have been making.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants