-
-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Navi card subgroup shuffle support for gemm #512
base: master
Are you sure you want to change the base?
Conversation
Question: how to process all kinds of realN type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: how to handle all realN types?
Hmm, from your implementation it seems like the NVIDIA case does a similar thing, so apparently that does work. What compilation errors do you get?
@@ -138,6 +138,25 @@ R"( | |||
#endif | |||
#endif | |||
|
|||
#if USE_SUBGROUP_SHUFFLING == 1 && SUBGROUP_SHUFFLING_GCN == 1 | |||
#define SUBGROUP_SIZE 32 // Assumes subgroup size is always 4 on AMD GCN GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the left you write 32, on the right you write 4, one of them probably is incorrect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It should be 32 (for Navi cards). Will change the comment.
src/utilities/compile.cpp
Outdated
if (device.IsGPU() && device.IsAMD() && device.Name().find("gfx1") != std::string::npos) { | ||
header_string += "#define USE_SUBGROUP_SHUFFLING 1\n"; | ||
header_string += "#define SUBGROUP_SHUFFLING_GCN 1\n"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementing it like this forces it on for all these AMD GPUs. Did you verify that using subgroup shuffling is always better compared to not using subgroup shuffling? The easiest way to test is to run the XGEMM kernel tuner with and without this modification and compare execution times of a good portion of the first chunk of kernels.
The alternative to what you implemented here is to make it a AMD-specific tuning parameter, and then at tuning time it will decide whether subgroup shuffling was a good idea or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It only is supposed to turn on with device name beginning with "gfx1", which I assume only applies on Navi cards, like gfx1010, gfx 1030, etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have run xgemm tuner and it seems a little bit better on rx 6900 xt card. I can test on rx 5700 xt too later.
@CNugteren code is ready for review. Have run xgemm tuner on 6900 XT.
|
Good to see performance increases by quite a lot! But also the best parameters are now different (which could be coincidence). So I would like you to test a bit more in depth, regarding what I said above:
So in other words, the way you've implemented it now assumes this is always a good idea. So if you can run the tuner (the first part with the non-random configurations) and can compare the before & after for each individual run and assure that they are either equal or better with the new code (within some noise margin of course), then this is fine. Otherwise we'll need to change the way you've implemented this and make this is a tuning parameter instead, which complicates things. |
There is existing logic that undef shuffle flag if the subgroup size is not met, so I guess only a few runs need to be consider, not each run since many of them doesn't meed the subgroup requirement to turn on shuffle flag. |
Taking closer look at the constraint: #if NWI != SUBGROUP_SIZE || MDIMC < SUBGROUP_SIZE
#undef USE_SUBGROUP_SHUFFLING
#define USE_SUBGROUP_SHUFFLING 0 // Disables subgroups in case the assumptions don't hold
#endif Seems like if SUBGROUP_SIZE the tuner will never satisfy this condition. Don't know how nvidia's 32 subgroup size can work. |
Indeed, not each run has to be checked, but I thought it was easier to just check everything, and make sure it is easier unchanged or better. But feel free to look at the cases that use subgroup shuffling indeed. Furthermore, keep in mind that there might be other constraints, because even if
Good point, I will also have a more detailed look later. |
You are right that this can be an issue, as said even in the original post of the PR that introduced it (#297). However, a bit further down in that thread I do make some analysis (#297 (comment)) which concludes that there are some combinations that trigger it, although not many, and that they might have to be changed to increase the chances - which we never did I believe. However a user is free to modify the tuner files manually of course. Regarding the PR itself, apart from my earlier comment (to see if shuffling here is always a good idea), I have 3 more comments before we can merge this in:
Again, thanks a lot for your contribution! |
Thanks for the info. |
If you need to test it on a AMD 7800XT, please send me your compiled windows tunner files with shfl enabled for Navi. I have already posted the previously tunned 7800xt to github. |
I re-read it and looked at the code again and I don't understand my own comment either. I would rather trust the code than the comment, perhaps I made a typo in the comment and I meant larger instead of smaller, I might have gotten confused by the fact that it says |
Also, is this ready to roll? |
7800xt.txt tuningresults.zip |
@tangjinchuan Sorry I don't have time to finish this PR recently. To me it's not mature to apply AMD assembly shuffle yet. If you read previous posts in this thread, only certain configuration of matrix size will trigger assembly shuffle. For most cases it's not triggered. I am not sure if shuffle is needed at all. |
@fancyIX Thank you very much for this message. It's fine. I believe the community highly appreciates any commitment you have been making. |
WIP
Use inline assembly for single precision on Navi cards.