-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] fixed the ReorderInstructions
pass
#5254
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments below
if (isPureMatmulProblem(funcOp)) { | ||
scheduleGlobalLoadLocalStore(funcOp); | ||
sinkSecondLoad(funcOp); | ||
const bool independentGlobalLoadStages = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be independent of whether pipelining has happened or not. It previously applied to all loads in the function
, not just in the for loop.
// Best perf on GEMM when these precede global loads. | ||
funcOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); }); | ||
|
||
if (independentGlobalLoadStages) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already accounted for in the test for local_store
and leadsToLoad
on line 260.
Therefore, the order here can universally be:
- local_stores
- global_loads
Then reversed (line 233) so the local_stores will be moved last and to the top of the loop if they are independent of global loads.
de1d3e2
to
9255bc6
Compare
forOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); }); | ||
|
||
// Move global loads early to prefetch. This may increase register pressure | ||
// but it enables issuing global loads early. | ||
forOp.walk([&](triton::LoadOp op) { moveOps.push_back(op); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjw36, If I understood you correctly. We only need this change, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really the only source change needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just do the swap for now and fixup tests.
@@ -58,7 +53,7 @@ findEarlyInsertionPoint(Block *block, Operation *move) { | |||
if (isa<triton::AtomicRMWOp, triton::AtomicCASOp>(wop)) | |||
ipnt = bi; | |||
// Break at barrier | |||
if (isa<gpu::BarrierOp>(wop)) | |||
if (isa<mlir::gpu::BarrierOp>(wop)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be unnecessary since it built before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happened because of the include of "third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h"
. In its turn, it include "triton/Conversion/TritonGPUToLLVM/Utility.h"
which has the following definition
triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h
Lines 210 to 217 in e2dc77b
namespace gpu { | |
Type getFunctionType(Type resultType, ValueRange operands); | |
LLVM::LLVMFuncOp appendOrGetExternFuncOp(RewriterBase &rewriter, Operation *op, | |
StringRef funcName, Type funcType, | |
StringRef libname = "", | |
StringRef libpath = ""); | |
} // namespace gpu |
As you can see it is gpu
namespace. So, we need to be explicit about the namespace in ReorderInstructions.cpp
file. Otherwise, we have a compilation error.
@@ -214,14 +209,15 @@ static void moveUpTranspose(triton::FuncOp funcOp) { | |||
} | |||
|
|||
// Schedule global load and local store ops for better GEMM performance. | |||
static void scheduleGlobalLoadLocalStore(triton::FuncOp funcOp) { | |||
static void scheduleGlobalLoadLocalStore(scf::ForOp forOp) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may have implications for other workloads, where it is beneficial to apply outside of for loops. Let's keep it as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we had a problem with persistent_streamk
kernel because there was 2 nested scf.ForOp
. The inner one (which was pure gemm) was not captured by our pattern matcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjw36, this function is only executed when we are dealing with pureMatmulProblems
:
triton/third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp
Lines 380 to 383 in e2dc77b
if (isPureMatmulProblem(funcOp)) { | |
scheduleGlobalLoadLocalStore(funcOp); | |
sinkSecondLoad(funcOp); | |
} |
forOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); }); | ||
|
||
// Move global loads early to prefetch. This may increase register pressure | ||
// but it enables issuing global loads early. | ||
forOp.walk([&](triton::LoadOp op) { moveOps.push_back(op); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really the only source change needed.
ReorderInstructions
pass to handle special casesReorderInstructions
pass
9255bc6
to
98b4f9e
Compare
Hi @sjw36, I did another revert. I will add a fix for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let's run perf to verify no regressions.
New contributor declaration
I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD
.Select one of the following.
/test
forlit
tests/unittest
for C++ tests/python/test
for end-to-end testsSelect one of the following.
lit
tests.lit
tests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)