[AMD] fixed the `ReorderInstructions` pass #5254

ravil-mobile · 2024-11-25T14:44:29Z

fixed local store and global load ordering for GEMM kernels

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because the current tests are supposed to cover code changes
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

sjw36

See comments below

sjw36 · 2024-11-25T14:56:00Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

-      if (isPureMatmulProblem(funcOp)) {
-        scheduleGlobalLoadLocalStore(funcOp);
-        sinkSecondLoad(funcOp);
+      const bool independentGlobalLoadStages =


This should be independent of whether pipelining has happened or not. It previously applied to all loads in the function, not just in the for loop.

sjw36 · 2024-11-25T15:07:39Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

-  // Best perf on GEMM when these precede global loads.
-  funcOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); });
+
+  if (independentGlobalLoadStages) {


This is already accounted for in the test for local_store and leadsToLoad on line 260.

Therefore, the order here can universally be:

local_stores

global_loads

Then reversed (line 233) so the local_stores will be moved last and to the top of the loop if they are independent of global loads.

ravil-mobile · 2024-11-25T15:51:40Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

+  forOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); });
+
+  // Move global loads early to prefetch. This may increase register pressure
+  // but it enables issuing global loads early.
+  forOp.walk([&](triton::LoadOp op) { moveOps.push_back(op); });


@sjw36, If I understood you correctly. We only need this change, right?

This is really the only source change needed.

sjw36

Let's just do the swap for now and fixup tests.

sjw36 · 2024-11-26T15:29:25Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

@@ -58,7 +53,7 @@ findEarlyInsertionPoint(Block *block, Operation *move) {
      if (isa<triton::AtomicRMWOp, triton::AtomicCASOp>(wop))
        ipnt = bi;
      // Break at barrier
-      if (isa<gpu::BarrierOp>(wop))
+      if (isa<mlir::gpu::BarrierOp>(wop))


This should be unnecessary since it built before.

This happened because of the include of "third_party/amd/lib/TritonAMDGPUToLLVM/Utility.h". In its turn, it include "triton/Conversion/TritonGPUToLLVM/Utility.h" which has the following definition

triton/include/triton/Conversion/TritonGPUToLLVM/Utility.h

Lines 210 to 217 in e2dc77b

namespace gpu {

Type getFunctionType(Type resultType, ValueRange operands);

LLVM::LLVMFuncOp appendOrGetExternFuncOp(RewriterBase &rewriter, Operation *op,

StringRef funcName, Type funcType,

StringRef libname = "",

StringRef libpath = "");

} // namespace gpu

As you can see it is gpu namespace. So, we need to be explicit about the namespace in ReorderInstructions.cpp file. Otherwise, we have a compilation error.

sjw36 · 2024-11-26T15:30:47Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

@@ -214,14 +209,15 @@ static void moveUpTranspose(triton::FuncOp funcOp) {
 }

 // Schedule global load and local store ops for better GEMM performance.
-static void scheduleGlobalLoadLocalStore(triton::FuncOp funcOp) {
+static void scheduleGlobalLoadLocalStore(scf::ForOp forOp) {


This may have implications for other workloads, where it is beneficial to apply outside of for loops. Let's keep it as before.

I remember we had a problem with persistent_streamk kernel because there was 2 nested scf.ForOp. The inner one (which was pure gemm) was not captured by our pattern matcher.

@sjw36, this function is only executed when we are dealing with pureMatmulProblems:

triton/third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

Lines 380 to 383 in e2dc77b

if (isPureMatmulProblem(funcOp)) {

scheduleGlobalLoadLocalStore(funcOp);

sinkSecondLoad(funcOp);

}

sjw36 · 2024-11-26T15:31:13Z

third_party/amd/lib/TritonAMDGPUTransforms/ReorderInstructions.cpp

+  forOp.walk([&](ttg::LocalStoreOp op) { moveOps.push_back(op); });
+
+  // Move global loads early to prefetch. This may increase register pressure
+  // but it enables issuing global loads early.
+  forOp.walk([&](triton::LoadOp op) { moveOps.push_back(op); });


This is really the only source change needed.

ravil-mobile · 2024-11-26T17:16:39Z

Hi @sjw36,

I did another revert. I will add a fix for persistent_streamk in a separate PR (in this case it would be easy to revert in the case if it would be needed).

sjw36

Looks good. Let's run perf to verify no regressions.

sjw36 suggested changes Nov 25, 2024

View reviewed changes

ravil-mobile force-pushed the ravil/reorder-fix branch from de1d3e2 to 9255bc6 Compare November 25, 2024 15:50

ravil-mobile commented Nov 25, 2024

View reviewed changes

ravil-mobile requested a review from sjw36 November 25, 2024 16:32

sjw36 suggested changes Nov 26, 2024

View reviewed changes

[AMD] fixed the ReorderInstructions pass

98b4f9e

ravil-mobile changed the title ~~[AMD] extended the ReorderInstructions pass to handle special cases~~ [AMD] fixed the ReorderInstructions pass Nov 26, 2024

ravil-mobile force-pushed the ravil/reorder-fix branch from 9255bc6 to 98b4f9e Compare November 26, 2024 17:15

ravil-mobile requested a review from sjw36 November 26, 2024 17:15

sjw36 approved these changes Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] fixed the `ReorderInstructions` pass #5254

[AMD] fixed the `ReorderInstructions` pass #5254

ravil-mobile commented Nov 25, 2024 •

edited

Loading

sjw36 left a comment

sjw36 Nov 25, 2024

sjw36 Nov 25, 2024

ravil-mobile Nov 25, 2024

sjw36 Nov 26, 2024

sjw36 left a comment

sjw36 Nov 26, 2024

ravil-mobile Nov 26, 2024

sjw36 Nov 26, 2024

ravil-mobile Nov 26, 2024

ravil-mobile Nov 26, 2024

sjw36 Nov 26, 2024

ravil-mobile commented Nov 26, 2024

sjw36 left a comment

	namespace gpu {
	Type getFunctionType(Type resultType, ValueRange operands);

	LLVM::LLVMFuncOp appendOrGetExternFuncOp(RewriterBase &rewriter, Operation *op,
	StringRef funcName, Type funcType,
	StringRef libname = "",
	StringRef libpath = "");
	} // namespace gpu

	if (isPureMatmulProblem(funcOp)) {
	scheduleGlobalLoadLocalStore(funcOp);
	sinkSecondLoad(funcOp);
	}

[AMD] fixed the ReorderInstructions pass #5254

Are you sure you want to change the base?

[AMD] fixed the ReorderInstructions pass #5254

Conversation

ravil-mobile commented Nov 25, 2024 • edited Loading

New contributor declaration

sjw36 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sjw36 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravil-mobile commented Nov 26, 2024

sjw36 left a comment

Choose a reason for hiding this comment

[AMD] fixed the `ReorderInstructions` pass #5254

[AMD] fixed the `ReorderInstructions` pass #5254

ravil-mobile commented Nov 25, 2024 •

edited

Loading