Generalize loop pre-header creation and loop hoisting #62560

BruceForstall · 2021-12-09T01:13:07Z

A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).

Currently, a loop pre-header has a number of restrictions:

it is only created for a "do-while" (top entry) loop, not for a mid-loop entry.
it isn't created if the current loop head and the loop entry block are in
different EH try regions

Additionally, partially due those restrictions, loop hoisting has restrictions:

it requires a "do-while" (top entry) loop
it requires the existing head block to dominate the loop entry block
it requires the existing head block to be in the same EH region as the entry block
it won't hoist if the entry block is the first block of a handler

This change removes all these restrictions.

Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.

I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.

A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.

There is some code refactoring to simplify the "find loops" code as well.

Some change details:

optSetBlockWeights is elevated to a "phase" that runs prior to loop recognition.
LoopFlags is simplified:
- LPFLG_DO_WHILE is removed; call lpIsTopEntry instead
- LPFLG_ONE_EXIT is removed; check lpExitCnt == 1 instead
- LPFLG_HOISTABLE is removed (there are no restrictions anymore)
- LPFLG_CONST is removed: check lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) instead (only used in one place
- bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL
Added a lpInitBlock field to the loop table. For constant and variable
initialization loops, code assumed that these expressions existed in the
head block. This isn't true anymore if we insert a pre-header block.
So, capture the block where these actually exist when we determine that
they do exist, and explicitly use this block pointer where needed.
Added fgComputeReturnBlocks() to extract this code out of fgComputeReachability into a function
Added optFindAndScaleGeneralLoopBlocks() to extract this out of loop recognition to its own function.
Added optResetLoopInfo() to reset the loop table and block annotations related to loops
Added fgDebugCheckBBNumIncreasing() to allow asserting that the bbNum
order of blocks is increasing. This should be used in phases that depend
on this order to do bbNum comparisons.
Add a lot more loop table validation in fgDebugCheckLoopTable()

ghost · 2021-12-09T01:13:14Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).

Currently, a loop pre-header has a number of restrictions:

it is only created for a "do-while" (top entry) loop, not for a mid-loop entry.
it isn't created if the current loop head and the loop entry block are in
different EH try regions

Additionally, partially due those restrictions, loop hoisting has restrictions:

it requires a "do-while" (top entry) loop
it requires the existing head block to dominate the loop entry block
it requires the existing head block to be in the same EH region as the entry block
it won't hoist if the entry block is the first block of a handler

This change removes all these restrictions.

Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.

I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.

A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.

There is some code refactoring to simplify the "find loops" code as well.

Some change details:

optSetBlockWeights is elevated to a "phase" that runs prior to loop recognition.
LoopFlags is simplified:
- LPFLG_DO_WHILE is removed; call lpIsTopEntry instead
- LPFLG_ONE_EXIT is removed; check lpExitCnt == 1 instead
- LPFLG_HOISTABLE is removed (there are no restrictions anymore)
- LPFLG_CONST is removed: check lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) instead (only used in one place
- bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL
Added a lpInitBlock field to the loop table. For constant and variable
initialization loops, code assumed that these expressions existed in the
head block. This isn't true anymore if we insert a pre-header block.
So, capture the block where these actually exist when we determine that
they do exist, and explicitly use this block pointer where needed.
Added fgComputeReturnBlocks() to extract this code out of fgComputeReachability into a function
Added optFindAndScaleGeneralLoopBlocks() to extract this out of loop recognition to its own function.
Added optResetLoopInfo() to reset the loop table and block annotations related to loops
Added fgDebugCheckBBNumIncreasing() to allow asserting that the bbNum
order of blocks is increasing. This should be used in phases that depend
on this order to do bbNum comparisons.
Add a lot more loop table validation in fgDebugCheckLoopTable()

Author:	BruceForstall
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

A loop pre-header is a block that flows directly (and only) to the loop entry block. The loop pre-header is the only non-loop predecessor of the entry block. Loop invariant code can be hoisted to the loop pre-header where it is guaranteed to be executed just once (per loop entry). Currently, a loop pre-header has a number of restrictions: - it is only created for a "do-while" (top entry) loop, not for a mid-loop entry. - it isn't created if the current loop head and the loop entry block are in different EH try regions Additionally, partially due those restrictions, loop hoisting has restrictions: - it requires a "do-while" (top entry) loop - it requires the existing `head` block to dominate the loop entry block - it requires the existing `head` block to be in the same EH region as the entry block - it won't hoist if the `entry` block is the first block of a handler This change removes all these restrictions. Previously, even if we did create a pre-header, the definition of a pre-header was a little weaker: an entry predecessor could be a non-loop block and also not the pre-header, if the predecessor was dominated by the entry block. This is more complicated to reason about, so I change the pre-header creation to force entry block non-loop predecessors to branch to the pre-header instead. This case only rarely occurs, when we have what looks like an outer loop back edge but the natural loop recognition package doesn't recognize it as an outer loop. I added a "stress mode" to always create a loop pre-header immediately after loop recognition. This is disabled currently because loop cloning doesn't respect the special status and invariants of a pre-header, and so inserts all the cloning conditions and copied blocks after the pre-header, triggering new loop structure asserts. This should be improved in the future. A lot more checking of the loop table and loop annotations on blocks has been added. This revealed a number of problems with loop unrolling leaving things in a bad state for downstream phases. Loop unrolling has been updated to fix this, in particular, the loop table is rebuilt if it is detected that we unroll a loop that contains nested loops, since there will now be multiple copies of those nested loops. This is the first case where we might rebuild the loop table, so it lays the groundwork for potentially rebuilding the loop table in other cases, such as after loop cloning where we don't add the "slow path" loops to the table. There is some code refactoring to simplify the "find loops" code as well. Some change details: - `optSetBlockWeights` is elevated to a "phase" that runs prior to loop recognition. - LoopFlags is simplified: - LPFLG_DO_WHILE is removed; call `lpIsTopEntry` instead - LPFLG_ONE_EXIT is removed; check `lpExitCnt == 1` instead - LPFLG_HOISTABLE is removed (there are no restrictions anymore) - LPFLG_CONST is removed: check `lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT)` instead (only used in one place - bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL - Added a `lpInitBlock` field to the loop table. For constant and variable initialization loops, code assumed that these expressions existed in the `head` block. This isn't true anymore if we insert a pre-header block. So, capture the block where these actually exist when we determine that they do exist, and explicitly use this block pointer where needed. - Added `fgComputeReturnBlocks()` to extract this code out of `fgComputeReachability` into a function - Added `optFindAndScaleGeneralLoopBlocks()` to extract this out of loop recognition to its own function. - Added `optResetLoopInfo()` to reset the loop table and block annotations related to loops - Added `fgDebugCheckBBNumIncreasing()` to allow asserting that the bbNum order of blocks is increasing. This should be used in phases that depend on this order to do bbNum comparisons. - Add a lot more loop table validation in `fgDebugCheckLoopTable()`

BruceForstall · 2021-12-09T02:06:33Z

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

azure-pipelines · 2021-12-09T02:07:01Z

Azure Pipelines successfully started running 2 pipeline(s).

BruceForstall · 2021-12-09T02:07:50Z

@AndyAyersMS @dotnet/jit-contrib PTAL

BruceForstall · 2021-12-09T02:08:57Z

I'll report asmdiffs (and re-kick-off the asmdiffs job) after the spmi collection gets rebuilt; currently, they won't be too useful.

BruceForstall · 2021-12-09T18:08:13Z

A few fixes to make:

A couple jitstress asserts:
Assertion failed 'loopEntryHasHeadPred' in 'Microsoft.CodeAnalysis.CSharp.Symbols.SourceMemberMethodSymbol:CheckModifiers(Microsoft.CodeAnalysis.Location,Microsoft.CodeAnalysis.DiagnosticBag):this' during 'Hoist loop code' (IL size 912)
a GCC compilation failure.

src/coreclr/jit/fgdiagnostic.cpp

1. Change `dspSuccs()` to not call code that will call `GetDescriptorForSwitch()` 2. Change `GetDescriptorForSwitch()` to use the correct max block number while inlining. We probably don't or shouldn't call GetDescriptorForSwitch while inlining, especially after (1), but this change doesn't hurt.

src/coreclr/jit/fgdiagnostic.cpp

src/coreclr/jit/loopcloning.cpp

AndyAyersMS · 2021-12-09T21:14:40Z

src/coreclr/jit/optimizer.cpp

+// 4. The loop iteration variable can't be address exposed.
+// 5. The loop iteration variable can't be a promoted struct field.
+// 6. We must be able to calculate the total constant iteration count.
+// 7. On x86, there is a limit to the number of return blocks. So if there are return blocks in the loop that


Hopefully by now we've moved any such blocks out of the loop range. Can you look into whether this is still a true constraint?

As far as I can tell, we have moved all BBJ_RETURN out of the loop body. I can't see any case where we haven't, but I don't know how to prove that currently. I could follow-up by adding an assert to the new loop checker that there are no BBJ_RETURN blocks.

Actually, it looks like there are cases where we don't move BBJ_RETURN out of the loop body. I don't know the precise conditions. SPMI found coreclr test Program:M49() and System.Security.Cryptography.Pkcs.Rfc3161TimestampToken:TryGetCertIds() (which has lots of nested loops + return + EH).

This test creates a case that fails:

for (int i = 0; i < Vector<int>.Count; i++) { a += 1; if (a > 12000) return a; try { a += 2; } finally { a += 3; } }

The block table is:

----------------------------------------------------------------------------------------------------------------------------------------- BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags] ----------------------------------------------------------------------------------------------------------------------------------------- BB01 [0000] 1 1 [000..006)-> BB06 ( cond ) i BB02 [0001] 2 BB01,BB05 0.50 [006..012)-> BB04 ( cond ) i bwd bwd-target BB03 [0002] 1 BB02 0.50 [012..014) (return) i BB04 [0004] 1 0 BB02 0.50 [014..01B) T0 try { } keep i try bwd BB05 [0011] 1 BB04 0.50 [01B..02C)-> BB02 ( cond ) keep i bwd cfb cfe BB06 [0008] 2 BB01,BB05 0.50 [02C..02E) (return) i +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow BB07 [0005] 1 0 0 [01B..020) (finret) H0 F fault { } keep i rare flet bwd -----------------------------------------------------------------------------------------------------------------------------------------

and the loop is from BB02..BB05 with an exit at BB03.

Interestingly, presuming we ever would unroll with an embedded return, we check to avoid doing so on x86 if it would exceed the max allowed epilog limit, but not elsewhere. Elsewhere, we still limit return blocks to 4 in fgAddInternal, but it's not a hard limit, so presumably we could exceed it in unrolling.

AndyAyersMS · 2021-12-09T21:17:38Z

src/coreclr/jit/optimizer.cpp

                        INDEBUG(++unrollFailures);
                        JITDUMP("Failed to unroll loop " FMT_LP ": block cloning failed on " FMT_BB "\n", lnum,
                                block->bbNum);
                        goto DONE_LOOP;
                    }

+                    // All blocks in the unrolled loop and all children loops will now be marked


children here confuses me -- are we unrolling a loop that has children...?

Seems like maybe we unroll a loop that has a cloned loop inside where we've unrolled the clone fast path? -- so both the unrolled fast path and the (loop but not marked as loop slow path) get reparented? Still, seems like a lot of code expansion, especially if we create N copies of the slow path.

We will unroll a loop that has nested child loops. If there are child loops, I later will force a re-generation of the loop table, so this code to apply new loop number to the unrolled loop block will only be permanent if there are no child loops. The idea here is that if there are no child loops, it is "easy" to set the correct loop number, and there is no need to pay the expense of rebuilding the loop table.

Now, it turns out that we (currently) very rarely unroll loops, but one case where we always unroll is a loop statically determined to execute exactly once. That still goes through all the mechanism, and there might be nested loops within that.

I could expand on the comment here if you thing that would help.

A more expansive comment would be helpful.

Is it just the "single iteration" case where we might have nested loops?

I guess unrolling there makes sense. Seems like for that case -- if we're not already doing so -- we should change the loop exit test to always exit, instead of cloning and then deleting the original loop.

I suppose if unrolling were smart enough to make N-1 copies this would just fall out, and we would not need to zap the originals either -- just repair the loop exit branch as above.

Again, maybe this already happens?

We can have nested loops in any case, not just the single iteration case. Consider:

public static int test_01() { int a = 0; for (int i = 0; i < 2; i++) { a += 9; for (int j = 0; j < 10000; j++) { a += 1; } } return a; }

With our current heuristics, we won't unroll this. With COMPlus_JitStressModeNames=STRESS_UNROLL_LOOPS we will unroll the outer loop but not the inner. Replace 2 by 0 (zero) and we'll remove the outer loop but also need to remove the inner.

It turns out you can "game" our heuristics as well, e.g.:

public static int test_02() { int a = 0; for (int i = 0; i < Vector<int>.Count; i++) { a += 9; for (int j = 0; j < 10000; j++) { a += 1; } } return a; }

We will always unroll a constant loop with Vector<T>.Count constant limit, here 8, but we won't unroll the inner loop.

The current code always creates N copies of the original code and then eliminates the original loop. It does seem like it could make N-1 copies instead.

src/coreclr/jit/optimizer.cpp

AndyAyersMS · 2021-12-09T21:38:21Z

src/coreclr/jit/optimizer.cpp

+
+    // If the block we're hoisting from and the pre-header are in different EH regions, don't hoist.
+    // TODO: we could probably hoist things that won't raise exceptions, such as constants.
+    if (!BasicBlock::sameTryRegion(optLoopTable[lnum].lpHead, treeBb))


I take it the preheader should always at least be in the same try region as the loop top/entry? So this is only filtering out hoists from a try that's within a loop?

src/coreclr/jit/optimizer.cpp

AndyAyersMS · 2021-12-09T21:42:40Z

src/coreclr/jit/optimizer.cpp

+//
+// Currently, if you create a pre-header but don't put any code in it, any subsequent fgUpdateFlowGraph()
+// pass might choose to compact the empty pre-header with a predecessor block. That is, a pre-header
+// block might disappear if not used.


I think we should reconsider this compaction -- those edges tend to be critical edges and we're probably better off having blocks there, at least for most of the phases.

Eg it might help prevent LSRA from putting its resolution move blocks in strange places.

We'd probably need a late flow opts phase that then does the cond->(empty) uncond opt.

It might not be hard to prevent compaction or cond->(empty)uncond opt. when a pre-header is involved, but as you point out, we might need a late flow opts. phase that we currently don't have to clean these up. So put it on the list for future work?

Yes. We should consider deferring these simple flow graph updates until some (new) late phase.

There was an assertion when considering an existing `head` block as a potential pre-header, that the `entry` block have the `head` block as a predecessor. However, we early exit if we find a non-head, non-loop edge. This could happen before we encounter the `head` block, making the assert incorrect. We don't want to run the entire loop just for the purpose of the assert (at least not here), so just remove the assert.

Change: ``` compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax ``` to: ``` impInlineRoot()->fgBBNumMax ```

BruceForstall · 2021-12-10T00:33:09Z

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

azure-pipelines · 2021-12-10T00:33:38Z

Azure Pipelines successfully started running 2 pipeline(s).

1. Added loop epoch concept. Currently set but never checked. 2. Added disabled assert about return blocks always being moved out of line of loop body. 3. Fixed bug checking `totalIter` after it was decremented to zero as a loop variable 4. Added more comments on incremental loop block `bbNatLoopNum` setting.

BruceForstall · 2021-12-11T03:39:00Z

/azp run runtime-coreclr jitstress, runtime-coreclr outerloop

azure-pipelines · 2021-12-11T03:39:29Z

Azure Pipelines successfully started running 2 pipeline(s).

When considering converting an existing loop `head` block to a pre-header, verify that the block has the same EH try region that we would create for a new pre-header. If not, we go ahead and create the new pre-header.

BruceForstall · 2021-12-14T01:32:22Z

@AndyAyersMS Is there any unaddressed question or concern left here?

AndyAyersMS

No other concerns.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 9, 2021

BruceForstall force-pushed the GeneralizeCreatePreHeaderSquashMain branch from 5488052 to 622e701 Compare December 9, 2021 02:06

dotnet deleted a comment from azure-pipelines bot Dec 9, 2021

BruceForstall marked this pull request as ready for review December 9, 2021 02:07

BruceForstall requested a review from AndyAyersMS December 9, 2021 02:07

runfoapp bot mentioned this pull request Dec 9, 2021

jit/intrinsics/createspan_il/createspan_il.sh #62285

Closed

Inline fgBuildBlockNumMap to allow using _alloca

ccd73cc

am11 reviewed Dec 9, 2021

View reviewed changes

src/coreclr/jit/fgdiagnostic.cpp Show resolved Hide resolved

AndyAyersMS reviewed Dec 9, 2021

View reviewed changes

BruceForstall added 4 commits December 9, 2021 15:00

Formatting

592c1a1

Use _alloca instead of alloca name

4b57715

Convert fgBBNumMax usage

80754a4

Change: ``` compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax ``` to: ``` impInlineRoot()->fgBBNumMax ```

This was referenced Dec 10, 2021

Certain loops do not get recorded in optLoopTable #43713

Closed

Change loop cloning to work with pre-header blocks #62665

Closed

Add EH condition when converting head to pre-header

aef7927

When considering converting an existing loop `head` block to a pre-header, verify that the block has the same EH try region that we would create for a new pre-header. If not, we go ahead and create the new pre-header.

AndyAyersMS approved these changes Dec 14, 2021

View reviewed changes

BruceForstall merged commit 955a681 into dotnet:main Dec 14, 2021

BruceForstall deleted the GeneralizeCreatePreHeaderSquashMain branch December 14, 2021 07:03

ghost locked as resolved and limited conversation to collaborators Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generalize loop pre-header creation and loop hoisting #62560

Generalize loop pre-header creation and loop hoisting #62560

BruceForstall commented Dec 9, 2021

ghost commented Dec 9, 2021

BruceForstall commented Dec 9, 2021

azure-pipelines bot commented Dec 9, 2021

BruceForstall commented Dec 9, 2021

BruceForstall commented Dec 9, 2021

BruceForstall commented Dec 9, 2021 •

edited

Loading

AndyAyersMS Dec 9, 2021

BruceForstall Dec 11, 2021

BruceForstall Dec 11, 2021

BruceForstall Dec 11, 2021

AndyAyersMS Dec 9, 2021

BruceForstall Dec 10, 2021

AndyAyersMS Dec 10, 2021

BruceForstall Dec 11, 2021

AndyAyersMS Dec 9, 2021

AndyAyersMS Dec 9, 2021

BruceForstall Dec 11, 2021

AndyAyersMS Dec 11, 2021

BruceForstall commented Dec 10, 2021

azure-pipelines bot commented Dec 10, 2021

BruceForstall commented Dec 11, 2021

azure-pipelines bot commented Dec 11, 2021

BruceForstall commented Dec 14, 2021

AndyAyersMS left a comment

Generalize loop pre-header creation and loop hoisting #62560

Generalize loop pre-header creation and loop hoisting #62560

Conversation

BruceForstall commented Dec 9, 2021

ghost commented Dec 9, 2021

BruceForstall commented Dec 9, 2021

azure-pipelines bot commented Dec 9, 2021

BruceForstall commented Dec 9, 2021

BruceForstall commented Dec 9, 2021

BruceForstall commented Dec 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BruceForstall commented Dec 10, 2021

azure-pipelines bot commented Dec 10, 2021

BruceForstall commented Dec 11, 2021

azure-pipelines bot commented Dec 11, 2021

BruceForstall commented Dec 14, 2021

AndyAyersMS left a comment

Choose a reason for hiding this comment

BruceForstall commented Dec 9, 2021 •

edited

Loading