-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalize loop pre-header creation and loop hoisting #62560
Generalize loop pre-header creation and loop hoisting #62560
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsA loop pre-header is a block that flows directly (and only) to the loop entry Currently, a loop pre-header has a number of restrictions:
Additionally, partially due those restrictions, loop hoisting has restrictions:
This change removes all these restrictions. Previously, even if we did create a pre-header, the definition of a pre-header I added a "stress mode" to always create a loop pre-header immediately after A lot more checking of the loop table and loop annotations on blocks has been There is some code refactoring to simplify the "find loops" code as well. Some change details:
|
A loop pre-header is a block that flows directly (and only) to the loop entry block. The loop pre-header is the only non-loop predecessor of the entry block. Loop invariant code can be hoisted to the loop pre-header where it is guaranteed to be executed just once (per loop entry). Currently, a loop pre-header has a number of restrictions: - it is only created for a "do-while" (top entry) loop, not for a mid-loop entry. - it isn't created if the current loop head and the loop entry block are in different EH try regions Additionally, partially due those restrictions, loop hoisting has restrictions: - it requires a "do-while" (top entry) loop - it requires the existing `head` block to dominate the loop entry block - it requires the existing `head` block to be in the same EH region as the entry block - it won't hoist if the `entry` block is the first block of a handler This change removes all these restrictions. Previously, even if we did create a pre-header, the definition of a pre-header was a little weaker: an entry predecessor could be a non-loop block and also not the pre-header, if the predecessor was dominated by the entry block. This is more complicated to reason about, so I change the pre-header creation to force entry block non-loop predecessors to branch to the pre-header instead. This case only rarely occurs, when we have what looks like an outer loop back edge but the natural loop recognition package doesn't recognize it as an outer loop. I added a "stress mode" to always create a loop pre-header immediately after loop recognition. This is disabled currently because loop cloning doesn't respect the special status and invariants of a pre-header, and so inserts all the cloning conditions and copied blocks after the pre-header, triggering new loop structure asserts. This should be improved in the future. A lot more checking of the loop table and loop annotations on blocks has been added. This revealed a number of problems with loop unrolling leaving things in a bad state for downstream phases. Loop unrolling has been updated to fix this, in particular, the loop table is rebuilt if it is detected that we unroll a loop that contains nested loops, since there will now be multiple copies of those nested loops. This is the first case where we might rebuild the loop table, so it lays the groundwork for potentially rebuilding the loop table in other cases, such as after loop cloning where we don't add the "slow path" loops to the table. There is some code refactoring to simplify the "find loops" code as well. Some change details: - `optSetBlockWeights` is elevated to a "phase" that runs prior to loop recognition. - LoopFlags is simplified: - LPFLG_DO_WHILE is removed; call `lpIsTopEntry` instead - LPFLG_ONE_EXIT is removed; check `lpExitCnt == 1` instead - LPFLG_HOISTABLE is removed (there are no restrictions anymore) - LPFLG_CONST is removed: check `lpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT)` instead (only used in one place - bool lpContainsCall is removed and replaced by LPFLG_CONTAINS_CALL - Added a `lpInitBlock` field to the loop table. For constant and variable initialization loops, code assumed that these expressions existed in the `head` block. This isn't true anymore if we insert a pre-header block. So, capture the block where these actually exist when we determine that they do exist, and explicitly use this block pointer where needed. - Added `fgComputeReturnBlocks()` to extract this code out of `fgComputeReachability` into a function - Added `optFindAndScaleGeneralLoopBlocks()` to extract this out of loop recognition to its own function. - Added `optResetLoopInfo()` to reset the loop table and block annotations related to loops - Added `fgDebugCheckBBNumIncreasing()` to allow asserting that the bbNum order of blocks is increasing. This should be used in phases that depend on this order to do bbNum comparisons. - Add a lot more loop table validation in `fgDebugCheckLoopTable()`
5488052
to
622e701
Compare
/azp run runtime-coreclr jitstress, runtime-coreclr outerloop |
Azure Pipelines successfully started running 2 pipeline(s). |
@AndyAyersMS @dotnet/jit-contrib PTAL |
I'll report asmdiffs (and re-kick-off the asmdiffs job) after the spmi collection gets rebuilt; currently, they won't be too useful. |
A few fixes to make:
|
1. Change `dspSuccs()` to not call code that will call `GetDescriptorForSwitch()` 2. Change `GetDescriptorForSwitch()` to use the correct max block number while inlining. We probably don't or shouldn't call GetDescriptorForSwitch while inlining, especially after (1), but this change doesn't hurt.
// 4. The loop iteration variable can't be address exposed. | ||
// 5. The loop iteration variable can't be a promoted struct field. | ||
// 6. We must be able to calculate the total constant iteration count. | ||
// 7. On x86, there is a limit to the number of return blocks. So if there are return blocks in the loop that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully by now we've moved any such blocks out of the loop range. Can you look into whether this is still a true constraint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, we have moved all BBJ_RETURN out of the loop body. I can't see any case where we haven't, but I don't know how to prove that currently. I could follow-up by adding an assert to the new loop checker that there are no BBJ_RETURN blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it looks like there are cases where we don't move BBJ_RETURN out of the loop body. I don't know the precise conditions. SPMI found coreclr test Program:M49()
and System.Security.Cryptography.Pkcs.Rfc3161TimestampToken:TryGetCertIds()
(which has lots of nested loops + return + EH).
This test creates a case that fails:
for (int i = 0; i < Vector<int>.Count; i++)
{
a += 1;
if (a > 12000) return a;
try {
a += 2;
} finally {
a += 3;
}
}
The block table is:
-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds weight lp [IL range] [jump] [EH region] [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000] 1 1 [000..006)-> BB06 ( cond ) i
BB02 [0001] 2 BB01,BB05 0.50 [006..012)-> BB04 ( cond ) i bwd bwd-target
BB03 [0002] 1 BB02 0.50 [012..014) (return) i
BB04 [0004] 1 0 BB02 0.50 [014..01B) T0 try { } keep i try bwd
BB05 [0011] 1 BB04 0.50 [01B..02C)-> BB02 ( cond ) keep i bwd cfb cfe
BB06 [0008] 2 BB01,BB05 0.50 [02C..02E) (return) i
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ funclets follow
BB07 [0005] 1 0 0 [01B..020) (finret) H0 F fault { } keep i rare flet bwd
-----------------------------------------------------------------------------------------------------------------------------------------
and the loop is from BB02..BB05 with an exit at BB03.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, presuming we ever would unroll with an embedded return, we check to avoid doing so on x86 if it would exceed the max allowed epilog limit, but not elsewhere. Elsewhere, we still limit return blocks to 4 in fgAddInternal, but it's not a hard limit, so presumably we could exceed it in unrolling.
src/coreclr/jit/optimizer.cpp
Outdated
INDEBUG(++unrollFailures); | ||
JITDUMP("Failed to unroll loop " FMT_LP ": block cloning failed on " FMT_BB "\n", lnum, | ||
block->bbNum); | ||
goto DONE_LOOP; | ||
} | ||
|
||
// All blocks in the unrolled loop and all children loops will now be marked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
children
here confuses me -- are we unrolling a loop that has children...?
Seems like maybe we unroll a loop that has a cloned loop inside where we've unrolled the clone fast path? -- so both the unrolled fast path and the (loop but not marked as loop slow path) get reparented? Still, seems like a lot of code expansion, especially if we create N copies of the slow path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will unroll a loop that has nested child loops. If there are child loops, I later will force a re-generation of the loop table, so this code to apply new loop number to the unrolled loop block will only be permanent if there are no child loops. The idea here is that if there are no child loops, it is "easy" to set the correct loop number, and there is no need to pay the expense of rebuilding the loop table.
Now, it turns out that we (currently) very rarely unroll loops, but one case where we always unroll is a loop statically determined to execute exactly once. That still goes through all the mechanism, and there might be nested loops within that.
I could expand on the comment here if you thing that would help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more expansive comment would be helpful.
Is it just the "single iteration" case where we might have nested loops?
I guess unrolling there makes sense. Seems like for that case -- if we're not already doing so -- we should change the loop exit test to always exit, instead of cloning and then deleting the original loop.
I suppose if unrolling were smart enough to make N-1 copies this would just fall out, and we would not need to zap the originals either -- just repair the loop exit branch as above.
Again, maybe this already happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can have nested loops in any case, not just the single iteration case. Consider:
public static int test_01()
{
int a = 0;
for (int i = 0; i < 2; i++)
{
a += 9;
for (int j = 0; j < 10000; j++)
{
a += 1;
}
}
return a;
}
With our current heuristics, we won't unroll this. With COMPlus_JitStressModeNames=STRESS_UNROLL_LOOPS
we will unroll the outer loop but not the inner. Replace 2
by 0
(zero) and we'll remove the outer loop but also need to remove the inner.
It turns out you can "game" our heuristics as well, e.g.:
public static int test_02()
{
int a = 0;
for (int i = 0; i < Vector<int>.Count; i++)
{
a += 9;
for (int j = 0; j < 10000; j++)
{
a += 1;
}
}
return a;
}
We will always unroll a constant loop with Vector<T>.Count
constant limit, here 8, but we won't unroll the inner loop.
The current code always creates N
copies of the original code and then eliminates the original loop. It does seem like it could make N-1
copies instead.
|
||
// If the block we're hoisting from and the pre-header are in different EH regions, don't hoist. | ||
// TODO: we could probably hoist things that won't raise exceptions, such as constants. | ||
if (!BasicBlock::sameTryRegion(optLoopTable[lnum].lpHead, treeBb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take it the preheader should always at least be in the same try region as the loop top/entry? So this is only filtering out hoists from a try that's within a loop?
// | ||
// Currently, if you create a pre-header but don't put any code in it, any subsequent fgUpdateFlowGraph() | ||
// pass might choose to compact the empty pre-header with a predecessor block. That is, a pre-header | ||
// block might disappear if not used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reconsider this compaction -- those edges tend to be critical edges and we're probably better off having blocks there, at least for most of the phases.
Eg it might help prevent LSRA from putting its resolution move blocks in strange places.
We'd probably need a late flow opts phase that then does the cond->(empty) uncond opt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might not be hard to prevent compaction or cond->(empty)uncond opt. when a pre-header is involved, but as you point out, we might need a late flow opts. phase that we currently don't have to clean these up. So put it on the list for future work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should consider deferring these simple flow graph updates until some (new) late phase.
There was an assertion when considering an existing `head` block as a potential pre-header, that the `entry` block have the `head` block as a predecessor. However, we early exit if we find a non-head, non-loop edge. This could happen before we encounter the `head` block, making the assert incorrect. We don't want to run the entire loop just for the purpose of the assert (at least not here), so just remove the assert.
Change: ``` compIsForInlining() ? impInlineInfo->InlinerCompiler->fgBBNumMax : fgBBNumMax ``` to: ``` impInlineRoot()->fgBBNumMax ```
/azp run runtime-coreclr jitstress, runtime-coreclr outerloop |
Azure Pipelines successfully started running 2 pipeline(s). |
1. Added loop epoch concept. Currently set but never checked. 2. Added disabled assert about return blocks always being moved out of line of loop body. 3. Fixed bug checking `totalIter` after it was decremented to zero as a loop variable 4. Added more comments on incremental loop block `bbNatLoopNum` setting.
/azp run runtime-coreclr jitstress, runtime-coreclr outerloop |
Azure Pipelines successfully started running 2 pipeline(s). |
When considering converting an existing loop `head` block to a pre-header, verify that the block has the same EH try region that we would create for a new pre-header. If not, we go ahead and create the new pre-header.
@AndyAyersMS Is there any unaddressed question or concern left here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No other concerns.
A loop pre-header is a block that flows directly (and only) to the loop entry
block. The loop pre-header is the only non-loop predecessor of the entry block.
Loop invariant code can be hoisted to the loop pre-header where it is
guaranteed to be executed just once (per loop entry).
Currently, a loop pre-header has a number of restrictions:
different EH try regions
Additionally, partially due those restrictions, loop hoisting has restrictions:
head
block to dominate the loop entry blockhead
block to be in the same EH region as the entry blockentry
block is the first block of a handlerThis change removes all these restrictions.
Previously, even if we did create a pre-header, the definition of a pre-header
was a little weaker: an entry predecessor could be a non-loop block and also
not the pre-header, if the predecessor was dominated by the entry block. This
is more complicated to reason about, so I change the pre-header creation to
force entry block non-loop predecessors to branch to the pre-header instead.
This case only rarely occurs, when we have what looks like an outer loop back
edge but the natural loop recognition package doesn't recognize it as an outer loop.
I added a "stress mode" to always create a loop pre-header immediately after
loop recognition. This is disabled currently because loop cloning doesn't
respect the special status and invariants of a pre-header, and so inserts
all the cloning conditions and copied blocks after the pre-header, triggering
new loop structure asserts. This should be improved in the future.
A lot more checking of the loop table and loop annotations on blocks has been
added. This revealed a number of problems with loop unrolling leaving things
in a bad state for downstream phases. Loop unrolling has been updated to fix
this, in particular, the loop table is rebuilt if it is detected that we unroll
a loop that contains nested loops, since there will now be multiple copies of
those nested loops. This is the first case where we might rebuild the loop
table, so it lays the groundwork for potentially rebuilding the loop table in
other cases, such as after loop cloning where we don't add the "slow path"
loops to the table.
There is some code refactoring to simplify the "find loops" code as well.
Some change details:
optSetBlockWeights
is elevated to a "phase" that runs prior to loop recognition.lpIsTopEntry
insteadlpExitCnt == 1
insteadlpFlags & (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT) == (LPFLG_CONST_INIT | LPFLG_CONST_LIMIT)
instead (only used in one placelpInitBlock
field to the loop table. For constant and variableinitialization loops, code assumed that these expressions existed in the
head
block. This isn't true anymore if we insert a pre-header block.So, capture the block where these actually exist when we determine that
they do exist, and explicitly use this block pointer where needed.
fgComputeReturnBlocks()
to extract this code out offgComputeReachability
into a functionoptFindAndScaleGeneralLoopBlocks()
to extract this out of loop recognition to its own function.optResetLoopInfo()
to reset the loop table and block annotations related to loopsfgDebugCheckBBNumIncreasing()
to allow asserting that the bbNumorder of blocks is increasing. This should be used in phases that depend
on this order to do bbNum comparisons.
fgDebugCheckLoopTable()