-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Port loop cloning to the new loop representation #95326
JIT: Port loop cloning to the new loop representation #95326
Conversation
Run the new loop recognition at the same time as old loop recognition and cross validate that the information matches the old loop information. Implement induction analysis on the new representation. Validate that it matches the old loop induction analysis. Add a few quirks to ensure this. Port loop cloning to run on the new loop representation. Add a number of quirks to keep the logic close to the old cloning. One notable difference is that the old loop recognition considers some blocks to be part of the loops that actually are not; for example, blocks that return or break out from the loop are not part of the loop's SCC, but can be part of the lexical range. This means that the old loop cloning will clone those blocks and apply static optimizations to them, while the new loop cloning won't. However, this does not seem to cause a significant number of regressions. A few diffs are expected due to considering fewer blocks to be part of the loop and thus cloning fewer blocks. Also, throughput regressions are expected since we are doing both loop identifications for now. I expect to incrementally work towards replacing everything with the new representation, at which point we can get rid of the old loop identification and hopefully most of the reachability/dominator computations. I expect us to make back the TP at that point.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsRun the new loop recognition at the same time as old loop recognition and cross validate that the information matches the old loop information. Implement induction analysis on the new representation. Validate that it matches the old loop induction analysis. Add a few quirks to ensure this. Port loop cloning to run on the new loop representation. Add a number of quirks to keep the logic close to the old cloning. One notable difference is that the old loop recognition considers some blocks to be part of the loops that actually are not; for example, blocks that return or break out from the loop are not part of the loop's SCC, but can be part of the lexical range. This means that the old loop cloning will clone those blocks and apply static optimizations to them, while the new loop cloning won't. Still need to investigate a bit how much of a problem this is -- the regressions do not seem to be that big. A few diffs are expected due to considering fewer blocks to be part of the loop and thus cloning fewer blocks. Also, throughput regressions are expected since we are doing both loop identifications for now. I expect to incrementally work towards replacing everything with the new representation, at which point we can get rid of the old loop identification and hopefully most of the reachability/dominator computations. I expect us to make back the TP at that point.
|
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
cc @dotnet/jit-contrib PTAL @BruceForstall No diffs except for throughput regressions. I was able to quirk the rest of the differences away -- it turns out the blocks that are lexically within the loop but not actually in the loop are always empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, this looks really good. It's good to see that converting from the existing loop structure to the new code is relatively straightforward. It will be interesting to see what additional improvements can happen by removing the quirks.
Here are some comments regarding both this change, and the previous change to introduce the FlowGraphNaturalLoop
type (and friends) which I wasn't able to previously review.
- I'd like to see more comments of all the new class/struct types:
FlowGraphNaturalLoop
,FlowGraphDfsTree
,NaturalLoopIterInfo
, etc. Including comments for the member declarations. - IIUC,
FlowGraphNaturalLoop
defines a loop as a single entry block, namedheader
, and a bitset of all member blocks indexed by the block reverse postorder number (compressed). (Presumably the edges could be re-derived from this.)- How often will this block postorder number be recomputed, if ever? (Every time it gets recomputed the loop table needs to be rebuilt.)
- Should the set size be compressed further by subtracting the minimum block member postorder number? e.g.,
loop->m_blocksSize = loop->m_header->bbPostorderNum + 1 - loop->m_MinimumBlockPostorderNumber;
? E.g., presumably if you have a small loop at the top of a very large function (say, at the top level, main-line control flow), its member blocks would have tightly packed post-order numbers but they would all be large. - nit: should it be
m_entry
instead ofm_header
? To me, "header" implies something lexically earlier, but lexicality is ignored in constructing the loop graph.
- Do you think the lexicality functions (VisitLoopBlocksLexical, GetLexicallyTopMostBlock, GetLexicallyBottomMostBlock) could be removed if we get rid of the concept of "fall-through (BBJ_NONE/BBJ_COND fall-through) early in the JIT?
- Do we need to worry about adding new blocks to the flow graph and having them appear in the loop graph as well, without building the loop graph? That might not be possible with post-order block number indices. Unless there is some provision for excess built in? E.g.,
fgDominate
works conservatively for new blocks, but I don't know if the new system can tolerate this. E.g., how would new blocks get added to the DFS tree, if at all? - Does the new loop graph actually recognize exactly the same set of loops as the old? That's somewhat surprising to me.
- Improve JIT loop optimizations (.NET 9) #93144 should be updated to account for this and previous related work.
- Some of the functions that took
loopNum
and now takeFlowGraphNaturalLoop*
now need their header comments updated. - should the
BitVecTraits
for the loop blocks be stored/cached with the loop and not recomputed withLoopBlockTraits()
function?
I'll add some more documentation. What style of comments are you looking for on the declarations? I'm a bit reluctant to duplicate documentation when it's already there on the definitions.
We need to recompute it after cloning and unrolling (provided they made flow graph changes). We already recompute DFS, dominators and dense reachability in those cases, so I expect it is going to be cheaper overall since we won't need dominators and reachability anymore, and I do not think the actual loop finding is that expensive.
I actually did this initially, but it felt a bit premature. In particular it's a bit messy since reducing the size can change the representation of the bit vectors. I still think the optimization makes sense, but we need a sanctioned API in
I prefer "header" since we already use "preheader". This was Andy's terminology, and it matches clang/GCC which I like.
There's a few sources of uses left:
The "block reordering" issue is probably going to be the biggest one to get rid of the lexical visitor.
It's a good question. My gut feeling from working with it is no: we already redo DFS to rebuild dominators, and redo reachability, and I think overall we are going to be in the green when all is said and done. It might be possible in some cases to incrementally rebuild the DFS tree, if we also store a preorder of blocks. For example when adding preheaders I think that might be possible. I'm not convinced the complication is worth it. Loop unrolling may end up with some intermediate recomputations that we didn't have before, in particular when we unroll a nested loop and then also want to unroll the parent loop. I don't think there's an innate need for this, I could teach loop unrolling about where to find the newly creted nested unrolled blocks, but it seemed like a complication that is not worth it since we do not expect a lot of nested unrolling to happen.
It recognizes a superset of the old code. There's quirks in this PR to avoid cloning for the additional new loops that it recognizes. I think when we are done that #43713 and #58941 will be fixed.
Will do.
Will do another pass to make sure these are updated.
Hmm, let me look at it in a follow-up. I was under the impression that the constructor of |
In particular, for the classes/structs: a summary of the purpose of the class. Possibly a note on the implementation. Comments on the member variables. For functions, don't duplicate the definition comments. It might be worthwhile writing a short comment at the function declarations; that's a judgement call to provide clarity when reading the declarations but not duplicating the definitions.
Is this because you need to add blocks to the loop before knowing all the blocks in the loop, hence not knowing what the lowest block number is? I suppose the loop finder could have a single "full size" block set it uses/re-uses while constructing a loop, then generate the "compressed" block set afterwards, copying over the bits as necessary. |
Gotcha, that makes a lot of sense. Will do that.
Yeah, we do not know the lowest postorder number before we have run the code in |
@BruceForstall I've added some more documentation and comments on the fields/types. Can you please take another look? |
Thanks for adding the comments; those are helpful. Regarding adding new blocks: the new loop graph benefits from having the old loop recognition insert pre-headers before the new loop graph is constructed. If the old loop is removed, then pre-header creation will have to be rebuilt on the new loop graph. This in itself might require a loop graph re-build if there isn't incremental update capability. (The same will also apply to the current set of loop canonicalization, such as splitting loop header blocks.) nit: you use Related:
which, IMO, should be:
Looks like |
We currently have |
m_dfs = fgComputeDfs(); | ||
m_loops = FlowGraphNaturalLoops::Find(m_dfs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "TODO" above points out that we don't rebuild the loop table, but obviously here we do rebuild the new loop table.
Does anyone use this after this point?
If they used it and it wasn't recomputed, would it be incorrect, or just missing data?
It looks like ProfileSynthesis has its own copy of FlowGraphDfsTree
, FlowGraphNaturalLoops
. Should it just use the ones in Compiler
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is unused after this point. We are going to need to do this recomputation eventually, so I simply added it now to get a more proper view of the TP cost.
I'll get rid of the TODO as part of removing the old loop stuff.
If they used it and it wasn't recomputed, would it be incorrect, or just missing data?
If you change the flow graph and then recompute the DFS, then BasicBlock::bbPostorderNum
will become invalid and mappings into m_dfs
and m_loops
will give nonsensical data. It's why I think it makes sense to assert that m_dfs
matches the current flow graph in post phase checks (and null it out when we then no longer need it, i.e. after RBO). As long as the DFS isn't recomputed you can add and remove new blocks, and queries in the DFS tree/loop structures will just give incorrect (outdated) results for the affected blocks. For example, if you add a new block that is clearly reachable, m_dfs->Contains(newBlock)
will return false incorrectly. In loop cloning we run over the loops parents before children which ensures we do not invalidate any of the stored data for any future loop we will see; thus we can keep acting on it. Loop unrolling (as I mentioned above) will be a bit harder.
It looks like ProfileSynthesis has its own copy of FlowGraphDfsTree, FlowGraphNaturalLoops. Should it just use the ones in Compiler?
I don't think so. I think if we can avoid keeping ambient state like this then we should; otherwise we need to take care to invalidate the information in the right places.
Correct, we will need to refind the loops. I do not think it is a problem -- the old loop finding is also doing renumbering, DFS, dominators and dense reachability in these cases. The only thing it avoids is the loop finding, but I expect loop finding to be cheaper than dominators and reachability.
I'll rename it, but I'd prefer to do it as part of the follow-up #95589, since I'm going to have to merge that anyway. FWIW, given that this is properly encapsulated and used in a class that only deals with the flow graph I personally do not see any confusion from this. If it was publicly exposed I'd be more inclined to consider it an issue.
I think it makes sense to add a post-phase check that |
no problem |
Thanks for the review @BruceForstall! I opened #95649, #95651 and #95652 for the outstanding issues (will do the renaming as part of #95589, so I didn't open an issue for it) |
Run the new loop recognition at the same time as old loop recognition and cross validate that the information matches the old loop information.
Implement induction analysis on the new representation. Validate that it matches the old loop induction analysis. Add a few quirks to ensure this.
Port loop cloning to run on the new loop representation. Add a number of quirks to keep the logic close to the old cloning. One notable difference is that the old loop recognition considers some blocks to be part of the loops that actually are not; for example, blocks that return or break out from the loop are not part of the loop's SCC, but can be part of the lexical range. This means that the old loop cloning will clone those blocks and apply static optimizations to them, while the new loop cloning won't. Still need to investigate a bit how much of a problem this is -- the regressions do not seem to be that big.
A few diffs are expected due to considering fewer blocks to be part of the loop and thus cloning fewer blocks. Also, throughput regressions are expected since we are doing both loop identifications for now. I expect to incrementally work towards replacing everything with the new representation, at which point we can get rid of the old loop identification and hopefully most of the reachability/dominator computations. I expect us to make back the TP at that point.