Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Port loop cloning to the new loop representation #95326

Merged
merged 14 commits into from
Dec 5, 2023

Conversation

jakobbotsch
Copy link
Member

Run the new loop recognition at the same time as old loop recognition and cross validate that the information matches the old loop information.

Implement induction analysis on the new representation. Validate that it matches the old loop induction analysis. Add a few quirks to ensure this.

Port loop cloning to run on the new loop representation. Add a number of quirks to keep the logic close to the old cloning. One notable difference is that the old loop recognition considers some blocks to be part of the loops that actually are not; for example, blocks that return or break out from the loop are not part of the loop's SCC, but can be part of the lexical range. This means that the old loop cloning will clone those blocks and apply static optimizations to them, while the new loop cloning won't. Still need to investigate a bit how much of a problem this is -- the regressions do not seem to be that big.

A few diffs are expected due to considering fewer blocks to be part of the loop and thus cloning fewer blocks. Also, throughput regressions are expected since we are doing both loop identifications for now. I expect to incrementally work towards replacing everything with the new representation, at which point we can get rid of the old loop identification and hopefully most of the reachability/dominator computations. I expect us to make back the TP at that point.

Run the new loop recognition at the same time as old loop recognition
and cross validate that the information matches the old loop
information.

Implement induction analysis on the new representation. Validate that it
matches the old loop induction analysis. Add a few quirks to ensure
this.

Port loop cloning to run on the new loop representation. Add a number of
quirks to keep the logic close to the old cloning. One notable
difference is that the old loop recognition considers some blocks to be
part of the loops that actually are not; for example, blocks that return
or break out from the loop are not part of the loop's SCC, but can be
part of the lexical range. This means that the old loop cloning will
clone those blocks and apply static optimizations to them, while the new
loop cloning won't. However, this does not seem to cause a significant
number of regressions.

A few diffs are expected due to considering fewer blocks to be part of
the loop and thus cloning fewer blocks. Also, throughput regressions are
expected since we are doing both loop identifications for now. I
expect to incrementally work towards replacing everything with the new
representation, at which point we can get rid of the old loop
identification and hopefully most of the reachability/dominator
computations. I expect us to make back the TP at that point.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 28, 2023
@ghost ghost assigned jakobbotsch Nov 28, 2023
@ghost
Copy link

ghost commented Nov 28, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Run the new loop recognition at the same time as old loop recognition and cross validate that the information matches the old loop information.

Implement induction analysis on the new representation. Validate that it matches the old loop induction analysis. Add a few quirks to ensure this.

Port loop cloning to run on the new loop representation. Add a number of quirks to keep the logic close to the old cloning. One notable difference is that the old loop recognition considers some blocks to be part of the loops that actually are not; for example, blocks that return or break out from the loop are not part of the loop's SCC, but can be part of the lexical range. This means that the old loop cloning will clone those blocks and apply static optimizations to them, while the new loop cloning won't. Still need to investigate a bit how much of a problem this is -- the regressions do not seem to be that big.

A few diffs are expected due to considering fewer blocks to be part of the loop and thus cloning fewer blocks. Also, throughput regressions are expected since we are doing both loop identifications for now. I expect to incrementally work towards replacing everything with the new representation, at which point we can get rid of the old loop identification and hopefully most of the reachability/dominator computations. I expect us to make back the TP at that point.

Author: jakobbotsch
Assignees: jakobbotsch
Labels:

area-CodeGen-coreclr

Milestone: -

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @BruceForstall

No diffs except for throughput regressions. I was able to quirk the rest of the differences away -- it turns out the blocks that are lexically within the loop but not actually in the loop are always empty BBJ_ALWAYS blocks today (maybe the old loop finding ensures that -- not sure). So we don't lose any optimization from the fact that we no longer clone and optimize them.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, this looks really good. It's good to see that converting from the existing loop structure to the new code is relatively straightforward. It will be interesting to see what additional improvements can happen by removing the quirks.

Here are some comments regarding both this change, and the previous change to introduce the FlowGraphNaturalLoop type (and friends) which I wasn't able to previously review.

  • I'd like to see more comments of all the new class/struct types: FlowGraphNaturalLoop, FlowGraphDfsTree, NaturalLoopIterInfo, etc. Including comments for the member declarations.
  • IIUC, FlowGraphNaturalLoop defines a loop as a single entry block, named header, and a bitset of all member blocks indexed by the block reverse postorder number (compressed). (Presumably the edges could be re-derived from this.)
    • How often will this block postorder number be recomputed, if ever? (Every time it gets recomputed the loop table needs to be rebuilt.)
    • Should the set size be compressed further by subtracting the minimum block member postorder number? e.g., loop->m_blocksSize = loop->m_header->bbPostorderNum + 1 - loop->m_MinimumBlockPostorderNumber;? E.g., presumably if you have a small loop at the top of a very large function (say, at the top level, main-line control flow), its member blocks would have tightly packed post-order numbers but they would all be large.
    • nit: should it be m_entry instead of m_header? To me, "header" implies something lexically earlier, but lexicality is ignored in constructing the loop graph.
  • Do you think the lexicality functions (VisitLoopBlocksLexical, GetLexicallyTopMostBlock, GetLexicallyBottomMostBlock) could be removed if we get rid of the concept of "fall-through (BBJ_NONE/BBJ_COND fall-through) early in the JIT?
  • Do we need to worry about adding new blocks to the flow graph and having them appear in the loop graph as well, without building the loop graph? That might not be possible with post-order block number indices. Unless there is some provision for excess built in? E.g., fgDominate works conservatively for new blocks, but I don't know if the new system can tolerate this. E.g., how would new blocks get added to the DFS tree, if at all?
  • Does the new loop graph actually recognize exactly the same set of loops as the old? That's somewhat surprising to me.
  • Improve JIT loop optimizations (.NET 9) #93144 should be updated to account for this and previous related work.
  • Some of the functions that took loopNum and now take FlowGraphNaturalLoop* now need their header comments updated.
  • should the BitVecTraits for the loop blocks be stored/cached with the loop and not recomputed with LoopBlockTraits() function?

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 5, 2023
@jakobbotsch
Copy link
Member Author

  • I'd like to see more comments of all the new class/struct types: FlowGraphNaturalLoop, FlowGraphDfsTree, NaturalLoopIterInfo, etc. Including comments for the member declarations.

I'll add some more documentation. What style of comments are you looking for on the declarations? I'm a bit reluctant to duplicate documentation when it's already there on the definitions.

  • IIUC, FlowGraphNaturalLoop defines a loop as a single entry block, named header, and a bitset of all member blocks indexed by the block reverse postorder number (compressed). (Presumably the edges could be re-derived from this.)

    • How often will this block postorder number be recomputed, if ever? (Every time it gets recomputed the loop table needs to be rebuilt.)

We need to recompute it after cloning and unrolling (provided they made flow graph changes). We already recompute DFS, dominators and dense reachability in those cases, so I expect it is going to be cheaper overall since we won't need dominators and reachability anymore, and I do not think the actual loop finding is that expensive.
I don't think we will need to recompute it anywhere else; we might still want to (e.g. it seems like it would be beneficial to do loop alignment later). Also, we'll be able to get rid of the DFS that SSA is computing and reuse the one computed for loops.

  • Should the set size be compressed further by subtracting the minimum block member postorder number? e.g., loop->m_blocksSize = loop->m_header->bbPostorderNum + 1 - loop->m_MinimumBlockPostorderNumber;? E.g., presumably if you have a small loop at the top of a very large function (say, at the top level, main-line control flow), its member blocks would have tightly packed post-order numbers but they would all be large.

I actually did this initially, but it felt a bit premature. In particular it's a bit messy since reducing the size can change the representation of the bit vectors. I still think the optimization makes sense, but we need a sanctioned API in BitVecOps to do it. Something that I think we can look at separately.

  • nit: should it be m_entry instead of m_header? To me, "header" implies something lexically earlier, but lexicality is ignored in constructing the loop graph.

I prefer "header" since we already use "preheader". This was Andy's terminology, and it matches clang/GCC which I like.

  • Do you think the lexicality functions (VisitLoopBlocksLexical, GetLexicallyTopMostBlock, GetLexicallyBottomMostBlock) could be removed if we get rid of the concept of "fall-through (BBJ_NONE/BBJ_COND fall-through) early in the JIT?

There's a few sources of uses left:

  • The blocks are cloned in lexical order. We still need to handle fall through, and I had to add extra code to do that, so I do not think there is a correctness requirement here, and I think it could be switched to RPO. However, I did not have enough faith in our block reordering code to do that. There would be a correctness requirement for EH regions if we cloned loops with EH regions.
  • Loop alignment looks at the top block to set loop alignment flag.
  • The IV analysis is currently looking at the bottom-most block, but this one should be switched to look at exiting blocks instead in a future PR.

The "block reordering" issue is probably going to be the biggest one to get rid of the lexical visitor.

  • Do we need to worry about adding new blocks to the flow graph and having them appear in the loop graph as well, without building the loop graph? That might not be possible with post-order block number indices. Unless there is some provision for excess built in? E.g., fgDominate works conservatively for new blocks, but I don't know if the new system can tolerate this. E.g., how would new blocks get added to the DFS tree, if at all?

It's a good question. My gut feeling from working with it is no: we already redo DFS to rebuild dominators, and redo reachability, and I think overall we are going to be in the green when all is said and done.

It might be possible in some cases to incrementally rebuild the DFS tree, if we also store a preorder of blocks. For example when adding preheaders I think that might be possible. I'm not convinced the complication is worth it.

Loop unrolling may end up with some intermediate recomputations that we didn't have before, in particular when we unroll a nested loop and then also want to unroll the parent loop. I don't think there's an innate need for this, I could teach loop unrolling about where to find the newly creted nested unrolled blocks, but it seemed like a complication that is not worth it since we do not expect a lot of nested unrolling to happen.

  • Does the new loop graph actually recognize exactly the same set of loops as the old? That's somewhat surprising to me.

It recognizes a superset of the old code. There's quirks in this PR to avoid cloning for the additional new loops that it recognizes. I think when we are done that #43713 and #58941 will be fixed.
I admit I was a bit surprised when my asserts that all old loops were recognized by the new code passed.

Will do.

  • Some of the functions that took loopNum and now take FlowGraphNaturalLoop* now need their header comments updated.

Will do another pass to make sure these are updated.

  • should the BitVecTraits for the loop blocks be stored/cached with the loop and not recomputed with LoopBlockTraits() function?

Hmm, let me look at it in a follow-up. I was under the impression that the constructor of BitVecTraits is simple enough that the native compiler benefits from seeing it, at least if the following bit vector operations get inlined.

@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Dec 5, 2023
@BruceForstall
Copy link
Member

What style of comments are you looking for on the declarations? I'm a bit reluctant to duplicate documentation when it's already there on the definitions.

In particular, for the classes/structs: a summary of the purpose of the class. Possibly a note on the implementation. Comments on the member variables. For functions, don't duplicate the definition comments. It might be worthwhile writing a short comment at the function declarations; that's a judgement call to provide clarity when reading the declarations but not duplicating the definitions.

  • Should the set size be compressed further by subtracting the minimum block member postorder number? ...

I actually did this initially, but it felt a bit premature. In particular it's a bit messy since reducing the size can change the representation of the bit vectors. I still think the optimization makes sense, but we need a sanctioned API in BitVecOps to do it. Something that I think we can look at separately.

Is this because you need to add blocks to the loop before knowing all the blocks in the loop, hence not knowing what the lowest block number is? I suppose the loop finder could have a single "full size" block set it uses/re-uses while constructing a loop, then generate the "compressed" block set afterwards, copying over the bits as necessary.

@jakobbotsch
Copy link
Member Author

What style of comments are you looking for on the declarations? I'm a bit reluctant to duplicate documentation when it's already there on the definitions.

In particular, for the classes/structs: a summary of the purpose of the class. Possibly a note on the implementation. Comments on the member variables. For functions, don't duplicate the definition comments. It might be worthwhile writing a short comment at the function declarations; that's a judgement call to provide clarity when reading the declarations but not duplicating the definitions.

Gotcha, that makes a lot of sense. Will do that.

  • Should the set size be compressed further by subtracting the minimum block member postorder number? ...

I actually did this initially, but it felt a bit premature. In particular it's a bit messy since reducing the size can change the representation of the bit vectors. I still think the optimization makes sense, but we need a sanctioned API in BitVecOps to do it. Something that I think we can look at separately.

Is this because you need to add blocks to the loop before knowing all the blocks in the loop, hence not knowing what the lowest block number is? I suppose the loop finder could have a single "full size" block set it uses/re-uses while constructing a loop, then generate the "compressed" block set afterwards, copying over the bits as necessary.

Yeah, we do not know the lowest postorder number before we have run the code in FindNaturalLoopBlocks, and that function needs a bit vector. So what I was doing was tracking the lowest postorder number as part of that loop, and then resizing it after. Using a single full size bit vector and then copying it makes a lot of sense I think; I'll add it to my TODO list to try this out.

@jakobbotsch
Copy link
Member Author

@BruceForstall I've added some more documentation and comments on the fields/types. Can you please take another look?

@BruceForstall
Copy link
Member

Thanks for adding the comments; those are helpful.

Regarding adding new blocks: the new loop graph benefits from having the old loop recognition insert pre-headers before the new loop graph is constructed. If the old loop is removed, then pre-header creation will have to be rebuilt on the new loop graph. This in itself might require a loop graph re-build if there isn't incremental update capability. (The same will also apply to the current set of loop canonicalization, such as splitting loop header blocks.)

nit: you use m_tree to store the DFS tree in various places. That name seems a little too generic (might refer to a GenTree, for example). Maybe m_dfsTree instead? In some places it's m_dfs. Also change to m_dfsTree to be consistent?

Related: FlowGraphDominatorTree has:

    const FlowGraphDfsTree* m_dfs;
    const DomTreeNode* m_tree;

which, IMO, should be:

    const FlowGraphDfsTree* m_dfsTree;
    const DomTreeNode* m_domTree;

Looks like GetDfsTree is unused.

@BruceForstall
Copy link
Member

We currently have fgDebugCheckLoopTable. What kind of similar checking should be added for the new loop table?

Comment on lines +3272 to +3273
m_dfs = fgComputeDfs();
m_loops = FlowGraphNaturalLoops::Find(m_dfs);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "TODO" above points out that we don't rebuild the loop table, but obviously here we do rebuild the new loop table.

Does anyone use this after this point?

If they used it and it wasn't recomputed, would it be incorrect, or just missing data?

It looks like ProfileSynthesis has its own copy of FlowGraphDfsTree, FlowGraphNaturalLoops. Should it just use the ones in Compiler?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is unused after this point. We are going to need to do this recomputation eventually, so I simply added it now to get a more proper view of the TP cost.
I'll get rid of the TODO as part of removing the old loop stuff.

If they used it and it wasn't recomputed, would it be incorrect, or just missing data?

If you change the flow graph and then recompute the DFS, then BasicBlock::bbPostorderNum will become invalid and mappings into m_dfs and m_loops will give nonsensical data. It's why I think it makes sense to assert that m_dfs matches the current flow graph in post phase checks (and null it out when we then no longer need it, i.e. after RBO). As long as the DFS isn't recomputed you can add and remove new blocks, and queries in the DFS tree/loop structures will just give incorrect (outdated) results for the affected blocks. For example, if you add a new block that is clearly reachable, m_dfs->Contains(newBlock) will return false incorrectly. In loop cloning we run over the loops parents before children which ensures we do not invalidate any of the stored data for any future loop we will see; thus we can keep acting on it. Loop unrolling (as I mentioned above) will be a bit harder.

It looks like ProfileSynthesis has its own copy of FlowGraphDfsTree, FlowGraphNaturalLoops. Should it just use the ones in Compiler?

I don't think so. I think if we can avoid keeping ambient state like this then we should; otherwise we need to take care to invalidate the information in the right places.

@jakobbotsch
Copy link
Member Author

Regarding adding new blocks: the new loop graph benefits from having the old loop recognition insert pre-headers before the new loop graph is constructed. If the old loop is removed, then pre-header creation will have to be rebuilt on the new loop graph. This in itself might require a loop graph re-build if there isn't incremental update capability. (The same will also apply to the current set of loop canonicalization, such as splitting loop header blocks.)

Correct, we will need to refind the loops. I do not think it is a problem -- the old loop finding is also doing renumbering, DFS, dominators and dense reachability in these cases. The only thing it avoids is the loop finding, but I expect loop finding to be cheaper than dominators and reachability.

nit: you use m_tree to store the DFS tree in various places. That name seems a little too generic (might refer to a GenTree, for example). Maybe m_dfsTree instead? In some places it's m_dfs. Also change to m_dfsTree to be consistent?

I'll rename it, but I'd prefer to do it as part of the follow-up #95589, since I'm going to have to merge that anyway.

FWIW, given that this is properly encapsulated and used in a class that only deals with the flow graph I personally do not see any confusion from this. If it was publicly exposed I'd be more inclined to consider it an issue.

We currently have fgDebugCheckLoopTable. What kind of similar checking should be added for the new loop table?

I think it makes sense to add a post-phase check that m_dfs is valid and up-to-date with the current flow graph. I do not see a reason to assert anything on m_loops -- it is a snapshot of the loops in m_dfs and the validity is asserted during its construction. Do you mind if I do it as a separate PR?

@BruceForstall
Copy link
Member

Do you mind if I do it as a separate PR?

no problem

@jakobbotsch jakobbotsch merged commit b4f1e7c into dotnet:main Dec 5, 2023
129 checks passed
@jakobbotsch jakobbotsch deleted the loop-clone-new-representation branch December 5, 2023 18:17
@jakobbotsch
Copy link
Member Author

Thanks for the review @BruceForstall! I opened #95649, #95651 and #95652 for the outstanding issues (will do the renaming as part of #95589, so I didn't open an issue for it)

@github-actions github-actions bot locked and limited conversation to collaborators Jan 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants