This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Directly update DT instead of DTU.
ClosedPublic

Authored by fhahn on Jan 11 2023, 5:44 AM.

Details

Summary

The scope of DT updates are very limited when unrolling loops: the DT
should only need updating for

  • new blocks added
  • exiting blocks we simplified branches

This can be done manually without too much extra work.
MergeBlockIntoPredecessor and changeToUnreachable also need to be
updated to support direct DT updates.

This fixes excessive time spent in DTU for same cases. In an internal
example, time spent in LoopUnroll with this patch goes from ~200s to 2s.

It also is slightly positive for CTMark:

  • NewPM-O3: -0.13%
  • NewPM-ReleaseThinLTO: -0.11%
  • NewPM-ReleaseLTO-g: -0.13%

Notable improvements are mafft (~ -0.50%) and lencod (~ -0.30%), with no
workload regressed.

https://llvm-compile-time-tracker.com/compare.php?from=78a9ee7834331fb4360457cc565fa36f5452f7e0&to=687e08d011b0dc6d3edd223612761e44225c7537&stat=instructions:u

Diff Detail

Event Timeline

fhahn created this revision.Jan 11 2023, 5:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 5:44 AM
fhahn requested review of this revision.Jan 11 2023, 5:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 5:44 AM
nikic added inline comments.Jan 11 2023, 6:57 AM
llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
100

Here and elsewhere, please add a comment -- normally, if there's a DT rather than DTU argument, I'd expect the DT to be used rather than updated.

llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
239

Do we need to check whether PredNode / BBNode are null? A bit degenerate, but we might be merging two unreachable blocks, no?

fhahn updated this revision to Diff 488227.Jan 11 2023, 8:01 AM

Document DT argument, check if DT nodes are null (=unreachable)

fhahn marked 2 inline comments as done.Jan 11 2023, 8:03 AM
fhahn added inline comments.
llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
100

Good point, added a comment here and for changeToUnreachable

llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
239

Added a check. I think it should be sufficient to check if PRedNode is reachable (!= nullptr), as BB must have PredBB as unique predecessor.

Also added a check to updateDominatorTreeUsingPredecessors.

nikic added inline comments.Jan 11 2023, 8:26 AM
llvm/lib/Transforms/Utils/Local.cpp
3530

What about the case where BB has a (now also unreachable) successor? Don't we have to erase all those nodes as well?

fhahn updated this revision to Diff 488282.Jan 11 2023, 10:32 AM
fhahn marked 2 inline comments as done.

Update to remove all nodes connected to node that became unreachable.

fhahn marked an inline comment as done.Jan 11 2023, 10:33 AM
fhahn added inline comments.
llvm/lib/Transforms/Utils/Local.cpp
3530

I am not sure what the exact invariant in the DT is, but I updated the code to remove all connected nodes to be on the safe side. Same in MergeBlockIntoPredecessor

I'm curious where all the time is spent in the 200s to 2s case. Is Eager strategy just as slow?

fhahn marked an inline comment as done.Jan 12 2023, 2:30 AM

I'm curious where all the time is spent in the 200s to 2s case. Is Eager strategy just as slow?

The DTU here was using the lazy strategy.

A profile is below (note that this is on a different (slower) machine, so it takes longer than 200s :)) . AFACIT most of the time is spent discovering reachability after deleting edges. I think the main reason manual updates are much quicker here is that we can use the loop structure to avoid determining reachability; after updating the DT for nodes connected to the exits, no other nodes in the DT should need updating.

8.06 min   99.9%	0 s	 	             llvm::UnrollLoop(llvm::Loop*, llvm::UnrollLoopOptions, llvm::LoopInfo*, llvm::ScalarEvolution*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::TargetTransformInfo const*, llvm::OptimizationRemarkEmitter*, bool, llvm::Loop**)
8.04 min   99.8%	0 s	 	              llvm::DomTreeUpdater::getDomTree()
8.04 min   99.7%	0 s	 	               llvm::DomTreeUpdater::applyDomTreeUpdates()
8.04 min   99.7%	0 s	 	                llvm::DominatorTreeBase<llvm::BasicBlock, false>::applyUpdates(llvm::ArrayRef<llvm::cfg::Update<llvm::BasicBlock*> >)
8.04 min   99.7%	0 s	 	                 void llvm::DomTreeBuilder::ApplyUpdates<llvm::DominatorTreeBase<llvm::BasicBlock, false> >(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::GraphDiff<llvm::DominatorTreeBase<llvm::BasicBlock, false>::NodePtr, llvm::DominatorTreeBase<llvm::BasicBlock, false>::IsPostDominator>&, llvm::GraphDiff<llvm::DominatorTreeBase<llvm::BasicBlock, false>::NodePtr, llvm::DominatorTreeBase<llvm::BasicBlock, false>::IsPostDominator>*)
8.04 min   99.7%	2.00 ms	 	                  llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::ApplyUpdates(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::GraphDiff<llvm::BasicBlock*, false>&, llvm::GraphDiff<llvm::BasicBlock*, false>*)
8.04 min   99.7%	3.00 ms	 	                   llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::ApplyNextUpdate(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo&)
7.10 min   88.0%	6.00 ms	 	                    llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::DeleteEdge(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::BasicBlock*, llvm::BasicBlock*)
7.09 min   88.0%	5.00 ms	 	                     llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::DeleteReachable(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::DomTreeNodeBase<llvm::BasicBlock>*, llvm::DomTreeNodeBase<llvm::BasicBlock>*)
4.61 min   57.2%	2.88 s	 	                      unsigned int llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::runDFS<false, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::DeleteReachable(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::DomTreeNodeBase<llvm::BasicBlock>*, llvm::DomTreeNodeBase<llvm::BasicBlock>*)::'lambda'(llvm::BasicBlock*, llvm::BasicBlock*)>(llvm::BasicBlock*, unsigned int, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::DeleteReachable(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::DomTreeNodeBase<llvm::BasicBlock>*, llvm::DomTreeNodeBase<llvm::BasicBlock>*)::'lambda'(llvm::BasicBlock*, llvm::BasicBlock*), unsigned int, llvm::DenseMap<llvm::BasicBlock*, unsigned int, llvm::DenseMapInfo<llvm::BasicBlock*, void>, llvm::detail::DenseMapPair<llvm::BasicBlock*, unsigned int> > const*)
1.52 min   18.9%	5.86 s	 	                      llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::runSemiNCA(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, unsigned int)
41.53 s    8.5%	1.07 s	 	                      llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::reattachExistingSubtree(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeNodeBase<llvm::BasicBlock>*)
15.22 s    3.1%	4.00 ms	 	                      llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::~SemiNCAInfo()
452.00 ms    0.0%	0 s	 	                      0xfffffffffffffffe
64.00 ms    0.0%	64.00 ms	 	                      DYLD-STUB$$llvm::DominatorTreeBase<llvm::BasicBlock, false>::getNode(llvm::BasicBlock const*) const
31.00 ms    0.0%	0 s	 	                      llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::SemiNCAInfo(llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*)
29.00 ms    0.0%	29.00 ms	 	                      DYLD-STUB$$llvm::SmallVectorBase<unsigned int>::empty() const
3.00 ms    0.0%	0 s	 	                      llvm::DominatorTreeBase<llvm::BasicBlock, false>::findNearestCommonDominator(llvm::BasicBlock*, llvm::BasicBlock*) const
1.00 ms    0.0%	1.00 ms	 	                      DYLD-STUB$$llvm::DomTreeNodeBase<llvm::BasicBlock>::getLevel() const
1.00 ms    0.0%	1.00 ms	 	                      DYLD-STUB$$llvm::DomTreeNodeBase<llvm::BasicBlock>::setIDom(llvm::DomTreeNodeBase<llvm::BasicBlock>*)
1.00 ms    0.0%	0 s	 	                      llvm::DominatorTreeBase<llvm::BasicBlock, false>::getNode(llvm::BasicBlock const*) const
90.00 ms    0.0%	1.00 ms	 	                     llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::DeleteEdge(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::BasicBlock*, llvm::BasicBlock*)::'lambda'(llvm::BasicBlock*, llvm::BasicBlock*)::operator()(llvm::BasicBlock*, llvm::BasicBlock*) const
35.00 ms    0.0%	1.00 ms	 	                     llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::DeleteUnreachable(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::DomTreeNodeBase<llvm::BasicBlock>*)
24.00 ms    0.0%	1.00 ms	 	                     llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::HasProperSupport(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::DomTreeNodeBase<llvm::BasicBlock>*)
23.00 ms    0.0%	0 s	 	                     llvm::DominatorTreeBase<llvm::BasicBlock, false>::findNearestCommonDominator(llvm::BasicBlock*, llvm::BasicBlock*) const
11.00 ms    0.0%	2.00 ms	 	                     llvm::DominatorTreeBase<llvm::BasicBlock, false>::getNode(llvm::BasicBlock const*) const
1.00 ms    0.0%	1.00 ms	 	                     llvm::DomTreeNodeBase<llvm::BasicBlock>::getIDom() const
56.59 s   11.7%	1.00 ms	 	                    llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::InsertEdge(llvm::DominatorTreeBase<llvm::BasicBlock, false>&, llvm::DomTreeBuilder::SemiNCAInfo<llvm::DominatorTreeBase<llvm::BasicBlock, false> >::BatchUpdateInfo*, llvm::BasicBlock*, llvm::BasicBlock*)
54.00 ms    0.0%	7.00 ms	 	                    llvm::GraphDiff<llvm::BasicBlock*, false>::popUpdateForIncrementalUpdates()
1.00 ms    0.0%	1.00 ms	 	                    llvm::cfg::Update<llvm::BasicBlock*>::getFrom() const
kuhar added a comment.Jan 12 2023, 2:09 PM

Some thoughts:

  1. If possible, it'd be great if we could have a repro to use as a benchmark for the updater. And experiment with. In the past, I found that manually instrumenting the construction & updater algorithms with event counters (e.g., number of nodes visited by DFS, number of updates in-flight, initial/final tree size) to give more insight than perf traces.
  2. I find it very hard to prove that manual updates are correct. We found a lot of very rare corner cases that required combination of things like infinite loops, multi-edges, irreducible loops, etc., that broke seemingly correct manual update functions. This makes me strongly prefer not to add more manual update code, unless we have a very good justification. Here, 200s -> 2s seems convincing for me if we cannot find other workarounds.
  3. I haven't looked into this specific case closely, but I'll list some generic alternative ideas:
    • See if we can determine to recalculate in this case and if this would be faster than running the updater.
    • Detect the unreachable node case in the updater and handle it there.
  4. Could you add assertions to verify the tree just before and after applying the updates manually? This can be guarded with EXPENSIVE_CHECKS.
llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
93

nit: s/use it to update the dominator tree directly/update it directly

llvm/include/llvm/Transforms/Utils/Local.h
345

nit: also here

501

nit: s/it's/its

501–503

It's not clear to me what the exact assumptions are here. I'm thinking about things like predecessors having edges to other predecessors and other corner cases.

In addition, could you add some ascii diagrams to illustrate these two cases (new/deleted)?

llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
242

nit: either BBNode->children() in the loop directly or use llvm::to_vector?

llvm/lib/Transforms/Utils/LoopUnroll.cpp
697

nit: also here

777–780

nit: could you add comments with parameter names?

kuhar added inline comments.Jan 12 2023, 2:11 PM
llvm/include/llvm/Transforms/Utils/Local.h
504

Also, could we have some unit tests? I'm thinking about both having some concrete examples that we can look at and possibly death test to make sure that the assertions catch what we wanted to assume.

fhahn updated this revision to Diff 489015.Jan 13 2023, 7:59 AM
fhahn marked 7 inline comments as done.

Thanks for taking a look!

Some thoughts:

  1. If possible, it'd be great if we could have a repro to use as a benchmark for the updater. And experiment with. In the past, I found that manually instrumenting the construction & updater algorithms with event counters (e.g., number of nodes visited by DFS, number of updates in-flight, initial/final tree size) to give more insight than perf traces.

I'll work with the person who reported the issue to see if it is possible to come up with a case that can be easily shared publicly.

The general structure is a function with ~70 smallish consecutive loops, where each of them can be fully unrolled (with trip counts of 32 or somethingn.

  1. I find it very hard to prove that manual updates are correct. We found a lot of very rare corner cases that required combination of things like infinite loops, multi-edges, irreducible loops, etc., that broke seemingly correct manual update functions. This makes me strongly prefer not to add more manual update code, unless we have a very good justification. Here, 200s -> 2s seems convincing for me if we cannot find other workarounds.

Agreed that it is very difficult to make sure manual updates are correct in general. In the loop-unroll case, the types of changes we make are limited, which hopefully makes it more tractable.

  1. I haven't looked into this specific case closely, but I'll list some generic alternative ideas:
    • See if we can determine to recalculate in this case and if this would be faster than running the updater.

Recalculating the DTs from scratch is much quicker than DTU (~18s), but still noticeable slower than manual updates (~2s).

    • Detect the unreachable node case in the updater and handle it there.
  1. Could you add assertions to verify the tree just before and after applying the updates manually? This can be guarded with EXPENSIVE_CHECKS.

There are already DT verification assertions just before the begin of the manual updates and at the end. I can add some in-between as well if you think that's beneficial, but I am a bit worried about doing so too liberally even with expensive checks.

llvm/include/llvm/Transforms/Utils/BasicBlockUtils.h
93

I simplified it a bit further: If \p DT is not nullptr, update it directly;

llvm/include/llvm/Transforms/Utils/Local.h
345

updated, thanks!

501

fixed, thanks!

501–503

I updated the comment to hopefully be more clear. The key thing is that it assumes the dominator tree to be up-to-date for all predecessors:

The function assumes the dominator tree is valid for all predecessors and sets \p BB's immediate dominator to the nearest common dominator of all predecessors. \p BB cannot be a predecessor of itself.

I also added an assertion that BB cannot have itself as predecessor; in that instance, the assumption that the DT is up-to-date for all predecessors doesn't hold, as we just asked to update the DT for BB.

In addition, could you add some ascii diagrams to illustrate these two cases (new/deleted)?

I am not sure if that would help too much as the main requirement is that the DT is up-to-date for the predecessors, hopefully the comment update adds enough info.

504

I added some small example tests in llvm/unittests/Transforms/Utils/LocalTest.cpp. Please let me know if you have other cases in mind. I also built various external binaries and clang bootstrap with DT verification and saw no issues.

llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
242

Changing the immediate dominator of a child of BB might invalidate the children iterator I think, so we need a vector. Updated to use to_vector, thanks for point out that utility, I wasn't aware of it.

llvm/lib/Transforms/Utils/LoopUnroll.cpp
697

updated, thanks!

777–780

Done, thanks!

kuhar added a comment.Jan 13 2023, 8:19 AM

Recalculating the DTs from scratch is much quicker than DTU (~18s), but still noticeable slower than manual updates (~2s).

It would be great if we could detect those cases and teach DTU to trigger this recalculation. Would 18s be an acceptable fix for your use case?

There are already DT verification assertions just before the begin of the manual updates and at the end. I can add some in-between as well if you think that's beneficial, but I am a bit worried about doing so too liberally even with expensive checks.

Ack.

llvm/unittests/Transforms/Utils/LocalTest.cpp
1092

FYI, I added a utility for writing similar tests in the past (CFGBuilder), not sure if we could link with it here and if it would simplify the test code

fhahn updated this revision to Diff 489405.Jan 15 2023, 3:57 PM

I took a step back and removed the need to add and use a new updateDominatorTreeUsingPredecessors helper. I discovered cases in general where we miss updates. While this was not an issue for the LoopUnroll test case, it seems dangerous to expose it.

The current patch switches to a hybrid approach: if there's a single exiting node, it should be sufficient to directly update the immediate dominators of the original exiting block and move them to the first actual exiting block of the unrolled loop. for all other cases, use the DTU.

After D141810, only MergeBlockIntoPredecessor needs udpating, and here the update should also be a straight-forward transfer of immediate dominators. The latest version achieves the same CTMark gains, while it should be much more limited in scope.

Recalculating the DTs from scratch is much quicker than DTU (~18s), but still noticeable slower than manual updates (~2s).

It would be great if we could detect those cases and teach DTU to trigger this recalculation. Would 18s be an acceptable fix for your use case?

It would be an improvement, but LLVM performs loop-unrolling very aggressively and I think it is very desirable to make sure the code to update the DT here is as fast as possible. The latest version of the patch is even slightly faster.

I also experimented with cut-offs for the DTU based on the number of nodes it needs to traverse in its DFS, which yielded good results for the slow case, but was negative on CTMark :(

fhahn added inline comments.Jan 15 2023, 3:59 PM
llvm/lib/Transforms/Utils/Local.cpp
3530

I undid the change in MergeBlockIntoPredecessor. It is definitely not needed there, as the successors of the original BB are transferred to the predecessor.

nikic added inline comments.Jan 16 2023, 1:43 AM
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
243

Doesn't this need to check whether the old idom was BB? Couldn't it be higher up?

fhahn marked an inline comment as done.Jan 16 2023, 5:43 AM
fhahn added inline comments.
llvm/lib/Transforms/Utils/BasicBlockUtils.cpp
243

I might be missing something here, but BBNode->children() iterates over the nodes which have BB as idom.

kuhar accepted this revision.Jan 16 2023, 8:10 AM

This looks fine to me but I haven't looked carefully at the surrounding code. Ideally, I think this should albo be reviewed by someone familiar with the loop unrolling implementation.

llvm/lib/Transforms/Utils/LoopUnroll.cpp
771

nit: we can use structured bindings here: auto& [Something, ExitInfo] = *ExitInfos.begin();

This revision is now accepted and ready to land.Jan 16 2023, 8:10 AM
fhahn updated this revision to Diff 490110.Jan 18 2023, 4:28 AM
fhahn marked an inline comment as done.

Thanks for taking a look! Updated to use structured bindings. I am planning to land this tomorrow unless there are additional comments.

This revision was landed with ongoing or failed builds.Jan 19 2023, 10:12 AM
This revision was automatically updated to reflect the committed changes.

this causes opt -passes=loop-unroll-full to crash on

define void @foo() {
bb:
  br label %bb1
    
bb1:                                              ; preds = %bb1, %bb1, %bb
  switch i1 true, label %bb1 [
    i1 true, label %bb2
    i1 false, label %bb1
  ]
    
 bb2:                                              ; preds = %bb1
  ret void
}

so I've reverted this for now

aeubanks reopened this revision.Jan 19 2023, 5:05 PM
This revision is now accepted and ready to land.Jan 19 2023, 5:05 PM

this causes opt -passes=loop-unroll-full to crash on

define void @foo() {
bb:
  br label %bb1
    
bb1:                                              ; preds = %bb1, %bb1, %bb
  switch i1 true, label %bb1 [
    i1 true, label %bb2
    i1 false, label %bb1
  ]
    
 bb2:                                              ; preds = %bb1
  ret void
}

so I've reverted this for now

I also reduced a clang segfault to this commit while compiling nettle library for RISCV64, The test script and case are here https://uclibc.org/~kraj/sexp-conv-5be224.zip

fhahn added a comment.Jan 20 2023, 8:35 AM

Thanks for the revert + test case! The issue was that some Exiting blocks may not be added to ExitInfos if they have a non-branch terminator. I added an extra check for ExitInfo's size and recommitted the patch. A test case has been added in f92b35392ed8e4631