This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure.
ClosedPublic

Authored by bmakam on Jul 14 2017, 5:13 AM.

Details

Summary

When simplifying unconditional branches from empty blocks, we pre-test if the
BB belongs to a set of loop headers and keep the block to prevent passes from
destroying canonical loop structure. However, the current algorithm fails if
the destination of the branch is a loop header. Especially when such a loop's
latch block is folded into loop header it results in additional backedges and
LoopSimplify turns it into a nested loop which prevent later optimizations
from being applied (e.g., loop unrolling and loop interleaving).

This patch augments the existing algorithm by further checking if the
destination of the branch belongs to a set of loop headers and defer
eliminating it if yes to LateSimplifyCFG.

Fixes PR33605: https://bugs.llvm.org/show_bug.cgi?id=33605

Diff Detail

Repository
rL LLVM

Event Timeline

bmakam created this revision.Jul 14 2017, 5:13 AM
mcrosier edited edge metadata.EditedJul 14 2017, 8:36 AM

Let me see if I can describe the problem and your approach to fixing the issue in my own words.

Currently, JumpThreading and SimplifyCFG avoid threading/merging "empty" loop headers as this would break the canonical form of the loop; the CFG edge being optimized is between the loop header and its successor. Your approach is to also avoid merging the incoming edges (i.e., back edges) to the loop header as well to avoid breaking the canonical form of the loop. Then later in late-SimplifyCFG and CodeGen prepare you more aggressively remove these empty blocks.

Sound about right?

lib/Transforms/Scalar/JumpThreading.cpp
239 ↗(On Diff #106623)

This comment needs some additional explaining after your change.

You're avoiding the case where we're skipping an empty block whose successor is a loop header, which IIUC when merged destroys the canonical form of the loop. You're then relying on CodeGenPrepare to eliminate these empty blocks.

bmakam updated this revision to Diff 106651.Jul 14 2017, 9:28 AM

Let me see if I can describe the problem and your approach to fixing the issue in my own words.

Currently, JumpThreading and SimplifyCFG avoid threading/merging "empty" loop headers as this would break the canonical form of the loop; the CFG edge being optimized is between the loop header and its successor. Your approach is to also avoid merging the incoming edges (i.e., back edges) to the loop header as well to avoid breaking the canonical form of the loop. Then later in late-SimplifyCFG and CodeGen prepare you more aggressively remove these empty blocks.

Sound about right?

Thanks Chad,
That's right.

bmakam marked an inline comment as done.Jul 14 2017, 9:29 AM
efriedma edited edge metadata.Jul 14 2017, 11:29 AM

I'm sort of worried this could have unexpected consequences; do you have performance numbers? (LLVM testsuite or SPEC)

lib/CodeGen/CodeGenPrepare.cpp
668 ↗(On Diff #106651)

Do you need to update the latch set here?

test/Transforms/LoopUnroll/pr33605.ll
1 ↗(On Diff #106651)

We usually prefer tests for transformation passes which just check the output of the pass itself, so they aren't so sensitive to changes in other passes.

bmakam updated this revision to Diff 106714.Jul 14 2017, 2:54 PM

Update latch set and modify a test case. Thanks, Eli.

gberry added a subscriber: gberry.Jul 14 2017, 3:01 PM
bmakam marked an inline comment as done.Jul 14 2017, 3:06 PM

I'm sort of worried this could have unexpected consequences; do you have performance numbers? (LLVM testsuite or SPEC)

I was targeting to unroll a hot loop in spec2017/gcc. In addition to unrolling the hot loop in spec2017/gcc which yielded 2% improvement, I observed a loop interleaved in povray which yielded 6-7% improvement.
Here are the full perf results for SPEC on Falkor with O3 config:

Benchmark                Diff (%)  
----------------------- ----------
spec2006/bzip2:ref       -2.18
spec2017/omnetpp:ref     -1.2
spec2006/perlbench:ref   -1.07
spec2000/equake:ref      -0.89
spec2000/art:ref         -0.79
spec2017/leela:ref       -0.67
spec2000/bzip2:ref       0.77
spec2006/namd:ref        0.84
spec2017/xz:ref          0.89
spec2000/gcc:ref         0.9
spec2006/mcf:ref         1.07
spec2017/deepsjeng:ref   1.62
spec2000/crafty:ref      1.62
spec2017/gcc:ref         2.18
spec2006/dealII:ref      3.73
spec2017/blender:ref     4.78
spec2006/povray:ref      6.25
spec2017/povray:ref      6.85
spec2000/gap:ref         7.15

Thanks Balaram for posting this patch, in general idea looks good to preserve the canonical form of the loops.

Any idea why these benchmark regressed:

spec2006/bzip2:ref -2.18
spec2017/omnetpp:ref -1.2
spec2006/perlbench:ref -1.07

Thanks Balaram for posting this patch, in general idea looks good to preserve the canonical form of the loops.

Any idea why these benchmark regressed:

spec2006/bzip2:ref -2.18
spec2017/omnetpp:ref -1.2
spec2006/perlbench:ref -1.07

Thanks Ashutosh,

I rerun these 3 benchmarks and compared with the tip of trunk today and the regression in spec2006/bzip2 and spec2006/perlbench were found to be dubious. spec2017/omnetpp is still regressed by same ~1%. I looked at all the perf counters and nothing was obvious. Perhaps an alignment issue?

bmakam updated this revision to Diff 106935.Jul 17 2017, 1:42 PM

Minor code clean up, NFCI.

Needs testcases for the jump threading and CGP changes.

Why do we need the empty block folding in both LateSimplifyCFG and CGP?

lib/CodeGen/CodeGenPrepare.cpp
668 ↗(On Diff #106651)

While you're here, do we also need to update the Preheaders set?

705 ↗(On Diff #106714)

Put isLatch first?

lib/Transforms/Scalar/JumpThreading.cpp
243 ↗(On Diff #106714)

BI->getSuccessor(0)?

lib/Transforms/Utils/SimplifyCFG.cpp
5659 ↗(On Diff #106714)

BI->getSuccessor(0)?

bmakam updated this revision to Diff 107144.Jul 18 2017, 11:35 AM

Update to address Eli's comments.

bmakam marked 4 inline comments as done.Jul 18 2017, 11:38 AM

Why do we need the empty block folding in both LateSimplifyCFG and CGP?

Empty folding in CGP occurs very late after LSR and is also not aggressive because it is not run iteratively. LateSimplifyCFG cannot catch empty case blocks after switch is lowered, so we need the empty block folding in both places.

LateSimplifyCFG cannot catch empty case blocks after switch is lowered, so we need the empty block folding in both places.

"After switch is lowered"? What transform are you talking about?

LateSimplifyCFG cannot catch empty case blocks after switch is lowered, so we need the empty block folding in both places.

"After switch is lowered"? What transform are you talking about?

I was talking about Switch-to-lookup table transform.

SwitchToLookupTable is itself part of LateSimplifyCFG; we should fold empty blocks afterwards, I think?

bmakam added a comment.EditedJul 18 2017, 12:27 PM

SwitchToLookupTable is itself part of LateSimplifyCFG; we should fold empty blocks afterwards, I think?

I was expecting the same but found that LateSimplifyCFG could not catch empty case blocks in spec2017/perlbench. Perhaps later transformations like unreachableblockelim or constanthoist could turn some cases into empty blocks?

Edit: I went back and checked that during LSR loopsimplify creates dedicated exit block which are now empty and need to be cleaned up in CGP.

Hmm...

I took a quick look at the pass pipeline (PassManagerBuilder::populateModulePassManager), and it turns out LateSimplifyCFG is false for the last simplifycfg run. That might be the source of your problem?

Hmm...

I took a quick look at the pass pipeline (PassManagerBuilder::populateModulePassManager), and it turns out LateSimplifyCFG is false for the last simplifycfg run. That might be the source of your problem?

Thanks, Eli. In case you missed my previous comment, the problem here is that during LSR, loopsimplify creates dedicated exit blocks which now need to be cleaned up in CGP.

Oh, LSR is generating the extra block? That makes sense... but it seems mostly unrelated to the rest of this patch. Does it have any impact on its own?

lib/CodeGen/CodeGenPrepare.cpp
671 ↗(On Diff #107144)

The "if" is redundant; erase() does nothing if the set doesn't contain BB.

I mean the CodeGenPrepare changes (which are deleting blocks generated by LSR) vs the other changes (which modify transforms that run before LSR).

bmakam updated this revision to Diff 107175.Jul 18 2017, 2:00 PM

Minor cleanup. Thanks, Eli.

bmakam marked an inline comment as done.Jul 18 2017, 2:05 PM

I mean the CodeGenPrepare changes (which are deleting blocks generated by LSR) vs the other changes (which modify transforms that run before LSR).

It seems to be unrelated to the rest of this patch. Though the problem in spec2017/perlbench surfaces only with this patch and regresses the benchmark by 3%. The CGP changes by itself was performance neutral on SPEC.
I can split this out as a separate change if you prefer.

Yes, please split.

lib/CodeGen/CodeGenPrepare.cpp
669 ↗(On Diff #107175)

You missed the other redundant if.

bmakam updated this revision to Diff 107181.Jul 18 2017, 2:26 PM

Split CGP changes into a separate follow on patch, per Eli's request.

This revision is now accepted and ready to land.Jul 18 2017, 2:33 PM

Thanks, Eli.

The CGP changes are here: D35584

This revision was automatically updated to reflect the committed changes.