This is an archive of the discontinued LLVM Phabricator instance.

[Passes] Run peeling as part of simple/full loop unrolling.
ClosedPublic

Authored by fhahn on Sep 29 2020, 2:18 AM.

Details

Summary

Loop peeling removes conditions from loop bodies that become invariant
after a small number of iterations. When triggered, this leads to fewer
compares and possibly PHIs in loop bodies, enabling further
optimizations. The current cost-model of loop peeling should be quite
conservative/safe, i.e. only peel if a condition in the loop becomes
known after peeling.

For example, see PR47671, where loop peeling enables vectorization by
removing a PHI the vectorizer does not understand. Granted, the
loop-vectorizer could also be taught about constant PHIs, but loop
peeling is likely to enable other optimizations as well.

This has an impact on quite a few benchmarks from
MultiSource/SPEC2000/SPEC2006 on X86 with -O3 -flto, for example

Same hash: 186 (filtered out)
Remaining: 51
Metric: loop-vectorize.LoopsVectorized

Program                                        base   patch  diff
 test-suite...ve-susan/automotive-susan.test     8.00   9.00 12.5%
 test-suite...nal/skidmarks10/skidmarks.test    35.00  31.00 -11.4%
 test-suite...lications/sqlite3/sqlite3.test    41.00  43.00  4.9%
 test-suite...s/ASC_Sequoia/AMGmk/AMGmk.test    25.00  26.00  4.0%
 test-suite...006/450.soplex/450.soplex.test    88.00  89.00  1.1%
 test-suite...TimberWolfMC/timberwolfmc.test   120.00 119.00 -0.8%
 test-suite.../CINT2006/403.gcc/403.gcc.test   215.00 216.00  0.5%
 test-suite...006/447.dealII/447.dealII.test   957.00 958.00  0.1%
 test-suite...ternal/HMMER/hmmcalibrate.test    75.00  75.00  0.0%

Same hash: 186 (filtered out)
Remaining: 51
Metric: loop-vectorize.LoopsAnalyzed

Program                                        base    patch   diff
 test-suite...ks/Prolangs-C/agrep/agrep.test   440.00  434.00  -1.4%
 test-suite...nal/skidmarks10/skidmarks.test   312.00  308.00  -1.3%
 test-suite...marks/7zip/7zip-benchmark.test   6399.00 6323.00 -1.2%
 test-suite...lications/minisat/minisat.test   134.00  135.00   0.7%
 test-suite...rks/FreeBench/pifft/pifft.test   295.00  297.00   0.7%
 test-suite...TimberWolfMC/timberwolfmc.test   1879.00 1869.00 -0.5%
 test-suite...pplications/treecc/treecc.test   689.00  691.00   0.3%
 test-suite...T2000/300.twolf/300.twolf.test   1593.00 1597.00  0.3%
 test-suite.../Benchmarks/Bullet/bullet.test   1394.00 1392.00 -0.1%
 test-suite...ications/JM/ldecod/ldecod.test   1431.00 1429.00 -0.1%
 test-suite...6/464.h264ref/464.h264ref.test   2229.00 2230.00  0.0%
 test-suite...lications/sqlite3/sqlite3.test   2590.00 2589.00 -0.0%
 test-suite...ications/JM/lencod/lencod.test   2732.00 2733.00  0.0%
 test-suite...006/453.povray/453.povray.test   3395.00 3394.00 -0.0%

Note the -11% regression in number of loops vectorized for skidmarks. I
suspect this corresponds to the fact that those loops are gone now (see
the reduction in number of loops analyzed by LV).

Diff Detail

Event Timeline

fhahn created this revision.Sep 29 2020, 2:18 AM
Herald added a project: Restricted Project. · View Herald Transcript
fhahn requested review of this revision.Sep 29 2020, 2:18 AM

The patch doesn't seem to apply for me:

llvm-project$ arc patch --nobranch D88471 --skip-dependencies
You have untracked files in this working copy.

  Working copy: /repositories/llvm-project/

  Untracked changes in working copy:
  (To ignore these 4 change(s), add them to ".git/info/exclude".)
    llvm/CMakeLists.txt.user.4.9-pre1
    llvm/CMakeLists.txt.user.old
    llvm/lib/Transforms/IPO/PassManagerBuilder.cpp.rej
    llvm/utils/compare-stats.py

    Ignore these 4 untracked file(s) and continue? [y/N] y

 INFO  Base commit is not in local repository; trying to fetch.


    This diff is against commit 6c1ce7e277624882c6139b978f90a489d638aff8, but
    the commit is nowhere in the working copy. Try to apply it against the
    current working copy state? (0be51c09131f92651e5bad9ca0e3fb8fdc37d84a)
    [Y/n] y

Checking patch llvm/test/Transforms/PhaseOrdering/X86/peel-before-lv-to-enable-vectorization.ll...
error: llvm/test/Transforms/PhaseOrdering/X86/peel-before-lv-to-enable-vectorization.ll: does not exist in index
Checking patch llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp...
Checking patch llvm/lib/Transforms/IPO/PassManagerBuilder.cpp...
error: while searching for:
  if (EnableLoopInterchange)
    MPM.add(createLoopInterchangePass()); // Interchange loops

  // Unroll small loops
  MPM.add(createSimpleLoopUnrollPass(OptLevel, DisableUnrollLoops,
                                     ForgetAllSCEVInLoopUnroll));
  addExtensionsToPM(EP_LoopOptimizerEnd, MPM);

error: patch failed: llvm/lib/Transforms/IPO/PassManagerBuilder.cpp:445
error: while searching for:
  if (EnableLoopInterchange)
    PM.add(createLoopInterchangePass());

  // Unroll small loops
  PM.add(createSimpleLoopUnrollPass(OptLevel, DisableUnrollLoops,
                                    ForgetAllSCEVInLoopUnroll));
  PM.add(createLoopVectorizePass(true, !LoopVectorize));

error: patch failed: llvm/lib/Transforms/IPO/PassManagerBuilder.cpp:1036
Checking patch llvm/include/llvm/Transforms/Scalar/LoopUnrollPass.h...
Checking patch llvm/include/llvm/Transforms/Scalar.h...
Hunk #1 succeeded at 184 (offset 6 lines).
Applied patch llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp cleanly.
Applying patch llvm/lib/Transforms/IPO/PassManagerBuilder.cpp with 2 rejects...
Rejected hunk #1.
Rejected hunk #2.
Applied patch llvm/include/llvm/Transforms/Scalar/LoopUnrollPass.h cleanly.
Applied patch llvm/include/llvm/Transforms/Scalar.h cleanly.

 Patch Failed! 
Usage Exception: Unable to apply patch!
fhahn updated this revision to Diff 295788.Oct 2 2020, 4:30 AM

The patch doesn't seem to apply for me:

Rebased, should apply now.

In terms of runtime impact, I measured the following impact on SPEC2006 on ARM64 with -O3 -flto, outside the noise:

test-suite...006/450.soplex/450.soplex.test   -2.4%
test-suite...6/464.h264ref/464.h264ref.test    1.3%
test-suite...T2006/473.astar/473.astar.test   -0.9%

Presumably the h264ref regression is related to wrong decisions in the cost modeling and probably warrants some further analysis.

The patch doesn't seem to apply for me:

Rebased, should apply now.

It does, thank you.

In terms of runtime impact, I measured the following impact on SPEC2006 on ARM64 with -O3 -flto, outside the noise:

test-suite...006/450.soplex/450.soplex.test   -2.4%
test-suite...6/464.h264ref/464.h264ref.test    1.3%
test-suite...T2006/473.astar/473.astar.test   -0.9%

Presumably the h264ref regression is related to wrong decisions in the cost modeling and probably warrants some further analysis.

lebedev.ri accepted this revision.Oct 2 2020, 6:18 AM

This looks good to me FWIW, on the code i have checked this on,
this is a win, causing small perf improvement and apparently no regressions.

Not sure if other reviewers have other comments.

This revision is now accepted and ready to land.Oct 2 2020, 6:18 AM

Are you gonna land this patch?

fhahn added a comment.Dec 16 2020, 7:31 AM

Are you gonna land this patch?

Yes I plan to land this soon. I'd just like to make sure to land this after addressing some perf regression we are seeing internally (unrelated to this patch), so it is easy to see whether this introduces any new ones.

Do you mean issues solved by D94232?

This revision was landed with ongoing or failed builds.Jan 26 2021, 5:53 AM
This revision was automatically updated to reflect the committed changes.

Hey, we've run into a case where this patch causes a dead loop to appear which doesn't subsequently get removed. It's not a huge deal (we're seeing some great speed-ups from this patch too!), but a niggle that might be nice to address. Reproducer:

opt -O3 -mtriple=aarch64 ~/tmp/unroll-into-dead-loop.ll -S -o - results in this silly little loop:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %index.next = add i64 %index, 2
  %0 = icmp eq i64 %index.next, %n.vec
  br i1 %0, label %middle.block, label %vector.body, !llvm.loop !1