This is an archive of the discontinued LLVM Phabricator instance.

[SCCP] Switch to widen at PHIs, stores and call edges.
ClosedPublic

Authored by fhahn on Apr 28 2020, 1:37 PM.

Details

Summary

Currently SCCP does not widen PHIs, stores or along call edges
(arguments/return values), but on operations that directly extend ranges
(like binary operators).

This means PHIs, stores and call edges are not pessimized by widening
currently, while binary operators are. The main reason for widening
operators initially was that opting-out for certain operations was
more straight-forward in the initial implementation (and it did not
matter too much, as range support initially was only implemented for a
very limited set of operations.

During the discussion in D78391, it was suggested to consider flipping
widening to PHIs, stores and along call edges. After adding support for
tracking the number of range extensions in ValueLattice, limiting the
number of range extensions per value is straight forward.

This patch introduces a MaxWidenSteps option to the MergeOptions,
limiting the number of range extensions per value. For PHIs, it seems
natural allow an extension for each (active) incoming value plus 1. For
the other cases, a arbitrary limit of 10 has been chosen initially. It would
potentially make sense to set it depending on the users of a
function/global, but that still needs investigating. This potentially
leads to more state-changes and longer compile-times.

The results look quite promising (MultiSource, SPEC):

Same hash: 179 (filtered out)
Remaining: 58
Metric: sccp.IPNumInstRemoved

Program base widen-phi diff
test-suite...ks/Prolangs-C/agrep/agrep.test 58.00 82.00 41.4%
test-suite...marks/SciMark2-C/scimark2.test 32.00 43.00 34.4%
test-suite...rks/FreeBench/mason/mason.test 6.00 8.00 33.3%
test-suite...langs-C/football/football.test 104.00 128.00 23.1%
test-suite...cations/hexxagon/hexxagon.test 36.00 42.00 16.7%
test-suite...CFP2000/177.mesa/177.mesa.test 214.00 249.00 16.4%
test-suite...ngs-C/assembler/assembler.test 14.00 16.00 14.3%
test-suite...arks/VersaBench/dbms/dbms.test 10.00 11.00 10.0%
test-suite...oxyApps-C++/miniFE/miniFE.test 43.00 47.00 9.3%
test-suite...ications/JM/ldecod/ldecod.test 179.00 195.00 8.9%
test-suite...CFP2006/433.milc/433.milc.test 249.00 265.00 6.4%
test-suite.../CINT2000/175.vpr/175.vpr.test 98.00 104.00 6.1%
test-suite...peg2/mpeg2dec/mpeg2decode.test 70.00 74.00 5.7%
test-suite...CFP2000/188.ammp/188.ammp.test 71.00 75.00 5.6%
test-suite...ce/Benchmarks/PAQ8p/paq8p.test 111.00 117.00 5.4%
test-suite...ce/Applications/Burg/burg.test 41.00 43.00 4.9%
test-suite...000/197.parser/197.parser.test 66.00 69.00 4.5%
test-suite...tions/lambda-0.1.3/lambda.test 23.00 24.00 4.3%
test-suite...urce/Applications/lua/lua.test 301.00 313.00 4.0%
test-suite...TimberWolfMC/timberwolfmc.test 76.00 79.00 3.9%
test-suite...lications/ClamAV/clamscan.test 991.00 1030.00 3.9%
test-suite...plications/d/make_dparser.test 53.00 55.00 3.8%
test-suite...fice-ispell/office-ispell.test 83.00 86.00 3.6%
test-suite...lications/obsequi/Obsequi.test 28.00 29.00 3.6%
test-suite.../Prolangs-C/bison/mybison.test 56.00 58.00 3.6%
test-suite.../CINT2000/254.gap/254.gap.test 170.00 176.00 3.5%
test-suite.../Applications/lemon/lemon.test 30.00 31.00 3.3%
test-suite.../CINT2000/176.gcc/176.gcc.test 1202.00 1240.00 3.2%
test-suite...pplications/treecc/treecc.test 79.00 81.00 2.5%
test-suite...chmarks/MallocBench/gs/gs.test 357.00 366.00 2.5%
test-suite...eeBench/analyzer/analyzer.test 103.00 105.00 1.9%
test-suite...T2006/445.gobmk/445.gobmk.test 1697.00 1724.00 1.6%
test-suite...006/453.povray/453.povray.test 1812.00 1839.00 1.5%
test-suite.../Benchmarks/Bullet/bullet.test 337.00 342.00 1.5%
test-suite.../CINT2000/252.eon/252.eon.test 426.00 432.00 1.4%
test-suite...T2000/300.twolf/300.twolf.test 214.00 217.00 1.4%
test-suite...pplications/oggenc/oggenc.test 244.00 247.00 1.2%
test-suite.../CINT2006/403.gcc/403.gcc.test 4008.00 4055.00 1.2%
test-suite...T2006/456.hmmer/456.hmmer.test 175.00 177.00 1.1%
test-suite...nal/skidmarks10/skidmarks.test 430.00 434.00 0.9%
test-suite.../Applications/sgefa/sgefa.test 115.00 116.00 0.9%
test-suite...006/447.dealII/447.dealII.test 1082.00 1091.00 0.8%
test-suite...6/482.sphinx3/482.sphinx3.test 141.00 142.00 0.7%
test-suite...ocBench/espresso/espresso.test 152.00 153.00 0.7%
test-suite...3.xalancbmk/483.xalancbmk.test 4003.00 4025.00 0.5%
test-suite...lications/sqlite3/sqlite3.test 548.00 551.00 0.5%
test-suite...marks/7zip/7zip-benchmark.test 5522.00 5551.00 0.5%
test-suite...nsumer-lame/consumer-lame.test 208.00 209.00 0.5%
test-suite...:: External/Povray/povray.test 1556.00 1563.00 0.4%
test-suite...000/186.crafty/186.crafty.test 298.00 299.00 0.3%
test-suite.../Applications/SPASS/SPASS.test 2019.00 2025.00 0.3%
test-suite...ications/JM/lencod/lencod.test 8427.00 8449.00 0.3%
test-suite...6/464.h264ref/464.h264ref.test 6797.00 6813.00 0.2%
test-suite...6/471.omnetpp/471.omnetpp.test 431.00 430.00 -0.2%
test-suite...006/450.soplex/450.soplex.test 446.00 447.00 0.2%
test-suite...0.perlbench/400.perlbench.test 1729.00 1727.00 -0.1%
test-suite...000/255.vortex/255.vortex.test 3815.00 3819.00 0.1%

Diff Detail

Event Timeline

fhahn created this revision.Apr 28 2020, 1:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2020, 1:37 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
fhahn updated this revision to Diff 260953.Apr 29 2020, 10:15 AM

Rebased, add comment

The threshold of 10 for various widenings seems like a good initial value. When using 5, it changes as follows

Same hash: 230 (filtered out)
Remaining: 7
Metric: sccp.IPNumInstRemoved

Program widen-phi.10 widen-phi.5 diff
test-suite...ce/Benchmarks/PAQ8p/paq8p.test 117.00 112.00 -4.3%
test-suite...DOE-ProxyApps-C/CoMD/CoMD.test 117.00 116.00 -0.9%
test-suite...3.xalancbmk/483.xalancbmk.test 4015.00 4008.00 -0.2%
test-suite...T2006/445.gobmk/445.gobmk.test 1723.00 1721.00 -0.1%
test-suite.../CINT2000/176.gcc/176.gcc.test 1236.00 1235.00 -0.1%
test-suite.../CINT2006/403.gcc/403.gcc.test 4024.00 4022.00 -0.0%
test-suite...-typeset/consumer-typeset.test 3192.00 3193.00 0.0%

When setting to 20, it changes to

Same hash: 235 (filtered out)
Remaining: 2
Metric: sccp.IPNumInstRemoved

Program widen-phi.10 widen-phi.20 diff
test-suite...marks/SciMark2-C/scimark2.test 43.00 47.00 9.3%
test-suite...0.perlbench/400.perlbench.test 1727.00 1740.00 0.8%

In terms of compile-time, it seems to have a relatively small negative impact (worst is +0.20%) according to http://llvm-compile-time-tracker.com/compare.php?from=209ab6d8835cd88320ceb814893759cfbda91d15&to=6ab3496169adf3115d3476b1cdde30cc4d0614f1&stat=instructions

Comparing cycles, it looks mostly positive, with a few -2%+ improvements in compile-time

nikic added a comment.Apr 29 2020, 2:04 PM

In terms of compile-time, it seems to have a relatively small negative impact (worst is +0.20%) according to http://llvm-compile-time-tracker.com/compare.php?from=209ab6d8835cd88320ceb814893759cfbda91d15&to=6ab3496169adf3115d3476b1cdde30cc4d0614f1&stat=instructions

Comparing cycles, it looks mostly positive, with a few -2%+ improvements in compile-time

Pretty sure those cycles are some unlucky noise. From the instruction counts, the impact looks fine. Looking into the per-file breakdown, there is one case that may be worth double checking: libclamav_upx.c from the ClamAV benchmark has a 4-5% increase in the ThinLTO configuration (which only runs mid-level optimizer, so SCCP stands out more). Possibly that exposes some pathological case.

llvm/lib/Transforms/Scalar/SCCP.cpp
748

The lazy counting of active values might be problematic: It works fine if phi args become feasible from left to right, but not if it happens from right to left. In that case, upon hitting the newly active argument, we will always set MaxWidenSteps to 2, regardless of how many of the later arguments are already active.

750

The main problem with this heuristic is that it does not distinguish which incoming value contributes to the widening. Intuitively, what we want is to allow widening once per operand, but what could end up happening is that we have a phi node with 64 arguments, 63 of them are constant, and the last one is an addrec. In this case, we would end up executing the loop 64 times. Of course, it's something of a degenerate case...

I don't have an immediate suggestion on how to improve this at the same level of complexity though.

fhahn updated this revision to Diff 263034.May 9 2020, 1:30 PM
fhahn marked 4 inline comments as done.

In terms of compile-time, it seems to have a relatively small negative impact (worst is +0.20%) according to http://llvm-compile-time-tracker.com/compare.php?from=209ab6d8835cd88320ceb814893759cfbda91d15&to=6ab3496169adf3115d3476b1cdde30cc4d0614f1&stat=instructions

Comparing cycles, it looks mostly positive, with a few -2%+ improvements in compile-time

Pretty sure those cycles are some unlucky noise. From the instruction counts, the impact looks fine. Looking into the per-file breakdown, there is one case that may be worth double checking: libclamav_upx.c from the ClamAV benchmark has a 4-5% increase in the ThinLTO configuration (which only runs mid-level optimizer, so SCCP stands out more). Possibly that exposes some pathological case.

I had a look at ClamAV and in the file in question, there are a lot of function calls which take arguments depending on a few phis. That's the scenario with the biggest knock-on effect unfortunately. I play around a bit, but I don't think there's much we can really do about those cases.

I've also updated the code to handle a merging a bit differently, which hopefully makes things a bit better in a few de-generate cases. Numbers: http://llvm-compile-time-tracker.com/compare.php?from=ae920a81ffa3c7e3c14de131d0d55abd31bbff7d&to=c1d4b08c704638cd835728397ec957abd5bd89e9&stat=instructions

What do you think?

fhahn added a comment.May 13 2020, 3:57 PM

(just realized I never submitted the responses to the inline comments)

llvm/lib/Transforms/Scalar/SCCP.cpp
748

Yes I meant to operate on a fresh lattice value that's latter combined with the existing value. I've reworked the code to do the merging a bit differently.

750

yeah, I don't think there's too much we can do about that without extra tracking. I've updated the code to set the number of range extensions to the max of the number of range extensions and the number of active incoming values. In the case were first merging lots of equal constants, we do not allow multiple extensions for later extending ranges. I think that's a bit better, but there are still cases which are problematic (e.g. if the first value that becomes feasible is an add-rec).

Looks good from my side, but would be great if @efriedma can chime in as well.

One thing worth mentioning is that this change precludes D78376, at least for the loop case (I think for the non-loop case, it should automatically do the right thing as a result of this change). If widening is controlled at phis, that's what we'll set to overdefined. If we then limit one of the phi arguments to something better than overdefined based on the condition, we won't be able to make use of that anymore.

llvm/lib/Transforms/Scalar/SCCP.cpp
158

Commit these changes separately?

1136

I'd suggest moving the 10 into a constant, as the value is repeated a few times.

1339–1340

Probably needs same handling as below?

llvm/test/Transforms/SCCP/widening.ll
877

Some spurious check lines here.

fhahn updated this revision to Diff 264844.May 19 2020, 4:03 AM
fhahn marked 6 inline comments as done.

Address comments, add extra test cases for return value cycles.

fhahn added inline comments.May 19 2020, 4:21 AM
llvm/lib/Transforms/Scalar/SCCP.cpp
158

Yes, I've removed them from the diff here. They are not required, just avoid doing some extra work. I'll submit those separately.

1136

I've also introduced a function to return the MergeOptions() to shorten the whole thing a bit.

1339–1340

Yes. I've added a test case for both the struct/non-struct versions here.

llvm/test/Transforms/SCCP/widening.ll
877

Removed, thanks!

fhahn added a comment.May 26 2020, 9:40 AM

ping :)

@efriedma, do you have any thoughts/opinions on the different options?

efriedma accepted this revision.May 27 2020, 6:14 PM

I guess I'll start by stepping back to state the fundamental issue here: constant ranges introduce way to many intermediate states, so the algorithm doesn't converge in a reasonable amount of time. To deal with this, you have to skip some steps: force values to overdefined, or maybe a substantially wider constantrange. There are a few different ways you could do this:

  1. Some sort of global counter: after N steps, anything still in the worklist goes straight to overdefined.
  2. Detect specific cycles: for example, if you have a loop induction variable, compute the range based on the trip count of the loop.
  3. What you're doing here: add a counter to each join point. It's impossible to loop without a join point, so you'll catch any arithmetic loops.
  4. Add a counter to arithmetic operations.
  5. Probably some other approaches I haven't thought of.

All of these can potentially work together: you can take the intersection of multiple approaches.

My intuition is that detecting specific cycles is going to be a lot more effective than iteratively widening, in terms of finding optimization opportunities. But detecting cycles is hard to do reliably, so we probably want some iterative limit anyway.


I think the approach here does a good job of avoiding potential pathological cases, some of which might not really involve arithmetic. And I don't see any issues with the implementation.

LGTM

This revision is now accepted and ready to land.May 27 2020, 6:14 PM

I guess I'll start by stepping back to state the fundamental issue here: constant ranges introduce way to many intermediate states, so the algorithm doesn't converge in a reasonable amount of time. To deal with this, you have to skip some steps: force values to overdefined, or maybe a substantially wider constantrange. There are a few different ways you could do this:

  1. Some sort of global counter: after N steps, anything still in the worklist goes straight to overdefined.
  2. Detect specific cycles: for example, if you have a loop induction variable, compute the range based on the trip count of the loop.
  3. What you're doing here: add a counter to each join point. It's impossible to loop without a join point, so you'll catch any arithmetic loops.
  4. Add a counter to arithmetic operations.
  5. Probably some other approaches I haven't thought of.

All of these can potentially work together: you can take the intersection of multiple approaches.

My intuition is that detecting specific cycles is going to be a lot more effective than iteratively widening, in terms of finding optimization opportunities. But detecting cycles is hard to do reliably, so we probably want some iterative limit anyway.

Thanks for the great summary Eli! There are definitely plenty of additional avenues to explore.

On a related note I think the top-level comments in SCCP.cpp are now quite out of date. I'll try to update them to reflect the current state more accurately soonish.

This revision was automatically updated to reflect the committed changes.