Page MenuHomePhabricator

[SCEV] Query expanded immediate cost at minsize
ClosedPublic

Authored by samparker on Mar 19 2020, 9:17 AM.

Details

Summary

As code size is the only thing we care about at minsize, query the cost of materialising immediates when calculating the cost of a SCEV expansion. We also modify the CostKind to TCK_CodeSize for minsize, instead of RecipThroughput. This gives -0.1% geomean reduction of the llvm test suite at -Oz for both thumbv7a and aarch64.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

@samparker are you trying to mitigate perf impact for in-order CPU?
I wonder if D73501 simply is counter-productive then.
Though yes, vectorizers need to be taught that trick.

Thanks for taking a look.

This is rather pessimistic. If we really want to do this, we need to use TargetTransformInfo::getIntImmCostInst().

This sounds good and what I was also considering.

What cost does that model?

I was under the impression that throughput/codesize costs for immediates would be highly correlated, where a 'high cost' constant would introduce instruction(s) to generate it, increasing code size and reducing throughput. As @spatel said in D76124, the lines have become blurred but we should be modelling at least something here.

I wonder if D73501 simply is counter-productive then.

I would be lying if I said that the patch didn't cause a whole world of pain :) But I would like to try to resolve this, if we can, by modelling costs better. The SCEV changes may have just broken our we do unrolling for our little microcontrollers, so I'll be looking at ARM TTI too.

Do we model rthrougput for constants elsewhere?
I may be wrong, but i don't recall seeing any such modelling previously.
Let's take a step back here. How about we simply revert D73501?

Do we model rthrougput for constants elsewhere?

I guess that depends on who 'we' are because, again, different backends will be modelling different things. Why the focus on throughput now anyway? When this code was using getOperationCost, it would have been getting some performance/code size cost, which I expect most backends will be modelling for constants. And I'm not really sure why throughput is specifically important here, we're not concerning ourselves with the throughput of casts and compares, right?

Let's take a step back here. How about we simply revert D73501?

Sounds good to me, although I still need to prod around here to see if codegen risk being erratic in the future.

Do we model rthrougput for constants elsewhere?

I guess that depends on who 'we' are because, again, different backends will be modelling different things.

Sorry, i meant llvm transforms in general, not backends.

Why the focus on throughput now anyway? When this code was using getOperationCost,
it would have been getting some performance/code size cost,
which I expect most backends will be modelling for constants.
And I'm not really sure why throughput is specifically important here,
we're not concerning ourselves with the throughput of casts and compares, right?

We have 3 cost models - latency, size and rthroughput.
My aspirational goal in this scevexpander budget was: "how much more computations are we willing to do without it being too much of a burden?"
Size model isn't really applicable here - we *could* lower any sequence into a libcall, regardless of it's native instruction count.
Likewise i'm not sure we're really after latency here, which leaves us with rthroughput.
But we can't subtract oranges from cucumbers, so all cost modelling should be consistently using rthroughput cost model.
Thus i'm asking, what does getIntImmCost() model? rthroughput or size?

Let's take a step back here. How about we simply revert D73501?

Sounds good to me, although I still need to prod around here to see if codegen risk being erratic in the future.

I'm pretty sure there's only two APIs that were designed around throughput: 'getInstructionThroughput' and 'getArithmeticInstrCost' and maybe the latter should be named more explicitly. The code size calls are generally a bit more explicit, so there's 'getIntImmCodeSizeCost', which is not the one I've used.

"how much more computations are we willing to do without it being too much of a burden?"

And this also depends what we're considering as a 'burden', and another angle that I'll probably need to look at in the near future is the burden of code size... Either way, this code shouldn't assume that constants are free (in any sense of the term).

samparker updated this revision to Diff 252830.Mar 26 2020, 7:05 AM
samparker edited the summary of this revision. (Show Details)

Introduced a lambda to look at an expressions operand and, if it's a constant, query for cost using getIntImmCostInst.

lebedev.ri requested changes to this revision.Mar 26 2020, 7:53 AM

<...>
Thus i'm asking, what does getIntImmCost() model? rthroughput or size?

This revision now requires changes to proceed.Mar 26 2020, 7:53 AM

The API is there to figure out whether the constant will be folded into the given instruction, otherwise there will be some 'cost' to materialize it. Having to generate instruction(s) for the materialization is likely to increase code size, but more importantly, reduce throughput and increase latency - which is why it's sometimes beneficial to hoist expensive constants out of loops.

samparker updated this revision to Diff 256260.Apr 9 2020, 5:44 AM
samparker added reviewers: dmgreen, uweigand.

Rebased. Thanks for reverting rewriteLoopExitValues @lebedev.ri, but I'm still seeing value in this patch. I'm not seeing any changes when running the test suite on my X86 box, but for Arm's DSP suite this affects 42/166 benchmarks for Thumb1 and 30/166 for Thumb2. Out of those changes I'm seeing a 1.6% geomean improvement for both targets.

samparker updated this revision to Diff 285000.EditedAug 12 2020, 1:06 AM
samparker retitled this revision from [SCEV] Query for immediate cost in Expander to [SCEV] Query expanded immediate cost at minsize.
samparker edited the summary of this revision. (Show Details)

Now only performing the checks when optimising for minsize.
thumbv7a results:

Metric: size..text

Program                                                                                              master  scev-expander diff
                                    test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   47596   46528       -2.2%
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test    6484    6340       -2.2%
                                        test-suite :: MultiSource/Benchmarks/VersaBench/bmm/bmm.test     924     908       -1.7%
                                 test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test    2916    2868       -1.6%
                                          test-suite :: MultiSource/Benchmarks/McCat/05-eks/eks.test    4748    4696       -1.1%
                        test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test    3120    3088       -1.0%
                                        test-suite :: SingleSource/Benchmarks/Misc/himenobmtxpa.test    2408    2384       -1.0%
                          test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test   19920   19748       -0.9%
                          test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test   19920   19748       -0.9%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test   32516   32236       -0.9%
                                  test-suite :: SingleSource/Benchmarks/Shootout/Shootout-lists.test     956     964        0.8%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/ft/ft.test    2920    2896       -0.8%
                 test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test   14600   14480       -0.8%
                                            test-suite :: SingleSource/Benchmarks/McGill/queens.test     976     968       -0.8%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test   31508   31252       -0.8%
                             test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   68296   67752       -0.8%
                                           test-suite :: SingleSource/Benchmarks/Misc/whetstone.test    1604    1592       -0.7%
                                       test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test    4936    4904       -0.6%
                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test   23836   23684       -0.6%
                                  test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test   27552   27384       -0.6%
                                     test-suite :: SingleSource/Benchmarks/BenchmarkGame/puzzle.test    1320    1312       -0.6%
                                           test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test    1348    1340       -0.6%
                                          test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test    1376    1368       -0.6%
                                         test-suite :: MultiSource/Applications/minisat/minisat.test    9980    9924       -0.6%
                                         test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test    3008    3024        0.5%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   69828   69460       -0.5%
                                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test   20116   20012       -0.5%
                                          test-suite :: SingleSource/Benchmarks/Misc-C++/bigfib.test    3332    3316       -0.5%
                                      test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  168600  167800       -0.5%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   26104   25984       -0.5%
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.test    3592    3576       -0.4%
                                           test-suite :: MultiSource/Applications/oggenc/oggenc.test   90092   89700       -0.4%
                          test-suite :: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test    2788    2776       -0.4%
                                              test-suite :: SingleSource/Benchmarks/McGill/misr.test    1864    1872        0.4%
                                          test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    7540    7508       -0.4%
                                        test-suite :: MultiSource/Applications/JM/lencod/lencod.test  318860  317620       -0.4%
                           test-suite :: MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.test    2176    2168       -0.4%
                                   test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test    4828    4812       -0.3%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   29164   29068       -0.3%
                                     test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test   24400   24320       -0.3%
                                            test-suite :: SingleSource/Benchmarks/Misc/oourafft.test    5344    5328       -0.3%
                                             test-suite :: MultiSource/Applications/spiff/spiff.test   12184   12152       -0.3%
                                             test-suite :: MultiSource/Applications/sgefa/sgefa.test    6096    6080       -0.3%
                                             test-suite :: MultiSource/Applications/SPASS/SPASS.test  201512  201000       -0.3%
                                        test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  128336  128016       -0.2%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test    9640    9616       -0.2%
                                         test-suite :: MultiSource/Applications/ClamAV/clamscan.test  246116  245556       -0.2%
                                                 test-suite :: MultiSource/Applications/lua/lua.test   61520   61392       -0.2%
                                               test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test   40916   40832       -0.2%
               test-suite :: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael.test    7796    7780       -0.2%
                                               test-suite :: SingleSource/Benchmarks/Misc/flops.test    3936    3928       -0.2%
                                         test-suite :: MultiSource/Applications/obsequi/Obsequi.test   17468   17436       -0.2%
                               test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   89716   89552       -0.2%
                                       test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  324352  323820       -0.2%
                                             test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  296700  296220       -0.2%
                                     test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test    2524    2520       -0.2%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test   20760   20728       -0.2%
                                             test-suite :: MultiSource/Benchmarks/nbench/nbench.test   16872   16848       -0.1%
                            test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test    2948    2952        0.1%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.test    6084    6076       -0.1%
                                          test-suite :: MultiSource/Applications/d/make_dparser.test   43812   43756       -0.1%
                                         test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test   20152   20176        0.1%
                                       test-suite :: MultiSource/Applications/hexxagon/hexxagon.test    7252    7244       -0.1%
                       test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test   14828   14812       -0.1%
                            test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test   92020   91924       -0.1%
                                      test-suite :: MultiSource/Benchmarks/VersaBench/dbms/dbms.test    8228    8220       -0.1%
                                                   test-suite :: MultiSource/Benchmarks/sim/sim.test    8704    8696       -0.1%
                                         test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  168288  168136       -0.1%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test   91924   91844       -0.1%
                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test  218764  218580       -0.1%
                                                    test-suite :: MicroBenchmarks/harris/harris.test   62584   62536       -0.1%
                                       test-suite :: MultiSource/Benchmarks/Ptrdist/yacr2/yacr2.test   11224   11216       -0.1%
                            test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   90788   90724       -0.1%
                                   test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test   12352   12344       -0.1%
                             test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   62016   61976       -0.1%
                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test    7120    7116       -0.1%
               test-suite :: MicroBenchmarks/ImageProcessing/BilateralFiltering/BilateralFilter.test   61768   61736       -0.1%
        test-suite :: MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test   61800   61768       -0.1%
                                        test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test   62136   62104       -0.1%
                                    test-suite :: MicroBenchmarks/ImageProcessing/Dither/Dither.test   62440   62408       -0.1%
                                               test-suite :: MultiSource/Applications/siod/siod.test   47092   47068       -0.1%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   60088   60064       -0.0%

aarch64 results:

Metric: size..text

Program                                                                                              master  scev-expander diff
                                    test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   72100   68220       -5.4%
                                           test-suite :: SingleSource/Benchmarks/Misc/whetstone.test    2092    2004       -4.2%
                                 test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test    2588    2500       -3.4%
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test    8420    8196       -2.7%
                                          test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test    2212    2156       -2.5%
                                           test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test    1652    1620       -1.9%
                        test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test    2948    2908       -1.4%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test   43972   43428       -1.2%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/ft/ft.test    3668    3628       -1.1%
                                          test-suite :: MultiSource/Benchmarks/McCat/05-eks/eks.test    5876    5812       -1.1%
                                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test   22276   22068       -0.9%
                                           test-suite :: MultiSource/Benchmarks/Olden/em3d/em3d.test    2812    2836        0.9%
                                         test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test   26804   26580       -0.8%
                                     test-suite :: SingleSource/Benchmarks/BenchmarkGame/puzzle.test     980     972       -0.8%
                                          test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    9132    9060       -0.8%
                          test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test   26220   26020       -0.8%
                          test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test   26220   26020       -0.8%
                                        test-suite :: SingleSource/Benchmarks/Misc/himenobmtxpa.test    3156    3132       -0.8%
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.test    3156    3132       -0.8%
                                       test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test    5460    5420       -0.7%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test   34852   34628       -0.6%
                             test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   99124   98540       -0.6%
                                           test-suite :: MultiSource/Applications/oggenc/oggenc.test  107024  106456       -0.5%
                                            test-suite :: SingleSource/Benchmarks/McGill/queens.test    1516    1508       -0.5%
                                  test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test   36012   35836       -0.5%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   82264   81864       -0.5%
                                              test-suite :: SingleSource/Benchmarks/McGill/misr.test    1668    1660       -0.5%
                            test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test    3372    3356       -0.5%
                                         test-suite :: MultiSource/Applications/minisat/minisat.test   13836   13772       -0.5%
                                         test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test    3460    3444       -0.5%
                                   test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test    3996    3980       -0.4%
                                               test-suite :: SingleSource/Benchmarks/Misc/flops.test    4116    4100       -0.4%
                                     test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test   27396   27292       -0.4%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test    8852    8820       -0.4%
                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test   32168   32056       -0.3%
                                          test-suite :: SingleSource/Benchmarks/Misc-C++/bigfib.test    4892    4876       -0.3%
                 test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test   19708   19644       -0.3%
                                             test-suite :: MultiSource/Benchmarks/nbench/nbench.test   20916   20852       -0.3%
                                             test-suite :: MultiSource/Applications/spiff/spiff.test   16028   15980       -0.3%
                                      test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  203964  203460       -0.2%
                               test-suite :: SingleSource/Benchmarks/Misc-C++/Large/sphereflake.test    3380    3372       -0.2%
                                             test-suite :: MultiSource/Applications/SPASS/SPASS.test  288136  287464       -0.2%
                                     test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test    3596    3588       -0.2%
                                        test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  161852  161492       -0.2%
                                                   test-suite :: MultiSource/Benchmarks/sim/sim.test   11044   11020       -0.2%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   41156   41068       -0.2%
                                        test-suite :: MultiSource/Applications/JM/lencod/lencod.test  383100  382340       -0.2%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   28852   28796       -0.2%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test   27236   27188       -0.2%
               test-suite :: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael.test    9132    9116       -0.2%
                                            test-suite :: SingleSource/Benchmarks/Misc/oourafft.test    5444    5436       -0.1%
                                   test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test   16428   16404       -0.1%
                                       test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  471400  470712       -0.1%
                                      test-suite :: MultiSource/Benchmarks/VersaBench/dbms/dbms.test   11324   11308       -0.1%
                                             test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  378596  378116       -0.1%
                                                    test-suite :: MicroBenchmarks/harris/harris.test  104836  104708       -0.1%
                                             test-suite :: MultiSource/Applications/sgefa/sgefa.test    6812    6804       -0.1%
                                         test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  244840  244568       -0.1%
                                         test-suite :: MultiSource/Applications/obsequi/Obsequi.test   21980   21956       -0.1%
                                       test-suite :: MultiSource/Benchmarks/Ptrdist/yacr2/yacr2.test   15124   15108       -0.1%
                               test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  116900  116780       -0.1%
                                                 test-suite :: MultiSource/Applications/lua/lua.test   90508   90420       -0.1%
                                         test-suite :: MultiSource/Applications/ClamAV/clamscan.test  331984  331680       -0.1%
                                               test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test   54364   54316       -0.1%
                                               test-suite :: MultiSource/Applications/siod/siod.test   64448   64392       -0.1%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   83348   83276       -0.1%
                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test    9548    9540       -0.1%
                             test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   86428   86356       -0.1%
                                       test-suite :: MultiSource/Applications/hexxagon/hexxagon.test   10180   10172       -0.1%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test  142752  142656       -0.1%
                            test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test  142792  142712       -0.1%
                                          test-suite :: MultiSource/Applications/d/make_dparser.test   62912   62880       -0.1%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test  139296  139232       -0.0%
Herald added a project: Restricted Project. · View Herald TranscriptAug 12 2020, 1:06 AM

That are really nice code size savings!

As only the improvements are shown, just curious if there are no regressions? Thus, this is overall an improvement too?
And for extra confidence and as the numbers are easy to obtain, probably best to get numbers for x86 too?

Thanks!. That's the full set of results, there's a few little regressions for thumb and I think only one for aarch64... I'll get the X86 numbers now.

Similar story on X86 as well:

Metric: size..text

Program                                                                                              master  scev-expander diff 
                                           test-suite :: SingleSource/Benchmarks/Misc/whetstone.test    2437    2325       -4.6%
                                           test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test    1637    1573       -3.9%
                                    test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   58885   56645       -3.8%
                                     test-suite :: SingleSource/Benchmarks/BenchmarkGame/puzzle.test     821     805       -1.9%
                                          test-suite :: MultiSource/Benchmarks/McCat/05-eks/eks.test    5845    5733       -1.9%
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test    8597    8437       -1.9%
                        test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test    2661    2613       -1.8%
                                          test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test    1781    1749       -1.8%
                            test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test    2917    2869       -1.6%
                                        test-suite :: SingleSource/Benchmarks/Misc/himenobmtxpa.test    3253    3205       -1.5%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test   47093   46469       -1.3%
                                 test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test    2485    2453       -1.3%
                                            test-suite :: SingleSource/Benchmarks/McGill/queens.test    1301    1285       -1.2%
                 test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test   20485   20245       -1.2%
                                              test-suite :: SingleSource/Benchmarks/McGill/misr.test    1525    1509       -1.0%
                                          test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    9221    9125       -1.0%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/ft/ft.test    3205    3173       -1.0%
                                   test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test    3365    3333       -1.0%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   88756   88148       -0.7%
                          test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test   28709   28517       -0.7%
                          test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test   28709   28517       -0.7%
                           test-suite :: MultiSource/Benchmarks/ASC_Sequoia/CrystalMk/CrystalMk.test    2501    2517        0.6%
                                     test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test   33429   33221       -0.6%
                                       test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test    5445    5413       -0.6%
                                  test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test   34293   34101       -0.6%
                                               test-suite :: SingleSource/Benchmarks/Misc/flops.test    6117    6085       -0.5%
                                             test-suite :: MultiSource/Applications/spiff/spiff.test   15413   15333       -0.5%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test   37109   36917       -0.5%
                                         test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test    3109    3125        0.5%
                                            test-suite :: SingleSource/Benchmarks/Misc/oourafft.test    6453    6421       -0.5%
                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test   29668   29524       -0.5%
                          test-suite :: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test    3461    3445       -0.5%
                                     test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test    3493    3477       -0.5%
                             test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   95733   95349       -0.4%
                               test-suite :: SingleSource/Benchmarks/Misc-C++/Large/sphereflake.test    4181    4165       -0.4%
                                         test-suite :: MultiSource/Applications/minisat/minisat.test   13413   13365       -0.4%
                                        test-suite :: MultiSource/Applications/JM/lencod/lencod.test  417589  416293       -0.3%
                                      test-suite :: MultiSource/Benchmarks/VersaBench/dbms/dbms.test   10645   10613       -0.3%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test   11173   11205        0.3%
                                           test-suite :: MultiSource/Applications/oggenc/oggenc.test  115716  115396       -0.3%
                                        test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  171989  171525       -0.3%
                                                   test-suite :: MultiSource/Benchmarks/sim/sim.test   11957   11925       -0.3%
                                      test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  221205  220629       -0.3%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test   25061   24997       -0.3%
                                         test-suite :: MultiSource/Applications/obsequi/Obsequi.test   21141   21093       -0.2%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   43445   43349       -0.2%
                                       test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  444644  443812       -0.2%
                                         test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test   27477   27525        0.2%
                                                 test-suite :: MultiSource/Applications/lua/lua.test   83317   83173       -0.2%
               test-suite :: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael.test   10389   10373       -0.2%
                                             test-suite :: MultiSource/Benchmarks/nbench/nbench.test   20981   20949       -0.2%
                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test   10821   10837        0.1%
                                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test   23077   23045       -0.1%
                                             test-suite :: MultiSource/Applications/SPASS/SPASS.test  284754  284386       -0.1%
                                             test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  400821  400325       -0.1%
                                       test-suite :: MultiSource/Benchmarks/Ptrdist/yacr2/yacr2.test   14341   14325       -0.1%
                                         test-suite :: MultiSource/Applications/ClamAV/clamscan.test  352852  352484       -0.1%
                            test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test  141716  141572       -0.1%
                                   test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test   16037   16021       -0.1%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test  141636  141508       -0.1%
                                               test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test   56085   56037       -0.1%
                                               test-suite :: MultiSource/Applications/siod/siod.test   60580   60532       -0.1%
                                         test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  241315  241139       -0.1%
                                                    test-suite :: MicroBenchmarks/harris/harris.test   99381   99317       -0.1%
                            test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test  139956  139876       -0.1%
                                          test-suite :: MultiSource/Applications/d/make_dparser.test   61347   61315       -0.1%
                               test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  138581  138645        0.0%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetARawLoops/lcalsARaw.test  138932  138868       -0.0%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   36805   36789       -0.0%
                                        test-suite :: MicroBenchmarks/ImageProcessing/Blur/blur.test   98981   98949       -0.0%
                                    test-suite :: MicroBenchmarks/ImageProcessing/Dither/Dither.test   99285   99253       -0.0%
                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test  321508  321412       -0.0%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetBRawLoops/lcalsBRaw.test  134756  134724       -0.0%
                             test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   84661   84677        0.0%
        test-suite :: MicroBenchmarks/ImageProcessing/AnisotropicDiffusion/AnisotropicDiffusion.test   98453   98437       -0.0%
                      test-suite :: MicroBenchmarks/ImageProcessing/Interpolation/Interpolation.test   99397   99381       -0.0%
                                         test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  102437  102421       -0.0%
                 test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  391620  391572       -0.0%
                            test-suite :: MicroBenchmarks/LCALS/SubsetBLambdaLoops/lcalsBLambda.test  134788  134772       -0.0%
                                        test-suite :: MicroBenchmarks/MemFunctions/MemFunctions.test  177893  177877       -0.0%
                                     test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  296165  296149       -0.0%
                                            test-suite :: MultiSource/Applications/kimwitu++/kc.test  297476  297460       -0.0%
                                                                                  Geomean difference                       -0.1%

Nice one!

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2408

This change here looks like an exact duplication of the change above (lines 2355 - 2362). Can this be in a helper?

samparker added inline comments.Aug 12 2020, 1:41 AM
llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2408

Yeah, that would be nicer.

lebedev.ri requested changes to this revision.Aug 12 2020, 1:50 AM

Apologies if i'm missing the point here.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2204

s/get/accountFor/

2214–2217

I'm not convinced this modelling is correct.
(which is why i didn't respond, but apparently i forgot to actually post that)

If we have SCEV x + y + 42, 42 will be modelled as-if it's at index 2,
but we should model this as (x + y) + 42, because there's usually some
form of an add that takes an immediate as second param.

2218–2221

Same

2222–2229

And again, if we have umin(x, y, 42), it's lowered as z = (x u< y) ? x : y ; z u< 42 ? z : 42,
so you can't possibly have third operand here.

2356–2357

enumerate(NAry->operands())

2409–2410

enumerate(NAry->operands())

This revision now requires changes to proceed.Aug 12 2020, 1:50 AM
samparker added inline comments.Aug 12 2020, 2:04 AM
llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2214–2217

Ah! Right, I'll try to fix that.

samparker updated this revision to Diff 285028.Aug 12 2020, 3:27 AM

Changed how operands are visited and costed, hopefully now translating the SCEV operand index correctly to an Instruction index.

SjoerdMeijer accepted this revision.Aug 14 2020, 12:53 AM

With @lebedev.ri last comment fixed, this now looks good to me. Please wait a day or so just in case there are more comments.

lebedev.ri requested changes to this revision.EditedAug 14 2020, 2:30 AM

I think this is going in the wrong direction.
I think the worklist needs to be changed, to be a struct { unsigned ParentOpcode; int OperandIdx; SCEV* S; },
then everything should suddenly become less convoluted/hand-wavey.

This revision now requires changes to proceed.Aug 14 2020, 2:30 AM

Yeah, that sounds better, I'll put together a separate patch. Are you happy with the operand index clamping though?

Yeah, that sounds better, I'll put together a separate patch.

Are you happy with the operand index clamping though?

Looks about right i guess.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2195

Can we still get a constant here?

2279–2282

We are not taxing constants for right-shifts.

samparker added inline comments.Aug 14 2020, 6:59 AM
llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2364

Now I notice that I wouldn't have been handling AddRec expressions... So should these operands be added to the worklist for both Add and Mul or would just Add be okay?

All this is (becoming?) so incredibly fragile...
Have you checked what happens if you simply make isHighCostExpansionHelper() always return true for -Oz? :)
This really needs refactoring/generalization.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2364

Given A + B*x, you'd want to model A as being at index 1,
and B as being part of multipler (again, at index 1.

And for higher orders A + B*x + C*x^2, again, B and C are part of multiply,
and it should be modelled as (B*x + C*x^2) + A.

So i think the generalization is that all nary operands except the first one are at index 1 of mul,
and the first nary operand is at index 1 of add.

This really needs refactoring/generalization.

I will try...

Have you checked what happens if you simply make isHighCostExpansionHelper() always return true for -Oz? :)

I thought I did, but actually it was just for rewriting loop exit values... I will run the numbers though! But I'm also interested in whether this can still be beneficial for execution speed too.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2364

Okay, thanks.

I just ran some numbers for when -Oz == HighCost and it's interesting... For Arm it's not good:

              master  minsize-high-cost-expand        diff                                                                                                                                        
count  310.000000     310.000000                310.000000                                                                                                                                        
mean   19710.077419   19662.567742              0.000666                                                                                                                                          
std    48882.246111   48773.870505              0.010779                                                                                                                                          
min    292.000000     292.000000               -0.042781                                                                                                                                          
25%    1090.000000    1084.000000              -0.002887                                                                                                                                          
50%    2776.000000    2784.000000               0.000000                                                                                                                                          
75%    10185.000000   10138.000000              0.002424                                                                                                                                          
max    324060.000000  323508.000000             0.058974
Geomean difference                                   0.1%

For AArch64 it's okay, but there are still plenty of large regressions:

              master  minsize-high-cost-expand        diff                                                                                                                                        
count  310.000000     310.000000                310.000000                                                                                                                                        
mean   26712.283871   26650.709677             -0.000930                                                                                                                                          
std    66147.455977   66033.469245              0.010996                                                                                                                                          
min    476.000000     476.000000               -0.050374                                                                                                                                          
25%    1294.000000    1304.000000              -0.003676                                                                                                                                          
50%    3168.000000    3168.000000               0.000000                                                                                                                                          
75%    11252.000000   11216.000000              0.001597                                                                                                                                          
max    471648.000000  470888.000000             0.051485
Geomean difference                                  -0.1%

But for X86, it's great:

              master  minsize-high-cost-expand        diff                                                                                                                                        
count  313.000000     313.000000                313.000000                                                                                                                                        
mean   27472.361022   27354.431310             -0.005084                                                                                                                                          
std    68350.278069   68101.639463              0.013028                                                                                                                                          
min    389.000000     389.000000               -0.065654                                                                                                                                          
25%    1157.000000    1157.000000              -0.007393                                                                                                                                          
50%    2933.000000    2949.000000               0.000000                                                                                                                                          
75%    11093.000000   11077.000000              0.000000                                                                                                                                          
max    444676.000000  442660.000000             0.039900
Geomean difference                                  -0.5%
lebedev.ri added inline comments.Aug 17 2020, 7:15 AM
llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2406

In D76434 you highlighted by SCEVNAry expressions can have more than two operands, which would expand to a chain of operations, and the existing costs for AddRecExprs tries to account for that. But this was missing for normal Add and Mul expressions. Have I misunderstood you?

Doesn't look missing to me?

samparker added inline comments.Aug 17 2020, 7:26 AM
llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2406

Ah, thanks! I'm getting lost amongst the patches.

samparker updated this revision to Diff 286253.Aug 18 2020, 4:49 AM

Rebased on top of D86050 and introduced a struct to map operands for the worklist, instead of just storing the opcode.

samparker updated this revision to Diff 290239.Sep 7 2020, 4:09 AM

Rebased now that the refactor patch is in.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 10 2020, 12:23 AM
This revision was automatically updated to reflect the committed changes.

Let's discuss any further desired tweaks in a post-commit review.

@samparker why did you commit this?
I have not finished reviewing this, and i've requested changes to the previous revisions.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2187

This needs a better comment.
Min/Max Idx are pretty magical on the first look.

2189

Not llvm coding style in any case.
These should likely be Min/Max

2192–2193

SCEVOperand::OperandIdx is int

2195

Please do mark done comments as such.

2344

Consider constants to be free unless we are optforsize

2347–2348

These are only used in a single place

2352

This is inconsistent with every other return here.

samparker marked 5 inline comments as done.Sep 10 2020, 12:56 AM

We had gone through many revisions where I've addressed all of your (helpful) comments and I hadn't heard anything else for three weeks. I sincerely expected the remaining issues to be style changes which seem appropriate for a post-commit review.

llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2192–2193

But when we enumerate the SCEV operands, the index is size_t and the types need to be compatible for std::max and min.

lebedev.ri added inline comments.Sep 10 2020, 1:01 AM
llvm/lib/Transforms/Utils/ScalarEvolutionExpander.cpp
2192–2193
int MinIdx = std::max((int)SCEVOp.index(), CostOp.MinIdx);
int OpIdx = std::min(MinIdx, CostOp.MaxIdx);
samparker updated this revision to Diff 290902.Sep 10 2020, 1:07 AM

Addressed comments.