This is an archive of the discontinued LLVM Phabricator instance.

[ARM][CostModel] Select instruction costs.
ClosedPublic

Authored by samparker on Jun 18 2020, 6:29 AM.

Details

Summary

Modify the ARM getCmpSelInstrCost implementation for the code size costs of selects. Now consider the legalization cost and increase the cost of i1 because those values wouldn't live in a general purpose register. We also make selects +1 more expensive to account for the IT instruction.

Diff Detail

Event Timeline

samparker created this revision.Jun 18 2020, 6:29 AM

I think this should be reasonable. Can you explain where the cost of 2 is coming form? I would expect IT to be roughly free in a lot of places. ARM has conditional statements for pretty much everything. And armv8.1-m has csel :) (But, we don't actually use that at the moment..)

Do you have benchmark results? Not that you have to share them here, but do they look, um, reliably better?

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
513

Simplift ?

Can you explain where the cost of 2 is coming form?

Well the IT isn't guaranteed to be free, it still takes up code size and, of course, it doesn't even exist for T1. But in my testing, a few months ago, it wasn't worth differentiating between M-profile versions - but I will double check. The cmsis numbers for the M33 look good, it doesn't effect many kernels but they're all decent improvements.

Fixed typo and also predicated the code for an M-class subtarget.

Now only +1 for the IT block when we're querying for code size. But we now +1 for i1 values too to account for (some) cost of rematerialising those values.

samparker updated this revision to Diff 278751.Jul 17 2020, 6:21 AM

Rebased and cleaned up the logic a bit.

I think you could argue the cost of selects in many ways on ARM. A lot of them will be free (folded into another instruction), a lot will cost 1, some will cost 2 because of the IT or even higher on a T1 core. I think in the end, whatever looks best on benchmarks is probably the best way to go.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
588

I'm not a huge fan of isMClass. That covers quite a range, and the top end (and even mid-end) of that range will chew through IT bocks/selects like they are not even present.

Maybe use some variant of isThumb would be better? I wouldn't expect the code below to change much between a mclass and others, unless I'm missing something?

samparker updated this revision to Diff 282559.Aug 3 2020, 3:04 AM

Yeah, makes sense, it's now using isThumb and I've cleaned up the predicate a bit too. The thumbv7 select costs tests have been updated accordingly too.

samparker updated this revision to Diff 284377.Aug 10 2020, 8:09 AM

Re-angling this patch to be just about code size, so the costing is the same as before but we only execute it when querying TCK_CodeSize. But we're also now overriding the user provided CostKind when we're optimising for minsize.

Metric: size..text

Program                                                                                              default select-cost diff
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test    6484    6332     -2.3%
                                    test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   47596   46536     -2.2%
                                        test-suite :: MultiSource/Benchmarks/VersaBench/bmm/bmm.test     924     908     -1.7%
                                test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   15072   15268      1.3%
                         test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   15072   15264      1.3%
                                          test-suite :: MultiSource/Benchmarks/McCat/05-eks/eks.test    4748    4696     -1.1%
                        test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test    3120    3088     -1.0%
                         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test   32516   32300     -0.7%
                                       test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test    2416    2400     -0.7%
                         test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    3808    3784     -0.6%
                                           test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test    1348    1340     -0.6%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test   31508   31324     -0.6%
                                          test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test    1376    1368     -0.6%
                                  test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test   27552   27408     -0.5%
                                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test   20116   20012     -0.5%
                                           test-suite :: SingleSource/Benchmarks/Misc/whetstone.test    1604    1596     -0.5%
                                   test-suite :: MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.test    3592    3576     -0.4%
                                         test-suite :: MultiSource/Applications/ClamAV/clamscan.test  246116  245092     -0.4%
                             test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test   68296   68040     -0.4%
                                   test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test    4828    4812     -0.3%
                                        test-suite :: MultiSource/Applications/JM/lencod/lencod.test  318860  317812     -0.3%
                                         test-suite :: MultiSource/Applications/minisat/minisat.test    9980    9948     -0.3%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   69828   69612     -0.3%
                                           test-suite :: MultiSource/Applications/oggenc/oggenc.test   90092   89828     -0.3%
                                       test-suite :: MultiSource/Benchmarks/Ptrdist/yacr2/yacr2.test   11224   11192     -0.3%
                          test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test   19920   19864     -0.3%
                          test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test   19920   19864     -0.3%
                                             test-suite :: MultiSource/Applications/spiff/spiff.test   12184   12152     -0.3%
                                             test-suite :: MultiSource/Applications/sgefa/sgefa.test    6096    6080     -0.3%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   29164   29092     -0.2%
                                             test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test   20760   20712     -0.2%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   26104   26048     -0.2%
                                     test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test   13180   13152     -0.2%
               test-suite :: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael.test    7796    7780     -0.2%
                                                   test-suite :: MultiSource/Benchmarks/sim/sim.test    8704    8720      0.2%
                                       test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  324352  323764     -0.2%
                                             test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  296700  296192     -0.2%
                           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test    9640    9624     -0.2%
                                       test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test    4936    4928     -0.2%
                                         test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test   20152   20120     -0.2%
                                     test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test    2524    2520     -0.2%
                                                 test-suite :: MultiSource/Applications/lua/lua.test   61520   61432     -0.1%
                            test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test    2948    2952      0.1%
                                             test-suite :: MultiSource/Applications/SPASS/SPASS.test  201448  201176     -0.1%
                                    test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test   23836   23804     -0.1%
                                        test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  128336  128176     -0.1%
                                          test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    7540    7532     -0.1%
                                             test-suite :: MultiSource/Benchmarks/nbench/nbench.test   16872   16856     -0.1%
                                         test-suite :: MultiSource/Applications/obsequi/Obsequi.test   17468   17452     -0.1%
                             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test  218764  218596     -0.1%
                                          test-suite :: MultiSource/Applications/d/make_dparser.test   43812   43780     -0.1%
                               test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   89716   89776      0.1%
                                   test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test   12352   12344     -0.1%
                       test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test   14828   14820     -0.1%
                            test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test   90788   90740     -0.1%
                                         test-suite :: MultiSource/Applications/sqlite3/sqlite3.test  168288  168208     -0.0%
                                      test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  168600  168520     -0.0%
               test-suite :: MicroBenchmarks/ImageProcessing/BilateralFiltering/BilateralFilter.test   61768   61752     -0.0%
                                  test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test   91924   91908     -0.0%
                            test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test   92020   92004     -0.0%
                                            test-suite :: MultiSource/Applications/kimwitu++/kc.test  242528  242492     -0.0%
                       test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   60088   60080     -0.0%
                             test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   62016   62008     -0.0%
                 test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  278220  278188     -0.0%
                                         test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   78376   78368     -0.0%
                                                                                  Geomean difference                     -0.1%
samparker updated this revision to Diff 284603.Aug 11 2020, 1:17 AM

Moved the ValTy type check into the codesize logic.

samparker updated this revision to Diff 284628.Aug 11 2020, 3:13 AM

When I made changes to the TTI framework, I added CodeSize costing to loop unroll and unswitch, where the implicit cost used to be the 'size and latency'... this could be causing some trouble for code size costing of selects...

dmgreen added inline comments.Aug 11 2020, 3:54 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
745

I feel like this should be done by the calling function, not at this level. We should try and be reliable and give the user what they ask for, at least not special case this single function for codesize.

I don't feel particularly strongly about it though, if you did feel like arguing the point.

P.S I agree TCK_SizeAndLatency sounds better than Codesize, for the transforms.

samparker added inline comments.Aug 11 2020, 4:08 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
745

No, I agree, this was just the lazy way.

samparker updated this revision to Diff 285051.Aug 12 2020, 5:34 AM
samparker edited the summary of this revision. (Show Details)

Rebased and removed the CostKind override.

dmgreen accepted this revision.Aug 14 2020, 11:20 AM

OK thanks,

If you are happy with how this it doing, then I am happy too. LGTM

This revision is now accepted and ready to land.Aug 14 2020, 11:20 AM
This revision was automatically updated to reflect the committed changes.