Modify the ARM getCmpSelInstrCost implementation for the code size costs of selects. Now consider the legalization cost and increase the cost of i1 because those values wouldn't live in a general purpose register. We also make selects +1 more expensive to account for the IT instruction.
Details
Diff Detail
Event Timeline
I think this should be reasonable. Can you explain where the cost of 2 is coming form? I would expect IT to be roughly free in a lot of places. ARM has conditional statements for pretty much everything. And armv8.1-m has csel :) (But, we don't actually use that at the moment..)
Do you have benchmark results? Not that you have to share them here, but do they look, um, reliably better?
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
513 | Simplift ? |
Can you explain where the cost of 2 is coming form?
Well the IT isn't guaranteed to be free, it still takes up code size and, of course, it doesn't even exist for T1. But in my testing, a few months ago, it wasn't worth differentiating between M-profile versions - but I will double check. The cmsis numbers for the M33 look good, it doesn't effect many kernels but they're all decent improvements.
Now only +1 for the IT block when we're querying for code size. But we now +1 for i1 values too to account for (some) cost of rematerialising those values.
I think you could argue the cost of selects in many ways on ARM. A lot of them will be free (folded into another instruction), a lot will cost 1, some will cost 2 because of the IT or even higher on a T1 core. I think in the end, whatever looks best on benchmarks is probably the best way to go.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
588 | I'm not a huge fan of isMClass. That covers quite a range, and the top end (and even mid-end) of that range will chew through IT bocks/selects like they are not even present. Maybe use some variant of isThumb would be better? I wouldn't expect the code below to change much between a mclass and others, unless I'm missing something? |
Yeah, makes sense, it's now using isThumb and I've cleaned up the predicate a bit too. The thumbv7 select costs tests have been updated accordingly too.
Re-angling this patch to be just about code size, so the costing is the same as before but we only execute it when querying TCK_CodeSize. But we're also now overriding the user provided CostKind when we're optimising for minsize.
Metric: size..text Program default select-cost diff test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test 6484 6332 -2.3% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 47596 46536 -2.2% test-suite :: MultiSource/Benchmarks/VersaBench/bmm/bmm.test 924 908 -1.7% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 15072 15268 1.3% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 15072 15264 1.3% test-suite :: MultiSource/Benchmarks/McCat/05-eks/eks.test 4748 4696 -1.1% test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test 3120 3088 -1.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 32516 32300 -0.7% test-suite :: SingleSource/Benchmarks/Misc-C++/Large/ray.test 2416 2400 -0.7% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 3808 3784 -0.6% test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test 1348 1340 -0.6% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 31508 31324 -0.6% test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test 1376 1368 -0.6% test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 27552 27408 -0.5% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 20116 20012 -0.5% test-suite :: SingleSource/Benchmarks/Misc/whetstone.test 1604 1596 -0.5% test-suite :: MultiSource/Benchmarks/ASC_Sequoia/IRSmk/IRSmk.test 3592 3576 -0.4% test-suite :: MultiSource/Applications/ClamAV/clamscan.test 246116 245092 -0.4% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 68296 68040 -0.4% test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test 4828 4812 -0.3% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 318860 317812 -0.3% test-suite :: MultiSource/Applications/minisat/minisat.test 9980 9948 -0.3% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 69828 69612 -0.3% test-suite :: MultiSource/Applications/oggenc/oggenc.test 90092 89828 -0.3% test-suite :: MultiSource/Benchmarks/Ptrdist/yacr2/yacr2.test 11224 11192 -0.3% test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test 19920 19864 -0.3% test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test 19920 19864 -0.3% test-suite :: MultiSource/Applications/spiff/spiff.test 12184 12152 -0.3% test-suite :: MultiSource/Applications/sgefa/sgefa.test 6096 6080 -0.3% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 29164 29092 -0.2% test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test 20760 20712 -0.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 26104 26048 -0.2% test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test 13180 13152 -0.2% test-suite :: MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael.test 7796 7780 -0.2% test-suite :: MultiSource/Benchmarks/sim/sim.test 8704 8720 0.2% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 324352 323764 -0.2% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 296700 296192 -0.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test 9640 9624 -0.2% test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test 4936 4928 -0.2% test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test 20152 20120 -0.2% test-suite :: SingleSource/Benchmarks/CoyoteBench/fftbench.test 2524 2520 -0.2% test-suite :: MultiSource/Applications/lua/lua.test 61520 61432 -0.1% test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test 2948 2952 0.1% test-suite :: MultiSource/Applications/SPASS/SPASS.test 201448 201176 -0.1% test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test 23836 23804 -0.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 128336 128176 -0.1% test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 7540 7532 -0.1% test-suite :: MultiSource/Benchmarks/nbench/nbench.test 16872 16856 -0.1% test-suite :: MultiSource/Applications/obsequi/Obsequi.test 17468 17452 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 218764 218596 -0.1% test-suite :: MultiSource/Applications/d/make_dparser.test 43812 43780 -0.1% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 89716 89776 0.1% test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test 12352 12344 -0.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test 14828 14820 -0.1% test-suite :: MicroBenchmarks/LCALS/SubsetALambdaLoops/lcalsALambda.test 90788 90740 -0.1% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 168288 168208 -0.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 168600 168520 -0.0% test-suite :: MicroBenchmarks/ImageProcessing/BilateralFiltering/BilateralFilter.test 61768 61752 -0.0% test-suite :: MicroBenchmarks/LCALS/SubsetCRawLoops/lcalsCRaw.test 91924 91908 -0.0% test-suite :: MicroBenchmarks/LCALS/SubsetCLambdaLoops/lcalsCLambda.test 92020 92004 -0.0% test-suite :: MultiSource/Applications/kimwitu++/kc.test 242528 242492 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 60088 60080 -0.0% test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 62016 62008 -0.0% test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test 278220 278188 -0.0% test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 78376 78368 -0.0% Geomean difference -0.1%
When I made changes to the TTI framework, I added CodeSize costing to loop unroll and unswitch, where the implicit cost used to be the 'size and latency'... this could be causing some trouble for code size costing of selects...
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
745 | I feel like this should be done by the calling function, not at this level. We should try and be reliable and give the user what they ask for, at least not special case this single function for codesize. I don't feel particularly strongly about it though, if you did feel like arguing the point. |
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp | ||
---|---|---|
745 | No, I agree, this was just the lazy way. |
Simplift ?