We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.
Fixes PR54985.
Paths
| Differential D124284
[SLP]Try partial store vectorization if supported by target. ClosedPublic Authored by ABataev on Apr 22 2022, 11:17 AM.
Details Summary We can try to vectorize number of stores less than MinVecRegSize Fixes PR54985.
Diff Detail
Event Timeline
Comment Actions I’m going to do some extra testing next week for this patch (if the bugs with non-typed pointers are fixed). Comment Actions Metric: SLP.NumVectorInstructions Program SLP.NumVectorInstructions results results0 diff test-suite :: MultiSource/Benchmarks/Prolangs-C++/shapes/shapes.test 0.00 6.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test 0.00 20.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test 0.00 4.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/BitBench/uuencode/uuencode.test 0.00 3.00 inf% test-suite :: SingleSource/Benchmarks/Stanford/Towers.test 0.00 1.00 inf% test-suite :: SingleSource/UnitTests/initp1.test 0.00 20.00 inf% test-suite :: SingleSource/UnitTests/ms_struct-bitfield-init.test 0.00 1.00 inf% test-suite :: MultiSource/Benchmarks/MiBench/network-dijkstra/network-dijkstra.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test 0.00 4.00 inf% test-suite :: MultiSource/Benchmarks/McCat/01-qbsort/qbsort.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/McCat/12-IOtest/iotest.test 0.00 1.00 inf% test-suite :: SingleSource/Benchmarks/Dhrystone/dry.test 0.00 3.00 inf% test-suite :: MultiSource/Applications/sgefa/sgefa.test 0.00 1.00 inf% test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/Rodinia/pathfinder/pathfinder.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test 0.00 4.00 inf% test-suite :: MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4.test 0.00 1.00 inf% test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test 0.00 3.00 inf% test-suite :: MultiSource/Applications/aha/aha.test 0.00 4.00 inf% test-suite :: MultiSource/Applications/spiff/spiff.test 0.00 5.00 inf% test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test 0.00 4.00 inf% test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test 0.00 1.00 inf% test-suite :: MultiSource/Applications/hexxagon/hexxagon.test 0.00 8.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-tbl/unix-tbl.test 0.00 2.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C++/life/life.test 0.00 20.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C++/objects/objects.test 0.00 3.00 inf% test-suite :: MultiSource/Benchmarks/MiBench/office-ispell/office-ispell.test 0.00 5.00 inf% test-suite :: MultiSource/Benchmarks/Prolangs-C/assembler/assembler.test 0.00 2.00 inf% test-suite :: MultiSource/Applications/siod/siod.test 2.00 209.00 10350.0% test-suite :: MultiSource/Applications/lua/lua.test 1.00 46.00 4500.0% test-suite :: MultiSource/Applications/sqlite3/sqlite3.test 21.00 438.00 1985.7% test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test 2.00 30.00 1400.0% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 3.00 41.00 1266.7% test-suite :: MultiSource/Benchmarks/Prolangs-C++/ocean/ocean.test 2.00 26.00 1200.0% test-suite :: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test 32.00 378.00 1081.2% test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test 10.00 77.00 670.0% test-suite :: MultiSource/Applications/d/make_dparser.test 2.00 15.00 650.0% test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test 20.00 141.00 605.0% test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test 20.00 141.00 605.0% test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test 2.00 13.00 550.0% test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test 2.00 13.00 550.0% test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test 5.00 32.00 540.0% test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 100.00 507.00 407.0% test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 100.00 507.00 407.0% test-suite :: SingleSource/Benchmarks/McGill/exptree.test 1.00 5.00 400.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test 2.00 10.00 400.0% test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test 11.00 44.00 300.0% test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test 5.00 18.00 260.0% test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 131.00 469.00 258.0% test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test 8.00 28.00 250.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 73.00 222.00 204.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/Pathfinder/PathFinder.test 2.00 6.00 200.0% test-suite :: MultiSource/Applications/lemon/lemon.test 5.00 15.00 200.0% test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test 12.00 35.00 191.7% test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test 97.00 282.00 190.7% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 11194.00 31786.00 184.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/archie-client/archie.test 4.00 11.00 175.0% test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test 7.00 18.00 157.1% test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test 47.00 120.00 155.3% test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test 11.00 28.00 154.5% test-suite :: MultiSource/Applications/hbd/hbd.test 41.00 104.00 153.7% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 698.00 1681.00 140.8% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 698.00 1681.00 140.8% test-suite :: MultiSource/Applications/ClamAV/clamscan.test 85.00 195.00 129.4% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 396.00 907.00 129.0% test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test 31.00 71.00 129.0% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 45.00 101.00 124.4% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 101.00 214.00 111.9% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1762.00 3598.00 104.2% test-suite :: External/SPEC/CFP2006/450.soplex/450.soplex.test 64.00 130.00 103.1% test-suite :: SingleSource/Benchmarks/Stanford/Perm.test 6.00 12.00 100.0% test-suite :: SingleSource/Benchmarks/Dhrystone/fldry.test 1.00 2.00 100.0% test-suite :: MultiSource/Applications/viterbi/viterbi.test 1.00 2.00 100.0% test-suite :: MultiSource/Applications/obsequi/Obsequi.test 2.00 4.00 100.0% test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test 120.00 233.00 94.2% test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test 56.00 107.00 91.1% test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test 522.00 976.00 87.0% test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test 165.00 295.00 78.8% test-suite :: MultiSource/Applications/SPASS/SPASS.test 176.00 307.00 74.4% test-suite :: MultiSource/Benchmarks/Rodinia/srad/srad.test 3.00 5.00 66.7% test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 6965.00 11467.00 64.6% test-suite :: MultiSource/Benchmarks/Prolangs-C/football/football.test 45.00 73.00 62.2% test-suite :: SingleSource/Benchmarks/Misc/richards_benchmark.test 10.00 16.00 60.0% test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 681.00 1080.00 58.6% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1175.00 1814.00 54.4% test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test 24.00 37.00 54.2% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 980.00 1508.00 53.9% test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test 32.00 49.00 53.1% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test 49.00 74.00 51.0% test-suite :: External/SPEC/CINT2006/462.libquantum/462.libquantum.test 107.00 161.00 50.5% test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3638.00 5474.00 50.5% test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3638.00 5474.00 50.5% test-suite :: SingleSource/Benchmarks/BenchmarkGame/fannkuch.test 2.00 3.00 50.0% test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test 88.00 129.00 46.6% test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test 88.00 129.00 46.6% test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test 10.00 14.00 40.0% test-suite :: MultiSource/Benchmarks/Olden/health/health.test 5.00 7.00 40.0% test-suite :: MultiSource/Applications/kimwitu++/kc.test 58.00 81.00 39.7% test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test 271.00 378.00 39.5% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 750.00 1031.00 37.5% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 750.00 1031.00 37.5% test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test 75.00 103.00 37.3% test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test 75.00 103.00 37.3% test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test 273.00 372.00 36.3% test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 2078.00 2778.00 33.7% test-suite :: MultiSource/Applications/treecc/treecc.test 12.00 16.00 33.3% test-suite :: SingleSource/Benchmarks/McGill/misr.test 6.00 8.00 33.3% test-suite :: MultiSource/Applications/minisat/minisat.test 3.00 4.00 33.3% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 814.00 1084.00 33.2% test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 814.00 1084.00 33.2% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 584.00 777.00 33.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 237.00 311.00 31.2% test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 65.00 85.00 30.8% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 65.00 85.00 30.8% test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test 53.00 67.00 26.4% test-suite :: SingleSource/Benchmarks/Misc-C++/bigfib.test 4.00 5.00 25.0% test-suite :: MultiSource/Benchmarks/nbench/nbench.test 218.00 271.00 24.3% test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 3719.00 4588.00 23.4% test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 3719.00 4588.00 23.4% test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test 670.00 820.00 22.4% test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test 32.00 39.00 21.9% test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4980.00 6059.00 21.7% test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test 4991.00 6071.00 21.6% test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test 10.00 12.00 20.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test 490.00 586.00 19.6% test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 15903.00 18607.00 17.0% test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test 5982.00 6994.00 16.9% test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 494.00 553.00 11.9% test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test 494.00 553.00 11.9% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.test 27.00 30.00 11.1% test-suite :: MultiSource/Benchmarks/MiBench/security-sha/security-sha.test 18.00 20.00 11.1% test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test 25.00 27.00 8.0% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test 86.00 92.00 7.0% test-suite :: SingleSource/Benchmarks/Misc-C++-EH/spirit.test 16.00 17.00 6.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test 236.00 247.00 4.7% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 6030.00 6307.00 4.6% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test 285.00 298.00 4.6% test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test 89.00 93.00 4.5% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test 102.00 106.00 3.9% test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test 3098.00 3198.00 3.2% test-suite :: SingleSource/UnitTests/matrix-types-spec.test 31.00 32.00 3.2% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test 143.00 147.00 2.8% test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test 49.00 50.00 2.0% test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test 59.00 60.00 1.7% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 1023.00 1038.00 1.5% test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test 84.00 85.00 1.2% test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test 1020.00 1029.00 0.9% test-suite :: SingleSource/Benchmarks/SmallPT/smallpt.test 114.00 115.00 0.9% test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 1560.00 1564.00 0.3% Statistics. All numbers are improvements. Comment Actions
My system is pretty busy so the perf numbers would not be correct most probably. Comment Actions
Just an example: test-suite :: MultiSource/Benchmarks/llubenchmark/llu.test 10.33 17.52 69.7% The test is not affected at all. As to some long run tests: test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test 28.09 29.60 5.4% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 32.65 34.00 4.1% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 96.71 100.30 3.7% test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test 49.65 50.82 2.4% test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test 50.21 50.93 1.4% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 33.50 31.27 -6.6% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 60.73 55.16 -9.2% test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test 36.94 32.76 -11.3% The less %, the better. Geomean is -100% but just like I said I would not trust these numbers. Comment Actions Increased priority, some numbers for long run tests: Metric: exec_time test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test 26.98 27.49 1.9% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 94.46 95.73 1.3% test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 63.61 64.22 1.0% Gains test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 234.34 231.17 -1.4% test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 27.07 26.70 -1.4% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 59.34 57.95 -2.3% test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 45.51 44.42 -2.4% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 36.46 35.40 -2.9% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 57.70 55.93 -3.1% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 57.02 53.86 -5.5% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 45.46 40.61 -10.7% test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 51.19 41.87 -18.2% Geomean is still -100%, which means that still the performance with the patch is better than before.
dmgreen added inline comments.
Comment Actions Thanks for the update, this looks better to me. The perf results I have were looking OK too, except for one fp16 case where it was choosing to use v2f16 vectorization. It could be OK, it's essentially trading unrolling vs vectorizing and one isn't obviously better or worse than the other. But small vectorization factors can be difficult at times. I think the problem is that getOperationAction will get the data from these OpActions, which will all be initialize to 0 (=Legal) and targets do not usually overrides that for illegal types: So it can pick up "Legal" stores just because the default initialization and the target has never had to set them to anything else in the past. There are TruncStoreActions that should be set though. What do you think of using something like this, based on whether there is a legal trunk store? unsigned getStoreMinimumVF(unsigned VF, Type *ScalarTy) const { auto &&IsSupportedByTarget = [this, ScalarTy](unsigned VF) { auto *SrcTy = FixedVectorType::get(ScalarTy, VF / 2); EVT VT = getTLI()->getValueType(DL, SrcTy); TargetLowering::LegalizeAction LA = getTLI()->getOperationAction(ISD::STORE, VT); if (getTLI()->isTypeLegal(VT) && (LA == TargetLowering::Legal || LA == TargetLowering::Custom)) return true; auto LT = getTLI()->getTypeLegalizationCost(DL, SrcTy); LA = getTLI()->getTruncStoreAction(LT.second, VT); return LA == TargetLowering::Legal || LA == TargetLowering::Custom; }; while (VF > 2 && IsSupportedByTarget(VF)) VF /= 2; return VF; } Would that work for the cases you are interested in? The target can override it in any case, so at least it is controllable. But if that works for your use-cases it should hopefully match the target lowering a little better. Comment Actions
I tried something similar already, it won't work. Plus, trunc store is not the case we're looking at here, it is different. This function just says to vectorizer that it might be worth trying this vector factor. The cost model should later inform that it is not profitable. If something is not correct in the TTI, it should be fixed in TTI. Comment Actions
Oh OK, that's a shame. There may be something a little off with the f16 costmodel, it is not always perfect, but I don't see anything obvious from what it is printing. There are only v2f16 values, which can't go too wrong. That issue isn't too important though. My worry is that this is currently an expensive way of saying "return 2". I've no strong objection if you want to go with the current method, but perhaps the default should be more "correct" and we can override the targets that want something different/more aggressive? They can choose to spend the extra compile time on factors that might not be expected to be very profitable to other archs. Comment Actions rebase? I'm not sure if rGc5e875f599c25c2ea5a5c3dc6396de17c0c80a45 will have changed due to this patch Comment Actions I'm seeing some fairly big regressions with this patch, specifically on Rome (AMD) architecture. As far as performance improvements, I see a few on Skylake in the range of 3-6%. An example here is MicroBenchmarks/ImageProcessing/Blur, which ranges between 4-5% improvement. Overall, the regressions outnumber the gains in the testing I've done so far and would likely block our compiler release. Comment Actions
I'm trying to improve it. But if we have perf regressions, there is something wrong with the cost model. Comment Actions Reworked initial implementation to be more conservative. Also, now it is able to handle trunc stores. Comment Actions
Performance testing still ongoing, should be completed by tomorrow. Comment Actions I ran the same set of benchmarks again without issue this time (well, there was an issue, but it turned out that someone had changed the benchmark sources :) ). They might not be the most amazing SLP tests, but no remaining objections from me. (The reason I suggested the truncate code the way I did was because by default the legalizing rules for smaller than legal power2 types is to promote integers to larger sizes. So under MVE where we only have 128bit vectors, a v4i8 vector will be promoted to a v4i32. We would then need a v4i32->v4i8 truncstore for it to be legal. Which it does have! So it would allow some of the smaller than legal types to be vectorized, essentially treating the v4i8 operations as v4i32's. If we are going for more conservative it might make sense to ignore that though, and float types will always widen as opposed to promote by default.) Comment Actions @asbirlea How are you specifying the SSE/AVX level for your benchmark runs - are you running with -march=native the x86-64-v* levels or something else? Comment Actions
I believe the runs are using -target-cpu k8 and -target-cpu haswell. The latest performance testing still shows one regression in a benchmark from singlesource, but there are many improvements that offset it in non-public benchmarks. So this latest diff is good to go from my side. Comment Actions Please can you rebase? I added a number of PR tests yesterday and I'm curious how many improve. Comment Actions
Hi, thanks for the testing. What's the name of the regressed single source test? Comment Actions
Sure, will do later today Comment Actions LGTM - naturally the test-suite regression needs further investigation but I think that can be performed post-commit
This revision is now accepted and ready to land.May 9 2022, 8:56 AM
This revision was landed with ongoing or failed builds.May 9 2022, 9:49 AM Closed by commit rG9dc4ced204d1: [SLP]Try partial store vectorization if supported by target. (authored by ABataev). · Explain Why This revision was automatically updated to reflect the committed changes. Comment Actions
Shootout/sieve for xfdo configuration on Rome looks regressed by 20%, and Shootout/fib2 for opt configuration also on Rome by 10%. There are others, but I hope this gives a rough idea. vdmitrie added inline comments.
RKSimon mentioned this in D127604: [SLP][X86] Add 32-bit vector stores to help vectorization opportunities.Jun 12 2022, 10:31 AM RKSimon added inline comments.
Revision Contents
Diff 428116 llvm/include/llvm/Analysis/TargetTransformInfo.h
llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
llvm/include/llvm/CodeGen/BasicTTIImpl.h
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
llvm/lib/Analysis/TargetTransformInfo.cpp
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
llvm/test/Transforms/SLPVectorizer/X86/arith-add-load.ll
llvm/test/Transforms/SLPVectorizer/X86/arith-and-const-load.ll
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
llvm/test/Transforms/SLPVectorizer/X86/crash_7zip.ll
llvm/test/Transforms/SLPVectorizer/X86/crash_bullet.ll
llvm/test/Transforms/SLPVectorizer/X86/crash_bullet3.ll
llvm/test/Transforms/SLPVectorizer/X86/crash_sim4b1.ll
llvm/test/Transforms/SLPVectorizer/X86/fptosi-inseltpoison.ll
llvm/test/Transforms/SLPVectorizer/X86/fptosi.ll
llvm/test/Transforms/SLPVectorizer/X86/fptoui.ll
llvm/test/Transforms/SLPVectorizer/X86/hadd-inseltpoison.ll
llvm/test/Transforms/SLPVectorizer/X86/hadd.ll
llvm/test/Transforms/SLPVectorizer/X86/insert-after-bundle.ll
llvm/test/Transforms/SLPVectorizer/X86/memory-runtime-checks.ll
llvm/test/Transforms/SLPVectorizer/X86/no_alternate_divrem.ll
llvm/test/Transforms/SLPVectorizer/X86/odd_store.ll
llvm/test/Transforms/SLPVectorizer/X86/pr49933.ll
llvm/test/Transforms/SLPVectorizer/X86/remark_not_all_parts.ll
llvm/test/Transforms/SLPVectorizer/X86/reorder_phi.ll
llvm/test/Transforms/SLPVectorizer/X86/saxpy.ll
llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll
llvm/test/Transforms/SLPVectorizer/X86/simple-loop.ll
llvm/test/Transforms/SLPVectorizer/X86/sitofp-inseltpoison.ll
llvm/test/Transforms/SLPVectorizer/X86/sitofp.ll
llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll
llvm/test/Transforms/SLPVectorizer/X86/vect_copyable_in_binops.ll
|
I don't think that exposing isLegalOrCustom to the midend is the right way to go - I feel it sets a bad precedent
I don't think that "Custom" means enough to base mid-end optimizations on. It can mean anything from "this can be custom lowered to a single instruction", to "this can _sometimes_ be custom lowered to a single instruction, in specific situations, otherwise it will expand", to "this has to be custom expanded into 150 instructions". The variance between them is just too large.
It also created a dependency between the mid-end and SDAG ISel lowering that isn't good to introduce - considering that there are other ISel's like Global ISel, there might be a point in the future where SDAG is entirely unused in certain backends.
From what I can tell (correct me if I'm wrong), what you want to add for this specific patch is a way to override/ignore getMinVectorRegisterBitWidth for stores that the target can efficiently handle. But you don't just want to change getMinVectorRegisterBitWidth? Can we add a method for doing that? shouldOverrideMinStoreVectorRegisterBitwidth(Type *Ty). The default implementation can still be the same as the current BasicTTI::isLegalOrCustomInstruction method, but it allows the target to override it if desired, and doesn't expose LegalOrCustom to the midend. Which I think is better in the long run.