We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.
Fixes PR54985.
| Paths 
 |  Differential  D124284  
[SLP]Try partial store vectorization if supported by target. ClosedPublic Authored by ABataev on Apr 22 2022, 11:17 AM. 
Details Summary We can try to vectorize number of stores less than MinVecRegSize Fixes PR54985. 
Diff Detail 
 Event Timeline
 Comment Actions I’m going to do some extra testing next week for this patch (if the bugs with non-typed pointers are fixed). Comment Actions Metric: SLP.NumVectorInstructions
Program                                                                                  SLP.NumVectorInstructions
                                                                                         results                   results0 diff
                    test-suite :: MultiSource/Benchmarks/Prolangs-C++/shapes/shapes.test     0.00                      6.00     inf%
                      test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test     0.00                     20.00     inf%
                        test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test     0.00                      4.00     inf%
              test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test     0.00                      2.00     inf%
                    test-suite :: MultiSource/Benchmarks/BitBench/uuencode/uuencode.test     0.00                      3.00     inf%
                              test-suite :: SingleSource/Benchmarks/Stanford/Towers.test     0.00                      1.00     inf%
                                        test-suite :: SingleSource/UnitTests/initp1.test     0.00                     20.00     inf%
                       test-suite :: SingleSource/UnitTests/ms_struct-bitfield-init.test     0.00                      1.00     inf%
     test-suite :: MultiSource/Benchmarks/MiBench/network-dijkstra/network-dijkstra.test     0.00                      2.00     inf%
     test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test     0.00                      4.00     inf%
                        test-suite :: MultiSource/Benchmarks/McCat/01-qbsort/qbsort.test     0.00                      2.00     inf%
                        test-suite :: MultiSource/Benchmarks/McCat/12-IOtest/iotest.test     0.00                      1.00     inf%
                                test-suite :: SingleSource/Benchmarks/Dhrystone/dry.test     0.00                      3.00     inf%
                                 test-suite :: MultiSource/Applications/sgefa/sgefa.test     0.00                      1.00     inf%
                       test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test     0.00                      2.00     inf%
                 test-suite :: MultiSource/Benchmarks/Rodinia/pathfinder/pathfinder.test     0.00                      2.00     inf%
                       test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test     0.00                      4.00     inf%
                      test-suite :: MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4.test     0.00                      1.00     inf%
                     test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test     0.00                      3.00     inf%
                                     test-suite :: MultiSource/Applications/aha/aha.test     0.00                      4.00     inf%
                                 test-suite :: MultiSource/Applications/spiff/spiff.test     0.00                      5.00     inf%
      test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test     0.00                      4.00     inf%
                         test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test     0.00                      2.00     inf%
            test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test     0.00                      1.00     inf%
                           test-suite :: MultiSource/Applications/hexxagon/hexxagon.test     0.00                      8.00     inf%
                  test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-tbl/unix-tbl.test     0.00                      2.00     inf%
                        test-suite :: MultiSource/Benchmarks/Prolangs-C++/life/life.test     0.00                     20.00     inf%
                  test-suite :: MultiSource/Benchmarks/Prolangs-C++/objects/objects.test     0.00                      3.00     inf%
           test-suite :: MultiSource/Benchmarks/MiBench/office-ispell/office-ispell.test     0.00                      5.00     inf%
                test-suite :: MultiSource/Benchmarks/Prolangs-C/assembler/assembler.test     0.00                      2.00     inf%
                                   test-suite :: MultiSource/Applications/siod/siod.test     2.00                    209.00 10350.0%
                                     test-suite :: MultiSource/Applications/lua/lua.test     1.00                     46.00  4500.0%
                             test-suite :: MultiSource/Applications/sqlite3/sqlite3.test    21.00                    438.00  1985.7%
                               test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test     2.00                     30.00  1400.0%
                 test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test     3.00                     41.00  1266.7%
                      test-suite :: MultiSource/Benchmarks/Prolangs-C++/ocean/ocean.test     2.00                     26.00  1200.0%
              test-suite :: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test    32.00                    378.00  1081.2%
                                   test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    10.00                     77.00   670.0%
                              test-suite :: MultiSource/Applications/d/make_dparser.test     2.00                     15.00   650.0%
                   test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    20.00                    141.00   605.0%
                  test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    20.00                    141.00   605.0%
              test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test     2.00                     13.00   550.0%
              test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test     2.00                     13.00   550.0%
             test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test     5.00                     32.00   540.0%
           test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   100.00                    507.00   407.0%
          test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   100.00                    507.00   407.0%
                               test-suite :: SingleSource/Benchmarks/McGill/exptree.test     1.00                      5.00   400.0%
           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test     2.00                     10.00   400.0%
                              test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    11.00                     44.00   300.0%
                                 test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test     5.00                     18.00   260.0%
                   test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   131.00                    469.00   258.0%
                       test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test     8.00                     28.00   250.0%
                   test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test    73.00                    222.00   204.1%
         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/Pathfinder/PathFinder.test     2.00                      6.00   200.0%
                                 test-suite :: MultiSource/Applications/lemon/lemon.test     5.00                     15.00   200.0%
                             test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test    12.00                     35.00   191.7%
                           test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test    97.00                    282.00   190.7%
                test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 11194.00                  31786.00   184.0%
               test-suite :: MultiSource/Benchmarks/Prolangs-C/archie-client/archie.test     4.00                     11.00   175.0%
                test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test     7.00                     18.00   157.1%
     test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test    47.00                    120.00   155.3%
                              test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test    11.00                     28.00   154.5%
                                     test-suite :: MultiSource/Applications/hbd/hbd.test    41.00                    104.00   153.7%
                      test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   698.00                   1681.00   140.8%
                       test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   698.00                   1681.00   140.8%
                             test-suite :: MultiSource/Applications/ClamAV/clamscan.test    85.00                    195.00   129.4%
                           test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   396.00                    907.00   129.0%
                           test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test    31.00                     71.00   129.0%
                           test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    45.00                    101.00   124.4%
                           test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   101.00                    214.00   111.9%
                         test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1762.00                   3598.00   104.2%
                          test-suite :: External/SPEC/CFP2006/450.soplex/450.soplex.test    64.00                    130.00   103.1%
                                test-suite :: SingleSource/Benchmarks/Stanford/Perm.test     6.00                     12.00   100.0%
                              test-suite :: SingleSource/Benchmarks/Dhrystone/fldry.test     1.00                      2.00   100.0%
                             test-suite :: MultiSource/Applications/viterbi/viterbi.test     1.00                      2.00   100.0%
                             test-suite :: MultiSource/Applications/obsequi/Obsequi.test     2.00                      4.00   100.0%
                       test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test   120.00                    233.00    94.2%
                        test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test    56.00                    107.00    91.1%
                               test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test   522.00                    976.00    87.0%
                             test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   165.00                    295.00    78.8%
                                 test-suite :: MultiSource/Applications/SPASS/SPASS.test   176.00                    307.00    74.4%
                             test-suite :: MultiSource/Benchmarks/Rodinia/srad/srad.test     3.00                      5.00    66.7%
                                 test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  6965.00                  11467.00    64.6%
                  test-suite :: MultiSource/Benchmarks/Prolangs-C/football/football.test    45.00                     73.00    62.2%
                      test-suite :: SingleSource/Benchmarks/Misc/richards_benchmark.test    10.00                     16.00    60.0%
          test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   681.00                   1080.00    58.6%
                            test-suite :: MultiSource/Applications/JM/lencod/lencod.test  1175.00                   1814.00    54.4%
                        test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test    24.00                     37.00    54.2%
                       test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   980.00                   1508.00    53.9%
                           test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test    32.00                     49.00    53.1%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    49.00                     74.00    51.0%
                 test-suite :: External/SPEC/CINT2006/462.libquantum/462.libquantum.test   107.00                    161.00    50.5%
                test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3638.00                   5474.00    50.5%
               test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3638.00                   5474.00    50.5%
                       test-suite :: SingleSource/Benchmarks/BenchmarkGame/fannkuch.test     2.00                      3.00    50.0%                                                                                               
                         test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test    88.00                    129.00    46.6%
                        test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test    88.00                    129.00    46.6%
                       test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test    10.00                     14.00    40.0%
                           test-suite :: MultiSource/Benchmarks/Olden/health/health.test     5.00                      7.00    40.0%
                                test-suite :: MultiSource/Applications/kimwitu++/kc.test    58.00                     81.00    39.7%
           test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   271.00                    378.00    39.5%
                    test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   750.00                   1031.00    37.5%
                     test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   750.00                   1031.00    37.5%
             test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    75.00                    103.00    37.3%
                    test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    75.00                    103.00    37.3%
                 test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   273.00                    372.00    36.3%
                   test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test  2078.00                   2778.00    33.7%
                               test-suite :: MultiSource/Applications/treecc/treecc.test    12.00                     16.00    33.3%
                                  test-suite :: SingleSource/Benchmarks/McGill/misr.test     6.00                      8.00    33.3%
                             test-suite :: MultiSource/Applications/minisat/minisat.test     3.00                      4.00    33.3%
               test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   814.00                   1084.00    33.2%
              test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   814.00                   1084.00    33.2%
                            test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   584.00                    777.00    33.0%
                               test-suite :: MultiSource/Applications/oggenc/oggenc.test   237.00                    311.00    31.2%
           test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    65.00                     85.00    30.8%
          test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    65.00                     85.00    30.8%
         test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test    53.00                     67.00    26.4%
                              test-suite :: SingleSource/Benchmarks/Misc-C++/bigfib.test     4.00                      5.00    25.0%
                                 test-suite :: MultiSource/Benchmarks/nbench/nbench.test   218.00                    271.00    24.3%
           test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  3719.00                   4588.00    23.4%
          test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  3719.00                   4588.00    23.4%
           test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   670.00                    820.00    22.4%
                       test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test    32.00                     39.00    21.9%
                  test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  4980.00                   6059.00    21.7%
                          test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  4991.00                   6071.00    21.6%
                           test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test    10.00                     12.00    20.0%
                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   490.00                    586.00    19.6%
                  test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 15903.00                  18607.00    17.0%
                          test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  5982.00                   6994.00    16.9%
                       test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   494.00                    553.00    11.9%
                        test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   494.00                    553.00    11.9%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.test    27.00                     30.00    11.1%
             test-suite :: MultiSource/Benchmarks/MiBench/security-sha/security-sha.test    18.00                     20.00    11.1%
                             test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test    25.00                     27.00     8.0%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    86.00                     92.00     7.0%
                           test-suite :: SingleSource/Benchmarks/Misc-C++-EH/spirit.test    16.00                     17.00     6.2%
             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test   236.00                    247.00     4.7%
                      test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  6030.00                   6307.00     4.6%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   285.00                    298.00     4.6%
                     test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test    89.00                     93.00     4.5%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test   102.00                    106.00     3.9%
                              test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test  3098.00                   3198.00     3.2%
                             test-suite :: SingleSource/UnitTests/matrix-types-spec.test    31.00                     32.00     3.2%
                     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test   143.00                    147.00     2.8%
                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test    49.00                     50.00     2.0%
                            test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test    59.00                     60.00     1.7%
                          test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  1023.00                   1038.00     1.5%
                test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test    84.00                     85.00     1.2%
                              test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test  1020.00                   1029.00     0.9%
                              test-suite :: SingleSource/Benchmarks/SmallPT/smallpt.test   114.00                    115.00     0.9%
                         test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test  1560.00                   1564.00     0.3%Statistics. All numbers are improvements. Comment Actions 
 My system is pretty busy so the perf numbers would not be correct most probably. Comment Actions 
 Just an example: test-suite :: MultiSource/Benchmarks/llubenchmark/llu.test 10.33 17.52 69.7% The test is not affected at all. As to some long run tests: test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test 28.09 29.60 5.4% test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test 32.65 34.00 4.1% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 96.71 100.30 3.7% test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test 49.65 50.82 2.4% test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test 50.21 50.93 1.4% test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test 33.50 31.27 -6.6% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 60.73 55.16 -9.2% test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test 36.94 32.76 -11.3% The less %, the better. Geomean is -100% but just like I said I would not trust these numbers. Comment Actions Increased priority, some numbers for long run tests: Metric: exec_time test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test 26.98 27.49 1.9% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 94.46 95.73 1.3% test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test 63.61 64.22 1.0% Gains test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 234.34 231.17 -1.4% test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test 27.07 26.70 -1.4% test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test 59.34 57.95 -2.3% test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test 45.51 44.42 -2.4% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 36.46 35.40 -2.9% test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test 57.70 55.93 -3.1% test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test 57.02 53.86 -5.5% test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test 45.46 40.61 -10.7% test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test 51.19 41.87 -18.2% Geomean is still -100%, which means that still the performance with the patch is better than before. 
 
 
 dmgreen added inline comments. 
 
 
 Comment Actions Thanks for the update, this looks better to me. The perf results I have were looking OK too, except for one fp16 case where it was choosing to use v2f16 vectorization. It could be OK, it's essentially trading unrolling vs vectorizing and one isn't obviously better or worse than the other. But small vectorization factors can be difficult at times. I think the problem is that getOperationAction will get the data from these OpActions, which will all be initialize to 0 (=Legal) and targets do not usually overrides that for illegal types: So it can pick up "Legal" stores just because the default initialization and the target has never had to set them to anything else in the past. There are TruncStoreActions that should be set though. What do you think of using something like this, based on whether there is a legal trunk store? unsigned getStoreMinimumVF(unsigned VF, Type *ScalarTy) const {
  auto &&IsSupportedByTarget = [this, ScalarTy](unsigned VF) {
    auto *SrcTy = FixedVectorType::get(ScalarTy, VF / 2);
    EVT VT = getTLI()->getValueType(DL, SrcTy);
    TargetLowering::LegalizeAction LA =
        getTLI()->getOperationAction(ISD::STORE, VT);
    if (getTLI()->isTypeLegal(VT) &&
        (LA == TargetLowering::Legal || LA == TargetLowering::Custom))
      return true;
    auto LT = getTLI()->getTypeLegalizationCost(DL, SrcTy);
    LA = getTLI()->getTruncStoreAction(LT.second, VT);
    return LA == TargetLowering::Legal || LA == TargetLowering::Custom;
  };
  while (VF > 2 && IsSupportedByTarget(VF))
    VF /= 2;
  return VF;
}Would that work for the cases you are interested in? The target can override it in any case, so at least it is controllable. But if that works for your use-cases it should hopefully match the target lowering a little better. Comment Actions 
 I tried something similar already, it won't work. Plus, trunc store is not the case we're looking at here, it is different. This function just says to vectorizer that it might be worth trying this vector factor. The cost model should later inform that it is not profitable. If something is not correct in the TTI, it should be fixed in TTI. Comment Actions 
 Oh OK, that's a shame. There may be something a little off with the f16 costmodel, it is not always perfect, but I don't see anything obvious from what it is printing. There are only v2f16 values, which can't go too wrong. That issue isn't too important though. My worry is that this is currently an expensive way of saying "return 2". I've no strong objection if you want to go with the current method, but perhaps the default should be more "correct" and we can override the targets that want something different/more aggressive? They can choose to spend the extra compile time on factors that might not be expected to be very profitable to other archs. Comment Actions rebase? I'm not sure if rGc5e875f599c25c2ea5a5c3dc6396de17c0c80a45 will have changed due to this patch Comment Actions I'm seeing some fairly big regressions with this patch, specifically on Rome (AMD) architecture.  As far as performance improvements, I see a few on Skylake in the range of 3-6%. An example here is MicroBenchmarks/ImageProcessing/Blur, which ranges between 4-5% improvement. Overall, the regressions outnumber the gains in the testing I've done so far and would likely block our compiler release. Comment Actions 
 
 I'm trying to improve it. But if we have perf regressions, there is something wrong with the cost model. Comment Actions Reworked initial implementation to be more conservative. Also, now it is able to handle trunc stores. Comment Actions 
 Performance testing still ongoing, should be completed by tomorrow. Comment Actions I ran the same set of benchmarks again without issue this time (well, there was an issue, but it turned out that someone had changed the benchmark sources :) ). They might not be the most amazing SLP tests, but no remaining objections from me. (The reason I suggested the truncate code the way I did was because by default the legalizing rules for smaller than legal power2 types is to promote integers to larger sizes. So under MVE where we only have 128bit vectors, a v4i8 vector will be promoted to a v4i32. We would then need a v4i32->v4i8 truncstore for it to be legal. Which it does have! So it would allow some of the smaller than legal types to be vectorized, essentially treating the v4i8 operations as v4i32's. If we are going for more conservative it might make sense to ignore that though, and float types will always widen as opposed to promote by default.) Comment Actions @asbirlea How are you specifying the SSE/AVX level for your benchmark runs - are you running with -march=native the x86-64-v* levels or something else? Comment Actions 
 I believe the runs are using -target-cpu k8 and -target-cpu haswell. The latest performance testing still shows one regression in a benchmark from singlesource, but there are many improvements that offset it in non-public benchmarks. So this latest diff is good to go from my side. Comment Actions Please can you rebase? I added a number of PR tests yesterday and I'm curious how many improve. Comment Actions 
 Hi, thanks for the testing. What's the name of the regressed single source test? Comment Actions 
 Sure, will do later today Comment Actions LGTM - naturally the test-suite regression needs further investigation but I think that can be performed post-commit 
 This revision is now accepted and ready to land.May 9 2022, 8:56 AM 
 
 This revision was landed with ongoing or failed builds.May 9 2022, 9:49 AM Closed by commit rG9dc4ced204d1: [SLP]Try partial store vectorization if supported by target. (authored by ABataev).  ·  Explain Why This revision was automatically updated to reflect the committed changes. Comment Actions 
 Shootout/sieve for xfdo configuration on Rome looks regressed by 20%, and Shootout/fib2 for opt configuration also on Rome by 10%. There are others, but I hope this gives a rough idea. vdmitrie added inline comments. 
 
 
 
 
 
 
 RKSimon mentioned this in D127604: [SLP][X86] Add 32-bit vector stores to help vectorization opportunities.Jun 12 2022, 10:31 AM RKSimon added inline comments. 
 
Revision Contents 
 
 
Diff 424590 llvm/include/llvm/Analysis/TargetTransformInfo.h
 llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
 llvm/include/llvm/CodeGen/BasicTTIImpl.h
 llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
 llvm/lib/Analysis/TargetTransformInfo.cpp
 llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
 
 llvm/test/Transforms/GVN/no_speculative_loads_with_asan.ll
 llvm/test/Transforms/NewGVN/no_speculative_loads_with_asan.ll
 llvm/test/Transforms/SLPVectorizer/AArch64/vectorizable-selects-uniform-cmps.ll
 llvm/test/Transforms/SLPVectorizer/X86/crash_7zip.ll
 llvm/test/Transforms/SLPVectorizer/X86/crash_bullet.ll
 llvm/test/Transforms/SLPVectorizer/X86/crash_bullet3.ll
 llvm/test/Transforms/SLPVectorizer/X86/crash_sim4b1.ll
 llvm/test/Transforms/SLPVectorizer/X86/fptosi-inseltpoison.ll
 llvm/test/Transforms/SLPVectorizer/X86/fptosi.ll
 llvm/test/Transforms/SLPVectorizer/X86/fptoui.ll
 llvm/test/Transforms/SLPVectorizer/X86/insert-after-bundle.ll
 llvm/test/Transforms/SLPVectorizer/X86/memory-runtime-checks.ll
 llvm/test/Transforms/SLPVectorizer/X86/no_alternate_divrem.ll
 llvm/test/Transforms/SLPVectorizer/X86/odd_store.ll
 llvm/test/Transforms/SLPVectorizer/X86/pr49933.ll
 llvm/test/Transforms/SLPVectorizer/X86/remark_not_all_parts.ll
 llvm/test/Transforms/SLPVectorizer/X86/reorder_phi.ll
 llvm/test/Transforms/SLPVectorizer/X86/saxpy.ll
 llvm/test/Transforms/SLPVectorizer/X86/schedule-bundle.ll
 llvm/test/Transforms/SLPVectorizer/X86/simple-loop.ll
 llvm/test/Transforms/SLPVectorizer/X86/sitofp-inseltpoison.ll
 llvm/test/Transforms/SLPVectorizer/X86/sitofp.ll
 llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll
 llvm/test/Transforms/SLPVectorizer/X86/vect_copyable_in_binops.ll
 llvm/test/Transforms/SLPVectorizer/slp-max-reg-size.ll
 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I don't think that exposing isLegalOrCustom to the midend is the right way to go - I feel it sets a bad precedent
I don't think that "Custom" means enough to base mid-end optimizations on. It can mean anything from "this can be custom lowered to a single instruction", to "this can _sometimes_ be custom lowered to a single instruction, in specific situations, otherwise it will expand", to "this has to be custom expanded into 150 instructions". The variance between them is just too large.
It also created a dependency between the mid-end and SDAG ISel lowering that isn't good to introduce - considering that there are other ISel's like Global ISel, there might be a point in the future where SDAG is entirely unused in certain backends.
From what I can tell (correct me if I'm wrong), what you want to add for this specific patch is a way to override/ignore getMinVectorRegisterBitWidth for stores that the target can efficiently handle. But you don't just want to change getMinVectorRegisterBitWidth? Can we add a method for doing that? shouldOverrideMinStoreVectorRegisterBitwidth(Type *Ty). The default implementation can still be the same as the current BasicTTI::isLegalOrCustomInstruction method, but it allows the target to override it if desired, and doesn't expose LegalOrCustom to the midend. Which I think is better in the long run.