This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Try partial store vectorization if supported by target.
ClosedPublic

Authored by ABataev on Apr 22 2022, 11:17 AM.

Details

Summary

We can try to vectorize number of stores less than MinVecRegSize
/ scalar_value_size, if it is allowed by target. Gives an extra
opportunity for the vectorization.

Fixes PR54985.

Diff Detail

Event Timeline

ABataev created this revision.Apr 22 2022, 11:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 11:17 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
ABataev requested review of this revision.Apr 22 2022, 11:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2022, 11:17 AM
xbolva00 added inline comments.Apr 22 2022, 12:57 PM
llvm/test/Transforms/SLPVectorizer/X86/pr49933.ll
2–3

Remove please.

4
ABataev updated this revision to Diff 424590.Apr 22 2022, 1:17 PM

Address comment

I’m going to do some extra testing next week for this patch (if the bugs with non-typed pointers are fixed).

Metric: SLP.NumVectorInstructions

Program                                                                                  SLP.NumVectorInstructions
                                                                                         results                   results0 diff
                    test-suite :: MultiSource/Benchmarks/Prolangs-C++/shapes/shapes.test     0.00                      6.00     inf%
                      test-suite :: MultiSource/Benchmarks/Prolangs-C/bison/mybison.test     0.00                     20.00     inf%
                        test-suite :: MultiSource/Benchmarks/Prolangs-C++/city/city.test     0.00                      4.00     inf%
              test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-smail/unix-smail.test     0.00                      2.00     inf%
                    test-suite :: MultiSource/Benchmarks/BitBench/uuencode/uuencode.test     0.00                      3.00     inf%
                              test-suite :: SingleSource/Benchmarks/Stanford/Towers.test     0.00                      1.00     inf%
                                        test-suite :: SingleSource/UnitTests/initp1.test     0.00                     20.00     inf%
                       test-suite :: SingleSource/UnitTests/ms_struct-bitfield-init.test     0.00                      1.00     inf%
     test-suite :: MultiSource/Benchmarks/MiBench/network-dijkstra/network-dijkstra.test     0.00                      2.00     inf%
     test-suite :: MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan.test     0.00                      4.00     inf%
                        test-suite :: MultiSource/Benchmarks/McCat/01-qbsort/qbsort.test     0.00                      2.00     inf%
                        test-suite :: MultiSource/Benchmarks/McCat/12-IOtest/iotest.test     0.00                      1.00     inf%
                                test-suite :: SingleSource/Benchmarks/Dhrystone/dry.test     0.00                      3.00     inf%
                                 test-suite :: MultiSource/Applications/sgefa/sgefa.test     0.00                      1.00     inf%
                       test-suite :: MultiSource/Benchmarks/Rodinia/hotspot/hotspot.test     0.00                      2.00     inf%
                 test-suite :: MultiSource/Benchmarks/Rodinia/pathfinder/pathfinder.test     0.00                      2.00     inf%
                       test-suite :: MultiSource/Benchmarks/MallocBench/cfrac/cfrac.test     0.00                      4.00     inf%
                      test-suite :: MultiSource/Benchmarks/Trimaran/enc-rc4/enc-rc4.test     0.00                      1.00     inf%
                     test-suite :: MultiSource/Benchmarks/Rodinia/backprop/backprop.test     0.00                      3.00     inf%
                                     test-suite :: MultiSource/Applications/aha/aha.test     0.00                      4.00     inf%
                                 test-suite :: MultiSource/Applications/spiff/spiff.test     0.00                      5.00     inf%
      test-suite :: MultiSource/Benchmarks/TSVC/NodeSplitting-flt/NodeSplitting-flt.test     0.00                      4.00     inf%
                         test-suite :: MultiSource/Applications/lambda-0.1.3/lambda.test     0.00                      2.00     inf%
            test-suite :: MultiSource/Benchmarks/Trimaran/netbench-url/netbench-url.test     0.00                      1.00     inf%
                           test-suite :: MultiSource/Applications/hexxagon/hexxagon.test     0.00                      8.00     inf%
                  test-suite :: MultiSource/Benchmarks/Prolangs-C/unix-tbl/unix-tbl.test     0.00                      2.00     inf%
                        test-suite :: MultiSource/Benchmarks/Prolangs-C++/life/life.test     0.00                     20.00     inf%
                  test-suite :: MultiSource/Benchmarks/Prolangs-C++/objects/objects.test     0.00                      3.00     inf%
           test-suite :: MultiSource/Benchmarks/MiBench/office-ispell/office-ispell.test     0.00                      5.00     inf%
                test-suite :: MultiSource/Benchmarks/Prolangs-C/assembler/assembler.test     0.00                      2.00     inf%
                                   test-suite :: MultiSource/Applications/siod/siod.test     2.00                    209.00 10350.0%
                                     test-suite :: MultiSource/Applications/lua/lua.test     1.00                     46.00  4500.0%
                             test-suite :: MultiSource/Applications/sqlite3/sqlite3.test    21.00                    438.00  1985.7%
                               test-suite :: SingleSource/Benchmarks/Stanford/Oscar.test     2.00                     30.00  1400.0%
                 test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test     3.00                     41.00  1266.7%
                      test-suite :: MultiSource/Benchmarks/Prolangs-C++/ocean/ocean.test     2.00                     26.00  1200.0%
              test-suite :: MultiSource/Benchmarks/VersaBench/beamformer/beamformer.test    32.00                    378.00  1081.2%
                                   test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    10.00                     77.00   670.0%
                              test-suite :: MultiSource/Applications/d/make_dparser.test     2.00                     15.00   650.0%
                   test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test    20.00                    141.00   605.0%
                  test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test    20.00                    141.00   605.0%
              test-suite :: MultiSource/Applications/ALAC/decode/alacconvert-decode.test     2.00                     13.00   550.0%
              test-suite :: MultiSource/Applications/ALAC/encode/alacconvert-encode.test     2.00                     13.00   550.0%
             test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test     5.00                     32.00   540.0%
           test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   100.00                    507.00   407.0%
          test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test   100.00                    507.00   407.0%
                               test-suite :: SingleSource/Benchmarks/McGill/exptree.test     1.00                      5.00   400.0%
           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/SimpleMOC/SimpleMOC.test     2.00                     10.00   400.0%
                              test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test    11.00                     44.00   300.0%
                                 test-suite :: MultiSource/Benchmarks/Ptrdist/bc/bc.test     5.00                     18.00   260.0%
                   test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test   131.00                    469.00   258.0%
                       test-suite :: MultiSource/Benchmarks/Fhourstones/fhourstones.test     8.00                     28.00   250.0%
                   test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test    73.00                    222.00   204.1%
         test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/Pathfinder/PathFinder.test     2.00                      6.00   200.0%
                                 test-suite :: MultiSource/Applications/lemon/lemon.test     5.00                     15.00   200.0%
                             test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test    12.00                     35.00   191.7%
                           test-suite :: External/SPEC/CINT2006/456.hmmer/456.hmmer.test    97.00                    282.00   190.7%
                test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 11194.00                  31786.00   184.0%
               test-suite :: MultiSource/Benchmarks/Prolangs-C/archie-client/archie.test     4.00                     11.00   175.0%
                test-suite :: MultiSource/Benchmarks/Fhourstones-3.1/fhourstones3.1.test     7.00                     18.00   157.1%
     test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test    47.00                    120.00   155.3%
                              test-suite :: SingleSource/Benchmarks/Stanford/Puzzle.test    11.00                     28.00   154.5%
                                     test-suite :: MultiSource/Applications/hbd/hbd.test    41.00                    104.00   153.7%
                      test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   698.00                   1681.00   140.8%
                       test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   698.00                   1681.00   140.8%
                             test-suite :: MultiSource/Applications/ClamAV/clamscan.test    85.00                    195.00   129.4%
                           test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   396.00                    907.00   129.0%
                           test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test    31.00                     71.00   129.0%
                           test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test    45.00                    101.00   124.4%
                           test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test   101.00                    214.00   111.9%
                         test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test  1762.00                   3598.00   104.2%
                          test-suite :: External/SPEC/CFP2006/450.soplex/450.soplex.test    64.00                    130.00   103.1%
                                test-suite :: SingleSource/Benchmarks/Stanford/Perm.test     6.00                     12.00   100.0%
                              test-suite :: SingleSource/Benchmarks/Dhrystone/fldry.test     1.00                      2.00   100.0%
                             test-suite :: MultiSource/Applications/viterbi/viterbi.test     1.00                      2.00   100.0%
                             test-suite :: MultiSource/Applications/obsequi/Obsequi.test     2.00                      4.00   100.0%
                       test-suite :: External/SPEC/CINT2006/471.omnetpp/471.omnetpp.test   120.00                    233.00    94.2%
                        test-suite :: External/SPEC/CFP2006/482.sphinx3/482.sphinx3.test    56.00                    107.00    91.1%
                               test-suite :: External/SPEC/CINT2006/403.gcc/403.gcc.test   522.00                    976.00    87.0%
                             test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test   165.00                    295.00    78.8%
                                 test-suite :: MultiSource/Applications/SPASS/SPASS.test   176.00                    307.00    74.4%
                             test-suite :: MultiSource/Benchmarks/Rodinia/srad/srad.test     3.00                      5.00    66.7%
                                 test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  6965.00                  11467.00    64.6%
                  test-suite :: MultiSource/Benchmarks/Prolangs-C/football/football.test    45.00                     73.00    62.2%
                      test-suite :: SingleSource/Benchmarks/Misc/richards_benchmark.test    10.00                     16.00    60.0%
          test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   681.00                   1080.00    58.6%
                            test-suite :: MultiSource/Applications/JM/lencod/lencod.test  1175.00                   1814.00    54.4%
                        test-suite :: MultiSource/Benchmarks/Prolangs-C/agrep/agrep.test    24.00                     37.00    54.2%
                       test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   980.00                   1508.00    53.9%
                           test-suite :: External/SPEC/CINT2006/458.sjeng/458.sjeng.test    32.00                     49.00    53.1%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniAMR/miniAMR.test    49.00                     74.00    51.0%
                 test-suite :: External/SPEC/CINT2006/462.libquantum/462.libquantum.test   107.00                    161.00    50.5%
                test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  3638.00                   5474.00    50.5%
               test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  3638.00                   5474.00    50.5%
                       test-suite :: SingleSource/Benchmarks/BenchmarkGame/fannkuch.test     2.00                      3.00    50.0%                                                                                               
                         test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test    88.00                    129.00    46.6%
                        test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test    88.00                    129.00    46.6%
                       test-suite :: MultiSource/Benchmarks/ASC_Sequoia/AMGmk/AMGmk.test    10.00                     14.00    40.0%
                           test-suite :: MultiSource/Benchmarks/Olden/health/health.test     5.00                      7.00    40.0%
                                test-suite :: MultiSource/Applications/kimwitu++/kc.test    58.00                     81.00    39.7%
           test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test   271.00                    378.00    39.5%
                    test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   750.00                   1031.00    37.5%
                     test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   750.00                   1031.00    37.5%
             test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test    75.00                    103.00    37.3%
                    test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test    75.00                    103.00    37.3%
                 test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test   273.00                    372.00    36.3%
                   test-suite :: External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test  2078.00                   2778.00    33.7%
                               test-suite :: MultiSource/Applications/treecc/treecc.test    12.00                     16.00    33.3%
                                  test-suite :: SingleSource/Benchmarks/McGill/misr.test     6.00                      8.00    33.3%
                             test-suite :: MultiSource/Applications/minisat/minisat.test     3.00                      4.00    33.3%
               test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test   814.00                   1084.00    33.2%
              test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test   814.00                   1084.00    33.2%
                            test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   584.00                    777.00    33.0%
                               test-suite :: MultiSource/Applications/oggenc/oggenc.test   237.00                    311.00    31.2%
           test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test    65.00                     85.00    30.8%
          test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test    65.00                     85.00    30.8%
         test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test    53.00                     67.00    26.4%
                              test-suite :: SingleSource/Benchmarks/Misc-C++/bigfib.test     4.00                      5.00    25.0%
                                 test-suite :: MultiSource/Benchmarks/nbench/nbench.test   218.00                    271.00    24.3%
           test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test  3719.00                   4588.00    23.4%
          test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test  3719.00                   4588.00    23.4%
           test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   670.00                    820.00    22.4%
                       test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test    32.00                     39.00    21.9%
                  test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  4980.00                   6059.00    21.7%
                          test-suite :: External/SPEC/CFP2006/453.povray/453.povray.test  4991.00                   6071.00    21.6%
                           test-suite :: MultiSource/Benchmarks/SciMark2-C/scimark2.test    10.00                     12.00    20.0%
                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR.test   490.00                    586.00    19.6%
                  test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 15903.00                  18607.00    17.0%
                          test-suite :: External/SPEC/CFP2006/447.dealII/447.dealII.test  5982.00                   6994.00    16.9%
                       test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   494.00                    553.00    11.9%
                        test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   494.00                    553.00    11.9%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/XSBench/XSBench.test    27.00                     30.00    11.1%
             test-suite :: MultiSource/Benchmarks/MiBench/security-sha/security-sha.test    18.00                     20.00    11.1%
                             test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test    25.00                     27.00     8.0%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    86.00                     92.00     7.0%
                           test-suite :: SingleSource/Benchmarks/Misc-C++-EH/spirit.test    16.00                     17.00     6.2%
             test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/PENNANT/PENNANT.test   236.00                    247.00     4.7%
                      test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test  6030.00                   6307.00     4.6%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/miniFE/miniFE.test   285.00                    298.00     4.6%
                     test-suite :: MultiSource/Benchmarks/FreeBench/distray/distray.test    89.00                     93.00     4.5%
               test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/RSBench/rsbench.test   102.00                    106.00     3.9%
                              test-suite :: External/SPEC/CFP2006/444.namd/444.namd.test  3098.00                   3198.00     3.2%
                             test-suite :: SingleSource/UnitTests/matrix-types-spec.test    31.00                     32.00     3.2%
                     test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test   143.00                    147.00     2.8%
                 test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C++/HPCCG/HPCCG.test    49.00                     50.00     2.0%
                            test-suite :: MultiSource/Benchmarks/McCat/08-main/main.test    59.00                     60.00     1.7%
                          test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  1023.00                   1038.00     1.5%
                test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test    84.00                     85.00     1.2%
                              test-suite :: External/SPEC/CFP2006/433.milc/433.milc.test  1020.00                   1029.00     0.9%
                              test-suite :: SingleSource/Benchmarks/SmallPT/smallpt.test   114.00                    115.00     0.9%
                         test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test  1560.00                   1564.00     0.3%

Statistics. All numbers are improvements.

xbolva00 added a comment.EditedApr 25 2022, 10:27 AM

Looks great. Could you also observe some runtime perf improvements?

Looks great. Could you also observe some runtime perf improvements?

My system is pretty busy so the perf numbers would not be correct most probably.

Looks great. Could you also observe some runtime perf improvements?

My system is pretty busy so the perf numbers would not be correct most probably.

Just an example:

test-suite :: MultiSource/Benchmarks/llubenchmark/llu.test  10.33     17.52     69.7%

The test is not affected at all.

As to some long run tests:

test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test  28.09     29.60      5.4%
test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  32.65     34.00      4.1%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test  96.71    100.30      3.7%
test-suite :: External/SPEC/CINT2017speed/605.mcf_s/605.mcf_s.test  49.65     50.82      2.4%
test-suite :: External/SPEC/CINT2017rate/505.mcf_r/505.mcf_r.test  50.21     50.93      1.4%
test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  33.50     31.27     -6.6%
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test  60.73     55.16     -9.2%
test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test  36.94     32.76    -11.3%

The less %, the better. Geomean is -100% but just like I said I would not trust these numbers.

Yeah, understood.

ABataev added a comment.EditedApr 25 2022, 12:56 PM

Increased priority, some numbers for long run tests:
Regressions

Metric: exec_time
test-suite :: External/SPEC/CINT2006/401.bzip2/401.bzip2.test  26.98     27.49      1.9%
test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test  94.46     95.73      1.3%
test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test  63.61     64.22      1.0%

Gains

test-suite :: External/SPEC/CFP2017speed/619.lbm_s/619.lbm_s.test 234.34    231.17     -1.4%
test-suite :: External/SPEC/CFP2017rate/519.lbm_r/519.lbm_r.test  27.07     26.70     -1.4%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test  59.34     57.95     -2.3%
test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test  45.51     44.42     -2.4%
test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  36.46     35.40     -2.9%
test-suite :: External/SPEC/CINT2006/445.gobmk/445.gobmk.test  57.70     55.93     -3.1%
test-suite :: External/SPEC/CINT2006/473.astar/473.astar.test  57.02     53.86     -5.5%
test-suite :: External/SPEC/CINT2017rate/520.omnetpp_r/520.omnetpp_r.test  45.46     40.61    -10.7%
test-suite :: External/SPEC/CINT2017speed/620.omnetpp_s/620.omnetpp_s.test  51.19     41.87    -18.2%

Geomean is still -100%, which means that still the performance with the patch is better than before.
Tested for O3 + LTO, generic CPU.

Very nice numbers!

Maybe @asbirlea or @fhahn want to run own internal tests?

Noted, I will run some testing this week.

RKSimon added inline comments.Apr 26 2022, 5:21 AM
llvm/test/Transforms/GVN/no_speculative_loads_with_asan.ll
55–56

its weird that none of tests actually CHECK-NOT for loads any more - given its the name of the test file :-|

ABataev added inline comments.Apr 26 2022, 5:23 AM
llvm/test/Transforms/GVN/no_speculative_loads_with_asan.ll
55–56

I can keep CHECK-NOT, if needed

RKSimon added inline comments.Apr 26 2022, 5:36 AM
llvm/test/Transforms/GVN/no_speculative_loads_with_asan.ll
55–56

Either that or we move to auto-generating the test file checks - I have no strong preference

dmgreen added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
1370

I don't think that exposing isLegalOrCustom to the midend is the right way to go - I feel it sets a bad precedent

I don't think that "Custom" means enough to base mid-end optimizations on. It can mean anything from "this can be custom lowered to a single instruction", to "this can _sometimes_ be custom lowered to a single instruction, in specific situations, otherwise it will expand", to "this has to be custom expanded into 150 instructions". The variance between them is just too large.

It also created a dependency between the mid-end and SDAG ISel lowering that isn't good to introduce - considering that there are other ISel's like Global ISel, there might be a point in the future where SDAG is entirely unused in certain backends.

From what I can tell (correct me if I'm wrong), what you want to add for this specific patch is a way to override/ignore getMinVectorRegisterBitWidth for stores that the target can efficiently handle. But you don't just want to change getMinVectorRegisterBitWidth? Can we add a method for doing that? shouldOverrideMinStoreVectorRegisterBitwidth(Type *Ty). The default implementation can still be the same as the current BasicTTI::isLegalOrCustomInstruction method, but it allows the target to override it if desired, and doesn't expose LegalOrCustom to the midend. Which I think is better in the long run.

dmgreen added inline comments.Apr 27 2022, 10:34 AM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
318

Should this be getTypeLegalizationCost or getValueType? Otherwise we are asking for the isOperationLegalOrCustom on a legal type (LT.second) below, which you would hope was always Legal and won't really tell you much about how legal a store to Ty is.

ABataev added inline comments.Apr 27 2022, 10:35 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1370

I'll try to invent something better.

vporpo added a subscriber: vporpo.Apr 27 2022, 2:32 PM
ABataev updated this revision to Diff 425774.Apr 28 2022, 7:31 AM

Address comments

ABataev updated this revision to Diff 425776.Apr 28 2022, 7:38 AM

Fix test checks

Thanks for the update, this looks better to me. The perf results I have were looking OK too, except for one fp16 case where it was choosing to use v2f16 vectorization. It could be OK, it's essentially trading unrolling vs vectorizing and one isn't obviously better or worse than the other. But small vectorization factors can be difficult at times.

I think the problem is that getOperationAction will get the data from these OpActions, which will all be initialize to 0 (=Legal) and targets do not usually overrides that for illegal types:
https://github.com/llvm/llvm-project/blob/a9d68a5524dea113cace5983697786599cbdce9a/llvm/lib/CodeGen/TargetLoweringBase.cpp#L733

So it can pick up "Legal" stores just because the default initialization and the target has never had to set them to anything else in the past. There are TruncStoreActions that should be set though. What do you think of using something like this, based on whether there is a legal trunk store?

unsigned getStoreMinimumVF(unsigned VF, Type *ScalarTy) const {
  auto &&IsSupportedByTarget = [this, ScalarTy](unsigned VF) {
    auto *SrcTy = FixedVectorType::get(ScalarTy, VF / 2);
    EVT VT = getTLI()->getValueType(DL, SrcTy);
    TargetLowering::LegalizeAction LA =
        getTLI()->getOperationAction(ISD::STORE, VT);
    if (getTLI()->isTypeLegal(VT) &&
        (LA == TargetLowering::Legal || LA == TargetLowering::Custom))
      return true;

    auto LT = getTLI()->getTypeLegalizationCost(DL, SrcTy);
    LA = getTLI()->getTruncStoreAction(LT.second, VT);
    return LA == TargetLowering::Legal || LA == TargetLowering::Custom;
  };
  while (VF > 2 && IsSupportedByTarget(VF))
    VF /= 2;
  return VF;
}

Would that work for the cases you are interested in? The target can override it in any case, so at least it is controllable. But if that works for your use-cases it should hopefully match the target lowering a little better.

Thanks for the update, this looks better to me. The perf results I have were looking OK too, except for one fp16 case where it was choosing to use v2f16 vectorization. It could be OK, it's essentially trading unrolling vs vectorizing and one isn't obviously better or worse than the other. But small vectorization factors can be difficult at times.

I think the problem is that getOperationAction will get the data from these OpActions, which will all be initialize to 0 (=Legal) and targets do not usually overrides that for illegal types:
https://github.com/llvm/llvm-project/blob/a9d68a5524dea113cace5983697786599cbdce9a/llvm/lib/CodeGen/TargetLoweringBase.cpp#L733

So it can pick up "Legal" stores just because the default initialization and the target has never had to set them to anything else in the past. There are TruncStoreActions that should be set though. What do you think of using something like this, based on whether there is a legal trunk store?

unsigned getStoreMinimumVF(unsigned VF, Type *ScalarTy) const {
  auto &&IsSupportedByTarget = [this, ScalarTy](unsigned VF) {
    auto *SrcTy = FixedVectorType::get(ScalarTy, VF / 2);
    EVT VT = getTLI()->getValueType(DL, SrcTy);
    TargetLowering::LegalizeAction LA =
        getTLI()->getOperationAction(ISD::STORE, VT);
    if (getTLI()->isTypeLegal(VT) &&
        (LA == TargetLowering::Legal || LA == TargetLowering::Custom))
      return true;

    auto LT = getTLI()->getTypeLegalizationCost(DL, SrcTy);
    LA = getTLI()->getTruncStoreAction(LT.second, VT);
    return LA == TargetLowering::Legal || LA == TargetLowering::Custom;
  };
  while (VF > 2 && IsSupportedByTarget(VF))
    VF /= 2;
  return VF;
}

Would that work for the cases you are interested in? The target can override it in any case, so at least it is controllable. But if that works for your use-cases it should hopefully match the target lowering a little better.

I tried something similar already, it won't work. Plus, trunc store is not the case we're looking at here, it is different. This function just says to vectorizer that it might be worth trying this vector factor. The cost model should later inform that it is not profitable. If something is not correct in the TTI, it should be fixed in TTI.

If something is not correct in the TTI, it should be fixed in TTI.

Strong +1

But that issue should not be a blocker IMHO.

I tried something similar already, it won't work. Plus, trunc store is not the case we're looking at here, it is different. This function just says to vectorizer that it might be worth trying this vector factor. The cost model should later inform that it is not profitable. If something is not correct in the TTI, it should be fixed in TTI.

Oh OK, that's a shame. There may be something a little off with the f16 costmodel, it is not always perfect, but I don't see anything obvious from what it is printing. There are only v2f16 values, which can't go too wrong.
The issue isn't that SLP vectorization is worse than scalar, it's that in that particular case runtime unrolling is better. The SLP that happens can get in the way of something more profitable, and no pass in llvm operates in a vacuum.

That issue isn't too important though. My worry is that this is currently an expensive way of saying "return 2". I've no strong objection if you want to go with the current method, but perhaps the default should be more "correct" and we can override the targets that want something different/more aggressive? They can choose to spend the extra compile time on factors that might not be expected to be very profitable to other archs.

rebase? I'm not sure if rGc5e875f599c25c2ea5a5c3dc6396de17c0c80a45 will have changed due to this patch

I'm seeing some fairly big regressions with this patch, specifically on Rome (AMD) architecture.
A couple of examples that are public in the test suite: SingleSource/Benchmarks/Shootout: for sieve I'm seeing a 20% performance regression in an opt build and an xfdo one, and for MicroBenchmarks/ImageProcessing/Dither 10% regression (opt, thinlto and xfdo).
I'm seeing also a couple on Skylake, opt build, in the range of 5-13 %, an example being eigen with 13% regression; this may be harder to track down as it's in a specific configuration but let me know if you want to reproduce this one.

As far as performance improvements, I see a few on Skylake in the range of 3-6%. An example here is MicroBenchmarks/ImageProcessing/Blur, which ranges between 4-5% improvement.

Overall, the regressions outnumber the gains in the testing I've done so far and would likely block our compiler release.

I tried something similar already, it won't work. Plus, trunc store is not the case we're looking at here, it is different. This function just says to vectorizer that it might be worth trying this vector factor. The cost model should later inform that it is not profitable. If something is not correct in the TTI, it should be fixed in TTI.

Oh OK, that's a shame. There may be something a little off with the f16 costmodel, it is not always perfect, but I don't see anything obvious from what it is printing. There are only v2f16 values, which can't go too wrong.
The issue isn't that SLP vectorization is worse than scalar, it's that in that particular case runtime unrolling is better. The SLP that happens can get in the way of something more profitable, and no pass in llvm operates in a vacuum.

That issue isn't too important though. My worry is that this is currently an expensive way of saying "return 2". I've no strong objection if you want to go with the current method, but perhaps the default should be more "correct" and we can override the targets that want something different/more aggressive? They can choose to spend the extra compile time on factors that might not be expected to be very profitable to other archs.

I'm seeing some fairly big regressions with this patch, specifically on Rome (AMD) architecture.
A couple of examples that are public in the test suite: SingleSource/Benchmarks/Shootout: for sieve I'm seeing a 20% performance regression in an opt build and an xfdo one, and for MicroBenchmarks/ImageProcessing/Dither 10% regression (opt, thinlto and xfdo).
I'm seeing also a couple on Skylake, opt build, in the range of 5-13 %, an example being eigen with 13% regression; this may be harder to track down as it's in a specific configuration but let me know if you want to reproduce this one.

As far as performance improvements, I see a few on Skylake in the range of 3-6%. An example here is MicroBenchmarks/ImageProcessing/Blur, which ranges between 4-5% improvement.

Overall, the regressions outnumber the gains in the testing I've done so far and would likely block our compiler release.

I'm trying to improve it. But if we have perf regressions, there is something wrong with the cost model.

ABataev updated this revision to Diff 427012.May 4 2022, 8:02 AM

Reworked initial implementation to be more conservative. Also, now it is able to handle trunc stores.

@asbirlea Do you have an update on the regressions you were seeing vs latest patch?

@asbirlea Do you have an update on the regressions you were seeing vs latest patch?

Performance testing still ongoing, should be completed by tomorrow.

I ran the same set of benchmarks again without issue this time (well, there was an issue, but it turned out that someone had changed the benchmark sources :) ). They might not be the most amazing SLP tests, but no remaining objections from me.

(The reason I suggested the truncate code the way I did was because by default the legalizing rules for smaller than legal power2 types is to promote integers to larger sizes. So under MVE where we only have 128bit vectors, a v4i8 vector will be promoted to a v4i32. We would then need a v4i32->v4i8 truncstore for it to be legal. Which it does have! So it would allow some of the smaller than legal types to be vectorized, essentially treating the v4i8 operations as v4i32's. If we are going for more conservative it might make sense to ignore that though, and float types will always widen as opposed to promote by default.)

@asbirlea How are you specifying the SSE/AVX level for your benchmark runs - are you running with -march=native the x86-64-v* levels or something else?

@asbirlea How are you specifying the SSE/AVX level for your benchmark runs - are you running with -march=native the x86-64-v* levels or something else?

I believe the runs are using -target-cpu k8 and -target-cpu haswell.

The latest performance testing still shows one regression in a benchmark from singlesource, but there are many improvements that offset it in non-public benchmarks. So this latest diff is good to go from my side.

LG. Thanks for many perf improvements.

@RKSimon ?

Please can you rebase? I added a number of PR tests yesterday and I'm curious how many improve.

ABataev added a comment.EditedMay 9 2022, 3:08 AM

@asbirlea How are you specifying the SSE/AVX level for your benchmark runs - are you running with -march=native the x86-64-v* levels or something else?

I believe the runs are using -target-cpu k8 and -target-cpu haswell.

The latest performance testing still shows one regression in a benchmark from singlesource, but there are many improvements that offset it in non-public benchmarks. So this latest diff is good to go from my side.

Hi, thanks for the testing. What's the name of the regressed single source test?

Please can you rebase? I added a number of PR tests yesterday and I'm curious how many improve.

Sure, will do later today

RKSimon accepted this revision.May 9 2022, 8:56 AM

LGTM - naturally the test-suite regression needs further investigation but I think that can be performed post-commit

llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

I'll investigate adding 32-bit vector load/store handling as well (it has the same costs as the codegen for 64-bit anyhow).

This revision is now accepted and ready to land.May 9 2022, 8:56 AM
ABataev added inline comments.May 9 2022, 9:03 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

TTI does not report that it supports 32 bit stores.

RKSimon added inline comments.May 9 2022, 9:22 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

We never bothered to add it - we mainly use the 64-bit vector load/store to handle f64-i64 handling on 32-bit targets

This revision was landed with ongoing or failed builds.May 9 2022, 9:49 AM
This revision was automatically updated to reflect the committed changes.

@asbirlea How are you specifying the SSE/AVX level for your benchmark runs - are you running with -march=native the x86-64-v* levels or something else?

I believe the runs are using -target-cpu k8 and -target-cpu haswell.

The latest performance testing still shows one regression in a benchmark from singlesource, but there are many improvements that offset it in non-public benchmarks. So this latest diff is good to go from my side.

Hi, thanks for the testing. What's the name of the regressed single source test?

Shootout/sieve for xfdo configuration on Rome looks regressed by 20%, and Shootout/fib2 for opt configuration also on Rome by 10%.
Dither has one -10% (floyd 128) and one +10% (floyd 512) on Rome opt and xfdo, but the same floyd ones plus a few others are +15 to +20% on Skylake opt, thinlto and xfdo and +5% to +10% on Haswell opt, thinlto and xfdo. Net win overall.
Eigen, complex benchmark also has many configs with net wins, e.g. +10% to +30% haswell xfdo and +10 to +25% rome opt.

There are others, but I hope this gives a rough idea.

vdmitrie added inline comments.
llvm/include/llvm/CodeGen/BasicTTIImpl.h
330

Hm, there is some discrepancy here.
Let's assume we entered the loop with VF = 8 and target supports 8 but not 4.

VF == 8 , VF >2 && IsSupportedByTarget(8) evaluates to true => VF = VF/2
VF == 4 , VF >2 && IsSupportedByTarget(4) evaluates to false

hence we return 4 which is actually not supported by target. Is that what the intent was?

ABataev added inline comments.May 9 2022, 7:01 PM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
330

IsSupportedByTarget performs checks for VF/2, not for VF, and only if it is supported, sets VF to VF/2.

vdmitrie added inline comments.May 9 2022, 7:04 PM
llvm/include/llvm/CodeGen/BasicTTIImpl.h
330

Ah, I overlooked that. Thanks.

xbolva00 added inline comments.May 13 2022, 5:26 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

Do you plan to add them?

RKSimon added inline comments.May 13 2022, 5:31 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

yes - got a few other blockers to deal with first though - that yak has to be shaved.......

anna added a subscriber: anna.May 17 2022, 8:35 AM
xbolva00 added inline comments.May 29 2022, 10:13 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

ok, thanks!

xbolva00 added inline comments.Jun 12 2022, 9:45 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

any updates?

void pr(char* __restrict a, char* __restrict r){
    for (int i = 0; i < 4; i++){
        r[i] += a[i];
    }
}

gcc emits nicely paddb.

RKSimon added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/arith-mul-load.ll
14 ↗(On Diff #428098)

https://reviews.llvm.org/D127604 - but I need someone to perf test the patch properly.