This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Introduce split shuffle vectorization mode.
Needs ReviewPublic

Authored by ABataev on Dec 28 2021, 1:37 PM.

Details

Summary

Introduced split shuffle node kind. If the node is supposed to be
gathered, the compiler tries to detect possibly vectorizable number of
scalar instruction in the whole node and if there are > number / 2 of
the compatible instructions, the whole node is splitted into 2 (number / 2)
subnodes, vectorized separately and then reshuffled into the resulting
vector.

Metric: SLP.NumVectorInstructions

Program results results0 diff

                  test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test    3.00    7.00  133.3%
                 test-suite :: MultiSource/Benchmarks/Prolangs-C/assembler/assembler.test    3.00    6.00  100.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test 1202.00 1627.00   35.4%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  149.00  201.00   34.9%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  149.00  201.00   34.9%
          test-suite :: MultiSource/Benchmarks/mediabench/mpeg2/mpeg2dec/mpeg2decode.test   37.00   46.00   24.3%
            test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  256.00  282.00   10.2%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1624.00 1781.00    9.7%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1624.00 1781.00    9.7%
                         test-suite :: External/SPEC/CINT2017speed/657.xz_s/657.xz_s.test  115.00  126.00    9.6%
                          test-suite :: External/SPEC/CINT2017rate/557.xz_r/557.xz_r.test  115.00  126.00    9.6%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 8737.00 9558.00    9.4%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 1051.00 1143.00    8.8%
                           test-suite :: MultiSource/Benchmarks/VersaBench/dbms/dbms.test   25.00   27.00    8.0%
                              test-suite :: SingleSource/Benchmarks/Misc/ReedSolomon.test   14.00   15.00    7.1%
                              test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   30.00   32.00    6.7%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1671.00 1772.00    6.0%
                 test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3565.00 3773.00    5.8%
                test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3565.00 3773.00    5.8%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4527.00 4772.00    5.4%
                   test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4240.00 4464.00    5.3%
                          test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1996.00 2100.00    5.2%
                            test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  421.00  440.00    4.5%
                              test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  299.00  311.00    4.0%
                               test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test   35.00   36.00    2.9%
                              test-suite :: MultiSource/Applications/SIBsim4/SIBsim4.test   39.00   40.00    2.6%
                                      test-suite :: MultiSource/Applications/hbd/hbd.test   41.00   42.00    2.4%
                  test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  260.00  265.00    1.9%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 9100.00 9220.00    1.3%
                       test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 4573.00 4598.00    0.5%
                    test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test  709.00  711.00    0.3%
           test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 5623.00 5638.00    0.3%
            test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 5623.00 5638.00    0.3%
            test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  383.00  384.00    0.3%

      test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test  624.00  623.00   -0.2%
                           test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test  571.00  570.00   -0.2%
                       test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  854.00  851.00   -0.4%
                        test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  854.00  851.00   -0.4%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   65.00   64.00   -1.5%
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  783.00  767.00   -2.0%
                     test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test  235.00  230.00   -2.1%
              test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test  235.00  230.00   -2.1%
                   test-suite :: MultiSource/Benchmarks/Prolangs-C/football/football.test   38.00   35.00   -7.9%
                   test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test  207.00  188.00   -9.2%
                    test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test  207.00  188.00   -9.2%
                        test-suite :: MultiSource/Benchmarks/Ptrdist/anagram/anagram.test   20.00   18.00  -10.0%
            test-suite :: SingleSource/Benchmarks/Polybench/stencils/fdtd-2d/fdtd-2d.test   37.00   31.00  -16.2%
              test-suite :: MultiSource/Benchmarks/MiBench/security-sha/security-sha.test   13.00   10.00  -23.1%
test-suite :: SingleSource/Benchmarks/Polybench/medley/floyd-warshall/floyd-warshall.test    8.00    6.00  -25.0%
    test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syrk/syrk.test    8.00    6.00  -25.0%

test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/cholesky/cholesky.test 8.00 6.00 -25.0%

      test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/gemm/gemm.test    8.00    6.00  -25.0%
    test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/syr2k/syr2k.test    8.00    6.00  -25.0%
test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/doitgen/doitgen.test    8.00    6.00  -25.0%

test-suite :: SingleSource/Benchmarks/Polybench/stencils/jacobi-2d-imper/jacobi-2d-imper.test 8.00 6.00 -25.0%

  test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/solvers/lu/lu.test    8.00    6.00  -25.0%
test-suite :: SingleSource/Benchmarks/Polybench/linear-algebra/kernels/mvt/mvt.test   16.00   12.00  -25.0%

test-suite :: SingleSource/Benchmarks/Polybench/stencils/jacobi-1d-imper/jacobi-1d-imper.test 8.00 6.00 -25.0%

test-suite :: SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.test    8.00    6.00  -25.0%
                              test-suite :: MultiSource/Applications/aha/aha.test     NaN    4.00     NaN

MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test

  • actually, more vector instruction, less shuffles.

MultiSource/Benchmarks/mafft/pairlocalalign.test - less shuffles
External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test - more vectorized code,
less shuffles
External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test - same
MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test - more
vectorized code, some inserts/shuffles were optimized.
MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test - more
vectorized code but some previously vectorized not vectorized anymore,
need D116312.
MultiSource/Benchmarks/mediabench/gsm/toast/toast.test - less shuffles,
more vector code.
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test - same
MultiSource/Benchmarks/Prolangs-C/football/football.test - need
non-power-2 to improve more, but pretty the same.
External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test - more vector instruction, less shuffles.
External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test - more vector instruction, less shuffles.
MultiSource/Benchmarks/Ptrdist/anagram/anagram.test - less shuffles
All other - same

Diff Detail

Unit TestsFailed