This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve vectorization of cmp instructions sequences.
ClosedPublic

Authored by ABataev on Nov 30 2021, 6:38 AM.

Details

Summary

Final attempt to vectorize bundles of comptatible cmp instructions after
all other instructions processing.

Metric: SLP.NumVectorInstructions

Program results results0 diff

   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    1.00    5.00  400.0%
                         test-suite :: MultiSource/Benchmarks/PAQ8p/paq8p.test    8.00   11.00   37.5%
               test-suite :: MultiSource/Benchmarks/Olden/voronoi/voronoi.test   20.00   26.00   30.0%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 1344.00 1648.00   22.6%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 1344.00 1648.00   22.6%
                         test-suite :: MultiSource/Benchmarks/Olden/bh/bh.test  102.00  124.00   21.6%
           test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/CoMD/CoMD.test  118.00  133.00   12.7%
     test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test 3233.00 3554.00    9.9%
      test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test 3233.00 3554.00    9.9%
                   test-suite :: MultiSource/Benchmarks/Olden/power/power.test   64.00   70.00    9.4%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 7879.00 8604.00    9.2%
      test-suite :: MultiSource/Benchmarks/Prolangs-C/simulator/simulator.test   50.00   54.00    8.0%
                   test-suite :: MultiSource/Applications/sqlite3/sqlite3.test   27.00   29.00    7.4%
        test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 8345.00 8955.00    7.3%
test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test  694.00  738.00    6.3%
                   test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test  361.00  382.00    5.8%
                 test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test  409.00  430.00    5.1%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  140.00  147.00    5.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  140.00  147.00    5.0%
        test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test 4013.00 4206.00    4.8%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  966.00 1011.00    4.7%
                      test-suite :: SingleSource/Benchmarks/Misc/oourafft.test   65.00   68.00    4.6%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test 4219.00 4381.00    3.8%
               test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1911.00 1973.00    3.2%
 test-suite :: External/SPEC/CINT2017rate/531.deepsjeng_r/531.deepsjeng_r.test   62.00   64.00    3.2%
test-suite :: External/SPEC/CINT2017speed/631.deepsjeng_s/631.deepsjeng_s.test   62.00   64.00    3.2%
            test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test  852.00  877.00    2.9%
             test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test  852.00  877.00    2.9%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test 1624.00 1668.00    2.7%
                    test-suite :: MultiSource/Benchmarks/McCat/18-imp/imp.test   39.00   40.00    2.6%

test-suite :: MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset.test 613.00 624.00 1.8%

 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  378.00  383.00    1.3%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg.test  293.00  295.00    0.7%
       test-suite :: MultiSource/Benchmarks/mediabench/jpeg/jpeg-6a/cjpeg.test  297.00  299.00    0.7%
 test-suite :: External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 5522.00 5534.00    0.2%
test-suite :: External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 5522.00 5534.00    0.2%

Diff Detail

Event Timeline

ABataev created this revision.Nov 30 2021, 6:38 AM
ABataev requested review of this revision.Nov 30 2021, 6:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2021, 6:38 AM
RKSimon requested changes to this revision.Nov 30 2021, 8:32 AM
RKSimon added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9194–9195

Please can do the NFC move of this down to its new location and then rebase so the diffs are more evident?

This revision now requires changes to proceed.Nov 30 2021, 8:32 AM
ABataev added inline comments.Nov 30 2021, 8:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9194–9195

Ok

RKSimon added inline comments.Nov 30 2021, 10:42 AM
llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll
111

Any idea what happened here?

ABataev added inline comments.Nov 30 2021, 11:00 AM
llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll
111

The pair of icmp slt gets vectorized because the cost model decided that it is profitable. We can't handle these selects as reduction because icmp instructions have different predicates.

RKSimon added inline comments.Nov 30 2021, 2:46 PM
llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll
111

But the sgt didn't get swapped and from the TODO at line 9298 I take it we can't do an altopcode/perdicate to handle the ult?

ABataev added inline comments.Nov 30 2021, 3:03 PM
llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll
111

Sorry, don't quite understand your question. Why sgt is not vectorized with ult? They are not comptaible. Why sgt, ult and 2 slts are not vectorized? 3 different predicates and they are not profitabke for vectorization (ending up with 2 gather nodes). TODO is for extra analysis of cmp operands, to select only cmps with same/alternate operands only (!) or constants only(!).

ABataev added inline comments.Nov 30 2021, 3:06 PM
llvm/test/Transforms/SLPVectorizer/X86/reduction-logical.ll
111

PS. Not compatible - also end up with 2 gather nodes

RKSimon accepted this revision.Dec 1 2021, 2:41 AM

Thanks for the clarification, LGTM

This revision is now accepted and ready to land.Dec 1 2021, 2:41 AM
This revision was landed with ongoing or failed builds.Dec 1 2021, 7:27 AM
This revision was automatically updated to reflect the committed changes.
zhuhan0 added a subscriber: zhuhan0.Jan 4 2022, 2:10 PM

I think this change has performance win on SpecCPU 2017 508.namd_r by 6%. Run using --size train, showing running time.

With this changeWithout this changeDiff
52.43955.842-6.094%
zhuhan0 added a comment.EditedJan 6 2022, 2:27 PM

Hi, we observed ~9% increase in SLP.NumVectorInstructions on SPEC's 508.namd_r with this change, using llvm-test-suite. I noticed the number is not reported here. Curious did you also see the same result? We tested on an Intel Skylake.

Hi, we observed ~9% increase in SLP.NumVectorInstructions on SPEC's 508.namd_r with this change, using llvm-test-suite. I noticed the number is not reported here. Curious did you also see the same result? We tested on an Intel Skylake.

I ttied with -march=native on Skylake, but did not see wuch results.

Hi, we observed ~9% increase in SLP.NumVectorInstructions on SPEC's 508.namd_r with this change, using llvm-test-suite. I noticed the number is not reported here. Curious did you also see the same result? We tested on an Intel Skylake.

I ttied with -march=native on Skylake, but did not see wuch results.

I see. -march=native is the difference. I can reproduce a no change on namd with that flag. However, the number of vector instructions decreased significantly compared with removing that flag, which I found counter-intuitive.

With -march=native

With this patchWithout this patch
SLP.NumVectorInstructions44464446

Without -march=native

With this patchWithout this patch
SLP.NumVectorInstructions59395473

Hi, we observed ~9% increase in SLP.NumVectorInstructions on SPEC's 508.namd_r with this change, using llvm-test-suite. I noticed the number is not reported here. Curious did you also see the same result? We tested on an Intel Skylake.

I ttied with -march=native on Skylake, but did not see wuch results.

I see. -march=native is the difference. I can reproduce a no change on namd with that flag. However, the number of vector instructions decreased significantly compared with removing that flag, which I found counter-intuitive.

With -march=native

With this patchWithout this patch
SLP.NumVectorInstructions44464446

Without -march=native

With this patchWithout this patch
SLP.NumVectorInstructions59395473

Sorry,

Hi, we observed ~9% increase in SLP.NumVectorInstructions on SPEC's 508.namd_r with this change, using llvm-test-suite. I noticed the number is not reported here. Curious did you also see the same result? We tested on an Intel Skylake.

I ttied with -march=native on Skylake, but did not see wuch results.

I see. -march=native is the difference. I can reproduce a no change on namd with that flag. However, the number of vector instructions decreased significantly compared with removing that flag, which I found counter-intuitive.

With -march=native

With this patchWithout this patch
SLP.NumVectorInstructions44464446

Without -march=native

With this patchWithout this patch
SLP.NumVectorInstructions59395473

Sorry, the number of vector instructions decreased or increased?

What I meant is, the number of vector instructions decreased after adding -march=native, compared with not adding -march=native, with or without this patch. In terms of this patch, the number increased significantly in our test which didn't add -march=native. See the above table for numbers.

Hi, we observed ~9% increase in SLP.NumVectorInstructions on SPEC's 508.namd_r with this change, using llvm-test-suite. I noticed the number is not reported here. Curious did you also see the same result? We tested on an Intel Skylake.

I ttied with -march=native on Skylake, but did not see wuch results.

I see. -march=native is the difference. I can reproduce a no change on namd with that flag. However, the number of vector instructions decreased significantly compared with removing that flag, which I found counter-intuitive.

With -march=native

With this patchWithout this patch
SLP.NumVectorInstructions44464446

Without -march=native

With this patchWithout this patch
SLP.NumVectorInstructions59395473

Ah, ok, got it. Actually, it is normal. Without -march=native you compile for the generic CPU, while -march=native causes to compile for Skylake with AVX support. It may cause different vectorization (differences in the cost model, different register vector sizes, etc.).