This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve reductions vectorization.
ClosedPublic

Authored by ABataev on Oct 11 2021, 1:07 PM.

Details

Summary

The pattern matching and vectgorization for reductions was not very
effective. Some of of the possible reduction values were marked as
external arguments, SLP could not find some reduction patterns because
of too early attempt to vectorize pair of binops arguments, the cost of
consts reductions was not correct. Patch addresses these issues and
improves the analysis/cost estimation and vectorization of the
reductions.

The most significant changes in SLP.NumVectorInstructions:

Metric: SLP.NumVectorInstructions [140/14396]

Program results results0 diff

         test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   920.00  3548.00 285.7%
          test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test    66.00   122.00  84.8%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   100.00   128.00  28.0%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0%

               test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   592.00   687.00  16.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   402.00   426.00   6.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test  1665.00  1745.00   4.8%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   135.00   139.00   3.0%

test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0%

              test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   388.00   397.00   2.3%
               test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   895.00   914.00   2.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   240.00   244.00   1.7%
       test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   240.00   244.00   1.7%
         test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   820.00   832.00   1.5%
          test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   820.00   832.00   1.5%
   test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00   0.7%
                    test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  8125.00  8183.00   0.7%
       test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1330.00  1338.00   0.6%
        test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1330.00  1338.00   0.6%
     test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9832.00  9880.00   0.5%
     test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5267.00  5291.00   0.5%
   test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  4018.00  4024.00   0.1%
  test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  4018.00  4024.00   0.1%
          test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   426.00   424.00  -0.5%
           test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   426.00   424.00  -0.5%
      test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   201.00   192.00  -4.5%
     test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   201.00   192.00  -4.5%

644.nab_s and 544.nab_r - reduced number of shuffles but increased number
of useful vectorized instructions.

641.leela_s and 541.leela_r - the function
@_ZN9FastBoard25get_pattern3_augment_specEiib is not inlined anymore
but its body gets vectorized successfully. Before, the function was
inlined twice and vectorized just after inlining, currently it is not
required. The vector code looks pretty similar, just like as it was before.

Diff Detail

Event Timeline

ABataev created this revision.Oct 11 2021, 1:07 PM
ABataev requested review of this revision.Oct 11 2021, 1:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 1:07 PM

Some minor comments, but its a heavy patch to review tbh...

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3883

Can this 2 iteration for-loop be simplified/split ?

5040

Can you transfer some of this explanation up to AreVectorizableGathers - its not very obvious what is being checked there.

5482

precommit this?

8458

vectorization

8472

Split this to stop clang-format heroics?

8480

Split this to stop clang-format heroics?

8485

Split this to stop clang-format heroics?

8610

Can you use another variable name to avoid Wshadow?

8805

for-range loop?

8808

for-range loop?

Some minor comments, but its a heavy patch to review tbh...

I understand. I'll try to split it.
The main idea behind this patch is to improve the reduction vectorization process. Currently, it SLPVectorizer gathers 3 kinds of instructions:
reduction operations (those with the root RdxKind kind), reduction values (the very first non-RdxKind instruction with the same opcodes) and extra args (instructions with different parents, non-RdxKind or non-reduced value opcode, etc.). At first, it complicates the reduction analysis (some of the potential reduction operands may transform into extra args, because their operands are also extra args). Also, it throws away some potential beneficial reductions, like constant, reductions with the repeated values, reductions with same/alternate opcodes.
Patch simplifies the reduction analysis process (we just do simple BFS in the operand order), gathers all potential reduced values (without checking for reduced value opcode) and extra args (without any extra transformations, we can detect such args immediately). Then it sorts potential reduction values by their value/instruction opcodes (same and/or alternate ones too) and then it tries to generate the reduction for all these potentially reduced values/instructions.
Also, it changes the order of reductions/args vectorization attempts. At first, we need to find the reductions and only if there are no reductions, try to vectorize args of the binops.
Also, it tries to generate the final scalar code for the non-reduced/extra args in the most optimal way, to avoid some extra dependency between the last scalar instructions to allow the CPU to schedule more instructions to be executed independently.
That's the first patch in the series. I have another one, which should add support for reduction operations with many uses, it may help to vectorize something like this:

bool Res = false;
for (int i =0; i < 15; ++i) {
  bool Cmp = a[i] < a[i+1];
  int min = Cmp ? a[i] : a[i+1]
  Res |= Cmp;
}

and similar patterns I saw in real user code.

ABataev updated this revision to Diff 382689.Oct 27 2021, 9:06 AM

Address comments

vporpo added a subscriber: vporpo.Nov 11 2021, 7:57 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 8:32 AM
Herald added a subscriber: kosarev. · View Herald Transcript
ABataev updated this revision to Diff 430080.May 17 2022, 8:39 AM

Use range-based for

RKSimon accepted this revision.May 18 2022, 3:35 AM

LGTM

llvm/test/Transforms/SLPVectorizer/X86/bool-mask.ll
3 ↗(On Diff #430080)

You might be able to add a common SSE check prefix?

This revision is now accepted and ready to land.May 18 2022, 3:35 AM
This revision was landed with ongoing or failed builds.May 18 2022, 1:24 PM
This revision was automatically updated to reflect the committed changes.
ABataev marked an inline comment as done.May 18 2022, 1:25 PM
anna added a subscriber: anna.May 31 2022, 11:44 AM

@ABataev, we are getting an assertion failure downstream in setInsertPointAfterBundle which is bisected to this patch and reproducible with it.

I have created a PR here with my analysis along with an upstream reproducer: https://github.com/llvm/llvm-project/issues/55796
Here's the (potential) fix for review: https://reviews.llvm.org/D126713.

If this is not the correct fix, can we please revert this patch and fix the issue?

@ABataev, we are getting an assertion failure downstream in setInsertPointAfterBundle which is bisected to this patch and reproducible with it.

I have created a PR here with my analysis along with an upstream reproducer: https://github.com/llvm/llvm-project/issues/55796
Here's the (potential) fix for review: https://reviews.llvm.org/D126713.

If this is not the correct fix, can we please revert this patch and fix the issue?

I saw it. Was waiting for a patch from you. If you don't have much time, I can prepare a small fix for this issue.