Page MenuHomePhabricator

[SLP]Improve reductions vectorization.
Needs ReviewPublic

Authored by ABataev on Oct 11 2021, 1:07 PM.

Details

Summary

The pattern matching and vectgorization for reductions was not very
effective. Some of of the possible reduction values were marked as
external arguments, SLP could not find some reduction patterns because
of too early attempt to vectorize pair of binops arguments, the cost of
consts reductions was not correct. Patch addresses these issues and
improves the analysis/cost estimation and vectorization of the
reductions.

The most significant changes in SLP.NumVectorInstructions:

Metric: SLP.NumVectorInstructions [140/14396]

Program results results0 diff

         test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test   920.00  3548.00 285.7%
          test-suite :: SingleSource/Benchmarks/BenchmarkGame/n-body.test    66.00   122.00  84.8%
test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test   100.00   128.00  28.0%

test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test 664.00 810.00 22.0%

               test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test   592.00   687.00  16.0%
test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test   402.00   426.00   6.0%
                 test-suite :: MultiSource/Applications/JM/lencod/lencod.test  1665.00  1745.00   4.8%
test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test   135.00   139.00   3.0%

test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 135.00 139.00 3.0%

              test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test   388.00   397.00   2.3%
               test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   895.00   914.00   2.1%
test-suite :: MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm.test   240.00   244.00   1.7%
       test-suite :: MultiSource/Benchmarks/mediabench/gsm/toast/toast.test   240.00   244.00   1.7%
         test-suite :: External/SPEC/CINT2017speed/602.gcc_s/602.gcc_s.test   820.00   832.00   1.5%
          test-suite :: External/SPEC/CINT2017rate/502.gcc_r/502.gcc_r.test   820.00   832.00   1.5%
   test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 14804.00 14914.00   0.7%
                    test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  8125.00  8183.00   0.7%
       test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  1330.00  1338.00   0.6%
        test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  1330.00  1338.00   0.6%
     test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  9832.00  9880.00   0.5%
     test-suite :: External/SPEC/CFP2017rate/511.povray_r/511.povray_r.test  5267.00  5291.00   0.5%
   test-suite :: External/SPEC/CFP2017rate/538.imagick_r/538.imagick_r.test  4018.00  4024.00   0.1%
  test-suite :: External/SPEC/CFP2017speed/638.imagick_s/638.imagick_s.test  4018.00  4024.00   0.1%
          test-suite :: External/SPEC/CFP2017speed/644.nab_s/644.nab_s.test   426.00   424.00  -0.5%
           test-suite :: External/SPEC/CFP2017rate/544.nab_r/544.nab_r.test   426.00   424.00  -0.5%
      test-suite :: External/SPEC/CINT2017rate/541.leela_r/541.leela_r.test   201.00   192.00  -4.5%
     test-suite :: External/SPEC/CINT2017speed/641.leela_s/641.leela_s.test   201.00   192.00  -4.5%

644.nab_s and 544.nab_r - reduced number of shuffles but increased number
of useful vectorized instructions.

641.leela_s and 541.leela_r - the function
@_ZN9FastBoard25get_pattern3_augment_specEiib is not inlined anymore
but its body gets vectorized successfully. Before, the function was
inlined twice and vectorized just after inlining, currently it is not
required. The vector code looks pretty similar, just like as it was before.

Diff Detail

Unit TestsFailed

TimeTest
40 msx64 debian > Polly.CodeGen::aliasing_different_pointer_types.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen -polly-codegen-verify -polly-codegen -S < /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/aliasing_different_pointer_types.ll
40 msx64 debian > Polly.CodeGen::aliasing_parametric_simple_1.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen -polly-codegen-verify -polly-codegen -S < /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/aliasing_parametric_simple_1.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/aliasing_parametric_simple_1.ll
30 msx64 debian > Polly.CodeGen::aliasing_parametric_simple_2.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen -polly-codegen-verify -polly-codegen -S < /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/aliasing_parametric_simple_2.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/aliasing_parametric_simple_2.ll
80 msx64 debian > Polly.CodeGen::exprModDiv.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen -polly-codegen-verify -polly-import-jscop -polly-codegen -S < /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/exprModDiv.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/exprModDiv.ll
90 msx64 debian > Polly.CodeGen::invariant_load_base_pointer_conditional_2.ll
Script: -- : 'RUN: at line 1'; /var/lib/buildkite-agent/builds/llvm-project/build/bin/opt -polly-process-unprofitable -polly-remarks-minimal -polly-use-llvm-names -polly-import-jscop-dir=/var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen -polly-codegen-verify -analyze -polly-scops -polly-invariant-load-hoisting=true < /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/invariant_load_base_pointer_conditional_2.ll | /var/lib/buildkite-agent/builds/llvm-project/build/bin/FileCheck /var/lib/buildkite-agent/builds/llvm-project/polly/test/CodeGen/invariant_load_base_pointer_conditional_2.ll
View Full Test Results (21 Failed)

Event Timeline

ABataev created this revision.Oct 11 2021, 1:07 PM
ABataev requested review of this revision.Oct 11 2021, 1:07 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 11 2021, 1:07 PM

Some minor comments, but its a heavy patch to review tbh...

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3886–3888

Can this 2 iteration for-loop be simplified/split ?

5045

Can you transfer some of this explanation up to AreVectorizableGathers - its not very obvious what is being checked there.

5487

precommit this?

8463

vectorization

8477

Split this to stop clang-format heroics?

8485

Split this to stop clang-format heroics?

8490

Split this to stop clang-format heroics?

8606

Can you use another variable name to avoid Wshadow?

8802

for-range loop?

8805

for-range loop?

Some minor comments, but its a heavy patch to review tbh...

I understand. I'll try to split it.
The main idea behind this patch is to improve the reduction vectorization process. Currently, it SLPVectorizer gathers 3 kinds of instructions:
reduction operations (those with the root RdxKind kind), reduction values (the very first non-RdxKind instruction with the same opcodes) and extra args (instructions with different parents, non-RdxKind or non-reduced value opcode, etc.). At first, it complicates the reduction analysis (some of the potential reduction operands may transform into extra args, because their operands are also extra args). Also, it throws away some potential beneficial reductions, like constant, reductions with the repeated values, reductions with same/alternate opcodes.
Patch simplifies the reduction analysis process (we just do simple BFS in the operand order), gathers all potential reduced values (without checking for reduced value opcode) and extra args (without any extra transformations, we can detect such args immediately). Then it sorts potential reduction values by their value/instruction opcodes (same and/or alternate ones too) and then it tries to generate the reduction for all these potentially reduced values/instructions.
Also, it changes the order of reductions/args vectorization attempts. At first, we need to find the reductions and only if there are no reductions, try to vectorize args of the binops.
Also, it tries to generate the final scalar code for the non-reduced/extra args in the most optimal way, to avoid some extra dependency between the last scalar instructions to allow the CPU to schedule more instructions to be executed independently.
That's the first patch in the series. I have another one, which should add support for reduction operations with many uses, it may help to vectorize something like this:

bool Res = false;
for (int i =0; i < 15; ++i) {
  bool Cmp = a[i] < a[i+1];
  int min = Cmp ? a[i] : a[i+1]
  Res |= Cmp;
}

and similar patterns I saw in real user code.

ABataev updated this revision to Diff 382689.Oct 27 2021, 9:06 AM

Address comments

vporpo added a subscriber: vporpo.Thu, Nov 11, 7:57 PM