This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve reductions analysis and emission, part 1.
ClosedPublic

Authored by ABataev on Nov 18 2021, 9:29 AM.

Details

Summary

Currently SLP vectorizer walks through the instructions and selects
3 main classes of values:

  1. reduction operations - instructions with same reduction opcode (add, mul, min/max, etc.), which build the reduction,
  2. reduced values - instructions with the same opcodes, but different from the reduction opcode
  3. extra arguments - all other values, instructions from the different basic block rather than the root node, instructions with to many/less uses.

This scheme is not very efficient. It excludes some instructions and all
non-instruction values from the reductions (constants, proficient
gathers), to many possibly reduced values are marked as extra arguments.
Patch improves this process by introducing a bit extended analysis
stage. During this stage, we still try to select 3 classes of the
values: 1) reduction operations - same as before, 2) possibly reduced
values - all instructions from the current block/non-instructions, which
may build a vectorization tree, 3) extra arguments - instructions from
the different basic blocks. Additionally, an extra sorting of the
possibly reduced values occurs to build the scalar sequences which
highly likely will bed vectorized, e.g. loads are grouped by the
distance between them, constants are grouped together, cmp instructions
are sorted by their compare types and predicates, extractelement
instructions are sorted by the vector operand, etc. Also, these groups
are reordered by their length so the longest group is the first in the
list of the possibly reduced values.

The vectorization process tries to emit the reductions for all these
groups. These reductions, remaining non-vectorized possible reduced
values and extra arguments are then combined into the final expression
just like it was before.

Diff Detail

Event Timeline

ABataev created this revision.Nov 18 2021, 9:29 AM
ABataev requested review of this revision.Nov 18 2021, 9:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 18 2021, 9:29 AM
RKSimon edited the summary of this revision. (Show Details)Nov 22 2021, 3:20 AM
RKSimon edited the summary of this revision. (Show Details)

A few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9279

This looks weird - as if we're reassigning different values to ExtraArgs[TreeN] - since we know Args.size() < 2 - maybe something like this:

if (!Args.empty())
  ExtraArgs[TreeN] = Args[0];
9430

for-range loop?

9591

Can we use a for-range loop here?

ABataev added inline comments.Jan 3 2022, 8:48 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
9279

Sure, will rewrite it.

9430

Ok

9591

Ok, will check the loops.

ABataev updated this revision to Diff 397675.Jan 5 2022, 12:26 PM

Address comments

ABataev updated this revision to Diff 405776.Feb 3 2022, 1:49 PM

Rebase + use generateKeySubkey from D116343

ABataev updated this revision to Diff 413885.Mar 8 2022, 11:20 AM

Rebase, ping

Herald added a project: Restricted Project. · View Herald TranscriptMar 8 2022, 11:20 AM
RKSimon accepted this revision.Mar 14 2022, 6:28 AM

LGTM

This revision is now accepted and ready to land.Mar 14 2022, 6:28 AM
This revision was landed with ongoing or failed builds.Apr 12 2022, 6:08 PM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.Apr 13 2022, 12:54 AM

It looks like this change has non-trivial compile-time impact in some cases. One large regression by 34% is on me_distortion.c from lencod at O3 (without codegen changes, so this is not due to better vectorization).

It looks like this change has non-trivial compile-time impact in some cases. One large regression by 34% is on me_distortion.c from lencod at O3 (without codegen changes, so this is not due to better vectorization).

Hi, thanks for the info, I have the idea how to improve it and reduce compile time, the will be ready soon.

Hi @ABataev

This causes the following failure: https://github.com/llvm/llvm-project/issues/54976

Please revert asap or fix.


Here is a crash reproducer after this patch.


Here is a crash reproducer after this patch.

Fixed in b0f0313febe755eeb7bacd62cf5862d9812f1690

alexfh added a subscriber: alexfh.May 19 2022, 8:13 PM

I've noticed that this patch (7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd) significantly increases compile time for a certain type of code (functions with a large number of branches on top level). In extreme cases (generated code with really large functions) this leads to tens of minutes of time spent in SLP vectorizer:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1895.1082 seconds (1895.2216 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  1752.0457 ( 93.4%)   0.2965 (  1.5%)  1752.3421 ( 92.5%)  1752.4505 ( 92.5%)  SLPVectorizerPass
  37.7381 (  2.0%)   5.4889 ( 28.6%)  43.2270 (  2.3%)  43.2324 (  2.3%)  ModuleInlinerWrapperPass
  37.0725 (  2.0%)   5.2396 ( 27.3%)  42.3121 (  2.2%)  42.3175 (  2.2%)  DevirtSCCRepeatedPass
  16.8654 (  0.9%)   0.2731 (  1.4%)  17.1384 (  0.9%)  17.1387 (  0.9%)  GVNPass

Analyzing data from perf record -g, I can also see related code, specifically llvm::slpvectorizer::BoUpSLP::buildTree_rec:

-   99.23%     0.00%  clang  clang                [.] main                                                                                                                                                                                          ▒
   - main                                                                                                                                                                                                                                           ▒
      - 99.23% ExecuteCC1Tool                                                                                                                                                                                                                       ▒
         - 99.23% cc1_main                                                                                                                                                                                                                          ◆
            - 99.23% clang::ExecuteCompilerInvocation                                                                                                                                                                                               ▒
               - 99.23% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                      ▒
                  - 99.21% clang::FrontendAction::Execute                                                                                                                                                                                           ▒
                     - 99.21% clang::ParseAST                                                                                                                                                                                                       ▒
                        - 98.53% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                      ▒
                           - 98.48% clang::EmitBackendOutput                                                                                                                                                                                        ▒
                              - 96.34% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                           ▒
                                 - 96.34% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                 ▒
                                    - 94.47% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                            ▒
                                       - 94.47% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                              ▒
                                          - 94.45% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run           ▒
                                             - 94.45% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                 ▒
                                                - 94.21% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                      ▒
                                                   - llvm::SLPVectorizerPass::run                                                                                                                                                                   ▒
                                                      - 94.20% llvm::SLPVectorizerPass::runImpl                                                                                                                                                     ▒
                                                         - 93.68% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                              ▒
                                                            - 92.96% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                              ▒
                                                               - 88.63% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                     ▒
                                                                    87.88% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                              ▒
                                                                    0.73% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                               ▒
                                                                 3.30% llvm::hashing::detail::hash_state::mix                                                                                                                                       ▒
                                                              0.61% llvm::SmallPtrSetImplBase::find_imp                                                                                                                                             ▒
                                    + 1.76% llvm::detail::PassModel<llvm::Module, llvm::ModuleInlinerWrapperPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                ▒
                              + 2.14% llvm::legacy::PassManagerImpl::run                                                                                                                                                                            ▒
                        + 0.65% clang::Parser::ParseTopLevelDecl

I'm trying to come up with an isolated test case now.

I've noticed that this patch (7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd) significantly increases compile time for a certain type of code (functions with a large number of branches on top level). In extreme cases (generated code with really large functions) this leads to tens of minutes of time spent in SLP vectorizer:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 1895.1082 seconds (1895.2216 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  1752.0457 ( 93.4%)   0.2965 (  1.5%)  1752.3421 ( 92.5%)  1752.4505 ( 92.5%)  SLPVectorizerPass
  37.7381 (  2.0%)   5.4889 ( 28.6%)  43.2270 (  2.3%)  43.2324 (  2.3%)  ModuleInlinerWrapperPass
  37.0725 (  2.0%)   5.2396 ( 27.3%)  42.3121 (  2.2%)  42.3175 (  2.2%)  DevirtSCCRepeatedPass
  16.8654 (  0.9%)   0.2731 (  1.4%)  17.1384 (  0.9%)  17.1387 (  0.9%)  GVNPass

Analyzing data from perf record -g, I can also see related code, specifically llvm::slpvectorizer::BoUpSLP::buildTree_rec:

-   99.23%     0.00%  clang  clang                [.] main                                                                                                                                                                                          ▒
   - main                                                                                                                                                                                                                                           ▒
      - 99.23% ExecuteCC1Tool                                                                                                                                                                                                                       ▒
         - 99.23% cc1_main                                                                                                                                                                                                                          ◆
            - 99.23% clang::ExecuteCompilerInvocation                                                                                                                                                                                               ▒
               - 99.23% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                      ▒
                  - 99.21% clang::FrontendAction::Execute                                                                                                                                                                                           ▒
                     - 99.21% clang::ParseAST                                                                                                                                                                                                       ▒
                        - 98.53% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                      ▒
                           - 98.48% clang::EmitBackendOutput                                                                                                                                                                                        ▒
                              - 96.34% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                           ▒
                                 - 96.34% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                 ▒
                                    - 94.47% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                            ▒
                                       - 94.47% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                              ▒
                                          - 94.45% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run           ▒
                                             - 94.45% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                 ▒
                                                - 94.21% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                      ▒
                                                   - llvm::SLPVectorizerPass::run                                                                                                                                                                   ▒
                                                      - 94.20% llvm::SLPVectorizerPass::runImpl                                                                                                                                                     ▒
                                                         - 93.68% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                              ▒
                                                            - 92.96% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                              ▒
                                                               - 88.63% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                     ▒
                                                                    87.88% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                              ▒
                                                                    0.73% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                               ▒
                                                                 3.30% llvm::hashing::detail::hash_state::mix                                                                                                                                       ▒
                                                              0.61% llvm::SmallPtrSetImplBase::find_imp                                                                                                                                             ▒
                                    + 1.76% llvm::detail::PassModel<llvm::Module, llvm::ModuleInlinerWrapperPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                ▒
                              + 2.14% llvm::legacy::PassManagerImpl::run                                                                                                                                                                            ▒
                        + 0.65% clang::Parser::ParseTopLevelDecl

I'm trying to come up with an isolated test case now.

Hi, the reproducer will definitely help to figure out the problem, thanks. I tried to avoid some extra checks and added limits, probably missed some.

No reduced test case, but one observation is that most samples in the slow version are around this (preexisting code):

// The reduction nodes (stored in UserIgnoreList) also should stay scalar.
for (Value *V : VL) {
  if (is_contained(UserIgnoreList, V)) {
 0.00 │ 7aa:   testq   %rdx, %rdx                                                                                                                                                                                                                  ▒
      │        leaq    -536(%rbp), %r15                                                                                                                                                                                                            ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4296                                                                                                                                   ▒
      │      ;   for (Value *V : VL) {                                                                                                                                                                                                             ◆
      │      ↓ je      80a                                                                                                                                                                                                                         ▒
      │        movq    -80(%rbp), %rax                                                                                                                                                                                                             ▒
      │        movq    2088(%rax), %rbx                                                                                                                                                                                                            ▒
      │        movq    2096(%rax), %rcx                                                                                                                                                                                                            ▒
      │        leaq    (%rbx,%rcx,8), %r10                                                                                                                                                                                                         ▒
      │        leaq    (,%rcx,8), %r8                                                                                                                                                                                                              ▒
      │      ↓ jmp     7e8                                                                                                                                                                                                                         ▒
      │      ; llvm-project/llvm/include/llvm/ADT/STLExtras.h:1671                                                                                                                                                 ▒
      │      ;   return std::find(adl_begin(Range), adl_end(Range), Element) != adl_end(Range);                                                                                                                                                    ▒
      │ 7d6:   cmpq    %r10, %rax                                                                                                                                                                                                                  ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4297                                                                                                                                   ▒
      │      ;     if (is_contained(UserIgnoreList, V)) {                                                                                                                                                                                          ▒
      │      ↓ jne     a98                                                                                                                                                                                                                         ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4296                                                                                                                                   ▒
      │      ;   for (Value *V : VL) {                                                                                                                                                                                                             ▒
 0.12 │ 7df:   addq    $8, %rsi                                                                                                                                                                                                                    ▒
 0.01 │        cmpq    %r9, %rsi                                                                                                                                                                                                                   ▒
      │      ↓ je      80a                                                                                                                                                                                                                         ▒
      │ 7e8:   movq    %rbx, %rax                                                                                                                                                                                                                  ▒
      │        testq   %rcx, %rcx                                                                                                                                                                                                                  ▒
      │      ; include/c++/v1/__algorithm/find.h:24                                                                                                                               ▒
      │      ;   for (; __first != __last; ++__first)                                                                                                                                                                                              ▒
      │      ↑ je      7d6                                                                                                                                                                                                                         ▒
 0.10 │        movq    (%rsi), %rdi                                                                                                                                                                                                                ▒
      │        movq    %r8, %rdx                                                                                                                                                                                                                   ▒
      │        movq    %rbx, %rax                                                                                                                                                                                                                  ▒
      │      ; include/c++/v1/__algorithm/find.h:25                                                                                                                               ▒
      │      ;     if (*__first == __value_)                                                                                                                                                                                                       ▒
99.02 │ 7f9:   cmpq    %rdi, (%rax)                                                                                                                                                                                                                ▒
      │      ↑ je      7d6                                                                                                                                                                                                                         ▒
      │      ; include/c++/v1/__algorithm/find.h:24                                                                                                                               ▒
      │      ;   for (; __first != __last; ++__first)                                                                                                                                                                                              ▒
      │        addq    $8, %rax                                                                                                                                                                                                                    ▒
      │        addq    $-8, %rdx                                                                                                                                                                                                                   ▒
 0.03 │      ↑ jne     7f9                                                                                                                                                                                                                         ▒
 0.28 │      ↑ jmp     7df                                                                                                                                                                                                                         ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4308                                                                                                                                   ▒
      │      ;   auto *VL0 = cast<Instruction>(S.OpValue);                                                                                                                                                                                         ▒
      │ 80a:   movq    -136(%rbp), %r14                                                                                                                                                                                                            ▒

No reduced test case, but one observation is that most samples in the slow version are around this (preexisting code):

// The reduction nodes (stored in UserIgnoreList) also should stay scalar.
for (Value *V : VL) {
  if (is_contained(UserIgnoreList, V)) {
 0.00 │ 7aa:   testq   %rdx, %rdx                                                                                                                                                                                                                  ▒
      │        leaq    -536(%rbp), %r15                                                                                                                                                                                                            ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4296                                                                                                                                   ▒
      │      ;   for (Value *V : VL) {                                                                                                                                                                                                             ◆
      │      ↓ je      80a                                                                                                                                                                                                                         ▒
      │        movq    -80(%rbp), %rax                                                                                                                                                                                                             ▒
      │        movq    2088(%rax), %rbx                                                                                                                                                                                                            ▒
      │        movq    2096(%rax), %rcx                                                                                                                                                                                                            ▒
      │        leaq    (%rbx,%rcx,8), %r10                                                                                                                                                                                                         ▒
      │        leaq    (,%rcx,8), %r8                                                                                                                                                                                                              ▒
      │      ↓ jmp     7e8                                                                                                                                                                                                                         ▒
      │      ; llvm-project/llvm/include/llvm/ADT/STLExtras.h:1671                                                                                                                                                 ▒
      │      ;   return std::find(adl_begin(Range), adl_end(Range), Element) != adl_end(Range);                                                                                                                                                    ▒
      │ 7d6:   cmpq    %r10, %rax                                                                                                                                                                                                                  ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4297                                                                                                                                   ▒
      │      ;     if (is_contained(UserIgnoreList, V)) {                                                                                                                                                                                          ▒
      │      ↓ jne     a98                                                                                                                                                                                                                         ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4296                                                                                                                                   ▒
      │      ;   for (Value *V : VL) {                                                                                                                                                                                                             ▒
 0.12 │ 7df:   addq    $8, %rsi                                                                                                                                                                                                                    ▒
 0.01 │        cmpq    %r9, %rsi                                                                                                                                                                                                                   ▒
      │      ↓ je      80a                                                                                                                                                                                                                         ▒
      │ 7e8:   movq    %rbx, %rax                                                                                                                                                                                                                  ▒
      │        testq   %rcx, %rcx                                                                                                                                                                                                                  ▒
      │      ; include/c++/v1/__algorithm/find.h:24                                                                                                                               ▒
      │      ;   for (; __first != __last; ++__first)                                                                                                                                                                                              ▒
      │      ↑ je      7d6                                                                                                                                                                                                                         ▒
 0.10 │        movq    (%rsi), %rdi                                                                                                                                                                                                                ▒
      │        movq    %r8, %rdx                                                                                                                                                                                                                   ▒
      │        movq    %rbx, %rax                                                                                                                                                                                                                  ▒
      │      ; include/c++/v1/__algorithm/find.h:25                                                                                                                               ▒
      │      ;     if (*__first == __value_)                                                                                                                                                                                                       ▒
99.02 │ 7f9:   cmpq    %rdi, (%rax)                                                                                                                                                                                                                ▒
      │      ↑ je      7d6                                                                                                                                                                                                                         ▒
      │      ; include/c++/v1/__algorithm/find.h:24                                                                                                                               ▒
      │      ;   for (; __first != __last; ++__first)                                                                                                                                                                                              ▒
      │        addq    $8, %rax                                                                                                                                                                                                                    ▒
      │        addq    $-8, %rdx                                                                                                                                                                                                                   ▒
 0.03 │      ↑ jne     7f9                                                                                                                                                                                                                         ▒
 0.28 │      ↑ jmp     7df                                                                                                                                                                                                                         ▒
      │      ; llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4308                                                                                                                                   ▒
      │      ;   auto *VL0 = cast<Instruction>(S.OpValue);                                                                                                                                                                                         ▒
      │ 80a:   movq    -136(%rbp), %r14                                                                                                                                                                                                            ▒

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 308.9466 seconds (308.9519 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass
  48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass
  47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass
  22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass
  11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass
   4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass

And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd where I could compile clang):

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 181.4693 seconds (181.4723 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
  47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
  22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
  11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
   4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
   5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 308.9466 seconds (308.9519 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass
  48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass
  47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass
  22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass
  11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass
   4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass

And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd where I could compile clang):

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 181.4693 seconds (181.4723 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
  47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
  22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
  11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
   4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
   5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.

The perf profile should help, thanks

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 308.9466 seconds (308.9519 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass
  48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass
  47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass
  22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass
  11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass
   4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass

And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd where I could compile clang):

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 181.4693 seconds (181.4723 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
  47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
  22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
  11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
   4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
   5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.

The perf profile should help, thanks

Looking at this I wonder whether SmallPtrSet is not that small? :)

-   86.47%     0.00%  clang    clang               [.] cc1_main                                                                                                                                                                                     ▒
   - cc1_main                                                                                                                                                                                                                                       ▒
      - 86.47% clang::ExecuteCompilerInvocation                                                                                                                                                                                                     ▒
         - 86.47% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                            ▒
            - 86.47% clang::FrontendAction::Execute                                                                                                                                                                                                 ▒
               - 86.47% clang::ParseAST                                                                                                                                                                                                             ▒
                  - 80.62% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                            ▒
                     - 80.17% clang::EmitBackendOutput                                                                                                                                                                                              ▒
                        - 63.55% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                 ▒
                           - 63.55% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                       ▒
                              - 46.20% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                  ▒
                                 - 46.17% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                    ▒
                                    - 45.90% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                 ▒
                                       - 45.89% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                       ▒
                                          - 43.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                            ▒
                                               llvm::SLPVectorizerPass::run                                                                                                                                                                         ▒
                                             - llvm::SLPVectorizerPass::runImpl                                                                                                                                                                     ▒
                                                - 43.20% llvm::SLPVectorizerPass::vectorizeChainsInBlock                                                                                                                                            ▒
                                                   - 42.84% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                                    ▒
                                                      - 41.02% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                                    ▒
                                                         - 40.93% (anonymous namespace)::HorizontalReduction::tryToReduce                                                                                                                           ▒
                                                            - 18.19% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                        ▒
                                                                 9.89% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                    ▒
                                                               - 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                  ▒
                                                                  - 3.23% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                  ▒
                                                                     + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒
                                                                    0.85% getSameOpcode                                                                                                                                                             ▒
                                                                  - 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry                                                                                                                                ▒
                                                                       0.72% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                              ▒
                                                                 0.98% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                     ▒
                                                              8.57% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                        ▒
                                                            + 4.71% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseM▒
                                                              3.23% (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()                                ▒
                                                              0.70% memset                                                                                                                                                                          ▒
                                                      + 1.82% tryToVectorizeSequence<llvm::Value>                                                                                                                                                   ▒

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 308.9466 seconds (308.9519 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass
  48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass
  47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass
  22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass
  11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass
   4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass

And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd where I could compile clang):

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 181.4693 seconds (181.4723 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
  47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
  22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
  11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
   4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
   5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.

The perf profile should help, thanks

Looking at this I wonder whether SmallPtrSet is not that small? :)

-   86.47%     0.00%  clang    clang               [.] cc1_main                                                                                                                                                                                     ▒
   - cc1_main                                                                                                                                                                                                                                       ▒
      - 86.47% clang::ExecuteCompilerInvocation                                                                                                                                                                                                     ▒
         - 86.47% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                            ▒
            - 86.47% clang::FrontendAction::Execute                                                                                                                                                                                                 ▒
               - 86.47% clang::ParseAST                                                                                                                                                                                                             ▒
                  - 80.62% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                            ▒
                     - 80.17% clang::EmitBackendOutput                                                                                                                                                                                              ▒
                        - 63.55% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                 ▒
                           - 63.55% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                       ▒
                              - 46.20% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                  ▒
                                 - 46.17% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                    ▒
                                    - 45.90% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                 ▒
                                       - 45.89% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                       ▒
                                          - 43.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                            ▒
                                               llvm::SLPVectorizerPass::run                                                                                                                                                                         ▒
                                             - llvm::SLPVectorizerPass::runImpl                                                                                                                                                                     ▒
                                                - 43.20% llvm::SLPVectorizerPass::vectorizeChainsInBlock                                                                                                                                            ▒
                                                   - 42.84% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                                    ▒
                                                      - 41.02% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                                    ▒
                                                         - 40.93% (anonymous namespace)::HorizontalReduction::tryToReduce                                                                                                                           ▒
                                                            - 18.19% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                        ▒
                                                                 9.89% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                    ▒
                                                               - 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                  ▒
                                                                  - 3.23% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                  ▒
                                                                     + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒
                                                                    0.85% getSameOpcode                                                                                                                                                             ▒
                                                                  - 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry                                                                                                                                ▒
                                                                       0.72% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                              ▒
                                                                 0.98% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                     ▒
                                                              8.57% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                        ▒
                                                            + 4.71% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseM▒
                                                              3.23% (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()                                ▒
                                                              0.70% memset                                                                                                                                                                          ▒
                                                      + 1.82% tryToVectorizeSequence<llvm::Value>                                                                                                                                                   ▒

Ok, thanks, will improve it on Monday. We can avoid multiple creation of SmallPtrSet and I'll check for other possible optimizations too.

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 308.9466 seconds (308.9519 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass
  48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass
  47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass
  22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass
  11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass
   4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass

And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd where I could compile clang):

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 181.4693 seconds (181.4723 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
  47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
  22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
  11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
   4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
   5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.

The perf profile should help, thanks

Looking at this I wonder whether SmallPtrSet is not that small? :)

-   86.47%     0.00%  clang    clang               [.] cc1_main                                                                                                                                                                                     ▒
   - cc1_main                                                                                                                                                                                                                                       ▒
      - 86.47% clang::ExecuteCompilerInvocation                                                                                                                                                                                                     ▒
         - 86.47% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                            ▒
            - 86.47% clang::FrontendAction::Execute                                                                                                                                                                                                 ▒
               - 86.47% clang::ParseAST                                                                                                                                                                                                             ▒
                  - 80.62% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                            ▒
                     - 80.17% clang::EmitBackendOutput                                                                                                                                                                                              ▒
                        - 63.55% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                 ▒
                           - 63.55% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                       ▒
                              - 46.20% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                  ▒
                                 - 46.17% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                    ▒
                                    - 45.90% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                 ▒
                                       - 45.89% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                       ▒
                                          - 43.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                            ▒
                                               llvm::SLPVectorizerPass::run                                                                                                                                                                         ▒
                                             - llvm::SLPVectorizerPass::runImpl                                                                                                                                                                     ▒
                                                - 43.20% llvm::SLPVectorizerPass::vectorizeChainsInBlock                                                                                                                                            ▒
                                                   - 42.84% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                                    ▒
                                                      - 41.02% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                                    ▒
                                                         - 40.93% (anonymous namespace)::HorizontalReduction::tryToReduce                                                                                                                           ▒
                                                            - 18.19% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                        ▒
                                                                 9.89% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                    ▒
                                                               - 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                  ▒
                                                                  - 3.23% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                  ▒
                                                                     + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒
                                                                    0.85% getSameOpcode                                                                                                                                                             ▒
                                                                  - 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry                                                                                                                                ▒
                                                                       0.72% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                              ▒
                                                                 0.98% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                     ▒
                                                              8.57% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                        ▒
                                                            + 4.71% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseM▒
                                                              3.23% (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()                                ▒
                                                              0.70% memset                                                                                                                                                                          ▒
                                                      + 1.82% tryToVectorizeSequence<llvm::Value>                                                                                                                                                   ▒

Ok, thanks, will improve it on Monday. We can avoid multiple creation of SmallPtrSet and I'll check for other possible optimizations too.

Significant time seems to be spent in this loop as well:

for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {
  if (Cnt >= Pos && Cnt < Pos + ReduxWidth)
    continue;
  unsigned NumOps = VectorizedVals.lookup(Candidates[Cnt]) +
                    std::count(VL.begin(), VL.end(), Candidates[Cnt]);
  if (NumOps != ReducedValsToOps.find(Candidates[Cnt])->second.size())
    LocalExternallyUsedValues[Candidates[Cnt]];
}

Aha, I committed 4e271fc49517362a9333371fb1ab7e865d4c1b0e earlier today, which should improve it. Try to update the compiler.

Thanks! This patch makes SLP vectorizer pass much faster on the problematic input, but it doesn't completely compensate the slowdown introduced here. This is how -ftime-report looks like after 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 308.9466 seconds (308.9519 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  132.7263 ( 45.9%)   0.0001 (  0.0%)  132.7264 ( 43.0%)  132.7351 ( 43.0%)  SLPVectorizerPass
  48.0866 ( 16.6%)   5.7199 ( 29.3%)  53.8065 ( 17.4%)  53.8123 ( 17.4%)  ModuleInlinerWrapperPass
  47.3402 ( 16.4%)   5.4604 ( 27.9%)  52.8007 ( 17.1%)  52.8060 ( 17.1%)  DevirtSCCRepeatedPass
  22.2568 (  7.7%)   0.3222 (  1.6%)  22.5790 (  7.3%)  22.5785 (  7.3%)  GVNPass
  11.3194 (  3.9%)   1.0507 (  5.4%)  12.3701 (  4.0%)  12.3520 (  4.0%)  InstCombinePass
   4.8834 (  1.7%)   1.2548 (  6.4%)   6.1382 (  2.0%)   6.1350 (  2.0%)  InlinerPass

And this is how it looked at 38d0df557706940af5d7110bdf662590449f8a60 (the closest commit before 7ea03f0b4e4ec5d91d48ba2976f5adc299089ffd where I could compile clang):

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 181.4693 seconds (181.4723 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
  47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
  22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
  11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
   4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
   5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Note the 5s -> 130s jump in time spent in SLPVectorizerPass. I'll grab the profile for the updated binary, but it will take some time. And yes, still trying to reduce the test case.

The perf profile should help, thanks

Looking at this I wonder whether SmallPtrSet is not that small? :)

-   86.47%     0.00%  clang    clang               [.] cc1_main                                                                                                                                                                                     ▒
   - cc1_main                                                                                                                                                                                                                                       ▒
      - 86.47% clang::ExecuteCompilerInvocation                                                                                                                                                                                                     ▒
         - 86.47% clang::CompilerInstance::ExecuteAction                                                                                                                                                                                            ▒
            - 86.47% clang::FrontendAction::Execute                                                                                                                                                                                                 ▒
               - 86.47% clang::ParseAST                                                                                                                                                                                                             ▒
                  - 80.62% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                            ▒
                     - 80.17% clang::EmitBackendOutput                                                                                                                                                                                              ▒
                        - 63.55% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                 ▒
                           - 63.55% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                       ▒
                              - 46.20% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                  ▒
                                 - 46.17% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                    ▒
                                    - 45.90% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                 ▒
                                       - 45.89% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                       ▒
                                          - 43.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                            ▒
                                               llvm::SLPVectorizerPass::run                                                                                                                                                                         ▒
                                             - llvm::SLPVectorizerPass::runImpl                                                                                                                                                                     ▒
                                                - 43.20% llvm::SLPVectorizerPass::vectorizeChainsInBlock                                                                                                                                            ▒
                                                   - 42.84% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                                    ▒
                                                      - 41.02% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                                    ▒
                                                         - 40.93% (anonymous namespace)::HorizontalReduction::tryToReduce                                                                                                                           ▒
                                                            - 18.19% llvm::slpvectorizer::BoUpSLP::buildTree                                                                                                                                        ▒
                                                                 9.89% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                    ▒
                                                               - 6.24% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                  ▒
                                                                  - 3.23% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                  ▒
                                                                     + 1.61% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::V▒
                                                                    0.85% getSameOpcode                                                                                                                                                             ▒
                                                                  - 0.78% llvm::slpvectorizer::BoUpSLP::newTreeEntry                                                                                                                                ▒
                                                                       0.72% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                              ▒
                                                                 0.98% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                     ▒
                                                              8.57% llvm::SmallPtrSetImplBase::FindBucketFor                                                                                                                                        ▒
                                                            + 4.71% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseM▒
                                                              3.23% (anonymous namespace)::HorizontalReduction::tryToReduce(llvm::slpvectorizer::BoUpSLP&, llvm::TargetTransformInfo*)::{lambda(bool)#1}::operator()                                ▒
                                                              0.70% memset                                                                                                                                                                          ▒
                                                      + 1.82% tryToVectorizeSequence<llvm::Value>                                                                                                                                                   ▒

Ok, thanks, will improve it on Monday. We can avoid multiple creation of SmallPtrSet and I'll check for other possible optimizations too.

Significant time seems to be spent in this loop as well:

for (unsigned Cnt = 0; Cnt < NumReducedVals; ++Cnt) {
  if (Cnt >= Pos && Cnt < Pos + ReduxWidth)
    continue;
  unsigned NumOps = VectorizedVals.lookup(Candidates[Cnt]) +
                    std::count(VL.begin(), VL.end(), Candidates[Cnt]);
  if (NumOps != ReducedValsToOps.find(Candidates[Cnt])->second.size())
    LocalExternallyUsedValues[Candidates[Cnt]];
}

Will fix it, thanks!

And finally a reduced test case:

$ cat q.cc
struct S {
  template<int N>
  bool f() const;
};
int f(const S& s) {
  int d = 0;
  if (s.f<1>()) ++d;
  // 4998 lines skipped
  if (s.f<5000>()) ++d;
  if (d == 0) {
    return s.f<-1>();
  }
  return 0;
}
$ ./clang-11004 --target=x86_64--linux-gnu -O1 -fslp-vectorize  -c -xc++
q.cc -o q.o -ftime-report
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 8.7398 seconds (8.8367 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---
 --- Name ---
   7.9827 ( 91.6%)   0.0081 ( 29.3%)   7.9908 ( 91.4%)   8.0861 ( 91.5%)
 SLPVectorizerPass
   0.1950 (  2.2%)   0.0002 (  0.6%)   0.1952 (  2.2%)   0.1962 (  2.2%)
 InstCombinePass
...

And finally a reduced test case:

$ cat q.cc
struct S {
  template<int N>
  bool f() const;
};
int f(const S& s) {
  int d = 0;
  if (s.f<1>()) ++d;
  // 4998 lines skipped
  if (s.f<5000>()) ++d;
  if (d == 0) {
    return s.f<-1>();
  }
  return 0;
}
$ ./clang-11004 --target=x86_64--linux-gnu -O1 -fslp-vectorize  -c -xc++
q.cc -o q.o -ftime-report
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 8.7398 seconds (8.8367 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---
 --- Name ---
   7.9827 ( 91.6%)   0.0081 ( 29.3%)   7.9908 ( 91.4%)   8.0861 ( 91.5%)
 SLPVectorizerPass
   0.1950 (  2.2%)   0.0002 (  0.6%)   0.1952 (  2.2%)   0.1962 (  2.2%)
 InstCombinePass
...

Numbers after 319a722f6fca365c8f71f457eac60bc3909988ee:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.9849 seconds (0.9873 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.2366 ( 24.3%)   0.0009 (  8.9%)   0.2375 ( 24.1%)   0.2381 ( 24.1%)  SLPVectorizerPass
   0.2188 ( 22.4%)   0.0000 (  0.1%)   0.2189 ( 22.2%)   0.2193 ( 22.2%)  InstCombinePass
   0.1803 ( 18.5%)   0.0001 (  0.9%)   0.1804 ( 18.3%)   0.1807 ( 18.3%)  ModuleInlinerWrapperPass

Numbers after 319a722f6fca365c8f71f457eac60bc3909988ee:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.9849 seconds (0.9873 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
   0.2366 ( 24.3%)   0.0009 (  8.9%)   0.2375 ( 24.1%)   0.2381 ( 24.1%)  SLPVectorizerPass
   0.2188 ( 22.4%)   0.0000 (  0.1%)   0.2189 ( 22.2%)   0.2193 ( 22.2%)  InstCombinePass
   0.1803 ( 18.5%)   0.0001 (  0.9%)   0.1804 ( 18.3%)   0.1807 ( 18.3%)  ModuleInlinerWrapperPass

Thank you Alexey! This patch significantly improves the situation for the reduced test case. It also cuts the time on the original problematic file around 2x compared to 4e271fc49517362a9333371fb1ab7e865d4c1b0e:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 256.7450 seconds (256.7466 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  79.6665 ( 33.7%)   0.0000 (  0.0%)  79.6665 ( 31.0%)  79.6728 ( 31.0%)  SLPVectorizerPass
  48.2242 ( 20.4%)   5.8213 ( 29.1%)  54.0455 ( 21.1%)  54.0537 ( 21.1%)  ModuleInlinerWrapperPass
  47.4894 ( 20.1%)   5.5481 ( 27.7%)  53.0375 ( 20.7%)  53.0450 ( 20.7%)  DevirtSCCRepeatedPass
  22.3364 (  9.4%)   0.2979 (  1.5%)  22.6343 (  8.8%)  22.6339 (  8.8%)  GVNPass
  11.2305 (  4.7%)   1.0957 (  5.5%)  12.3262 (  4.8%)  12.3197 (  4.8%)  InstCombinePass
   4.9344 (  2.1%)   1.2685 (  6.3%)   6.2029 (  2.4%)   6.2003 (  2.4%)  InlinerPass

The time is still much higher (5s -> 79s) than it was before D114171:

Total Execution Time: 181.4693 seconds (181.4723 wall clock)

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
48.1675 ( 29.7%)   5.6593 ( 29.3%)  53.8269 ( 29.7%)  53.8332 ( 29.7%)  ModuleInlinerWrapperPass
47.4270 ( 29.2%)   5.3983 ( 28.0%)  52.8253 ( 29.1%)  52.8315 ( 29.1%)  DevirtSCCRepeatedPass
22.1909 ( 13.7%)   0.2916 (  1.5%)  22.4824 ( 12.4%)  22.4825 ( 12.4%)  GVNPass
11.1989 (  6.9%)   1.0292 (  5.3%)  12.2281 (  6.7%)  12.2204 (  6.7%)  InstCombinePass
 4.9943 (  3.1%)   1.1928 (  6.2%)   6.1871 (  3.4%)   6.1842 (  3.4%)  InlinerPass
 5.2197 (  3.2%)   0.0000 (  0.0%)   5.2198 (  2.9%)   5.2201 (  2.9%)  SLPVectorizerPass

Looking at what else the original file contains that makes SLPVectorizer slow.

Current profile:

- 73.25% clang::BackendConsumer::HandleTranslationUnit                                                                                                                                                                                           ▒
   - 72.69% clang::EmitBackendOutput                                                                                                                                                                                                             ▒
      - 52.79% (anonymous namespace)::EmitAssemblyHelper::RunOptimizationPipeline                                                                                                                                                                ▒
         - 52.79% llvm::PassManager<llvm::Module, llvm::AnalysisManager<llvm::Module>>::run                                                                                                                                                      ▒
            - 31.80% llvm::detail::PassModel<llvm::Module, llvm::ModuleToFunctionPassAdaptor, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Module>>::run                                                                                 ▒
               - 31.77% llvm::ModuleToFunctionPassAdaptor::run                                                                                                                                                                                   ▒
                  - 31.44% llvm::detail::PassModel<llvm::Function, llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                ▒
                     - 31.43% llvm::PassManager<llvm::Function, llvm::AnalysisManager<llvm::Function>>::run                                                                                                                                      ▒
                        - 28.23% llvm::detail::PassModel<llvm::Function, llvm::SLPVectorizerPass, llvm::PreservedAnalyses, llvm::AnalysisManager<llvm::Function>>::run                                                                           ▒
                             llvm::SLPVectorizerPass::run                                                                                                                                                                                        ▒
                           - llvm::SLPVectorizerPass::runImpl                                                                                                                                                                                    ▒
                              - 28.19% llvm::SLPVectorizerPass::vectorizeChainsInBlock                                                                                                                                                           ▒
                                 - 27.76% llvm::SLPVectorizerPass::vectorizeSimpleInstructions                                                                                                                                                   ▒
                                    - 25.69% llvm::SLPVectorizerPass::vectorizeRootInstruction                                                                                                                                                   ▒
                                       - 25.50% (anonymous namespace)::HorizontalReduction::tryToReduce                                                                                                                                          ▒
                                          - 5.27% llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Va▒
                                             - 3.12% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::Value*, unsigned int, ▒
                                                  2.71% llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >::grow                                        ▒
                                               0.63% std::__u::vector<std::__u::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u> >, std::__u::allocator<std::__u::pair<llvm::Value*, llvm::SmallVector<llvm::Instruction*, 2u> > > >:▒
                                          - 3.11% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                                    ▒
                                             - 1.66% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                                    ▒
                                                - 0.85% llvm::DenseMapBase<llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, llvm::Value*, unsigned in▒
                                                     0.69% llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*, void>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >::grow                                     ▒
                                          - 2.21% llvm::SmallPtrSetImpl<llvm::Value*>::insert                                                                                                                                                    ▒
                                             - 1.98% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                                   ▒
                                                  1.10% llvm::SmallPtrSetImplBase::Grow                                                                                                                                                          ▒
                                          - 1.60% llvm::SmallPtrSetImplBase::insert_imp_big                                                                                                                                                      ▒
                                               0.77% llvm::SmallPtrSetImplBase::Grow                                                                                                                                                             ▒
                                    - 2.06% tryToVectorizeSequence<llvm::Value>                                                                                                                                                                  ▒
                                       - 2.04% llvm::SLPVectorizerPass::tryToVectorizeList                                                                                                                                                       ▒
                                          - 1.49% llvm::slpvectorizer::BoUpSLP::buildTree_rec                                                                                                                                                    ▒
                                               0.66% llvm::slpvectorizer::BoUpSLP::buildTree_rec(llvm::ArrayRef<llvm::Value*>, unsigned int, llvm::slpvectorizer::BoUpSLP::EdgeInfo const&)::$_32::operator()                                    ▒

And the hottest part of HorizontalReduction::tryToReduce:

     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:11138                                                                                                                                  ▒
     │      ;     switch (RdxKind) {                                                                                                                                                                                                              ▒
0.00 │        movl    480(%rax), %edi                                                                                                                                                                                                             ▒
     │        movl    $3134, %eax                                                                                                                                                                                                                 ▒
     │        btl     %edi, %eax                                                                                                                                                                                                                  ▒
     │      ↓ jae     26ba                                                                                                                                                                                                                        ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:11146                                                                                                                                  ▒
     │      ;       unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RdxKind);                                                                                                                                                                ▒
     │      → callq   llvm::RecurrenceDescriptor::getOpcode                                                                                                                                                                                       ▒
     │        movl    %eax, %r15d                                                                                                                                                                                                                 ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:11147                                                                                                                                  ▒
     │      ;       if (!AllConsts)                                                                                                                                                                                                               ▒
     │        testb   %r13b, %r13b                                                                                                                                                                                                                ▒
     │      ↓ je      273b                                                                                                                                                                                                                        ▒
     │        xorl    %r13d, %r13d                                                                                                                                                                                                                ▒
     │        movl    $0, -104(%rbp)                                                                                                                                                                                                              ▒
     │        movq    -184(%rbp), %rbx                                                                                                                                                                                                            ▒
     │      ↓ jmp     276a                                                                                                                                                                                                                        ▒
     │        nop                                                                                                                                                                                                                                 ▒
0.34 │25e0:   movq    %rcx, %rsi                                                                                                                                                                                                                  ▒
0.03 │25e3:   movq    %rsi, %rcx                                                                                                                                                                                                                  ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/ADT/DenseMap.h:1256                                                                                                                                                  ▒
     │      ;     return LHS.Ptr == RHS.Ptr;                                                                                                                                                                                                      ▒
0.18 │        cmpq    %r8, %rsi                                                                                                                                                                                                                   ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:10932                                                                                                                                  ▒
     │      ;         for (Value *U : IgnoreList)                                                                                                                                                                                                 ▒
     │      ↑ je      2515                                                                                                                                                                                                                        ◆
0.20 │25ef:   movq    (%rcx), %r9                                                                                                                                                                                                                 ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Value.h:533                                                                                                                                                       ▒
     │      ;     return SubclassID;                                                                                                                                                                                                              ▒
9.77 │        movzbl  16(%r9), %edx                                                                                                                                                                                                               ▒
     │        cmpl    $27, %edx                                                                                                                                                                                                                   ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Operator.h:296                                                                                                                                                    ▒
     │      ;     if (auto *I = dyn_cast<Instruction>(V))                                                                                                                                                                                         ▒
0.26 │      ↓ jae     2610                                                                                                                                                                                                                        ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Constants.h:1355                                                                                                                                                  ▒
     │      ;     return V->getValueID() == ConstantExprVal;                                                                                                                                                                                      ▒
     │        cmpb    $5, %dl                                                                                                                                                                                                                     ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Operator.h:298                                                                                                                                                    ▒
     │      ;     else if (auto *CE = dyn_cast<ConstantExpr>(V))                                                                                                                                                                                  ▒
     │      ↓ jne     2631                                                                                                                                                                                                                        ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Value.h:852                                                                                                                                                       ▒
     │      ;   unsigned short getSubclassDataFromValue() const { return SubclassData; }                                                                                                                                                          ▒
     │        movzwl  18(%r9), %edx                                                                                                                                                                                                               ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Operator.h:303                                                                                                                                                    ▒
     │      ;     switch (Opcode) {                                                                                                                                                                                                               ▒
     │        cmpl    $57, %edx                                                                                                                                                                                                                   ▒
     │      ↓ jbe     2618                                                                                                                                                                                                                        ▒
     │      ↓ jmp     2631                                                                                                                                                                                                                        ▒
     │        nop                                                                                                                                                                                                                                 ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Instruction.h:157                                                                                                                                                 ▒
     │      ;   unsigned getOpcode() const { return getValueID() - InstructionVal; }                                                                                                                                                              ▒
0.01 │2610:   addl    $-27, %edx                                                                                                                                                                                                                  ▒
     │      ; /proc/self/cwd/third_party/llvm/llvm-project/llvm/include/llvm/IR/Operator.h:303                                                                                                                                                    ▒

Hi @ABataev, this patch is still causing miscompile, see https://github.com/llvm/llvm-project/issues/55688 for details.