This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Try to match reductions before trying to vectorize a vector build sequence.
ClosedPublic

Authored by vdmitrie on Aug 24 2022, 12:19 PM.

Details

Summary

This patch changes order of searching for reductions vs other vectorization possibilities.
The idea is if we do not match a reduction it won't be harmful for further attempts to
find vectorizable operations on a vector build sequences. But doing it in the opposite
order we have good chance to ruin opportunity to match a reduction later.
We also don't want to try vectorizing binary operations too early as 2-way vectorization
may effectively prohibit wider ones leading to producing less effective code.

Diff Detail

Event Timeline

vdmitrie created this revision.Aug 24 2022, 12:19 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 12:19 PM
vdmitrie requested review of this revision.Aug 24 2022, 12:19 PM
ABataev added inline comments.Aug 24 2022, 12:27 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11691–11694

Can you put this and corresponding changes to the separate NFC patch?

11968–11969

Formatting

11981–11985

Outline this code to the separate function? Same code is used in 2 places.

vdmitrie updated this revision to Diff 455358.Aug 24 2022, 1:47 PM

Address review comments.

vdmitrie marked 2 inline comments as done.Aug 24 2022, 1:49 PM
vdmitrie added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11691–11694
vdmitrie updated this revision to Diff 455393.Aug 24 2022, 3:09 PM

update due to nfc split

ABataev added inline comments.Aug 25 2022, 7:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812

Why did you decide to remove these functions and embed their code instead?

vdmitrie added inline comments.Aug 25 2022, 9:38 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812

Both methods are only called (and make sense) in context of vector build sequence and that is already ensured by skipping them if findBuildAggregate did not detect a sequence in the loop where they used. If we keep the call inside the methods it will be redundant. If remove that call from both of the methods they would become non self sufficient , i.e. would rely on assumption that they are called within certain context. But assumptions are dangerous things.
As all that context is contained within just a small loop I decided that for better clarity it is be better to submerge these methods into the loop rather than keep them with assumptions (or add an ugly assertions to show that assumption).

ABataev added inline comments.Aug 25 2022, 9:52 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11948–11982

I think you're doing it too early, need to do it after the vectorizeHorReduction, which should start with I, otherwise we may miss single insertelement instruction which has reduction.

vdmitrie added inline comments.Aug 25 2022, 10:18 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11948–11982

You probably meant to make another call to vectorizeHorReduction if we did not match a vector build. I'll try to reproduce the situation you described in a test case.

ABataev added inline comments.Aug 25 2022, 10:25 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11948–11982

I mean for horizontal reductions you don't need to perform this, you can do it after the horizontal reduction matching. Plus continue is too early, if findBuildAggregate fails.

vdmitrie added inline comments.Aug 25 2022, 10:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11948–11982

The whole point of the patch was to make sure to visit every vector build operand early. In order to do that we need the call. But I also see what you mean.

vdmitrie updated this revision to Diff 455684.Aug 25 2022, 12:14 PM

Address comments + added coverage test case.

ABataev added inline comments.Aug 25 2022, 12:24 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

Why do you want still call this after findBuildAggregate?

llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll
117

Please, precommit the test

vdmitrie added inline comments.Aug 25 2022, 12:31 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

In order to match hor reduction on build vector operands. Can you offer other approach to do the same?

llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll
117

The vectorizer does not change its behavior on this test with the patch . I can commit it separately if you want but that won't be a pre-commit.

ABataev added inline comments.Aug 25 2022, 12:38 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

But vectorizeRootInstruction already does this, why do you want to do it twice, here and in vectorizeRootInstruction? It increases compile time.
Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after? And remove this loop and final OpsChanged |= tryToVectorize(PostponedInsts, R); call?

vdmitrie added inline comments.Aug 25 2022, 1:05 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

But vectorizeRootInstruction already does this,

Nope. This is exactly what does not happen after your commit https://reviews.llvm.org/rG7ea03f0b4e4e
Before the commit we traversed through operands of insertelements when collected instructions for work stack and were able to locate all reductions at once.

Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after?

Nope. vectorizeRootInstruction may too early create 2-way vectorizations. We only want to call it after trying on with wider VFs with tryToVectorizeList.

ABataev added inline comments.Aug 25 2022, 1:28 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

Then you don't need to call vectorizeRootInstruction. Also, why do you need to find operands of the buildvector sequence? Just call vectorizeHorReduction for each insert instruction in the loop and in the second loop call findBuildAggregate and all other stuff for buildvector and postponed insts.

vdmitrie added inline comments.Aug 25 2022, 2:54 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

Just call vectorizeHorReduction for each insert ...

I don't like this approach. I'd rather enable vectorizeHorReduction to traverse through insertelement operands. BTW you did not explain why you suppressed that.

ABataev added inline comments.Aug 25 2022, 2:57 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

To save compile time, to avoid analysis of the same instructions several times.
Why you don't like it?

vdmitrie added inline comments.Aug 25 2022, 3:24 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

To save compile time, to avoid analysis of the same instructions several times.
Why you don't like it?

You are probably talking about https://reviews.llvm.org/D114171 when saying "compile time". But I'm about the different change. Note that the patch that was last reviewed in https://reviews.llvm.org/D114171
does not match what was actually committed (with the reference to differential revision). I'm specifically concerned about this piece of code:

// Try to vectorize operands.
// Continue analysis for the instruction from the same basic block only to
// save compile time.
if (++Level < RecursionMaxDepth)
  for (auto *Op : Inst->operand_values())
    if (VisitedInstrs.insert(Op).second)
      if (auto *I = dyn_cast<Instruction>(Op))
        // Do not try to vectorize CmpInst operands,  this is done
        // separately.
        if (!isa<PHINode>(I) && !isa<CmpInst>(I) && !R.isDeleted(I) &&
            I->getParent() == BB)
          Stack.emplace(I, Level);

where you changed to avoid traversing through insertelement operands:

if (!isa<PHINode, CmpInst, InsertElementInst, InsertValueInst>(I) &&

I'm not convinced that this specific change worth extra compile time.

I can agree that sticking to buildvector is not the best approach, but not traversing the operands early does not save compile time. It only leads to postponing the action for another visit but with much lower chances for vectorizing it the right way.

ABataev added inline comments.Aug 25 2022, 3:37 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

It was changed after several comments regarding compile time.
There are 2 problems with the original code.

  1. Compile time in case of unsuccessful attempts. We repeat analysis of same instructions at least twice, if not more.
  2. It does not preseve the analysis order for postponable instructions, some of them gets analyzed too early. It breaks internal logic, makes it harder for the perf abd bug analysis. And may lead to a too early vectorization.

The problem that insertelement is also the operand and it maybe a part of the another build vector sequence. In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence.

vdmitrie added inline comments.Aug 25 2022, 4:01 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence.

Now vectorizeHorRedcution is decoupled from 2-way vectorization. we just need to call vectorizeHorRedcution instead of vectorizeRootInstruction in order to avoid unwanted 2-way vectors early.

ABataev added inline comments.Aug 25 2022, 4:36 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

And it is good. But then there is a call for vectorizeRootInstruction, which just repeats almost same stuff again. If you want to keep current implementation, I suggest to remove the call for vectorizeRootInstruction and call it if findBuildAggregate fails only, like

if (!findBuildAggregate(I, TTI, BuildVectorOpds, BuildVectorInsts))
    return vectorizeRootInstruction(...);
vdmitrie added inline comments.Aug 25 2022, 4:55 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812–11815

Yep. I was thinking about that too.
I also tried to add an extra loop over instructions just targeting reductions in vectorizeSimpleInstructions:

 // pass1 - try to vectorize reductions only
 for (auto *I : reverse(Instructions)) {
   if (R.isDeleted(I))
     continue;
   if (isa<CmpInst>(I)) {
     PostponedCmps.push_back(I);
    continue;
   }
   OpsChanged |= vectorizeHorReduction(nullptr, I, BB, R, TTI, PostponedInsts);
 }
 // pass2 - try to match and vectorize a buildvector sequence.
 for (auto *I : reverse(Instructions)) {
  if (R.isDeleted(I) || isa<CmpInst>(I))
     continue;
  if (auto *LastInsertValue = dyn_cast<InsertValueInst>(I)) {
    OpsChanged |= vectorizeInsertValueInst(LastInsertValue, BB, R);
  } else if (auto *LastInsertElem = dyn_cast<InsertElementInst>(I)) {
    OpsChanged |= vectorizeInsertElementInst(LastInsertElem, BB, R);
   }
}
 // Now try to vectorize postponed instructions.
 OpsChanged |= tryToVectorize(PostponedInsts, R);

Here we don't stick to vector build for reductions and have no interface changes.
What's your opinion in this approach?

vdmitrie updated this revision to Diff 455766.Aug 25 2022, 5:12 PM
vdmitrie retitled this revision from [SLP] Try to match reductions first in a vector build sequence. to [SLP] Try to match reductions before trying to vectorize a vector build sequence..
vdmitrie edited the summary of this revision. (Show Details)

Yep, that's what I proposed earlier. Can you run perf testing with for these changes?

Yep, that's what I proposed earlier. Can you run perf testing with for these changes?

It's running. There are a couple of things to note. I'm only able to run perf testing on a downstream compiler with quite limited number of benchmarks.
Thus evaluation of performance impact will be based on that and in theory might differ for llvm-project.
And there will be no performance numbers - just can say whether looses/gains are expected.

The performance run did not reveal any regression while there were couple of gains (13% and 18% - two tests in different benchmarks, one of them was targeted). On cpu2017 the numbers are nearly flat.
Tested with benchmarks SPEC CPU2017, Coremark Pro, SPEC HPC 2021, and a few more.

ABataev accepted this revision.Aug 29 2022, 10:02 AM

LG, thanks!

This revision is now accepted and ready to land.Aug 29 2022, 10:02 AM