This patch changes order of searching for reductions vs other vectorization possibilities.
The idea is if we do not match a reduction it won't be harmful for further attempts to
find vectorizable operations on a vector build sequences. But doing it in the opposite
order we have good chance to ruin opportunity to match a reduction later.
We also don't want to try vectorizing binary operations too early as 2-way vectorization
may effectively prohibit wider ones leading to producing less effective code.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11691–11694 | Can you put this and corresponding changes to the separate NFC patch? | |
11968–11969 | Formatting | |
11981–11985 | Outline this code to the separate function? Same code is used in 2 places. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11691–11694 | Split into https://reviews.llvm.org/D132603 |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812 | Why did you decide to remove these functions and embed their code instead? |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812 | Both methods are only called (and make sense) in context of vector build sequence and that is already ensured by skipping them if findBuildAggregate did not detect a sequence in the loop where they used. If we keep the call inside the methods it will be redundant. If remove that call from both of the methods they would become non self sufficient , i.e. would rely on assumption that they are called within certain context. But assumptions are dangerous things. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11948–11952 | I think you're doing it too early, need to do it after the vectorizeHorReduction, which should start with I, otherwise we may miss single insertelement instruction which has reduction. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11948–11952 | You probably meant to make another call to vectorizeHorReduction if we did not match a vector build. I'll try to reproduce the situation you described in a test case. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11948–11952 | I mean for horizontal reductions you don't need to perform this, you can do it after the horizontal reduction matching. Plus continue is too early, if findBuildAggregate fails. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11948–11952 | The whole point of the patch was to make sure to visit every vector build operand early. In order to do that we need the call. But I also see what you mean. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | Why do you want still call this after findBuildAggregate? | |
llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll | ||
117 | Please, precommit the test |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | In order to match hor reduction on build vector operands. Can you offer other approach to do the same? | |
llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll | ||
117 | The vectorizer does not change its behavior on this test with the patch . I can commit it separately if you want but that won't be a pre-commit. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | But vectorizeRootInstruction already does this, why do you want to do it twice, here and in vectorizeRootInstruction? It increases compile time. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 |
Nope. This is exactly what does not happen after your commit https://reviews.llvm.org/rG7ea03f0b4e4e
Nope. vectorizeRootInstruction may too early create 2-way vectorizations. We only want to call it after trying on with wider VFs with tryToVectorizeList. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | Then you don't need to call vectorizeRootInstruction. Also, why do you need to find operands of the buildvector sequence? Just call vectorizeHorReduction for each insert instruction in the loop and in the second loop call findBuildAggregate and all other stuff for buildvector and postponed insts. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 |
I don't like this approach. I'd rather enable vectorizeHorReduction to traverse through insertelement operands. BTW you did not explain why you suppressed that. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | To save compile time, to avoid analysis of the same instructions several times. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 |
You are probably talking about https://reviews.llvm.org/D114171 when saying "compile time". But I'm about the different change. Note that the patch that was last reviewed in https://reviews.llvm.org/D114171 // Try to vectorize operands. // Continue analysis for the instruction from the same basic block only to // save compile time. if (++Level < RecursionMaxDepth) for (auto *Op : Inst->operand_values()) if (VisitedInstrs.insert(Op).second) if (auto *I = dyn_cast<Instruction>(Op)) // Do not try to vectorize CmpInst operands, this is done // separately. if (!isa<PHINode>(I) && !isa<CmpInst>(I) && !R.isDeleted(I) && I->getParent() == BB) Stack.emplace(I, Level); where you changed to avoid traversing through insertelement operands: if (!isa<PHINode, CmpInst, InsertElementInst, InsertValueInst>(I) && I'm not convinced that this specific change worth extra compile time. I can agree that sticking to buildvector is not the best approach, but not traversing the operands early does not save compile time. It only leads to postponing the action for another visit but with much lower chances for vectorizing it the right way. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | It was changed after several comments regarding compile time.
The problem that insertelement is also the operand and it maybe a part of the another build vector sequence. In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 |
Now vectorizeHorRedcution is decoupled from 2-way vectorization. we just need to call vectorizeHorRedcution instead of vectorizeRootInstruction in order to avoid unwanted 2-way vectors early. |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | And it is good. But then there is a call for vectorizeRootInstruction, which just repeats almost same stuff again. If you want to keep current implementation, I suggest to remove the call for vectorizeRootInstruction and call it if findBuildAggregate fails only, like if (!findBuildAggregate(I, TTI, BuildVectorOpds, BuildVectorInsts)) return vectorizeRootInstruction(...); |
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp | ||
---|---|---|
11812–11815 | Yep. I was thinking about that too. // pass1 - try to vectorize reductions only for (auto *I : reverse(Instructions)) { if (R.isDeleted(I)) continue; if (isa<CmpInst>(I)) { PostponedCmps.push_back(I); continue; } OpsChanged |= vectorizeHorReduction(nullptr, I, BB, R, TTI, PostponedInsts); } // pass2 - try to match and vectorize a buildvector sequence. for (auto *I : reverse(Instructions)) { if (R.isDeleted(I) || isa<CmpInst>(I)) continue; if (auto *LastInsertValue = dyn_cast<InsertValueInst>(I)) { OpsChanged |= vectorizeInsertValueInst(LastInsertValue, BB, R); } else if (auto *LastInsertElem = dyn_cast<InsertElementInst>(I)) { OpsChanged |= vectorizeInsertElementInst(LastInsertElem, BB, R); } } // Now try to vectorize postponed instructions. OpsChanged |= tryToVectorize(PostponedInsts, R); Here we don't stick to vector build for reductions and have no interface changes. |
Yep, that's what I proposed earlier. Can you run perf testing with for these changes?
It's running. There are a couple of things to note. I'm only able to run perf testing on a downstream compiler with quite limited number of benchmarks.
Thus evaluation of performance impact will be based on that and in theory might differ for llvm-project.
And there will be no performance numbers - just can say whether looses/gains are expected.
The performance run did not reveal any regression while there were couple of gains (13% and 18% - two tests in different benchmarks, one of them was targeted). On cpu2017 the numbers are nearly flat.
Tested with benchmarks SPEC CPU2017, Coremark Pro, SPEC HPC 2021, and a few more.
Can you put this and corresponding changes to the separate NFC patch?