This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
12/22
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
1/2
redux-feed-buildvector.ll

Differential D132590

[SLP] Try to match reductions before trying to vectorize a vector build sequence.
ClosedPublic

Authored by vdmitrie on Aug 24 2022, 12:19 PM.

Download Raw Diff

Details

Reviewers

ABataev
Vasilis
RKSimon

Commits

rG329b972d416a: [SLP] Try to match reductions before trying to vectorize a vector build…

Summary

This patch changes order of searching for reductions vs other vectorization possibilities.
The idea is if we do not match a reduction it won't be harmful for further attempts to
find vectorizable operations on a vector build sequences. But doing it in the opposite
order we have good chance to ruin opportunity to match a reduction later.
We also don't want to try vectorizing binary operations too early as 2-way vectorization
may effectively prohibit wider ones leading to producing less effective code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

vdmitrie created this revision.Aug 24 2022, 12:19 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 12:19 PM

Herald added subscribers: vporpo, hiraditya, kristof.beyls. · View Herald Transcript

vdmitrie requested review of this revision.Aug 24 2022, 12:19 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptAug 24 2022, 12:19 PM

ABataev added inline comments.Aug 24 2022, 12:27 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11691–11694	Can you put this and corresponding changes to the separate NFC patch?
12007–12008	Formatting
12020–12024	Outline this code to the separate function? Same code is used in 2 places.

Address review comments.

Harbormaster completed remote builds in B183220: Diff 455358.Aug 24 2022, 1:48 PM

vdmitrie marked 2 inline comments as done.Aug 24 2022, 1:49 PM

vdmitrie added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11691–11694	Split into https://reviews.llvm.org/D132603

update due to nfc split

Harbormaster completed remote builds in B183243: Diff 455393.Aug 24 2022, 3:10 PM

update

Harbormaster completed remote builds in B183258: Diff 455418.Aug 24 2022, 4:17 PM

ABataev added inline comments.Aug 25 2022, 7:18 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812	Why did you decide to remove these functions and embed their code instead?

vdmitrie added inline comments.Aug 25 2022, 9:38 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11812	Both methods are only called (and make sense) in context of vector build sequence and that is already ensured by skipping them if findBuildAggregate did not detect a sequence in the loop where they used. If we keep the call inside the methods it will be redundant. If remove that call from both of the methods they would become non self sufficient , i.e. would rely on assumption that they are called within certain context. But assumptions are dangerous things. As all that context is contained within just a small loop I decided that for better clarity it is be better to submerge these methods into the loop rather than keep them with assumptions (or add an ugly assertions to show that assumption).

ABataev added inline comments.Aug 25 2022, 9:52 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11987–11988	I think you're doing it too early, need to do it after the vectorizeHorReduction, which should start with I, otherwise we may miss single insertelement instruction which has reduction.

vdmitrie added inline comments.Aug 25 2022, 10:18 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11987–11988	You probably meant to make another call to vectorizeHorReduction if we did not match a vector build. I'll try to reproduce the situation you described in a test case.

ABataev added inline comments.Aug 25 2022, 10:25 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11987–11988	I mean for horizontal reductions you don't need to perform this, you can do it after the horizontal reduction matching. Plus `continue` is too early, if `findBuildAggregate` fails.

vdmitrie added inline comments.Aug 25 2022, 10:30 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11987–11988	The whole point of the patch was to make sure to visit every vector build operand early. In order to do that we need the call. But I also see what you mean.

Address comments + added coverage test case.

ABataev added inline comments.Aug 25 2022, 12:24 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	Why do you want still call this after `findBuildAggregate`?
llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll
117	Please, precommit the test

vdmitrie added inline comments.Aug 25 2022, 12:31 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	In order to match hor reduction on build vector operands. Can you offer other approach to do the same?
llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll
117	The vectorizer does not change its behavior on this test with the patch . I can commit it separately if you want but that won't be a pre-commit.

ABataev added inline comments.Aug 25 2022, 12:38 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	But vectorizeRootInstruction already does this, why do you want to do it twice, here and in vectorizeRootInstruction? It increases compile time. Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after? And remove this loop and final OpsChanged \|= tryToVectorize(PostponedInsts, R); call?

vdmitrie added inline comments.Aug 25 2022, 1:05 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	But vectorizeRootInstruction already does this, Nope. This is exactly what does not happen after your commit https://reviews.llvm.org/rG7ea03f0b4e4e Before the commit we traversed through operands of insertelements when collected instructions for work stack and were able to locate all reductions at once. Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after? Nope. vectorizeRootInstruction may too early create 2-way vectorizations. We only want to call it after trying on with wider VFs with tryToVectorizeList.

ABataev added inline comments.Aug 25 2022, 1:28 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	Then you don't need to call vectorizeRootInstruction. Also, why do you need to find operands of the buildvector sequence? Just call vectorizeHorReduction for each insert instruction in the loop and in the second loop call findBuildAggregate and all other stuff for buildvector and postponed insts.

Harbormaster completed remote builds in B183449: Diff 455684.Aug 25 2022, 1:28 PM

vdmitrie added inline comments.Aug 25 2022, 2:54 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	Just call vectorizeHorReduction for each insert ... I don't like this approach. I'd rather enable vectorizeHorReduction to traverse through insertelement operands. BTW you did not explain why you suppressed that.

ABataev added inline comments.Aug 25 2022, 2:57 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	To save compile time, to avoid analysis of the same instructions several times. Why you don't like it?

vdmitrie added inline comments.Aug 25 2022, 3:24 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	To save compile time, to avoid analysis of the same instructions several times. Why you don't like it? You are probably talking about https://reviews.llvm.org/D114171 when saying "compile time". But I'm about the different change. Note that the patch that was last reviewed in https://reviews.llvm.org/D114171 does not match what was actually committed (with the reference to differential revision). I'm specifically concerned about this piece of code: // Try to vectorize operands. // Continue analysis for the instruction from the same basic block only to // save compile time. if (++Level < RecursionMaxDepth) for (auto Op : Inst->operand_values()) if (VisitedInstrs.insert(Op).second) if (auto I = dyn_cast<Instruction>(Op)) // Do not try to vectorize CmpInst operands, this is done // separately. if (!isa<PHINode>(I) && !isa<CmpInst>(I) && !R.isDeleted(I) && I->getParent() == BB) Stack.emplace(I, Level); where you changed to avoid traversing through insertelement operands: if (!isa<PHINode, CmpInst, InsertElementInst, InsertValueInst>(I) && I'm not convinced that this specific change worth extra compile time. I can agree that sticking to buildvector is not the best approach, but not traversing the operands early does not save compile time. It only leads to postponing the action for another visit but with much lower chances for vectorizing it the right way.

ABataev added inline comments.Aug 25 2022, 3:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	It was changed after several comments regarding compile time. There are 2 problems with the original code. Compile time in case of unsuccessful attempts. We repeat analysis of same instructions at least twice, if not more. It does not preseve the analysis order for postponable instructions, some of them gets analyzed too early. It breaks internal logic, makes it harder for the perf abd bug analysis. And may lead to a too early vectorization. The problem that insertelement is also the operand and it maybe a part of the another build vector sequence. In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence.

vdmitrie added inline comments.Aug 25 2022, 4:01 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence. Now vectorizeHorRedcution is decoupled from 2-way vectorization. we just need to call vectorizeHorRedcution instead of vectorizeRootInstruction in order to avoid unwanted 2-way vectors early.

ABataev added inline comments.Aug 25 2022, 4:36 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11820–11823	And it is good. But then there is a call for vectorizeRootInstruction, which just repeats almost same stuff again. If you want to keep current implementation, I suggest to remove the call for vectorizeRootInstruction and call it if findBuildAggregate fails only, like if (!findBuildAggregate(I, TTI, BuildVectorOpds, BuildVectorInsts)) return vectorizeRootInstruction(...);

vdmitrie added inline comments.Aug 25 2022, 4:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

11820–11823

Yep. I was thinking about that too.
I also tried to add an extra loop over instructions just targeting reductions in vectorizeSimpleInstructions:

 // pass1 - try to vectorize reductions only
 for (auto *I : reverse(Instructions)) {
   if (R.isDeleted(I))
     continue;
   if (isa<CmpInst>(I)) {
     PostponedCmps.push_back(I);
    continue;
   }
   OpsChanged |= vectorizeHorReduction(nullptr, I, BB, R, TTI, PostponedInsts);
 }
 // pass2 - try to match and vectorize a buildvector sequence.
 for (auto *I : reverse(Instructions)) {
  if (R.isDeleted(I) || isa<CmpInst>(I))
     continue;
  if (auto *LastInsertValue = dyn_cast<InsertValueInst>(I)) {
    OpsChanged |= vectorizeInsertValueInst(LastInsertValue, BB, R);
  } else if (auto *LastInsertElem = dyn_cast<InsertElementInst>(I)) {
    OpsChanged |= vectorizeInsertElementInst(LastInsertElem, BB, R);
   }
}
 // Now try to vectorize postponed instructions.
 OpsChanged |= tryToVectorize(PostponedInsts, R);

Here we don't stick to vector build for reductions and have no interface changes.
What's your opinion in this approach?

vdmitrie updated this revision to Diff 455766.Aug 25 2022, 5:12 PM

vdmitrie retitled this revision from [SLP] Try to match reductions first in a vector build sequence. to [SLP] Try to match reductions before trying to vectorize a vector build sequence..

vdmitrie edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B183497: Diff 455766.Aug 25 2022, 6:07 PM

Yep, that's what I proposed earlier. Can you run perf testing with for these changes?

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptAug 26 2022, 3:51 AM

In D132590#3751406, @ABataev wrote:

Yep, that's what I proposed earlier. Can you run perf testing with for these changes?

It's running. There are a couple of things to note. I'm only able to run perf testing on a downstream compiler with quite limited number of benchmarks.
Thus evaluation of performance impact will be based on that and in theory might differ for llvm-project.
And there will be no performance numbers - just can say whether looses/gains are expected.

The performance run did not reveal any regression while there were couple of gains (13% and 18% - two tests in different benchmarks, one of them was targeted). On cpu2017 the numbers are nearly flat.
Tested with benchmarks SPEC CPU2017, Coremark Pro, SPEC HPC 2021, and a few more.

LG, thanks!

This revision is now accepted and ready to land.Aug 29 2022, 10:02 AM

Closed by commit rG329b972d416a: [SLP] Try to match reductions before trying to vectorize a vector build… (authored by vdmitrie). · Explain WhyAug 29 2022, 1:42 PM

This revision was automatically updated to reflect the committed changes.

vdmitrie added a commit: rG329b972d416a: [SLP] Try to match reductions before trying to vectorize a vector build….

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

10 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

71 lines

test/

Transforms/

SLPVectorizer/

X86/

redux-feed-buildvector.ll

184 lines

Diff 455684

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	private:

/// Make an attempt to vectorize reduction and then try to vectorize		/// Make an attempt to vectorize reduction and then try to vectorize
/// postponed binary operations.		/// postponed binary operations.
/// \returns true on any successfull vectorization.		/// \returns true on any successfull vectorization.
bool vectorizeRootInstruction(PHINode P, Value V, BasicBlock *BB,		bool vectorizeRootInstruction(PHINode P, Value V, BasicBlock *BB,
slpvectorizer::BoUpSLP &R,		slpvectorizer::BoUpSLP &R,
TargetTransformInfo *TTI);		TargetTransformInfo *TTI);

/// Try to vectorize trees that start at insertvalue instructions.		/// Tries to match and vectorize a vector build sequance.
bool vectorizeInsertValueInst(InsertValueInst IVI, BasicBlock BB,		bool vectorizeBuildVector(Instruction I, BasicBlock BB,
slpvectorizer::BoUpSLP &R);

/// Try to vectorize trees that start at insertelement instructions.
bool vectorizeInsertElementInst(InsertElementInst IEI, BasicBlock BB,
slpvectorizer::BoUpSLP &R);		slpvectorizer::BoUpSLP &R);

/// Tries to vectorize constructs started from CmpInst, InsertValueInst or		/// Tries to vectorize constructs started from CmpInst, InsertValueInst or
/// InsertElementInst instructions.		/// InsertElementInst instructions.
bool vectorizeSimpleInstructions(InstSetVector &Instructions, BasicBlock *BB,		bool vectorizeSimpleInstructions(InstSetVector &Instructions, BasicBlock *BB,
slpvectorizer::BoUpSLP &R,		slpvectorizer::BoUpSLP &R,
bool AtTerminator);		bool AtTerminator);

/// Scan the basic block and look for patterns that are likely to start		/// Scan the basic block and look for patterns that are likely to start
Show All 18 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,553 Lines • ▼ Show 20 Lines	static Optional<unsigned> getAggregateSize(Instruction *InsertInst) {
} while (true);		} while (true);
}		}

static void findBuildAggregate_rec(Instruction *LastInsertInst,		static void findBuildAggregate_rec(Instruction *LastInsertInst,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI,
SmallVectorImpl<Value *> &BuildVectorOpds,		SmallVectorImpl<Value *> &BuildVectorOpds,
SmallVectorImpl<Value *> &InsertElts,		SmallVectorImpl<Value *> &InsertElts,
unsigned OperandOffset) {		unsigned OperandOffset) {
		assert((isa<InsertElementInst, InsertValueInst>(LastInsertInst)) &&
		"Expected insertelement or insertvalue instruction!");
do {		do {
Value *InsertedOperand = LastInsertInst->getOperand(1);		Value *InsertedOperand = LastInsertInst->getOperand(1);
Optional<unsigned> OperandIndex =		Optional<unsigned> OperandIndex =
getInsertIndex(LastInsertInst, OperandOffset);		getInsertIndex(LastInsertInst, OperandOffset);
if (!OperandIndex)		if (!OperandIndex)
return;		return;
if (isa<InsertElementInst, InsertValueInst>(InsertedOperand)) {		if (isa<InsertElementInst, InsertValueInst>(InsertedOperand)) {
findBuildAggregate_rec(cast<Instruction>(InsertedOperand), TTI,		findBuildAggregate_rec(cast<Instruction>(InsertedOperand), TTI,
Show All 22 Lines
///		///
/// Assume LastInsertInst is of InsertElementInst or InsertValueInst type.		/// Assume LastInsertInst is of InsertElementInst or InsertValueInst type.
///		///
/// \return true if it matches.		/// \return true if it matches.
static bool findBuildAggregate(Instruction *LastInsertInst,		static bool findBuildAggregate(Instruction *LastInsertInst,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI,
SmallVectorImpl<Value *> &BuildVectorOpds,		SmallVectorImpl<Value *> &BuildVectorOpds,
SmallVectorImpl<Value *> &InsertElts) {		SmallVectorImpl<Value *> &InsertElts) {

assert((isa<InsertElementInst>(LastInsertInst) \|\|
isa<InsertValueInst>(LastInsertInst)) &&
"Expected insertelement or insertvalue instruction!");

assert((BuildVectorOpds.empty() && InsertElts.empty()) &&		assert((BuildVectorOpds.empty() && InsertElts.empty()) &&
"Expected empty result vectors!");		"Expected empty result vectors!");
		if (!isa<InsertElementInst, InsertValueInst>(LastInsertInst))
		return false;

Optional<unsigned> AggregateSize = getAggregateSize(LastInsertInst);		Optional<unsigned> AggregateSize = getAggregateSize(LastInsertInst);
if (!AggregateSize)		if (!AggregateSize)
return false;		return false;
BuildVectorOpds.resize(*AggregateSize);		BuildVectorOpds.resize(*AggregateSize);
InsertElts.resize(*AggregateSize);		InsertElts.resize(*AggregateSize);

findBuildAggregate_rec(LastInsertInst, TTI, BuildVectorOpds, InsertElts, 0);		findBuildAggregate_rec(LastInsertInst, TTI, BuildVectorOpds, InsertElts, 0);
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	if (match(I, m_Intrinsic<Intrinsic::smin>(m_Value(V0), m_Value(V1))))
return true;		return true;
if (match(I, m_Intrinsic<Intrinsic::umax>(m_Value(V0), m_Value(V1))))		if (match(I, m_Intrinsic<Intrinsic::umax>(m_Value(V0), m_Value(V1))))
return true;		return true;
if (match(I, m_Intrinsic<Intrinsic::umin>(m_Value(V0), m_Value(V1))))		if (match(I, m_Intrinsic<Intrinsic::umin>(m_Value(V0), m_Value(V1))))
return true;		return true;
return false;		return false;
}		}

bool SLPVectorizerPass::vectorizeHorReduction(		bool SLPVectorizerPass::vectorizeHorReduction(
PHINode P, Value V, BasicBlock BB, BoUpSLP &R, TargetTransformInfo TTI,		PHINode P, Value V, BasicBlock BB, BoUpSLP &R, TargetTransformInfo TTI,
SmallVectorImpl<WeakTrackingVH> &PostponedInsts) {		SmallVectorImpl<WeakTrackingVH> &PostponedInsts) {
if (!ShouldVectorizeHor)		if (!ShouldVectorizeHor)
		ABataevUnsubmitted Not Done Reply Inline Actions Can you put this and corresponding changes to the separate NFC patch? ABataev: Can you put this and corresponding changes to the separate NFC patch?
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions Split into https://reviews.llvm.org/D132603 vdmitrie: Split into https://reviews.llvm.org/D132603
return false;		return false;

auto *Root = dyn_cast_or_null<Instruction>(V);		auto *Root = dyn_cast_or_null<Instruction>(V);
if (!Root)		if (!Root)
return false;		return false;

if (!isa<BinaryOperator>(Root))		if (!isa<BinaryOperator>(Root))
P = nullptr;		P = nullptr;
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::tryToVectorize(ArrayRef<WeakTrackingVH> Insts,
BoUpSLP &R) {		BoUpSLP &R) {
bool Res = false;		bool Res = false;
for (Value *V : Insts)		for (Value *V : Insts)
if (auto *Inst = dyn_cast<Instruction>(V); Inst && !R.isDeleted(Inst))		if (auto *Inst = dyn_cast<Instruction>(V); Inst && !R.isDeleted(Inst))
Res \|= tryToVectorize(Inst, R);		Res \|= tryToVectorize(Inst, R);
return Res;		return Res;
}		}

bool SLPVectorizerPass::vectorizeInsertValueInst(InsertValueInst *IVI,		bool SLPVectorizerPass::vectorizeBuildVector(Instruction *I,
ABataevUnsubmitted Not Done Reply Inline Actions Why did you decide to remove these functions and embed their code instead? ABataev: Why did you decide to remove these functions and embed their code instead?
vdmitrieAuthorUnsubmitted Done Reply Inline Actions Both methods are only called (and make sense) in context of vector build sequence and that is already ensured by skipping them if findBuildAggregate did not detect a sequence in the loop where they used. If we keep the call inside the methods it will be redundant. If remove that call from both of the methods they would become non self sufficient , i.e. would rely on assumption that they are called within certain context. But assumptions are dangerous things. As all that context is contained within just a small loop I decided that for better clarity it is be better to submerge these methods into the loop rather than keep them with assumptions (or add an ugly assertions to show that assumption). vdmitrie: Both methods are only called (and make sense) in context of vector build sequence and that is…
BasicBlock *BB, BoUpSLP &R) {		BasicBlock *BB, BoUpSLP &R) {
const DataLayout &DL = BB->getModule()->getDataLayout();
if (!R.canMapToVector(IVI->getType(), DL))
return false;

SmallVector<Value *, 16> BuildVectorOpds;		SmallVector<Value *, 16> BuildVectorOpds;
SmallVector<Value *, 16> BuildVectorInsts;		SmallVector<Value *, 16> BuildVectorInsts;
if (!findBuildAggregate(IVI, TTI, BuildVectorOpds, BuildVectorInsts))		if (!findBuildAggregate(I, TTI, BuildVectorOpds, BuildVectorInsts))
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IVI << "\n");		bool OpsChanged = false;
// Aggregate value is unlikely to be processed in vector register.		// Try to find reductions in buildvector sequences.
return tryToVectorizeList(BuildVectorOpds, R);		SmallVector<WeakTrackingVH> PostponedInsts;
}		for (Value *Op : BuildVectorOpds)
		OpsChanged \|=
		vectorizeHorReduction(nullptr, Op, BB, R, TTI, PostponedInsts);
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you want still call this after `findBuildAggregate`? ABataev: Why do you want still call this after `findBuildAggregate`?
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions In order to match hor reduction on build vector operands. Can you offer other approach to do the same? vdmitrie: In order to match hor reduction on build vector operands. Can you offer other approach to do…
		ABataevUnsubmitted Not Done Reply Inline Actions But vectorizeRootInstruction already does this, why do you want to do it twice, here and in vectorizeRootInstruction? It increases compile time. Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after? And remove this loop and final OpsChanged \|= tryToVectorize(PostponedInsts, R); call? ABataev: But vectorizeRootInstruction already does this, why do you want to do it twice, here and in…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions But vectorizeRootInstruction already does this, Nope. This is exactly what does not happen after your commit https://reviews.llvm.org/rG7ea03f0b4e4e Before the commit we traversed through operands of insertelements when collected instructions for work stack and were able to locate all reductions at once. Can we just call vectorizeRootInstruction(nullptr, I, BB, R, TTI); and vectorizeBuildVector after? Nope. vectorizeRootInstruction may too early create 2-way vectorizations. We only want to call it after trying on with wider VFs with tryToVectorizeList. vdmitrie: > But vectorizeRootInstruction already does this, Nope. This is exactly what does not happen…
		ABataevUnsubmitted Not Done Reply Inline Actions Then you don't need to call vectorizeRootInstruction. Also, why do you need to find operands of the buildvector sequence? Just call vectorizeHorReduction for each insert instruction in the loop and in the second loop call findBuildAggregate and all other stuff for buildvector and postponed insts. ABataev: Then you don't need to call vectorizeRootInstruction. Also, why do you need to find operands of…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions Just call vectorizeHorReduction for each insert ... I don't like this approach. I'd rather enable vectorizeHorReduction to traverse through insertelement operands. BTW you did not explain why you suppressed that. vdmitrie: > Just call vectorizeHorReduction for each insert ... I don't like this approach. I'd rather…
		ABataevUnsubmitted Not Done Reply Inline Actions To save compile time, to avoid analysis of the same instructions several times. Why you don't like it? ABataev: To save compile time, to avoid analysis of the same instructions several times. Why you don't…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions To save compile time, to avoid analysis of the same instructions several times. Why you don't like it? You are probably talking about https://reviews.llvm.org/D114171 when saying "compile time". But I'm about the different change. Note that the patch that was last reviewed in https://reviews.llvm.org/D114171 does not match what was actually committed (with the reference to differential revision). I'm specifically concerned about this piece of code: // Try to vectorize operands. // Continue analysis for the instruction from the same basic block only to // save compile time. if (++Level < RecursionMaxDepth) for (auto Op : Inst->operand_values()) if (VisitedInstrs.insert(Op).second) if (auto I = dyn_cast<Instruction>(Op)) // Do not try to vectorize CmpInst operands, this is done // separately. if (!isa<PHINode>(I) && !isa<CmpInst>(I) && !R.isDeleted(I) && I->getParent() == BB) Stack.emplace(I, Level); where you changed to avoid traversing through insertelement operands: if (!isa<PHINode, CmpInst, InsertElementInst, InsertValueInst>(I) && I'm not convinced that this specific change worth extra compile time. I can agree that sticking to buildvector is not the best approach, but not traversing the operands early does not save compile time. It only leads to postponing the action for another visit but with much lower chances for vectorizing it the right way. vdmitrie: > To save compile time, to avoid analysis of the same instructions several times. > Why you…
		ABataevUnsubmitted Not Done Reply Inline Actions It was changed after several comments regarding compile time. There are 2 problems with the original code. Compile time in case of unsuccessful attempts. We repeat analysis of same instructions at least twice, if not more. It does not preseve the analysis order for postponable instructions, some of them gets analyzed too early. It breaks internal logic, makes it harder for the perf abd bug analysis. And may lead to a too early vectorization. The problem that insertelement is also the operand and it maybe a part of the another build vector sequence. In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence. ABataev: It was changed after several comments regarding compile time. There are 2 problems with the…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions In case of too early vectorization attempt we may end up with 2 x vectorization of the operand of the insertelement ibstruction instead of possible wider buildvector sequence. Now vectorizeHorRedcution is decoupled from 2-way vectorization. we just need to call vectorizeHorRedcution instead of vectorizeRootInstruction in order to avoid unwanted 2-way vectors early. vdmitrie: > In case of too early vectorization attempt we may end up with 2 x vectorization of the…
		ABataevUnsubmitted Not Done Reply Inline Actions And it is good. But then there is a call for vectorizeRootInstruction, which just repeats almost same stuff again. If you want to keep current implementation, I suggest to remove the call for vectorizeRootInstruction and call it if findBuildAggregate fails only, like if (!findBuildAggregate(I, TTI, BuildVectorOpds, BuildVectorInsts)) return vectorizeRootInstruction(...); ABataev: And it is good. But then there is a call for vectorizeRootInstruction, which just repeats…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions Yep. I was thinking about that too. I also tried to add an extra loop over instructions just targeting reductions in vectorizeSimpleInstructions: // pass1 - try to vectorize reductions only for (auto I : reverse(Instructions)) { if (R.isDeleted(I)) continue; if (isa<CmpInst>(I)) { PostponedCmps.push_back(I); continue; } OpsChanged \|= vectorizeHorReduction(nullptr, I, BB, R, TTI, PostponedInsts); } // pass2 - try to match and vectorize a buildvector sequence. for (auto I : reverse(Instructions)) { if (R.isDeleted(I) \|\| isa<CmpInst>(I)) continue; if (auto LastInsertValue = dyn_cast<InsertValueInst>(I)) { OpsChanged \|= vectorizeInsertValueInst(LastInsertValue, BB, R); } else if (auto LastInsertElem = dyn_cast<InsertElementInst>(I)) { OpsChanged \|= vectorizeInsertElementInst(LastInsertElem, BB, R); } } // Now try to vectorize postponed instructions. OpsChanged \|= tryToVectorize(PostponedInsts, R); Here we don't stick to vector build for reductions and have no interface changes. What's your opinion in this approach? vdmitrie: Yep. I was thinking about that too. I also tried to add an extra loop over instructions just…

bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI,
BasicBlock *BB, BoUpSLP &R) {
SmallVector<Value *, 16> BuildVectorInsts;
SmallVector<Value *, 16> BuildVectorOpds;
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) \|\|		if (isa<InsertValueInst>(I)) {
(llvm::all_of(		const DataLayout &DL = BB->getModule()->getDataLayout();
BuildVectorOpds,		if (R.canMapToVector(I->getType(), DL)) {
[](Value *V) { return isa<ExtractElementInst, UndefValue>(V); }) &&
isFixedVectorShuffle(BuildVectorOpds, Mask)))		LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *I << "\n");
return false;		// Aggregate value is unlikely to be processed in vector register.
		OpsChanged \|= tryToVectorizeList(BuildVectorOpds, R);
		}
		} else if (isa<InsertElementInst>(I) &&
		!(all_of(BuildVectorOpds,
		[](Value *V) {
		return isa<ExtractElementInst, UndefValue>(V);
		}) &&
		isFixedVectorShuffle(BuildVectorOpds, Mask))) {

LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IEI << "\n");		LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *I << "\n");
return tryToVectorizeList(BuildVectorInsts, R);		OpsChanged \|= tryToVectorizeList(BuildVectorInsts, R);
		}
		// Try to vectorize postponed binops where reductions were not found.
		OpsChanged \|= tryToVectorize(PostponedInsts, R);
		return OpsChanged;
}		}

template <typename T>		template <typename T>
static bool		static bool
tryToVectorizeSequence(SmallVectorImpl<T *> &Incoming,		tryToVectorizeSequence(SmallVectorImpl<T *> &Incoming,
function_ref<unsigned(T *)> Limit,		function_ref<unsigned(T *)> Limit,
function_ref<bool(T , T )> Comparator,		function_ref<bool(T , T )> Comparator,
function_ref<bool(T , T )> AreCompatible,		function_ref<bool(T , T )> AreCompatible,
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
bool SLPVectorizerPass::vectorizeSimpleInstructions(InstSetVector &Instructions,		bool SLPVectorizerPass::vectorizeSimpleInstructions(InstSetVector &Instructions,
BasicBlock *BB, BoUpSLP &R,		BasicBlock *BB, BoUpSLP &R,
bool AtTerminator) {		bool AtTerminator) {
bool OpsChanged = false;		bool OpsChanged = false;
SmallVector<Instruction *, 4> PostponedCmps;		SmallVector<Instruction *, 4> PostponedCmps;
for (auto *I : reverse(Instructions)) {		for (auto *I : reverse(Instructions)) {
if (R.isDeleted(I))		if (R.isDeleted(I))
continue;		continue;
if (auto *LastInsertValue = dyn_cast<InsertValueInst>(I)) {		if (isa<CmpInst>(I)) {
OpsChanged \|= vectorizeInsertValueInst(LastInsertValue, BB, R);
} else if (auto *LastInsertElem = dyn_cast<InsertElementInst>(I)) {
OpsChanged \|= vectorizeInsertElementInst(LastInsertElem, BB, R);
} else if (isa<CmpInst>(I)) {
PostponedCmps.push_back(I);		PostponedCmps.push_back(I);
continue;		continue;
}		}
// Try to find reductions in buildvector sequnces.		OpsChanged \|= vectorizeBuildVector(I, BB, R);
OpsChanged \|= vectorizeRootInstruction(nullptr, I, BB, R, TTI);		OpsChanged \|= vectorizeRootInstruction(nullptr, I, BB, R, TTI);
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions I think you're doing it too early, need to do it after the vectorizeHorReduction, which should start with I, otherwise we may miss single insertelement instruction which has reduction. ABataev: I think you're doing it too early, need to do it after the vectorizeHorReduction, which should…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions You probably meant to make another call to vectorizeHorReduction if we did not match a vector build. I'll try to reproduce the situation you described in a test case. vdmitrie: You probably meant to make another call to vectorizeHorReduction if we did not match a vector…
		ABataevUnsubmitted Not Done Reply Inline Actions I mean for horizontal reductions you don't need to perform this, you can do it after the horizontal reduction matching. Plus `continue` is too early, if `findBuildAggregate` fails. ABataev: I mean for horizontal reductions you don't need to perform this, you can do it after the…
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions The whole point of the patch was to make sure to visit every vector build operand early. In order to do that we need the call. But I also see what you mean. vdmitrie: The whole point of the patch was to make sure to visit every vector build operand early. In…
if (AtTerminator) {		if (AtTerminator) {
// Try to find reductions first.		// Try to find reductions first.
for (Instruction *I : PostponedCmps) {		for (Instruction *I : PostponedCmps) {
if (R.isDeleted(I))		if (R.isDeleted(I))
continue;		continue;
for (Value *Op : I->operands())		for (Value *Op : I->operands())
OpsChanged \|= vectorizeRootInstruction(nullptr, Op, BB, R, TTI);		OpsChanged \|= vectorizeRootInstruction(nullptr, Op, BB, R, TTI);
}		}
// Try to vectorize operands as vector bundles.		// Try to vectorize operands as vector bundles.
for (Instruction *I : PostponedCmps) {		for (Instruction *I : PostponedCmps) {
if (R.isDeleted(I))		if (R.isDeleted(I))
continue;		continue;
OpsChanged \|= tryToVectorize(I, R);		OpsChanged \|= tryToVectorize(I, R);
}		}
// Try to vectorize list of compares.		// Try to vectorize list of compares.
// Sort by type, compare predicate, etc.		// Sort by type, compare predicate, etc.
auto &&CompareSorter = [&R](Value V, Value V2) {		auto &&CompareSorter = [&R](Value V, Value V2) {
return compareCmp<false>(V, V2,		return compareCmp<false>(V, V2,
[&R](Instruction *I) { return R.isDeleted(I); });		[&R](Instruction *I) { return R.isDeleted(I); });
};		};
		ABataevUnsubmitted Done Reply Inline Actions Formatting ABataev: Formatting

auto &&AreCompatibleCompares = [&R](Value V1, Value V2) {		auto &&AreCompatibleCompares = [&R](Value V1, Value V2) {
if (V1 == V2)		if (V1 == V2)
return true;		return true;
return compareCmp<true>(V1, V2,		return compareCmp<true>(V1, V2,
[&R](Instruction *I) { return R.isDeleted(I); });		[&R](Instruction *I) { return R.isDeleted(I); });
};		};
auto Limit = [&R](Value *V) {		auto Limit = [&R](Value *V) {
unsigned EltSize = R.getVectorElementSize(V);		unsigned EltSize = R.getVectorElementSize(V);
return std::max(2U, R.getMaxVecRegSize() / EltSize);		return std::max(2U, R.getMaxVecRegSize() / EltSize);
};		};

SmallVector<Value *> Vals(PostponedCmps.begin(), PostponedCmps.end());		SmallVector<Value *> Vals(PostponedCmps.begin(), PostponedCmps.end());
OpsChanged \|= tryToVectorizeSequence<Value>(		OpsChanged \|= tryToVectorizeSequence<Value>(
Vals, Limit, CompareSorter, AreCompatibleCompares,		Vals, Limit, CompareSorter, AreCompatibleCompares,
[this, &R](ArrayRef<Value *> Candidates, bool LimitForRegisterSize) {		[this, &R](ArrayRef<Value *> Candidates, bool LimitForRegisterSize) {
		ABataevUnsubmitted Done Reply Inline Actions Outline this code to the separate function? Same code is used in 2 places. ABataev: Outline this code to the separate function? Same code is used in 2 places.
// Exclude possible reductions from other blocks.		// Exclude possible reductions from other blocks.
bool ArePossiblyReducedInOtherBlock =		bool ArePossiblyReducedInOtherBlock =
any_of(Candidates, [](Value *V) {		any_of(Candidates, [](Value *V) {
return any_of(V->users(), [V](User *U) {		return any_of(V->users(), [V](User *U) {
return isa<SelectInst>(U) &&		return isa<SelectInst>(U) &&
cast<SelectInst>(U)->getParent() !=		cast<SelectInst>(U)->getParent() !=
cast<Instruction>(V)->getParent();		cast<Instruction>(V)->getParent();
});		});
▲ Show 20 Lines • Show All 475 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -mtriple=x86_64 -slp-vectorizer -S -mcpu=skylake-avx512 \| FileCheck %s		; RUN: opt < %s -mtriple=x86_64 -slp-vectorizer -S -mcpu=skylake-avx512 \| FileCheck %s

; The test represents the case with multiple vectorization possibilities		; The test represents the case with multiple vectorization possibilities
; but the most effective way to vectorize it is to match both 8-way reductions		; but the most effective way to vectorize it is to match both 8-way reductions
; feeding the insertelement vector build sequence.		; feeding the insertelement vector build sequence.

declare void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double>, <2 x double*>, i32 immarg, <2 x i1>)		declare void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double>, <2 x double*>, i32 immarg, <2 x i1>)

define void @test(double* nocapture readonly %arg, double* nocapture readonly %arg1, double* nocapture %arg2) {		define void @rdx_feeds_buildvector(double* nocapture readonly %arg, double* nocapture readonly %arg1, double* nocapture %arg2) {
; CHECK-LABEL: @test(		; CHECK-LABEL: @rdx_feeds_buildvector(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[GEP1_0:%.]] = getelementptr inbounds double, double [[ARG:%.*]], i64 1		; CHECK-NEXT: [[TMP0:%.]] = insertelement <8 x double> poison, double* [[ARG:%.*]], i32 0
; CHECK-NEXT: [[LD1_0:%.]] = load double, double [[GEP1_0]], align 8		; CHECK-NEXT: [[SHUFFLE:%.]] = shufflevector <8 x double> [[TMP0]], <8 x double*> poison, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[TMP1:%.]] = getelementptr double, <8 x double> [[SHUFFLE]], <8 x i64> <i64 1, i64 3, i64 5, i64 7, i64 9, i64 11, i64 13, i64 15>
; CHECK-NEXT: [[GEP2_0:%.]] = getelementptr inbounds double, double [[ARG1:%.*]], i64 16		; CHECK-NEXT: [[GEP2_0:%.]] = getelementptr inbounds double, double [[ARG1:%.*]], i64 16
; CHECK-NEXT: [[GEP1_1:%.]] = getelementptr inbounds double, double [[ARG]], i64 3		; CHECK-NEXT: [[TMP2:%.]] = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double> [[TMP1]], i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
; CHECK-NEXT: [[LD1_1:%.]] = load double, double [[GEP1_1]], align 8		; CHECK-NEXT: [[TMP3:%.]] = bitcast double [[ARG1]] to <8 x double>*
; CHECK-NEXT: [[GEP0_1:%.]] = getelementptr inbounds double, double [[ARG1]], i64 1		; CHECK-NEXT: [[TMP4:%.]] = load <8 x double>, <8 x double> [[TMP3]], align 8
; CHECK-NEXT: [[GEP2_1:%.]] = getelementptr inbounds double, double [[ARG1]], i64 17		; CHECK-NEXT: [[TMP5:%.*]] = fmul fast <8 x double> [[TMP4]], [[TMP2]]
; CHECK-NEXT: [[GEP1_2:%.]] = getelementptr inbounds double, double [[ARG]], i64 5		; CHECK-NEXT: [[TMP6:%.*]] = call fast double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> [[TMP5]])
; CHECK-NEXT: [[LD1_2:%.]] = load double, double [[GEP1_2]], align 8		; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[GEP2_0]] to <8 x double>*
; CHECK-NEXT: [[GEP0_2:%.]] = getelementptr inbounds double, double [[ARG1]], i64 2		; CHECK-NEXT: [[TMP8:%.]] = load <8 x double>, <8 x double> [[TMP7]], align 8
; CHECK-NEXT: [[GEP2_2:%.]] = getelementptr inbounds double, double [[ARG1]], i64 18		; CHECK-NEXT: [[TMP9:%.*]] = fmul fast <8 x double> [[TMP8]], [[TMP2]]
; CHECK-NEXT: [[GEP1_3:%.]] = getelementptr inbounds double, double [[ARG]], i64 7		; CHECK-NEXT: [[TMP10:%.*]] = call fast double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> [[TMP9]])
; CHECK-NEXT: [[LD1_3:%.]] = load double, double [[GEP1_3]], align 8		; CHECK-NEXT: [[I142:%.*]] = insertelement <2 x double> poison, double [[TMP6]], i64 0
; CHECK-NEXT: [[GEP0_3:%.]] = getelementptr inbounds double, double [[ARG1]], i64 3		; CHECK-NEXT: [[I143:%.*]] = insertelement <2 x double> [[I142]], double [[TMP10]], i64 1
; CHECK-NEXT: [[GEP2_3:%.]] = getelementptr inbounds double, double [[ARG1]], i64 19
; CHECK-NEXT: [[GEP1_4:%.]] = getelementptr inbounds double, double [[ARG]], i64 9
; CHECK-NEXT: [[LD1_4:%.]] = load double, double [[GEP1_4]], align 8
; CHECK-NEXT: [[GEP0_4:%.]] = getelementptr inbounds double, double [[ARG1]], i64 4
; CHECK-NEXT: [[GEP2_4:%.]] = getelementptr inbounds double, double [[ARG1]], i64 20
; CHECK-NEXT: [[GEP1_5:%.]] = getelementptr inbounds double, double [[ARG]], i64 11
; CHECK-NEXT: [[LD1_5:%.]] = load double, double [[GEP1_5]], align 8
; CHECK-NEXT: [[GEP0_5:%.]] = getelementptr inbounds double, double [[ARG1]], i64 5
; CHECK-NEXT: [[GEP2_5:%.]] = getelementptr inbounds double, double [[ARG1]], i64 21
; CHECK-NEXT: [[GEP1_6:%.]] = getelementptr inbounds double, double [[ARG]], i64 13
; CHECK-NEXT: [[LD1_6:%.]] = load double, double [[GEP1_6]], align 8
; CHECK-NEXT: [[GEP0_6:%.]] = getelementptr inbounds double, double [[ARG1]], i64 6
; CHECK-NEXT: [[GEP2_6:%.]] = getelementptr inbounds double, double [[ARG1]], i64 22
; CHECK-NEXT: [[GEP1_7:%.]] = getelementptr inbounds double, double [[ARG]], i64 15
; CHECK-NEXT: [[LD1_7:%.]] = load double, double [[GEP1_7]], align 8
; CHECK-NEXT: [[GEP0_7:%.]] = getelementptr inbounds double, double [[ARG1]], i64 7
; CHECK-NEXT: [[GEP2_7:%.]] = getelementptr inbounds double, double [[ARG1]], i64 23
; CHECK-NEXT: [[LD0_0:%.]] = load double, double [[ARG1]], align 8
; CHECK-NEXT: [[LD2_0:%.]] = load double, double [[GEP2_0]], align 8
; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> poison, double [[LD0_0]], i32 0
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x double> [[TMP0]], double [[LD2_0]], i32 1
; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> poison, double [[LD1_0]], i32 0
; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[LD1_0]], i32 1
; CHECK-NEXT: [[TMP4:%.*]] = fmul fast <2 x double> [[TMP1]], [[TMP3]]
; CHECK-NEXT: [[LD0_1:%.]] = load double, double [[GEP0_1]], align 8
; CHECK-NEXT: [[LD2_1:%.]] = load double, double [[GEP2_1]], align 8
; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> poison, double [[LD0_1]], i32 0
; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[LD2_1]], i32 1
; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> poison, double [[LD1_1]], i32 0
; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[LD1_1]], i32 1
; CHECK-NEXT: [[TMP9:%.*]] = fmul fast <2 x double> [[TMP6]], [[TMP8]]
; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]]
; CHECK-NEXT: [[LD0_2:%.]] = load double, double [[GEP0_2]], align 8
; CHECK-NEXT: [[LD2_2:%.]] = load double, double [[GEP2_2]], align 8
; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x double> poison, double [[LD0_2]], i32 0
; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x double> [[TMP11]], double [[LD2_2]], i32 1
; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> poison, double [[LD1_2]], i32 0
; CHECK-NEXT: [[TMP14:%.*]] = insertelement <2 x double> [[TMP13]], double [[LD1_2]], i32 1
; CHECK-NEXT: [[TMP15:%.*]] = fmul fast <2 x double> [[TMP12]], [[TMP14]]
; CHECK-NEXT: [[TMP16:%.*]] = fadd fast <2 x double> [[TMP10]], [[TMP15]]
; CHECK-NEXT: [[LD0_3:%.]] = load double, double [[GEP0_3]], align 8
; CHECK-NEXT: [[LD2_3:%.]] = load double, double [[GEP2_3]], align 8
; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x double> poison, double [[LD0_3]], i32 0
; CHECK-NEXT: [[TMP18:%.*]] = insertelement <2 x double> [[TMP17]], double [[LD2_3]], i32 1
; CHECK-NEXT: [[TMP19:%.*]] = insertelement <2 x double> poison, double [[LD1_3]], i32 0
; CHECK-NEXT: [[TMP20:%.*]] = insertelement <2 x double> [[TMP19]], double [[LD1_3]], i32 1
; CHECK-NEXT: [[TMP21:%.*]] = fmul fast <2 x double> [[TMP18]], [[TMP20]]
; CHECK-NEXT: [[TMP22:%.*]] = fadd fast <2 x double> [[TMP16]], [[TMP21]]
; CHECK-NEXT: [[LD0_4:%.]] = load double, double [[GEP0_4]], align 8
; CHECK-NEXT: [[LD2_4:%.]] = load double, double [[GEP2_4]], align 8
; CHECK-NEXT: [[TMP23:%.*]] = insertelement <2 x double> poison, double [[LD0_4]], i32 0
; CHECK-NEXT: [[TMP24:%.*]] = insertelement <2 x double> [[TMP23]], double [[LD2_4]], i32 1
; CHECK-NEXT: [[TMP25:%.*]] = insertelement <2 x double> poison, double [[LD1_4]], i32 0
; CHECK-NEXT: [[TMP26:%.*]] = insertelement <2 x double> [[TMP25]], double [[LD1_4]], i32 1
; CHECK-NEXT: [[TMP27:%.*]] = fmul fast <2 x double> [[TMP24]], [[TMP26]]
; CHECK-NEXT: [[TMP28:%.*]] = fadd fast <2 x double> [[TMP22]], [[TMP27]]
; CHECK-NEXT: [[LD0_5:%.]] = load double, double [[GEP0_5]], align 8
; CHECK-NEXT: [[LD2_5:%.]] = load double, double [[GEP2_5]], align 8
; CHECK-NEXT: [[TMP29:%.*]] = insertelement <2 x double> poison, double [[LD0_5]], i32 0
; CHECK-NEXT: [[TMP30:%.*]] = insertelement <2 x double> [[TMP29]], double [[LD2_5]], i32 1
; CHECK-NEXT: [[TMP31:%.*]] = insertelement <2 x double> poison, double [[LD1_5]], i32 0
; CHECK-NEXT: [[TMP32:%.*]] = insertelement <2 x double> [[TMP31]], double [[LD1_5]], i32 1
; CHECK-NEXT: [[TMP33:%.*]] = fmul fast <2 x double> [[TMP30]], [[TMP32]]
; CHECK-NEXT: [[TMP34:%.*]] = fadd fast <2 x double> [[TMP28]], [[TMP33]]
; CHECK-NEXT: [[LD0_6:%.]] = load double, double [[GEP0_6]], align 8
; CHECK-NEXT: [[LD2_6:%.]] = load double, double [[GEP2_6]], align 8
; CHECK-NEXT: [[TMP35:%.*]] = insertelement <2 x double> poison, double [[LD0_6]], i32 0
; CHECK-NEXT: [[TMP36:%.*]] = insertelement <2 x double> [[TMP35]], double [[LD2_6]], i32 1
; CHECK-NEXT: [[TMP37:%.*]] = insertelement <2 x double> poison, double [[LD1_6]], i32 0
; CHECK-NEXT: [[TMP38:%.*]] = insertelement <2 x double> [[TMP37]], double [[LD1_6]], i32 1
; CHECK-NEXT: [[TMP39:%.*]] = fmul fast <2 x double> [[TMP36]], [[TMP38]]
; CHECK-NEXT: [[TMP40:%.*]] = fadd fast <2 x double> [[TMP34]], [[TMP39]]
; CHECK-NEXT: [[LD0_7:%.]] = load double, double [[GEP0_7]], align 8
; CHECK-NEXT: [[LD2_7:%.]] = load double, double [[GEP2_7]], align 8
; CHECK-NEXT: [[TMP41:%.*]] = insertelement <2 x double> poison, double [[LD0_7]], i32 0
; CHECK-NEXT: [[TMP42:%.*]] = insertelement <2 x double> [[TMP41]], double [[LD2_7]], i32 1
; CHECK-NEXT: [[TMP43:%.*]] = insertelement <2 x double> poison, double [[LD1_7]], i32 0
; CHECK-NEXT: [[TMP44:%.*]] = insertelement <2 x double> [[TMP43]], double [[LD1_7]], i32 1
; CHECK-NEXT: [[TMP45:%.*]] = fmul fast <2 x double> [[TMP42]], [[TMP44]]
; CHECK-NEXT: [[TMP46:%.*]] = fadd fast <2 x double> [[TMP40]], [[TMP45]]
; CHECK-NEXT: [[P:%.]] = getelementptr inbounds double, double [[ARG2:%.*]], <2 x i64> <i64 0, i64 16>		; CHECK-NEXT: [[P:%.]] = getelementptr inbounds double, double [[ARG2:%.*]], <2 x i64> <i64 0, i64 16>
; CHECK-NEXT: call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> [[TMP46]], <2 x double*> [[P]], i32 8, <2 x i1> <i1 true, i1 true>)		; CHECK-NEXT: call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> [[I143]], <2 x double*> [[P]], i32 8, <2 x i1> <i1 true, i1 true>)
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%gep1.0 = getelementptr inbounds double, double* %arg, i64 1		%gep1.0 = getelementptr inbounds double, double* %arg, i64 1
%ld1.0 = load double, double* %gep1.0, align 8		%ld1.0 = load double, double* %gep1.0, align 8
%ld0.0 = load double, double* %arg1, align 8		%ld0.0 = load double, double* %arg1, align 8
%mul1.0 = fmul fast double %ld0.0, %ld1.0		%mul1.0 = fmul fast double %ld0.0, %ld1.0
%gep2.0 = getelementptr inbounds double, double* %arg1, i64 16		%gep2.0 = getelementptr inbounds double, double* %arg1, i64 16
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	entry:
%mul2.7 = fmul fast double %ld2.7, %ld1.7		%mul2.7 = fmul fast double %ld2.7, %ld1.7
%rdx2 = fadd fast double %rdx2.5, %mul2.7		%rdx2 = fadd fast double %rdx2.5, %mul2.7
%i142 = insertelement <2 x double> poison, double %rdx1, i64 0		%i142 = insertelement <2 x double> poison, double %rdx1, i64 0
%i143 = insertelement <2 x double> %i142, double %rdx2, i64 1		%i143 = insertelement <2 x double> %i142, double %rdx2, i64 1
%p = getelementptr inbounds double, double* %arg2, <2 x i64> <i64 0, i64 16>		%p = getelementptr inbounds double, double* %arg2, <2 x i64> <i64 0, i64 16>
call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> %i143, <2 x double*> %p, i32 8, <2 x i1> <i1 true, i1 true>)		call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> %i143, <2 x double*> %p, i32 8, <2 x i1> <i1 true, i1 true>)
ret void		ret void
}		}

		; In this test reduction feeds single insertelement instruction
		ABataevUnsubmitted Not Done Reply Inline Actions Please, precommit the test ABataev: Please, precommit the test
		vdmitrieAuthorUnsubmitted Done Reply Inline Actions The vectorizer does not change its behavior on this test with the patch . I can commit it separately if you want but that won't be a pre-commit. vdmitrie: The vectorizer does not change its behavior on this test with the patch . I can commit it…
		define void @rdx_feeds_single_insert(<2 x double> %v, double* nocapture readonly %arg, double* nocapture readonly %arg1, double* nocapture %arg2) {
		; CHECK-LABEL: @rdx_feeds_single_insert(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: [[TMP0:%.]] = insertelement <8 x double> poison, double* [[ARG:%.*]], i32 0
		; CHECK-NEXT: [[SHUFFLE:%.]] = shufflevector <8 x double> [[TMP0]], <8 x double*> poison, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[TMP1:%.]] = getelementptr double, <8 x double> [[SHUFFLE]], <8 x i64> <i64 1, i64 3, i64 5, i64 7, i64 9, i64 11, i64 13, i64 15>
		; CHECK-NEXT: [[TMP2:%.]] = call <8 x double> @llvm.masked.gather.v8f64.v8p0f64(<8 x double> [[TMP1]], i32 8, <8 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <8 x double> undef)
		; CHECK-NEXT: [[TMP3:%.]] = bitcast double [[ARG1:%.]] to <8 x double>
		; CHECK-NEXT: [[TMP4:%.]] = load <8 x double>, <8 x double> [[TMP3]], align 8
		; CHECK-NEXT: [[TMP5:%.*]] = fmul fast <8 x double> [[TMP4]], [[TMP2]]
		; CHECK-NEXT: [[TMP6:%.*]] = call fast double @llvm.vector.reduce.fadd.v8f64(double -0.000000e+00, <8 x double> [[TMP5]])
		; CHECK-NEXT: [[I:%.]] = insertelement <2 x double> [[V:%.]], double [[TMP6]], i64 1
		; CHECK-NEXT: [[P:%.]] = getelementptr inbounds double, double [[ARG2:%.*]], <2 x i64> <i64 0, i64 16>
		; CHECK-NEXT: call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> [[I]], <2 x double*> [[P]], i32 8, <2 x i1> <i1 true, i1 true>)
		; CHECK-NEXT: ret void
		;
		entry:
		%gep1.0 = getelementptr inbounds double, double* %arg, i64 1
		%ld1.0 = load double, double* %gep1.0, align 8
		%ld0.0 = load double, double* %arg1, align 8
		%mul1.0 = fmul fast double %ld0.0, %ld1.0
		%gep1.1 = getelementptr inbounds double, double* %arg, i64 3
		%ld1.1 = load double, double* %gep1.1, align 8
		%gep0.1 = getelementptr inbounds double, double* %arg1, i64 1
		%ld0.1 = load double, double* %gep0.1, align 8
		%mul1.1 = fmul fast double %ld0.1, %ld1.1
		%rdx1.0 = fadd fast double %mul1.0, %mul1.1
		%gep1.2 = getelementptr inbounds double, double* %arg, i64 5
		%ld1.2 = load double, double* %gep1.2, align 8
		%gep0.2 = getelementptr inbounds double, double* %arg1, i64 2
		%ld0.2 = load double, double* %gep0.2, align 8
		%mul1.2 = fmul fast double %ld0.2, %ld1.2
		%rdx1.1 = fadd fast double %rdx1.0, %mul1.2
		%gep1.3 = getelementptr inbounds double, double* %arg, i64 7
		%ld1.3 = load double, double* %gep1.3, align 8
		%gep0.3 = getelementptr inbounds double, double* %arg1, i64 3
		%ld0.3 = load double, double* %gep0.3, align 8
		%mul1.3 = fmul fast double %ld0.3, %ld1.3
		%rdx1.2 = fadd fast double %rdx1.1, %mul1.3
		%gep1.4 = getelementptr inbounds double, double* %arg, i64 9
		%ld1.4 = load double, double* %gep1.4, align 8
		%gep0.4 = getelementptr inbounds double, double* %arg1, i64 4
		%ld0.4 = load double, double* %gep0.4, align 8
		%mul1.4 = fmul fast double %ld0.4, %ld1.4
		%rdx1.3 = fadd fast double %rdx1.2, %mul1.4
		%gep1.5 = getelementptr inbounds double, double* %arg, i64 11
		%ld1.5 = load double, double* %gep1.5, align 8
		%gep0.5 = getelementptr inbounds double, double* %arg1, i64 5
		%ld0.5 = load double, double* %gep0.5, align 8
		%mul1.5 = fmul fast double %ld0.5, %ld1.5
		%rdx1.4 = fadd fast double %rdx1.3, %mul1.5
		%gep1.6 = getelementptr inbounds double, double* %arg, i64 13
		%ld1.6 = load double, double* %gep1.6, align 8
		%gep0.6 = getelementptr inbounds double, double* %arg1, i64 6
		%ld0.6 = load double, double* %gep0.6, align 8
		%mul1.6 = fmul fast double %ld0.6, %ld1.6
		%rdx1.5 = fadd fast double %rdx1.4, %mul1.6
		%gep1.7 = getelementptr inbounds double, double* %arg, i64 15
		%ld1.7 = load double, double* %gep1.7, align 8
		%gep0.7 = getelementptr inbounds double, double* %arg1, i64 7
		%ld0.7 = load double, double* %gep0.7, align 8
		%mul1.7 = fmul fast double %ld0.7, %ld1.7
		%rdx1 = fadd fast double %rdx1.5, %mul1.7
		%i = insertelement <2 x double> %v, double %rdx1, i64 1
		%p = getelementptr inbounds double, double* %arg2, <2 x i64> <i64 0, i64 16>
		call void @llvm.masked.scatter.v2f64.v2p0f64(<2 x double> %i, <2 x double*> %p, i32 8, <2 x i1> <i1 true, i1 true>)
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Try to match reductions before trying to vectorize a vector build sequence.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 455684

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/redux-feed-buildvector.ll

[SLP] Try to match reductions before trying to vectorize a vector build sequence.
ClosedPublic