This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Be more aggressive about reduction width selection.
ClosedPublic

Authored by • chatur01 on Oct 27 2015, 6:52 AM.

Download Raw Diff

Details

Reviewers

nadav
jmolloy

Commits

rGab3215fa115d: [SLP] Be more aggressive about reduction width selection.
rL251428: [SLP] Be more aggressive about reduction width selection.

Summary

This change could be way off-piste, I'm looking for any feedback on whether it's an acceptable approach.

It never seems to be a problem to gobble up as many reduction values as can be found, and then to attempt to reduce the resulting tree. Some of the workloads I'm looking at have been aggressively unrolled by hand, and by selecting reduction widths that are not constrained by a vector register size, it becomes possible to profitably vectorize. My test case shows such an unrolling which SLP was not vectorizing (on neither ARM nor X86) before this patch, but with it does vectorize.

I measure no significant compile time impact of this change when combined with D13949 and D14063. There are also no significant performance regressions on ARM/AArch64 in SPEC or LNT.

The more principled approach I thought of was to generate several candidate tree's and use the cost model to pick the cheapest one. That seemed like quite a big design change (the algorithms seem very much one-shot), and would likely be a costly thing for compile time. This seemed to do the job at very little cost, but I'm worried I've misunderstood something!

Diff Detail

Event Timeline

• chatur01 updated this revision to Diff 38531.Oct 27 2015, 6:52 AM

• chatur01 retitled this revision from to [SLP] Be more aggressive about reduction width selection..

• chatur01 updated this object.

• chatur01 added reviewers: nadav, jmolloy.

• chatur01 set the repository for this revision to rL LLVM.

• chatur01 added a subscriber: llvm-commits.

Herald added a subscriber: aemerson. · View Herald TranscriptOct 27 2015, 6:52 AM

mssimpso added a subscriber: mssimpso.Oct 27 2015, 7:00 AM

Closed by commit rL251428: [SLP] Be more aggressive about reduction width selection. (authored by • chatur01). · Explain WhyOct 27 2015, 11:01 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

47 lines

test/

Transforms/

SLPVectorizer/

AArch64/

horizontal.ll

123 lines

Diff 38531

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 3,658 Lines • ▼ Show 20 Lines	class HorizontalReduction {

BinaryOperator *ReductionRoot;		BinaryOperator *ReductionRoot;
PHINode *ReductionPHI;		PHINode *ReductionPHI;

/// The opcode of the reduction.		/// The opcode of the reduction.
unsigned ReductionOpcode;		unsigned ReductionOpcode;
/// The opcode of the values we perform a reduction on.		/// The opcode of the values we perform a reduction on.
unsigned ReducedValueOpcode;		unsigned ReducedValueOpcode;
/// The width of one full horizontal reduction operation.
unsigned ReduxWidth;
/// Should we model this reduction as a pairwise reduction tree or a tree that		/// Should we model this reduction as a pairwise reduction tree or a tree that
/// splits the vector in halves and adds those halves.		/// splits the vector in halves and adds those halves.
bool IsPairwiseReduction;		bool IsPairwiseReduction;

public:		public:
		/// The width of one full horizontal reduction operation.
		unsigned ReduxWidth;

HorizontalReduction()		HorizontalReduction()
: ReductionRoot(nullptr), ReductionPHI(nullptr), ReductionOpcode(0),		: ReductionRoot(nullptr), ReductionPHI(nullptr), ReductionOpcode(0),
ReducedValueOpcode(0), ReduxWidth(0), IsPairwiseReduction(false) {}		ReducedValueOpcode(0), IsPairwiseReduction(false), ReduxWidth(0) {}

/// \brief Try to find a reduction tree.		/// \brief Try to find a reduction tree.
bool matchAssociativeReduction(PHINode Phi, BinaryOperator B) {		bool matchAssociativeReduction(PHINode Phi, BinaryOperator B) {
assert((!Phi \|\|		assert((!Phi \|\|
std::find(Phi->op_begin(), Phi->op_end(), B) != Phi->op_end()) &&		std::find(Phi->op_begin(), Phi->op_end(), B) != Phi->op_end()) &&
"Thi phi needs to use the binary operator");		"Thi phi needs to use the binary operator");

// We could have a initial reductions that is not an add.		// We could have a initial reductions that is not an add.
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	if (VectorizedTree) {
ReductionRoot->setOperand(0, VectorizedTree);		ReductionRoot->setOperand(0, VectorizedTree);
ReductionRoot->setOperand(1, ReductionPHI);		ReductionRoot->setOperand(1, ReductionPHI);
} else		} else
ReductionRoot->replaceAllUsesWith(VectorizedTree);		ReductionRoot->replaceAllUsesWith(VectorizedTree);
}		}
return VectorizedTree != nullptr;		return VectorizedTree != nullptr;
}		}

private:		unsigned numReductionValues() const {
		return ReducedVals.size();
		}

		private:
/// \brief Calculate the cost of a reduction.		/// \brief Calculate the cost of a reduction.
int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal) {		int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal) {
Type *ScalarTy = FirstReducedVal->getType();		Type *ScalarTy = FirstReducedVal->getType();
Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);		Type *VecTy = VectorType::get(ScalarTy, ReduxWidth);

int PairwiseRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, true);		int PairwiseRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, true);
int SplittingRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, false);		int SplittingRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, false);

▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	if (P->getIncomingBlock(0) == BBLatch) {
Rdx = P->getIncomingValue(0);		Rdx = P->getIncomingValue(0);
} else if (P->getIncomingBlock(1) == BBLatch) {		} else if (P->getIncomingBlock(1) == BBLatch) {
Rdx = P->getIncomingValue(1);		Rdx = P->getIncomingValue(1);
}		}

return Rdx;		return Rdx;
}		}

		/// \brief Attempt to reduce a horizontal reduction.
		/// If it is legal to match a horizontal reduction feeding
		/// the phi node P with reduction operators BI, then check if it
		/// can be done.
		/// \returns true if a horizontal reduction was matched and reduced.
		/// \returns false if a horizontal reduction was not matched.
		static bool canMatchHorizontalReduction(PHINode P, BinaryOperator BI,
		BoUpSLP &R, TargetTransformInfo *TTI) {
		if (!ShouldVectorizeHor)
		return false;

		HorizontalReduction HorRdx;
		if (!HorRdx.matchAssociativeReduction(P, BI))
		return false;

		// If there is a sufficient number of reduction values, reduce
		// to a nearby power-of-2. Can safely generate oversized
		// vectors and rely on the backend to split them to legal sizes.
		HorRdx.ReduxWidth =
		std::max((uint64_t)4, PowerOf2Floor(HorRdx.numReductionValues()));

		return HorRdx.tryToReduce(R, TTI);
		}

bool SLPVectorizer::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {		bool SLPVectorizer::vectorizeChainsInBlock(BasicBlock *BB, BoUpSLP &R) {
bool Changed = false;		bool Changed = false;
SmallVector<Value *, 4> Incoming;		SmallVector<Value *, 4> Incoming;
SmallSet<Value *, 16> VisitedInstrs;		SmallSet<Value *, 16> VisitedInstrs;

bool HaveVectorizedPhiNodes = true;		bool HaveVectorizedPhiNodes = true;
while (HaveVectorizedPhiNodes) {		while (HaveVectorizedPhiNodes) {
HaveVectorizedPhiNodes = false;		HaveVectorizedPhiNodes = false;
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	if (PHINode *P = dyn_cast<PHINode>(it)) {
Value *Rdx = getReductionValue(P, BB, LI);		Value *Rdx = getReductionValue(P, BB, LI);

// Check if this is a Binary Operator.		// Check if this is a Binary Operator.
BinaryOperator *BI = dyn_cast_or_null<BinaryOperator>(Rdx);		BinaryOperator *BI = dyn_cast_or_null<BinaryOperator>(Rdx);
if (!BI)		if (!BI)
continue;		continue;

// Try to match and vectorize a horizontal reduction.		// Try to match and vectorize a horizontal reduction.
HorizontalReduction HorRdx;		if (canMatchHorizontalReduction(P, BI, R, TTI)) {
if (ShouldVectorizeHor && HorRdx.matchAssociativeReduction(P, BI) &&
HorRdx.tryToReduce(R, TTI)) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}

Value *Inst = BI->getOperand(0);		Value *Inst = BI->getOperand(0);
if (Inst == P)		if (Inst == P)
Inst = BI->getOperand(1);		Inst = BI->getOperand(1);

if (tryToVectorize(dyn_cast<BinaryOperator>(Inst), R)) {		if (tryToVectorize(dyn_cast<BinaryOperator>(Inst), R)) {
// We would like to start over since some instructions are deleted		// We would like to start over since some instructions are deleted
// and the iterator may become invalid value.		// and the iterator may become invalid value.
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}

continue;		continue;
}		}

// Try to vectorize horizontal reductions feeding into a store.
if (ShouldStartVectorizeHorAtStore)		if (ShouldStartVectorizeHorAtStore)
if (StoreInst *SI = dyn_cast<StoreInst>(it))		if (StoreInst *SI = dyn_cast<StoreInst>(it))
if (BinaryOperator *BinOp =		if (BinaryOperator *BinOp =
dyn_cast<BinaryOperator>(SI->getValueOperand())) {		dyn_cast<BinaryOperator>(SI->getValueOperand())) {
HorizontalReduction HorRdx;		if (canMatchHorizontalReduction(nullptr, BinOp, R, TTI) \|\|
if (((HorRdx.matchAssociativeReduction(nullptr, BinOp) &&		tryToVectorize(BinOp, R)) {
HorRdx.tryToReduce(R, TTI)) \|\|
tryToVectorize(BinOp, R))) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}
}		}

// Try to vectorize horizontal reductions feeding into a return.		// Try to vectorize horizontal reductions feeding into a return.
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/horizontal.ll

	Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines

	for.end.loopexit: ; preds = %for.body, %if.end			for.end.loopexit: ; preds = %for.body, %if.end
	br label %for.end			br label %for.end

	for.end: ; preds = %for.end.loopexit, %entry			for.end: ; preds = %for.end.loopexit, %entry
	%s.1 = phi i32 [ 0, %entry ], [ %add13, %for.end.loopexit ]			%s.1 = phi i32 [ 0, %entry ], [ %add13, %for.end.loopexit ]
	ret i32 %s.1			ret i32 %s.1
	}			}

				; CHECK: test_unrolled_select
				; CHECK: load <8 x i8>
				; CHECK: load <8 x i8>
				; CHECK: select <8 x i1>
				define i32 @test_unrolled_select(i8* noalias nocapture readonly %blk1, i8* noalias nocapture readonly %blk2, i32 %lx, i32 %h, i32 %lim) #0 {
				entry:
				%cmp.43 = icmp sgt i32 %h, 0
				br i1 %cmp.43, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry
				%idx.ext = sext i32 %lx to i64
				br label %for.body

				for.body: ; preds = %for.body.lr.ph, %if.end.86
				%s.047 = phi i32 [ 0, %for.body.lr.ph ], [ %add82, %if.end.86 ]
				%j.046 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %if.end.86 ]
				%p2.045 = phi i8* [ %blk2, %for.body.lr.ph ], [ %add.ptr88, %if.end.86 ]
				%p1.044 = phi i8* [ %blk1, %for.body.lr.ph ], [ %add.ptr, %if.end.86 ]
				%0 = load i8, i8* %p1.044, align 1
				%conv = zext i8 %0 to i32
				%1 = load i8, i8* %p2.045, align 1
				%conv2 = zext i8 %1 to i32
				%sub = sub nsw i32 %conv, %conv2
				%cmp3 = icmp slt i32 %sub, 0
				%sub5 = sub nsw i32 0, %sub
				%sub5.sub = select i1 %cmp3, i32 %sub5, i32 %sub
				%add = add nsw i32 %sub5.sub, %s.047
				%arrayidx6 = getelementptr inbounds i8, i8* %p1.044, i64 1
				%2 = load i8, i8* %arrayidx6, align 1
				%conv7 = zext i8 %2 to i32
				%arrayidx8 = getelementptr inbounds i8, i8* %p2.045, i64 1
				%3 = load i8, i8* %arrayidx8, align 1
				%conv9 = zext i8 %3 to i32
				%sub10 = sub nsw i32 %conv7, %conv9
				%cmp11 = icmp slt i32 %sub10, 0
				%sub14 = sub nsw i32 0, %sub10
				%v.1 = select i1 %cmp11, i32 %sub14, i32 %sub10
				%add16 = add nsw i32 %add, %v.1
				%arrayidx17 = getelementptr inbounds i8, i8* %p1.044, i64 2
				%4 = load i8, i8* %arrayidx17, align 1
				%conv18 = zext i8 %4 to i32
				%arrayidx19 = getelementptr inbounds i8, i8* %p2.045, i64 2
				%5 = load i8, i8* %arrayidx19, align 1
				%conv20 = zext i8 %5 to i32
				%sub21 = sub nsw i32 %conv18, %conv20
				%cmp22 = icmp slt i32 %sub21, 0
				%sub25 = sub nsw i32 0, %sub21
				%sub25.sub21 = select i1 %cmp22, i32 %sub25, i32 %sub21
				%add27 = add nsw i32 %add16, %sub25.sub21
				%arrayidx28 = getelementptr inbounds i8, i8* %p1.044, i64 3
				%6 = load i8, i8* %arrayidx28, align 1
				%conv29 = zext i8 %6 to i32
				%arrayidx30 = getelementptr inbounds i8, i8* %p2.045, i64 3
				%7 = load i8, i8* %arrayidx30, align 1
				%conv31 = zext i8 %7 to i32
				%sub32 = sub nsw i32 %conv29, %conv31
				%cmp33 = icmp slt i32 %sub32, 0
				%sub36 = sub nsw i32 0, %sub32
				%v.3 = select i1 %cmp33, i32 %sub36, i32 %sub32
				%add38 = add nsw i32 %add27, %v.3
				%arrayidx39 = getelementptr inbounds i8, i8* %p1.044, i64 4
				%8 = load i8, i8* %arrayidx39, align 1
				%conv40 = zext i8 %8 to i32
				%arrayidx41 = getelementptr inbounds i8, i8* %p2.045, i64 4
				%9 = load i8, i8* %arrayidx41, align 1
				%conv42 = zext i8 %9 to i32
				%sub43 = sub nsw i32 %conv40, %conv42
				%cmp44 = icmp slt i32 %sub43, 0
				%sub47 = sub nsw i32 0, %sub43
				%sub47.sub43 = select i1 %cmp44, i32 %sub47, i32 %sub43
				%add49 = add nsw i32 %add38, %sub47.sub43
				%arrayidx50 = getelementptr inbounds i8, i8* %p1.044, i64 5
				%10 = load i8, i8* %arrayidx50, align 1
				%conv51 = zext i8 %10 to i32
				%arrayidx52 = getelementptr inbounds i8, i8* %p2.045, i64 5
				%11 = load i8, i8* %arrayidx52, align 1
				%conv53 = zext i8 %11 to i32
				%sub54 = sub nsw i32 %conv51, %conv53
				%cmp55 = icmp slt i32 %sub54, 0
				%sub58 = sub nsw i32 0, %sub54
				%v.5 = select i1 %cmp55, i32 %sub58, i32 %sub54
				%add60 = add nsw i32 %add49, %v.5
				%arrayidx61 = getelementptr inbounds i8, i8* %p1.044, i64 6
				%12 = load i8, i8* %arrayidx61, align 1
				%conv62 = zext i8 %12 to i32
				%arrayidx63 = getelementptr inbounds i8, i8* %p2.045, i64 6
				%13 = load i8, i8* %arrayidx63, align 1
				%conv64 = zext i8 %13 to i32
				%sub65 = sub nsw i32 %conv62, %conv64
				%cmp66 = icmp slt i32 %sub65, 0
				%sub69 = sub nsw i32 0, %sub65
				%sub69.sub65 = select i1 %cmp66, i32 %sub69, i32 %sub65
				%add71 = add nsw i32 %add60, %sub69.sub65
				%arrayidx72 = getelementptr inbounds i8, i8* %p1.044, i64 7
				%14 = load i8, i8* %arrayidx72, align 1
				%conv73 = zext i8 %14 to i32
				%arrayidx74 = getelementptr inbounds i8, i8* %p2.045, i64 7
				%15 = load i8, i8* %arrayidx74, align 1
				%conv75 = zext i8 %15 to i32
				%sub76 = sub nsw i32 %conv73, %conv75
				%cmp77 = icmp slt i32 %sub76, 0
				%sub80 = sub nsw i32 0, %sub76
				%v.7 = select i1 %cmp77, i32 %sub80, i32 %sub76
				%add82 = add nsw i32 %add71, %v.7
				%cmp83 = icmp slt i32 %add82, %lim
				br i1 %cmp83, label %if.end.86, label %for.end.loopexit

				if.end.86: ; preds = %for.body
				%add.ptr = getelementptr inbounds i8, i8* %p1.044, i64 %idx.ext
				%add.ptr88 = getelementptr inbounds i8, i8* %p2.045, i64 %idx.ext
				%inc = add nuw nsw i32 %j.046, 1
				%cmp = icmp slt i32 %inc, %h
				br i1 %cmp, label %for.body, label %for.end.loopexit

				for.end.loopexit: ; preds = %for.body, %if.end.86
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %entry
				%s.1 = phi i32 [ 0, %entry ], [ %add82, %for.end.loopexit ]
				ret i32 %s.1
				}