Download Raw Diff

Details

Reviewers

mzolotukhin
mkuper
hfinkel

Commits

rG62af7252f17c: [SLP] Fixed cost model for horizontal reduction.
rL288398: [SLP] Fixed cost model for horizontal reduction.

Summary

Currently when cost of scalar operations is evaluated the vector type is used for scalar operations. Patch fixes this issue and fixes evaluation of the vector operations cost.

Diff Detail

Repository: rL LLVM

Event Timeline

ABataev updated this revision to Diff 76836.Nov 3 2016, 4:27 AM

ABataev retitled this revision from to [SLP] Fixed cost model for horizontal reduction..

ABataev updated this object.

ABataev added reviewers: hfinkel, mkuper, mzolotukhin.

ABataev added subscribers: llvm-commits, spatel, RKSimon.

Thanks for catching this Alexey, computing the scalar reduction cost with the vector type does look very odd.

include/llvm/CodeGen/BasicTTIImpl.h
928 ↗	(On Diff #76836)	The computation is getting pretty hairy. Could you please add comments explaining what exactly it computes? I understand you're trying to model the reduction steps performed on illegal vectors and steps performed on legal steps separately, but I'm not sure this computes precisely what we want.
932 ↗	(On Diff #76836)	Maybe save static_cast<T *>(this) in a temporary, for readability's sake, now that you're adding more uses?
934 ↗	(On Diff #76836)	I counts the number of illegal reduction steps, right? Maybe give it a more meaningful name? It's not a regular loop counter where it's obvious what it counts up to.

ABataev marked 3 inline comments as done.Nov 9 2016, 5:59 AM

ABataev added inline comments.

include/llvm/CodeGen/BasicTTIImpl.h
928 ↗	(On Diff #76836)	Added several lines of comments trying to explain the new code.
934 ↗	(On Diff #76836)	Changed to while loop.

Updated after Michael's comments

mkuper added inline comments.Nov 13 2016, 11:56 AM

include/llvm/CodeGen/BasicTTIImpl.h
966 ↗	(On Diff #77338)	ThisT -> ConcreteTTI, or something conveying the same information? (That's what this actually is, right?)
983 ↗	(On Diff #77338)	ont eh -> on the
988 ↗	(On Diff #77338)	Why the +1? If the type is legal, LongVectorsCount is 0, Ty doesn't change, so this changes the behavior from multiplying by NumReduxLevel to NumReduxLevel + 1. Was the previous version wrong in that respect?

Fixed after comments from Michael

So, now, none of the costs on the tests actually change?
Does that mean that the changes in costs in the previous versions came from the +1?

Can you add a test that demonstrates the cost change? (Preferably in a way that shows what happened - e.g. commit a test with the "bad" cost, and then have a diff with the good one).

Fixed some issues in previous implementation. More correct cost calculation

In D26277#598814, @mkuper wrote:

So, now, none of the costs on the tests actually change?
Does that mean that the changes in costs in the previous versions came from the +1?

Can you add a test that demonstrates the cost change? (Preferably in a way that shows what happened - e.g. commit a test with the "bad" cost, and then have a diff with the good one).

Changed a code a little bit. Checked it, the calculated cost is very close to the real situation, but seems to me the BoUpSLP tree cost is too optimistic. Will look at it later.

In D26277#599995, @ABataev wrote:

In D26277#598814, @mkuper wrote:

So, now, none of the costs on the tests actually change?
Does that mean that the changes in costs in the previous versions came from the +1?

Can you add a test that demonstrates the cost change? (Preferably in a way that shows what happened - e.g. commit a test with the "bad" cost, and then have a diff with the good one).

Changed a code a little bit. Checked it, the calculated cost is very close to the real situation, but seems to me the BoUpSLP tree cost is too optimistic. Will look at it later.

I'm sorry, but I'm still confused.
The original rationale for this patch was that the vector cost model is too optimistic, but the only test change seems to show the cost model becoming even more optimistic.

Michael, added several test cases in r287801. Checked the test after my changes - no changes is found, the result is still the same as before this patch with fixes

Updated to the latest version

The point I'm trying to make is that there's still no adequate test coverage for this patch.

Could you please split this into two patches:

A patch that contains line 4286, and a test where the behavior actually changes due to that line.

A patch that contains getReductionCost() and a test where the behavior changes due to those changes. I assume (a) the change in reduction.ll you have in this patch is due to this part, and (b) 11 is a better cost here than 17, but it seems like <8 x i32> should be legal on AVX2, and illegal on SSE/AVX, so I wonder why we get the same cost for both case.

Would a smaller fix achieve the same result? If so, can you add a test that demonstrates where the logic you've added for illegal types matters.

In D26277#604897, @mkuper wrote:

The point I'm trying to make is that there's still no adequate test coverage for this patch.

Could you please split this into two patches:

A patch that contains line 4286, and a test where the behavior actually changes due to that line.

A patch that contains getReductionCost() and a test where the behavior changes due to those changes. I assume (a) the change in reduction.ll you have in this patch is due to this part, and (b) 11 is a better cost here than 17, but it seems like <8 x i32> should be legal on AVX2, and illegal on SSE/AVX, so I wonder why we get the same cost for both case.

Would a smaller fix achieve the same result? If so, can you add a test that demonstrates where the logic you've added for illegal types matters.

Michael, I don't think this is a good idea. Actually, all these changes are required for cost model fix. I removed all optimizations from the patch, now it is a pure bug fix.
Yes, the patch may reduce the cost of vector operations, but also it reduces the cost of scalar operations because of use of ScalarTy instead of VectorTy.

Currently, this kind of reduction is also allowed. The vector cost of this operation is 17, but the scalar cost is 16 (because of using of VectorTy). With this patch, the vector cost is 11, but the scalar cost will be 7, so the difference even better than in the original code.

Ok, now I see, in this case it doesn't make sense to separate the two.
(Sorry for the response time, I was on vacation)

LGTM, except that I would still like to see a test for the scalar cost as part of the final patch you commit. Having the test to begin with (showing 17 -> 11 and 16 -> 7 together) would have saved us some miscommunication.

This revision is now accepted and ready to land.Nov 30 2016, 10:09 AM

Closed by commit rL288398: [SLP] Fixed cost model for horizontal reduction. (authored by ABataev). · Explain WhyDec 1 2016, 10:52 AM

This revision was automatically updated to reflect the committed changes.

Diff 79950

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 921 Lines • ▼ Show 20 Lines	unsigned getNumberOfParts(Type *Tp) {
std::pair<unsigned, MVT> LT = getTLI()->getTypeLegalizationCost(DL, Tp);		std::pair<unsigned, MVT> LT = getTLI()->getTypeLegalizationCost(DL, Tp);
return LT.first;		return LT.first;
}		}

unsigned getAddressComputationCost(Type *Ty, bool IsComplex) { return 0; }		unsigned getAddressComputationCost(Type *Ty, bool IsComplex) { return 0; }

unsigned getReductionCost(unsigned Opcode, Type *Ty, bool IsPairwise) {		unsigned getReductionCost(unsigned Opcode, Type *Ty, bool IsPairwise) {
assert(Ty->isVectorTy() && "Expect a vector type");		assert(Ty->isVectorTy() && "Expect a vector type");
		Type *ScalarTy = Ty->getVectorElementType();
unsigned NumVecElts = Ty->getVectorNumElements();		unsigned NumVecElts = Ty->getVectorNumElements();
unsigned NumReduxLevels = Log2_32(NumVecElts);		unsigned NumReduxLevels = Log2_32(NumVecElts);
unsigned ArithCost =		// Try to calculate arithmetic and shuffle op costs for reduction operations.
NumReduxLevels *		// We're assuming that reduction operation are performing the following way:
static_cast<T *>(this)->getArithmeticInstrCost(Opcode, Ty);		// 1. Non-pairwise reduction
		// %val1 = shufflevector<n x t> %val, <n x t> %undef,
		// <n x i32> <i32 n/2, i32 n/2 + 1, ..., i32 n, i32 undef, ..., i32 undef>
		// \----------------v-------------/ \----------v------------/
		// n/2 elements n/2 elements
		// %red1 = op <n x t> %val, <n x t> val1
		// After this operation we have a vector %red1 with only maningfull the
		// first n/2 elements, the second n/2 elements are undefined and can be
		// dropped. All other operations are actually working with the vector of
		// length n/2, not n. though the real vector length is still n.
		// %val2 = shufflevector<n x t> %red1, <n x t> %undef,
		// <n x i32> <i32 n/4, i32 n/4 + 1, ..., i32 n/2, i32 undef, ..., i32 undef>
		// \----------------v-------------/ \----------v------------/
		// n/4 elements 3*n/4 elements
		// %red2 = op <n x t> %red1, <n x t> val2 - working with the vector of
		// length n/2, the resulting vector has length n/4 etc.
		// 2. Pairwise reduction:
		// Everything is the same except for an additional shuffle operation which
		// is used to produce operands for pairwise kind of reductions.
		// %val1 = shufflevector<n x t> %val, <n x t> %undef,
		// <n x i32> <i32 0, i32 2, ..., i32 n-2, i32 undef, ..., i32 undef>
		// \-------------v----------/ \----------v------------/
		// n/2 elements n/2 elements
		// %val2 = shufflevector<n x t> %val, <n x t> %undef,
		// <n x i32> <i32 1, i32 3, ..., i32 n-1, i32 undef, ..., i32 undef>
		// \-------------v----------/ \----------v------------/
		// n/2 elements n/2 elements
		// %red1 = op <n x t> %val1, <n x t> val2
		// Again, the operation is performed on <n x t> vector, but the resulting
		// vector %red1 is <n/2 x t> vector.
		//
		// The cost model should take into account that the actual length of the
		// vector is reduced on each iteration.
		unsigned ArithCost = 0;
		unsigned ShuffleCost = 0;
		auto ConcreteTTI = static_cast<T >(this);
		std::pair<unsigned, MVT> LT =
		ConcreteTTI->getTLI()->getTypeLegalizationCost(DL, Ty);
		unsigned LongVectorCount = 0;
		unsigned MVTLen =
		LT.second.isVector() ? LT.second.getVectorNumElements() : 1;
		while (NumVecElts > MVTLen) {
		NumVecElts /= 2;
// Assume the pairwise shuffles add a cost.		// Assume the pairwise shuffles add a cost.
unsigned ShuffleCost =		ShuffleCost += (IsPairwise + 1) *
NumReduxLevels * (IsPairwise + 1) *		ConcreteTTI->getShuffleCost(TTI::SK_ExtractSubvector, Ty,
static_cast<T *>(this)		NumVecElts, Ty);
->getShuffleCost(TTI::SK_ExtractSubvector, Ty, NumVecElts / 2, Ty);		ArithCost += ConcreteTTI->getArithmeticInstrCost(Opcode, Ty);
		Ty = VectorType::get(ScalarTy, NumVecElts);
		++LongVectorCount;
		}
		// The minimal length of the vector is limited by the real length of vector
		// operations performed on the current platform. That's why several final
		// reduction opertions are perfomed on the vectors with the same
		// architecture-dependent length.
		ShuffleCost += (NumReduxLevels - LongVectorCount) * (IsPairwise + 1) *
		ConcreteTTI->getShuffleCost(TTI::SK_ExtractSubvector, Ty,
		NumVecElts, Ty);
		ArithCost += (NumReduxLevels - LongVectorCount) *
		ConcreteTTI->getArithmeticInstrCost(Opcode, Ty);
return ShuffleCost + ArithCost + getScalarizationOverhead(Ty, false, true);		return ShuffleCost + ArithCost + getScalarizationOverhead(Ty, false, true);
}		}

unsigned getVectorSplitCost() { return 1; }		unsigned getVectorSplitCost() { return 1; }

/// @}		/// @}
};		};

Show All 19 Lines

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 4,281 Lines • ▼ Show 20 Lines	int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal) {

int PairwiseRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, true);		int PairwiseRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, true);
int SplittingRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, false);		int SplittingRdxCost = TTI->getReductionCost(ReductionOpcode, VecTy, false);

IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;		IsPairwiseReduction = PairwiseRdxCost < SplittingRdxCost;
int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;		int VecReduxCost = IsPairwiseReduction ? PairwiseRdxCost : SplittingRdxCost;

int ScalarReduxCost =		int ScalarReduxCost =
ReduxWidth * TTI->getArithmeticInstrCost(ReductionOpcode, VecTy);		(ReduxWidth - 1) *
		TTI->getArithmeticInstrCost(ReductionOpcode, ScalarTy);

DEBUG(dbgs() << "SLP: Adding cost " << VecReduxCost - ScalarReduxCost		DEBUG(dbgs() << "SLP: Adding cost " << VecReduxCost - ScalarReduxCost
<< " for reduction that starts with " << *FirstReducedVal		<< " for reduction that starts with " << *FirstReducedVal
<< " (It is a "		<< " (It is a "
<< (IsPairwiseReduction ? "pairwise" : "splitting")		<< (IsPairwiseReduction ? "pairwise" : "splitting")
<< " reduction)\n");		<< " reduction)\n");

return VecReduxCost - ScalarReduxCost;		return VecReduxCost - ScalarReduxCost;
▲ Show 20 Lines • Show All 589 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/X86/reduction.ll

Show All 27 Lines	<8 x i32> <i32 2 , i32 3, i32 undef, i32 undef,
i32 undef, i32 undef, i32 undef, i32 undef>		i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx.2 = add <8 x i32> %bin.rdx, %rdx.shuf.2		%bin.rdx.2 = add <8 x i32> %bin.rdx, %rdx.shuf.2
%rdx.shuf.3 = shufflevector <8 x i32> %bin.rdx.2, <8 x i32> undef,		%rdx.shuf.3 = shufflevector <8 x i32> %bin.rdx.2, <8 x i32> undef,
<8 x i32> <i32 1 , i32 undef, i32 undef, i32 undef,		<8 x i32> <i32 1 , i32 undef, i32 undef, i32 undef,
i32 undef, i32 undef, i32 undef, i32 undef>		i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx.3 = add <8 x i32> %bin.rdx.2, %rdx.shuf.3		%bin.rdx.3 = add <8 x i32> %bin.rdx.2, %rdx.shuf.3

; CHECK-LABEL: reduction_cost_int		; CHECK-LABEL: reduction_cost_int
; CHECK: cost of 17 {{.*}} extractelement		; CHECK: cost of 11 {{.*}} extractelement
; AVX-LABEL: reduction_cost_int		; AVX-LABEL: reduction_cost_int
; AVX: cost of 5 {{.*}} extractelement		; AVX: cost of 5 {{.*}} extractelement

%r = extractelement <8 x i32> %bin.rdx.3, i32 0		%r = extractelement <8 x i32> %bin.rdx.3, i32 0
ret i32 %r		ret i32 %r
}		}

define fastcc float @pairwise_hadd(<4 x float> %rdx, float %f1) {		define fastcc float @pairwise_hadd(<4 x float> %rdx, float %f1) {
▲ Show 20 Lines • Show All 323 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/reduction_unrolled.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -slp-vectorize-hor -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 -debug < %s 2>&1 \| FileCheck %s			; RUN: opt -slp-vectorizer -slp-vectorize-hor -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 -debug < %s 2>&1 \| FileCheck %s
	; RUN: opt -slp-vectorizer -slp-vectorize-hor -S -mtriple=x86_64-unknown-linux-gnu -mcpu=core2 -debug < %s 2>&1 \| FileCheck --check-prefix=SSE2 %s			; RUN: opt -slp-vectorizer -slp-vectorize-hor -S -mtriple=x86_64-unknown-linux-gnu -mcpu=core2 -debug < %s 2>&1 \| FileCheck --check-prefix=SSE2 %s
	; REQUIRES: asserts			; REQUIRES: asserts

	; int test(unsigned int *p) {			; int test(unsigned int *p) {
	; int sum = 0;			; int sum = 0;
	; for (int i = 0; i < 8; i++)			; for (int i = 0; i < 8; i++)
	; sum += p[i];			; sum += p[i];
	; return sum;			; return sum;
	; }			; }

	; Vector cost is 5, Scalar cost is 32			; Vector cost is 5, Scalar cost is 7
	; CHECK: Adding cost -27 for reduction that starts with %7 = load i32, i32* %arrayidx.7, align 4 (It is a splitting reduction)			; CHECK: Adding cost -2 for reduction that starts with %7 = load i32, i32* %arrayidx.7, align 4 (It is a splitting reduction)
	; Vector cost is 17, Scalar cost is 16			; Vector cost is 11, Scalar cost is 7
	; SSE2: Adding cost 1 for reduction that starts with %7 = load i32, i32* %arrayidx.7, align 4 (It is a splitting reduction)			; SSE2: Adding cost 4 for reduction that starts with %7 = load i32, i32* %arrayidx.7, align 4 (It is a splitting reduction)
	define i32 @test(i32* nocapture readonly %p) {			define i32 @test(i32* nocapture readonly %p) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK: [[BC:%.]] = bitcast i32 %p to <8 x i32>*			; CHECK: [[BC:%.]] = bitcast i32 %p to <8 x i32>*
	; CHECK-NEXT: [[LD:%.]] = load <8 x i32>, <8 x i32> [[BC]], align 4			; CHECK-NEXT: [[LD:%.]] = load <8 x i32>, <8 x i32> [[BC]], align 4
	; CHECK: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[LD]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK: [[RDX_SHUF:%.*]] = shufflevector <8 x i32> [[LD]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[LD]], [[RDX_SHUF]]			; CHECK-NEXT: [[BIN_RDX:%.*]] = add <8 x i32> [[LD]], [[RDX_SHUF]]
	; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; CHECK-NEXT: [[RDX_SHUF1:%.*]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]]			; CHECK-NEXT: [[BIN_RDX2:%.*]] = add <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]]
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fixed cost model for horizontal reduction.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 79950

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/trunk/test/Analysis/CostModel/X86/reduction.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/reduction_unrolled.ll

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fixed cost model for horizontal reduction.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 79950

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/trunk/test/Analysis/CostModel/X86/reduction.ll

llvm/trunk/test/Transforms/SLPVectorizer/X86/reduction_unrolled.ll

[SLP] Fixed cost model for horizontal reduction.
ClosedPublic