This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
4
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
PR35777.ll
-
sign-extend.ll

Differential D41948

[SLP] Fix vectorization for tree with trunc to minimum required bit width.
ClosedPublic

Authored by ABataev on Jan 11 2018, 6:46 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
mkuper
hfinkel
mssimpso

Commits

rGfa80c47c6a67: [SLP] Fix vectorization for tree with trunc to minimum required bit width.
rL322946: [SLP] Fix vectorization for tree with trunc to minimum required bit width.

Summary

If the vectorized tree has truncate to minimum required bit width and
the vector type of the cast operation after the truncation is the same
as the vector type of the cast operands, count cost of the vector cast
operation as 0, because this cast will be later removed.
Also, if the vectorization tree root operations are integer cast operations, do not consider them as candidates for truncation. It will just create extra number of the same vector/scalar operations, which will be removed by instcombiner.

Diff Detail

Build Status

Buildable 13717
Build 13717: arc lint + arc unit

Event Timeline

ABataev created this revision.Jan 11 2018, 6:46 AM

Thanks for cleaning this up.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2061–2064	Makes sense.
4023–4025	What do you mean by "top" here? Do you just mean that the initial Root should not be a cast? If so, would it be easier to check that here, rather than passing the flag in collectValuesToDemote?

ABataev added inline comments.Jan 15 2018, 6:38 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4023–4025	Yes, it is about initial roots. The main problem here is that these roots still must be analyzeв though not considered as candidates to demote. Meanwhile, I have a question for you. Why you try to generate sequence `cast ( extractelement(<resulting_vector>))` instead of `extractelement(cast(<resulting_vector>))`? It produces `2 * n` instructions, while the second approach will produce `n + 1` instructions.

Update after review

Harbormaster completed remote builds in B13991: Diff 130484.Jan 18 2018, 1:23 PM

mssimpso added inline comments.Jan 18 2018, 1:36 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4023–4025	I honestly don't recall what the reason was. It probably had something to do with the way InstCombine wanted to rewrite the expression, given the trunc/ext sequence. I think what we should really try to do is make the type truncating code in InstCombine (EvaluateInDifferentType) a utility that we can use within the vectorizers. The trunc/ext trick is very fragile. We use it here in SLP,and in two separate places in LV to type-shrink reductions and other instructions. I'll finish taking a look at the rest of the patch shortly.

LGTM.

This revision is now accepted and ready to land.Jan 18 2018, 1:43 PM

Closed by commit rL322946: [SLP] Fix vectorization for tree with trunc to minimum required bit width. (authored by ABataev). · Explain WhyJan 19 2018, 6:43 AM

This revision was automatically updated to reflect the committed changes.

haicheng mentioned this in D44565: [SLP] Add a check before skipping inserting a trunc after z|sext vectorization tree root.Mar 16 2018, 7:18 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

45 lines

test/

Transforms/

SLPVectorizer/

X86/

PR35777.ll

11 lines

sign-extend.ll

21 lines

Diff 129443

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 2,052 Lines • ▼ Show 20 Lines	switch (ShuffleOrOp) {
case Instruction::BitCast: {		case Instruction::BitCast: {
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();

// Calculate the cost of this instruction.		// Calculate the cost of this instruction.
int ScalarCost = VL.size() * TTI->getCastInstrCost(VL0->getOpcode(),		int ScalarCost = VL.size() * TTI->getCastInstrCost(VL0->getOpcode(),
VL0->getType(), SrcTy, VL0);		VL0->getType(), SrcTy, VL0);

VectorType *SrcVecTy = VectorType::get(SrcTy, VL.size());		VectorType *SrcVecTy = VectorType::get(SrcTy, VL.size());
int VecCost = TTI->getCastInstrCost(VL0->getOpcode(), VecTy, SrcVecTy, VL0);		int VecCost = 0;
		// Check if the values are candidates to demote.
		if (!MinBWs.count(VL0) \|\| VecTy != SrcVecTy)
		VecCost = TTI->getCastInstrCost(VL0->getOpcode(), VecTy, SrcVecTy, VL0);
		mssimpsoUnsubmitted Not Done Reply Inline Actions Makes sense. mssimpso: Makes sense.
return VecCost - ScalarCost;		return VecCost - ScalarCost;
}		}
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::Select: {		case Instruction::Select: {
// Calculate the cost of this instruction.		// Calculate the cost of this instruction.
VectorType *MaskTy = VectorType::get(Builder.getInt1Ty(), VL.size());		VectorType *MaskTy = VectorType::get(Builder.getInt1Ty(), VL.size());
int ScalarCost = VecTy->getNumElements() *		int ScalarCost = VecTy->getNumElements() *
▲ Show 20 Lines • Show All 1,824 Lines • ▼ Show 20 Lines	unsigned BoUpSLP::getVectorElementSize(Value *V) {
return MaxWidth;		return MaxWidth;
}		}

// Determine if a value V in a vectorizable expression Expr can be demoted to a		// Determine if a value V in a vectorizable expression Expr can be demoted to a
// smaller type with a truncation. We collect the values that will be demoted		// smaller type with a truncation. We collect the values that will be demoted
// in ToDemote and additional roots that require investigating in Roots.		// in ToDemote and additional roots that require investigating in Roots.
static bool collectValuesToDemote(Value V, SmallPtrSetImpl<Value > &Expr,		static bool collectValuesToDemote(Value V, SmallPtrSetImpl<Value > &Expr,
SmallVectorImpl<Value *> &ToDemote,		SmallVectorImpl<Value *> &ToDemote,
SmallVectorImpl<Value *> &Roots) {		SmallVectorImpl<Value *> &Roots,
		bool IncludeCastsToDemote) {
// We can always demote constants.		// We can always demote constants.
if (isa<Constant>(V)) {		if (isa<Constant>(V)) {
		if (IncludeCastsToDemote)
ToDemote.push_back(V);		ToDemote.push_back(V);
return true;		return true;
}		}

// If the value is not an instruction in the expression with only one use, it		// If the value is not an instruction in the expression with only one use, it
// cannot be demoted.		// cannot be demoted.
auto *I = dyn_cast<Instruction>(V);		auto *I = dyn_cast<Instruction>(V);
if (!I \|\| !I->hasOneUse() \|\| !Expr.count(I))		if (!I \|\| !I->hasOneUse() \|\| !Expr.count(I))
return false;		return false;

switch (I->getOpcode()) {		switch (I->getOpcode()) {

// We can always demote truncations and extensions. Since truncations can		// We can always demote truncations and extensions. Since truncations can
// seed additional demotion, we save the truncated value.		// seed additional demotion, we save the truncated value.
case Instruction::Trunc:		case Instruction::Trunc:
Roots.push_back(I->getOperand(0));		Roots.push_back(I->getOperand(0));
break;		LLVM_FALLTHROUGH;
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
		if (!IncludeCastsToDemote)
		return true;
break;		break;

// We can demote certain binary operations if we can demote both of their		// We can demote certain binary operations if we can demote both of their
// operands.		// operands.
case Instruction::Add:		case Instruction::Add:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor:		case Instruction::Xor:
if (!collectValuesToDemote(I->getOperand(0), Expr, ToDemote, Roots) \|\|		if (!collectValuesToDemote(I->getOperand(0), Expr, ToDemote, Roots,
!collectValuesToDemote(I->getOperand(1), Expr, ToDemote, Roots))		/IncludeCastsToDemote=/true) \|\|
		!collectValuesToDemote(I->getOperand(1), Expr, ToDemote, Roots,
		/IncludeCastsToDemote=/true))
return false;		return false;
break;		break;

// We can demote selects if we can demote their true and false values.		// We can demote selects if we can demote their true and false values.
case Instruction::Select: {		case Instruction::Select: {
SelectInst *SI = cast<SelectInst>(I);		SelectInst *SI = cast<SelectInst>(I);
if (!collectValuesToDemote(SI->getTrueValue(), Expr, ToDemote, Roots) \|\|		if (!collectValuesToDemote(SI->getTrueValue(), Expr, ToDemote, Roots,
!collectValuesToDemote(SI->getFalseValue(), Expr, ToDemote, Roots))		/IncludeCastsToDemote=/true) \|\|
		!collectValuesToDemote(SI->getFalseValue(), Expr, ToDemote, Roots,
		/IncludeCastsToDemote=/true))
return false;		return false;
break;		break;
}		}

// We can demote phis if we can demote all their incoming operands. Note that		// We can demote phis if we can demote all their incoming operands. Note that
// we don't need to worry about cycles since we ensure single use above.		// we don't need to worry about cycles since we ensure single use above.
case Instruction::PHI: {		case Instruction::PHI: {
PHINode *PN = cast<PHINode>(I);		PHINode *PN = cast<PHINode>(I);
for (Value *IncValue : PN->incoming_values())		for (Value *IncValue : PN->incoming_values())
if (!collectValuesToDemote(IncValue, Expr, ToDemote, Roots))		if (!collectValuesToDemote(IncValue, Expr, ToDemote, Roots,
		/IncludeCastsToDemote=/true))
return false;		return false;
break;		break;
}		}

// Otherwise, conservatively give up.		// Otherwise, conservatively give up.
default:		default:
return false;		return false;
}		}
Show All 40 Lines	for (auto *Root : TreeRoot)
if (!Root->hasOneUse() \|\| Expr.count(*Root->user_begin()))		if (!Root->hasOneUse() \|\| Expr.count(*Root->user_begin()))
return;		return;

// Conservatively determine if we can actually truncate the roots of the		// Conservatively determine if we can actually truncate the roots of the
// expression. Collect the values that can be demoted in ToDemote and		// expression. Collect the values that can be demoted in ToDemote and
// additional roots that require investigating in Roots.		// additional roots that require investigating in Roots.
SmallVector<Value *, 32> ToDemote;		SmallVector<Value *, 32> ToDemote;
SmallVector<Value *, 4> Roots;		SmallVector<Value *, 4> Roots;
for (auto *Root : TreeRoot)		for (auto *Root : TreeRoot) {
if (!collectValuesToDemote(Root, Expr, ToDemote, Roots))		// Do not include top zext/sext/trunc operations to those to be demoted, it
		// produces noise cast<vect>, trunc <vect>, exctract <vect>, cast <extract>
		// sequence.
		mssimpsoUnsubmitted Not Done Reply Inline Actions What do you mean by "top" here? Do you just mean that the initial Root should not be a cast? If so, would it be easier to check that here, rather than passing the flag in collectValuesToDemote? mssimpso: What do you mean by "top" here? Do you just mean that the initial Root should not be a cast?
		ABataevAuthorUnsubmitted Not Done Reply Inline Actions Yes, it is about initial roots. The main problem here is that these roots still must be analyzeв though not considered as candidates to demote. Meanwhile, I have a question for you. Why you try to generate sequence `cast ( extractelement(<resulting_vector>))` instead of `extractelement(cast(<resulting_vector>))`? It produces `2 * n` instructions, while the second approach will produce `n + 1` instructions. ABataev: Yes, it is about initial roots. The main problem here is that these roots still must be…
		mssimpsoUnsubmitted Not Done Reply Inline Actions I honestly don't recall what the reason was. It probably had something to do with the way InstCombine wanted to rewrite the expression, given the trunc/ext sequence. I think what we should really try to do is make the type truncating code in InstCombine (EvaluateInDifferentType) a utility that we can use within the vectorizers. The trunc/ext trick is very fragile. We use it here in SLP,and in two separate places in LV to type-shrink reductions and other instructions. I'll finish taking a look at the rest of the patch shortly. mssimpso: I honestly don't recall what the reason was. It probably had something to do with the way…
		if (!collectValuesToDemote(Root, Expr, ToDemote, Roots,
		/IncludeCastsToDemote=/false))
return;		return;
		}

// The maximum bit width required to represent all the values that can be		// The maximum bit width required to represent all the values that can be
// demoted without loss of precision. It would be safe to truncate the roots		// demoted without loss of precision. It would be safe to truncate the roots
// of the expression to this width.		// of the expression to this width.
auto MaxBitWidth = 8u;		auto MaxBitWidth = 8u;

// We first check if all the bits of the roots are demanded. If they're not,		// We first check if all the bits of the roots are demanded. If they're not,
// we can truncate the roots to this narrower type.		// we can truncate the roots to this narrower type.
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	void BoUpSLP::computeMinimumValueSizes() {
// If the maximum bit width we compute is less than the with of the roots'		// If the maximum bit width we compute is less than the with of the roots'
// type, we can proceed with the narrowing. Otherwise, do nothing.		// type, we can proceed with the narrowing. Otherwise, do nothing.
if (MaxBitWidth >= TreeRootIT->getBitWidth())		if (MaxBitWidth >= TreeRootIT->getBitWidth())
return;		return;

// If we can truncate the root, we must collect additional values that might		// If we can truncate the root, we must collect additional values that might
// be demoted as a result. That is, those seeded by truncations we will		// be demoted as a result. That is, those seeded by truncations we will
// modify.		// modify.
while (!Roots.empty())		while (!Roots.empty()) {
collectValuesToDemote(Roots.pop_back_val(), Expr, ToDemote, Roots);		collectValuesToDemote(Roots.pop_back_val(), Expr, ToDemote, Roots,
		/IncludeCastsToDemote=/true);
		}

// Finally, map the values we can demote to the maximum bit with we computed.		// Finally, map the values we can demote to the maximum bit with we computed.
for (auto *Scalar : ToDemote)		for (auto *Scalar : ToDemote)
MinBWs[Scalar] = std::make_pair(MaxBitWidth, !IsKnownPositive);		MinBWs[Scalar] = std::make_pair(MaxBitWidth, !IsKnownPositive);
}		}

namespace {		namespace {

▲ Show 20 Lines • Show All 1,888 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/PR35777.ll

	Show All 10 Lines
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> undef, double [[ARG:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> undef, double [[ARG:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[ARG]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[ARG]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP3]], [[TMP1]]			; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP3]], [[TMP1]]
	; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP0]], [[TMP4]]			; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP0]], [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.]] = load <2 x double>, <2 x double> bitcast (double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4) to <2 x double>*), align 16			; CHECK-NEXT: [[TMP6:%.]] = load <2 x double>, <2 x double> bitcast (double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4) to <2 x double>*), align 16
	; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x double> [[TMP6]], [[TMP5]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x double> [[TMP6]], [[TMP5]]
	; CHECK-NEXT: [[TMP8:%.*]] = fptosi <2 x double> [[TMP7]] to <2 x i32>			; CHECK-NEXT: [[TMP8:%.*]] = fptosi <2 x double> [[TMP7]] to <2 x i32>
	; CHECK-NEXT: [[TMP9:%.*]] = sext <2 x i32> [[TMP8]] to <2 x i64>			; CHECK-NEXT: [[TMP9:%.*]] = sext <2 x i32> [[TMP8]] to <2 x i64>
	; CHECK-NEXT: [[TMP10:%.*]] = trunc <2 x i64> [[TMP9]] to <2 x i32>			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i32> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP16:%.*]] = insertvalue { i64, i64 } undef, i64 [[TMP10]], 0
	; CHECK-NEXT: [[TMP12:%.*]] = sext i32 [[TMP11]] to i64			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
	; CHECK-NEXT: [[TMP16:%.*]] = insertvalue { i64, i64 } undef, i64 [[TMP12]], 0			; CHECK-NEXT: [[TMP17:%.*]] = insertvalue { i64, i64 } [[TMP16]], i64 [[TMP11]], 1
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[TMP14:%.*]] = sext i32 [[TMP13]] to i64
	; CHECK-NEXT: [[TMP17:%.*]] = insertvalue { i64, i64 } [[TMP16]], i64 [[TMP14]], 1
	; CHECK-NEXT: ret { i64, i64 } [[TMP17]]			; CHECK-NEXT: ret { i64, i64 } [[TMP17]]
	;			;
	bb:			bb:
	%tmp = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 0), align 16			%tmp = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 0), align 16
	%tmp1 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 2), align 16			%tmp1 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 2), align 16
	%tmp2 = fmul double %tmp1, %arg			%tmp2 = fmul double %tmp1, %arg
	%tmp3 = fadd double %tmp, %tmp2			%tmp3 = fadd double %tmp, %tmp2
	%tmp4 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4), align 16			%tmp4 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4), align 16
	Show All 15 Lines

test/Transforms/SLPVectorizer/X86/sign-extend.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer < %s -S -o - -mtriple=x86_64-apple-macosx10.10.0 -mcpu=core2 \| FileCheck %s			; RUN: opt -slp-vectorizer < %s -S -o - -mtriple=x86_64-apple-macosx10.10.0 -mcpu=core2 \| FileCheck %s

	define <4 x i32> @sign_extend_v_v(<4 x i16> %lhs) {			define <4 x i32> @sign_extend_v_v(<4 x i16> %lhs) {
	; CHECK-LABEL: @sign_extend_v_v(			; CHECK-LABEL: @sign_extend_v_v(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[VECEXT:%.]] = extractelement <4 x i16> [[LHS:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = sext <4 x i16> [[LHS:%.]] to <4 x i32>
	; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[VECEXT]] to i32			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
	; CHECK-NEXT: [[VECINIT:%.*]] = insertelement <4 x i32> undef, i32 [[CONV]], i32 0			; CHECK-NEXT: [[VECINIT:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0
	; CHECK-NEXT: [[VECEXT1:%.*]] = extractelement <4 x i16> [[LHS]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[VECEXT1]] to i32			; CHECK-NEXT: [[VECINIT3:%.*]] = insertelement <4 x i32> [[VECINIT]], i32 [[TMP2]], i32 1
	; CHECK-NEXT: [[VECINIT3:%.*]] = insertelement <4 x i32> [[VECINIT]], i32 [[CONV2]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
	; CHECK-NEXT: [[VECEXT4:%.*]] = extractelement <4 x i16> [[LHS]], i32 2			; CHECK-NEXT: [[VECINIT6:%.*]] = insertelement <4 x i32> [[VECINIT3]], i32 [[TMP3]], i32 2
	; CHECK-NEXT: [[CONV5:%.*]] = sext i16 [[VECEXT4]] to i32			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
	; CHECK-NEXT: [[VECINIT6:%.*]] = insertelement <4 x i32> [[VECINIT3]], i32 [[CONV5]], i32 2			; CHECK-NEXT: [[VECINIT9:%.*]] = insertelement <4 x i32> [[VECINIT6]], i32 [[TMP4]], i32 3
	; CHECK-NEXT: [[VECEXT7:%.*]] = extractelement <4 x i16> [[LHS]], i32 3
	; CHECK-NEXT: [[CONV8:%.*]] = sext i16 [[VECEXT7]] to i32
	; CHECK-NEXT: [[VECINIT9:%.*]] = insertelement <4 x i32> [[VECINIT6]], i32 [[CONV8]], i32 3
	; CHECK-NEXT: ret <4 x i32> [[VECINIT9]]			; CHECK-NEXT: ret <4 x i32> [[VECINIT9]]
	;			;
	entry:			entry:
	%vecext = extractelement <4 x i16> %lhs, i32 0			%vecext = extractelement <4 x i16> %lhs, i32 0
	%conv = sext i16 %vecext to i32			%conv = sext i16 %vecext to i32
	%vecinit = insertelement <4 x i32> undef, i32 %conv, i32 0			%vecinit = insertelement <4 x i32> undef, i32 %conv, i32 0
	%vecext1 = extractelement <4 x i16> %lhs, i32 1			%vecext1 = extractelement <4 x i16> %lhs, i32 1
	%conv2 = sext i16 %vecext1 to i32			%conv2 = sext i16 %vecext1 to i32
	Show All 9 Lines