This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
PR35777.ll
-
sign-extend.ll

Differential D41948

[SLP] Fix vectorization for tree with trunc to minimum required bit width.
ClosedPublic

Authored by ABataev on Jan 11 2018, 6:46 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
mkuper
hfinkel
mssimpso

Commits

rGfa80c47c6a67: [SLP] Fix vectorization for tree with trunc to minimum required bit width.
rL322946: [SLP] Fix vectorization for tree with trunc to minimum required bit width.

Summary

If the vectorized tree has truncate to minimum required bit width and
the vector type of the cast operation after the truncation is the same
as the vector type of the cast operands, count cost of the vector cast
operation as 0, because this cast will be later removed.
Also, if the vectorization tree root operations are integer cast operations, do not consider them as candidates for truncation. It will just create extra number of the same vector/scalar operations, which will be removed by instcombiner.

Diff Detail

Repository: rL LLVM

Event Timeline

ABataev created this revision.Jan 11 2018, 6:46 AM

Thanks for cleaning this up.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2061–2064 ↗	(On Diff #129443)	Makes sense.
4023–4025 ↗	(On Diff #129443)	What do you mean by "top" here? Do you just mean that the initial Root should not be a cast? If so, would it be easier to check that here, rather than passing the flag in collectValuesToDemote?

ABataev added inline comments.Jan 15 2018, 6:38 AM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4023–4025 ↗	(On Diff #129443)	Yes, it is about initial roots. The main problem here is that these roots still must be analyzeв though not considered as candidates to demote. Meanwhile, I have a question for you. Why you try to generate sequence `cast ( extractelement(<resulting_vector>))` instead of `extractelement(cast(<resulting_vector>))`? It produces `2 * n` instructions, while the second approach will produce `n + 1` instructions.

Update after review

Harbormaster completed remote builds in B13991: Diff 130484.Jan 18 2018, 1:23 PM

mssimpso added inline comments.Jan 18 2018, 1:36 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
4023–4025 ↗	(On Diff #129443)	I honestly don't recall what the reason was. It probably had something to do with the way InstCombine wanted to rewrite the expression, given the trunc/ext sequence. I think what we should really try to do is make the type truncating code in InstCombine (EvaluateInDifferentType) a utility that we can use within the vectorizers. The trunc/ext trick is very fragile. We use it here in SLP,and in two separate places in LV to type-shrink reductions and other instructions. I'll finish taking a look at the rest of the patch shortly.

LGTM.

This revision is now accepted and ready to land.Jan 18 2018, 1:43 PM

Closed by commit rL322946: [SLP] Fix vectorization for tree with trunc to minimum required bit width. (authored by ABataev). · Explain WhyJan 19 2018, 6:43 AM

This revision was automatically updated to reflect the committed changes.

haicheng mentioned this in D44565: [SLP] Add a check before skipping inserting a trunc after z|sext vectorization tree root.Mar 16 2018, 7:18 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

22 lines

test/

Transforms/

SLPVectorizer/

X86/

PR35777.ll

11 lines

sign-extend.ll

21 lines

Diff 130611

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 2,059 Lines • ▼ Show 20 Lines	switch (ShuffleOrOp) {
case Instruction::BitCast: {		case Instruction::BitCast: {
Type *SrcTy = VL0->getOperand(0)->getType();		Type *SrcTy = VL0->getOperand(0)->getType();

// Calculate the cost of this instruction.		// Calculate the cost of this instruction.
int ScalarCost = VL.size() * TTI->getCastInstrCost(VL0->getOpcode(),		int ScalarCost = VL.size() * TTI->getCastInstrCost(VL0->getOpcode(),
VL0->getType(), SrcTy, VL0);		VL0->getType(), SrcTy, VL0);

VectorType *SrcVecTy = VectorType::get(SrcTy, VL.size());		VectorType *SrcVecTy = VectorType::get(SrcTy, VL.size());
int VecCost = TTI->getCastInstrCost(VL0->getOpcode(), VecTy, SrcVecTy, VL0);		int VecCost = 0;
		// Check if the values are candidates to demote.
		if (!MinBWs.count(VL0) \|\| VecTy != SrcVecTy)
		VecCost = TTI->getCastInstrCost(VL0->getOpcode(), VecTy, SrcVecTy, VL0);
return VecCost - ScalarCost;		return VecCost - ScalarCost;
}		}
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp:		case Instruction::ICmp:
case Instruction::Select: {		case Instruction::Select: {
// Calculate the cost of this instruction.		// Calculate the cost of this instruction.
VectorType *MaskTy = VectorType::get(Builder.getInt1Ty(), VL.size());		VectorType *MaskTy = VectorType::get(Builder.getInt1Ty(), VL.size());
int ScalarCost = VecTy->getNumElements() *		int ScalarCost = VecTy->getNumElements() *
▲ Show 20 Lines • Show All 1,932 Lines • ▼ Show 20 Lines	for (auto *Root : TreeRoot)
if (!Root->hasOneUse() \|\| Expr.count(*Root->user_begin()))		if (!Root->hasOneUse() \|\| Expr.count(*Root->user_begin()))
return;		return;

// Conservatively determine if we can actually truncate the roots of the		// Conservatively determine if we can actually truncate the roots of the
// expression. Collect the values that can be demoted in ToDemote and		// expression. Collect the values that can be demoted in ToDemote and
// additional roots that require investigating in Roots.		// additional roots that require investigating in Roots.
SmallVector<Value *, 32> ToDemote;		SmallVector<Value *, 32> ToDemote;
SmallVector<Value *, 4> Roots;		SmallVector<Value *, 4> Roots;
for (auto *Root : TreeRoot)		for (auto *Root : TreeRoot) {
		// Do not include top zext/sext/trunc operations to those to be demoted, it
		// produces noise cast<vect>, trunc <vect>, exctract <vect>, cast <extract>
		// sequence.
		if (isa<Constant>(Root))
		continue;
		auto *I = dyn_cast<Instruction>(Root);
		if (!I \|\| !I->hasOneUse() \|\| !Expr.count(I))
		return;
		if (isa<ZExtInst>(I) \|\| isa<SExtInst>(I))
		continue;
		if (auto *TI = dyn_cast<TruncInst>(I)) {
		Roots.push_back(TI->getOperand(0));
		continue;
		}
if (!collectValuesToDemote(Root, Expr, ToDemote, Roots))		if (!collectValuesToDemote(Root, Expr, ToDemote, Roots))
return;		return;
		}

// The maximum bit width required to represent all the values that can be		// The maximum bit width required to represent all the values that can be
// demoted without loss of precision. It would be safe to truncate the roots		// demoted without loss of precision. It would be safe to truncate the roots
// of the expression to this width.		// of the expression to this width.
auto MaxBitWidth = 8u;		auto MaxBitWidth = 8u;

// We first check if all the bits of the roots are demanded. If they're not,		// We first check if all the bits of the roots are demanded. If they're not,
// we can truncate the roots to this narrower type.		// we can truncate the roots to this narrower type.
▲ Show 20 Lines • Show All 1,967 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/PR35777.ll

	Show All 10 Lines
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> undef, double [[ARG:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <2 x double> undef, double [[ARG:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[ARG]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[ARG]], i32 1
	; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP3]], [[TMP1]]			; CHECK-NEXT: [[TMP4:%.*]] = fmul <2 x double> [[TMP3]], [[TMP1]]
	; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP0]], [[TMP4]]			; CHECK-NEXT: [[TMP5:%.*]] = fadd <2 x double> [[TMP0]], [[TMP4]]
	; CHECK-NEXT: [[TMP6:%.]] = load <2 x double>, <2 x double> bitcast (double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4) to <2 x double>*), align 16			; CHECK-NEXT: [[TMP6:%.]] = load <2 x double>, <2 x double> bitcast (double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4) to <2 x double>*), align 16
	; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x double> [[TMP6]], [[TMP5]]			; CHECK-NEXT: [[TMP7:%.*]] = fadd <2 x double> [[TMP6]], [[TMP5]]
	; CHECK-NEXT: [[TMP8:%.*]] = fptosi <2 x double> [[TMP7]] to <2 x i32>			; CHECK-NEXT: [[TMP8:%.*]] = fptosi <2 x double> [[TMP7]] to <2 x i32>
	; CHECK-NEXT: [[TMP9:%.*]] = sext <2 x i32> [[TMP8]] to <2 x i64>			; CHECK-NEXT: [[TMP9:%.*]] = sext <2 x i32> [[TMP8]] to <2 x i64>
	; CHECK-NEXT: [[TMP10:%.*]] = trunc <2 x i64> [[TMP9]] to <2 x i32>			; CHECK-NEXT: [[TMP10:%.*]] = extractelement <2 x i64> [[TMP9]], i32 0
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i32> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP16:%.*]] = insertvalue { i64, i64 } undef, i64 [[TMP10]], 0
	; CHECK-NEXT: [[TMP12:%.*]] = sext i32 [[TMP11]] to i64			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP9]], i32 1
	; CHECK-NEXT: [[TMP16:%.*]] = insertvalue { i64, i64 } undef, i64 [[TMP12]], 0			; CHECK-NEXT: [[TMP17:%.*]] = insertvalue { i64, i64 } [[TMP16]], i64 [[TMP11]], 1
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <2 x i32> [[TMP10]], i32 1
	; CHECK-NEXT: [[TMP14:%.*]] = sext i32 [[TMP13]] to i64
	; CHECK-NEXT: [[TMP17:%.*]] = insertvalue { i64, i64 } [[TMP16]], i64 [[TMP14]], 1
	; CHECK-NEXT: ret { i64, i64 } [[TMP17]]			; CHECK-NEXT: ret { i64, i64 } [[TMP17]]
	;			;
	bb:			bb:
	%tmp = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 0), align 16			%tmp = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 0), align 16
	%tmp1 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 2), align 16			%tmp1 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 2), align 16
	%tmp2 = fmul double %tmp1, %arg			%tmp2 = fmul double %tmp1, %arg
	%tmp3 = fadd double %tmp, %tmp2			%tmp3 = fadd double %tmp, %tmp2
	%tmp4 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4), align 16			%tmp4 = load double, double* getelementptr inbounds ([6 x double], [6 x double]* @global, i64 0, i64 4), align 16
	Show All 15 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/sign-extend.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer < %s -S -o - -mtriple=x86_64-apple-macosx10.10.0 -mcpu=core2 \| FileCheck %s			; RUN: opt -slp-vectorizer < %s -S -o - -mtriple=x86_64-apple-macosx10.10.0 -mcpu=core2 \| FileCheck %s

	define <4 x i32> @sign_extend_v_v(<4 x i16> %lhs) {			define <4 x i32> @sign_extend_v_v(<4 x i16> %lhs) {
	; CHECK-LABEL: @sign_extend_v_v(			; CHECK-LABEL: @sign_extend_v_v(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[VECEXT:%.]] = extractelement <4 x i16> [[LHS:%.]], i32 0			; CHECK-NEXT: [[TMP0:%.]] = sext <4 x i16> [[LHS:%.]] to <4 x i32>
	; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[VECEXT]] to i32			; CHECK-NEXT: [[TMP1:%.*]] = extractelement <4 x i32> [[TMP0]], i32 0
	; CHECK-NEXT: [[VECINIT:%.*]] = insertelement <4 x i32> undef, i32 [[CONV]], i32 0			; CHECK-NEXT: [[VECINIT:%.*]] = insertelement <4 x i32> undef, i32 [[TMP1]], i32 0
	; CHECK-NEXT: [[VECEXT1:%.*]] = extractelement <4 x i16> [[LHS]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[VECEXT1]] to i32			; CHECK-NEXT: [[VECINIT3:%.*]] = insertelement <4 x i32> [[VECINIT]], i32 [[TMP2]], i32 1
	; CHECK-NEXT: [[VECINIT3:%.*]] = insertelement <4 x i32> [[VECINIT]], i32 [[CONV2]], i32 1			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP0]], i32 2
	; CHECK-NEXT: [[VECEXT4:%.*]] = extractelement <4 x i16> [[LHS]], i32 2			; CHECK-NEXT: [[VECINIT6:%.*]] = insertelement <4 x i32> [[VECINIT3]], i32 [[TMP3]], i32 2
	; CHECK-NEXT: [[CONV5:%.*]] = sext i16 [[VECEXT4]] to i32			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x i32> [[TMP0]], i32 3
	; CHECK-NEXT: [[VECINIT6:%.*]] = insertelement <4 x i32> [[VECINIT3]], i32 [[CONV5]], i32 2			; CHECK-NEXT: [[VECINIT9:%.*]] = insertelement <4 x i32> [[VECINIT6]], i32 [[TMP4]], i32 3
	; CHECK-NEXT: [[VECEXT7:%.*]] = extractelement <4 x i16> [[LHS]], i32 3
	; CHECK-NEXT: [[CONV8:%.*]] = sext i16 [[VECEXT7]] to i32
	; CHECK-NEXT: [[VECINIT9:%.*]] = insertelement <4 x i32> [[VECINIT6]], i32 [[CONV8]], i32 3
	; CHECK-NEXT: ret <4 x i32> [[VECINIT9]]			; CHECK-NEXT: ret <4 x i32> [[VECINIT9]]
	;			;
	entry:			entry:
	%vecext = extractelement <4 x i16> %lhs, i32 0			%vecext = extractelement <4 x i16> %lhs, i32 0
	%conv = sext i16 %vecext to i32			%conv = sext i16 %vecext to i32
	%vecinit = insertelement <4 x i32> undef, i32 %conv, i32 0			%vecinit = insertelement <4 x i32> undef, i32 %conv, i32 0
	%vecext1 = extractelement <4 x i16> %lhs, i32 1			%vecext1 = extractelement <4 x i16> %lhs, i32 1
	%conv2 = sext i16 %vecext1 to i32			%conv2 = sext i16 %vecext1 to i32
	Show All 9 Lines