This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
3
InstCombineVectorOps.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
insert-extract-shuffle.ll

Differential D50840

[InstCombine] Extend collectShuffleElements to support extract/zext/insert patterns
AbandonedPublic

Authored by joey on Aug 16 2018, 6:47 AM.

Download Raw Diff

Details

Reviewers

spatel
lebedev.ri
ABataev
RKSimon

Summary

collectShuffleElements already handles combining the following into a single shufflevector:

%elt0 = extractelement <8 x i16> %in, i32 3
%elt1 = extractelement <8 x i16> %in, i32 1
%elt2 = extractelement <8 x i16> %in2, i32 0
%elt3 = extractelement <8 x i16> %in, i32 2

%vec.0 = insertelement <4 x i16> undef, i16 %elt0, i32 0
%vec.1 = insertelement <4 x i16> %vec.0, i16 %elt1, i32 1
%vec.2 = insertelement <4 x i16> %vec.1, i16 %elt2, i32 2
%vec.3 = insertelement <4 x i16> %vec.2, i16 %elt3, i32 3

This patch extends it to handle the following, by turning it into shufflevector + ext.

%elt0e = extractelement <8 x i16> %in, i32 3
%elt1e = extractelement <8 x i16> %in, i32 1
%elt2e = extractelement <8 x i16> %in, i32 0
%elt3e = extractelement <8 x i16> %in, i32 3

%elt0 = zext i16 %elt0e to i32
%elt1 = zext i16 %elt1e to i32
%elt2 = zext i16 %elt2e to i32
%elt3 = zext i16 %elt3e to i32

%vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
%vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
%vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
%vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3

Diff Detail

Event Timeline

joey created this revision.Aug 16 2018, 6:47 AM

lebedev.ri added reviewers: spatel, lebedev.ri.Aug 16 2018, 6:51 AM

Why is this limited to extensions? Why can't this also be done for trunc? Or more generally, why don't we want to do this if the same operation is applied for all the elements?
Do the vectorizer passes handle this? (Especially in light of the last question of 1.)

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
477	Why `ZI`? Is this limited to `zext`? `CI` perhaps?
801	Same

In D50840#1202436, @lebedev.ri wrote:

Why is this limited to extensions? Why can't this also be done for trunc? Or more generally, why don't we want to do this if the same operation is applied for all the elements?

Do the vectorizer passes handle this? (Especially in light of the last question of 1.)

No real reason, just the test case I wrote the code for. I can do the same for trunc as well. I'm not sure if it makes sense for the other CastInst operators though.
I tried the test case with opt -slp-vectorizer and it doesn't catch it. Anything else I can try?

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
477	ZI just because I originally wrote the patch for ZExtInst. I can change that to CI. It works for `zext` and `sext` currently.

In D50840#1202449, @joey wrote:

In D50840#1202436, @lebedev.ri wrote:

Why is this limited to extensions? Why can't this also be done for trunc? Or more generally, why don't we want to do this if the same operation is applied for all the elements?

Do the vectorizer passes handle this? (Especially in light of the last question of 1.)

No real reason, just the test case I wrote the code for. I can do the same for trunc as well. I'm not sure if it makes sense for the other CastInst operators though.

I tried the test case with opt -slp-vectorizer and it doesn't catch it. Anything else I can try?

I see, CC'ing @ABataev / @RKSimon / @mssimpso
I really wonder if this is one of these cases where we shouldn't do it in instcombine, even if it can be done, but elsewhere..
Because it isn't perfectly obvious [to me] why we would want to stop on these 3 casts, and not keep piling more stuff here.

It could be the kind of thing we should do in slp @ABataev what do you think?

RKSimon added a reviewer: RKSimon.Aug 16 2018, 7:44 AM

In D50840#1202486, @RKSimon wrote:

It could be the kind of thing we should do in slp @ABataev what do you think?

Yes, looks like the opportunity for the SLP Vectorizer.

Here is the (reduced) motivating example:

__kernel void foo(__global uchar4 *p1, __global ushort2 *p2)
{
    uchar4 t0 = p1[0];
    uchar4 t1 = p1[1];
    
    ushort2 t00 = (ushort2)((ushort)t0.x, (ushort)t0.y);
    ushort2 t10 = (ushort2)((ushort)t1.x, (ushort)t1.y);
    
    *p2 += (t00 * t10);
}

I haven't worked with the SLPVectorizer before, so would need some guidance in making the change there. Or someone could take over the change, if it's easier.

I found that if I apply the following patch:

diff --git a/lib/Transforms/Vectorize/SLPVectorizer.cpp b/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 32df6d58157..76103732adc 100644
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp 
@@ -2431,8 +2431,10 @@ bool BoUpSLP::isFullyVectorizableTinyTree() {
     return true;
 
   // Gathering cost would be too much for tiny trees.
+/*
   if (VectorizableTree[0].NeedToGather || VectorizableTree[1].NeedToGather)
     return false;
+*/

Using the test:

define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
  %elt0e = extractelement <8 x i16> %in, i32 3
  %elt1e = extractelement <8 x i16> %in, i32 1
  %elt2e = extractelement <8 x i16> %in, i32 0
  %elt3e = extractelement <8 x i16> %in, i32 3
  %elt0 = zext i16 %elt0e to i32
  %elt1 = zext i16 %elt1e to i32
  %elt2 = zext i16 %elt2e to i32
  %elt3 = zext i16 %elt3e to i32
  %vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
  %vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
  %vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
  %vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3
  ret <4 x i32> %vec.3
}

The SLPVectorizer produces:

define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
  %elt0e = extractelement <8 x i16> %in, i32 3
  %elt1e = extractelement <8 x i16> %in, i32 1
  %elt2e = extractelement <8 x i16> %in, i32 0
  %1 = insertelement <4 x i16> undef, i16 %elt0e, i32 0
  %2 = insertelement <4 x i16> %1, i16 %elt1e, i32 1
  %3 = insertelement <4 x i16> %2, i16 %elt2e, i32 2
  %4 = insertelement <4 x i16> %3, i16 %elt0e, i32 3
  %5 = zext <4 x i16> %4 to <4 x i32>
  %6 = extractelement <4 x i32> %5, i32 0
  %vec.0 = insertelement <4 x i32> undef, i32 %6, i32 0
  %7 = extractelement <4 x i32> %5, i32 1
  %vec.1 = insertelement <4 x i32> %vec.0, i32 %7, i32 1
  %8 = extractelement <4 x i32> %5, i32 2
  %vec.2 = insertelement <4 x i32> %vec.1, i32 %8, i32 2
  %9 = extractelement <4 x i32> %5, i32 3
  %vec.3 = insertelement <4 x i32> %vec.2, i32 %9, i32 3
  ret <4 x i32> %vec.3
}

Then InstCombine can clean that up into:

define <4 x i32> @test3(<8 x i16> %in, <8 x i16> %in2) {
  %1 = shufflevector <8 x i16> %in, <8 x i16> undef, <4 x i32> <i32 3, i32 1, i32 0, i32 3>
  %2 = zext <4 x i16> %1 to <4 x i32>
  ret <4 x i32> %2
}

So it looks like the SLPVectorizer already can do this, with some tweaks.

In D50840#1202564, @joey wrote:
I haven't worked with the SLPVectorizer before, so would need some guidance in making the change there. Or someone could take over the change, if it's easier.

I found that if I apply the following patch:
diff --git a/lib/Transforms/Vectorize/SLPVectorizer.cpp b/lib/Transforms/Vectorize/SLPVectorizer.cpp
index 32df6d58157..76103732adc 100644
--- a/lib/Transforms/Vectorize/SLPVectorizer.cpp
+++ b/lib/Transforms/Vectorize/SLPVectorizer.cpp 
@@ -2431,8 +2431,10 @@ bool BoUpSLP::isFullyVectorizableTinyTree() {
     return true;
 
   // Gathering cost would be too much for tiny trees.
+/*
   if (VectorizableTree[0].NeedToGather || VectorizableTree[1].NeedToGather)
     return false;
+*/

It seems that, the buildvector instructions are not counted in the VectorizableTree while the cost is properly included as a parameter to tryToVectorizeList.

As discussed, most likely something for a proper vectorizer.
Removing from review queue..

This revision now requires changes to proceed.Aug 29 2018, 2:01 AM

Some guidance on how to fix this in the SLPVectorzier would be helpful. Or if it's small enough that someone else can fix it, that's fine with me too.

joey abandoned this revision.Nov 22 2018, 3:52 AM

Filed a bug report, so we don't forget this: https://bugs.llvm.org/show_bug.cgi?id=39768

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineVectorOps.cpp

29 lines

test/

Transforms/

InstCombine/

insert-extract-shuffle.ll

48 lines

Diff 161019

lib/Transforms/InstCombine/InstCombineVectorOps.cpp

Show First 20 Lines • Show All 445 Lines • ▼ Show 20 Lines
///		///
/// Note: we intentionally don't try to fold earlier shuffles since they have		/// Note: we intentionally don't try to fold earlier shuffles since they have
/// often been chosen carefully to be efficiently implementable on the target.		/// often been chosen carefully to be efficiently implementable on the target.
using ShuffleOps = std::pair<Value , Value >;		using ShuffleOps = std::pair<Value , Value >;

static ShuffleOps collectShuffleElements(Value *V,		static ShuffleOps collectShuffleElements(Value *V,
SmallVectorImpl<Constant *> &Mask,		SmallVectorImpl<Constant *> &Mask,
Value *PermittedRHS,		Value *PermittedRHS,
		Optional<Instruction::CastOps> ExtOpc,
InstCombiner &IC) {		InstCombiner &IC) {
assert(V->getType()->isVectorTy() && "Invalid shuffle!");		assert(V->getType()->isVectorTy() && "Invalid shuffle!");
unsigned NumElts = V->getType()->getVectorNumElements();		unsigned NumElts = V->getType()->getVectorNumElements();

if (isa<UndefValue>(V)) {		if (isa<UndefValue>(V)) {
Mask.assign(NumElts, UndefValue::get(Type::getInt32Ty(V->getContext())));		Mask.assign(NumElts, UndefValue::get(Type::getInt32Ty(V->getContext())));
return std::make_pair(		return std::make_pair(
PermittedRHS ? UndefValue::get(PermittedRHS->getType()) : V, nullptr);		PermittedRHS ? UndefValue::get(PermittedRHS->getType()) : V, nullptr);
}		}

if (isa<ConstantAggregateZero>(V)) {		if (isa<ConstantAggregateZero>(V)) {
Mask.assign(NumElts, ConstantInt::get(Type::getInt32Ty(V->getContext()),0));		Mask.assign(NumElts, ConstantInt::get(Type::getInt32Ty(V->getContext()),0));
return std::make_pair(V, nullptr);		return std::make_pair(V, nullptr);
}		}

if (InsertElementInst *IEI = dyn_cast<InsertElementInst>(V)) {		if (InsertElementInst *IEI = dyn_cast<InsertElementInst>(V)) {
// If this is an insert of an extract from some other vector, include it.		// If this is an insert of an extract from some other vector, include it.
Value *VecOp = IEI->getOperand(0);		Value *VecOp = IEI->getOperand(0);
Value *ScalarOp = IEI->getOperand(1);		Value *ScalarOp = IEI->getOperand(1);
Value *IdxOp = IEI->getOperand(2);		Value *IdxOp = IEI->getOperand(2);

		if(ExtOpc)
		if (CastInst *ZI = dyn_cast<CastInst>(ScalarOp))
		lebedev.riUnsubmitted Not Done Reply Inline Actions Why `ZI`? Is this limited to `zext`? `CI` perhaps? lebedev.ri: Why `ZI`? Is this limited to `zext`? `CI` perhaps?
		joeyAuthorUnsubmitted Not Done Reply Inline Actions ZI just because I originally wrote the patch for ZExtInst. I can change that to CI. It works for `zext` and `sext` currently. joey: ZI just because I originally wrote the patch for ZExtInst. I can change that to CI. It works…
		if (ZI->getOpcode() == *ExtOpc)
		ScalarOp = ZI->getOperand(0);

if (ExtractElementInst *EI = dyn_cast<ExtractElementInst>(ScalarOp)) {		if (ExtractElementInst *EI = dyn_cast<ExtractElementInst>(ScalarOp)) {
if (isa<ConstantInt>(EI->getOperand(1)) && isa<ConstantInt>(IdxOp)) {		if (isa<ConstantInt>(EI->getOperand(1)) && isa<ConstantInt>(IdxOp)) {
unsigned ExtractedIdx =		unsigned ExtractedIdx =
cast<ConstantInt>(EI->getOperand(1))->getZExtValue();		cast<ConstantInt>(EI->getOperand(1))->getZExtValue();
unsigned InsertedIdx = cast<ConstantInt>(IdxOp)->getZExtValue();		unsigned InsertedIdx = cast<ConstantInt>(IdxOp)->getZExtValue();

// Either the extracted from or inserted into vector must be RHSVec,		// Either the extracted from or inserted into vector must be RHSVec,
// otherwise we'd end up with a shuffle of three inputs.		// otherwise we'd end up with a shuffle of three inputs.
if (EI->getOperand(0) == PermittedRHS \|\| PermittedRHS == nullptr) {		if (EI->getOperand(0) == PermittedRHS \|\| PermittedRHS == nullptr) {
Value *RHS = EI->getOperand(0);		Value *RHS = EI->getOperand(0);
ShuffleOps LR = collectShuffleElements(VecOp, Mask, RHS, IC);		ShuffleOps LR = collectShuffleElements(VecOp, Mask, RHS, ExtOpc, IC);
assert(LR.second == nullptr \|\| LR.second == RHS);		assert(LR.second == nullptr \|\| LR.second == RHS);

if (LR.first->getType() != RHS->getType()) {		if (LR.first->getType() != RHS->getType()) {
// Although we are giving up for now, see if we can create extracts		// Although we are giving up for now, see if we can create extracts
// that match the inserts for another round of combining.		// that match the inserts for another round of combining.
replaceExtractElements(IEI, EI, IC);		replaceExtractElements(IEI, EI, IC);

// We tried our best, but we can't find anything compatible with RHS		// We tried our best, but we can't find anything compatible with RHS
▲ Show 20 Lines • Show All 292 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitInsertElementInst(InsertElementInst &IE) {
if (auto *V = SimplifyInsertElementInst(		if (auto *V = SimplifyInsertElementInst(
VecOp, ScalarOp, IdxOp, SQ.getWithInstruction(&IE)))		VecOp, ScalarOp, IdxOp, SQ.getWithInstruction(&IE)))
return replaceInstUsesWith(IE, V);		return replaceInstUsesWith(IE, V);

// Inserting an undef or into an undefined place, remove this.		// Inserting an undef or into an undefined place, remove this.
if (isa<UndefValue>(ScalarOp) \|\| isa<UndefValue>(IdxOp))		if (isa<UndefValue>(ScalarOp) \|\| isa<UndefValue>(IdxOp))
replaceInstUsesWith(IE, VecOp);		replaceInstUsesWith(IE, VecOp);

		Optional<Instruction::CastOps> ExtOpc = None;
		if (CastInst *ZI = dyn_cast<CastInst>(ScalarOp)) {
		lebedev.riUnsubmitted Not Done Reply Inline Actions Same lebedev.ri: Same
		if (ZI->getOpcode() == Instruction::ZExt \|\|
		ZI->getOpcode() == Instruction::SExt) {
		ScalarOp = ZI->getOperand(0);
		ExtOpc = ZI->getOpcode();
		}
		}

// If the inserted element was extracted from some other vector, and if the		// If the inserted element was extracted from some other vector, and if the
// indexes are constant, try to turn this into a shufflevector operation.		// indexes are constant, try to turn this into a shufflevector operation.
if (ExtractElementInst *EI = dyn_cast<ExtractElementInst>(ScalarOp)) {		if (ExtractElementInst *EI = dyn_cast<ExtractElementInst>(ScalarOp)) {
if (isa<ConstantInt>(EI->getOperand(1)) && isa<ConstantInt>(IdxOp)) {		if (isa<ConstantInt>(EI->getOperand(1)) && isa<ConstantInt>(IdxOp)) {
unsigned NumInsertVectorElts = IE.getType()->getNumElements();		unsigned NumInsertVectorElts = IE.getType()->getNumElements();
unsigned NumExtractVectorElts =		unsigned NumExtractVectorElts =
EI->getOperand(0)->getType()->getVectorNumElements();		EI->getOperand(0)->getType()->getVectorNumElements();
unsigned ExtractedIdx =		unsigned ExtractedIdx =
Show All 10 Lines	if (isa<ConstantInt>(EI->getOperand(1)) && isa<ConstantInt>(IdxOp)) {
// back into the same place, just use the input vector.		// back into the same place, just use the input vector.
if (EI->getOperand(0) == VecOp && ExtractedIdx == InsertedIdx)		if (EI->getOperand(0) == VecOp && ExtractedIdx == InsertedIdx)
return replaceInstUsesWith(IE, VecOp);		return replaceInstUsesWith(IE, VecOp);

// If this insertelement isn't used by some other insertelement, turn it		// If this insertelement isn't used by some other insertelement, turn it
// (and any insertelements it points to), into one big shuffle.		// (and any insertelements it points to), into one big shuffle.
if (!IE.hasOneUse() \|\| !isa<InsertElementInst>(IE.user_back())) {		if (!IE.hasOneUse() \|\| !isa<InsertElementInst>(IE.user_back())) {
SmallVector<Constant*, 16> Mask;		SmallVector<Constant*, 16> Mask;
ShuffleOps LR = collectShuffleElements(&IE, Mask, nullptr, *this);		ShuffleOps LR = collectShuffleElements(&IE, Mask, nullptr, ExtOpc, *this);

// The proposed shuffle may be trivial, in which case we shouldn't		// The proposed shuffle may be trivial, in which case we shouldn't
// perform the combine.		// perform the combine.
if (LR.first != &IE && LR.second != &IE) {		if (LR.first != &IE && LR.second != &IE) {
// We now have a shuffle of LHS, RHS, Mask.		// We now have a shuffle of LHS, RHS, Mask.
if (LR.second == nullptr)		if (LR.second == nullptr)
LR.second = UndefValue::get(LR.first->getType());		LR.second = UndefValue::get(LR.first->getType());
return new ShuffleVectorInst(LR.first, LR.second,
		Instruction *SVI = new ShuffleVectorInst(LR.first, LR.second,
ConstantVector::get(Mask));		ConstantVector::get(Mask));

		if (ExtOpc) {
		SVI->insertBefore(&IE);
		SVI = CastInst::Create(*ExtOpc, SVI, IE.getType());
		}

		return SVI;
}		}
}		}
}		}
}		}

unsigned VWidth = VecOp->getType()->getVectorNumElements();		unsigned VWidth = VecOp->getType()->getVectorNumElements();
APInt UndefElts(VWidth, 0);		APInt UndefElts(VWidth, 0);
APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
▲ Show 20 Lines • Show All 871 Lines • Show Last 20 Lines

test/Transforms/InstCombine/insert-extract-shuffle.ll

Show First 20 Lines • Show All 277 Lines • ▼ Show 20 Lines	entry:
%a = extractelement <2 x i32> %x, i32 1		%a = extractelement <2 x i32> %x, i32 1
%b = insertelement <4 x i32> zeroinitializer, i32 %a, i64 3		%b = insertelement <4 x i32> zeroinitializer, i32 %a, i64 3
%c = add i32 %y, 3		%c = add i32 %y, 3
%d = extractelement <2 x i32> %x, i32 %c		%d = extractelement <2 x i32> %x, i32 %c
%e = icmp eq i32 %d, 0		%e = icmp eq i32 %d, 0
%ret = select i1 %e, <4 x i32> %b, <4 x i32> zeroinitializer		%ret = select i1 %e, <4 x i32> %b, <4 x i32> zeroinitializer
ret <4 x i32> %ret		ret <4 x i32> %ret
}		}

		define <4 x i32> @test3(<8 x i16> %in) {
		; CHECK-LABEL: @test3(
		; CHECK-NEXT: [[VEC_3:%.*]] = shufflevector <8 x i16> %in, <8 x i16> undef, <4 x i32> <i32 3, i32 1, i32 0, i32 3>
		; CHECK-NEXT: [[ZEXT:%.*]] = zext <4 x i16> [[VEC_3]] to <4 x i32>
		; CHECK-NEXT: ret <4 x i32> [[ZEXT]]
		;
		%elt0e = extractelement <8 x i16> %in, i32 3
		%elt1e = extractelement <8 x i16> %in, i32 1
		%elt2e = extractelement <8 x i16> %in, i32 0
		%elt3e = extractelement <8 x i16> %in, i32 3

		%elt0 = zext i16 %elt0e to i32
		%elt1 = zext i16 %elt1e to i32
		%elt2 = zext i16 %elt2e to i32
		%elt3 = zext i16 %elt3e to i32

		%vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
		%vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
		%vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
		%vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3

		ret <4 x i32> %vec.3
		}

		define <4 x i32> @test4(<8 x i16> %in) {
		; CHECK-LABEL: @test4(
		; CHECK-NEXT: [[VEC_3:%.*]] = shufflevector <8 x i16> %in, <8 x i16> undef, <4 x i32> <i32 3, i32 1, i32 0, i32 3>
		; CHECK-NEXT: [[ZEXT:%.*]] = sext <4 x i16> [[VEC_3]] to <4 x i32>
		; CHECK-NEXT: ret <4 x i32> [[ZEXT]]
		;
		%elt0e = extractelement <8 x i16> %in, i32 3
		%elt1e = extractelement <8 x i16> %in, i32 1
		%elt2e = extractelement <8 x i16> %in, i32 0
		%elt3e = extractelement <8 x i16> %in, i32 3

		%elt0 = sext i16 %elt0e to i32
		%elt1 = sext i16 %elt1e to i32
		%elt2 = sext i16 %elt2e to i32
		%elt3 = sext i16 %elt3e to i32

		%vec.0 = insertelement <4 x i32> undef, i32 %elt0, i32 0
		%vec.1 = insertelement <4 x i32> %vec.0, i32 %elt1, i32 1
		%vec.2 = insertelement <4 x i32> %vec.1, i32 %elt2, i32 2
		%vec.3 = insertelement <4 x i32> %vec.2, i32 %elt3, i32 3

		ret <4 x i32> %vec.3
		}