This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
InstCombineVectorOps.cpp
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
vec_shuffle.ll

Differential D53037

[InstCombine] combine a shuffle and an extract subvector shuffle
ClosedPublic

Authored by spatel on Oct 9 2018, 1:09 PM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon
lebedev.ri
efriedma

Commits

rG7181146c6c45: [InstCombine] combine a shuffle and an extract subvector shuffle
rL344476: [InstCombine] combine a shuffle and an extract subvector shuffle

Summary

This is part of the missing IR-level folding noted in D52912.
This should be ok as a canonicalization because the new shuffle mask can't be any more complicated than the existing shuffle mask. If there's some target where the shorter vector shuffle is not legal, it should just end up expanding to something like the pair of shuffles that we're starting with here.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Oct 9 2018, 1:09 PM

Herald added a subscriber: mcrosier. · View Herald TranscriptOct 9 2018, 1:09 PM

spatel added a reviewer: efriedma.Oct 9 2018, 1:09 PM

lebedev.ri added inline comments.Oct 10 2018, 8:38 AM

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1496 ↗	(On Diff #168860)	I'm being paranoid here. What is the expected outcome of this transform. The new shuffle mask, do we expect (read: require) it to be smaller than the existing ones?

spatel added inline comments.Oct 10 2018, 9:01 AM

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1496 ↗	(On Diff #168860)	The new mask is the same size as the extracting shuffle (Shuf) size. The extracting shuffle size must be smaller than the first shuffle's size. This is guaranteed by 'isIdentityWithExtract', but I'll add an assert to make that clearer.

Patch updated:
Added an assert to (hopefully) make this transform clearer.

Side note: while trying to edit this patch, I realized that I had accidentally already committed it to trunk with rL344082. I reverted that in rL344178.

Sorry about that...on the plus side, there were no bot complaints or bug reports while it was in trunk. :)

lebedev.ri added inline comments.Oct 12 2018, 12:48 AM

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1493 ↗	(On Diff #169088)	I think this comment is too consice. The inner shuffle isn't required to be identity, is it? Also, can it have `undef`s?

LGTM - cheers

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1493 ↗	(On Diff #169088)	Add an under to the inner mask if you can - but its not a big issue.
test/Transforms/InstCombine/vec_shuffle.ll
173 ↗	(On Diff #169088)	Remove TODO?
187 ↗	(On Diff #169088)	Remove TODO?

This revision is now accepted and ready to land.Oct 12 2018, 10:32 AM

spatel added inline comments.Oct 12 2018, 10:51 AM

lib/Transforms/InstCombine/InstCombineVectorOps.cpp
1493 ↗	(On Diff #169088)	Good observations! Yes, the inner shuffle can have undefs (1 of the regression tests shows that possibility). No, it is not technically required to be identity, but it is required by our current standards of shuffle combining. Ie, if it wasn't an identity, we'd potentially be creating a truly new shuffle mask (shuffling elements in a way that the original code did not), and that's not allowed. I'll clarify these points in a revised code comment.
test/Transforms/InstCombine/vec_shuffle.ll
173 ↗	(On Diff #169088)	Yes - forgot to update the comment here and below.

Closed by commit rL344476: [InstCombine] combine a shuffle and an extract subvector shuffle (authored by spatel). · Explain WhyOct 14 2018, 8:27 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptOct 14 2018, 8:27 AM

I found a case were this combine causes a codegen regression on AArch64. In the example below, %s0 puts data into a 128 bit vector and %s1 and %s2 extract the lower and upper halves. Without folding %s0 and %s1, we can generate a single AArch64 tbl instruction for %s0 and a mov instruction for %s2. With the fold in this patch, we generate 3 additional instructions: additional tbl for %s2 and 2 instructions for loading the mask.

So on AArch64, the combine produces worse code, in case we can generate a single tbl instruction for the top-level shuffle and we extract the lower and upper halves, which is cheap. Do you have an idea how to best address the issue?

define <8 x i16> @test(<16 x i8> %s) {
entry:
  %0 = sub <16 x i8> <i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1, i8 undef, i8 undef, i8 undef, i8 -1>, %s
  %s0 = shufflevector <16 x i8> %0, <16 x i8> undef, <16 x i32> <i32 3, i32 3, i32 3, i32 3, i32 7, i32 7, i32 7, i32 7, i32 11, i32 11, i32 11, i32 11, i32 15, i32 15, i32 15, i32 15>
  %s1 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %s2 = shufflevector <16 x i8> %s0, <16 x i8> undef, <8 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %a = call <8 x i16> @fn(<8 x i8> %s1, <8 x i8> %s2) #6
  ret <8 x i16> %a
}

declare <8 x i16> @fn(<8 x i8>, <8 x i8>)

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 8:36 AM

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

In D53037#1385727, @efriedma wrote:

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

But can you be sure there won't ever be such a shuffle via some other means?
I don't know that much about AArch64, but this really sounds like something to be dealt with in the backend..

In D53037#1385727, @efriedma wrote:

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

The goal for this patch was to make things easier for the backend, but I didn't think of this case of course. How about limiting the transform like this:

Index: InstCombine/InstCombineVectorOps.cpp
===================================================================
--- InstCombine/InstCombineVectorOps.cpp	(revision 353169)
+++ InstCombine/InstCombineVectorOps.cpp	(working copy)
@@ -1498,6 +1498,11 @@
   if (!match(Op0, m_ShuffleVector(m_Value(X), m_Value(Y), m_Constant(Mask))))
     return nullptr;
 
+  // Be extra conservative with shuffle transforms. If we can't kill the 1st
+  // shuffle, then combining may result in worse codegen.
+  if (!Op0->hasOneUse())
+    return nullptr;
+
   // We are extracting a subvector from a shuffle. Remove excess elements from
   // the 1st shuffle mask to eliminate the extract.
   //

In D53037#1385795, @spatel wrote:
In D53037#1385727, @efriedma wrote:

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

The goal for this patch was to make things easier for the backend, but I didn't think of this case of course. How about limiting the transform like this:
Index: InstCombine/InstCombineVectorOps.cpp
===================================================================
--- InstCombine/InstCombineVectorOps.cpp	(revision 353169)
+++ InstCombine/InstCombineVectorOps.cpp	(working copy)
@@ -1498,6 +1498,11 @@
   if (!match(Op0, m_ShuffleVector(m_Value(X), m_Value(Y), m_Constant(Mask))))
     return nullptr;
 
+  // Be extra conservative with shuffle transforms. If we can't kill the 1st
+  // shuffle, then combining may result in worse codegen.
+  if (!Op0->hasOneUse())
+    return nullptr;
+
   // We are extracting a subvector from a shuffle. Remove excess elements from
   // the 1st shuffle mask to eliminate the extract.
   //

That would solve the issue, thanks. I think it would make sense to keep this transform quite conservative, without knowing more about the backend. If this is helpful for some backends, it might make sense to do it at a later stage in the pipeline. Merging back shuffles in the backend might be possible in some cases, but I think it would be better if the backend does not have to fight instcombine in this case.

In D53037#1385763, @lebedev.ri wrote:

In D53037#1385727, @efriedma wrote:

That case is specifically interesting because when it's split, each half is lowered to a different shuffle, and both shuffles are expensive.

Obviously you could try to find that specific pattern in SelectionDAG on AArch64, but I'm not sure it makes sense to expect the backend to try; instcombine probably shouldn't be creating difficult-to-solve optimization puzzles.

But can you be sure there won't ever be such a shuffle via some other means?
I don't know that much about AArch64, but this really sounds like something to be dealt with in the backend..

Yes, that's our default argument for IR canonicalization: if the source code was already in this form to begin with, then there must be some missing optimization already out there.
But we have a special restriction for shufflevector - we assume that masks are only ever created if they are easy for the target to digest. And the corollary is that we don't create all-new shuffle masks in target-independent passes. That means we're going to miss some optimizations in IR that would've fallen out from intermediate transforms, but I think we have to live with that. As discussed in the earlier comments, this patch is right at the edge of the mask restriction.

spatel mentioned this in rL353233: [InstCombine] split shuffle test to show extra use constraint; NFC.Feb 5 2019, 2:46 PM

spatel mentioned this in rG0272b44ea267: [InstCombine] split shuffle test to show extra use constraint; NFC.

spatel mentioned this in rL353235: [InstCombine] limit extracting shuffle transform based on uses.Feb 5 2019, 2:58 PM

spatel mentioned this in rGcddb1e54697d: [InstCombine] limit extracting shuffle transform based on uses.

Added one-use check:
rL353235

Great, thanks Sanjay!

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

InstCombine/

InstCombineVectorOps.cpp

38 lines

test/

Transforms/

InstCombine/

vec_shuffle.ll

8 lines

Diff 169602

llvm/trunk/lib/Transforms/InstCombine/InstCombineVectorOps.cpp

Show First 20 Lines • Show All 1,471 Lines • ▼ Show 20 Lines	static Instruction *narrowVectorSelect(ShuffleVectorInst &Shuf,
// shuf (sel (shuf NarrowCond, undef, WideMask), X, Y), undef, NarrowMask) -->		// shuf (sel (shuf NarrowCond, undef, WideMask), X, Y), undef, NarrowMask) -->
// sel NarrowCond, (shuf X, undef, NarrowMask), (shuf Y, undef, NarrowMask)		// sel NarrowCond, (shuf X, undef, NarrowMask), (shuf Y, undef, NarrowMask)
Value *Undef = UndefValue::get(X->getType());		Value *Undef = UndefValue::get(X->getType());
Value *NarrowX = Builder.CreateShuffleVector(X, Undef, Shuf.getMask());		Value *NarrowX = Builder.CreateShuffleVector(X, Undef, Shuf.getMask());
Value *NarrowY = Builder.CreateShuffleVector(Y, Undef, Shuf.getMask());		Value *NarrowY = Builder.CreateShuffleVector(Y, Undef, Shuf.getMask());
return SelectInst::Create(NarrowCond, NarrowX, NarrowY);		return SelectInst::Create(NarrowCond, NarrowX, NarrowY);
}		}

		/// Try to combine 2 shuffles into 1 shuffle by concatenating a shuffle mask.
		static Instruction *foldIdentityExtractShuffle(ShuffleVectorInst &Shuf) {
		Value Op0 = Shuf.getOperand(0), Op1 = Shuf.getOperand(1);
		if (!Shuf.isIdentityWithExtract() \|\| !isa<UndefValue>(Op1))
		return nullptr;

		Value X, Y;
		Constant *Mask;
		if (!match(Op0, m_ShuffleVector(m_Value(X), m_Value(Y), m_Constant(Mask))))
		return nullptr;

		// We are extracting a subvector from a shuffle. Remove excess elements from
		// the 1st shuffle mask to eliminate the extract.
		//
		// This transform is conservatively limited to identity extracts because we do
		// not allow arbitrary shuffle mask creation as a target-independent transform
		// (because we can't guarantee that will lower efficiently).
		//
		// If the extracting shuffle has an undef mask element, it transfers to the
		// new shuffle mask. Otherwise, copy the original mask element. Example:
		// shuf (shuf X, Y, <C0, C1, C2, undef, C4>), undef, <0, undef, 2, 3> -->
		// shuf X, Y, <C0, undef, C2, undef>
		unsigned NumElts = Shuf.getType()->getVectorNumElements();
		SmallVector<Constant *, 16> NewMask(NumElts);
		assert(NumElts < Mask->getType()->getVectorNumElements() &&
		"Identity with extract must have less elements than its inputs");

		for (unsigned i = 0; i != NumElts; ++i) {
		Constant *ExtractMaskElt = Shuf.getMask()->getAggregateElement(i);
		Constant *MaskElt = Mask->getAggregateElement(i);
		NewMask[i] = isa<UndefValue>(ExtractMaskElt) ? ExtractMaskElt : MaskElt;
		}
		return new ShuffleVectorInst(X, Y, ConstantVector::get(NewMask));
		}

Instruction *InstCombiner::visitShuffleVectorInst(ShuffleVectorInst &SVI) {		Instruction *InstCombiner::visitShuffleVectorInst(ShuffleVectorInst &SVI) {
Value *LHS = SVI.getOperand(0);		Value *LHS = SVI.getOperand(0);
Value *RHS = SVI.getOperand(1);		Value *RHS = SVI.getOperand(1);
if (auto *V = SimplifyShuffleVectorInst(		if (auto *V = SimplifyShuffleVectorInst(
LHS, RHS, SVI.getMask(), SVI.getType(), SQ.getWithInstruction(&SVI)))		LHS, RHS, SVI.getMask(), SVI.getType(), SQ.getWithInstruction(&SVI)))
return replaceInstUsesWith(SVI, V);		return replaceInstUsesWith(SVI, V);

if (Instruction *I = foldSelectShuffle(SVI, Builder, DL))		if (Instruction *I = foldSelectShuffle(SVI, Builder, DL))
return I;		return I;

if (Instruction *I = narrowVectorSelect(SVI, Builder))		if (Instruction *I = narrowVectorSelect(SVI, Builder))
return I;		return I;

unsigned VWidth = SVI.getType()->getVectorNumElements();		unsigned VWidth = SVI.getType()->getVectorNumElements();
APInt UndefElts(VWidth, 0);		APInt UndefElts(VWidth, 0);
APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));		APInt AllOnesEltMask(APInt::getAllOnesValue(VWidth));
if (Value *V = SimplifyDemandedVectorElts(&SVI, AllOnesEltMask, UndefElts)) {		if (Value *V = SimplifyDemandedVectorElts(&SVI, AllOnesEltMask, UndefElts)) {
if (V != &SVI)		if (V != &SVI)
return replaceInstUsesWith(SVI, V);		return replaceInstUsesWith(SVI, V);
return &SVI;		return &SVI;
}		}

		if (Instruction *I = foldIdentityExtractShuffle(SVI))
		return I;

SmallVector<int, 16> Mask = SVI.getShuffleMask();		SmallVector<int, 16> Mask = SVI.getShuffleMask();
Type *Int32Ty = Type::getInt32Ty(SVI.getContext());		Type *Int32Ty = Type::getInt32Ty(SVI.getContext());
unsigned LHSWidth = LHS->getType()->getVectorNumElements();		unsigned LHSWidth = LHS->getType()->getVectorNumElements();
bool MadeChange = false;		bool MadeChange = false;

// Canonicalize shuffle(x ,x,mask) -> shuffle(x, undef,mask')		// Canonicalize shuffle(x ,x,mask) -> shuffle(x, undef,mask')
// Canonicalize shuffle(undef,x,mask) -> shuffle(x, undef,mask').		// Canonicalize shuffle(undef,x,mask) -> shuffle(x, undef,mask').
if (LHS == RHS \|\| isa<UndefValue>(LHS)) {		if (LHS == RHS \|\| isa<UndefValue>(LHS)) {
▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/InstCombine/vec_shuffle.ll

	Show First 20 Lines • Show All 164 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[T3:%.]] = shufflevector <8 x i8> [[T2:%.]], <8 x i8> [[T6:%.*]], <8 x i32> <i32 0, i32 3, i32 1, i32 4, i32 8, i32 9, i32 10, i32 11>			; CHECK-NEXT: [[T3:%.]] = shufflevector <8 x i8> [[T2:%.]], <8 x i8> [[T6:%.*]], <8 x i32> <i32 0, i32 3, i32 1, i32 4, i32 8, i32 9, i32 10, i32 11>
	; CHECK-NEXT: ret <8 x i8> [[T3]]			; CHECK-NEXT: ret <8 x i8> [[T3]]
	;			;
	%t1 = shufflevector <8 x i8> %t6, <8 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 4, i32 undef, i32 7>			%t1 = shufflevector <8 x i8> %t6, <8 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 4, i32 undef, i32 7>
	%t3 = shufflevector <8 x i8> %t2, <8 x i8> %t1, <8 x i32> <i32 0, i32 3, i32 1, i32 4, i32 8, i32 9, i32 10, i32 11>			%t3 = shufflevector <8 x i8> %t2, <8 x i8> %t1, <8 x i32> <i32 0, i32 3, i32 1, i32 4, i32 8, i32 9, i32 10, i32 11>
	ret <8 x i8> %t3			ret <8 x i8> %t3
	}			}

	; TODO: The mask length of the 1st shuffle can be reduced to eliminate the 2nd shuffle.			; The mask length of the 1st shuffle can be reduced to eliminate the 2nd shuffle.

	define <2 x i8> @extract_subvector_of_shuffle(<2 x i8> %x, <2 x i8> %y) {			define <2 x i8> @extract_subvector_of_shuffle(<2 x i8> %x, <2 x i8> %y) {
	; CHECK-LABEL: @extract_subvector_of_shuffle(			; CHECK-LABEL: @extract_subvector_of_shuffle(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <2 x i8> [[X:%.]], <2 x i8> [[Y:%.*]], <3 x i32> <i32 0, i32 2, i32 undef>			; CHECK-NEXT: [[EXTRACT_SUBV:%.]] = shufflevector <2 x i8> [[X:%.]], <2 x i8> [[Y:%.*]], <2 x i32> <i32 0, i32 2>
	; CHECK-NEXT: [[EXTRACT_SUBV:%.*]] = shufflevector <3 x i8> [[SHUF]], <3 x i8> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: ret <2 x i8> [[EXTRACT_SUBV]]			; CHECK-NEXT: ret <2 x i8> [[EXTRACT_SUBV]]
	;			;
	%shuf = shufflevector <2 x i8> %x, <2 x i8> %y, <3 x i32> <i32 0, i32 2, i32 0>			%shuf = shufflevector <2 x i8> %x, <2 x i8> %y, <3 x i32> <i32 0, i32 2, i32 0>
	%extract_subv = shufflevector <3 x i8> %shuf, <3 x i8> undef, <2 x i32> <i32 0, i32 1>			%extract_subv = shufflevector <3 x i8> %shuf, <3 x i8> undef, <2 x i32> <i32 0, i32 1>
	ret <2 x i8> %extract_subv			ret <2 x i8> %extract_subv
	}			}

	; TODO:
	; Extra uses are ok.			; Extra uses are ok.
	; Undef elements in either mask are ok. Undefs from the 2nd shuffle mask should propagate to the new shuffle.			; Undef elements in either mask are ok. Undefs from the 2nd shuffle mask should propagate to the new shuffle.
	; The type of the inputs does not have to match the output type.			; The type of the inputs does not have to match the output type.

	declare void @use_v5i8(<5 x i8>)			declare void @use_v5i8(<5 x i8>)

	define <4 x i8> @extract_subvector_of_shuffle_extra_use(<2 x i8> %x, <2 x i8> %y) {			define <4 x i8> @extract_subvector_of_shuffle_extra_use(<2 x i8> %x, <2 x i8> %y) {
	; CHECK-LABEL: @extract_subvector_of_shuffle_extra_use(			; CHECK-LABEL: @extract_subvector_of_shuffle_extra_use(
	; CHECK-NEXT: [[SHUF:%.]] = shufflevector <2 x i8> [[X:%.]], <2 x i8> [[Y:%.*]], <5 x i32> <i32 undef, i32 2, i32 0, i32 1, i32 0>			; CHECK-NEXT: [[SHUF:%.]] = shufflevector <2 x i8> [[X:%.]], <2 x i8> [[Y:%.*]], <5 x i32> <i32 undef, i32 2, i32 0, i32 1, i32 0>
	; CHECK-NEXT: call void @use_v5i8(<5 x i8> [[SHUF]])			; CHECK-NEXT: call void @use_v5i8(<5 x i8> [[SHUF]])
	; CHECK-NEXT: [[EXTRACT_SUBV:%.*]] = shufflevector <5 x i8> [[SHUF]], <5 x i8> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			; CHECK-NEXT: [[EXTRACT_SUBV:%.*]] = shufflevector <2 x i8> [[X]], <2 x i8> [[Y]], <4 x i32> <i32 undef, i32 2, i32 0, i32 undef>
	; CHECK-NEXT: ret <4 x i8> [[EXTRACT_SUBV]]			; CHECK-NEXT: ret <4 x i8> [[EXTRACT_SUBV]]
	;			;
	%shuf = shufflevector <2 x i8> %x, <2 x i8> %y, <5 x i32> <i32 undef, i32 2, i32 0, i32 1, i32 0>			%shuf = shufflevector <2 x i8> %x, <2 x i8> %y, <5 x i32> <i32 undef, i32 2, i32 0, i32 1, i32 0>
	call void @use_v5i8(<5 x i8> %shuf)			call void @use_v5i8(<5 x i8> %shuf)
	%extract_subv = shufflevector <5 x i8> %shuf, <5 x i8> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			%extract_subv = shufflevector <5 x i8> %shuf, <5 x i8> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	ret <4 x i8> %extract_subv			ret <4 x i8> %extract_subv
	}			}

	▲ Show 20 Lines • Show All 857 Lines • Show Last 20 Lines