This is an archive of the discontinued LLVM Phabricator instance.

Differential D111901

[VectorCombine] fold shuffle-of-binops with common operand
ClosedPublic

Authored by spatel on Oct 15 2021, 10:34 AM.

Download Raw Diff

Details

Reviewers

fhahn
lebedev.ri
RKSimon

Commits

rG66d22b4da4af: [VectorCombine] fold shuffle-of-binops with common operand

Summary

shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)

This is motivated by an example in D111800 (although that patch would avoid the problem for that particular example).

The pattern is shown in reduced form with:
https://llvm.org/PR52178
https://alive2.llvm.org/ce/z/d8zB4D

There is no difference on the PhaseOrdering test from D111800 because the aarch64 cost model says that the shuffle cost is 3 while the fadd cost is 2. That seems wrong for a simple v4f32 shuffle, but that should be another patch if correct.

Diff Detail

Event Timeline

spatel created this revision.Oct 15 2021, 10:34 AM

Herald added subscribers: hiraditya, kristof.beyls, mcrosier. · View Herald TranscriptOct 15 2021, 10:34 AM

spatel requested review of this revision.Oct 15 2021, 10:34 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 15 2021, 10:34 AM

spatel edited the summary of this revision. (Show Details)Oct 15 2021, 10:35 AM

spatel added a parent revision: D111891: [Analysis] add utility function for unary shuffle mask creation.

Harbormaster completed remote builds in B129095: Diff 380028.Oct 15 2021, 10:42 AM

spatel mentioned this in D111800: [VectorCombine] Add option to only run scalarization transforms..Oct 15 2021, 10:56 AM

spatel mentioned this in rG2a3cc4d46184: [Analysis] add utility function for unary shuffle mask creation.Oct 18 2021, 6:01 AM

RKSimon added inline comments.Oct 18 2021, 8:00 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1084	Are we OK with accepting the fold if ShufCost == BinopCost ?

spatel marked an inline comment as done.Oct 18 2021, 8:44 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

1084

This follows the lead of other vector combines and instcombine in general: if we can rearrange code without incurring cost, then it might unlock further transforms, so we try it.

Direct motivation is seen in the example from D111800 - in that case, we get shuffle-of-shuffle as the 1st instructions in the function and that can be reduced by the backend (x86 at least).

In a minimal case where there's no further optimization, we end up with something that is probably neutral. For example, here's the 'and' v2i64 test on x86 and aarch64:

define <2 x i64> @and_and_shuf_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {
  %b0 = and <2 x i64> %x, %y
  %b1 = and <2 x i64> %y, %z
  %r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0>
  ret <2 x i64> %r
}

define <2 x i64> @shuf_shuf_and_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {
  %a0 = shufflevector <2 x i64> %y, <2 x i64> poison, <2 x i32> <i32 1, i32 0>
  %a1 = shufflevector <2 x i64> %x, <2 x i64> %z, <2 x i32> <i32 3, i32 0>
  %r = and <2 x i64> %a0, %a1
  ret <2 x i64> %r
}

before:
	andps	%xmm1, %xmm0
	andps	%xmm1, %xmm2
	shufps	$78, %xmm0, %xmm2               ## xmm2 = xmm2[2,3],xmm0[0,1]
after:	
	pshufd	$78, %xmm1, %xmm1               ## xmm1 = xmm1[2,3,0,1]
	shufps	$78, %xmm0, %xmm2               ## xmm2 = xmm2[2,3],xmm0[0,1]
	pand	%xmm2, %xmm1

before:
	and	v2.16b, v1.16b, v2.16b
	and	v0.16b, v0.16b, v1.16b
	ext	v0.16b, v2.16b, v0.16b, #8
after:
	ext	v1.16b, v1.16b, v1.16b, #8
	ext	v0.16b, v2.16b, v0.16b, #8
	and	v0.16b, v1.16b, v0.16b

Yes that makes sense - LGTM.

This revision is now accepted and ready to land.Oct 18 2021, 8:51 AM

This revision was landed with ongoing or failed builds.Oct 21 2021, 9:38 AM

Closed by commit rG66d22b4da4af: [VectorCombine] fold shuffle-of-binops with common operand (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

spatel added a commit: rG66d22b4da4af: [VectorCombine] fold shuffle-of-binops with common operand.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

56 lines

test/

Transforms/

VectorCombine/

X86/

shuffle.ll

54 lines

Diff 380028

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	private:
void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
bool foldExtractExtract(Instruction &I);		bool foldExtractExtract(Instruction &I);
bool foldBitcastShuf(Instruction &I);		bool foldBitcastShuf(Instruction &I);
bool scalarizeBinopOrCmp(Instruction &I);		bool scalarizeBinopOrCmp(Instruction &I);
bool foldExtractedCmps(Instruction &I);		bool foldExtractedCmps(Instruction &I);
bool foldSingleElementStore(Instruction &I);		bool foldSingleElementStore(Instruction &I);
bool scalarizeLoadExtract(Instruction &I);		bool scalarizeLoadExtract(Instruction &I);
		bool foldShuffleOfBinops(Instruction &I);

void replaceValue(Value &Old, Value &New) {		void replaceValue(Value &Old, Value &New) {
Old.replaceAllUsesWith(&New);		Old.replaceAllUsesWith(&New);
New.takeName(&Old);		New.takeName(&Old);
if (auto *NewI = dyn_cast<Instruction>(&New)) {		if (auto *NewI = dyn_cast<Instruction>(&New)) {
Worklist.pushUsersToWorkList(*NewI);		Worklist.pushUsersToWorkList(*NewI);
Worklist.pushValue(NewI);		Worklist.pushValue(NewI);
}		}
▲ Show 20 Lines • Show All 947 Lines • ▼ Show 20 Lines	for (User *U : LI->users()) {
NewLoad->setAlignment(ScalarOpAlignment);		NewLoad->setAlignment(ScalarOpAlignment);

replaceValue(EI, NewLoad);		replaceValue(EI, NewLoad);
}		}

return true;		return true;
}		}

		/// Try to convert "shuffle (binop), (binop)" with a shared binop operand into
		/// "binop (shuffle), (shuffle)".
		bool VectorCombine::foldShuffleOfBinops(Instruction &I) {
		auto *VecTy = dyn_cast<FixedVectorType>(I.getType());
		if (!VecTy)
		return false;

		BinaryOperator B0, B1;
		ArrayRef<int> Mask;
		if (!match(&I, m_Shuffle(m_OneUse(m_BinOp(B0)), m_OneUse(m_BinOp(B1)),
		m_Mask(Mask))) \|\|
		B0->getOpcode() != B1->getOpcode() \|\| B0->getType() != VecTy)
		return false;

		// Try to replace a binop with a shuffle if the shuffle is not costly.
		// The new shuffle will choose from a single, common operand, so it may be
		// cheaper than the existing two-operand shuffle.
		SmallVector<int> UnaryMask = createUnaryMask(Mask, Mask.size());
		Instruction::BinaryOps Opcode = B0->getOpcode();
		InstructionCost BinopCost = TTI.getArithmeticInstrCost(Opcode, VecTy);
		InstructionCost ShufCost = TTI.getShuffleCost(
		TargetTransformInfo::SK_PermuteSingleSrc, VecTy, UnaryMask);
		if (ShufCost > BinopCost)
		RKSimonUnsubmitted Done Reply Inline Actions Are we OK with accepting the fold if ShufCost == BinopCost ? RKSimon: Are we OK with accepting the fold if ShufCost == BinopCost ?
		spatelAuthorUnsubmitted Done Reply Inline Actions This follows the lead of other vector combines and instcombine in general: if we can rearrange code without incurring cost, then it might unlock further transforms, so we try it. Direct motivation is seen in the example from D111800 - in that case, we get shuffle-of-shuffle as the 1st instructions in the function and that can be reduced by the backend (x86 at least). In a minimal case where there's no further optimization, we end up with something that is probably neutral. For example, here's the 'and' v2i64 test on x86 and aarch64: define <2 x i64> @and_and_shuf_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) { %b0 = and <2 x i64> %x, %y %b1 = and <2 x i64> %y, %z %r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0> ret <2 x i64> %r } define <2 x i64> @shuf_shuf_and_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) { %a0 = shufflevector <2 x i64> %y, <2 x i64> poison, <2 x i32> <i32 1, i32 0> %a1 = shufflevector <2 x i64> %x, <2 x i64> %z, <2 x i32> <i32 3, i32 0> %r = and <2 x i64> %a0, %a1 ret <2 x i64> %r } before: andps %xmm1, %xmm0 andps %xmm1, %xmm2 shufps $78, %xmm0, %xmm2 ## xmm2 = xmm2[2,3],xmm0[0,1] after: pshufd $78, %xmm1, %xmm1 ## xmm1 = xmm1[2,3,0,1] shufps $78, %xmm0, %xmm2 ## xmm2 = xmm2[2,3],xmm0[0,1] pand %xmm2, %xmm1 before: and v2.16b, v1.16b, v2.16b and v0.16b, v0.16b, v1.16b ext v0.16b, v2.16b, v0.16b, #8 after: ext v1.16b, v1.16b, v1.16b, #8 ext v0.16b, v2.16b, v0.16b, #8 and v0.16b, v1.16b, v0.16b spatel: This follows the lead of other vector combines and instcombine in general: if we can rearrange…
		return false;

		// If we have something like "add X, Y" and "add Z, X", swap ops to match.
		Value X = B0->getOperand(0), Y = B0->getOperand(1);
		Value Z = B1->getOperand(0), W = B1->getOperand(1);
		if (BinaryOperator::isCommutative(Opcode) && X != Z && Y != W)
		std::swap(X, Y);

		Value Shuf0, Shuf1;
		if (X == Z) {
		// shuf (bo X, Y), (bo X, W) --> bo (shuf X), (shuf Y, W)
		Shuf0 = Builder.CreateShuffleVector(X, UnaryMask);
		Shuf1 = Builder.CreateShuffleVector(Y, W, Mask);
		} else if (Y == W) {
		// shuf (bo X, Y), (bo Z, Y) --> bo (shuf X, Z), (shuf Y)
		Shuf0 = Builder.CreateShuffleVector(X, Z, Mask);
		Shuf1 = Builder.CreateShuffleVector(Y, UnaryMask);
		} else {
		return false;
		}

		Value *NewBO = Builder.CreateBinOp(Opcode, Shuf0, Shuf1);
		// Intersect flags from the old binops.
		if (auto *NewInst = dyn_cast<Instruction>(NewBO)) {
		NewInst->copyIRFlags(B0);
		NewInst->andIRFlags(B1);
		}
		replaceValue(I, *NewBO);
		return true;
		}

/// This is the entry point for all transforms. Pass manager differences are		/// This is the entry point for all transforms. Pass manager differences are
/// handled in the callers of this function.		/// handled in the callers of this function.
bool VectorCombine::run() {		bool VectorCombine::run() {
if (DisableVectorCombine)		if (DisableVectorCombine)
return false;		return false;

// Don't attempt vectorization if the target does not support vectors.		// Don't attempt vectorization if the target does not support vectors.
if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))		if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))
return false;		return false;

bool MadeChange = false;		bool MadeChange = false;
auto FoldInst = [this, &MadeChange](Instruction &I) {		auto FoldInst = [this, &MadeChange](Instruction &I) {
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
MadeChange \|= vectorizeLoadInsert(I);		MadeChange \|= vectorizeLoadInsert(I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= scalarizeBinopOrCmp(I);		MadeChange \|= scalarizeBinopOrCmp(I);
MadeChange \|= foldExtractedCmps(I);		MadeChange \|= foldExtractedCmps(I);
MadeChange \|= scalarizeLoadExtract(I);		MadeChange \|= scalarizeLoadExtract(I);
MadeChange \|= foldSingleElementStore(I);		MadeChange \|= foldSingleElementStore(I);
		MadeChange \|= foldShuffleOfBinops(I);
};		};
for (BasicBlock &BB : F) {		for (BasicBlock &BB : F) {
// Ignore unreachable basic blocks.		// Ignore unreachable basic blocks.
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Use early increment range so that we can erase instructions in loop.		// Use early increment range so that we can erase instructions in loop.
for (Instruction &I : make_early_inc_range(BB)) {		for (Instruction &I : make_early_inc_range(BB)) {
if (I.isDebugOrPseudoInst())		if (I.isDebugOrPseudoInst())
▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/shuffle.ll

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	;
%bc1 = bitcast <4 x i32> %permil to <8 x i16>		%bc1 = bitcast <4 x i32> %permil to <8 x i16>
%add = shl <8 x i16> %bc1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>		%add = shl <8 x i16> %bc1, <i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1, i16 1>
%bc2 = bitcast <8 x i16> %add to <4 x i32>		%bc2 = bitcast <8 x i16> %add to <4 x i32>
%permil1 = shufflevector <4 x i32> %bc2, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>		%permil1 = shufflevector <4 x i32> %bc2, <4 x i32> poison, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
%bc3 = bitcast <4 x i32> %permil1 to <2 x i64>		%bc3 = bitcast <4 x i32> %permil1 to <2 x i64>
ret <2 x i64> %bc3		ret <2 x i64> %bc3
}		}

		; Shuffle is much cheaper than fdiv. FMF are intersected.

define <4 x float> @shuf_fdiv_v4f32_yy(<4 x float> %x, <4 x float> %y, <4 x float> %z) {		define <4 x float> @shuf_fdiv_v4f32_yy(<4 x float> %x, <4 x float> %y, <4 x float> %z) {
; CHECK-LABEL: @shuf_fdiv_v4f32_yy(		; CHECK-LABEL: @shuf_fdiv_v4f32_yy(
; CHECK-NEXT: [[B0:%.]] = fdiv fast <4 x float> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[X:%.]], <4 x float> [[Z:%.*]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>
; CHECK-NEXT: [[B1:%.]] = fdiv arcp <4 x float> [[Z:%.]], [[Y]]		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x float> [[Y:%.]], <4 x float> poison, <4 x i32> <i32 1, i32 3, i32 1, i32 3>
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[B0]], <4 x float> [[B1]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>		; CHECK-NEXT: [[R:%.*]] = fdiv arcp <4 x float> [[TMP1]], [[TMP2]]
; CHECK-NEXT: ret <4 x float> [[R]]		; CHECK-NEXT: ret <4 x float> [[R]]
;		;
%b0 = fdiv fast <4 x float> %x, %y		%b0 = fdiv fast <4 x float> %x, %y
%b1 = fdiv arcp <4 x float> %z, %y		%b1 = fdiv arcp <4 x float> %z, %y
%r = shufflevector <4 x float> %b0, <4 x float> %b1, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%r = shufflevector <4 x float> %b0, <4 x float> %b1, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
ret <4 x float> %r		ret <4 x float> %r
}		}

		; Common operand is op0 of the binops.

define <4 x i32> @shuf_add_v4i32_xx(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {		define <4 x i32> @shuf_add_v4i32_xx(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {
; CHECK-LABEL: @shuf_add_v4i32_xx(		; CHECK-LABEL: @shuf_add_v4i32_xx(
; CHECK-NEXT: [[B0:%.]] = add <4 x i32> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[X:%.]], <4 x i32> poison, <4 x i32> <i32 undef, i32 undef, i32 2, i32 0>
; CHECK-NEXT: [[B1:%.]] = add <4 x i32> [[X]], [[Z:%.]]		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[Y:%.]], <4 x i32> [[Z:%.*]], <4 x i32> <i32 undef, i32 undef, i32 6, i32 0>
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 undef, i32 undef, i32 6, i32 0>		; CHECK-NEXT: [[R:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: ret <4 x i32> [[R]]		; CHECK-NEXT: ret <4 x i32> [[R]]
;		;
%b0 = add <4 x i32> %x, %y		%b0 = add <4 x i32> %x, %y
%b1 = add <4 x i32> %x, %z		%b1 = add <4 x i32> %x, %z
%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 poison, i32 poison, i32 6, i32 0>		%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 poison, i32 poison, i32 6, i32 0>
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		; For commutative instructions, common operand may be swapped.

define <4 x float> @shuf_fmul_v4f32_xx_swap(<4 x float> %x, <4 x float> %y, <4 x float> %z) {		define <4 x float> @shuf_fmul_v4f32_xx_swap(<4 x float> %x, <4 x float> %y, <4 x float> %z) {
; CHECK-LABEL: @shuf_fmul_v4f32_xx_swap(		; CHECK-LABEL: @shuf_fmul_v4f32_xx_swap(
; CHECK-NEXT: [[B0:%.]] = fmul <4 x float> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[Y:%.]], <4 x float> [[Z:%.*]], <4 x i32> <i32 0, i32 3, i32 4, i32 7>
; CHECK-NEXT: [[B1:%.]] = fmul <4 x float> [[Z:%.]], [[X]]		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x float> [[X:%.]], <4 x float> poison, <4 x i32> <i32 0, i32 3, i32 0, i32 3>
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[B0]], <4 x float> [[B1]], <4 x i32> <i32 0, i32 3, i32 4, i32 7>		; CHECK-NEXT: [[R:%.*]] = fmul <4 x float> [[TMP1]], [[TMP2]]
; CHECK-NEXT: ret <4 x float> [[R]]		; CHECK-NEXT: ret <4 x float> [[R]]
;		;
%b0 = fmul <4 x float> %x, %y		%b0 = fmul <4 x float> %x, %y
%b1 = fmul <4 x float> %z, %x		%b1 = fmul <4 x float> %z, %x
%r = shufflevector <4 x float> %b0, <4 x float> %b1, <4 x i32> <i32 0, i32 3, i32 4, i32 7>		%r = shufflevector <4 x float> %b0, <4 x float> %b1, <4 x i32> <i32 0, i32 3, i32 4, i32 7>
ret <4 x float> %r		ret <4 x float> %r
}		}

		; For commutative instructions, common operand may be swapped.

define <2 x i64> @shuf_and_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {		define <2 x i64> @shuf_and_v2i64_yy_swap(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {
; CHECK-LABEL: @shuf_and_v2i64_yy_swap(		; CHECK-LABEL: @shuf_and_v2i64_yy_swap(
; CHECK-NEXT: [[B0:%.]] = and <2 x i64> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i64> [[Y:%.]], <2 x i64> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[B1:%.]] = and <2 x i64> [[Y]], [[Z:%.]]		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i64> [[X:%.]], <2 x i64> [[Z:%.*]], <2 x i32> <i32 3, i32 0>
; CHECK-NEXT: [[R:%.*]] = shufflevector <2 x i64> [[B0]], <2 x i64> [[B1]], <2 x i32> <i32 3, i32 0>		; CHECK-NEXT: [[R:%.*]] = and <2 x i64> [[TMP1]], [[TMP2]]
; CHECK-NEXT: ret <2 x i64> [[R]]		; CHECK-NEXT: ret <2 x i64> [[R]]
;		;
%b0 = and <2 x i64> %x, %y		%b0 = and <2 x i64> %x, %y
%b1 = and <2 x i64> %y, %z		%b1 = and <2 x i64> %y, %z
%r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0>		%r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0>
ret <2 x i64> %r		ret <2 x i64> %r
}		}

		; non-commutative binop, but common op0

define <4 x i32> @shuf_shl_v4i32_xx(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {		define <4 x i32> @shuf_shl_v4i32_xx(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {
; CHECK-LABEL: @shuf_shl_v4i32_xx(		; CHECK-LABEL: @shuf_shl_v4i32_xx(
; CHECK-NEXT: [[B0:%.]] = shl <4 x i32> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[X:%.]], <4 x i32> poison, <4 x i32> <i32 3, i32 1, i32 1, i32 2>
; CHECK-NEXT: [[B1:%.]] = shl <4 x i32> [[X]], [[Z:%.]]		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[Y:%.]], <4 x i32> [[Z:%.*]], <4 x i32> <i32 3, i32 1, i32 1, i32 6>
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 3, i32 1, i32 1, i32 6>		; CHECK-NEXT: [[R:%.*]] = shl <4 x i32> [[TMP1]], [[TMP2]]
; CHECK-NEXT: ret <4 x i32> [[R]]		; CHECK-NEXT: ret <4 x i32> [[R]]
;		;
%b0 = shl <4 x i32> %x, %y		%b0 = shl <4 x i32> %x, %y
%b1 = shl <4 x i32> %x, %z		%b1 = shl <4 x i32> %x, %z
%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 3, i32 1, i32 1, i32 6>		%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 3, i32 1, i32 1, i32 6>
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		; negative test - common operand, but not commutable

define <4 x i32> @shuf_shl_v4i32_xx_swap(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {		define <4 x i32> @shuf_shl_v4i32_xx_swap(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {
; CHECK-LABEL: @shuf_shl_v4i32_xx_swap(		; CHECK-LABEL: @shuf_shl_v4i32_xx_swap(
; CHECK-NEXT: [[B0:%.]] = shl <4 x i32> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = shl <4 x i32> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: [[B1:%.]] = shl <4 x i32> [[Z:%.]], [[X]]		; CHECK-NEXT: [[B1:%.]] = shl <4 x i32> [[Z:%.]], [[X]]
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 3, i32 2, i32 2, i32 5>		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 3, i32 2, i32 2, i32 5>
; CHECK-NEXT: ret <4 x i32> [[R]]		; CHECK-NEXT: ret <4 x i32> [[R]]
;		;
%b0 = shl <4 x i32> %x, %y		%b0 = shl <4 x i32> %x, %y
%b1 = shl <4 x i32> %z, %x		%b1 = shl <4 x i32> %z, %x
%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 3, i32 2, i32 2, i32 5>		%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 3, i32 2, i32 2, i32 5>
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		; negative test - mismatched opcodes

define <2 x i64> @shuf_sub_add_v2i64_yy(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {		define <2 x i64> @shuf_sub_add_v2i64_yy(<2 x i64> %x, <2 x i64> %y, <2 x i64> %z) {
; CHECK-LABEL: @shuf_sub_add_v2i64_yy(		; CHECK-LABEL: @shuf_sub_add_v2i64_yy(
; CHECK-NEXT: [[B0:%.]] = sub <2 x i64> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = sub <2 x i64> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: [[B1:%.]] = add <2 x i64> [[Z:%.]], [[Y]]		; CHECK-NEXT: [[B1:%.]] = add <2 x i64> [[Z:%.]], [[Y]]
; CHECK-NEXT: [[R:%.*]] = shufflevector <2 x i64> [[B0]], <2 x i64> [[B1]], <2 x i32> <i32 3, i32 0>		; CHECK-NEXT: [[R:%.*]] = shufflevector <2 x i64> [[B0]], <2 x i64> [[B1]], <2 x i32> <i32 3, i32 0>
; CHECK-NEXT: ret <2 x i64> [[R]]		; CHECK-NEXT: ret <2 x i64> [[R]]
;		;
%b0 = sub <2 x i64> %x, %y		%b0 = sub <2 x i64> %x, %y
%b1 = add <2 x i64> %z, %y		%b1 = add <2 x i64> %z, %y
%r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0>		%r = shufflevector <2 x i64> %b0, <2 x i64> %b1, <2 x i32> <i32 3, i32 0>
ret <2 x i64> %r		ret <2 x i64> %r
}		}

		; negative test - type change via shuffle

define <8 x float> @shuf_fmul_v4f32_xx_type(<4 x float> %x, <4 x float> %y, <4 x float> %z) {		define <8 x float> @shuf_fmul_v4f32_xx_type(<4 x float> %x, <4 x float> %y, <4 x float> %z) {
; CHECK-LABEL: @shuf_fmul_v4f32_xx_type(		; CHECK-LABEL: @shuf_fmul_v4f32_xx_type(
; CHECK-NEXT: [[B0:%.]] = fmul <4 x float> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = fmul <4 x float> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: [[B1:%.]] = fmul <4 x float> [[Z:%.]], [[X]]		; CHECK-NEXT: [[B1:%.]] = fmul <4 x float> [[Z:%.]], [[X]]
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[B0]], <4 x float> [[B1]], <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 0, i32 1, i32 1, i32 6>		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[B0]], <4 x float> [[B1]], <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 0, i32 1, i32 1, i32 6>
; CHECK-NEXT: ret <8 x float> [[R]]		; CHECK-NEXT: ret <8 x float> [[R]]
;		;
%b0 = fmul <4 x float> %x, %y		%b0 = fmul <4 x float> %x, %y
%b1 = fmul <4 x float> %z, %x		%b1 = fmul <4 x float> %z, %x
%r = shufflevector <4 x float> %b0, <4 x float> %b1, <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 0, i32 1, i32 1, i32 6>		%r = shufflevector <4 x float> %b0, <4 x float> %b1, <8 x i32> <i32 0, i32 3, i32 4, i32 7, i32 0, i32 1, i32 1, i32 6>
ret <8 x float> %r		ret <8 x float> %r
}		}

		; negative test - uses

define <4 x i32> @shuf_lshr_v4i32_yy_use1(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {		define <4 x i32> @shuf_lshr_v4i32_yy_use1(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {
; CHECK-LABEL: @shuf_lshr_v4i32_yy_use1(		; CHECK-LABEL: @shuf_lshr_v4i32_yy_use1(
; CHECK-NEXT: [[B0:%.]] = lshr <4 x i32> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = lshr <4 x i32> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: call void @use(<4 x i32> [[B0]])		; CHECK-NEXT: call void @use(<4 x i32> [[B0]])
; CHECK-NEXT: [[B1:%.]] = lshr <4 x i32> [[Z:%.]], [[Y]]		; CHECK-NEXT: [[B1:%.]] = lshr <4 x i32> [[Z:%.]], [[Y]]
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 0, i32 2, i32 4, i32 6>		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 0, i32 2, i32 4, i32 6>
; CHECK-NEXT: ret <4 x i32> [[R]]		; CHECK-NEXT: ret <4 x i32> [[R]]
;		;
%b0 = lshr <4 x i32> %x, %y		%b0 = lshr <4 x i32> %x, %y
call void @use(<4 x i32> %b0)		call void @use(<4 x i32> %b0)
%b1 = lshr <4 x i32> %z, %y		%b1 = lshr <4 x i32> %z, %y
%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		; negative test - uses

define <4 x i32> @shuf_mul_v4i32_yy_use2(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {		define <4 x i32> @shuf_mul_v4i32_yy_use2(<4 x i32> %x, <4 x i32> %y, <4 x i32> %z) {
; CHECK-LABEL: @shuf_mul_v4i32_yy_use2(		; CHECK-LABEL: @shuf_mul_v4i32_yy_use2(
; CHECK-NEXT: [[B0:%.]] = mul <4 x i32> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = mul <4 x i32> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: [[B1:%.]] = mul <4 x i32> [[Z:%.]], [[Y]]		; CHECK-NEXT: [[B1:%.]] = mul <4 x i32> [[Z:%.]], [[Y]]
; CHECK-NEXT: call void @use(<4 x i32> [[B1]])		; CHECK-NEXT: call void @use(<4 x i32> [[B1]])
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[B0]], <4 x i32> [[B1]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>
; CHECK-NEXT: ret <4 x i32> [[R]]		; CHECK-NEXT: ret <4 x i32> [[R]]
;		;
%b0 = mul <4 x i32> %x, %y		%b0 = mul <4 x i32> %x, %y
%b1 = mul <4 x i32> %z, %y		%b1 = mul <4 x i32> %z, %y
call void @use(<4 x i32> %b1)		call void @use(<4 x i32> %b1)
%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%r = shufflevector <4 x i32> %b0, <4 x i32> %b1, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
ret <4 x i32> %r		ret <4 x i32> %r
}		}

		; negative test - must have matching operand

define <4 x float> @shuf_fadd_v4f32_no_common_op(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) {		define <4 x float> @shuf_fadd_v4f32_no_common_op(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x float> %w) {
; CHECK-LABEL: @shuf_fadd_v4f32_no_common_op(		; CHECK-LABEL: @shuf_fadd_v4f32_no_common_op(
; CHECK-NEXT: [[B0:%.]] = fadd <4 x float> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = fadd <4 x float> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: [[B1:%.]] = fadd <4 x float> [[Z:%.]], [[W:%.*]]		; CHECK-NEXT: [[B1:%.]] = fadd <4 x float> [[Z:%.]], [[W:%.*]]
; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[B0]], <4 x float> [[B1]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>		; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[B0]], <4 x float> [[B1]], <4 x i32> <i32 1, i32 3, i32 5, i32 7>
; CHECK-NEXT: ret <4 x float> [[R]]		; CHECK-NEXT: ret <4 x float> [[R]]
;		;
%b0 = fadd <4 x float> %x, %y		%b0 = fadd <4 x float> %x, %y
%b1 = fadd <4 x float> %z, %w		%b1 = fadd <4 x float> %z, %w
%r = shufflevector <4 x float> %b0, <4 x float> %b1, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		%r = shufflevector <4 x float> %b0, <4 x float> %b1, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
ret <4 x float> %r		ret <4 x float> %r
}		}

		; negative test - binops may be relatively cheap

define <16 x i16> @shuf_and_v16i16_yy_expensive_shuf(<16 x i16> %x, <16 x i16> %y, <16 x i16> %z) {		define <16 x i16> @shuf_and_v16i16_yy_expensive_shuf(<16 x i16> %x, <16 x i16> %y, <16 x i16> %z) {
; CHECK-LABEL: @shuf_and_v16i16_yy_expensive_shuf(		; CHECK-LABEL: @shuf_and_v16i16_yy_expensive_shuf(
; CHECK-NEXT: [[B0:%.]] = and <16 x i16> [[X:%.]], [[Y:%.*]]		; CHECK-NEXT: [[B0:%.]] = and <16 x i16> [[X:%.]], [[Y:%.*]]
; CHECK-NEXT: [[B1:%.]] = and <16 x i16> [[Y]], [[Z:%.]]		; CHECK-NEXT: [[B1:%.]] = and <16 x i16> [[Y]], [[Z:%.]]
; CHECK-NEXT: [[R:%.*]] = shufflevector <16 x i16> [[B0]], <16 x i16> [[B1]], <16 x i32> <i32 15, i32 22, i32 25, i32 13, i32 28, i32 0, i32 undef, i32 3, i32 0, i32 30, i32 3, i32 7, i32 9, i32 19, i32 2, i32 22>		; CHECK-NEXT: [[R:%.*]] = shufflevector <16 x i16> [[B0]], <16 x i16> [[B1]], <16 x i32> <i32 15, i32 22, i32 25, i32 13, i32 28, i32 0, i32 undef, i32 3, i32 0, i32 30, i32 3, i32 7, i32 9, i32 19, i32 2, i32 22>
; CHECK-NEXT: ret <16 x i16> [[R]]		; CHECK-NEXT: ret <16 x i16> [[R]]
;		;
%b0 = and <16 x i16> %x, %y		%b0 = and <16 x i16> %x, %y
%b1 = and <16 x i16> %y, %z		%b1 = and <16 x i16> %y, %z
%r = shufflevector <16 x i16> %b0, <16 x i16> %b1, <16 x i32> <i32 15, i32 22, i32 25, i32 13, i32 28, i32 0, i32 poison, i32 3, i32 0, i32 30, i32 3, i32 7, i32 9, i32 19, i32 2, i32 22>		%r = shufflevector <16 x i16> %b0, <16 x i16> %b1, <16 x i32> <i32 15, i32 22, i32 25, i32 13, i32 28, i32 0, i32 poison, i32 3, i32 0, i32 30, i32 3, i32 7, i32 9, i32 19, i32 2, i32 22>
ret <16 x i16> %r		ret <16 x i16> %r
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] fold shuffle-of-binops with common operandClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 380028

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

llvm/test/Transforms/VectorCombine/X86/shuffle.ll

[VectorCombine] fold shuffle-of-binops with common operand
ClosedPublic