This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Improved (v)insertps shuffle matching
ClosedPublic

Authored by RKSimon on Jan 8 2015, 10:17 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
chandlerc
andreadb

Commits

rG94a4cc027ab1: [X86][SSE] Improved (v)insertps shuffle matching
rL225589: [X86][SSE] Improved (v)insertps shuffle matching

Summary

In the current code we only attempt to match against insertps if we have exactly one element from the second input vector, irrespective of how much of the shuffle result is zeroable.

This patch checks to see if there is a single non-zeroable element from either input that requires insertion. It also supports matching of cases where only one of the inputs need to be referenced.

We also split (v)insertps shuffle matching off into a new lowerVectorShuffleAsInsertPS function.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 17902.Jan 8 2015, 10:17 AM

RKSimon retitled this revision from to [X86][SSE] Improved (v)insertps shuffle matching.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: chandlerc, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

Hi Simon,

LGTM with more comments to ease futur modifications.
See the inlined comments.

Any performances numbers related to this?

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
8178 ↗	(On Diff #17902)	We can use a bool for that and I would also change the name to something like IsV1Used.
8205 ↗	(On Diff #17902)	Add a comment saying that if we get here, V2 is not used but one element of V1 is moved. Therefore, V1 will be used as inplace element (first operand) and as the source of insertion (second operand). I think that would make clearer why V2 gets V1’s information and ease future reading/modifications.
8209 ↗	(On Diff #17902)	Add a comment saying that unlike LLVM, the index is from the start of the current operand, not the start of the concatenated vector. I.e., indexes in {V1[4] V2[4]} targeting V2 are equals to original index - numberOfElts(V1).
8212 ↗	(On Diff #17902)	Add a comment saying that V1 is not used, which means it is zeroable.
8214 ↗	(On Diff #17902)	No curly brackets.
8536 ↗	(On Diff #17902)	We can remove this block with the previous one to get rid of one if.

This revision is now accepted and ready to land.Jan 8 2015, 4:07 PM

qcolombet added inline comments.Jan 8 2015, 4:10 PM

lib/Target/X86/X86ISelLowering.cpp
8536 ↗	(On Diff #17902)	s/remove/merge/

Additional nits from Quentin's comments. Please try to work on some of the coding style issues.

lib/Target/X86/X86ISelLowering.cpp
8182–8197 ↗	(On Diff #17902)	I've made this comment is several review threads. please use continue rather than long if-else chains in loops.
8190–8191 ↗	(On Diff #17902)	assymetric braces are really werid.
8222 ↗	(On Diff #17902)	Indent is wrong here. Again, please use clang-format.

Closed by commit rL225589: [X86][SSE] Improved (v)insertps shuffle matching (authored by RKSimon). · Explain WhyJan 10 2015, 11:47 AM

This revision was automatically updated to reflect the committed changes.

Thanks for the feedback guys - apologies for the code style problems, they should be fixed now.

Regarding performance - its tricky to give specific numbers as insertps gets matched against a wide variety of masks, but if we assume that an insertps instruction replaces a xorps (zero) and 2 dependant shufps, on a brief test I'm seeing a 35% boost on Core2Duo.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

124 lines

test/

CodeGen/

X86/

combine-or.ll

10 lines

masked_memop.ll

10 lines

vector-shuffle-combining.ll

28 lines

Diff 17974

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,157 Lines • ▼ Show 20 Lines	if (V.getOpcode() == ISD::BUILD_VECTOR \|\|
// We can't broadcast from a vector register w/o AVX2, and we can only		// We can't broadcast from a vector register w/o AVX2, and we can only
// broadcast from the zero-element of a vector register.		// broadcast from the zero-element of a vector register.
return SDValue();		return SDValue();
}		}

return DAG.getNode(X86ISD::VBROADCAST, DL, VT, V);		return DAG.getNode(X86ISD::VBROADCAST, DL, VT, V);
}		}

		// Check for whether we can use INSERTPS to perform the shuffle. We only use
		// INSERTPS when the V1 elements are already in the correct locations
		// because otherwise we can just always use two SHUFPS instructions which
		// are much smaller to encode than a SHUFPS and an INSERTPS. We can also
		// perform INSERTPS if a single V1 element is out of place and all V2
		// elements are zeroable.
		static SDValue lowerVectorShuffleAsInsertPS(SDValue Op, SDValue V1, SDValue V2,
		ArrayRef<int> Mask,
		SelectionDAG &DAG) {
		assert(Op.getSimpleValueType() == MVT::v4f32 && "Bad shuffle type!");
		assert(V1.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
		assert(V2.getSimpleValueType() == MVT::v4f32 && "Bad operand type!");
		assert(Mask.size() == 4 && "Unexpected mask size for v4 shuffle!");

		SmallBitVector Zeroable = computeZeroableShuffleElements(Mask, V1, V2);

		unsigned ZMask = 0;
		int V1DstIndex = -1;
		int V2DstIndex = -1;
		bool V1UsedInPlace = false;

		for (int i = 0; i < 4; i++) {
		// Synthesize a zero mask from the zeroable elements (includes undefs).
		if (Zeroable[i]) {
		ZMask \|= 1 << i;
		continue;
		}

		// Flag if we use any V1 inputs in place.
		if (i == Mask[i]) {
		V1UsedInPlace = true;
		continue;
		}

		// We can only insert a single non-zeroable element.
		if (V1DstIndex != -1 \|\| V2DstIndex != -1)
		return SDValue();

		if (Mask[i] < 4) {
		// V1 input out of place for insertion.
		V1DstIndex = i;
		} else {
		// V2 input for insertion.
		V2DstIndex = i;
		}
		}

		// Don't bother if we have no (non-zeroable) element for insertion.
		if (V1DstIndex == -1 && V2DstIndex == -1)
		return SDValue();

		// Determine element insertion src/dst indices. The src index is from the
		// start of the inserted vector, not the start of the concatenated vector.
		unsigned V2SrcIndex = 0;
		if (V1DstIndex != -1) {
		// If we have a V1 input out of place, we use V1 as the V2 element insertion
		// and don't use the original V2 at all.
		V2SrcIndex = Mask[V1DstIndex];
		V2DstIndex = V1DstIndex;
		V2 = V1;
		} else {
		V2SrcIndex = Mask[V2DstIndex] - 4;
		}

		// If no V1 inputs are used in place, then the result is created only from
		// the zero mask and the V2 insertion - so remove V1 dependency.
		if (!V1UsedInPlace)
		V1 = DAG.getUNDEF(MVT::v4f32);

		unsigned InsertPSMask = V2SrcIndex << 6 \| V2DstIndex << 4 \| ZMask;
		assert((InsertPSMask & ~0xFFu) == 0 && "Invalid mask!");

		// Insert the V2 element into the desired position.
		SDLoc DL(Op);
		return DAG.getNode(X86ISD::INSERTPS, DL, MVT::v4f32, V1, V2,
		DAG.getConstant(InsertPSMask, MVT::i8));
		}

/// \brief Handle lowering of 2-lane 64-bit floating point shuffles.		/// \brief Handle lowering of 2-lane 64-bit floating point shuffles.
///		///
/// This is the basis function for the 2-lane 64-bit shuffles as we have full		/// This is the basis function for the 2-lane 64-bit shuffles as we have full
/// support for floating point shuffles but not integer shuffles. These		/// support for floating point shuffles but not integer shuffles. These
/// instructions will incur a domain crossing penalty on some chips though so		/// instructions will incur a domain crossing penalty on some chips though so
/// it is better to avoid lowering through this for integer vectors where		/// it is better to avoid lowering through this for integer vectors where
/// possible.		/// possible.
static SDValue lowerV2F64VectorShuffle(SDValue Op, SDValue V1, SDValue V2,		static SDValue lowerV2F64VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
▲ Show 20 Lines • Show All 289 Lines • ▼ Show 20 Lines	static SDValue lowerV4F32VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
// we defer to if both this and BLENDPS fail to match, so restrict this to		// we defer to if both this and BLENDPS fail to match, so restrict this to
// when the V2 input is targeting element 0 of the mask -- that is the fast		// when the V2 input is targeting element 0 of the mask -- that is the fast
// case here.		// case here.
if (NumV2Elements == 1 && Mask[0] >= 4)		if (NumV2Elements == 1 && Mask[0] >= 4)
if (SDValue V = lowerVectorShuffleAsElementInsertion(MVT::v4f32, DL, V1, V2,		if (SDValue V = lowerVectorShuffleAsElementInsertion(MVT::v4f32, DL, V1, V2,
Mask, Subtarget, DAG))		Mask, Subtarget, DAG))
return V;		return V;

if (Subtarget->hasSSE41())		if (Subtarget->hasSSE41()) {
if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4f32, V1, V2, Mask,		if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4f32, V1, V2, Mask,
Subtarget, DAG))		Subtarget, DAG))
return Blend;		return Blend;

// Check for whether we can use INSERTPS to perform the blend. We only use		// Use INSERTPS if we can complete the shuffle efficiently.
// INSERTPS when the V1 elements are already in the correct locations		if (SDValue V = lowerVectorShuffleAsInsertPS(Op, V1, V2, Mask, DAG))
// because otherwise we can just always use two SHUFPS instructions which		return V;
// are much smaller to encode than a SHUFPS and an INSERTPS.
if (NumV2Elements == 1 && Subtarget->hasSSE41()) {
int V2Index =
std::find_if(Mask.begin(), Mask.end(), [](int M) { return M >= 4; }) -
Mask.begin();

// When using INSERTPS we can zero any lane of the destination. Collect
// the zero inputs into a mask and drop them from the lanes of V1 which
// actually need to be present as inputs to the INSERTPS.
SmallBitVector Zeroable = computeZeroableShuffleElements(Mask, V1, V2);

// Synthesize a shuffle mask for the non-zero and non-v2 inputs.
bool InsertNeedsShuffle = false;
unsigned ZMask = 0;
for (int i = 0; i < 4; ++i)
if (i != V2Index) {
if (Zeroable[i]) {
ZMask \|= 1 << i;
} else if (Mask[i] != i) {
InsertNeedsShuffle = true;
break;
}
}

// We don't want to use INSERTPS or other insertion techniques if it will
// require shuffling anyways.
if (!InsertNeedsShuffle) {
// If all of V1 is zeroable, replace it with undef.
if ((ZMask \| 1 << V2Index) == 0xF)
V1 = DAG.getUNDEF(MVT::v4f32);

unsigned InsertPSMask = (Mask[V2Index] - 4) << 6 \| V2Index << 4 \| ZMask;
assert((InsertPSMask & ~0xFFu) == 0 && "Invalid mask!");

// Insert the V2 element into the desired position.
return DAG.getNode(X86ISD::INSERTPS, DL, MVT::v4f32, V1, V2,
DAG.getConstant(InsertPSMask, MVT::i8));
}
}		}

// Otherwise fall back to a SHUFPS lowering strategy.		// Otherwise fall back to a SHUFPS lowering strategy.
return lowerVectorShuffleWithSHUFPS(DL, MVT::v4f32, Mask, V1, V2, DAG);		return lowerVectorShuffleWithSHUFPS(DL, MVT::v4f32, Mask, V1, V2, DAG);
}		}

/// \brief Lower 4-lane i32 vector shuffles.		/// \brief Lower 4-lane i32 vector shuffles.
///		///
▲ Show 20 Lines • Show All 18,056 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/combine-or.ll

Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retq
ret <4 x i32> %or		ret <4 x i32> %or
}		}


define <4 x i32> @test19(<4 x i32> %a, <4 x i32> %b) {		define <4 x i32> @test19(<4 x i32> %a, <4 x i32> %b) {
; CHECK-LABEL: test19:		; CHECK-LABEL: test19:
; CHECK: # BB#0:		; CHECK: # BB#0:
; CHECK-NEXT: xorps %xmm2, %xmm2		; CHECK-NEXT: xorps %xmm2, %xmm2
; CHECK-NEXT: xorps %xmm3, %xmm3		; CHECK-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,0],xmm0[0,3]
; CHECK-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,0],xmm0[0,3]		; CHECK-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,2,1,3]
; CHECK-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,2,1,3]		; CHECK-NEXT: insertps {{.*#+}} xmm1 = xmm1[0],zero,xmm1[2,2]
; CHECK-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,0],xmm1[0,0]		; CHECK-NEXT: orps %xmm1, %xmm2
; CHECK-NEXT: shufps {{.*#+}} xmm2 = xmm2[2,0],xmm1[2,2]
; CHECK-NEXT: orps %xmm3, %xmm2
; CHECK-NEXT: movaps %xmm2, %xmm0		; CHECK-NEXT: movaps %xmm2, %xmm0
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%shuf1 = shufflevector <4 x i32> %a, <4 x i32> zeroinitializer, <4 x i32><i32 4, i32 0, i32 4, i32 3>		%shuf1 = shufflevector <4 x i32> %a, <4 x i32> zeroinitializer, <4 x i32><i32 4, i32 0, i32 4, i32 3>
%shuf2 = shufflevector <4 x i32> %b, <4 x i32> zeroinitializer, <4 x i32><i32 0, i32 4, i32 2, i32 2>		%shuf2 = shufflevector <4 x i32> %b, <4 x i32> zeroinitializer, <4 x i32><i32 0, i32 4, i32 2, i32 2>
%or = or <4 x i32> %shuf1, %shuf2		%or = or <4 x i32> %shuf1, %shuf2
ret <4 x i32> %or		ret <4 x i32> %or
}		}

Show All 40 Lines

llvm/trunk/test/CodeGen/X86/masked_memop.ll

	Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
	}			}

	; AVX512-LABEL: test5			; AVX512-LABEL: test5
	; AVX512: vmovupd (%rdi), %zmm1 {%k1}			; AVX512: vmovupd (%rdi), %zmm1 {%k1}

	; AVX2-LABEL: test5			; AVX2-LABEL: test5
	; AVX2: vmaskmovpd			; AVX2: vmaskmovpd
	; AVX2: vblendvpd			; AVX2: vblendvpd
	; AVX2: vmaskmovpd			; AVX2: vmaskmovpd
	; AVX2: vblendvpd			; AVX2: vblendvpd
	define <8 x double> @test5(<8 x i32> %trigger, <8 x double>* %addr, <8 x double> %dst) {			define <8 x double> @test5(<8 x i32> %trigger, <8 x double>* %addr, <8 x double> %dst) {
	%mask = icmp eq <8 x i32> %trigger, zeroinitializer			%mask = icmp eq <8 x i32> %trigger, zeroinitializer
	%res = call <8 x double> @llvm.masked.load.v8f64(<8 x double>* %addr, i32 4, <8 x i1>%mask, <8 x double>%dst)			%res = call <8 x double> @llvm.masked.load.v8f64(<8 x double>* %addr, i32 4, <8 x i1>%mask, <8 x double>%dst)
	ret <8 x double> %res			ret <8 x double> %res
	}			}

	; AVX2-LABEL: test6			; AVX2-LABEL: test6
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines

	define void @test13(<16 x i32> %trigger, <16 x float>* %addr, <16 x float> %val) {			define void @test13(<16 x i32> %trigger, <16 x float>* %addr, <16 x float> %val) {
	%mask = icmp eq <16 x i32> %trigger, zeroinitializer			%mask = icmp eq <16 x i32> %trigger, zeroinitializer
	call void @llvm.masked.store.v16f32(<16 x float>%val, <16 x float>* %addr, i32 4, <16 x i1>%mask)			call void @llvm.masked.store.v16f32(<16 x float>%val, <16 x float>* %addr, i32 4, <16 x i1>%mask)
	ret void			ret void
	}			}

	; AVX2-LABEL: test14			; AVX2-LABEL: test14
	; AVX2: vshufps $-24			; AVX2: vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
	; AVX2: vmaskmovps			; AVX2: vmaskmovps
	define void @test14(<2 x i32> %trigger, <2 x float>* %addr, <2 x float> %val) {			define void @test14(<2 x i32> %trigger, <2 x float>* %addr, <2 x float> %val) {
	%mask = icmp eq <2 x i32> %trigger, zeroinitializer			%mask = icmp eq <2 x i32> %trigger, zeroinitializer
	call void @llvm.masked.store.v2f32(<2 x float>%val, <2 x float>* %addr, i32 4, <2 x i1>%mask)			call void @llvm.masked.store.v2f32(<2 x float>%val, <2 x float>* %addr, i32 4, <2 x i1>%mask)
	ret void			ret void
	}			}

	; AVX2-LABEL: test15			; AVX2-LABEL: test15
	Show All 27 Lines
	; AVX2-NOT: blend			; AVX2-NOT: blend
	define <2 x float> @test18(<2 x i32> %trigger, <2 x float>* %addr) {			define <2 x float> @test18(<2 x i32> %trigger, <2 x float>* %addr) {
	%mask = icmp eq <2 x i32> %trigger, zeroinitializer			%mask = icmp eq <2 x i32> %trigger, zeroinitializer
	%res = call <2 x float> @llvm.masked.load.v2f32(<2 x float>* %addr, i32 4, <2 x i1>%mask, <2 x float>undef)			%res = call <2 x float> @llvm.masked.load.v2f32(<2 x float>* %addr, i32 4, <2 x i1>%mask, <2 x float>undef)
	ret <2 x float> %res			ret <2 x float> %res
	}			}


	declare <16 x i32> @llvm.masked.load.v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)			declare <16 x i32> @llvm.masked.load.v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)
	declare <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*, i32, <4 x i1>, <4 x i32>)			declare <4 x i32> @llvm.masked.load.v4i32(<4 x i32>*, i32, <4 x i1>, <4 x i32>)
	declare <2 x i32> @llvm.masked.load.v2i32(<2 x i32>*, i32, <2 x i1>, <2 x i32>)			declare <2 x i32> @llvm.masked.load.v2i32(<2 x i32>*, i32, <2 x i1>, <2 x i32>)
	declare void @llvm.masked.store.v16i32(<16 x i32>, <16 x i32>*, i32, <16 x i1>)			declare void @llvm.masked.store.v16i32(<16 x i32>, <16 x i32>*, i32, <16 x i1>)
	declare void @llvm.masked.store.v8i32(<8 x i32>, <8 x i32>*, i32, <8 x i1>)			declare void @llvm.masked.store.v8i32(<8 x i32>, <8 x i32>*, i32, <8 x i1>)
	declare void @llvm.masked.store.v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)			declare void @llvm.masked.store.v4i32(<4 x i32>, <4 x i32>*, i32, <4 x i1>)
	declare void @llvm.masked.store.v2f32(<2 x float>, <2 x float>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2f32(<2 x float>, <2 x float>*, i32, <2 x i1>)
	declare void @llvm.masked.store.v2i32(<2 x i32>, <2 x i32>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2i32(<2 x i32>, <2 x i32>*, i32, <2 x i1>)
	declare void @llvm.masked.store.v16f32(<16 x float>, <16 x float>*, i32, <16 x i1>)			declare void @llvm.masked.store.v16f32(<16 x float>, <16 x float>*, i32, <16 x i1>)
	declare void @llvm.masked.store.v16f32p(<16 x float>, <16 x float>*, i32, <16 x i1>)			declare void @llvm.masked.store.v16f32p(<16 x float>, <16 x float>*, i32, <16 x i1>)
	declare <16 x float> @llvm.masked.load.v16f32(<16 x float>*, i32, <16 x i1>, <16 x float>)			declare <16 x float> @llvm.masked.load.v16f32(<16 x float>*, i32, <16 x i1>, <16 x float>)
	declare <8 x float> @llvm.masked.load.v8f32(<8 x float>*, i32, <8 x i1>, <8 x float>)			declare <8 x float> @llvm.masked.load.v8f32(<8 x float>*, i32, <8 x i1>, <8 x float>)
	declare <4 x float> @llvm.masked.load.v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)			declare <4 x float> @llvm.masked.load.v4f32(<4 x float>*, i32, <4 x i1>, <4 x float>)
	declare <2 x float> @llvm.masked.load.v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)			declare <2 x float> @llvm.masked.load.v2f32(<2 x float>*, i32, <2 x i1>, <2 x float>)
	declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)			declare <8 x double> @llvm.masked.load.v8f64(<8 x double>*, i32, <8 x i1>, <8 x double>)
	declare <4 x double> @llvm.masked.load.v4f64(<4 x double>*, i32, <4 x i1>, <4 x double>)			declare <4 x double> @llvm.masked.load.v4f64(<4 x double>*, i32, <4 x i1>, <4 x double>)
	declare <2 x double> @llvm.masked.load.v2f64(<2 x double>*, i32, <2 x i1>, <2 x double>)			declare <2 x double> @llvm.masked.load.v2f64(<2 x double>*, i32, <2 x i1>, <2 x double>)
	declare void @llvm.masked.store.v8f64(<8 x double>, <8 x double>*, i32, <8 x i1>)			declare void @llvm.masked.store.v8f64(<8 x double>, <8 x double>*, i32, <8 x i1>)
	declare void @llvm.masked.store.v2f64(<2 x double>, <2 x double>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2f64(<2 x double>, <2 x double>*, i32, <2 x i1>)
	declare void @llvm.masked.store.v2i64(<2 x i64>, <2 x i64>*, i32, <2 x i1>)			declare void @llvm.masked.store.v2i64(<2 x i64>, <2 x i64>*, i32, <2 x i1>)

llvm/trunk/test/CodeGen/X86/vector-shuffle-combining.ll

	Show First 20 Lines • Show All 547 Lines • ▼ Show 20 Lines
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuf1 = shufflevector <4 x i32> %a, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>			%shuf1 = shufflevector <4 x i32> %a, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>
	%shuf2 = shufflevector <4 x i32> %b, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>			%shuf2 = shufflevector <4 x i32> %b, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>
	%or = or <4 x i32> %shuf1, %shuf2			%or = or <4 x i32> %shuf1, %shuf2
	ret <4 x i32> %or			ret <4 x i32> %or
	}			}

	define <4 x i32> @combine_bitwise_ops_test3c(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {			define <4 x i32> @combine_bitwise_ops_test3c(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {
	; SSE-LABEL: combine_bitwise_ops_test3c:			; SSE2-LABEL: combine_bitwise_ops_test3c:
	; SSE: # BB#0:			; SSE2: # BB#0:
	; SSE-NEXT: xorps %xmm1, %xmm0			; SSE2-NEXT: xorps %xmm1, %xmm0
	; SSE-NEXT: xorps %xmm1, %xmm1			; SSE2-NEXT: xorps %xmm1, %xmm1
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[1,3]			; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[1,3]
	; SSE-NEXT: retq			; SSE2-NEXT: retq
				;
				; SSSE3-LABEL: combine_bitwise_ops_test3c:
				; SSSE3: # BB#0:
				; SSSE3-NEXT: xorps %xmm1, %xmm0
				; SSSE3-NEXT: xorps %xmm1, %xmm1
				; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[1,3]
				; SSSE3-NEXT: retq
				;
				; SSE41-LABEL: combine_bitwise_ops_test3c:
				; SSE41: # BB#0:
				; SSE41-NEXT: xorps %xmm1, %xmm0
				; SSE41-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
				; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: combine_bitwise_ops_test3c:			; AVX-LABEL: combine_bitwise_ops_test3c:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vxorps %xmm1, %xmm0, %xmm0			; AVX-NEXT: vxorps %xmm1, %xmm0, %xmm0
	; AVX-NEXT: vxorps %xmm1, %xmm1, %xmm1			; AVX-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,2],zero,zero
	; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[1,3]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuf1 = shufflevector <4 x i32> %a, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>			%shuf1 = shufflevector <4 x i32> %a, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>
	%shuf2 = shufflevector <4 x i32> %b, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>			%shuf2 = shufflevector <4 x i32> %b, <4 x i32> %c, <4 x i32><i32 0, i32 2, i32 5, i32 7>
	%xor = xor <4 x i32> %shuf1, %shuf2			%xor = xor <4 x i32> %shuf1, %shuf2
	ret <4 x i32> %xor			ret <4 x i32> %xor
	}			}

	define <4 x i32> @combine_bitwise_ops_test4c(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {			define <4 x i32> @combine_bitwise_ops_test4c(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c) {
	▲ Show 20 Lines • Show All 1,976 Lines • Show Last 20 Lines