This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Generalised unpackl/unpckh shuffle matching
ClosedPublic

Authored by RKSimon on Feb 11 2015, 8:54 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
chandlerc
andreadb

Commits

rG1d89a02abbaa: [X86][SSE] Generalised unpckl/unpckh shuffle matching
rL229571: [X86][SSE] Generalised unpckl/unpckh shuffle matching

Summary

The existing unpck instruction lowering was based on matching explicit shuffle patterns, and missed many alternative shuffle masks (notably commuted masks and duplicate inputs).

This patch adds lowerVectorUnpack() which can be used to thoroughly match any unpckl/unpckh pattern.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 19764.Feb 11 2015, 8:54 AM

RKSimon retitled this revision from to [X86][SSE] Generalised unpackl/unpckh shuffle matching.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: chandlerc, qcolombet, andreadb, spatel.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

Moving this discussion back to phabricator.

Chandler - I can look at adding explicit (commuted) shuffle patterns if you prefer. The added test case 'shuffle_v4i32_40u1' was the start of all of this - canonicalization seemed to be forcing a commute preventing the lowering to punpckldq - the presense of the undefined lane seems to cause it.

The zmm test cases improved because of the poor handling of these shuffles so far.

Reduced scope of this to just deal with commuted version of the 64 and 32 bit element shuffles (pre-AVX512) byt adding missing matching patterns.

The core of the problem is that the lowerVectorShuffle canonicalization can't (and shouldn't) be second guessing the importance of undefined lanes and should just sort by the numbers of defined 1st/2nd input lanes. This is a problem for explicit shuffle matching using both inputs, which is just the unpck instructions these days.

Hi Simon,

LGTM, but Chandler may have a different opinion on the way of canonicalize this.

Thanks,
Quentin

lib/Target/X86/X86ISelLowering.cpp
10875	Shouldn’t the last number be 5?
10975	Ditto.

Thanks Quentin, fixed bad shuffle mask and tightened tests. Added extra unpckh test.

Thanks Simon.

Let us move forward. We can address Chandler’s concerns if any later on.

This revision is now accepted and ready to land.Feb 17 2015, 11:17 AM

Closed by commit rL229571: [X86][SSE] Generalised unpckl/unpckh shuffle matching (authored by RKSimon). · Explain WhyFeb 17 2015, 2:26 PM

This revision was automatically updated to reflect the committed changes.

I really disagree with the direction of this commit. I'm sorry I've not replied more promptly, but I would like us to go back and look at *why* we cannot canonicalize this problem away, and if we cannot, making isShuffleMaskEquivalent do the commuting itself. I have an idea of how to implement the latter, but I really don't understand what is breaking canonicalizing yet.

I'm probably just being dense (sorry), I do believe there may be a fundamental problem here, but I'm not seeing it yet. undef lanes *never* impact what is canonical in either direction. Is there some opposing canonicalization rules that we have pick between, and that forces some cases to be canonicalized the wrong way?

I agree that adding multiple shuffle pattern matches for the same instruction is a waste.

If we take the shuffle <-1, 0, 5, 1> as an example (2 V1, 1 V2) - lowerVectorShuffle won't to commute this as there are already more V1 inputs than V2, and won't commute to ensure V1 elements are earlier in the shuffle than V2 either, and it is this that is necessary to ensure that the unpcks correctly match.

Altering isShuffleMaskEquivalent to support commutation looks relatively straightforward (as would supporting duplicate inputs - but that is probably unecessary) - would you like me to prepare a patch?

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 229431)

24 lines

test/

CodeGen/

X86/

	vector-shuffle-128-v4.ll
	vector-shuffle-128-v4.ll (revision 229431)

32 lines

	vector-shuffle-256-v4.ll
	vector-shuffle-256-v4.ll (revision 229431)

9 lines

	vector-shuffle-256-v8.ll
	vector-shuffle-256-v8.ll (revision 229431)

24 lines

Diff 20046

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,861 Lines • ▼ Show 20 Lines	if (!isSingleSHUFPSMask(Mask))
return BlendPerm;		return BlendPerm;
}		}

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 1, 5))		if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 1, 5))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f32, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f32, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, 2, 6, 3, 7))		if (isShuffleEquivalent(V1, V2, Mask, 2, 6, 3, 7))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f32, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f32, V1, V2);
		if (isShuffleEquivalent(V1, V2, Mask, 4, 0, 5, 1))
		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f32, V2, V1);
		if (isShuffleEquivalent(V1, V2, Mask, 6, 2, 7, 3))
		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f32, V2, V1);

// Otherwise fall back to a SHUFPS lowering strategy.		// Otherwise fall back to a SHUFPS lowering strategy.
return lowerVectorShuffleWithSHUFPS(DL, MVT::v4f32, Mask, V1, V2, DAG);		return lowerVectorShuffleWithSHUFPS(DL, MVT::v4f32, Mask, V1, V2, DAG);
}		}

/// \brief Lower 4-lane i32 vector shuffles.		/// \brief Lower 4-lane i32 vector shuffles.
///		///
/// We try to handle these with integer-domain shuffles where we can, but for		/// We try to handle these with integer-domain shuffles where we can, but for
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	if (SDValue Masked =
lowerVectorShuffleAsBitMask(DL, MVT::v4i32, V1, V2, Mask, DAG))		lowerVectorShuffleAsBitMask(DL, MVT::v4i32, V1, V2, Mask, DAG))
return Masked;		return Masked;

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 1, 5))		if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 1, 5))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4i32, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4i32, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, 2, 6, 3, 7))		if (isShuffleEquivalent(V1, V2, Mask, 2, 6, 3, 7))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4i32, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4i32, V1, V2);
		if (isShuffleEquivalent(V1, V2, Mask, 4, 0, 5, 1))
		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4i32, V2, V1);
		if (isShuffleEquivalent(V1, V2, Mask, 6, 2, 7, 3))
		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4i32, V2, V1);

// Try to use byte rotation instructions.		// Try to use byte rotation instructions.
// Its more profitable for pre-SSSE3 to use shuffles/unpacks.		// Its more profitable for pre-SSSE3 to use shuffles/unpacks.
if (Subtarget->hasSSSE3())		if (Subtarget->hasSSSE3())
if (SDValue Rotate = lowerVectorShuffleAsByteRotate(		if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
DL, MVT::v4i32, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v4i32, V1, V2, Mask, Subtarget, DAG))
return Rotate;		return Rotate;

▲ Show 20 Lines • Show All 1,709 Lines • ▼ Show 20 Lines	static SDValue lowerV4F64VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
}		}

// X86 has dedicated unpack instructions that can handle specific blend		// X86 has dedicated unpack instructions that can handle specific blend
// operations: UNPCKH and UNPCKL.		// operations: UNPCKH and UNPCKL.
if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 2, 6))		if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 2, 6))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, 1, 5, 3, 7))		if (isShuffleEquivalent(V1, V2, Mask, 1, 5, 3, 7))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V1, V2);
		if (isShuffleEquivalent(V1, V2, Mask, 4, 0, 6, 2))
		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V2, V1);
		if (isShuffleEquivalent(V1, V2, Mask, 5, 1, 7, 3))
		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V2, V1);

// If we have a single input to the zero element, insert that into V1 if we		// If we have a single input to the zero element, insert that into V1 if we
// can do so cheaply.		// can do so cheaply.
int NumV2Elements =		int NumV2Elements =
std::count_if(Mask.begin(), Mask.end(), [](int M) { return M >= 4; });		std::count_if(Mask.begin(), Mask.end(), [](int M) { return M >= 4; });
if (NumV2Elements == 1 && Mask[0] >= 4)		if (NumV2Elements == 1 && Mask[0] >= 4)
if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(		if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
MVT::v4f64, DL, V1, V2, Mask, Subtarget, DAG))		MVT::v4f64, DL, V1, V2, Mask, Subtarget, DAG))
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	if (SDValue Shift = lowerVectorShuffleAsByteShift(
DL, MVT::v4i64, V1, V2, Mask, DAG))		DL, MVT::v4i64, V1, V2, Mask, DAG))
return Shift;		return Shift;

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 2, 6))		if (isShuffleEquivalent(V1, V2, Mask, 0, 4, 2, 6))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4i64, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4i64, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, 1, 5, 3, 7))		if (isShuffleEquivalent(V1, V2, Mask, 1, 5, 3, 7))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4i64, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4i64, V1, V2);
		if (isShuffleEquivalent(V1, V2, Mask, 4, 0, 6, 2))
		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4i64, V2, V1);
		if (isShuffleEquivalent(V1, V2, Mask, 5, 1, 7, 3))
		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4i64, V2, V1);

// Try to simplify this by merging 128-bit lanes to enable a lane-based		// Try to simplify this by merging 128-bit lanes to enable a lane-based
// shuffle. However, if we have AVX2 and either inputs are already in place,		// shuffle. However, if we have AVX2 and either inputs are already in place,
// we will be able to shuffle even across lanes the other input in a single		// we will be able to shuffle even across lanes the other input in a single
// instruction so skip this pattern.		// instruction so skip this pattern.
if (!(Subtarget->hasAVX2() && (isShuffleMaskInputInPlace(0, Mask) \|\|		if (!(Subtarget->hasAVX2() && (isShuffleMaskInputInPlace(0, Mask) \|\|
isShuffleMaskInputInPlace(1, Mask))))		isShuffleMaskInputInPlace(1, Mask))))
if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(		if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	if (isSingleInputShuffleMask(Mask))
return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v8f32, V1,		return DAG.getNode(X86ISD::VPERMILPI, DL, MVT::v8f32, V1,
getV4X86ShuffleImm8ForMask(RepeatedMask, DAG));		getV4X86ShuffleImm8ForMask(RepeatedMask, DAG));

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (isShuffleEquivalent(V1, V2, Mask, 0, 8, 1, 9, 4, 12, 5, 13))		if (isShuffleEquivalent(V1, V2, Mask, 0, 8, 1, 9, 4, 12, 5, 13))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8f32, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8f32, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, 2, 10, 3, 11, 6, 14, 7, 15))		if (isShuffleEquivalent(V1, V2, Mask, 2, 10, 3, 11, 6, 14, 7, 15))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v8f32, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v8f32, V1, V2);
		if (isShuffleEquivalent(V1, V2, Mask, 8, 0, 9, 1, 12, 4, 13, 5))
		qcolombetUnsubmitted Not Done Reply Inline Actions Shouldn’t the last number be 5? qcolombet: Shouldn’t the last number be 5?
		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8f32, V2, V1);
		if (isShuffleEquivalent(V1, V2, Mask, 10, 2, 11, 3, 14, 6, 15, 7))
		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v8f32, V2, V1);

// Otherwise, fall back to a SHUFPS sequence. Here it is important that we		// Otherwise, fall back to a SHUFPS sequence. Here it is important that we
// have already handled any direct blends. We also need to squash the		// have already handled any direct blends. We also need to squash the
// repeated mask into a simulated v4f32 mask.		// repeated mask into a simulated v4f32 mask.
for (int i = 0; i < 4; ++i)		for (int i = 0; i < 4; ++i)
if (RepeatedMask[i] >= 8)		if (RepeatedMask[i] >= 8)
RepeatedMask[i] -= 4;		RepeatedMask[i] -= 4;
return lowerVectorShuffleWithSHUFPS(DL, MVT::v8f32, RepeatedMask, V1, V2, DAG);		return lowerVectorShuffleWithSHUFPS(DL, MVT::v8f32, RepeatedMask, V1, V2, DAG);
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	if (isSingleInputShuffleMask(Mask))
return DAG.getNode(X86ISD::PSHUFD, DL, MVT::v8i32, V1,		return DAG.getNode(X86ISD::PSHUFD, DL, MVT::v8i32, V1,
getV4X86ShuffleImm8ForMask(RepeatedMask, DAG));		getV4X86ShuffleImm8ForMask(RepeatedMask, DAG));

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (isShuffleEquivalent(V1, V2, Mask, 0, 8, 1, 9, 4, 12, 5, 13))		if (isShuffleEquivalent(V1, V2, Mask, 0, 8, 1, 9, 4, 12, 5, 13))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8i32, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8i32, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, 2, 10, 3, 11, 6, 14, 7, 15))		if (isShuffleEquivalent(V1, V2, Mask, 2, 10, 3, 11, 6, 14, 7, 15))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v8i32, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v8i32, V1, V2);
		if (isShuffleEquivalent(V1, V2, Mask, 8, 0, 9, 1, 12, 4, 13, 5))
		qcolombetUnsubmitted Not Done Reply Inline Actions Ditto. qcolombet: Ditto.
		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v8i32, V2, V1);
		if (isShuffleEquivalent(V1, V2, Mask, 10, 2, 11, 3, 14, 6, 15, 7))
		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v8i32, V2, V1);
}		}

// Try to use bit shift instructions.		// Try to use bit shift instructions.
if (SDValue Shift = lowerVectorShuffleAsBitShift(		if (SDValue Shift = lowerVectorShuffleAsBitShift(
DL, MVT::v8i32, V1, V2, Mask, DAG))		DL, MVT::v8i32, V1, V2, Mask, DAG))
return Shift;		return Shift;

// Try to use byte shift instructions.		// Try to use byte shift instructions.
▲ Show 20 Lines • Show All 16,342 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shuffle-128-v4.ll

	Show First 20 Lines • Show All 949 Lines • ▼ Show 20 Lines
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuffle = shufflevector <4 x float> %a, <4 x float> zeroinitializer, <4 x i32> <i32 0, i32 4, i32 4, i32 3>			%shuffle = shufflevector <4 x float> %a, <4 x float> zeroinitializer, <4 x i32> <i32 0, i32 4, i32 4, i32 3>
	ret <4 x float> %shuffle			ret <4 x float> %shuffle
	}			}

	define <4 x float> @shuffle_v4f32_u051(<4 x float> %a, <4 x float> %b) {			define <4 x float> @shuffle_v4f32_u051(<4 x float> %a, <4 x float> %b) {
	; SSE-LABEL: shuffle_v4f32_u051:			; SSE-LABEL: shuffle_v4f32_u051:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[1,0],xmm0[1,0]			; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; SSE-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,2]			; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: shuffle_v4f32_u051:			; AVX-LABEL: shuffle_v4f32_u051:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vshufps {{.*#+}} xmm1 = xmm1[1,0],xmm0[1,0]			; AVX-NEXT: vunpcklps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; AVX-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,0],xmm1[0,2]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 undef, i32 0, i32 5, i32 1>			%shuffle = shufflevector <4 x float> %a, <4 x float> %b, <4 x i32> <i32 undef, i32 0, i32 5, i32 1>
	ret <4 x float> %shuffle			ret <4 x float> %shuffle
	}			}

	define <4 x i32> @shuffle_v4i32_4zzz(<4 x i32> %a) {			define <4 x i32> @shuffle_v4i32_4zzz(<4 x i32> %a) {
	; SSE2-LABEL: shuffle_v4i32_4zzz:			; SSE2-LABEL: shuffle_v4i32_4zzz:
	; SSE2: # BB#0:			; SSE2: # BB#0:
	▲ Show 20 Lines • Show All 328 Lines • ▼ Show 20 Lines
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 2, i32 3, i32 4, i32 5>			%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
	ret <4 x i32> %shuffle			ret <4 x i32> %shuffle
	}			}

	define <4 x i32> @shuffle_v4i32_40u1(<4 x i32> %a, <4 x i32> %b) {			define <4 x i32> @shuffle_v4i32_40u1(<4 x i32> %a, <4 x i32> %b) {
	; SSE2-LABEL: shuffle_v4i32_40u1:			; SSE2-LABEL: shuffle_v4i32_40u1:
	; SSE2: # BB#0:			; SSE2: # BB#0:
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[0,0]			; SSE2-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[2,1]			; SSE2-NEXT: movdqa %xmm1, %xmm0
	; SSE2-NEXT: movaps %xmm1, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE3-LABEL: shuffle_v4i32_40u1:			; SSE3-LABEL: shuffle_v4i32_40u1:
	; SSE3: # BB#0:			; SSE3: # BB#0:
	; SSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[0,0]			; SSE3-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; SSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[2,1]			; SSE3-NEXT: movdqa %xmm1, %xmm0
	; SSE3-NEXT: movaps %xmm1, %xmm0
	; SSE3-NEXT: retq			; SSE3-NEXT: retq
	;			;
	; SSSE3-LABEL: shuffle_v4i32_40u1:			; SSSE3-LABEL: shuffle_v4i32_40u1:
	; SSSE3: # BB#0:			; SSSE3: # BB#0:
	; SSSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0],xmm0[0,0]			; SSSE3-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; SSSE3-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,2],xmm0[2,1]			; SSSE3-NEXT: movdqa %xmm1, %xmm0
	; SSSE3-NEXT: movaps %xmm1, %xmm0
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: shuffle_v4i32_40u1:			; SSE41-LABEL: shuffle_v4i32_40u1:
	; SSE41: # BB#0:			; SSE41: # BB#0:
	; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]			; SSE41-NEXT: punpckldq {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; SSE41-NEXT: pblendw {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3,4,5,6,7]			; SSE41-NEXT: movdqa %xmm1, %xmm0
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: shuffle_v4i32_40u1:			; AVX1-LABEL: shuffle_v4i32_40u1:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]			; AVX1-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3,4,5,6,7]
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v4i32_40u1:			; AVX2-LABEL: shuffle_v4i32_40u1:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]			; AVX2-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 0, i32 undef, i32 1>			%shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 0, i32 undef, i32 1>
	ret <4 x i32> %shuffle			ret <4 x i32> %shuffle
	}			}

	define <4 x i32> @shuffle_v4i32_3456(<4 x i32> %a, <4 x i32> %b) {			define <4 x i32> @shuffle_v4i32_3456(<4 x i32> %a, <4 x i32> %b) {
	; SSE2-LABEL: shuffle_v4i32_3456:			; SSE2-LABEL: shuffle_v4i32_3456:
	; SSE2: # BB#0:			; SSE2: # BB#0:
	▲ Show 20 Lines • Show All 568 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shuffle-256-v4.ll

Show First 20 Lines • Show All 354 Lines • ▼ Show 20 Lines
; AVX2-NEXT: retq		; AVX2-NEXT: retq
%shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 4, i32 1, i32 5>		%shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
ret <4 x double> %shuffle		ret <4 x double> %shuffle
}		}

define <4 x double> @shuffle_v4f64_u062(<4 x double> %a, <4 x double> %b) {		define <4 x double> @shuffle_v4f64_u062(<4 x double> %a, <4 x double> %b) {
; ALL-LABEL: shuffle_v4f64_u062:		; ALL-LABEL: shuffle_v4f64_u062:
; ALL: # BB#0:		; ALL: # BB#0:
; ALL-NEXT: vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]		; ALL-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
; ALL-NEXT: retq		; ALL-NEXT: retq
%shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 undef, i32 0, i32 6, i32 2>		%shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 undef, i32 0, i32 6, i32 2>
ret <4 x double> %shuffle		ret <4 x double> %shuffle
}		}

define <4 x i64> @shuffle_v4i64_0000(<4 x i64> %a, <4 x i64> %b) {		define <4 x i64> @shuffle_v4i64_0000(<4 x i64> %a, <4 x i64> %b) {
; AVX1-LABEL: shuffle_v4i64_0000:		; AVX1-LABEL: shuffle_v4i64_0000:
; AVX1: # BB#0:		; AVX1: # BB#0:
▲ Show 20 Lines • Show All 396 Lines • ▼ Show 20 Lines	; AVX2-NEXT: retq
%shuffle = shufflevector <4 x i64> zeroinitializer, <4 x i64> %a, <4 x i32> <i32 0, i32 4, i32 0, i32 6>		%shuffle = shufflevector <4 x i64> zeroinitializer, <4 x i64> %a, <4 x i32> <i32 0, i32 4, i32 0, i32 6>
ret <4 x i64> %shuffle		ret <4 x i64> %shuffle
}		}

define <4 x i64> @shuffle_v4i64_5zuz(<4 x i64> %a) {		define <4 x i64> @shuffle_v4i64_5zuz(<4 x i64> %a) {
; AVX1-LABEL: shuffle_v4i64_5zuz:		; AVX1-LABEL: shuffle_v4i64_5zuz:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vxorpd %ymm1, %ymm1, %ymm1		; AVX1-NEXT: vxorpd %ymm1, %ymm1, %ymm1
; AVX1-NEXT: vshufpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[2],ymm1[3]		; AVX1-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: shuffle_v4i64_5zuz:		; AVX2-LABEL: shuffle_v4i64_5zuz:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vpsrldq {{.*#+}} ymm0 = ymm0[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero,ymm0[24,25,26,27,28,29,30,31],zero,zero,zero,zero,zero,zero,zero,zero		; AVX2-NEXT: vpsrldq {{.*#+}} ymm0 = ymm0[8,9,10,11,12,13,14,15],zero,zero,zero,zero,zero,zero,zero,zero,ymm0[24,25,26,27,28,29,30,31],zero,zero,zero,zero,zero,zero,zero,zero
; AVX2-NEXT: retq		; AVX2-NEXT: retq
%shuffle = shufflevector <4 x i64> zeroinitializer, <4 x i64> %a, <4 x i32> <i32 5, i32 0, i32 undef, i32 0>		%shuffle = shufflevector <4 x i64> zeroinitializer, <4 x i64> %a, <4 x i32> <i32 5, i32 0, i32 undef, i32 0>
ret <4 x i64> %shuffle		ret <4 x i64> %shuffle
}		}

define <4 x i64> @shuffle_v4i64_40u2(<4 x i64> %a, <4 x i64> %b) {		define <4 x i64> @shuffle_v4i64_40u2(<4 x i64> %a, <4 x i64> %b) {
; AVX1-LABEL: shuffle_v4i64_40u2:		; AVX1-LABEL: shuffle_v4i64_40u2:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vshufpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]		; AVX1-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: shuffle_v4i64_40u2:		; AVX2-LABEL: shuffle_v4i64_40u2:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,1,0,1,4,5,4,5]		; AVX2-NEXT: vpunpcklqdq {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[2],ymm0[2]
; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1],ymm0[2,3,4,5,6,7]
; AVX2-NEXT: retq		; AVX2-NEXT: retq
%shuffle = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 4, i32 0, i32 undef, i32 2>		%shuffle = shufflevector <4 x i64> %a, <4 x i64> %b, <4 x i32> <i32 4, i32 0, i32 undef, i32 2>
ret <4 x i64> %shuffle		ret <4 x i64> %shuffle
}		}

define <4 x i64> @stress_test1(<4 x i64> %a, <4 x i64> %b) {		define <4 x i64> @stress_test1(<4 x i64> %a, <4 x i64> %b) {
; ALL-LABEL: stress_test1:		; ALL-LABEL: stress_test1:
; ALL: retq		; ALL: retq
▲ Show 20 Lines • Show All 122 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shuffle-256-v8.ll

	Show First 20 Lines • Show All 809 Lines • ▼ Show 20 Lines
	; ALL: # BB#0:			; ALL: # BB#0:
	; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm1[0,1],ymm0[2,3]			; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm1[0,1],ymm0[2,3]
	; ALL-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4]			; ALL-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,2,1,0,7,6,5,4]
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4>			%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 11, i32 10, i32 9, i32 8, i32 7, i32 6, i32 5, i32 4>
	ret <8 x float> %shuffle			ret <8 x float> %shuffle
	}			}

	define <8 x float> @shuffle_v8f32_80u1b4uu(<8 x float> %a, <8 x float> %b) {			define <8 x float> @shuffle_v8f32_80u1c4u5(<8 x float> %a, <8 x float> %b) {
	; ALL-LABEL: shuffle_v8f32_80u1b4uu:			; ALL-LABEL: shuffle_v8f32_80u1c4u5:
	; ALL: # BB#0:			; ALL: # BB#0:
	; ALL-NEXT: vshufps {{.*#+}} ymm1 = ymm1[0,0],ymm0[0,0],ymm1[4,4],ymm0[4,4]			; ALL-NEXT: vunpcklps {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[1],ymm0[1],ymm1[4],ymm0[4],ymm1[5],ymm0[5]
	; ALL-NEXT: vshufps {{.*#+}} ymm0 = ymm1[0,2],ymm0[2,1],ymm1[4,6],ymm0[6,5]
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 8, i32 0, i32 undef, i32 1, i32 12, i32 4, i32 undef, i32 undef>			%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 8, i32 0, i32 undef, i32 1, i32 12, i32 4, i32 undef, i32 5>
				ret <8 x float> %shuffle
				}

				define <8 x float> @shuffle_v8f32_a2u3e6f7(<8 x float> %a, <8 x float> %b) {
				; ALL-LABEL: shuffle_v8f32_a2u3e6f7:
				; ALL: # BB#0:
				; ALL-NEXT: vunpckhps {{.*#+}} ymm0 = ymm1[2],ymm0[2],ymm1[3],ymm0[3],ymm1[6],ymm0[6],ymm1[7],ymm0[7]
				; ALL-NEXT: retq
				%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 10, i32 2, i32 undef, i32 3, i32 14, i32 6, i32 15, i32 7>
	ret <8 x float> %shuffle			ret <8 x float> %shuffle
	}			}

	define <8 x i32> @shuffle_v8i32_00000000(<8 x i32> %a, <8 x i32> %b) {			define <8 x i32> @shuffle_v8i32_00000000(<8 x i32> %a, <8 x i32> %b) {
	; AVX1-LABEL: shuffle_v8i32_00000000:			; AVX1-LABEL: shuffle_v8i32_00000000:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]			; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	▲ Show 20 Lines • Show All 1,044 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <8 x i32> zeroinitializer, <8 x i32> %a, <8 x i32> <i32 9, i32 undef, i32 11, i32 0, i32 13, i32 14, i32 15, i32 0>			%shuffle = shufflevector <8 x i32> zeroinitializer, <8 x i32> %a, <8 x i32> <i32 9, i32 undef, i32 11, i32 0, i32 13, i32 14, i32 15, i32 0>
	ret <8 x i32> %shuffle			ret <8 x i32> %shuffle
	}			}

	define <8 x i32> @shuffle_v8i32_80u1b4uu(<8 x i32> %a, <8 x i32> %b) {			define <8 x i32> @shuffle_v8i32_80u1b4uu(<8 x i32> %a, <8 x i32> %b) {
	; AVX1-LABEL: shuffle_v8i32_80u1b4uu:			; AVX1-LABEL: shuffle_v8i32_80u1b4uu:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vshufps {{.*#+}} ymm1 = ymm1[0,0],ymm0[0,0],ymm1[4,4],ymm0[4,4]			; AVX1-NEXT: vunpcklps {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[1],ymm0[1],ymm1[4],ymm0[4],ymm1[5],ymm0[5]
	; AVX1-NEXT: vshufps {{.*#+}} ymm0 = ymm1[0,2],ymm0[2,1],ymm1[4,6],ymm0[6,5]
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v8i32_80u1b4uu:			; AVX2-LABEL: shuffle_v8i32_80u1b4uu:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,0,2,1,4,4,6,5]			; AVX2-NEXT: vpunpckldq {{.*#+}} ymm0 = ymm1[0],ymm0[0],ymm1[1],ymm0[1],ymm1[4],ymm0[4],ymm1[5],ymm0[5]
	; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0],ymm0[1,2,3],ymm1[4],ymm0[5,6,7]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 8, i32 0, i32 undef, i32 1, i32 12, i32 4, i32 undef, i32 undef>			%shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 8, i32 0, i32 undef, i32 1, i32 12, i32 4, i32 undef, i32 undef>
	ret <8 x i32> %shuffle			ret <8 x i32> %shuffle
	}			}

	define <8 x float> @splat_mem_v8f32_2(float* %p) {			define <8 x float> @splat_mem_v8f32_2(float* %p) {
	; ALL-LABEL: splat_mem_v8f32_2:			; ALL-LABEL: splat_mem_v8f32_2:
	; ALL: # BB#0:			; ALL: # BB#0:
	▲ Show 20 Lines • Show All 188 Lines • Show Last 20 Lines