This is an archive of the discontinued LLVM Phabricator instance.

try to lowerVectorShuffleAsElementInsertion() for all 256-bit vector sub-types [X86, AVX]
ClosedPublic

Authored by spatel on Mar 14 2015, 8:02 AM.

Download Raw Diff

Details

Reviewers

chandlerc
andreadb
craig.topper

Commits

rG2ae994388138: [X86, AVX] try to lowerVectorShuffleAsElementInsertion() for all 256-bit vector…
rL233704: [X86, AVX] try to lowerVectorShuffleAsElementInsertion() for all 256-bit vector…

Summary

I suggested this change in D7898: hoist the lowerVectorShuffleAsElementInsertion() call into lower256BitVectorShuffle().

It improves the v4i64 case although not optimally. This AVX codegen:

vmovq {{.*#+}} xmm0 = mem[0],zero
vxorpd %ymm1, %ymm1, %ymm1
vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]

Becomes:

vmovsd {{.*#+}} xmm0 = mem[0],zero

Unfortunately, this doesn't completely solve PR22685. There are still at least 2 problems under here:

We're not handling v32i8 / v16i16.
We're not getting the FP / int domains right for instruction selection.

But since this patch alone appears to do no harm, reduces code duplication, and helps v4i64, I'm submitting this patch ahead of fixing the above.

Diff Detail

Event Timeline

spatel updated this revision to Diff 21985.Mar 14 2015, 8:02 AM

spatel retitled this revision from to try to lowerVectorShuffleAsElementInsertion() for all 256-bit vector sub-types [X86, AVX].

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: craig.topper, chandlerc, andreadb.

spatel added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptMar 14 2015, 8:02 AM

Ping.

Ping * 2.

Hi Sanjay,

test/CodeGen/X86/vector-shuffle-256-v4.ll
830	So, this is what you meant when you said that we don't get the correct fp/int domain. In X86InstrSSE.td we have patterns like this: def : Pat<(v4i64 (X86vzmovl (insert_subvector undef, (v2i64 (scalar_to_vector (loadi64 addr:$src))), (iPTR 0)))), (SUBREG_TO_REG (i32 0), (VMOVSDrm addr:$src), sub_xmm)>; Do you plan to send a follow-up patch to fix tablegen patterns so that VMOVQI2PQIrm is used instead of VMOVSDrm for the integer domain?. If that's the case, then it makes sense to commit this patch first and fix the fp/int domain issue in a separate patch.
test/CodeGen/X86/vector-shuffle-256-v8.ll
134–137	This has nothing to do with your patch, however, I am surprised that we get this long sequence of instructions on AVX2 instead of just a single 'vmovaps' plus 'vpermd'. Here, %ymm1 is used to store the 'vpermd' permute mask. That mask is basically known at compile time (it is vector <7,0,0,0,0,0,0,0>) so, we could just have a load from constant pool instead of computing the mask at runtime. I think we could replace this entire sequence with a load from constant pool followed by a 'vpermd'.
963–967	Same here.

spatel added inline comments.Mar 30 2015, 10:22 AM

test/CodeGen/X86/vector-shuffle-256-v4.ll
830	Hi Andrea - That's correct. I saw a couple of places where we didn't have the right tablegen patterns. And I had a patch for it somewhere...but I'm not finding it now. But it was just simple replacements to substitute the right type like what you have noted here.
test/CodeGen/X86/vector-shuffle-256-v8.ll
134–137	Interesting - it's not entirely unrelated because the permute mask itself could be viewed as a zero-extended vector, right? I've filed this as: https://llvm.org/bugs/show_bug.cgi?id=23073

andreadb added inline comments.Mar 30 2015, 10:41 AM

test/CodeGen/X86/vector-shuffle-256-v8.ll
134–137	Right, movl $7, %eax vmovd %eax, %xmm1 vxorps %ymm2, %ymm2, %ymm2 vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7] is basically equivalent to: movl $7, %eax vmovd %eax, %xmm1 Bits [VLMAX-1:32] would be implicitly zeroed.

andreadb accepted this revision.Mar 31 2015, 8:04 AM

andreadb edited edge metadata.

This revision is now accepted and ready to land.Mar 31 2015, 8:04 AM

Closed by commit rL233704: [X86, AVX] try to lowerVectorShuffleAsElementInsertion() for all 256-bit vector… (authored by spatel). · Explain WhyMar 31 2015, 9:35 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

30 lines

test/

CodeGen/

X86/

2012-1-10-buildvector.ll

13 lines

vector-shuffle-256-v4.ll

12 lines

vector-shuffle-256-v8.ll

8 lines

Diff 21985

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,203 Lines • ▼ Show 20 Lines	if (isShuffleEquivalent(V1, V2, Mask, {0, 4, 2, 6}))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, {1, 5, 3, 7}))		if (isShuffleEquivalent(V1, V2, Mask, {1, 5, 3, 7}))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, {4, 0, 6, 2}))		if (isShuffleEquivalent(V1, V2, Mask, {4, 0, 6, 2}))
return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V2, V1);		return DAG.getNode(X86ISD::UNPCKL, DL, MVT::v4f64, V2, V1);
if (isShuffleEquivalent(V1, V2, Mask, {5, 1, 7, 3}))		if (isShuffleEquivalent(V1, V2, Mask, {5, 1, 7, 3}))
return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V2, V1);		return DAG.getNode(X86ISD::UNPCKH, DL, MVT::v4f64, V2, V1);

// If we have a single input to the zero element, insert that into V1 if we
// can do so cheaply.
int NumV2Elements =
std::count_if(Mask.begin(), Mask.end(), [](int M) { return M >= 4; });
if (NumV2Elements == 1 && Mask[0] >= 4)
if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
DL, MVT::v4f64, V1, V2, Mask, Subtarget, DAG))
return Insertion;

if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4f64, V1, V2, Mask,		if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v4f64, V1, V2, Mask,
Subtarget, DAG))		Subtarget, DAG))
return Blend;		return Blend;

// Check if the blend happens to exactly fit that of SHUFPD.		// Check if the blend happens to exactly fit that of SHUFPD.
if ((Mask[0] == -1 \|\| Mask[0] < 2) &&		if ((Mask[0] == -1 \|\| Mask[0] < 2) &&
(Mask[1] == -1 \|\| (Mask[1] >= 4 && Mask[1] < 6)) &&		(Mask[1] == -1 \|\| (Mask[1] >= 4 && Mask[1] < 6)) &&
(Mask[2] == -1 \|\| (Mask[2] >= 2 && Mask[2] < 4)) &&		(Mask[2] == -1 \|\| (Mask[2] >= 2 && Mask[2] < 4)) &&
▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines	static SDValue lowerV8F32VectorShuffle(SDValue Op, SDValue V1, SDValue V2,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
SDLoc DL(Op);		SDLoc DL(Op);
assert(V1.getSimpleValueType() == MVT::v8f32 && "Bad operand type!");		assert(V1.getSimpleValueType() == MVT::v8f32 && "Bad operand type!");
assert(V2.getSimpleValueType() == MVT::v8f32 && "Bad operand type!");		assert(V2.getSimpleValueType() == MVT::v8f32 && "Bad operand type!");
ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
ArrayRef<int> Mask = SVOp->getMask();		ArrayRef<int> Mask = SVOp->getMask();
assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");		assert(Mask.size() == 8 && "Unexpected mask size for v8 shuffle!");

// If we have a single input to the zero element, insert that into V1 if we
// can do so cheaply.
int NumV2Elements =
std::count_if(Mask.begin(), Mask.end(), [](int M) { return M >= 8; });
if (NumV2Elements == 1 && Mask[0] >= 8)
if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
DL, MVT::v8f32, V1, V2, Mask, Subtarget, DAG))
return Insertion;

if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8f32, V1, V2, Mask,		if (SDValue Blend = lowerVectorShuffleAsBlend(DL, MVT::v8f32, V1, V2, Mask,
Subtarget, DAG))		Subtarget, DAG))
return Blend;		return Blend;

// Check for being able to broadcast a single element.		// Check for being able to broadcast a single element.
if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v8f32, V1,		if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v8f32, V1,
Mask, Subtarget, DAG))		Mask, Subtarget, DAG))
return Broadcast;		return Broadcast;
▲ Show 20 Lines • Show All 354 Lines • ▼ Show 20 Lines
/// together based on the available instructions.		/// together based on the available instructions.
static SDValue lower256BitVectorShuffle(SDValue Op, SDValue V1, SDValue V2,		static SDValue lower256BitVectorShuffle(SDValue Op, SDValue V1, SDValue V2,
MVT VT, const X86Subtarget *Subtarget,		MVT VT, const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
SDLoc DL(Op);		SDLoc DL(Op);
ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
ArrayRef<int> Mask = SVOp->getMask();		ArrayRef<int> Mask = SVOp->getMask();

		// If we have a single input to the zero element, insert that into V1 if we
		// can do so cheaply.
		int NumElts = VT.getVectorNumElements();
		int NumV2Elements = std::count_if(Mask.begin(), Mask.end(), [NumElts](int M) {
		return M >= NumElts;
		});

		if (NumV2Elements == 1 && Mask[0] >= NumElts)
		if (SDValue Insertion = lowerVectorShuffleAsElementInsertion(
		DL, VT, V1, V2, Mask, Subtarget, DAG))
		return Insertion;

// There is a really nice hard cut-over between AVX1 and AVX2 that means we can		// There is a really nice hard cut-over between AVX1 and AVX2 that means we can
// check for those subtargets here and avoid much of the subtarget querying in		// check for those subtargets here and avoid much of the subtarget querying in
// the per-vector-type lowering routines. With AVX1 we have essentially zero		// the per-vector-type lowering routines. With AVX1 we have essentially zero
// ability to manipulate a 256-bit vector with integer types. Since we'll use		// ability to manipulate a 256-bit vector with integer types. Since we'll use
// floating point types there eventually, just immediately cast everything to		// floating point types there eventually, just immediately cast everything to
// a float and operate entirely in that domain.		// a float and operate entirely in that domain.
if (VT.isInteger() && !Subtarget->hasAVX2()) {		if (VT.isInteger() && !Subtarget->hasAVX2()) {
int ElementBits = VT.getScalarSizeInBits();		int ElementBits = VT.getScalarSizeInBits();
▲ Show 20 Lines • Show All 14,786 Lines • Show Last 20 Lines

test/CodeGen/X86/2012-1-10-buildvector.ll

	; RUN: llc < %s -march=x86 -mcpu=corei7-avx -mattr=+avx -mtriple=i686-pc-win32 \| FileCheck %s			; RUN: llc < %s -march=x86 -mcpu=corei7-avx -mattr=+avx -mtriple=i686-pc-win32 \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-f80:128:128-v64:64:64-v128:128:128-a0:0:64-f80:32:32-n8:16:32-S32"			target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-f80:128:128-v64:64:64-v128:128:128-a0:0:64-f80:32:32-n8:16:32-S32"
	target triple = "i686-pc-win32"			target triple = "i686-pc-win32"

	;CHECK-LABEL: bad_cast:			; CHECK-LABEL: bad_cast:
	define void @bad_cast() {			define void @bad_cast() {
	entry:			entry:
	%vext.i = shufflevector <2 x i64> undef, <2 x i64> undef, <3 x i32> <i32 0, i32 1, i32 undef>			%vext.i = shufflevector <2 x i64> undef, <2 x i64> undef, <3 x i32> <i32 0, i32 1, i32 undef>
	%vecinit8.i = shufflevector <3 x i64> zeroinitializer, <3 x i64> %vext.i, <3 x i32> <i32 0, i32 3, i32 4>			%vecinit8.i = shufflevector <3 x i64> zeroinitializer, <3 x i64> %vext.i, <3 x i32> <i32 0, i32 3, i32 4>
	store <3 x i64> %vecinit8.i, <3 x i64>* undef, align 32			store <3 x i64> %vecinit8.i, <3 x i64>* undef, align 32
	;CHECK: ret			; CHECK: ret
	ret void			ret void
	}			}


	;CHECK-LABEL: bad_insert:			; CHECK-LABEL: bad_insert:
	define void @bad_insert(i32 %t) {			define void @bad_insert(i32 %t) {
	entry:			entry:
	;CHECK: vxorps %ymm1, %ymm1, %ymm1			; CHECK: vmovss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	;CHECK-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6,7]			; CHECK-NEXT: vmovaps %ymm0
				; CHECK: ret

	%v2 = insertelement <8 x i32> zeroinitializer, i32 %t, i32 0			%v2 = insertelement <8 x i32> zeroinitializer, i32 %t, i32 0
	store <8 x i32> %v2, <8 x i32> addrspace(1)* undef, align 32			store <8 x i32> %v2, <8 x i32> addrspace(1)* undef, align 32
	;CHECK: ret
	ret void			ret void
	}			}

test/CodeGen/X86/vector-shuffle-256-v4.ll

Show First 20 Lines • Show All 807 Lines • ▼ Show 20 Lines	; ALL: retq

ret <4 x i64> %f		ret <4 x i64> %f
}		}

define <4 x i64> @insert_reg_and_zero_v4i64(i64 %a) {		define <4 x i64> @insert_reg_and_zero_v4i64(i64 %a) {
; AVX1-LABEL: insert_reg_and_zero_v4i64:		; AVX1-LABEL: insert_reg_and_zero_v4i64:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vmovq %rdi, %xmm0		; AVX1-NEXT: vmovq %rdi, %xmm0
; AVX1-NEXT: vxorpd %ymm1, %ymm1, %ymm1
; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: insert_reg_and_zero_v4i64:		; AVX2-LABEL: insert_reg_and_zero_v4i64:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vmovq %rdi, %xmm0		; AVX2-NEXT: vmovq %rdi, %xmm0
; AVX2-NEXT: vpxor %ymm1, %ymm1, %ymm1
; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3,4,5,6,7]
; AVX2-NEXT: retq		; AVX2-NEXT: retq
%v = insertelement <4 x i64> undef, i64 %a, i64 0		%v = insertelement <4 x i64> undef, i64 %a, i64 0
%shuffle = shufflevector <4 x i64> %v, <4 x i64> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>		%shuffle = shufflevector <4 x i64> %v, <4 x i64> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
ret <4 x i64> %shuffle		ret <4 x i64> %shuffle
}		}

define <4 x i64> @insert_mem_and_zero_v4i64(i64* %ptr) {		define <4 x i64> @insert_mem_and_zero_v4i64(i64* %ptr) {
; AVX1-LABEL: insert_mem_and_zero_v4i64:		; AVX1-LABEL: insert_mem_and_zero_v4i64:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero		; AVX1-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero
		andreadbUnsubmitted Not Done Reply Inline Actions So, this is what you meant when you said that we don't get the correct fp/int domain. In X86InstrSSE.td we have patterns like this: def : Pat<(v4i64 (X86vzmovl (insert_subvector undef, (v2i64 (scalar_to_vector (loadi64 addr:$src))), (iPTR 0)))), (SUBREG_TO_REG (i32 0), (VMOVSDrm addr:$src), sub_xmm)>; Do you plan to send a follow-up patch to fix tablegen patterns so that VMOVQI2PQIrm is used instead of VMOVSDrm for the integer domain?. If that's the case, then it makes sense to commit this patch first and fix the fp/int domain issue in a separate patch. andreadb: So, this is what you meant when you said that we don't get the correct fp/int domain. In…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Hi Andrea - That's correct. I saw a couple of places where we didn't have the right tablegen patterns. And I had a patch for it somewhere...but I'm not finding it now. But it was just simple replacements to substitute the right type like what you have noted here. spatel: Hi Andrea - That's correct. I saw a couple of places where we didn't have the right tablegen…
; AVX1-NEXT: vxorpd %ymm1, %ymm1, %ymm1
; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: insert_mem_and_zero_v4i64:		; AVX2-LABEL: insert_mem_and_zero_v4i64:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero		; AVX2-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero
; AVX2-NEXT: vpxor %ymm1, %ymm1, %ymm1
; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3,4,5,6,7]
; AVX2-NEXT: retq		; AVX2-NEXT: retq
%a = load i64, i64* %ptr		%a = load i64, i64* %ptr
%v = insertelement <4 x i64> undef, i64 %a, i64 0		%v = insertelement <4 x i64> undef, i64 %a, i64 0
%shuffle = shufflevector <4 x i64> %v, <4 x i64> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>		%shuffle = shufflevector <4 x i64> %v, <4 x i64> zeroinitializer, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
ret <4 x i64> %shuffle		ret <4 x i64> %shuffle
}		}

define <4 x double> @insert_reg_and_zero_v4f64(double %a) {		define <4 x double> @insert_reg_and_zero_v4f64(double %a) {
▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shuffle-256-v8.ll

	Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm0[2,3,0,1]
	; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]			; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
	; AVX1-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,0,0,0,4,4,4,4]			; AVX1-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,0,0,0,4,4,4,4]
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v8f32_70000000:			; AVX2-LABEL: shuffle_v8f32_70000000:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: movl $7, %eax			; AVX2-NEXT: movl $7, %eax
	; AVX2-NEXT: vmovd %eax, %xmm1			; AVX2-NEXT: vmovd %eax, %xmm1
	; AVX2-NEXT: vpxor %ymm2, %ymm2, %ymm2			; AVX2-NEXT: vxorps %ymm2, %ymm2, %ymm2
	; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]			; AVX2-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
				andreadbUnsubmitted Not Done Reply Inline Actions This has nothing to do with your patch, however, I am surprised that we get this long sequence of instructions on AVX2 instead of just a single 'vmovaps' plus 'vpermd'. Here, %ymm1 is used to store the 'vpermd' permute mask. That mask is basically known at compile time (it is vector <7,0,0,0,0,0,0,0>) so, we could just have a load from constant pool instead of computing the mask at runtime. I think we could replace this entire sequence with a load from constant pool followed by a 'vpermd'. andreadb: This has nothing to do with your patch, however, I am surprised that we get this long sequence…
				spatelAuthorUnsubmitted Not Done Reply Inline Actions Interesting - it's not entirely unrelated because the permute mask itself could be viewed as a zero-extended vector, right? I've filed this as: https://llvm.org/bugs/show_bug.cgi?id=23073 spatel: Interesting - it's not entirely unrelated because the permute mask itself could be viewed as a…
				andreadbUnsubmitted Not Done Reply Inline Actions Right, movl $7, %eax vmovd %eax, %xmm1 vxorps %ymm2, %ymm2, %ymm2 vblendps {{.#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7] is basically equivalent to: movl $7, %eax vmovd %eax, %xmm1 Bits [VLMAX-1:32] would be implicitly zeroed. andreadb:* Right, ``` movl $7, %eax vmovd %eax, %xmm1 vxorps %ymm2, %ymm2, %ymm2 vblendps {{.*#+}}…
	; AVX2-NEXT: vpermps %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpermps %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>			%shuffle = shufflevector <8 x float> %a, <8 x float> %b, <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
	ret <8 x float> %shuffle			ret <8 x float> %shuffle
	}			}

	define <8 x float> @shuffle_v8f32_01014545(<8 x float> %a, <8 x float> %b) {			define <8 x float> @shuffle_v8f32_01014545(<8 x float> %a, <8 x float> %b) {
	; ALL-LABEL: shuffle_v8f32_01014545:			; ALL-LABEL: shuffle_v8f32_01014545:
	▲ Show 20 Lines • Show All 809 Lines • ▼ Show 20 Lines
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm0[2,3,0,1]
	; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]			; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3]
	; AVX1-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,0,0,0,4,4,4,4]			; AVX1-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[3,0,0,0,4,4,4,4]
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v8i32_70000000:			; AVX2-LABEL: shuffle_v8i32_70000000:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: movl $7, %eax			; AVX2-NEXT: movl $7, %eax
	; AVX2-NEXT: vmovd %eax, %xmm1			; AVX2-NEXT: vmovd %eax, %xmm1
	; AVX2-NEXT: vpxor %ymm2, %ymm2, %ymm2			; AVX2-NEXT: vxorps %ymm2, %ymm2, %ymm2
	; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]			; AVX2-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0],ymm2[1,2,3,4,5,6,7]
	; AVX2-NEXT: vpermd %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpermd %ymm0, %ymm1, %ymm0
				andreadbUnsubmitted Not Done Reply Inline Actions Same here. andreadb: Same here.
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>			%shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 7, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
	ret <8 x i32> %shuffle			ret <8 x i32> %shuffle
	}			}

	define <8 x i32> @shuffle_v8i32_01014545(<8 x i32> %a, <8 x i32> %b) {			define <8 x i32> @shuffle_v8i32_01014545(<8 x i32> %a, <8 x i32> %b) {
	; AVX1-LABEL: shuffle_v8i32_01014545:			; AVX1-LABEL: shuffle_v8i32_01014545:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	▲ Show 20 Lines • Show All 1,117 Lines • Show Last 20 Lines