This is an archive of the discontinued LLVM Phabricator instance.

[X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types
ClosedPublic

Authored by craig.topper on Jan 15 2017, 11:50 AM.

Download Raw Diff

Details

Reviewers

RKSimon
zvi
delena

Commits

rGfbc7805e252e: [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types
rL295155: [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types

Summary

We don't seem to have great rules on what a valid VBROADCAST node looks like. And as a consequence we end up with a lot of patterns to try to catch everything. We have patterns with scalar inputs, 128-bit vector inputs, 256-bit vector inputs, and 512-bit vector inputs.

As you can see from the things improved here we are currently missing patterns for 128-bit loads being extended to 256-bit before the vbroadcast.

I'd like to propose that VBROADCAST should always take a 128-bit vector type as input. As a first step towards that this patch adds an EXTRACT_SUBVECTOR in front of VBROADCAST when the input is 256 or 512-bits. In the future I would like to add scalar_to_vector around all the scalar operations. And maybe we should consider adding a VBROADCAST+load node to avoid separating loads from the broadcasting operation when the load itself isn't foldable.

This requires an additional change in target shuffle combining to look for the extract subvector and look through it to find the original operand. I'm sure this change isn't perfect but was enough to fix a few test failures that were being caused.

Another interesting thing I noticed is that the changes in masked_gather_scatter.ll show cases were we don't remove a useless insert into element 1 before broadcasting element 0.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper updated this revision to Diff 84497.Jan 15 2017, 11:50 AM

craig.topper retitled this revision from to [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types.

craig.topper updated this object.

craig.topper added reviewers: delena, zvi, RKSimon.

craig.topper added a subscriber: llvm-commits.

I think we've spoken in the past of splitting broadcast (and subvector broadcast) into memory and register opcode variants.

lib/Target/X86/X86ISelLowering.cpp
5488 ↗	(On Diff #84497)	I've got the horrid feeling this will cause problems for combineX86ShufflesRecursively which decodes shuffles but at the moment asserts if the inputs are not the same size as the shuffle result itself.
test/CodeGen/X86/widened-broadcast.ll
128 ↗	(On Diff #84497)	Any idea why AVX1 fails?

What became of previous conversations about splitting the opcodes?

lib/Target/X86/X86ISelLowering.cpp
5488 ↗	(On Diff #84497)	Does the fact that I put an node with the matching type into the Ops vector not help?
test/CodeGen/X86/widened-broadcast.ll
128 ↗	(On Diff #84497)	Probably something to do with the lowering as broadcast only supporting integer types with AVX2. But that's just a guess.

zvi added inline comments.Jan 16 2017, 2:27 PM

lib/Target/X86/X86ISelLowering.cpp
5479 ↗	(On Diff #84497)	EXTRACT_SUBVECTOR's Index operand must be a constant, so you don't need a dyn_cast<>.
5484 ↗	(On Diff #84497)	You can drop the outter braces
5487 ↗	(On Diff #84497)	Comment need to be updated
9677 ↗	(On Diff #84497)	Can you please add here a comment explaining why?

RKSimon added inline comments.Jan 19 2017, 6:57 AM

lib/Target/X86/X86ISelLowering.cpp
5488 ↗	(On Diff #84497)	My mistake - yes that should work.

In D28747#646820, @craig.topper wrote:

What became of previous conversations about splitting the opcodes?

I think it came up after D22460 when we found some regressions. Even if we don't go the route of fully splitting broadcast/subvbroadcast into reg and mem intrinsics I'm going to look into adding broadcast support into EltsFromConsecutiveLoads soon.

What is the status of this patch?

Address previous review comments.

Herald added a subscriber: igorb. · View Herald TranscriptFeb 13 2017, 9:06 PM

RKSimon added inline comments.Feb 14 2017, 5:55 AM

test/CodeGen/X86/masked_gather_scatter.ll
1719 ↗	(On Diff #88299)	Current codegen should be committed to trunk if you want to show a delta, otherwise possibly just keep it as it is with ALL-NOT check?

craig.topper added inline comments.Feb 14 2017, 10:05 PM

test/CodeGen/X86/masked_gather_scatter.ll
1719 ↗	(On Diff #88299)	I just naively reran the update script when previous patch failed to apply. I'll change it back

Reverted portions of the scatter gather test case that the update script unnecessarily changed.

This patch LGTM. If you won't improve this patch to handle the AVX1 cases, please create a bug.

This revision is now accepted and ready to land.Feb 14 2017, 10:52 PM

Is using an fp broadcast for an integer operation for AVX1 a good idea. Is there a stack cross penalty for that on Sandy Bridge?

Closed by commit rL295155: [X86] Don't create VBROADCAST nodes with 256-bit or 512-bit input types (authored by ctopper). · Explain WhyFeb 14 2017, 11:10 PM

This revision was automatically updated to reflect the committed changes.

In D28747#677154, @craig.topper wrote:

Is using an fp broadcast for an integer operation for AVX1 a good idea. Is there a stack cross penalty for that on Sandy Bridge?

For 256-bit cases using broadcastss/broadcastsd is a definite win - plus nearly all the 256-bit AVX1 operations are in the fp-domain (including how we lower v8i32/v4i64 shuffles). For 128-bit cases its less clear.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

20 lines

test/

CodeGen/

X86/

masked_gather_scatter.ll

12 lines

vector-shuffle-avx512.ll

2 lines

widened-broadcast.ll

42 lines

Diff 88492

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,388 Lines • ▼ Show 20 Lines	case X86ISD::PSHUFLW:
DecodePSHUFLWMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);		DecodePSHUFLWMask(VT, cast<ConstantSDNode>(ImmN)->getZExtValue(), Mask);
IsUnary = true;		IsUnary = true;
break;		break;
case X86ISD::VZEXT_MOVL:		case X86ISD::VZEXT_MOVL:
DecodeZeroMoveLowMask(VT, Mask);		DecodeZeroMoveLowMask(VT, Mask);
IsUnary = true;		IsUnary = true;
break;		break;
case X86ISD::VBROADCAST: {		case X86ISD::VBROADCAST: {
// We only decode broadcasts of same-sized vectors at the moment.		SDValue N0 = N->getOperand(0);
if (N->getOperand(0).getValueType() == VT) {		// See if we're broadcasting from index 0 of an EXTRACT_SUBVECTOR. If so,
		// add the pre-extracted value to the Ops vector.
		if (N0.getOpcode() == ISD::EXTRACT_SUBVECTOR &&
		N0.getOperand(0).getValueType() == VT &&
		N0.getConstantOperandVal(1) == 0)
		Ops.push_back(N0.getOperand(0));

		// We only decode broadcasts of same-sized vectors, unless the broadcast
		// came from an extract from the original width. If we found one, we
		// pushed it the Ops vector above.
		if (N0.getValueType() == VT \|\| !Ops.empty()) {
DecodeVectorBroadcast(VT, Mask);		DecodeVectorBroadcast(VT, Mask);
IsUnary = true;		IsUnary = true;
break;		break;
}		}
return false;		return false;
}		}
case X86ISD::VPERMILPV: {		case X86ISD::VPERMILPV: {
IsUnary = true;		IsUnary = true;
▲ Show 20 Lines • Show All 4,317 Lines • ▼ Show 20 Lines	static SDValue lowerVectorShuffleAsBroadcast(const SDLoc &DL, MVT VT,

// 32-bit targets need to load i64 as a f64 and then bitcast the result.		// 32-bit targets need to load i64 as a f64 and then bitcast the result.
if (!Subtarget.is64Bit() && SrcVT == MVT::i64) {		if (!Subtarget.is64Bit() && SrcVT == MVT::i64) {
V = DAG.getBitcast(MVT::f64, V);		V = DAG.getBitcast(MVT::f64, V);
unsigned NumBroadcastElts = BroadcastVT.getVectorNumElements();		unsigned NumBroadcastElts = BroadcastVT.getVectorNumElements();
BroadcastVT = MVT::getVectorVT(MVT::f64, NumBroadcastElts);		BroadcastVT = MVT::getVectorVT(MVT::f64, NumBroadcastElts);
}		}

		// We only support broadcasting from 128-bit vectors to minimize the
		// number of patterns we need to deal with in isel. So extract down to
		// 128-bits.
		if (SrcVT.getSizeInBits() > 128)
		V = extract128BitVector(V, 0, DAG, DL);

return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V));		return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V));
}		}

// Check for whether we can use INSERTPS to perform the shuffle. We only use		// Check for whether we can use INSERTPS to perform the shuffle. We only use
// INSERTPS when the V1 elements are already in the correct locations		// INSERTPS when the V1 elements are already in the correct locations
// because otherwise we can just always use two SHUFPS instructions which		// because otherwise we can just always use two SHUFPS instructions which
// are much smaller to encode than a SHUFPS and an INSERTPS. We can also		// are much smaller to encode than a SHUFPS and an INSERTPS. We can also
// perform INSERTPS if a single V1 element is out of place and all V2		// perform INSERTPS if a single V1 element is out of place and all V2
▲ Show 20 Lines • Show All 25,554 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/masked_gather_scatter.ll

Show First 20 Lines • Show All 708 Lines • ▼ Show 20 Lines	; SKX_32-NEXT: retl
%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <16 x float> undef)		%res = call <16 x float> @llvm.masked.gather.v16f32(<16 x float*> %gep.random, i32 4, <16 x i1> <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>, <16 x float> undef)
ret <16 x float>%res		ret <16 x float>%res
}		}

; The base pointer is not splat, can't find unform base		; The base pointer is not splat, can't find unform base
define <16 x float> @test14(float* %base, i32 %ind, <16 x float*> %vec) {		define <16 x float> @test14(float* %base, i32 %ind, <16 x float*> %vec) {
; KNL_64-LABEL: test14:		; KNL_64-LABEL: test14:
; KNL_64: # BB#0:		; KNL_64: # BB#0:
; KNL_64-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm1		; KNL_64-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0
; KNL_64-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
; KNL_64-NEXT: vpbroadcastq %xmm0, %zmm0		; KNL_64-NEXT: vpbroadcastq %xmm0, %zmm0
; KNL_64-NEXT: vmovd %esi, %xmm1		; KNL_64-NEXT: vmovd %esi, %xmm1
; KNL_64-NEXT: vpbroadcastd %xmm1, %ymm1		; KNL_64-NEXT: vpbroadcastd %xmm1, %ymm1
; KNL_64-NEXT: vpmovsxdq %ymm1, %zmm1		; KNL_64-NEXT: vpmovsxdq %ymm1, %zmm1
; KNL_64-NEXT: vpsllq $2, %zmm1, %zmm1		; KNL_64-NEXT: vpsllq $2, %zmm1, %zmm1
; KNL_64-NEXT: vpaddq %zmm1, %zmm0, %zmm0		; KNL_64-NEXT: vpaddq %zmm1, %zmm0, %zmm0
; KNL_64-NEXT: kxnorw %k0, %k0, %k1		; KNL_64-NEXT: kxnorw %k0, %k0, %k1
; KNL_64-NEXT: kshiftrw $8, %k1, %k2		; KNL_64-NEXT: kshiftrw $8, %k1, %k2
; KNL_64-NEXT: vgatherqps (,%zmm0), %ymm1 {%k2}		; KNL_64-NEXT: vgatherqps (,%zmm0), %ymm1 {%k2}
; KNL_64-NEXT: vgatherqps (,%zmm0), %ymm2 {%k1}		; KNL_64-NEXT: vgatherqps (,%zmm0), %ymm2 {%k1}
; KNL_64-NEXT: vinsertf64x4 $1, %ymm1, %zmm2, %zmm0		; KNL_64-NEXT: vinsertf64x4 $1, %ymm1, %zmm2, %zmm0
; KNL_64-NEXT: retq		; KNL_64-NEXT: retq
;		;
; KNL_32-LABEL: test14:		; KNL_32-LABEL: test14:
; KNL_32: # BB#0:		; KNL_32: # BB#0:
; KNL_32-NEXT: vpinsrd $1, {{[0-9]+}}(%esp), %xmm0, %xmm1		; KNL_32-NEXT: vpinsrd $1, {{[0-9]+}}(%esp), %xmm0, %xmm0
; KNL_32-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
; KNL_32-NEXT: vpbroadcastd %xmm0, %zmm0		; KNL_32-NEXT: vpbroadcastd %xmm0, %zmm0
; KNL_32-NEXT: vpslld $2, {{[0-9]+}}(%esp){1to16}, %zmm1		; KNL_32-NEXT: vpslld $2, {{[0-9]+}}(%esp){1to16}, %zmm1
; KNL_32-NEXT: vpaddd %zmm1, %zmm0, %zmm1		; KNL_32-NEXT: vpaddd %zmm1, %zmm0, %zmm1
; KNL_32-NEXT: kxnorw %k0, %k0, %k1		; KNL_32-NEXT: kxnorw %k0, %k0, %k1
; KNL_32-NEXT: vgatherdps (,%zmm1), %zmm0 {%k1}		; KNL_32-NEXT: vgatherdps (,%zmm1), %zmm0 {%k1}
; KNL_32-NEXT: retl		; KNL_32-NEXT: retl
;		;
; SKX-LABEL: test14:		; SKX-LABEL: test14:
; SKX: # BB#0:		; SKX: # BB#0:
; SKX-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm1		; SKX-NEXT: vpinsrq $1, %rdi, %xmm0, %xmm0
; SKX-NEXT: vinserti64x2 $0, %xmm1, %zmm0, %zmm0
; SKX-NEXT: vpbroadcastq %xmm0, %zmm0		; SKX-NEXT: vpbroadcastq %xmm0, %zmm0
; SKX-NEXT: vpbroadcastd %esi, %ymm1		; SKX-NEXT: vpbroadcastd %esi, %ymm1
; SKX-NEXT: vpmovsxdq %ymm1, %zmm1		; SKX-NEXT: vpmovsxdq %ymm1, %zmm1
; SKX-NEXT: vpsllq $2, %zmm1, %zmm1		; SKX-NEXT: vpsllq $2, %zmm1, %zmm1
; SKX-NEXT: vpaddq %zmm1, %zmm0, %zmm0		; SKX-NEXT: vpaddq %zmm1, %zmm0, %zmm0
; SKX-NEXT: kxnorw %k0, %k0, %k1		; SKX-NEXT: kxnorw %k0, %k0, %k1
; SKX-NEXT: kshiftrw $8, %k1, %k2		; SKX-NEXT: kshiftrw $8, %k1, %k2
; SKX-NEXT: vgatherqps (,%zmm0), %ymm1 {%k2}		; SKX-NEXT: vgatherqps (,%zmm0), %ymm1 {%k2}
; SKX-NEXT: vgatherqps (,%zmm0), %ymm2 {%k1}		; SKX-NEXT: vgatherqps (,%zmm0), %ymm2 {%k1}
; SKX-NEXT: vinsertf32x8 $1, %ymm1, %zmm2, %zmm0		; SKX-NEXT: vinsertf32x8 $1, %ymm1, %zmm2, %zmm0
; SKX-NEXT: retq		; SKX-NEXT: retq
;		;
; SKX_32-LABEL: test14:		; SKX_32-LABEL: test14:
; SKX_32: # BB#0:		; SKX_32: # BB#0:
; SKX_32-NEXT: vpinsrd $1, {{[0-9]+}}(%esp), %xmm0, %xmm1		; SKX_32-NEXT: vpinsrd $1, {{[0-9]+}}(%esp), %xmm0, %xmm0
; SKX_32-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
; SKX_32-NEXT: vpbroadcastd %xmm0, %zmm0		; SKX_32-NEXT: vpbroadcastd %xmm0, %zmm0
; SKX_32-NEXT: vpslld $2, {{[0-9]+}}(%esp){1to16}, %zmm1		; SKX_32-NEXT: vpslld $2, {{[0-9]+}}(%esp){1to16}, %zmm1
; SKX_32-NEXT: vpaddd %zmm1, %zmm0, %zmm1		; SKX_32-NEXT: vpaddd %zmm1, %zmm0, %zmm1
; SKX_32-NEXT: kxnorw %k0, %k0, %k1		; SKX_32-NEXT: kxnorw %k0, %k0, %k1
; SKX_32-NEXT: vgatherdps (,%zmm1), %zmm0 {%k1}		; SKX_32-NEXT: vgatherdps (,%zmm1), %zmm0 {%k1}
; SKX_32-NEXT: retl		; SKX_32-NEXT: retl

%broadcast.splatinsert = insertelement <16 x float> %vec, float %base, i32 1		%broadcast.splatinsert = insertelement <16 x float> %vec, float %base, i32 1
▲ Show 20 Lines • Show All 1,334 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-avx512.ll

	Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines
	; SKX64-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<def>			; SKX64-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<def>
	; SKX64-NEXT: movb $-127, %al			; SKX64-NEXT: movb $-127, %al
	; SKX64-NEXT: kmovb %eax, %k1			; SKX64-NEXT: kmovb %eax, %k1
	; SKX64-NEXT: vpexpandd %ymm0, %ymm0 {%k1} {z}			; SKX64-NEXT: vpexpandd %ymm0, %ymm0 {%k1} {z}
	; SKX64-NEXT: retq			; SKX64-NEXT: retq
	;			;
	; KNL64-LABEL: expand3:			; KNL64-LABEL: expand3:
	; KNL64: # BB#0:			; KNL64: # BB#0:
	; KNL64-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<def>
	; KNL64-NEXT: vpbroadcastq %xmm0, %ymm0			; KNL64-NEXT: vpbroadcastq %xmm0, %ymm0
	; KNL64-NEXT: vpxor %ymm1, %ymm1, %ymm1			; KNL64-NEXT: vpxor %ymm1, %ymm1, %ymm1
	; KNL64-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6],ymm0[7]			; KNL64-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6],ymm0[7]
	; KNL64-NEXT: retq			; KNL64-NEXT: retq
	;			;
	; SKX32-LABEL: expand3:			; SKX32-LABEL: expand3:
	; SKX32: # BB#0:			; SKX32: # BB#0:
	; SKX32-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<def>			; SKX32-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<def>
	; SKX32-NEXT: movb $-127, %al			; SKX32-NEXT: movb $-127, %al
	; SKX32-NEXT: kmovb %eax, %k1			; SKX32-NEXT: kmovb %eax, %k1
	; SKX32-NEXT: vpexpandd %ymm0, %ymm0 {%k1} {z}			; SKX32-NEXT: vpexpandd %ymm0, %ymm0 {%k1} {z}
	; SKX32-NEXT: retl			; SKX32-NEXT: retl
	;			;
	; KNL32-LABEL: expand3:			; KNL32-LABEL: expand3:
	; KNL32: # BB#0:			; KNL32: # BB#0:
	; KNL32-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<def>
	; KNL32-NEXT: vpbroadcastq %xmm0, %ymm0			; KNL32-NEXT: vpbroadcastq %xmm0, %ymm0
	; KNL32-NEXT: vpxor %ymm1, %ymm1, %ymm1			; KNL32-NEXT: vpxor %ymm1, %ymm1, %ymm1
	; KNL32-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6],ymm0[7]			; KNL32-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3,4,5,6],ymm0[7]
	; KNL32-NEXT: retl			; KNL32-NEXT: retl
	%res = shufflevector <4 x i32> zeroinitializer, <4 x i32> %a, <8 x i32> <i32 4, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0,i32 5>			%res = shufflevector <4 x i32> zeroinitializer, <4 x i32> %a, <8 x i32> <i32 4, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0,i32 5>
	ret <8 x i32> %res			ret <8 x i32> %res
	}			}

	▲ Show 20 Lines • Show All 723 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/widened-broadcast.ll

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	; AVX1-LABEL: load_splat_8f32_4f32_01010101:			; AVX1-LABEL: load_splat_8f32_4f32_01010101:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vmovddup {{.*#+}} xmm0 = mem[0,0]			; AVX1-NEXT: vmovddup {{.*#+}} xmm0 = mem[0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_8f32_4f32_01010101:			; AVX2-LABEL: load_splat_8f32_4f32_01010101:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovaps (%rdi), %xmm0			; AVX2-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX2-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_8f32_4f32_01010101:			; AVX512-LABEL: load_splat_8f32_4f32_01010101:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovaps (%rdi), %xmm0			; AVX512-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX512-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <4 x float>, <4 x float>* %ptr			%ld = load <4 x float>, <4 x float>* %ptr
	%ret = shufflevector <4 x float> %ld, <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>			%ret = shufflevector <4 x float> %ld, <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
	ret <8 x float> %ret			ret <8 x float> %ret
	}			}

	define <8 x float> @load_splat_8f32_8f32_01010101(<8 x float>* %ptr) nounwind uwtable readnone ssp {			define <8 x float> @load_splat_8f32_8f32_01010101(<8 x float>* %ptr) nounwind uwtable readnone ssp {
	▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	; AVX1-LABEL: load_splat_8i32_4i32_01010101:			; AVX1-LABEL: load_splat_8i32_4i32_01010101:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,1,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,1,0,1]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_8i32_4i32_01010101:			; AVX2-LABEL: load_splat_8i32_4i32_01010101:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovaps (%rdi), %xmm0			; AVX2-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX2-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_8i32_4i32_01010101:			; AVX512-LABEL: load_splat_8i32_4i32_01010101:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovaps (%rdi), %xmm0			; AVX512-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX512-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <4 x i32>, <4 x i32>* %ptr			%ld = load <4 x i32>, <4 x i32>* %ptr
	%ret = shufflevector <4 x i32> %ld, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>			%ret = shufflevector <4 x i32> %ld, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
	ret <8 x i32> %ret			ret <8 x i32> %ret
	}			}

	define <8 x i32> @load_splat_8i32_8i32_01010101(<8 x i32>* %ptr) nounwind uwtable readnone ssp {			define <8 x i32> @load_splat_8i32_8i32_01010101(<8 x i32>* %ptr) nounwind uwtable readnone ssp {
	▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
	; AVX1-LABEL: load_splat_16i16_8i16_0101010101010101:			; AVX1-LABEL: load_splat_16i16_8i16_0101010101010101:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,0,0,0]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_16i16_8i16_0101010101010101:			; AVX2-LABEL: load_splat_16i16_8i16_0101010101010101:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovaps (%rdi), %xmm0			; AVX2-NEXT: vbroadcastss (%rdi), %ymm0
	; AVX2-NEXT: vbroadcastss %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_16i16_8i16_0101010101010101:			; AVX512-LABEL: load_splat_16i16_8i16_0101010101010101:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovaps (%rdi), %xmm0			; AVX512-NEXT: vbroadcastss (%rdi), %ymm0
	; AVX512-NEXT: vbroadcastss %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <8 x i16>, <8 x i16>* %ptr			%ld = load <8 x i16>, <8 x i16>* %ptr
	%ret = shufflevector <8 x i16> %ld, <8 x i16> undef, <16 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>			%ret = shufflevector <8 x i16> %ld, <8 x i16> undef, <16 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
	ret <16 x i16> %ret			ret <16 x i16> %ret
	}			}

	define <16 x i16> @load_splat_16i16_8i16_0123012301230123(<8 x i16>* %ptr) nounwind uwtable readnone ssp {			define <16 x i16> @load_splat_16i16_8i16_0123012301230123(<8 x i16>* %ptr) nounwind uwtable readnone ssp {
	; SSE-LABEL: load_splat_16i16_8i16_0123012301230123:			; SSE-LABEL: load_splat_16i16_8i16_0123012301230123:
	; SSE: # BB#0: # %entry			; SSE: # BB#0: # %entry
	; SSE-NEXT: pshufd {{.*#+}} xmm0 = mem[0,1,0,1]			; SSE-NEXT: pshufd {{.*#+}} xmm0 = mem[0,1,0,1]
	; SSE-NEXT: movdqa %xmm0, %xmm1			; SSE-NEXT: movdqa %xmm0, %xmm1
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX1-LABEL: load_splat_16i16_8i16_0123012301230123:			; AVX1-LABEL: load_splat_16i16_8i16_0123012301230123:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,1,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,1,0,1]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_16i16_8i16_0123012301230123:			; AVX2-LABEL: load_splat_16i16_8i16_0123012301230123:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovaps (%rdi), %xmm0			; AVX2-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX2-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_16i16_8i16_0123012301230123:			; AVX512-LABEL: load_splat_16i16_8i16_0123012301230123:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovaps (%rdi), %xmm0			; AVX512-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX512-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <8 x i16>, <8 x i16>* %ptr			%ld = load <8 x i16>, <8 x i16>* %ptr
	%ret = shufflevector <8 x i16> %ld, <8 x i16> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>			%ret = shufflevector <8 x i16> %ld, <8 x i16> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3,i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
	ret <16 x i16> %ret			ret <16 x i16> %ret
	}			}

	define <16 x i16> @load_splat_16i16_16i16_0101010101010101(<16 x i16>* %ptr) nounwind uwtable readnone ssp {			define <16 x i16> @load_splat_16i16_16i16_0101010101010101(<16 x i16>* %ptr) nounwind uwtable readnone ssp {
	▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = mem[0,0,0,0,4,5,6,7]			; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = mem[0,0,0,0,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_32i8_16i8_01010101010101010101010101010101:			; AVX2-LABEL: load_splat_32i8_16i8_01010101010101010101010101010101:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovdqa (%rdi), %xmm0			; AVX2-NEXT: vpbroadcastw (%rdi), %ymm0
	; AVX2-NEXT: vpbroadcastw %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_32i8_16i8_01010101010101010101010101010101:			; AVX512-LABEL: load_splat_32i8_16i8_01010101010101010101010101010101:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovdqa (%rdi), %xmm0			; AVX512-NEXT: vpbroadcastw (%rdi), %ymm0
	; AVX512-NEXT: vpbroadcastw %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %ptr			%ld = load <16 x i8>, <16 x i8>* %ptr
	%ret = shufflevector <16 x i8> %ld, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>			%ret = shufflevector <16 x i8> %ld, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
	ret <32 x i8> %ret			ret <32 x i8> %ret
	}			}

	define <32 x i8> @load_splat_32i8_16i8_01230123012301230123012301230123(<16 x i8>* %ptr) nounwind uwtable readnone ssp {			define <32 x i8> @load_splat_32i8_16i8_01230123012301230123012301230123(<16 x i8>* %ptr) nounwind uwtable readnone ssp {
	; SSE-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:			; SSE-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:
	; SSE: # BB#0: # %entry			; SSE: # BB#0: # %entry
	; SSE-NEXT: pshufd {{.*#+}} xmm0 = mem[0,0,0,0]			; SSE-NEXT: pshufd {{.*#+}} xmm0 = mem[0,0,0,0]
	; SSE-NEXT: movdqa %xmm0, %xmm1			; SSE-NEXT: movdqa %xmm0, %xmm1
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX1-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:			; AVX1-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,0,0,0]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,0,0,0]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:			; AVX2-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovaps (%rdi), %xmm0			; AVX2-NEXT: vbroadcastss (%rdi), %ymm0
	; AVX2-NEXT: vbroadcastss %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:			; AVX512-LABEL: load_splat_32i8_16i8_01230123012301230123012301230123:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovaps (%rdi), %xmm0			; AVX512-NEXT: vbroadcastss (%rdi), %ymm0
	; AVX512-NEXT: vbroadcastss %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %ptr			%ld = load <16 x i8>, <16 x i8>* %ptr
	%ret = shufflevector <16 x i8> %ld, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>			%ret = shufflevector <16 x i8> %ld, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3, i32 0, i32 1, i32 2, i32 3>
	ret <32 x i8> %ret			ret <32 x i8> %ret
	}			}

	define <32 x i8> @load_splat_32i8_16i8_01234567012345670123456701234567(<16 x i8>* %ptr) nounwind uwtable readnone ssp {			define <32 x i8> @load_splat_32i8_16i8_01234567012345670123456701234567(<16 x i8>* %ptr) nounwind uwtable readnone ssp {
	; SSE-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:			; SSE-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:
	; SSE: # BB#0: # %entry			; SSE: # BB#0: # %entry
	; SSE-NEXT: pshufd {{.*#+}} xmm0 = mem[0,1,0,1]			; SSE-NEXT: pshufd {{.*#+}} xmm0 = mem[0,1,0,1]
	; SSE-NEXT: movdqa %xmm0, %xmm1			; SSE-NEXT: movdqa %xmm0, %xmm1
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX1-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:			; AVX1-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,1,0,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = mem[0,1,0,1]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:			; AVX2-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovaps (%rdi), %xmm0			; AVX2-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX2-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:			; AVX512-LABEL: load_splat_32i8_16i8_01234567012345670123456701234567:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: vmovaps (%rdi), %xmm0			; AVX512-NEXT: vbroadcastsd (%rdi), %ymm0
	; AVX512-NEXT: vbroadcastsd %xmm0, %ymm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %ptr			%ld = load <16 x i8>, <16 x i8>* %ptr
	%ret = shufflevector <16 x i8> %ld, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%ret = shufflevector <16 x i8> %ld, <16 x i8> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	ret <32 x i8> %ret			ret <32 x i8> %ret
	}			}

	define <32 x i8> @load_splat_32i8_32i8_01010101010101010101010101010101(<32 x i8>* %ptr) nounwind uwtable readnone ssp {			define <32 x i8> @load_splat_32i8_32i8_01010101010101010101010101010101(<32 x i8>* %ptr) nounwind uwtable readnone ssp {
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines