This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner, x86] convert insertelement of bitcasted vector into shuffle
ClosedPublic

Authored by spatel on Sep 28 2017, 3:54 PM.

Download Raw Diff

Details

Reviewers

efriedma
igorb
zvi
craig.topper
RKSimon

Commits

rG34fd5eaaf04c: [DAGCombiner] convert insertelement of bitcasted vector into shuffle
rL315460: [DAGCombiner] convert insertelement of bitcasted vector into shuffle

Summary

This is a generalization of the IR fold in D38316 to handle insertion into a non-undef vector. If this looks ok, then we should probably just abandon that one.

I had to add a target hook to avoid AVX512 horror with vXi1 shuffles. I think ARM/AArch64 will want to enable this too based on the earlier discussion, but I'm not sure if that would be limited to certain types or just set it to true for everything.

There may be room for improvement in the shuffle lowering here, but I think that would be follow-up work.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Sep 28 2017, 3:54 PM

Herald added subscribers: kristof.beyls, mcrosier, aemerson. · View Herald TranscriptSep 28 2017, 3:54 PM

fhahn added a subscriber: fhahn.Sep 29 2017, 1:23 AM

RKSimon added inline comments.Sep 29 2017, 2:52 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13787 ↗	(On Diff #117069)	We need to test for isShuffleMaskLegal here and I think it will remove the need for shouldConvertInsSubVecToShuffle

spatel added inline comments.Sep 29 2017, 7:35 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13787 ↗	(On Diff #117069)	Aha...I thought we must have something here already. I was trying something similar to that in an earlier draft, but it's tricky: If we use the actual shuffle type (VT), the subvector that we're trying to avoid messing with will already be cast from vXi1 to a potentially legal mask type. AVX512 disaster will occur for existing test cases. If we use the subvector type (SubVecVT), we will fail to get any of the cases shown here because the subvector type isn't legal (eg, v2i32). This is similar to why I couldn't use insert_subvector nodes - they require legal types. If we hack it to check for a hybrid fake type (the number of elements in the result + the element type of the subvector), we won't get the 512-bit cases for the AVX2 targets. This probably doesn't matter much in practice (shouldn't have larger than legal vectors in the first place?). So it's not quite what we're looking for, but I can post a draft of option #3 if that's a reasonable alternative to a new hook. Or let me know if you see another way out.

spatel added inline comments.Sep 29 2017, 7:59 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13787 ↗	(On Diff #117069)	Note: option #3 isn't as crazy as I made it sound, that 'fake type' is ConcatVT in the current patch.

I like the idea of #3 (making sure we comment it thoroughly!)

In D38388#884472, @RKSimon wrote:

I like the idea of #3 (making sure we comment it thoroughly!)

We can make it less hacky by performing the shuffle in the pre-bitcast type. Then, we're not lying about the shuffle. We'll have to bitcast the big vector and then bitcast the result though...and the mask will be less beautiful (it'll be more like what I was trying to create in D38316). But that should still catch the cases we want without adding another hook. Let me draft that and see what breaks... :)

Patch updated:

Use the existing isShuffleMaskLegal() TLI hook to determine when this transform is possible.
To make that query legitimate, perform the shuffle in the type of the padded subvector.
To make the proper shuffle op, bitcast the big vector and bitcast the shuffle back to the big vector type.

This gives us the same wins for x86 as the earlier rev only if the vector type is legal. Ie, we no longer optimize the 512-bit vectors for an AVX2 target. That's ok because we don't care about optimizing codegen for oversized illegal vector types?

I think there's also an improvement for AArch64 now that we're using a hook that is enabled for other targets. We had:

ins	v1.d[1], v0.d[0]
mov		v0.16b, v1.16b

and now:

zip1	v0.2d, v1.2d, v0.2d

Herald added a subscriber: javed.absar. · View Herald TranscriptSep 29 2017, 10:59 AM

RKSimon added inline comments.Oct 1 2017, 5:43 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
13774 ↗	(On Diff #117180)	How bad does it get if we allow any shuffle prior to legalization?
13787 ↗	(On Diff #117069)	Don't the interim generated nodes need adding to the worklist?

spatel marked an inline comment as done.Oct 1 2017, 9:29 AM

spatel added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

13774 ↗

(On Diff #117180)

I'm not sure about other targets because no existing tests change, but given that this isn't a clear win, I think we need some kind of hook to opt out.

For AVX512, it gets Bad - as in hundreds of extra instructions.

It looks like we'd need to add a concat_vectors fold to prevent this pattern:

t76: v128i1 = concat_vectors t13, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1, undef:v16i1
t77: v8i16 = bitcast t76
t78: v8i16 = vector_shuffle<0,8,2,3,4,5,6,7> t80, t77

From becoming:

  ...t218: i8 = extract_vector_elt t13, Constant:i64<12>
  t217: i8 = extract_vector_elt t13, Constant:i64<13>
  t216: i8 = extract_vector_elt t13, Constant:i64<14>
  t215: i8 = extract_vector_elt t13, Constant:i64<15>
t210: v32i8 = BUILD_VECTOR t230, t229, t228, t227, t226, t225, t224, t223, t222, t221, t220, t219, t218, t217, t216, t215, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8, undef:i8

Patch updated:

Add intermediate nodes to worklist.
This shifted the balance to have the helper function be a class member rather than a static, so some cosmetic diffs associated with that.

Has the legalization requirements affected whether we can abandon D38316?

In D38388#885413, @RKSimon wrote:

Has the legalization requirements affected whether we can abandon D38316?

It's a gray area (to me at least). This patch is enough to solve the motivating bug:
https://bugs.llvm.org/show_bug.cgi?id=34716
...and it captures the more general case of inserting into a non-undef vector too, but you're right - it no longer handles the 512-bit vector example with a potentially pre-AVX512 target that Eli may have been suggesting in the other review.

So the considerations are (and I don't have a strong opinion either way):

The IR canonicalization of D38316 can enable an IR improvement (less instructions in the motivating bug), so it still has some value. Is that value enough to justify the cost?
Do we want to add a different target-independent IR-level fold to kill the extra insertelement instruction for a non-undef insertion? That one seems more complicated, so we'd at least want to know how that pattern was created.

Ping.

LGTM

This revision is now accepted and ready to land.Oct 10 2017, 5:59 AM

Closed by commit rL315460: [DAGCombiner] convert insertelement of bitcasted vector into shuffle (authored by spatel). · Explain WhyOct 11 2017, 7:12 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D38316: [InstCombine] replace bitcast to scalar + insertelement with widening shuffle + vector bitcast.Oct 13 2017, 8:59 AM

spatel mentioned this in D40209: [DAGCombiner] eliminate shuffle of insert element.Nov 18 2017, 9:13 AM

spatel mentioned this in rL320050: [DAGCombiner] eliminate shuffle of insert element.Dec 7 2017, 7:18 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

65 lines

test/

CodeGen/

AArch64/

arm64-neon-copy.ll

2 lines

X86/

insertelement-shuffle.ll

61 lines

Diff 118610

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 409 Lines • ▼ Show 20 Lines	bool isSetCCEquivalent(SDValue N, SDValue &LHS, SDValue &RHS,
SDValue &CC) const;		SDValue &CC) const;
bool isOneUseSetCC(SDValue N) const;		bool isOneUseSetCC(SDValue N) const;

SDValue SimplifyNodeWithTwoResults(SDNode *N, unsigned LoOp,		SDValue SimplifyNodeWithTwoResults(SDNode *N, unsigned LoOp,
unsigned HiOp);		unsigned HiOp);
SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);		SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);
SDValue CombineExtLoad(SDNode *N);		SDValue CombineExtLoad(SDNode *N);
SDValue combineRepeatedFPDivisors(SDNode *N);		SDValue combineRepeatedFPDivisors(SDNode *N);
		SDValue combineInsertEltToShuffle(SDNode *N, unsigned InsIndex);
SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);		SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);
SDValue BuildSDIV(SDNode *N);		SDValue BuildSDIV(SDNode *N);
SDValue BuildSDIVPow2(SDNode *N);		SDValue BuildSDIVPow2(SDNode *N);
SDValue BuildUDIV(SDNode *N);		SDValue BuildUDIV(SDNode *N);
SDValue BuildLogBase2(SDValue Op, const SDLoc &DL);		SDValue BuildLogBase2(SDValue Op, const SDLoc &DL);
SDValue BuildReciprocalEstimate(SDValue Op, SDNodeFlags Flags);		SDValue BuildReciprocalEstimate(SDValue Op, SDNodeFlags Flags);
SDValue buildRsqrtEstimate(SDValue Op, SDNodeFlags Flags);		SDValue buildRsqrtEstimate(SDValue Op, SDNodeFlags Flags);
SDValue buildSqrtEstimate(SDValue Op, SDNodeFlags Flags);		SDValue buildSqrtEstimate(SDValue Op, SDNodeFlags Flags);
▲ Show 20 Lines • Show All 13,316 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::splitMergedValStore(StoreSDNode *ST) {
// Higher value store.		// Higher value store.
SDValue St1 =		SDValue St1 =
DAG.getStore(St0, DL, Hi, Ptr,		DAG.getStore(St0, DL, Hi, Ptr,
ST->getPointerInfo().getWithOffset(HalfValBitSize / 8),		ST->getPointerInfo().getWithOffset(HalfValBitSize / 8),
Alignment / 2, MMOFlags, AAInfo);		Alignment / 2, MMOFlags, AAInfo);
return St1;		return St1;
}		}

		/// Convert a disguised subvector insertion into a shuffle:
		/// insert_vector_elt V, (bitcast X from vector type), IdxC -->
		/// bitcast(shuffle (bitcast V), (extended X), Mask)
		/// Note: We do not use an insert_subvector node because that requires a legal
		/// subvector type.
		SDValue DAGCombiner::combineInsertEltToShuffle(SDNode *N, unsigned InsIndex) {
		SDValue InsertVal = N->getOperand(1);
		if (InsertVal.getOpcode() != ISD::BITCAST \|\| !InsertVal.hasOneUse() \|\|
		!InsertVal.getOperand(0).getValueType().isVector())
		return SDValue();

		SDValue SubVec = InsertVal.getOperand(0);
		SDValue DestVec = N->getOperand(0);
		EVT SubVecVT = SubVec.getValueType();
		EVT VT = DestVec.getValueType();
		unsigned NumSrcElts = SubVecVT.getVectorNumElements();
		unsigned ExtendRatio = VT.getSizeInBits() / SubVecVT.getSizeInBits();
		unsigned NumMaskVals = ExtendRatio * NumSrcElts;

		// Step 1: Create a shuffle mask that implements this insert operation. The
		// vector that we are inserting into will be operand 0 of the shuffle, so
		// those elements are just 'i'. The inserted subvector is in the first
		// positions of operand 1 of the shuffle. Example:
		// insert v4i32 V, (v2i16 X), 2 --> shuffle v8i16 V', X', {0,1,2,3,8,9,6,7}
		SmallVector<int, 16> Mask(NumMaskVals);
		for (unsigned i = 0; i != NumMaskVals; ++i) {
		if (i / NumSrcElts == InsIndex)
		Mask[i] = (i % NumSrcElts) + NumMaskVals;
		else
		Mask[i] = i;
		}

		// Bail out if the target can not handle the shuffle we want to create.
		EVT SubVecEltVT = SubVecVT.getVectorElementType();
		EVT ShufVT = EVT::getVectorVT(*DAG.getContext(), SubVecEltVT, NumMaskVals);
		if (!TLI.isShuffleMaskLegal(Mask, ShufVT))
		return SDValue();

		// Step 2: Create a wide vector from the inserted source vector by appending
		// undefined elements. This is the same size as our destination vector.
		SDLoc DL(N);
		SmallVector<SDValue, 8> ConcatOps(ExtendRatio, DAG.getUNDEF(SubVecVT));
		ConcatOps[0] = SubVec;
		SDValue PaddedSubV = DAG.getNode(ISD::CONCAT_VECTORS, DL, ShufVT, ConcatOps);

		// Step 3: Shuffle in the padded subvector.
		SDValue DestVecBC = DAG.getBitcast(ShufVT, DestVec);
		SDValue Shuf = DAG.getVectorShuffle(ShufVT, DL, DestVecBC, PaddedSubV, Mask);
		AddToWorklist(PaddedSubV.getNode());
		AddToWorklist(DestVecBC.getNode());
		AddToWorklist(Shuf.getNode());
		return DAG.getBitcast(VT, Shuf);
		}

SDValue DAGCombiner::visitINSERT_VECTOR_ELT(SDNode *N) {		SDValue DAGCombiner::visitINSERT_VECTOR_ELT(SDNode *N) {
SDValue InVec = N->getOperand(0);		SDValue InVec = N->getOperand(0);
SDValue InVal = N->getOperand(1);		SDValue InVal = N->getOperand(1);
SDValue EltNo = N->getOperand(2);		SDValue EltNo = N->getOperand(2);
SDLoc DL(N);		SDLoc DL(N);

// If the inserted element is an UNDEF, just use the input vector.		// If the inserted element is an UNDEF, just use the input vector.
if (InVal.isUndef())		if (InVal.isUndef())
return InVec;		return InVec;

EVT VT = InVec.getValueType();		EVT VT = InVec.getValueType();

// Remove redundant insertions:		// Remove redundant insertions:
// (insert_vector_elt x (extract_vector_elt x idx) idx) -> x		// (insert_vector_elt x (extract_vector_elt x idx) idx) -> x
if (InVal.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&		if (InVal.getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
InVec == InVal.getOperand(0) && EltNo == InVal.getOperand(1))		InVec == InVal.getOperand(0) && EltNo == InVal.getOperand(1))
return InVec;		return InVec;

// Check that we know which element is being inserted		// We must know which element is being inserted for folds below here.
if (!isa<ConstantSDNode>(EltNo))		auto *IndexC = dyn_cast<ConstantSDNode>(EltNo);
		if (!IndexC)
return SDValue();		return SDValue();
unsigned Elt = cast<ConstantSDNode>(EltNo)->getZExtValue();		unsigned Elt = IndexC->getZExtValue();

		if (SDValue Shuf = combineInsertEltToShuffle(N, Elt))
		return Shuf;

// Canonicalize insert_vector_elt dag nodes.		// Canonicalize insert_vector_elt dag nodes.
// Example:		// Example:
// (insert_vector_elt (insert_vector_elt A, Idx0), Idx1)		// (insert_vector_elt (insert_vector_elt A, Idx0), Idx1)
// -> (insert_vector_elt (insert_vector_elt A, Idx1), Idx0)		// -> (insert_vector_elt (insert_vector_elt A, Idx1), Idx0)
//		//
// Do this only if the child insert_vector node has one use; also		// Do this only if the child insert_vector node has one use; also
// do this only if indices are both constants and Idx1 < Idx0.		// do this only if indices are both constants and Idx1 < Idx0.
▲ Show 20 Lines • Show All 3,735 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/arm64-neon-copy.ll

	Show First 20 Lines • Show All 134 Lines • ▼ Show 20 Lines
	; CHECK: ins {{v[0-9]+}}.s[1], {{v[0-9]+}}.s[1]			; CHECK: ins {{v[0-9]+}}.s[1], {{v[0-9]+}}.s[1]
	%tmp3 = extractelement <2 x float> %tmp1, i32 1			%tmp3 = extractelement <2 x float> %tmp1, i32 1
	%tmp4 = insertelement <4 x float> %tmp2, float %tmp3, i32 1			%tmp4 = insertelement <4 x float> %tmp2, float %tmp3, i32 1
	ret <4 x float> %tmp4			ret <4 x float> %tmp4
	}			}

	define <2 x double> @ins1f2(<1 x double> %tmp1, <2 x double> %tmp2) {			define <2 x double> @ins1f2(<1 x double> %tmp1, <2 x double> %tmp2) {
	; CHECK-LABEL: ins1f2:			; CHECK-LABEL: ins1f2:
	; CHECK: ins {{v[0-9]+}}.d[1], {{v[0-9]+}}.d[0]			; CHECK: zip1 {{v[0-9]+}}.2d, {{v[0-9]+}}.2d
	%tmp3 = extractelement <1 x double> %tmp1, i32 0			%tmp3 = extractelement <1 x double> %tmp1, i32 0
	%tmp4 = insertelement <2 x double> %tmp2, double %tmp3, i32 1			%tmp4 = insertelement <2 x double> %tmp2, double %tmp3, i32 1
	ret <2 x double> %tmp4			ret <2 x double> %tmp4
	}			}

	define <8 x i8> @ins16b8(<16 x i8> %tmp1, <8 x i8> %tmp2) {			define <8 x i8> @ins16b8(<16 x i8> %tmp1, <8 x i8> %tmp2) {
	; CHECK-LABEL: ins16b8:			; CHECK-LABEL: ins16b8:
	; CHECK: ins {{v[0-9]+}}.b[7], {{v[0-9]+}}.b[2]			; CHECK: ins {{v[0-9]+}}.b[7], {{v[0-9]+}}.b[2]
	▲ Show 20 Lines • Show All 1,333 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/insertelement-shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx2 \| FileCheck %s --check-prefix=X32_AVX256			; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx2 \| FileCheck %s --check-prefix=X32_AVX256
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2 \| FileCheck %s --check-prefix=X64_AVX256			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2 \| FileCheck %s --check-prefix=X64_AVX256
	; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx512f \| FileCheck %s --check-prefix=X32_AVX512			; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx512f \| FileCheck %s --check-prefix=X32_AVX512
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f \| FileCheck %s --check-prefix=X64_AVX512			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f \| FileCheck %s --check-prefix=X64_AVX512

	define <8 x float> @insert_subvector_256(i16 %x0, i16 %x1, <8 x float> %v) nounwind {			define <8 x float> @insert_subvector_256(i16 %x0, i16 %x1, <8 x float> %v) nounwind {
	; X32_AVX256-LABEL: insert_subvector_256:			; X32_AVX256-LABEL: insert_subvector_256:
	; X32_AVX256: # BB#0:			; X32_AVX256: # BB#0:
	; X32_AVX256-NEXT: pushl %eax
	; X32_AVX256-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X32_AVX256-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X32_AVX256-NEXT: vpinsrw $1, {{[0-9]+}}(%esp), %xmm1, %xmm1			; X32_AVX256-NEXT: vpinsrw $1, {{[0-9]+}}(%esp), %xmm1, %xmm1
	; X32_AVX256-NEXT: vmovd %xmm1, (%esp)			; X32_AVX256-NEXT: vpbroadcastd %xmm1, %xmm1
	; X32_AVX256-NEXT: vinsertps {{.*#+}} xmm1 = xmm0[0],mem[0],xmm0[2,3]			; X32_AVX256-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
	; X32_AVX256-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; X32_AVX256-NEXT: popl %eax
	; X32_AVX256-NEXT: retl			; X32_AVX256-NEXT: retl
	;			;
	; X64_AVX256-LABEL: insert_subvector_256:			; X64_AVX256-LABEL: insert_subvector_256:
	; X64_AVX256: # BB#0:			; X64_AVX256: # BB#0:
	; X64_AVX256-NEXT: vmovd %edi, %xmm1			; X64_AVX256-NEXT: vmovd %edi, %xmm1
	; X64_AVX256-NEXT: vpinsrw $1, %esi, %xmm1, %xmm1			; X64_AVX256-NEXT: vpinsrw $1, %esi, %xmm1, %xmm1
	; X64_AVX256-NEXT: vmovd %xmm1, -{{[0-9]+}}(%rsp)			; X64_AVX256-NEXT: vpbroadcastd %xmm1, %xmm1
	; X64_AVX256-NEXT: vinsertps {{.*#+}} xmm1 = xmm0[0],mem[0],xmm0[2,3]			; X64_AVX256-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
	; X64_AVX256-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; X64_AVX256-NEXT: retq			; X64_AVX256-NEXT: retq
	;			;
	; X32_AVX512-LABEL: insert_subvector_256:			; X32_AVX512-LABEL: insert_subvector_256:
	; X32_AVX512: # BB#0:			; X32_AVX512: # BB#0:
	; X32_AVX512-NEXT: pushl %eax
	; X32_AVX512-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X32_AVX512-NEXT: vmovd {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X32_AVX512-NEXT: vpinsrw $1, {{[0-9]+}}(%esp), %xmm1, %xmm1			; X32_AVX512-NEXT: vpinsrw $1, {{[0-9]+}}(%esp), %xmm1, %xmm1
	; X32_AVX512-NEXT: vmovd %xmm1, (%esp)			; X32_AVX512-NEXT: vpbroadcastd %xmm1, %xmm1
	; X32_AVX512-NEXT: vinsertps {{.*#+}} xmm1 = xmm0[0],mem[0],xmm0[2,3]			; X32_AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
	; X32_AVX512-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; X32_AVX512-NEXT: popl %eax
	; X32_AVX512-NEXT: retl			; X32_AVX512-NEXT: retl
	;			;
	; X64_AVX512-LABEL: insert_subvector_256:			; X64_AVX512-LABEL: insert_subvector_256:
	; X64_AVX512: # BB#0:			; X64_AVX512: # BB#0:
	; X64_AVX512-NEXT: vmovd %edi, %xmm1			; X64_AVX512-NEXT: vmovd %edi, %xmm1
	; X64_AVX512-NEXT: vpinsrw $1, %esi, %xmm1, %xmm1			; X64_AVX512-NEXT: vpinsrw $1, %esi, %xmm1, %xmm1
	; X64_AVX512-NEXT: vmovd %xmm1, -{{[0-9]+}}(%rsp)			; X64_AVX512-NEXT: vpbroadcastd %xmm1, %xmm1
	; X64_AVX512-NEXT: vinsertps {{.*#+}} xmm1 = xmm0[0],mem[0],xmm0[2,3]			; X64_AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
	; X64_AVX512-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; X64_AVX512-NEXT: retq			; X64_AVX512-NEXT: retq
	%ins1 = insertelement <2 x i16> undef, i16 %x0, i32 0			%ins1 = insertelement <2 x i16> undef, i16 %x0, i32 0
	%ins2 = insertelement <2 x i16> %ins1, i16 %x1, i32 1			%ins2 = insertelement <2 x i16> %ins1, i16 %x1, i32 1
	%bc = bitcast <2 x i16> %ins2 to float			%bc = bitcast <2 x i16> %ins2 to float
	%ins3 = insertelement <8 x float> %v, float %bc, i32 1			%ins3 = insertelement <8 x float> %v, float %bc, i32 1
	ret <8 x float> %ins3			ret <8 x float> %ins3
	}			}

	Show All 21 Lines
	; X64_AVX256-NEXT: vmovq %xmm2, %rax			; X64_AVX256-NEXT: vmovq %xmm2, %rax
	; X64_AVX256-NEXT: vextracti128 $1, %ymm0, %xmm2			; X64_AVX256-NEXT: vextracti128 $1, %ymm0, %xmm2
	; X64_AVX256-NEXT: vpinsrq $0, %rax, %xmm2, %xmm2			; X64_AVX256-NEXT: vpinsrq $0, %rax, %xmm2, %xmm2
	; X64_AVX256-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm0			; X64_AVX256-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm0
	; X64_AVX256-NEXT: retq			; X64_AVX256-NEXT: retq
	;			;
	; X32_AVX512-LABEL: insert_subvector_512:			; X32_AVX512-LABEL: insert_subvector_512:
	; X32_AVX512: # BB#0:			; X32_AVX512: # BB#0:
	; X32_AVX512-NEXT: pushl %ebp			; X32_AVX512-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
	; X32_AVX512-NEXT: movl %esp, %ebp			; X32_AVX512-NEXT: vmovdqa64 {{.*#+}} zmm2 = [0,0,1,0,8,0,3,0,4,0,5,0,6,0,7,0]
	; X32_AVX512-NEXT: andl $-8, %esp			; X32_AVX512-NEXT: vpermt2q %zmm1, %zmm2, %zmm0
	; X32_AVX512-NEXT: subl $8, %esp
	; X32_AVX512-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
	; X32_AVX512-NEXT: vmovlps %xmm1, (%esp)
	; X32_AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	; X32_AVX512-NEXT: vpinsrd $0, (%esp), %xmm1, %xmm1
	; X32_AVX512-NEXT: vpinsrd $1, {{[0-9]+}}(%esp), %xmm1, %xmm1
	; X32_AVX512-NEXT: vinserti32x4 $1, %xmm1, %zmm0, %zmm0
	; X32_AVX512-NEXT: movl %ebp, %esp
	; X32_AVX512-NEXT: popl %ebp
	; X32_AVX512-NEXT: retl			; X32_AVX512-NEXT: retl
	;			;
	; X64_AVX512-LABEL: insert_subvector_512:			; X64_AVX512-LABEL: insert_subvector_512:
	; X64_AVX512: # BB#0:			; X64_AVX512: # BB#0:
	; X64_AVX512-NEXT: vmovd %edi, %xmm1			; X64_AVX512-NEXT: vmovd %edi, %xmm1
	; X64_AVX512-NEXT: vpinsrd $1, %esi, %xmm1, %xmm1			; X64_AVX512-NEXT: vpinsrd $1, %esi, %xmm1, %xmm1
	; X64_AVX512-NEXT: vmovq %xmm1, %rax			; X64_AVX512-NEXT: vmovdqa64 {{.*#+}} zmm2 = [0,1,8,3,4,5,6,7]
	; X64_AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1			; X64_AVX512-NEXT: vpermt2q %zmm1, %zmm2, %zmm0
	; X64_AVX512-NEXT: vpinsrq $0, %rax, %xmm1, %xmm1
	; X64_AVX512-NEXT: vinserti32x4 $1, %xmm1, %zmm0, %zmm0
	; X64_AVX512-NEXT: retq			; X64_AVX512-NEXT: retq
	%ins1 = insertelement <2 x i32> undef, i32 %x0, i32 0			%ins1 = insertelement <2 x i32> undef, i32 %x0, i32 0
	%ins2 = insertelement <2 x i32> %ins1, i32 %x1, i32 1			%ins2 = insertelement <2 x i32> %ins1, i32 %x1, i32 1
	%bc = bitcast <2 x i32> %ins2 to i64			%bc = bitcast <2 x i32> %ins2 to i64
	%ins3 = insertelement <8 x i64> %v, i64 %bc, i32 2			%ins3 = insertelement <8 x i64> %v, i64 %bc, i32 2
	ret <8 x i64> %ins3			ret <8 x i64> %ins3
	}			}

	Show All 26 Lines
	; X64_AVX256-NEXT: vmovd %edi, %xmm0			; X64_AVX256-NEXT: vmovd %edi, %xmm0
	; X64_AVX256-NEXT: vpinsrd $1, %esi, %xmm0, %xmm0			; X64_AVX256-NEXT: vpinsrd $1, %esi, %xmm0, %xmm0
	; X64_AVX256-NEXT: vpbroadcastq %xmm0, %ymm0			; X64_AVX256-NEXT: vpbroadcastq %xmm0, %ymm0
	; X64_AVX256-NEXT: vmovdqa %ymm0, %ymm1			; X64_AVX256-NEXT: vmovdqa %ymm0, %ymm1
	; X64_AVX256-NEXT: retq			; X64_AVX256-NEXT: retq
	;			;
	; X32_AVX512-LABEL: insert_subvector_into_undef:			; X32_AVX512-LABEL: insert_subvector_into_undef:
	; X32_AVX512: # BB#0:			; X32_AVX512: # BB#0:
	; X32_AVX512-NEXT: pushl %ebp
	; X32_AVX512-NEXT: movl %esp, %ebp
	; X32_AVX512-NEXT: andl $-8, %esp
	; X32_AVX512-NEXT: subl $8, %esp
	; X32_AVX512-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero			; X32_AVX512-NEXT: vmovsd {{.*#+}} xmm0 = mem[0],zero
	; X32_AVX512-NEXT: vmovlps %xmm0, (%esp)			; X32_AVX512-NEXT: vbroadcastsd %xmm0, %zmm0
	; X32_AVX512-NEXT: movl (%esp), %eax
	; X32_AVX512-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X32_AVX512-NEXT: vmovd %eax, %xmm0
	; X32_AVX512-NEXT: vpinsrd $1, %ecx, %xmm0, %xmm0
	; X32_AVX512-NEXT: vpinsrd $2, %eax, %xmm0, %xmm0
	; X32_AVX512-NEXT: vpinsrd $3, %ecx, %xmm0, %xmm0
	; X32_AVX512-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0
	; X32_AVX512-NEXT: vinserti64x4 $1, %ymm0, %zmm0, %zmm0
	; X32_AVX512-NEXT: movl %ebp, %esp
	; X32_AVX512-NEXT: popl %ebp
	; X32_AVX512-NEXT: retl			; X32_AVX512-NEXT: retl
	;			;
	; X64_AVX512-LABEL: insert_subvector_into_undef:			; X64_AVX512-LABEL: insert_subvector_into_undef:
	; X64_AVX512: # BB#0:			; X64_AVX512: # BB#0:
	; X64_AVX512-NEXT: vmovd %edi, %xmm0			; X64_AVX512-NEXT: vmovd %edi, %xmm0
	; X64_AVX512-NEXT: vpinsrd $1, %esi, %xmm0, %xmm0			; X64_AVX512-NEXT: vpinsrd $1, %esi, %xmm0, %xmm0
	; X64_AVX512-NEXT: vpbroadcastq %xmm0, %zmm0			; X64_AVX512-NEXT: vpbroadcastq %xmm0, %zmm0
	; X64_AVX512-NEXT: retq			; X64_AVX512-NEXT: retq
	%ins1 = insertelement <2 x i32> undef, i32 %x0, i32 0			%ins1 = insertelement <2 x i32> undef, i32 %x0, i32 0
	%ins2 = insertelement <2 x i32> %ins1, i32 %x1, i32 1			%ins2 = insertelement <2 x i32> %ins1, i32 %x1, i32 1
	%bc = bitcast <2 x i32> %ins2 to i64			%bc = bitcast <2 x i32> %ins2 to i64
	%ins3 = insertelement <8 x i64> undef, i64 %bc, i32 0			%ins3 = insertelement <8 x i64> undef, i64 %bc, i32 0
	%splat = shufflevector <8 x i64> %ins3, <8 x i64> undef, <8 x i32> zeroinitializer			%splat = shufflevector <8 x i64> %ins3, <8 x i64> undef, <8 x i32> zeroinitializer
	ret <8 x i64> %splat			ret <8 x i64> %splat
	}			}