This is an archive of the discontinued LLVM Phabricator instance.

Lower certain build_vectors to insertps instructions
ClosedPublic

Authored by filcab on Apr 26 2014, 10:54 PM.

Download Raw Diff

Details

Reviewers

filcab
nadav
delena
craig.topper

Commits

rG095d9d573a62: Lower certain build_vectors to insertps instructions
rL208271: Lower certain build_vectors to insertps instructions

Summary

Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.

Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.

Added some tests for different cases where this optimization triggers.

Diff Detail

Event Timeline

filcab updated this revision to Diff 8868.Apr 26 2014, 10:54 PM

filcab retitled this revision from to Lower certain build_vectors to insertps instructions.

filcab updated this object.

filcab edited the test plan for this revision. (Show Details)

filcab added reviewers: nadav, delena, craig.topper.

filcab added a subscriber: Unknown Object (MLST).

I added some comments, please see inside.

Elena

Hi Elena,

Did you forget to add the comments? I didn't get them nor see them in phabricator.

Thank you,
Filipe

delena added inline comments.Apr 26 2014, 11:49 PM

lib/Target/X86/X86ISelLowering.cpp
5418	CorrectIdx here is boolean - 1 or 0. Further you do ++.
5432	Looks like you are looking for a splat vector. But if you want to use INSERTPS, your build-vector should include only one non-zero element.
6204	What happens for 8x32 and 16x32 vectors here?
test/CodeGen/X86/sse41.ll
331	Could you, please, explain what code you expect to see here? Is it only one insertps instruction? Usually, such extract-insert chain we have for matrix transpose. But in this case the elements are extracted from different vectors.

Fixed a bug pointed by Elena: only optimize v4x32 vectors, not bigger ones.

Let me know if you want these changed.

Thanks,
Filipe

lib/Target/X86/X86ISelLowering.cpp
5418	CorrectIdx is an unsigned int. It gets initialized to 0 or 1 according to that test.
5432	Not necessarily a splat vector. Right now I'm only doing the optimization for a build_vector of elements from one single vector. But for an INSERTPS we can have up to 3 (non-zero) elements from one vector (if their destination index is the same as the source index) and one (non-zero) vector from another. We can also set any position to zero. This test is here to bail early and serve as a start for further optimizations. But I think if I optimized for every case, this diff would be too big. Optimizing for every case will also take time, since the IR for this can vary wildly (extractelement + insertelement + vectorshuffle, etc).
test/CodeGen/X86/sse41.ll
331	Eventually all these shuf_???? should be reduced to a single insertps instruction. Right now my patch doesn't accomplish this, since we need to match several other cases (and also match on lowershufflevector). I can match the exact code that should be emitted, if you prefer. I can also match more code in the functions that aren't yet fully reduced to an insertps instruction. Let me know if you want me to do any of these. I figured splitting this optimization would be easier to review and accept and it still improves our code generation gradually.

Let's assume that you need to insert 2 or 3 elements that extracted from vector X.
But you create an INSERPS node that puts one element to UNDEF.

return DAG.getNode(X86ISD::INSERTPS, dl, VT, V, DAG.getUNDEF(VT),

InsertpsMask);

Elena

I'm not following. Those still work. I have examples in the tests and they succeed.

I just ran them through llc, as well as additional examples (two slight modifications to shuffle the zeros around, described in the name of the functions) and here's is llc's output (including the comments for insertps, which show what we want:

_shuf_XYZ0:                             ## @shuf_XYZ0
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $8, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2],zero
  retq
  .cfi_endproc

  .globl    _shuf_0YZW
  .align    4, 0x90
_shuf_0YZW:                             ## @shuf_0YZW
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $81, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2,3]
  retq
  .cfi_endproc

  .globl    _shuf_XY00
  .align    4, 0x90
_shuf_XY00:                             ## @shuf_XY00
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $12, %xmm0, %xmm0 ## xmm0 = xmm0[0,1],zero,zero
  retq
  .cfi_endproc

  .globl    _shuf_X0Z0
  .align    4, 0x90
_shuf_X0Z0:                             ## @shuf_X0Z0
  .cfi_startproc
## BB#0:                                ## %entry
  insertps  $10, %xmm0, %xmm0 ## xmm0 = xmm0[0],zero,xmm0[2],zero
  retq
  .cfi_endproc

If I misunderstood, please provide additional details.

Thanks,
Filipe

You see the right code because it is a small test and %xmm0 is used
Try to run this test

define < 4 x float> @test(<4 x float> %x) {

%vecext = extractelement <4 x float> %x, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecext1 = extractelement <4 x float> %x, i32 1
%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
%vecext3 = extractelement <4 x float> %x, i32 2
%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2
%vecinit5 = insertelement <4 x float> %vecinit4, float 0.0, i32 3

%mask = fcmp olt <4 x float> %vecinit5, %x

%res = select  <4 x i1> %mask, <4 x float> %x, <4 x float>%vecinit5
ret <4 x float> %res

}

Elena

Fix the undef bug pointed by Elena.

Thanks for the detailed report, I finally understood the bug's cause :-)

Filipe

delena added inline comments.Apr 27 2014, 2:06 AM

lib/Target/X86/X86ISelLowering.cpp
5423	If it is a zeroNode, somebody should take care for this. I don't understand how do your tests with zeroes work.
5447	You can't insert V into V. if you want to "copy" 3 elements and insert 1, you should write (INSERTPS, dl, VT, V, scalar_to_vector(elt), index) If you want to copy 2 elements and insert 2, you can't use INSERTPS at all
test/CodeGen/X86/sse41.ll
353	What code is generated here?

filcab added inline comments.Apr 27 2014, 11:42 AM

lib/Target/X86/X86ISelLowering.cpp
5423	This is testing if this element of the build_vector is a zeroNode. If it is, we're still ok to do the optimization, since we can insert 0 wherever we want. What I can do is simply not check for zero or undef. It won't change CorrectIdx nor change the comparison of CorrectIdx and NumNonZero.
5447	For now this optimization is only dealing with inserting 0 in vectors. For inserting 0 in the vectors, it is acceptable to insert V into V (with countD == countS == the index of one of the elements that won't be turned into 0). In the future, it will have to be changed to insert a V0 into a V1 (or vice-versa), with a special case for when we're moving an element inside V0. e.g: (x,y,z,w) -> (x,z,z,w) or (x,y,z,w) -> (x,0,0,x), etc. We're only using insertps iff NumNonZero (which was counted in Lowerbuildvector and is the number of non-zero+non-undef elements) is equal to CorrectIdx (which is the number of elements from V that are inserted in the new vector with the same index they had in V). Since we know this, we can use insertps with V as both vector arguments.
test/CodeGen/X86/sse41.ll
353	_shuf_XY00: ## @shuf_XY00 .cfi_startproc ## BB#0: ## %entry insertps $12, %xmm0, %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0xc0,0x0c] ## xmm0 = xmm0[0,1],zero,zero retq ## encoding: [0xc3] .cfi_endproc

Perform the optimization when we're moving a single element to a different
place in the vector, while zeroing out some other elements.

(sorry about the last revision. I misused arc diff)

Vectors built with zeros and elements in the same order as another
(source) vector are optimized to be built using a single insertps
instruction.
Also optimize when we move one element in a vector to a different place
in that vector while zeroing out some of the other elements.

Further optimizations are possible, described in TODO comments.
I will be implementing at least some of them in the near future.

Added some tests for different cases where this optimization triggers.

delena added inline comments.Apr 28 2014, 11:23 PM

lib/Target/X86/X86ISelLowering.cpp
5461	Let's assume that the first element of the BUILD_VECTOR is "undef" and the first element of EXTRACT does not go to any place. FirstNonZeroIdx = 2, for example. The INSERTPS instruction you use copies the first element to another place, because this instruction can manipulate with the first element only.

filcab added inline comments.Apr 29 2014, 1:07 AM

lib/Target/X86/X86ISelLowering.cpp
5461	Hi Elena, I'm not understanding what you mean, sorry. I tried making examples from your comment, but they seem to work and do the correct modification. When we find a non-zero element that has the correct index, we copy it to the same place, IF there's no element that has to change index (we use that index for countS and countD in the insertpsmask). If there is an element that changes index, then we use its old and new indices for the insertpsmask.. Here's a couple of examples: ll file: https://gist.github.com/filcab/11392909 llc -debug output: https://gist.github.com/filcab/11392943 Thanks, Filipe

Please ignore my last message. It's my mistake. I don't see any other problem in the code itself. I just think that such bunch of extracts and inserts may be converted to one shuffle.

Elena

Yes, some of these kinds of functions do get turned into shufflevectors instead of build_vectors. I will address these in subsequent patches.

Is that ok?

Filipe

Ping. I'm not sure if your last message was an LGTM, Elena.

LGTM

filcab accepted this revision.May 7 2014, 5:53 PM

filcab added a reviewer: filcab.

This revision is now accepted and ready to land.May 7 2014, 5:53 PM

filcab closed this revision.May 7 2014, 5:53 PM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

60 lines

test/

CodeGen/

X86/

sse41.ll

149 lines

Diff 8868

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,390 Lines • ▼ Show 20 Lines	if (isNonZero) {
MVT::v8i16, V, Op.getOperand(i),		MVT::v8i16, V, Op.getOperand(i),
DAG.getIntPtrConstant(i));		DAG.getIntPtrConstant(i));
}		}
}		}

return V;		return V;
}		}

		/// LowerBuildVectorv4x32 - Custom lower build_vector of v4i32 or v4f32.
		static SDValue LowerBuildVectorv4x32(SDValue Op, unsigned NumElems,
		unsigned NonZeros, unsigned NumNonZero,
		unsigned NumZero, SelectionDAG &DAG,
		const X86Subtarget *Subtarget,
		const TargetLowering &TLI) {
		// We know there's at least one non-zero element
		unsigned FirstNonZeroIdx = 0;
		SDValue FirstNonZero = Op->getOperand(FirstNonZeroIdx);
		while (FirstNonZero.getOpcode() == ISD::UNDEF \|\|
		X86::isZeroNode(FirstNonZero)) {
		++FirstNonZeroIdx;
		FirstNonZero = Op->getOperand(FirstNonZeroIdx);
		}

		if (FirstNonZero.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
		return SDValue();

		SDValue V = FirstNonZero.getOperand(0);
		unsigned CorrectIdx = cast<ConstantSDNode>(FirstNonZero.getOperand(1))
		delenaUnsubmitted Not Done Reply Inline Actions CorrectIdx here is boolean - 1 or 0. Further you do ++. delena: CorrectIdx here is boolean - 1 or 0. Further you do ++.
		filcabAuthorUnsubmitted Not Done Reply Inline Actions CorrectIdx is an unsigned int. It gets initialized to 0 or 1 according to that test. filcab: CorrectIdx is an unsigned int. It gets initialized to 0 or 1 according to that test.
		->getZExtValue() == FirstNonZeroIdx;

		for (unsigned Idx = FirstNonZeroIdx + 1; Idx < NumElems; ++Idx) {
		SDValue Elem = Op.getOperand(Idx);
		if (Elem.getOpcode() == ISD::UNDEF \|\| X86::isZeroNode(Elem))
		delenaUnsubmitted Not Done Reply Inline Actions If it is a zeroNode, somebody should take care for this. I don't understand how do your tests with zeroes work. delena: If it is a zeroNode, somebody should take care for this. I don't understand how do your tests…
		filcabAuthorUnsubmitted Not Done Reply Inline Actions This is testing if this element of the build_vector is a zeroNode. If it is, we're still ok to do the optimization, since we can insert 0 wherever we want. What I can do is simply not check for zero or undef. It won't change CorrectIdx nor change the comparison of CorrectIdx and NumNonZero. filcab: This is testing if this element of the build_vector is a zeroNode. If it is, we're still ok to…
		continue;

		// TODO: What else can be here? Deal with it.
		if (Elem.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
		return SDValue();

		// TODO: Some optimizations are still possible here
		// ex: Getting one element from a vector, and the rest from another.
		if (Elem.getOperand(0) != V)
		delenaUnsubmitted Not Done Reply Inline Actions Looks like you are looking for a splat vector. But if you want to use INSERTPS, your build-vector should include only one non-zero element. delena: Looks like you are looking for a splat vector. But if you want to use INSERTPS, your build…
		filcabAuthorUnsubmitted Not Done Reply Inline Actions Not necessarily a splat vector. Right now I'm only doing the optimization for a build_vector of elements from one single vector. But for an INSERTPS we can have up to 3 (non-zero) elements from one vector (if their destination index is the same as the source index) and one (non-zero) vector from another. We can also set any position to zero. This test is here to bail early and serve as a start for further optimizations. But I think if I optimized for every case, this diff would be too big. Optimizing for every case will also take time, since the IR for this can vary wildly (extractelement + insertelement + vectorshuffle, etc). filcab: Not necessarily a splat vector. Right now I'm only doing the optimization for a build_vector of…
		return SDValue();

		if (cast<ConstantSDNode>(Elem.getOperand(1))->getZExtValue() == Idx)
		++CorrectIdx;
		}

		if (NumNonZero != CorrectIdx)
		return SDValue();

		// We're copying a vector and setting some values to 0
		SDLoc dl(Op);
		EVT VT = Op.getSimpleValueType();
		SDValue InsertpsMask = DAG.getIntPtrConstant(
		FirstNonZeroIdx << 6 \| FirstNonZeroIdx << 4 \| (~NonZeros & 0xf));
		return DAG.getNode(X86ISD::INSERTPS, dl, VT, V, DAG.getUNDEF(VT),
		delenaUnsubmitted Not Done Reply Inline Actions You can't insert V into V. if you want to "copy" 3 elements and insert 1, you should write (INSERTPS, dl, VT, V, scalar_to_vector(elt), index) If you want to copy 2 elements and insert 2, you can't use INSERTPS at all delena: You can't insert V into V. if you want to "copy" 3 elements and insert 1, you should write…
		filcabAuthorUnsubmitted Not Done Reply Inline Actions For now this optimization is only dealing with inserting 0 in vectors. For inserting 0 in the vectors, it is acceptable to insert V into V (with countD == countS == the index of one of the elements that won't be turned into 0). In the future, it will have to be changed to insert a V0 into a V1 (or vice-versa), with a special case for when we're moving an element inside V0. e.g: (x,y,z,w) -> (x,z,z,w) or (x,y,z,w) -> (x,0,0,x), etc. We're only using insertps iff NumNonZero (which was counted in Lowerbuildvector and is the number of non-zero+non-undef elements) is equal to CorrectIdx (which is the number of elements from V that are inserted in the new vector with the same index they had in V). Since we know this, we can use insertps with V as both vector arguments. filcab: For now this optimization is only dealing with inserting 0 in vectors. For inserting 0 in the…
		InsertpsMask);
		}

/// getVShift - Return a vector logical shift node.		/// getVShift - Return a vector logical shift node.
///		///
static SDValue getVShift(bool isLeft, EVT VT, SDValue SrcOp,		static SDValue getVShift(bool isLeft, EVT VT, SDValue SrcOp,
unsigned NumBits, SelectionDAG &DAG,		unsigned NumBits, SelectionDAG &DAG,
const TargetLowering &TLI, SDLoc dl) {		const TargetLowering &TLI, SDLoc dl) {
assert(VT.is128BitVector() && "Unknown type for VShift");		assert(VT.is128BitVector() && "Unknown type for VShift");
EVT ShVT = MVT::v2i64;		EVT ShVT = MVT::v2i64;
unsigned Opc = isLeft ? X86ISD::VSHLDQ : X86ISD::VSRLDQ;		unsigned Opc = isLeft ? X86ISD::VSHLDQ : X86ISD::VSRLDQ;
SrcOp = DAG.getNode(ISD::BITCAST, dl, ShVT, SrcOp);		SrcOp = DAG.getNode(ISD::BITCAST, dl, ShVT, SrcOp);
return DAG.getNode(ISD::BITCAST, dl, VT,		return DAG.getNode(ISD::BITCAST, dl, VT,
DAG.getNode(Opc, dl, ShVT, SrcOp,		DAG.getNode(Opc, dl, ShVT, SrcOp,
		delenaUnsubmitted Not Done Reply Inline Actions Let's assume that the first element of the BUILD_VECTOR is "undef" and the first element of EXTRACT does not go to any place. FirstNonZeroIdx = 2, for example. The INSERTPS instruction you use copies the first element to another place, because this instruction can manipulate with the first element only. delena: Let's assume that the first element of the BUILD_VECTOR is "undef" and the first element of…
		filcabAuthorUnsubmitted Not Done Reply Inline Actions Hi Elena, I'm not understanding what you mean, sorry. I tried making examples from your comment, but they seem to work and do the correct modification. When we find a non-zero element that has the correct index, we copy it to the same place, IF there's no element that has to change index (we use that index for countS and countD in the insertpsmask). If there is an element that changes index, then we use its old and new indices for the insertpsmask.. Here's a couple of examples: ll file: https://gist.github.com/filcab/11392909 llc -debug output: https://gist.github.com/filcab/11392943 Thanks, Filipe filcab: Hi Elena, I'm not understanding what you mean, sorry. I tried making examples from your…
DAG.getConstant(NumBits,		DAG.getConstant(NumBits,
TLI.getScalarShiftAmountTy(SrcOp.getValueType()))));		TLI.getScalarShiftAmountTy(SrcOp.getValueType()))));
}		}

static SDValue		static SDValue
LowerAsSplatVectorLoad(SDValue SrcOp, MVT VT, SDLoc dl, SelectionDAG &DAG) {		LowerAsSplatVectorLoad(SDValue SrcOp, MVT VT, SDLoc dl, SelectionDAG &DAG) {

// Check if the scalar load can be widened into a vector load. And if		// Check if the scalar load can be widened into a vector load. And if
▲ Show 20 Lines • Show All 724 Lines • ▼ Show 20 Lines	X86TargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const {
}		}

if (EVTBits == 16 && NumElems == 8) {		if (EVTBits == 16 && NumElems == 8) {
SDValue V = LowerBuildVectorv8i16(Op, NonZeros,NumNonZero,NumZero, DAG,		SDValue V = LowerBuildVectorv8i16(Op, NonZeros,NumNonZero,NumZero, DAG,
Subtarget, *this);		Subtarget, *this);
if (V.getNode()) return V;		if (V.getNode()) return V;
}		}

		// If element VT is == 32 bits, try to generate an INSERTPS
		if (EVTBits == 32) {
		SDValue V = LowerBuildVectorv4x32(Op, NumElems, NonZeros, NumNonZero,
		delenaUnsubmitted Not Done Reply Inline Actions What happens for 8x32 and 16x32 vectors here? delena: What happens for 8x32 and 16x32 vectors here?
		NumZero, DAG, Subtarget, *this);
		if (V.getNode())
		return V;
		}

// If element VT is == 32 bits, turn it into a number of shuffles.		// If element VT is == 32 bits, turn it into a number of shuffles.
SmallVector<SDValue, 8> V(NumElems);		SmallVector<SDValue, 8> V(NumElems);
if (NumElems == 4 && NumZero > 0) {		if (NumElems == 4 && NumZero > 0) {
for (unsigned i = 0; i < 4; ++i) {		for (unsigned i = 0; i < 4; ++i) {
bool isZero = !(NonZeros & (1 << i));		bool isZero = !(NonZeros & (1 << i));
if (isZero)		if (isZero)
V[i] = getZeroVector(VT, Subtarget, DAG, dl);		V[i] = getZeroVector(VT, Subtarget, DAG, dl);
else		else
▲ Show 20 Lines • Show All 14,670 Lines • Show Last 20 Lines

test/CodeGen/X86/sse41.ll

	Show First 20 Lines • Show All 314 Lines • ▼ Show 20 Lines
	; CHECK-NOT: shufps			; CHECK-NOT: shufps
	; CHECK: insertps $32,			; CHECK: insertps $32,
	; CHECK: ret			; CHECK: ret
	%1 = load i32* %b, align 4			%1 = load i32* %b, align 4
	%2 = insertelement <4 x i32> undef, i32 %1, i32 0			%2 = insertelement <4 x i32> undef, i32 %1, i32 0
	%result = shufflevector <4 x i32> %a, <4 x i32> %2, <4 x i32> <i32 0, i32 1, i32 4, i32 3>			%result = shufflevector <4 x i32> %a, <4 x i32> %2, <4 x i32> <i32 0, i32 1, i32 4, i32 3>
	ret <4 x i32> %result			ret <4 x i32> %result
	}			}

				;;;;;;; Shuffles optimizable with a single insertps instruction
				define <4 x float> @shuf_XYZ0(<4 x float> %x, <4 x float> %a) {
				; CHECK-LABEL: shuf_XYZ0:
				; CHECK-NOT: pextrd
				; CHECK-NOT: punpckldq
				; CHECK: insertps $8
				; CHECK: ret
				%vecext = extractelement <4 x float> %x, i32 0
				delenaUnsubmitted Not Done Reply Inline Actions Could you, please, explain what code you expect to see here? Is it only one insertps instruction? Usually, such extract-insert chain we have for matrix transpose. But in this case the elements are extracted from different vectors. delena: Could you, please, explain what code you expect to see here? Is it only one insertps…
				filcabAuthorUnsubmitted Not Done Reply Inline Actions Eventually all these shuf_???? should be reduced to a single insertps instruction. Right now my patch doesn't accomplish this, since we need to match several other cases (and also match on lowershufflevector). I can match the exact code that should be emitted, if you prefer. I can also match more code in the functions that aren't yet fully reduced to an insertps instruction. Let me know if you want me to do any of these. I figured splitting this optimization would be easier to review and accept and it still improves our code generation gradually. filcab: Eventually all these shuf_???? should be reduced to a single insertps instruction. Right now my…
				%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
				%vecext1 = extractelement <4 x float> %x, i32 1
				%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
				%vecext3 = extractelement <4 x float> %x, i32 2
				%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2
				%vecinit5 = insertelement <4 x float> %vecinit4, float 0.0, i32 3
				ret <4 x float> %vecinit5
				}

				define <4 x float> @shuf_XY00(<4 x float> %x, <4 x float> %a) {
				; CHECK-LABEL: shuf_XY00:
				; CHECK-NOT: pextrd
				; CHECK-NOT: punpckldq
				; CHECK: insertps $12
				; CHECK: ret
				%vecext = extractelement <4 x float> %x, i32 0
				%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
				%vecext1 = extractelement <4 x float> %x, i32 1
				%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
				%vecinit3 = insertelement <4 x float> %vecinit2, float 0.0, i32 2
				%vecinit4 = insertelement <4 x float> %vecinit3, float 0.0, i32 3
				ret <4 x float> %vecinit4
				delenaUnsubmitted Not Done Reply Inline Actions What code is generated here? delena: What code is generated here?
				filcabAuthorUnsubmitted Not Done Reply Inline Actions _shuf_XY00: ## @shuf_XY00 .cfi_startproc ## BB#0: ## %entry insertps $12, %xmm0, %xmm0 ## encoding: [0x66,0x0f,0x3a,0x21,0xc0,0x0c] ## xmm0 = xmm0[0,1],zero,zero retq ## encoding: [0xc3] .cfi_endproc filcab: _shuf_XY00: ## @shuf_XY00 .cfi_startproc ## BB#0…
				}

				define <4 x float> @shuf_X00A(<4 x float> %x, <4 x float> %a) {
				; CHECK-LABEL: shuf_X00A:
				; CHECK-NOT: movaps
				; CHECK-NOT: shufps
				; CHECK: insertps $48
				; CHECK: ret
				%vecext = extractelement <4 x float> %x, i32 0
				%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
				%vecinit1 = insertelement <4 x float> %vecinit, float 0.0, i32 1
				%vecinit2 = insertelement <4 x float> %vecinit1, float 0.0, i32 2
				%vecinit4 = shufflevector <4 x float> %vecinit2, <4 x float> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
				ret <4 x float> %vecinit4
				}

				define <4 x float> @shuf_X00X(<4 x float> %x, <4 x float> %a) {
				; CHECK-LABEL: shuf_X00X:
				; CHECK-NOT: movaps
				; CHECK-NOT: shufps
				; CHECK: insertps $48
				; CHECK: ret
				%vecext = extractelement <4 x float> %x, i32 0
				%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
				%vecinit1 = insertelement <4 x float> %vecinit, float 0.0, i32 1
				%vecinit2 = insertelement <4 x float> %vecinit1, float 0.0, i32 2
				%vecinit4 = shufflevector <4 x float> %vecinit2, <4 x float> %x, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
				ret <4 x float> %vecinit4
				}

				define <4 x float> @shuf_X0YC(<4 x float> %x, <4 x float> %a) {
				; CHECK-LABEL: shuf_X0YC:
				; CHECK: shufps
				; CHECK-NOT: movhlps
				; CHECK-NOT: shufps
				; CHECK: insertps $176
				; CHECK: ret
				%vecext = extractelement <4 x float> %x, i32 0
				%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
				%vecinit1 = insertelement <4 x float> %vecinit, float 0.0, i32 1
				%vecinit3 = shufflevector <4 x float> %vecinit1, <4 x float> %x, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
				%vecinit5 = shufflevector <4 x float> %vecinit3, <4 x float> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 6>
				ret <4 x float> %vecinit5
				}

				define <4 x i32> @i32_shuf_XYZ0(<4 x i32> %x, <4 x i32> %a) {
				; CHECK-LABEL: i32_shuf_XYZ0:
				; CHECK-NOT: pextrd
				; CHECK-NOT: punpckldq
				; CHECK: insertps $8
				; CHECK: ret
				%vecext = extractelement <4 x i32> %x, i32 0
				%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
				%vecext1 = extractelement <4 x i32> %x, i32 1
				%vecinit2 = insertelement <4 x i32> %vecinit, i32 %vecext1, i32 1
				%vecext3 = extractelement <4 x i32> %x, i32 2
				%vecinit4 = insertelement <4 x i32> %vecinit2, i32 %vecext3, i32 2
				%vecinit5 = insertelement <4 x i32> %vecinit4, i32 0, i32 3
				ret <4 x i32> %vecinit5
				}

				define <4 x i32> @i32_shuf_XY00(<4 x i32> %x, <4 x i32> %a) {
				; CHECK-LABEL: i32_shuf_XY00:
				; CHECK-NOT: pextrd
				; CHECK-NOT: punpckldq
				; CHECK: insertps $12
				; CHECK: ret
				%vecext = extractelement <4 x i32> %x, i32 0
				%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
				%vecext1 = extractelement <4 x i32> %x, i32 1
				%vecinit2 = insertelement <4 x i32> %vecinit, i32 %vecext1, i32 1
				%vecinit3 = insertelement <4 x i32> %vecinit2, i32 0, i32 2
				%vecinit4 = insertelement <4 x i32> %vecinit3, i32 0, i32 3
				ret <4 x i32> %vecinit4
				}

				define <4 x i32> @i32_shuf_X00A(<4 x i32> %x, <4 x i32> %a) {
				; CHECK-LABEL: i32_shuf_X00A:
				; CHECK-NOT: movaps
				; CHECK-NOT: shufps
				; CHECK: insertps $48
				; CHECK: ret
				%vecext = extractelement <4 x i32> %x, i32 0
				%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
				%vecinit1 = insertelement <4 x i32> %vecinit, i32 0, i32 1
				%vecinit2 = insertelement <4 x i32> %vecinit1, i32 0, i32 2
				%vecinit4 = shufflevector <4 x i32> %vecinit2, <4 x i32> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
				ret <4 x i32> %vecinit4
				}

				define <4 x i32> @i32_shuf_X00X(<4 x i32> %x, <4 x i32> %a) {
				; CHECK-LABEL: i32_shuf_X00X:
				; CHECK-NOT: movaps
				; CHECK-NOT: shufps
				; CHECK: insertps $48
				; CHECK: ret
				%vecext = extractelement <4 x i32> %x, i32 0
				%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
				%vecinit1 = insertelement <4 x i32> %vecinit, i32 0, i32 1
				%vecinit2 = insertelement <4 x i32> %vecinit1, i32 0, i32 2
				%vecinit4 = shufflevector <4 x i32> %vecinit2, <4 x i32> %x, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
				ret <4 x i32> %vecinit4
				}

				define <4 x i32> @i32_shuf_X0YC(<4 x i32> %x, <4 x i32> %a) {
				; CHECK-LABEL: i32_shuf_X0YC:
				; CHECK: shufps
				; CHECK-NOT: movhlps
				; CHECK-NOT: shufps
				; CHECK: insertps $176
				; CHECK: ret
				%vecext = extractelement <4 x i32> %x, i32 0
				%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
				%vecinit1 = insertelement <4 x i32> %vecinit, i32 0, i32 1
				%vecinit3 = shufflevector <4 x i32> %vecinit1, <4 x i32> %x, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
				%vecinit5 = shufflevector <4 x i32> %vecinit3, <4 x i32> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 6>
				ret <4 x i32> %vecinit5
				}