This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improved lowering of v4x32 build_vector dag nodes.
ClosedPublic

Authored by andreadb on Nov 18 2014, 11:32 AM.

Download Raw Diff

Details

Reviewers

qcolombet
grosbach
nadav
delena

Commits

rG1b657bfcc807: [X86] Improved lowering of v4x32 build_vector dag nodes.
rL222375: [X86] Improved lowering of v4x32 build_vector dag nodes.

Summary

Hi Quentin, Nadav (and all),

This patch improves the lowering of v4f32 and v4i32 build_vector dag nodes to blend/insertps.
In particular, this patch improves function 'LowerBuildVectorv4x32' which works under the following preconditions:

the build_vector in input is not a build_vector of all-zeros;
the build_vector in input has at least one non-zero element.

This patch improves the previous behavior as follows:

A build_vector that performs a blend with a zero vector is converted to a shuffle.
We now identify more opportunities to lower a build_vector into an insertps with zero masking.

About 1), this is to let the shuffle legalizer expand the dag node in a optimal way. In particular, this helps improving the codegen in cases where an insertps is selected instead of a movq or a blend (See the differences in test sse41.ll and sse2.ll).

About 2), we now get much better codegen in all the new test cases added in sse41.ll.

For example:
;;
define <4 x float> @insertps_7(<4 x float> %A, <4 x float> %B) #0 {
entry:

%vecext = extractelement <4 x float> %A, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecinit1 = insertelement <4 x float> %vecinit, float 0.000000e+00, i32 1
%vecext2 = extractelement <4 x float> %B, i32 1
%vecinit3 = insertelement <4 x float> %vecinit1, float %vecext2, i32 2
%vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3
ret <4 x float> %vecinit4

}
;;

Before the backend generated the following assembly:

shufps $-27, %xmm1, %xmm1
xorps %xmm2, %xmm2
blendps $14, %xmm2, %xmm0
blendps $14, %xmm2, %xmm1
unpcklpd %xmm1, %xmm0
retq

with this patch, the backend correctly lowers the build_vector to insertps:

insertps $170, %xmm1, %xmm0 # xmm0 = xmm0[0],zero,xmm1[1],zero
retq

Please let me know if ok to submit.
Thanks,
Andrea

Diff Detail

Repository: rL LLVM

Event Timeline

andreadb updated this revision to Diff 16345.Nov 18 2014, 11:32 AM

andreadb retitled this revision from to [X86] Improved lowering of v4x32 build_vector dag nodes..

andreadb updated this object.

andreadb edited the test plan for this revision. (Show Details)

andreadb added reviewers: qcolombet, nadav, grosbach, delena.

andreadb added a subscriber: Unknown Object (MLST).

andreadb added inline comments.Nov 18 2014, 12:59 PM

lib/Target/X86/X86ISelLowering.cpp
5752–5754 ↗	(On Diff #16345)	I just realized that this check could be improved. This assertion should check that the number of non-zero elements is strictly bigger than 1 (and not >= 1). The reason why it cannot be 1 is because build_vector nodes of i32 or f32 elements that only have one non-zero element are expanded earlier before we call this function. I will correct it before sending.

Uploaded a new version of the patch.
This time we correctly check the precondition on the number of non-zero elements in input to the build_vector.
Also, added a missing check on the value type of vectors in input to extract_vector_elt dag nodes.

Hi Andrea,

LGTM.

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
5784 ↗	(On Diff #16378)	Just add a comment that the zero vector will be on the RHS, that’s why it is EltIdx + 4.
5789 ↗	(On Diff #16378)	Add a comment that by construction Elt is a EXTRACT_VECTOR_ELT with constant index.

This revision is now accepted and ready to land.Nov 19 2014, 10:49 AM

Closed by commit rL222375 (authored by adibiagio).

Thanks Quentin!

I added the extra comments in the code.
Committed revision 222375.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

152 lines

test/

CodeGen/

X86/

sse2.ll

11 lines

sse41.ll

150 lines

Diff 16394

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,734 Lines • ▼ Show 20 Lines	if (isNonZero) {
DAG.getIntPtrConstant(i));		DAG.getIntPtrConstant(i));
}		}
}		}

return V;		return V;
}		}

/// LowerBuildVectorv4x32 - Custom lower build_vector of v4i32 or v4f32.		/// LowerBuildVectorv4x32 - Custom lower build_vector of v4i32 or v4f32.
static SDValue LowerBuildVectorv4x32(SDValue Op, unsigned NumElems,		static SDValue LowerBuildVectorv4x32(SDValue Op, SelectionDAG &DAG,
unsigned NonZeros, unsigned NumNonZero,
unsigned NumZero, SelectionDAG &DAG,
const X86Subtarget *Subtarget,		const X86Subtarget *Subtarget,
const TargetLowering &TLI) {		const TargetLowering &TLI) {
// We know there's at least one non-zero element		// Find all zeroable elements.
unsigned FirstNonZeroIdx = 0;		bool Zeroable[4];
SDValue FirstNonZero = Op->getOperand(FirstNonZeroIdx);		for (int i=0; i < 4; ++i) {
while (FirstNonZero.getOpcode() == ISD::UNDEF \|\|		SDValue Elt = Op->getOperand(i);
X86::isZeroNode(FirstNonZero)) {		Zeroable[i] = (Elt.getOpcode() == ISD::UNDEF \|\| X86::isZeroNode(Elt));
++FirstNonZeroIdx;
FirstNonZero = Op->getOperand(FirstNonZeroIdx);
}		}
		assert(std::count_if(&Zeroable[0], &Zeroable[4],
if (FirstNonZero.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|		[](bool M) { return !M; }) > 1 &&
!isa<ConstantSDNode>(FirstNonZero.getOperand(1)))		"We expect at least two non-zero elements!");

		// We only know how to deal with build_vector nodes where elements are either
		// zeroable or extract_vector_elt with constant index.
		SDValue FirstNonZero;
		for (int i=0; i < 4; ++i) {
		if (Zeroable[i])
		continue;
		SDValue Elt = Op->getOperand(i);
		if (Elt.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		!isa<ConstantSDNode>(Elt.getOperand(1)))
return SDValue();		return SDValue();
		// Make sure that this node is extracting from a 128-bit vector.
SDValue V = FirstNonZero.getOperand(0);		MVT VT = Elt.getOperand(0).getSimpleValueType();
MVT VVT = V.getSimpleValueType();		if (!VT.is128BitVector())
if (!Subtarget->hasSSE41() \|\| (VVT != MVT::v4f32 && VVT != MVT::v4i32))
return SDValue();		return SDValue();
		if (!FirstNonZero.getNode())
		FirstNonZero = Elt;
		}

unsigned FirstNonZeroDst =		assert(FirstNonZero.getNode() && "Unexpected build vector of all zeros!");
cast<ConstantSDNode>(FirstNonZero.getOperand(1))->getZExtValue();		SDValue V1 = FirstNonZero.getOperand(0);
unsigned CorrectIdx = FirstNonZeroDst == FirstNonZeroIdx;		MVT VT = V1.getSimpleValueType();
unsigned IncorrectIdx = CorrectIdx ? -1U : FirstNonZeroIdx;
unsigned IncorrectDst = CorrectIdx ? -1U : FirstNonZeroDst;

for (unsigned Idx = FirstNonZeroIdx + 1; Idx < NumElems; ++Idx) {		// See if this build_vector can be lowered as a blend with zero.
SDValue Elem = Op.getOperand(Idx);		SDValue Elt;
if (Elem.getOpcode() == ISD::UNDEF \|\| X86::isZeroNode(Elem))		unsigned EltMaskIdx, EltIdx;
		int Mask[4];
		for (EltIdx = 0; EltIdx < 4; ++EltIdx) {
		if (Zeroable[EltIdx]) {
		// The zero vector will be on the right hand side.
		Mask[EltIdx] = EltIdx+4;
continue;		continue;
		}

// TODO: What else can be here? Deal with it.		Elt = Op->getOperand(EltIdx);
if (Elem.getOpcode() != ISD::EXTRACT_VECTOR_ELT)		// By construction, Elt is a EXTRACT_VECTOR_ELT with constant index.
return SDValue();		EltMaskIdx = cast<ConstantSDNode>(Elt.getOperand(1))->getZExtValue();
		if (Elt.getOperand(0) != V1 \|\| EltMaskIdx != EltIdx)
		break;
		Mask[EltIdx] = EltIdx;
		}

// TODO: Some optimizations are still possible here		if (EltIdx == 4) {
// ex: Getting one element from a vector, and the rest from another.		// Let the shuffle legalizer deal with blend operations.
if (Elem.getOperand(0) != V)		SDValue VZero = getZeroVector(VT, Subtarget, DAG, SDLoc(Op));
return SDValue();		if (V1.getSimpleValueType() != VT)
		V1 = DAG.getNode(ISD::BITCAST, SDLoc(V1), VT, V1);
		return DAG.getVectorShuffle(VT, SDLoc(V1), V1, VZero, &Mask[0]);
		}

unsigned Dst = cast<ConstantSDNode>(Elem.getOperand(1))->getZExtValue();		// See if we can lower this build_vector to a INSERTPS.
if (Dst == Idx)		if (!Subtarget->hasSSE41())
++CorrectIdx;
else if (IncorrectIdx == -1U) {
IncorrectIdx = Idx;
IncorrectDst = Dst;
} else
// There was already one element with an incorrect index.
// We can't optimize this case to an insertps.
return SDValue();		return SDValue();
}

if (NumNonZero == CorrectIdx \|\| NumNonZero == CorrectIdx + 1) {		SDValue V2 = Elt.getOperand(0);
SDLoc dl(Op);		if (Elt == FirstNonZero)
EVT VT = Op.getSimpleValueType();		V1 = SDValue();
unsigned ElementMoveMask = 0;
if (IncorrectIdx == -1U)
ElementMoveMask = FirstNonZeroIdx << 6 \| FirstNonZeroIdx << 4;
else
ElementMoveMask = IncorrectDst << 6 \| IncorrectIdx << 4;

SDValue InsertpsMask =		bool CanFold = true;
DAG.getIntPtrConstant(ElementMoveMask \| (~NonZeros & 0xf));		for (unsigned i = EltIdx + 1; i < 4 && CanFold; ++i) {
return DAG.getNode(X86ISD::INSERTPS, dl, VT, V, V, InsertpsMask);		if (Zeroable[i])
		continue;

		SDValue Current = Op->getOperand(i);
		SDValue SrcVector = Current->getOperand(0);
		if (!V1.getNode())
		V1 = SrcVector;
		CanFold = SrcVector == V1 &&
		cast<ConstantSDNode>(Current.getOperand(1))->getZExtValue() == i;
}		}

		if (!CanFold)
return SDValue();		return SDValue();

		assert(V1.getNode() && "Expected at least two non-zero elements!");
		if (V1.getSimpleValueType() != MVT::v4f32)
		V1 = DAG.getNode(ISD::BITCAST, SDLoc(V1), MVT::v4f32, V1);
		if (V2.getSimpleValueType() != MVT::v4f32)
		V2 = DAG.getNode(ISD::BITCAST, SDLoc(V2), MVT::v4f32, V2);

		// Ok, we can emit an INSERTPS instruction.
		unsigned ZMask = 0;
		for (int i = 0; i < 4; ++i)
		if (Zeroable[i])
		ZMask \|= 1 << i;

		unsigned InsertPSMask = EltMaskIdx << 6 \| EltIdx << 4 \| ZMask;
		assert((InsertPSMask & ~0xFFu) == 0 && "Invalid mask!");
		SDValue Result = DAG.getNode(X86ISD::INSERTPS, SDLoc(Op), MVT::v4f32, V1, V2,
		DAG.getIntPtrConstant(InsertPSMask));
		return DAG.getNode(ISD::BITCAST, SDLoc(Op), VT, Result);
}		}

/// getVShift - Return a vector logical shift node.		/// getVShift - Return a vector logical shift node.
///		///
static SDValue getVShift(bool isLeft, EVT VT, SDValue SrcOp,		static SDValue getVShift(bool isLeft, EVT VT, SDValue SrcOp,
unsigned NumBits, SelectionDAG &DAG,		unsigned NumBits, SelectionDAG &DAG,
const TargetLowering &TLI, SDLoc dl) {		const TargetLowering &TLI, SDLoc dl) {
assert(VT.is128BitVector() && "Unknown type for VShift");		assert(VT.is128BitVector() && "Unknown type for VShift");
▲ Show 20 Lines • Show All 1,171 Lines • ▼ Show 20 Lines	X86TargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const {
if (EVTBits == 16 && NumElems == 8) {		if (EVTBits == 16 && NumElems == 8) {
SDValue V = LowerBuildVectorv8i16(Op, NonZeros,NumNonZero,NumZero, DAG,		SDValue V = LowerBuildVectorv8i16(Op, NonZeros,NumNonZero,NumZero, DAG,
Subtarget, *this);		Subtarget, *this);
if (V.getNode()) return V;		if (V.getNode()) return V;
}		}

// If element VT is == 32 bits and has 4 elems, try to generate an INSERTPS		// If element VT is == 32 bits and has 4 elems, try to generate an INSERTPS
if (EVTBits == 32 && NumElems == 4) {		if (EVTBits == 32 && NumElems == 4) {
SDValue V = LowerBuildVectorv4x32(Op, NumElems, NonZeros, NumNonZero,		SDValue V = LowerBuildVectorv4x32(Op, DAG, Subtarget, *this);
NumZero, DAG, Subtarget, *this);
if (V.getNode())		if (V.getNode())
return V;		return V;
}		}

// If element VT is == 32 bits, turn it into a number of shuffles.		// If element VT is == 32 bits, turn it into a number of shuffles.
SmallVector<SDValue, 8> V(NumElems);		SmallVector<SDValue, 8> V(NumElems);
if (NumElems == 4 && NumZero > 0) {		if (NumElems == 4 && NumZero > 0) {
for (unsigned i = 0; i < 4; ++i) {		for (unsigned i = 0; i < 4; ++i) {
▲ Show 20 Lines • Show All 19,158 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/sse2.ll

	Show First 20 Lines • Show All 296 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%1 = shufflevector <2 x i64> %i, <2 x i64> <i64 0, i64 undef>, <2 x i32> <i32 0, i32 2>			%1 = shufflevector <2 x i64> %i, <2 x i64> <i64 0, i64 undef>, <2 x i32> <i32 0, i32 2>
	ret <2 x i64> %1			ret <2 x i64> %1
	}			}

	define <4 x i32> @PR19721(<4 x i32> %i) {			define <4 x i32> @PR19721(<4 x i32> %i) {
	; CHECK-LABEL: PR19721:			; CHECK-LABEL: PR19721:
	; CHECK: ## BB#0:			; CHECK: ## BB#0:
	; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; CHECK-NEXT: xorps %xmm1, %xmm1
	; CHECK-NEXT: movd %xmm1, %eax
	; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,1,2,3]
	; CHECK-NEXT: movd %xmm1, %ecx
	; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
	; CHECK-NEXT: pxor %xmm0, %xmm0
	; CHECK-NEXT: movss %xmm1, %xmm0			; CHECK-NEXT: movss %xmm1, %xmm0
	; CHECK-NEXT: movd %ecx, %xmm1
	; CHECK-NEXT: movd %eax, %xmm2
	; CHECK-NEXT: punpckldq {{.*#+}} xmm2 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
	; CHECK-NEXT: shufps {{.*#+}} xmm0 = xmm0[1,0],xmm2[0,1]
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	%bc = bitcast <4 x i32> %i to i128			%bc = bitcast <4 x i32> %i to i128
	%insert = and i128 %bc, -4294967296			%insert = and i128 %bc, -4294967296
	%bc2 = bitcast i128 %insert to <4 x i32>			%bc2 = bitcast i128 %insert to <4 x i32>
	ret <4 x i32> %bc2			ret <4 x i32> %bc2
	}			}

	define <4 x i32> @test_mul(<4 x i32> %x, <4 x i32> %y) {			define <4 x i32> @test_mul(<4 x i32> %x, <4 x i32> %y) {
	Show All 12 Lines

llvm/trunk/test/CodeGen/X86/sse41.ll

Show First 20 Lines • Show All 417 Lines • ▼ Show 20 Lines
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0],xmm0[3]		; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
; X64-NEXT: retq		; X64-NEXT: retq
%1 = load i32* %b, align 4		%1 = load i32* %b, align 4
%2 = insertelement <4 x i32> undef, i32 %1, i32 0		%2 = insertelement <4 x i32> undef, i32 %1, i32 0
%result = shufflevector <4 x i32> %a, <4 x i32> %2, <4 x i32> <i32 0, i32 1, i32 4, i32 3>		%result = shufflevector <4 x i32> %a, <4 x i32> %2, <4 x i32> <i32 0, i32 1, i32 4, i32 3>
ret <4 x i32> %result		ret <4 x i32> %result
}		}

;;;;;; Shuffles optimizable with a single insertps instruction		;;;;;; Shuffles optimizable with a single insertps or blend instruction
define <4 x float> @shuf_XYZ0(<4 x float> %x, <4 x float> %a) {		define <4 x float> @shuf_XYZ0(<4 x float> %x, <4 x float> %a) {
; X32-LABEL: shuf_XYZ0:		; X32-LABEL: shuf_XYZ0:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],zero		; X32-NEXT: xorps %xmm1, %xmm1
		; X32-NEXT: blendps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: shuf_XYZ0:		; X64-LABEL: shuf_XYZ0:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],zero		; X64-NEXT: xorps %xmm1, %xmm1
		; X64-NEXT: blendps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3]
; X64-NEXT: retq		; X64-NEXT: retq
%vecext = extractelement <4 x float> %x, i32 0		%vecext = extractelement <4 x float> %x, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecext1 = extractelement <4 x float> %x, i32 1		%vecext1 = extractelement <4 x float> %x, i32 1
%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
%vecext3 = extractelement <4 x float> %x, i32 2		%vecext3 = extractelement <4 x float> %x, i32 2
%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2		%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2
%vecinit5 = insertelement <4 x float> %vecinit4, float 0.0, i32 3		%vecinit5 = insertelement <4 x float> %vecinit4, float 0.0, i32 3
ret <4 x float> %vecinit5		ret <4 x float> %vecinit5
}		}

define <4 x float> @shuf_XY00(<4 x float> %x, <4 x float> %a) {		define <4 x float> @shuf_XY00(<4 x float> %x, <4 x float> %a) {
; X32-LABEL: shuf_XY00:		; X32-LABEL: shuf_XY00:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],zero,zero		; X32-NEXT: movq %xmm0, %xmm0
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: shuf_XY00:		; X64-LABEL: shuf_XY00:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],zero,zero		; X64-NEXT: movq %xmm0, %xmm0
; X64-NEXT: retq		; X64-NEXT: retq
%vecext = extractelement <4 x float> %x, i32 0		%vecext = extractelement <4 x float> %x, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecext1 = extractelement <4 x float> %x, i32 1		%vecext1 = extractelement <4 x float> %x, i32 1
%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
%vecinit3 = insertelement <4 x float> %vecinit2, float 0.0, i32 2		%vecinit3 = insertelement <4 x float> %vecinit2, float 0.0, i32 2
%vecinit4 = insertelement <4 x float> %vecinit3, float 0.0, i32 3		%vecinit4 = insertelement <4 x float> %vecinit3, float 0.0, i32 3
ret <4 x float> %vecinit4		ret <4 x float> %vecinit4
▲ Show 20 Lines • Show All 126 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
%vecinit3 = shufflevector <4 x float> %vecinit1, <4 x float> %x, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>		%vecinit3 = shufflevector <4 x float> %vecinit1, <4 x float> %x, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
%vecinit5 = shufflevector <4 x float> %vecinit3, <4 x float> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 6>		%vecinit5 = shufflevector <4 x float> %vecinit3, <4 x float> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 6>
ret <4 x float> %vecinit5		ret <4 x float> %vecinit5
}		}

define <4 x i32> @i32_shuf_XYZ0(<4 x i32> %x, <4 x i32> %a) {		define <4 x i32> @i32_shuf_XYZ0(<4 x i32> %x, <4 x i32> %a) {
; X32-LABEL: i32_shuf_XYZ0:		; X32-LABEL: i32_shuf_XYZ0:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],zero		; X32-NEXT: pxor %xmm1, %xmm1
		; X32-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,5],xmm1[6,7]
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: i32_shuf_XYZ0:		; X64-LABEL: i32_shuf_XYZ0:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1,2],zero		; X64-NEXT: pxor %xmm1, %xmm1
		; X64-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1,2,3,4,5],xmm1[6,7]
; X64-NEXT: retq		; X64-NEXT: retq
%vecext = extractelement <4 x i32> %x, i32 0		%vecext = extractelement <4 x i32> %x, i32 0
%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0		%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
%vecext1 = extractelement <4 x i32> %x, i32 1		%vecext1 = extractelement <4 x i32> %x, i32 1
%vecinit2 = insertelement <4 x i32> %vecinit, i32 %vecext1, i32 1		%vecinit2 = insertelement <4 x i32> %vecinit, i32 %vecext1, i32 1
%vecext3 = extractelement <4 x i32> %x, i32 2		%vecext3 = extractelement <4 x i32> %x, i32 2
%vecinit4 = insertelement <4 x i32> %vecinit2, i32 %vecext3, i32 2		%vecinit4 = insertelement <4 x i32> %vecinit2, i32 %vecext3, i32 2
%vecinit5 = insertelement <4 x i32> %vecinit4, i32 0, i32 3		%vecinit5 = insertelement <4 x i32> %vecinit4, i32 0, i32 3
ret <4 x i32> %vecinit5		ret <4 x i32> %vecinit5
}		}

define <4 x i32> @i32_shuf_XY00(<4 x i32> %x, <4 x i32> %a) {		define <4 x i32> @i32_shuf_XY00(<4 x i32> %x, <4 x i32> %a) {
; X32-LABEL: i32_shuf_XY00:		; X32-LABEL: i32_shuf_XY00:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],zero,zero		; X32-NEXT: movq %xmm0, %xmm0
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: i32_shuf_XY00:		; X64-LABEL: i32_shuf_XY00:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0,1],zero,zero		; X64-NEXT: movq %xmm0, %xmm0
; X64-NEXT: retq		; X64-NEXT: retq
%vecext = extractelement <4 x i32> %x, i32 0		%vecext = extractelement <4 x i32> %x, i32 0
%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0		%vecinit = insertelement <4 x i32> undef, i32 %vecext, i32 0
%vecext1 = extractelement <4 x i32> %x, i32 1		%vecext1 = extractelement <4 x i32> %x, i32 1
%vecinit2 = insertelement <4 x i32> %vecinit, i32 %vecext1, i32 1		%vecinit2 = insertelement <4 x i32> %vecinit, i32 %vecext1, i32 1
%vecinit3 = insertelement <4 x i32> %vecinit2, i32 0, i32 2		%vecinit3 = insertelement <4 x i32> %vecinit2, i32 0, i32 2
%vecinit4 = insertelement <4 x i32> %vecinit3, i32 0, i32 3		%vecinit4 = insertelement <4 x i32> %vecinit3, i32 0, i32 3
ret <4 x i32> %vecinit4		ret <4 x i32> %vecinit4
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
%vecinit5 = shufflevector <4 x i32> %vecinit3, <4 x i32> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 6>		%vecinit5 = shufflevector <4 x i32> %vecinit3, <4 x i32> %a, <4 x i32> <i32 0, i32 1, i32 2, i32 6>
ret <4 x i32> %vecinit5		ret <4 x i32> %vecinit5
}		}

;; Test for a bug in the first implementation of LowerBuildVectorv4x32		;; Test for a bug in the first implementation of LowerBuildVectorv4x32
define < 4 x float> @test_insertps_no_undef(<4 x float> %x) {		define < 4 x float> @test_insertps_no_undef(<4 x float> %x) {
; X32-LABEL: test_insertps_no_undef:		; X32-LABEL: test_insertps_no_undef:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: movaps %xmm0, %xmm1		; X32-NEXT: xorps %xmm1, %xmm1
; X32-NEXT: insertps {{.*#+}} xmm1 = xmm1[0,1,2],zero		; X32-NEXT: blendps {{.*#+}} xmm1 = xmm0[0,1,2],xmm1[3]
; X32-NEXT: maxps %xmm1, %xmm0		; X32-NEXT: maxps %xmm1, %xmm0
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: test_insertps_no_undef:		; X64-LABEL: test_insertps_no_undef:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: movaps %xmm0, %xmm1		; X64-NEXT: xorps %xmm1, %xmm1
; X64-NEXT: insertps {{.*#+}} xmm1 = xmm1[0,1,2],zero		; X64-NEXT: blendps {{.*#+}} xmm1 = xmm0[0,1,2],xmm1[3]
; X64-NEXT: maxps %xmm1, %xmm0		; X64-NEXT: maxps %xmm1, %xmm0
; X64-NEXT: retq		; X64-NEXT: retq
%vecext = extractelement <4 x float> %x, i32 0		%vecext = extractelement <4 x float> %x, i32 0
%vecinit = insertelement <4 x float> undef, float %vecext, i32 0		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
%vecext1 = extractelement <4 x float> %x, i32 1		%vecext1 = extractelement <4 x float> %x, i32 1
%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
%vecext3 = extractelement <4 x float> %x, i32 2		%vecext3 = extractelement <4 x float> %x, i32 2
%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2		%vecinit4 = insertelement <4 x float> %vecinit2, float %vecext3, i32 2
▲ Show 20 Lines • Show All 233 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
%gather_load = shufflevector <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, <8 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%gather_load = shufflevector <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, <8 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%shuffle109 = shufflevector <4 x i32> <i32 4, i32 5, i32 6, i32 7>, <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; 4 5 6 7		%shuffle109 = shufflevector <4 x i32> <i32 4, i32 5, i32 6, i32 7>, <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; 4 5 6 7
%shuffle116 = shufflevector <8 x i32> %gather_load, <8 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef> ; 3 x x x		%shuffle116 = shufflevector <8 x i32> %gather_load, <8 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef> ; 3 x x x
%shuffle117 = shufflevector <4 x i32> %shuffle109, <4 x i32> %shuffle116, <4 x i32> <i32 4, i32 3, i32 undef, i32 undef> ; 3 7 x x		%shuffle117 = shufflevector <4 x i32> %shuffle109, <4 x i32> %shuffle116, <4 x i32> <i32 4, i32 3, i32 undef, i32 undef> ; 3 7 x x
%ptrcast = bitcast i32* %RET to <4 x i32>*		%ptrcast = bitcast i32* %RET to <4 x i32>*
store <4 x i32> %shuffle117, <4 x i32>* %ptrcast, align 4		store <4 x i32> %shuffle117, <4 x i32>* %ptrcast, align 4
ret void		ret void
}		}

		define <4 x float> @insertps_4(<4 x float> %A, <4 x float> %B) {
		; X32-LABEL: insertps_4:
		; X32: ## BB#0:
		; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm1[2],zero
		; X32-NEXT: retl
		;
		; X64-LABEL: insertps_4:
		; X64: ## BB#0:
		; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm1[2],zero
		; X64-NEXT: retq
		entry:
		%vecext = extractelement <4 x float> %A, i32 0
		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
		%vecinit1 = insertelement <4 x float> %vecinit, float 0.000000e+00, i32 1
		%vecext2 = extractelement <4 x float> %B, i32 2
		%vecinit3 = insertelement <4 x float> %vecinit1, float %vecext2, i32 2
		%vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3
		ret <4 x float> %vecinit4
		}

		define <4 x float> @insertps_5(<4 x float> %A, <4 x float> %B) {
		; X32-LABEL: insertps_5:
		; X32: ## BB#0:
		; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],xmm1[1],zero,zero
		; X32-NEXT: retl
		;
		; X64-LABEL: insertps_5:
		; X64: ## BB#0:
		; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],xmm1[1],zero,zero
		; X64-NEXT: retq
		entry:
		%vecext = extractelement <4 x float> %A, i32 0
		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
		%vecext1 = extractelement <4 x float> %B, i32 1
		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
		%vecinit3 = insertelement <4 x float> %vecinit2, float 0.000000e+00, i32 2
		%vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3
		ret <4 x float> %vecinit4
		}

		define <4 x float> @insertps_6(<4 x float> %A, <4 x float> %B) {
		; X32-LABEL: insertps_6:
		; X32: ## BB#0:
		; X32-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[1],xmm1[2],zero
		; X32-NEXT: retl
		;
		; X64-LABEL: insertps_6:
		; X64: ## BB#0:
		; X64-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[1],xmm1[2],zero
		; X64-NEXT: retq
		entry:
		%vecext = extractelement <4 x float> %A, i32 1
		%vecinit = insertelement <4 x float> <float 0.000000e+00, float undef, float undef, float undef>, float %vecext, i32 1
		%vecext1 = extractelement <4 x float> %B, i32 2
		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 2
		%vecinit3 = insertelement <4 x float> %vecinit2, float 0.000000e+00, i32 3
		ret <4 x float> %vecinit3
		}

		define <4 x float> @insertps_7(<4 x float> %A, <4 x float> %B) {
		; X32-LABEL: insertps_7:
		; X32: ## BB#0:
		; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm1[1],zero
		; X32-NEXT: retl
		;
		; X64-LABEL: insertps_7:
		; X64: ## BB#0:
		; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],zero,xmm1[1],zero
		; X64-NEXT: retq
		entry:
		%vecext = extractelement <4 x float> %A, i32 0
		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
		%vecinit1 = insertelement <4 x float> %vecinit, float 0.000000e+00, i32 1
		%vecext2 = extractelement <4 x float> %B, i32 1
		%vecinit3 = insertelement <4 x float> %vecinit1, float %vecext2, i32 2
		%vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3
		ret <4 x float> %vecinit4
		}

		define <4 x float> @insertps_8(<4 x float> %A, <4 x float> %B) {
		; X32-LABEL: insertps_8:
		; X32: ## BB#0:
		; X32-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],zero,zero
		; X32-NEXT: retl
		;
		; X64-LABEL: insertps_8:
		; X64: ## BB#0:
		; X64-NEXT: insertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],zero,zero
		; X64-NEXT: retq
		entry:
		%vecext = extractelement <4 x float> %A, i32 0
		%vecinit = insertelement <4 x float> undef, float %vecext, i32 0
		%vecext1 = extractelement <4 x float> %B, i32 0
		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 1
		%vecinit3 = insertelement <4 x float> %vecinit2, float 0.000000e+00, i32 2
		%vecinit4 = insertelement <4 x float> %vecinit3, float 0.000000e+00, i32 3
		ret <4 x float> %vecinit4
		}

		define <4 x float> @insertps_9(<4 x float> %A, <4 x float> %B) {
		; X32-LABEL: insertps_9:
		; X32: ## BB#0:
		; X32-NEXT: insertps {{.*#+}} xmm1 = zero,xmm0[0],xmm1[2],zero
		; X32-NEXT: movaps %xmm1, %xmm0
		; X32-NEXT: retl
		;
		; X64-LABEL: insertps_9:
		; X64: ## BB#0:
		; X64-NEXT: insertps {{.*#+}} xmm1 = zero,xmm0[0],xmm1[2],zero
		; X64-NEXT: movaps %xmm1, %xmm0
		; X64-NEXT: retq
		entry:
		%vecext = extractelement <4 x float> %A, i32 0
		%vecinit = insertelement <4 x float> <float 0.000000e+00, float undef, float undef, float undef>, float %vecext, i32 1
		%vecext1 = extractelement <4 x float> %B, i32 2
		%vecinit2 = insertelement <4 x float> %vecinit, float %vecext1, i32 2
		%vecinit3 = insertelement <4 x float> %vecinit2, float 0.000000e+00, i32 3
		ret <4 x float> %vecinit3
		}