This is an archive of the discontinued LLVM Phabricator instance.

[X86] tranform insertps to blendps when possible for better performance
AbandonedPublic

Authored by spatel on Feb 24 2015, 1:41 PM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon
chandlerc
andreadb
mkuper

Summary

This patch adds a target-specific combine to transform insertps nodes into blendi nodes. We just have to check to see if a translation of the immediate mask is possible.

Insertps has less potential throughput than blendps on all x86 chips that I have surveyed. For example on Haswell, we can execute blendps on 3 different ports, but insertps is limited to 1. On Sandybridge, PIledriver, and Bulldozer, it's 2 vs. 1.

Doing this transform also reduces the number of patterns we have to match when optimizing scalar SSE code.

Diff Detail

Event Timeline

spatel updated this revision to Diff 20619.Feb 24 2015, 1:41 PM

spatel retitled this revision from to [X86] tranform insertps to blendps when possible for better performance.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: mkuper, chandlerc, RKSimon.

spatel added a subscriber: Unknown Object (MLST).

Updated patch: I had commented out the patterns in X86InstrSSE.td rather than deleting them.

I wrote a test program that executes 100 of each inst type to confirm that we do achieve double the throughput on SandyBridge using blendps:
blendps : 5406907580 cycles for 150000000 iterations (36.05 cycles/iter).
insertps: 10869956010 cycles for 150000000 iterations (72.47 cycles/iter).

I also found that on AMD Jaguar (btver2), blendps has half the latency of insertps, so it's an even bigger win there.

ab added a subscriber: ab.Feb 26 2015, 4:05 PM

I'm not really sold on doing this as a target combine. Is there some reason we can't just produce the desired insertps or blendps when lowering? This doesn't seem likely to come up only after doing some other shuffle lowering, but maybe I'm not seeing why.

lib/Target/X86/X86ISelLowering.cpp
22964	Please use an explicit type. Its short and removes questions.
22966	This use of auto hurts readability IMO.
23004–23007	I think we need to fix this always, and I think we should just handle it here in the "may fold load" case. The primary reason to use insertps is to fold a scalar load into the shuffle. Switching to blendps is a huge mistake there because new we have to do some scalar -> vector first. In many cases, we may end up emitting insertps+blendps or movss+blendps which doesn't seem like the right lowering to me.
23009–23031	Can you use something like getTargetShuffleMask to handle all of this?

In D7866#131618, @chandlerc wrote:

I'm not really sold on doing this as a target combine. Is there some reason we can't just produce the desired insertps or blendps when lowering? This doesn't seem likely to come up only after doing some other shuffle lowering, but maybe I'm not seeing why.

Let me answer this question first before going to wrangle up some data on the other question:
I made this a target combine because I don't know how else to handle this case given our current intrinsic lowering:

define <4 x float> @blendps(<4 x float> %x, <4 x float> %y) {
  %0 = tail call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %x, <4 x float> %y, i8 0)
  ret <4 x float> %0
}

This is the IR produced when a programmer uses SSE intrinsics in C source. It directly becomes an INSERTPS node via:

X86_INTRINSIC_DATA(sse41_insertps,    INTR_TYPE_3OP, X86ISD::INSERTPS, 0)

spatel added inline comments.Mar 2 2015, 2:52 PM

lib/Target/X86/X86ISelLowering.cpp
23004–23007	I think I understand the concern now. You're worried about the cases where we're loading into one of the higher lanes. That could easily require a bonus shuffle instruction if converted to a blendps. Let me resubmit the patch to only handle the single case of the low 32-bit lane because that should just be a movss from memory. I was focused on the low element case. Ie, which of these exact sequences is better: vmovss C0(%rip), %xmm1 <--- load into low lane; no shuffling before blendps vblendps $1, $xmm1, $xmm0, $xmm0 vs. vinsertps $0, C0(%rip), %xmm0, %xmm0 Sequences of movss+blendps have better throughput on SandyBridge and Haswell than insertps. Here's what IACA shows for the load cases: SandyBridge: 2x throughput Haswell: 2x throughput (we're limited by the loads here; blendps has 3x throughput on its own) I wrote a test program to confirm the load case performance on SandyBridge: blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter). insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter). This is for a string of 100 independent shuffle ops like: vmovss ones(%rip), %xmm0 vblendps $1, %xmm0, %xmm1, %xmm1 vmovss ones(%rip), %xmm0 vblendps $1, %xmm0, %xmm2, %xmm2 ... On AMD Jaguar, the independent strings of load op versions of blendps and insertps perform equivalently. For a string of dependent load+shuffle ops, I see the same 2x perf win for blendps due to the lower latency of the blendps instruction. insertps is an extra special wart in the SSE instruction set: it can't be extended to longer vectors (AVX, AVX512), so it will never get any extra transistors thrown its way to improve its performance relative to better-defined vector instructions. We should be careful generating insertps...some day it may end up microcoded.

Updated patch to:

Only handle the low element to low element insertps case (immediate == 0)
Added test case to confirm that we're only transforming the low element insertps case
Removed use of 'auto'

Sorry for the delays...

@chandlerc wrote:

I think we just need to change the SSE intrinsics to use generic shuffle IR rather than intrinsics. We shouldn't be worrying about
re-combining the LLVM instruction intrinsics in the backend to speed things up. We should insist that code use generic IR as input if
they want this kind of combining.

That's the 2nd suggestion I've gotten to rework the intrinsics in a week, so I guess I can't ignore that angle any longer. :)

So yes, let's put this patch on hold and see what happens when we start sending more generic shuffles down the line.

spatel mentioned this in D8332: Prefer blendps over insertps codegen for one special case [X86].Mar 13 2015, 2:49 PM

Abandoning:
Attempting to address the FIXME in D8332.
Patching the C header files to generate more generic shuffles is also underway.

spatel mentioned this in rL232850: [X86] Prefer blendps over insertps codegen for one special case.Mar 20 2015, 2:22 PM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

39 lines

X86InstrSSE.td

14 lines

test/

CodeGen/

X86/

avx-load-store.ll

28 lines

sse41.ll

25 lines

Diff 21055

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,467 Lines • ▼ Show 20 Lines	if (EltVT.getSizeInBits() == 8 \|\| EltVT.getSizeInBits() == 16) {
if (N1.getValueType() != MVT::i32)		if (N1.getValueType() != MVT::i32)
N1 = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, N1);		N1 = DAG.getNode(ISD::ANY_EXTEND, dl, MVT::i32, N1);
if (N2.getValueType() != MVT::i32)		if (N2.getValueType() != MVT::i32)
N2 = DAG.getIntPtrConstant(IdxVal);		N2 = DAG.getIntPtrConstant(IdxVal);
return DAG.getNode(Opc, dl, VT, N0, N1, N2);		return DAG.getNode(Opc, dl, VT, N0, N1, N2);
}		}

if (EltVT == MVT::f32) {		if (EltVT == MVT::f32) {
		// FIXME: We should generate a BLENDI here if we're just inserting from
		// and to the low lane and not zeroing (IdxVal == 0).
		// BLENDPS has better performance than INSERTPS in that case.

// Bits [7:6] of the constant are the source select. This will always be		// Bits [7:6] of the constant are the source select. This will always be
// zero here. The DAG Combiner may combine an extract_elt index into		// zero here. The DAG Combiner may combine an extract_elt index into
// these		// these
// bits. For example (insert (extract, 3), 2) could be matched by		// bits. For example (insert (extract, 3), 2) could be matched by
// putting		// putting
// the '3' into bits [7:6] of X86ISD::INSERTPS.		// the '3' into bits [7:6] of X86ISD::INSERTPS.
// Bits [5:4] of the constant are the destination select. This is the		// Bits [5:4] of the constant are the destination select. This is the
// value of the incoming immediate.		// value of the incoming immediate.
▲ Show 20 Lines • Show All 12,458 Lines • ▼ Show 20 Lines
}		}

static SDValue PerformINSERTPSCombine(SDNode *N, SelectionDAG &DAG,		static SDValue PerformINSERTPSCombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
SDLoc dl(N);		SDLoc dl(N);
MVT VT = N->getOperand(1)->getSimpleValueType(0);		MVT VT = N->getOperand(1)->getSimpleValueType(0);
assert((VT == MVT::v4f32 \|\| VT == MVT::v4i32) &&		assert((VT == MVT::v4f32 \|\| VT == MVT::v4i32) &&
"X86insertps is only defined for v4x32");		"X86insertps is only defined for v4x32");

SDValue Ld = N->getOperand(1);		uint64_t Imm8 = cast<ConstantSDNode>(N->getOperand(2))->getZExtValue();
if (MayFoldLoad(Ld)) {		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);

		if (MayFoldLoad(N1)) {
// Extract the countS bits from the immediate so we can get the proper		// Extract the countS bits from the immediate so we can get the proper
// address when narrowing the vector load to a specific element.		// address when narrowing the vector load to a specific element.
// When the second source op is a memory address, insertps doesn't use		// When the second source op is a memory address, insertps doesn't use
// countS and just gets an f32 from that address.		// countS and just gets an f32 from that address.
unsigned DestIndex =		unsigned DestIdx = Imm8 >> 6;
		chandlercUnsubmitted Not Done Reply Inline Actions Please use an explicit type. Its short and removes questions. chandlerc: Please use an explicit type. Its short and removes questions.
cast<ConstantSDNode>(N->getOperand(2))->getZExtValue() >> 6;

Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(Ld), DestIndex, DAG);		SDValue Ld = NarrowVectorLoadToElement(cast<LoadSDNode>(N1), DestIdx, DAG);
		chandlercUnsubmitted Not Done Reply Inline Actions This use of auto hurts readability IMO. chandlerc: This use of auto hurts readability IMO.

// Create this as a scalar to vector to match the instruction pattern.		// Create this as a scalar to vector to match the instruction pattern.
SDValue LoadScalarToVector = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Ld);		SDValue LoadScalarToVector = DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VT, Ld);
// countS bits are ignored when loading from memory on insertps, which		// countS bits are ignored when loading from memory on insertps, which
// means we don't need to explicitly set them to 0.		// means we don't need to explicitly set them to 0.
return DAG.getNode(X86ISD::INSERTPS, dl, VT, N->getOperand(0),		return DAG.getNode(X86ISD::INSERTPS, dl, VT, N->getOperand(0),
LoadScalarToVector, N->getOperand(2));		LoadScalarToVector, N->getOperand(2));
}		}

		// A register/register insertps that is just moving the low 32-bits to the
		// corresponding location in the destination register can be simplified into
		// a blendps. Blendps should have equal or better performance because it's a
		// simpler operation.
		// This also allows us to eliminate some pattern-matching possibilities for
		// scalar SSE math ops that are performed in xmm registers and then shuffled.

		// FIXME: If optimizing for size and there is a load folding opportunity,
		// we should either not do this transform or we should undo it in
		// PerformBLENDICombine. The above check for "MayFoldLoad" doesn't work
		// because it doesn't look through a SCALAR_TO_VECTOR node.

		if (Imm8 == 0x00) {
		// We do not convert insertps nodes if they are operating on anything
		// other than the low element of the vector because that might cause an
		// extra shuffle operation to be created.
		SDValue NewMask = DAG.getConstant(0x01, MVT::i8);
		return DAG.getNode(X86ISD::BLENDI, dl, VT, N0, N1, NewMask);
		}

return SDValue();		return SDValue();
}		}

static SDValue PerformBLENDICombine(SDNode *N, SelectionDAG &DAG) {		static SDValue PerformBLENDICombine(SDNode *N, SelectionDAG &DAG) {
SDValue V0 = N->getOperand(0);		SDValue V0 = N->getOperand(0);
SDValue V1 = N->getOperand(1);		SDValue V1 = N->getOperand(1);
SDLoc DL(N);		SDLoc DL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector		// Canonicalize a v2f64 blend with a mask of 2 by swapping the vector
// operands and changing the mask to 1. This saves us a bunch of		// operands and changing the mask to 1. This saves us a bunch of
// pattern-matching possibilities related to scalar math ops in SSE/AVX.		// pattern-matching possibilities related to scalar math ops in SSE/AVX.
		chandlercUnsubmitted Not Done Reply Inline Actions I think we need to fix this always, and I think we should just handle it here in the "may fold load" case. The primary reason to use insertps is to fold a scalar load into the shuffle. Switching to blendps is a huge mistake there because new we have to do some scalar -> vector first. In many cases, we may end up emitting insertps+blendps or movss+blendps which doesn't seem like the right lowering to me. chandlerc: I think we need to fix this always, and I think we should just handle it here in the "may fold…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions I think I understand the concern now. You're worried about the cases where we're loading into one of the higher lanes. That could easily require a bonus shuffle instruction if converted to a blendps. Let me resubmit the patch to only handle the single case of the low 32-bit lane because that should just be a movss from memory. I was focused on the low element case. Ie, which of these exact sequences is better: vmovss C0(%rip), %xmm1 <--- load into low lane; no shuffling before blendps vblendps $1, $xmm1, $xmm0, $xmm0 vs. vinsertps $0, C0(%rip), %xmm0, %xmm0 Sequences of movss+blendps have better throughput on SandyBridge and Haswell than insertps. Here's what IACA shows for the load cases: SandyBridge: 2x throughput Haswell: 2x throughput (we're limited by the loads here; blendps has 3x throughput on its own) I wrote a test program to confirm the load case performance on SandyBridge: blendps : 5381572012 cycles for 150000000 iterations (35.88 cycles/iter). insertps: 10387753446 cycles for 150000000 iterations (69.25 cycles/iter). This is for a string of 100 independent shuffle ops like: vmovss ones(%rip), %xmm0 vblendps $1, %xmm0, %xmm1, %xmm1 vmovss ones(%rip), %xmm0 vblendps $1, %xmm0, %xmm2, %xmm2 ... On AMD Jaguar, the independent strings of load op versions of blendps and insertps perform equivalently. For a string of dependent load+shuffle ops, I see the same 2x perf win for blendps due to the lower latency of the blendps instruction. insertps is an extra special wart in the SSE instruction set: it can't be extended to longer vectors (AVX, AVX512), so it will never get any extra transistors thrown its way to improve its performance relative to better-defined vector instructions. We should be careful generating insertps...some day it may end up microcoded. spatel: I think I understand the concern now. You're worried about the cases where we're loading into…
// x86InstrInfo knows how to commute this back after instruction selection		// x86InstrInfo knows how to commute this back after instruction selection
// if it would help register allocation.		// if it would help register allocation.

// TODO: If optimizing for size or a processor that doesn't suffer from		// TODO: If optimizing for size or a processor that doesn't suffer from
// partial register update stalls, this should be transformed into a MOVSD		// partial register update stalls, this should be transformed into a MOVSD
// instruction because a MOVSD is 1-2 bytes smaller than a BLENDPD.		// instruction because a MOVSD is 1-2 bytes smaller than a BLENDPD.

if (VT == MVT::v2f64)		if (VT == MVT::v2f64)
if (auto *Mask = dyn_cast<ConstantSDNode>(N->getOperand(2)))		if (auto *Mask = dyn_cast<ConstantSDNode>(N->getOperand(2)))
if (Mask->getZExtValue() == 2 && !isShuffleFoldableLoad(V0)) {		if (Mask->getZExtValue() == 2 && !isShuffleFoldableLoad(V0)) {
SDValue NewMask = DAG.getConstant(1, MVT::i8);		SDValue NewMask = DAG.getConstant(1, MVT::i8);
return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V0, NewMask);		return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V0, NewMask);
}		}

return SDValue();		return SDValue();
}		}

// Helper function of PerformSETCCCombine. It is to materialize "setb reg"		// Helper function of PerformSETCCCombine. It is to materialize "setb reg"
// as "sbb reg,reg", since it can be extended without zext and produces		// as "sbb reg,reg", since it can be extended without zext and produces
// an all-ones bit which is more useful than 0/1 in some cases.		// an all-ones bit which is more useful than 0/1 in some cases.
static SDValue MaterializeSETB(SDLoc DL, SDValue EFLAGS, SelectionDAG &DAG,		static SDValue MaterializeSETB(SDLoc DL, SDValue EFLAGS, SelectionDAG &DAG,
MVT VT) {		MVT VT) {
if (VT == MVT::i8)		if (VT == MVT::i8)
return DAG.getNode(ISD::AND, DL, VT,		return DAG.getNode(ISD::AND, DL, VT,
		chandlercUnsubmitted Not Done Reply Inline Actions Can you use something like getTargetShuffleMask to handle all of this? chandlerc: Can you use something like getTargetShuffleMask to handle all of this?
DAG.getNode(X86ISD::SETCC_CARRY, DL, MVT::i8,		DAG.getNode(X86ISD::SETCC_CARRY, DL, MVT::i8,
DAG.getConstant(X86::COND_B, MVT::i8), EFLAGS),		DAG.getConstant(X86::COND_B, MVT::i8), EFLAGS),
DAG.getConstant(1, VT));		DAG.getConstant(1, VT));
assert (VT == MVT::i1 && "Unexpected type for SECCC node");		assert (VT == MVT::i1 && "Unexpected type for SECCC node");
return DAG.getNode(ISD::TRUNCATE, DL, MVT::i1,		return DAG.getNode(ISD::TRUNCATE, DL, MVT::i1,
DAG.getNode(X86ISD::SETCC_CARRY, DL, MVT::i8,		DAG.getNode(X86ISD::SETCC_CARRY, DL, MVT::i8,
DAG.getConstant(X86::COND_B, MVT::i8), EFLAGS));		DAG.getConstant(X86::COND_B, MVT::i8), EFLAGS));
}		}
▲ Show 20 Lines • Show All 1,216 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,173 Lines • ▼ Show 20 Lines	let Predicates = [UseSSE1] in {
// vector math op with insert via movss		// vector math op with insert via movss
def : Pat<(v4f32 (X86Movss (v4f32 VR128:$dst),		def : Pat<(v4f32 (X86Movss (v4f32 VR128:$dst),
(Op (v4f32 VR128:$dst), (v4f32 VR128:$src)))),		(Op (v4f32 VR128:$dst), (v4f32 VR128:$src)))),
(!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst, v4f32:$src)>;		(!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst, v4f32:$src)>;
}		}

// With SSE 4.1, insertps/blendi are preferred to movsd, so match those too.		// With SSE 4.1, insertps/blendi are preferred to movsd, so match those too.
let Predicates = [UseSSE41] in {		let Predicates = [UseSSE41] in {
// extracted scalar math op with insert via insertps
def : Pat<(v4f32 (X86insertps (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
(Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
FR32:$src))), (iPTR 0))),
(!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst,
(COPY_TO_REGCLASS FR32:$src, VR128))>;

// extracted scalar math op with insert via blend		// extracted scalar math op with insert via blend
def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst), (v4f32 (scalar_to_vector		def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
(Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),		(Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
FR32:$src))), (i8 1))),		FR32:$src))), (i8 1))),
(!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst,		(!cast<I>(OpcPrefix#SSrr_Int) v4f32:$dst,
(COPY_TO_REGCLASS FR32:$src, VR128))>;		(COPY_TO_REGCLASS FR32:$src, VR128))>;

// vector math op with insert via blend		// vector math op with insert via blend
def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst),		def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst),
(Op (v4f32 VR128:$dst), (v4f32 VR128:$src)), (i8 1))),		(Op (v4f32 VR128:$dst), (v4f32 VR128:$src)), (i8 1))),
(!cast<I>(OpcPrefix#SSrr_Int)v4f32:$dst, v4f32:$src)>;		(!cast<I>(OpcPrefix#SSrr_Int)v4f32:$dst, v4f32:$src)>;

}		}

// Repeat everything for AVX, except for the movss + scalar combo...		// Repeat everything for AVX, except for the movss + scalar combo...
// because that one shouldn't occur with AVX codegen?		// because that one shouldn't occur with AVX codegen?
let Predicates = [HasAVX] in {		let Predicates = [HasAVX] in {
// extracted scalar math op with insert via insertps
def : Pat<(v4f32 (X86insertps (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
(Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
FR32:$src))), (iPTR 0))),
(!cast<I>("V"#OpcPrefix#SSrr_Int) v4f32:$dst,
(COPY_TO_REGCLASS FR32:$src, VR128))>;

// extracted scalar math op with insert via blend		// extracted scalar math op with insert via blend
def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst), (v4f32 (scalar_to_vector		def : Pat<(v4f32 (X86Blendi (v4f32 VR128:$dst), (v4f32 (scalar_to_vector
(Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),		(Op (f32 (vector_extract (v4f32 VR128:$dst), (iPTR 0))),
FR32:$src))), (i8 1))),		FR32:$src))), (i8 1))),
(!cast<I>("V"#OpcPrefix#SSrr_Int) v4f32:$dst,		(!cast<I>("V"#OpcPrefix#SSrr_Int) v4f32:$dst,
(COPY_TO_REGCLASS FR32:$src, VR128))>;		(COPY_TO_REGCLASS FR32:$src, VR128))>;

// vector math op with insert via movss		// vector math op with insert via movss
▲ Show 20 Lines • Show All 5,643 Lines • Show Last 20 Lines

test/CodeGen/X86/avx-load-store.ll

Show All 17 Lines	entry:
store <4 x double> %tmp1.i, <4 x double>* %0, align 32		store <4 x double> %tmp1.i, <4 x double>* %0, align 32
store <8 x float> %tmp1.i17, <8 x float>* %1, align 32		store <8 x float> %tmp1.i17, <8 x float>* %1, align 32
store <4 x i64> %tmp1.i16, <4 x i64>* %i, align 32		store <4 x i64> %tmp1.i16, <4 x i64>* %i, align 32
ret void		ret void
}		}

declare void @dummy(<4 x double>, <8 x float>, <4 x i64>)		declare void @dummy(<4 x double>, <8 x float>, <4 x i64>)

;;		; Although this could have a load folded vinsertps, we prefer
;; The two tests below check that we must fold load + scalar_to_vector		; to use vmovss + vblendps because it has better performance.
;; + ins_subvec+ zext into only a single vmovss or vmovsd or vinsertps from memory

; CHECK: mov00		; CHECK-LABEL: mov_blendps:
define <8 x float> @mov00(<8 x float> %v, float * %ptr) nounwind {		define <8 x float> @mov_blendps(<8 x float> %v, float * %ptr) nounwind {
%val = load float, float* %ptr		%val = load float, float* %ptr
; CHECK: vinsertps		; CHECK: vmovss
		; CHECK: vblendps
; CHECK: vinsertf128		; CHECK: vinsertf128
%i0 = insertelement <8 x float> zeroinitializer, float %val, i32 0		%i0 = insertelement <8 x float> zeroinitializer, float %val, i32 0
ret <8 x float> %i0		ret <8 x float> %i0
; CHECK: ret		; CHECK: ret
}		}

		; Use vinsertps to load a scalar into a higher lane because there is no
		; version of vblendps that loads a scalar. Transferring out of the low
		; lane after a vmovss would require another shuffle operation.

		; CHECK-LABEL: mov_insertps:
		define <4 x float> @mov_insertps(<4 x float> %v, float * %ptr) nounwind {
		%val = load float, float* %ptr
		; CHECK: vinsertps $29, (%rdi), %xmm0, %xmm0
		%i0 = insertelement <4 x float> zeroinitializer, float %val, i32 1
		ret <4 x float> %i0
		; CHECK: ret
		}

		;; This test checks that we must fold load + ins_subvec + zext
		;; into only a single vmovlpd from memory

; CHECK: mov01		; CHECK: mov01
define <4 x double> @mov01(<4 x double> %v, double * %ptr) nounwind {		define <4 x double> @mov01(<4 x double> %v, double * %ptr) nounwind {
%val = load double, double* %ptr		%val = load double, double* %ptr
; CHECK: vmovlpd		; CHECK: vmovlpd
; CHECK: vinsertf128		; CHECK: vinsertf128
%i0 = insertelement <4 x double> zeroinitializer, double %val, i32 0		%i0 = insertelement <4 x double> zeroinitializer, double %val, i32 0
ret <4 x double> %i0		ret <4 x double> %i0
; CHECK: ret		; CHECK: ret
▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

test/CodeGen/X86/sse41.ll

	Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines
	; X64-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[1,2,3]			; X64-NEXT: insertps {{.*#+}} xmm0 = zero,xmm0[1,2,3]
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %t1, <4 x float> %t2, i32 1) nounwind readnone			%tmp1 = call <4 x float> @llvm.x86.sse41.insertps(<4 x float> %t1, <4 x float> %t2, i32 1) nounwind readnone
	ret <4 x float> %tmp1			ret <4 x float> %tmp1
	}			}

	declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i32) nounwind readnone			declare <4 x float> @llvm.x86.sse41.insertps(<4 x float>, <4 x float>, i32) nounwind readnone

	define <4 x float> @insertps_2(<4 x float> %t1, float %t2) nounwind {			; In cases where either blendps or insertps will do the job,
	; X32-LABEL: insertps_2:			; prefer blendps because it has better performance.

				define <4 x float> @blendps_1(<4 x float> %t1, float %t2) nounwind {
				; X32-LABEL: blendps_1:
	; X32: ## BB#0:			; X32: ## BB#0:
	; X32-NEXT: insertps {{.*#+}} xmm0 = mem[0],xmm0[1,2,3]			; X32-NEXT: movss 4(%esp), {{.*}} mem[0],zero,zero,zero
				; X32-NEXT: blendps {{.#+}} xmm0 = xmm{{.}}[0],xmm0[1,2,3]
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: insertps_2:			; X64-LABEL: blendps_1:
	; X64: ## BB#0:			; X64: ## BB#0:
	; X64-NEXT: insertps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]			; X64-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp1 = insertelement <4 x float> %t1, float %t2, i32 0			%tmp1 = insertelement <4 x float> %t1, float %t2, i32 0
	ret <4 x float> %tmp1			ret <4 x float> %tmp1
	}			}
	define <4 x float> @insertps_3(<4 x float> %t1, <4 x float> %t2) nounwind {
	; X32-LABEL: insertps_3:			define <4 x float> @blendps_2(<4 x float> %t1, <4 x float> %t2) nounwind {
				; X32-LABEL: blendps_2:
	; X32: ## BB#0:			; X32: ## BB#0:
	; X32-NEXT: insertps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]			; X32-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: insertps_3:			; X64-LABEL: blendps_2:
	; X64: ## BB#0:			; X64: ## BB#0:
	; X64-NEXT: insertps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]			; X64-NEXT: blendps {{.*#+}} xmm0 = xmm1[0],xmm0[1,2,3]
	; X64-NEXT: retq			; X64-NEXT: retq
	%tmp2 = extractelement <4 x float> %t2, i32 0			%tmp2 = extractelement <4 x float> %t2, i32 0
	%tmp1 = insertelement <4 x float> %t1, float %tmp2, i32 0			%tmp1 = insertelement <4 x float> %t1, float %tmp2, i32 0
	ret <4 x float> %tmp1			ret <4 x float> %tmp1
	}			}

	define i32 @ptestz_1(<2 x i64> %t1, <2 x i64> %t2) nounwind {			define i32 @ptestz_1(<2 x i64> %t1, <2 x i64> %t2) nounwind {
	; X32-LABEL: ptestz_1:			; X32-LABEL: ptestz_1:
	▲ Show 20 Lines • Show All 975 Lines • Show Last 20 Lines