This is an archive of the discontinued LLVM Phabricator instance.

[x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507)
ClosedPublic

Authored by spatel on May 5 2015, 2:01 PM.

Download Raw Diff

Details

Reviewers

RKSimon
delena
mkuper

Commits

rGa9f6d3505d04: [x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507)
rL236740: [x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507)

Summary

This patch attempts to finish the job that was abandoned in D6958 following the refactoring in http://reviews.llvm.org/rL230221:

Uncomment the intrinsic def for the AVX r_Int instruction.
Add missing r_Int entries to the load folding tables; there are already tests that check these in "test/Codegen/X86/fold-load-unops.ll", so I haven't added any more in this patch.
Add patterns to solve PR21507 ( https://llvm.org/bugs/show_bug.cgi?id=21507 ).

So instead of this:

movaps	%xmm0, %xmm1
rcpss	%xmm1, %xmm1
movss	%xmm1, %xmm0

We should now get:

rcpss	%xmm0, %xmm0

And instead of this:

vsqrtss	%xmm0, %xmm0, %xmm1
vblendps	$1, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm1[0],xmm0[1,2,3]

We should now get:

vsqrtss	%xmm0, %xmm0, %xmm0

Diff Detail

Event Timeline

spatel updated this revision to Diff 24978.May 5 2015, 2:01 PM

spatel retitled this revision from to [x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: delena, mkuper, RKSimon.

spatel added a subscriber: Unknown Object (MLST).

Thanks Sanjay, in D9095 I added some tests to sse-scalar-fp-arith.ll for llvm.sqrt.f32 / llvm.sqrt.f64 tests - these don't appear to be optimized by this patch - is this something that could be easily added? Feel free to transfer them to sse-scalar-fp-arith-unary.ll.

lib/Target/X86/X86InstrInfo.cpp
529	excess whitespace before the comma

In D9504#166604, @RKSimon wrote:

Thanks Sanjay, in D9095 I added some tests to sse-scalar-fp-arith.ll for llvm.sqrt.f32 / llvm.sqrt.f64 tests - these don't appear to be optimized by this patch - is this something that could be easily added? Feel free to transfer them to sse-scalar-fp-arith-unary.ll.

Ah, the sqrt IR intrinsics as opposed to the sqrt SSE intrinsics. I didn't consider them specifically, but I did look at patterns that would match the X86frcp and X86frsqrt SDNodes + blend/move. It seemed to me that the conditions needed to produce that pattern were so far-fetched (-ffast-math + reciprocals enabled + NR refinement turned off + a scalar op in the middle of vector code?) that it wasn't worth the effort.

I'm probably not being imaginative enough, but here's my thinking: in order to get a scalar sqrt *IR* intrinsic from/to vector operands from C source, a coder would have to be explicitly using SSE intrinsics and then throw a libm sqrt() call into the mix. If the coder used _mm_sqrt_ss(), we wouldn't see a sqrt IR intrinsic. If the code was auto-vectorized, we also wouldn't see a scalar sqrt intrinsic; it would be a vector intrinsic, and then we wouldn't see this insert/extract pattern where we're just operating on the scalar lane?

That said, if this is a common enough occurrence, then what I'd hope to do is just add more defm lines instead of duplicating the multiclass of patterns to match SDNodes rather than Intrinsics, eg:

defm : scalar_unary_math_patterns<fsqrt, "SQRTSD", X86Movsd, v2f64, UseSSE2>;

...but I'm not sure how to do that in tablegen. Any suggestions?

In D9504#166800, @spatel wrote:
That said, if this is a common enough occurrence, then what I'd hope to do is just add more defm lines instead of duplicating the multiclass of patterns to match SDNodes rather than Intrinsics, eg:
defm : scalar_unary_math_patterns<fsqrt, "SQRTSD", X86Movsd, v2f64, UseSSE2>;
...but I'm not sure how to do that in tablegen. Any suggestions?

I think I have a hack-around: just make the param a string and then cast it, but after a little more thought, it's not enough. If we do want to optimize the scalar op case, we need a whole set of different patterns to match (as we do for the binops earlier in this file). Ie, we're looking for something like this:

    0x7fcbf987f8c0: f64 = extract_vector_elt 0x7fcbf987f530, 0x7fcbf987f790 [ORD=2]
  0x7fcbf987f9f0: f64 = fsqrt 0x7fcbf987f8c0 [ORD=3]
0x7fcbf987fb20: v2f64 = insert_vector_elt 0x7fcbf987f530, 0x7fcbf987f9f0, 0x7fcbf987f790 [ORD=4]

Patch updated:
Simon and I discussed supporting the cases in sse-scalar-fp-arith.ll and how that IR can occur via generic vector code that relies on compiler builtins. We'll need more patterns to handle those; I don't see a quick fix.

Added a TODO comment regarding additional patterns.
Fixed extra space in memory folding table entry.
Changed test file to specify a triple rather than a cpu.
Modified check lines in the test file for generic x86-64 matches.

LGTM

This revision is now accepted and ready to land.May 6 2015, 1:44 PM

andreadb added a subscriber: andreadb.May 7 2015, 3:13 AM

andreadb added inline comments.

lib/Target/X86/X86InstrSSE.td
3417–3420	Hi Sanjay, do we still need this pattern? I maybe wrong, but your new pattern (at line 3413) should make this one dead.

spatel added inline comments.May 7 2015, 8:42 AM

lib/Target/X86/X86InstrSSE.td
3417–3420	Thanks, Andrea. Yes, I'm not seeing how this pattern will do anything now, and there's no difference on any regression tests. I'll remove it.

Closed by commit rL236740: [x86] eliminate unnecessary shuffling/moves with unary scalar math ops (PR21507) (authored by spatel). · Explain WhyMay 7 2015, 8:52 AM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in rL236743: Use intrinsic pattern to make a simpler match.May 7 2015, 9:55 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86InstrInfo.cpp

6 lines

X86InstrSSE.td

59 lines

test/

CodeGen/

X86/

sse-scalar-fp-arith-unary.ll

73 lines

Diff 25064

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 520 Lines • ▼ Show 20 Lines	static const X86MemoryFoldTableEntry MemoryFoldTable1[] = {
{ X86::PMOVZXDQrr, X86::PMOVZXDQrm, TB_ALIGN_16 },		{ X86::PMOVZXDQrr, X86::PMOVZXDQrm, TB_ALIGN_16 },
{ X86::PMOVZXWDrr, X86::PMOVZXWDrm, TB_ALIGN_16 },		{ X86::PMOVZXWDrr, X86::PMOVZXWDrm, TB_ALIGN_16 },
{ X86::PMOVZXWQrr, X86::PMOVZXWQrm, TB_ALIGN_16 },		{ X86::PMOVZXWQrr, X86::PMOVZXWQrm, TB_ALIGN_16 },
{ X86::PSHUFDri, X86::PSHUFDmi, TB_ALIGN_16 },		{ X86::PSHUFDri, X86::PSHUFDmi, TB_ALIGN_16 },
{ X86::PSHUFHWri, X86::PSHUFHWmi, TB_ALIGN_16 },		{ X86::PSHUFHWri, X86::PSHUFHWmi, TB_ALIGN_16 },
{ X86::PSHUFLWri, X86::PSHUFLWmi, TB_ALIGN_16 },		{ X86::PSHUFLWri, X86::PSHUFLWmi, TB_ALIGN_16 },
{ X86::PTESTrr, X86::PTESTrm, TB_ALIGN_16 },		{ X86::PTESTrr, X86::PTESTrm, TB_ALIGN_16 },
{ X86::RCPPSr, X86::RCPPSm, TB_ALIGN_16 },		{ X86::RCPPSr, X86::RCPPSm, TB_ALIGN_16 },
		{ X86::RCPSSr, X86::RCPSSm, 0 },
		RKSimonUnsubmitted Not Done Reply Inline Actions excess whitespace before the comma RKSimon: excess whitespace before the comma
		{ X86::RCPSSr_Int, X86::RCPSSm_Int, 0 },
{ X86::ROUNDPDr, X86::ROUNDPDm, TB_ALIGN_16 },		{ X86::ROUNDPDr, X86::ROUNDPDm, TB_ALIGN_16 },
{ X86::ROUNDPSr, X86::ROUNDPSm, TB_ALIGN_16 },		{ X86::ROUNDPSr, X86::ROUNDPSm, TB_ALIGN_16 },
{ X86::RSQRTPSr, X86::RSQRTPSm, TB_ALIGN_16 },		{ X86::RSQRTPSr, X86::RSQRTPSm, TB_ALIGN_16 },
{ X86::RSQRTSSr, X86::RSQRTSSm, 0 },		{ X86::RSQRTSSr, X86::RSQRTSSm, 0 },
{ X86::RSQRTSSr_Int, X86::RSQRTSSm_Int, 0 },		{ X86::RSQRTSSr_Int, X86::RSQRTSSm_Int, 0 },
{ X86::SQRTPDr, X86::SQRTPDm, TB_ALIGN_16 },		{ X86::SQRTPDr, X86::SQRTPDm, TB_ALIGN_16 },
{ X86::SQRTPSr, X86::SQRTPSm, TB_ALIGN_16 },		{ X86::SQRTPSr, X86::SQRTPSm, TB_ALIGN_16 },
{ X86::SQRTSDr, X86::SQRTSDm, 0 },		{ X86::SQRTSDr, X86::SQRTSDm, 0 },
▲ Show 20 Lines • Show All 697 Lines • ▼ Show 20 Lines	static const X86MemoryFoldTableEntry MemoryFoldTable2[] = {
{ X86::Int_VCVTSI2SDrr, X86::Int_VCVTSI2SDrm, 0 },		{ X86::Int_VCVTSI2SDrr, X86::Int_VCVTSI2SDrm, 0 },
{ X86::VCVTSI2SS64rr, X86::VCVTSI2SS64rm, 0 },		{ X86::VCVTSI2SS64rr, X86::VCVTSI2SS64rm, 0 },
{ X86::Int_VCVTSI2SS64rr, X86::Int_VCVTSI2SS64rm, 0 },		{ X86::Int_VCVTSI2SS64rr, X86::Int_VCVTSI2SS64rm, 0 },
{ X86::VCVTSI2SSrr, X86::VCVTSI2SSrm, 0 },		{ X86::VCVTSI2SSrr, X86::VCVTSI2SSrm, 0 },
{ X86::Int_VCVTSI2SSrr, X86::Int_VCVTSI2SSrm, 0 },		{ X86::Int_VCVTSI2SSrr, X86::Int_VCVTSI2SSrm, 0 },
{ X86::VCVTSS2SDrr, X86::VCVTSS2SDrm, 0 },		{ X86::VCVTSS2SDrr, X86::VCVTSS2SDrm, 0 },
{ X86::Int_VCVTSS2SDrr, X86::Int_VCVTSS2SDrm, 0 },		{ X86::Int_VCVTSS2SDrr, X86::Int_VCVTSS2SDrm, 0 },
{ X86::VRCPSSr, X86::VRCPSSm, 0 },		{ X86::VRCPSSr, X86::VRCPSSm, 0 },
		{ X86::VRCPSSr_Int, X86::VRCPSSm_Int, 0 },
{ X86::VRSQRTSSr, X86::VRSQRTSSm, 0 },		{ X86::VRSQRTSSr, X86::VRSQRTSSm, 0 },
		{ X86::VRSQRTSSr_Int, X86::VRSQRTSSm_Int, 0 },
{ X86::VSQRTSDr, X86::VSQRTSDm, 0 },		{ X86::VSQRTSDr, X86::VSQRTSDm, 0 },
		{ X86::VSQRTSDr_Int, X86::VSQRTSDm_Int, 0 },
{ X86::VSQRTSSr, X86::VSQRTSSm, 0 },		{ X86::VSQRTSSr, X86::VSQRTSSm, 0 },
		{ X86::VSQRTSSr_Int, X86::VSQRTSSm_Int, 0 },
{ X86::VADDPDrr, X86::VADDPDrm, 0 },		{ X86::VADDPDrr, X86::VADDPDrm, 0 },
{ X86::VADDPSrr, X86::VADDPSrm, 0 },		{ X86::VADDPSrr, X86::VADDPSrm, 0 },
{ X86::VADDSDrr, X86::VADDSDrm, 0 },		{ X86::VADDSDrr, X86::VADDSDrm, 0 },
{ X86::VADDSDrr_Int, X86::VADDSDrm_Int, 0 },		{ X86::VADDSDrr_Int, X86::VADDSDrm_Int, 0 },
{ X86::VADDSSrr, X86::VADDSSrm, 0 },		{ X86::VADDSSrr, X86::VADDSSrm, 0 },
{ X86::VADDSSrr_Int, X86::VADDSSrm_Int, 0 },		{ X86::VADDSSrr_Int, X86::VADDSSrm_Int, 0 },
{ X86::VADDSUBPDrr, X86::VADDSUBPDrm, 0 },		{ X86::VADDSUBPDrr, X86::VADDSUBPDrm, 0 },
{ X86::VADDSUBPSrr, X86::VADDSUBPSrm, 0 },		{ X86::VADDSUBPSrr, X86::VADDSUBPSrm, 0 },
▲ Show 20 Lines • Show All 5,144 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,363 Lines • ▼ Show 20 Lines	def : Pat<(vt (OpNode mem_cpat:$src)),
(vt (IMPLICIT_DEF)), mem_cpat:$src)), RC))>;		(vt (IMPLICIT_DEF)), mem_cpat:$src)), RC))>;
// These are unary operations, but they are modeled as having 2 source operands		// These are unary operations, but they are modeled as having 2 source operands
// because the high elements of the destination are unchanged in SSE.		// because the high elements of the destination are unchanged in SSE.
def : Pat<(Intr VR128:$src),		def : Pat<(Intr VR128:$src),
(!cast<Instruction>(NAME#Suffix##r_Int) VR128:$src, VR128:$src)>;		(!cast<Instruction>(NAME#Suffix##r_Int) VR128:$src, VR128:$src)>;
def : Pat<(Intr (load addr:$src)),		def : Pat<(Intr (load addr:$src)),
(vt (COPY_TO_REGCLASS(!cast<Instruction>(NAME#Suffix##m)		(vt (COPY_TO_REGCLASS(!cast<Instruction>(NAME#Suffix##m)
addr:$src), VR128))>;		addr:$src), VR128))>;
def : Pat<(Intr mem_cpat:$src),		def : Pat<(Intr mem_cpat:$src),
(!cast<Instruction>(NAME#Suffix##m_Int)		(!cast<Instruction>(NAME#Suffix##m_Int)
(vt (IMPLICIT_DEF)), mem_cpat:$src)>;		(vt (IMPLICIT_DEF)), mem_cpat:$src)>;
}		}
}		}

multiclass avx_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,		multiclass avx_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,
ValueType vt, ValueType ScalarVT,		ValueType vt, ValueType ScalarVT,
X86MemOperand x86memop, Operand vec_memop,		X86MemOperand x86memop, Operand vec_memop,
ComplexPattern mem_cpat,		ComplexPattern mem_cpat,
Intrinsic Intr, SDNode OpNode, Domain d,		Intrinsic Intr, SDNode OpNode, Domain d,
OpndItins itins, Predicate target, string Suffix> {		OpndItins itins, Predicate target, string Suffix> {
let hasSideEffects = 0 in {		let hasSideEffects = 0 in {
def r : I<opc, MRMSrcReg, (outs RC:$dst), (ins RC:$src1, RC:$src2),		def r : I<opc, MRMSrcReg, (outs RC:$dst), (ins RC:$src1, RC:$src2),
!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[], itins.rr, d>, Sched<[itins.Sched]>;		[], itins.rr, d>, Sched<[itins.Sched]>;
let mayLoad = 1 in		let mayLoad = 1 in
def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins RC:$src1, x86memop:$src2),		def m : I<opc, MRMSrcMem, (outs RC:$dst), (ins RC:$src1, x86memop:$src2),
!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[], itins.rm, d>, Sched<[itins.Sched.Folded, ReadAfterLd]>;		[], itins.rm, d>, Sched<[itins.Sched.Folded, ReadAfterLd]>;
let isCodeGenOnly = 1 in {		let isCodeGenOnly = 1 in {
// todo: uncomment when all r_Int forms will be added to X86InstrInfo.cpp		def r_Int : I<opc, MRMSrcReg, (outs VR128:$dst),
//def r_Int : I<opc, MRMSrcReg, (outs VR128:$dst),		(ins VR128:$src1, VR128:$src2),
// (ins VR128:$src1, VR128:$src2),		!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
// !strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		[]>, Sched<[itins.Sched.Folded]>;
// []>, Sched<[itins.Sched.Folded]>;
let mayLoad = 1 in		let mayLoad = 1 in
def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst),		def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, vec_memop:$src2),		(ins VR128:$src1, vec_memop:$src2),
!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[]>, Sched<[itins.Sched.Folded, ReadAfterLd]>;		[]>, Sched<[itins.Sched.Folded, ReadAfterLd]>;
}		}
}		}

let Predicates = [target] in {		let Predicates = [target] in {
def : Pat<(OpNode RC:$src), (!cast<Instruction>("V"#NAME#Suffix##r)		def : Pat<(OpNode RC:$src), (!cast<Instruction>("V"#NAME#Suffix##r)
(ScalarVT (IMPLICIT_DEF)), RC:$src)>;		(ScalarVT (IMPLICIT_DEF)), RC:$src)>;

def : Pat<(vt (OpNode mem_cpat:$src)),		def : Pat<(vt (OpNode mem_cpat:$src)),
(!cast<Instruction>("V"#NAME#Suffix##m_Int) (vt (IMPLICIT_DEF)),		(!cast<Instruction>("V"#NAME#Suffix##m_Int) (vt (IMPLICIT_DEF)),
mem_cpat:$src)>;		mem_cpat:$src)>;

// todo: use r_Int form when it will be ready		def : Pat<(Intr VR128:$src),
//def : Pat<(Intr VR128:$src), (!cast<Instruction>("V"#NAME#Suffix##r_Int)		(!cast<Instruction>("V"#NAME#Suffix##r_Int)
// (VT (IMPLICIT_DEF)), VR128:$src)>;		(vt (IMPLICIT_DEF)), VR128:$src)>;

def : Pat<(Intr VR128:$src),		def : Pat<(Intr VR128:$src),
(vt (COPY_TO_REGCLASS(		(vt (COPY_TO_REGCLASS(
!cast<Instruction>("V"#NAME#Suffix##r) (ScalarVT (IMPLICIT_DEF)),		!cast<Instruction>("V"#NAME#Suffix##r) (ScalarVT (IMPLICIT_DEF)),
(ScalarVT (COPY_TO_REGCLASS VR128:$src, RC))), VR128))>;		(ScalarVT (COPY_TO_REGCLASS VR128:$src, RC))), VR128))>;
		andreadbUnsubmitted Not Done Reply Inline Actions Hi Sanjay, do we still need this pattern? I maybe wrong, but your new pattern (at line 3413) should make this one dead. andreadb: Hi Sanjay, do we still need this pattern? I maybe wrong, but your new pattern (at line 3413)…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Thanks, Andrea. Yes, I'm not seeing how this pattern will do anything now, and there's no difference on any regression tests. I'll remove it. spatel: Thanks, Andrea. Yes, I'm not seeing how this pattern will do anything now, and there's no…

def : Pat<(Intr mem_cpat:$src),		def : Pat<(Intr mem_cpat:$src),
(!cast<Instruction>("V"#NAME#Suffix##m_Int)		(!cast<Instruction>("V"#NAME#Suffix##m_Int)
(vt (IMPLICIT_DEF)), mem_cpat:$src)>;		(vt (IMPLICIT_DEF)), mem_cpat:$src)>;
}		}
let Predicates = [target, OptForSize] in		let Predicates = [target, OptForSize] in
def : Pat<(ScalarVT (OpNode (load addr:$src))),		def : Pat<(ScalarVT (OpNode (load addr:$src))),
(!cast<Instruction>("V"#NAME#Suffix##m) (ScalarVT (IMPLICIT_DEF)),		(!cast<Instruction>("V"#NAME#Suffix##m) (ScalarVT (IMPLICIT_DEF)),
addr:$src)>;		addr:$src)>;
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
// in order to obtain suitable precision.		// in order to obtain suitable precision.
defm RSQRT : sse1_fp_unop_s<0x52, "rsqrt", X86frsqrt, SSE_RSQRTSS>,		defm RSQRT : sse1_fp_unop_s<0x52, "rsqrt", X86frsqrt, SSE_RSQRTSS>,
sse1_fp_unop_p<0x52, "rsqrt", X86frsqrt, SSE_RSQRTPS>;		sse1_fp_unop_p<0x52, "rsqrt", X86frsqrt, SSE_RSQRTPS>;
defm RCP : sse1_fp_unop_s<0x53, "rcp", X86frcp, SSE_RCPS>,		defm RCP : sse1_fp_unop_s<0x53, "rcp", X86frcp, SSE_RCPS>,
sse1_fp_unop_p<0x53, "rcp", X86frcp, SSE_RCPP>;		sse1_fp_unop_p<0x53, "rcp", X86frcp, SSE_RCPP>;

// There is no f64 version of the reciprocal approximation instructions.		// There is no f64 version of the reciprocal approximation instructions.

		// TODO: We should add scalar op patterns for these just like we have for
		// the binops above. If the binop and unop patterns could all be unified
		// that would be even better.

		multiclass scalar_unary_math_patterns<Intrinsic Intr, string OpcPrefix,
		SDNode Move, ValueType VT,
		Predicate BasePredicate> {
		let Predicates = [BasePredicate] in {
		def : Pat<(VT (Move VT:$dst, (Intr VT:$src))),
		(!cast<I>(OpcPrefix#r_Int) VT:$dst, VT:$src)>;
		}

		// With SSE 4.1, blendi is preferred to movs*, so match that too.
		let Predicates = [UseSSE41] in {
		def : Pat<(VT (X86Blendi VT:$dst, (Intr VT:$src), (i8 1))),
		(!cast<I>(OpcPrefix#r_Int) VT:$dst, VT:$src)>;
		}

		// Repeat for AVX versions of the instructions.
		let Predicates = [HasAVX] in {
		def : Pat<(VT (Move VT:$dst, (Intr VT:$src))),
		(!cast<I>("V"#OpcPrefix#r_Int) VT:$dst, VT:$src)>;

		def : Pat<(VT (X86Blendi VT:$dst, (Intr VT:$src), (i8 1))),
		(!cast<I>("V"#OpcPrefix#r_Int) VT:$dst, VT:$src)>;
		}
		}

		defm : scalar_unary_math_patterns<int_x86_sse_rcp_ss, "RCPSS", X86Movss,
		v4f32, UseSSE1>;
		defm : scalar_unary_math_patterns<int_x86_sse_rsqrt_ss, "RSQRTSS", X86Movss,
		v4f32, UseSSE1>;
		defm : scalar_unary_math_patterns<int_x86_sse_sqrt_ss, "SQRTSS", X86Movss,
		v4f32, UseSSE1>;
		defm : scalar_unary_math_patterns<int_x86_sse2_sqrt_sd, "SQRTSD", X86Movsd,
		v2f64, UseSSE2>;


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SSE 1 & 2 - Non-temporal stores		// SSE 1 & 2 - Non-temporal stores
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

let AddedComplexity = 400 in { // Prefer non-temporal versions		let AddedComplexity = 400 in { // Prefer non-temporal versions
let SchedRW = [WriteStore] in {		let SchedRW = [WriteStore] in {
let Predicates = [HasAVX, NoVLX] in {		let Predicates = [HasAVX, NoVLX] in {
def VMOVNTPSmr : VPSI<0x2B, MRMDestMem, (outs),		def VMOVNTPSmr : VPSI<0x2B, MRMDestMem, (outs),
▲ Show 20 Lines • Show All 5,298 Lines • Show Last 20 Lines

test/CodeGen/X86/sse-scalar-fp-arith-unary.ll

				; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=sse2 < %s \| FileCheck --check-prefix=SSE %s
				; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=sse4.1 < %s \| FileCheck --check-prefix=SSE %s
				; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=avx < %s \| FileCheck --check-prefix=AVX %s

				; PR21507 - https://llvm.org/bugs/show_bug.cgi?id=21507
				; Each function should be a single math op; no extra moves.


				define <4 x float> @recip(<4 x float> %x) {
				; SSE-LABEL: recip:
				; SSE: # BB#0:
				; SSE-NEXT: rcpss %xmm0, %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: recip:
				; AVX: # BB#0:
				; AVX-NEXT: vrcpss %xmm0, %xmm0, %xmm0
				; AVX-NEXT: retq
				%y = tail call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %x)
				%shuf = shufflevector <4 x float> %y, <4 x float> %x, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
				ret <4 x float> %shuf
				}

				define <4 x float> @recip_square_root(<4 x float> %x) {
				; SSE-LABEL: recip_square_root:
				; SSE: # BB#0:
				; SSE-NEXT: rsqrtss %xmm0, %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: recip_square_root:
				; AVX: # BB#0:
				; AVX-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0
				; AVX-NEXT: retq
				%y = tail call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %x)
				%shuf = shufflevector <4 x float> %y, <4 x float> %x, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
				ret <4 x float> %shuf
				}

				define <4 x float> @square_root(<4 x float> %x) {
				; SSE-LABEL: square_root:
				; SSE: # BB#0:
				; SSE-NEXT: sqrtss %xmm0, %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: square_root:
				; AVX: # BB#0:
				; AVX-NEXT: vsqrtss %xmm0, %xmm0, %xmm0
				; AVX-NEXT: retq
				%y = tail call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %x)
				%shuf = shufflevector <4 x float> %y, <4 x float> %x, <4 x i32> <i32 0, i32 5, i32 6, i32 7>
				ret <4 x float> %shuf
				}

				define <2 x double> @square_root_double(<2 x double> %x) {
				; SSE-LABEL: square_root_double:
				; SSE: # BB#0:
				; SSE-NEXT: sqrtsd %xmm0, %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: square_root_double:
				; AVX: # BB#0:
				; AVX-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0
				; AVX-NEXT: retq
				%y = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %x)
				%shuf = shufflevector <2 x double> %y, <2 x double> %x, <2 x i32> <i32 0, i32 3>
				ret <2 x double> %shuf
				}

				declare <4 x float> @llvm.x86.sse.rcp.ss(<4 x float>)
				declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>)
				declare <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float>)
				declare <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double>)