This is an archive of the discontinued LLVM Phabricator instance.

[X86] Avoid folding scalar loads into unary sse intrinsics
ClosedPublic

Authored by mkuper on Dec 23 2015, 3:46 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
andreadb

Commits

rGd36e24a1662a: [X86] Avoid folding scalar loads into unary sse intrinsics
rL256671: [X86] Avoid folding scalar loads into unary sse intrinsics

Summary

Not folding these cases tends to avoid partial register updates:

sqrtss (%eax), %xmm0

Has a partial update of %xmm0, while

movss (%eax), %xmm0
sqrtss %xmm0, %xmm0

Has a clobber of the high lanes immediately before the partial update, avoiding a potential stall.

Given this, we only want to fold when optimizing for size.
This is consistent with the patterns we already have for the fp/int converts, and in X86InstrInfo::foldMemoryOperandImpl()

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 43519.Dec 23 2015, 3:46 AM

mkuper retitled this revision from to [X86] Avoid folding scalar loads into unary sse intrinsics.

mkuper updated this object.

mkuper added reviewers: RKSimon, spatel, andreadb.

mkuper added subscribers: llvm-commits, DavidKreitzer.

This is consistent with the patterns we already have for the fp/int converts...

We still need to fix converts?

#include <xmmintrin.h>
__m128 foo(__m128 x, int *y) { return _mm_cvtsi32_ss(x, *y); }

$ ./clang -O1 ss2si.c -S -o -

cvtsi2ssl  (%rdi), %xmm1  <--- false dependency on xmm1?
movss      %xmm1, %xmm0

lib/Target/X86/X86InstrSSE.td
3392 ↗	(On Diff #43519)	80-cols.
3433 ↗	(On Diff #43519)	80-cols.

In D15741#317755, @spatel wrote:
This is consistent with the patterns we already have for the fp/int converts...

We still need to fix converts?
#include <xmmintrin.h>
__m128 foo(__m128 x, int *y) { return _mm_cvtsi32_ss(x, *y); }
$ ./clang -O1 ss2si.c -S -o -
cvtsi2ssl  (%rdi), %xmm1  <--- false dependency on xmm1?
movss      %xmm1, %xmm0

Right, I was talking about this:

def CVTSD2SSrm  : I<0x5A, MRMSrcMem, (outs FR32:$dst), (ins f64mem:$src),
                      "cvtsd2ss\t{$src, $dst|$dst, $src}",
                      [(set FR32:$dst, (fround (loadf64 addr:$src)))],
                      IIC_SSE_CVT_Scalar_RM>,
                      XD,
                  Requires<[UseSSE2, OptForSize]>, Sched<[WriteCvtF2FLd]>;

But this is actually the non-intrinsic pattern.

lib/Target/X86/X86InstrSSE.td
3392 ↗	(On Diff #43519)	The TDs don't enforce 80-cols consistently, and I never remember whether they should. Thanks. :-)

In D15741#317959, @mkuper wrote:

def CVTSD2SSrm  : I<0x5A, MRMSrcMem, (outs FR32:$dst), (ins f64mem:$src),
                    "cvtsd2ss\t{$src, $dst|$dst, $src}",
                    [(set FR32:$dst, (fround (loadf64 addr:$src)))],
                    IIC_SSE_CVT_Scalar_RM>,
                    XD,
                Requires<[UseSSE2, OptForSize]>, Sched<[WriteCvtF2FLd]>;

Ah, I managed to miss that one.
How about adding some 'FIXME' notes and/or changing the other defs since we're currently inconsistent about this? LGTM otherwise.

float f1(int *x) { return *x; } 
double f2(int *x) { return *x; }
float f3(long long *x) { return *x; }
double f4(long long *x) { return *x; }
float f5(double *x) { return *x; }
double f6(float *x) { return *x; }

$ ./clang -O1 ss2si.c -S -o - |grep cvt
cvtsi2ssl	(%rdi), %xmm0
cvtsi2sdl	(%rdi), %xmm0
cvtsi2ssq	(%rdi), %xmm0
cvtsi2sdq	(%rdi), %xmm0
cvtsd2ss	%xmm0, %xmm0
cvtss2sd	%xmm0, %xmm0

Regarding handling this via ExeDepsFix - it's not clear to me that its current solution:

xorps %xmm0, %xmm0
cvtsi2ssl (%rdi) %xmm0

would be better than unfolding the load. I think the xorps instruction saves a byte in all cases, but it may be micro-arch-dependent whether that's actually cheaper?

lib/Target/X86/X86InstrSSE.td
3392 ↗	(On Diff #43519)	It seems like we mostly try to follow the law, but I would fully support a new rule for these files. 80-cols causes a lot of extra suffering trying to make sense of this code that's already hard to read. :)

This revision is now accepted and ready to land.Dec 30 2015, 8:39 AM

Thanks, Sanjay.

How about adding some 'FIXME' notes and/or changing the other defs since we're currently inconsistent about this? LGTM otherwise.
float f1(int *x) { return *x; } 
double f2(int *x) { return *x; }
float f3(long long *x) { return *x; }
double f4(long long *x) { return *x; }
float f5(double *x) { return *x; }
double f6(float *x) { return *x; }

I'll add FIXMEs.

Regarding handling this via ExeDepsFix - it's not clear to me that its current solution:
xorps %xmm0, %xmm0
cvtsi2ssl (%rdi) %xmm0
would be better than unfolding the load. I think the xorps instruction saves a byte in all cases, but it may be micro-arch-dependent whether that's actually cheaper?

I think it generally is better (the xorps idiom should be recognized by any modern Intel CPU, at least). But David is the real authority on this.

Closed by commit rL256671: [X86] Avoid folding scalar loads into unary sse intrinsics (authored by mkuper). · Explain WhyDec 31 2015, 1:48 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86InstrSSE.td

54 lines

test/

CodeGen/

X86/

fold-load-unops.ll

94 lines

Diff 43827

llvm/trunk/lib/Target/X86/X86InstrSSE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,460 Lines • ▼ Show 20 Lines	def SSE_CVT_SS2SI_64 : OpndItins<
IIC_SSE_CVT_SS2SI64_RR, IIC_SSE_CVT_SS2SI64_RM		IIC_SSE_CVT_SS2SI64_RR, IIC_SSE_CVT_SS2SI64_RM
>;		>;

let Sched = WriteCvtF2I in		let Sched = WriteCvtF2I in
def SSE_CVT_SD2SI : OpndItins<		def SSE_CVT_SD2SI : OpndItins<
IIC_SSE_CVT_SD2SI_RR, IIC_SSE_CVT_SD2SI_RM		IIC_SSE_CVT_SD2SI_RR, IIC_SSE_CVT_SD2SI_RM
>;		>;

		// FIXME: We probably want to match the rm form only when optimizing for
		// size, to avoid false depenendecies (see sse_fp_unop_s for details)
multiclass sse12_cvt_s<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,		multiclass sse12_cvt_s<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,
SDNode OpNode, X86MemOperand x86memop, PatFrag ld_frag,		SDNode OpNode, X86MemOperand x86memop, PatFrag ld_frag,
string asm, OpndItins itins> {		string asm, OpndItins itins> {
def rr : SI<opc, MRMSrcReg, (outs DstRC:$dst), (ins SrcRC:$src), asm,		def rr : SI<opc, MRMSrcReg, (outs DstRC:$dst), (ins SrcRC:$src), asm,
[(set DstRC:$dst, (OpNode SrcRC:$src))],		[(set DstRC:$dst, (OpNode SrcRC:$src))],
itins.rr>, Sched<[itins.Sched]>;		itins.rr>, Sched<[itins.Sched]>;
def rm : SI<opc, MRMSrcMem, (outs DstRC:$dst), (ins x86memop:$src), asm,		def rm : SI<opc, MRMSrcMem, (outs DstRC:$dst), (ins x86memop:$src), asm,
[(set DstRC:$dst, (OpNode (ld_frag addr:$src)))],		[(set DstRC:$dst, (OpNode (ld_frag addr:$src)))],
itins.rm>, Sched<[itins.Sched.Folded]>;		itins.rm>, Sched<[itins.Sched.Folded]>;
}		}

multiclass sse12_cvt_p<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,		multiclass sse12_cvt_p<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,
X86MemOperand x86memop, string asm, Domain d,		X86MemOperand x86memop, string asm, Domain d,
OpndItins itins> {		OpndItins itins> {
let hasSideEffects = 0 in {		let hasSideEffects = 0 in {
def rr : I<opc, MRMSrcReg, (outs DstRC:$dst), (ins SrcRC:$src), asm,		def rr : I<opc, MRMSrcReg, (outs DstRC:$dst), (ins SrcRC:$src), asm,
[], itins.rr, d>, Sched<[itins.Sched]>;		[], itins.rr, d>, Sched<[itins.Sched]>;
let mayLoad = 1 in		let mayLoad = 1 in
def rm : I<opc, MRMSrcMem, (outs DstRC:$dst), (ins x86memop:$src), asm,		def rm : I<opc, MRMSrcMem, (outs DstRC:$dst), (ins x86memop:$src), asm,
[], itins.rm, d>, Sched<[itins.Sched.Folded]>;		[], itins.rm, d>, Sched<[itins.Sched.Folded]>;
}		}
}		}

		// FIXME: We probably want to match the rm form only when optimizing for
		// size, to avoid false depenendecies (see sse_fp_unop_s for details)
multiclass sse12_vcvt_avx<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,		multiclass sse12_vcvt_avx<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,
X86MemOperand x86memop, string asm> {		X86MemOperand x86memop, string asm> {
let hasSideEffects = 0, Predicates = [UseAVX] in {		let hasSideEffects = 0, Predicates = [UseAVX] in {
def rr : SI<opc, MRMSrcReg, (outs DstRC:$dst), (ins DstRC:$src1, SrcRC:$src),		def rr : SI<opc, MRMSrcReg, (outs DstRC:$dst), (ins DstRC:$src1, SrcRC:$src),
!strconcat(asm,"\t{$src, $src1, $dst\|$dst, $src1, $src}"), []>,		!strconcat(asm,"\t{$src, $src1, $dst\|$dst, $src1, $src}"), []>,
Sched<[WriteCvtI2F]>;		Sched<[WriteCvtI2F]>;
let mayLoad = 1 in		let mayLoad = 1 in
def rm : SI<opc, MRMSrcMem, (outs DstRC:$dst),		def rm : SI<opc, MRMSrcMem, (outs DstRC:$dst),
▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines
def : InstAlias<"cvtsi2ss\t{$src, $dst\|$dst, $src}",		def : InstAlias<"cvtsi2ss\t{$src, $dst\|$dst, $src}",
(CVTSI2SSrm FR64:$dst, i32mem:$src), 0>;		(CVTSI2SSrm FR64:$dst, i32mem:$src), 0>;
def : InstAlias<"cvtsi2sd\t{$src, $dst\|$dst, $src}",		def : InstAlias<"cvtsi2sd\t{$src, $dst\|$dst, $src}",
(CVTSI2SDrm FR64:$dst, i32mem:$src), 0>;		(CVTSI2SDrm FR64:$dst, i32mem:$src), 0>;

// Conversion Instructions Intrinsics - Match intrinsics which expect MM		// Conversion Instructions Intrinsics - Match intrinsics which expect MM
// and/or XMM operand(s).		// and/or XMM operand(s).

		// FIXME: We probably want to match the rm form only when optimizing for
		// size, to avoid false depenendecies (see sse_fp_unop_s for details)
multiclass sse12_cvt_sint<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,		multiclass sse12_cvt_sint<bits<8> opc, RegisterClass SrcRC, RegisterClass DstRC,
Intrinsic Int, Operand memop, ComplexPattern mem_cpat,		Intrinsic Int, Operand memop, ComplexPattern mem_cpat,
string asm, OpndItins itins> {		string asm, OpndItins itins> {
def rr : SI<opc, MRMSrcReg, (outs DstRC:$dst), (ins SrcRC:$src),		def rr : SI<opc, MRMSrcReg, (outs DstRC:$dst), (ins SrcRC:$src),
!strconcat(asm, "\t{$src, $dst\|$dst, $src}"),		!strconcat(asm, "\t{$src, $dst\|$dst, $src}"),
[(set DstRC:$dst, (Int SrcRC:$src))], itins.rr>,		[(set DstRC:$dst, (Int SrcRC:$src))], itins.rr>,
Sched<[itins.Sched]>;		Sched<[itins.Sched]>;
def rm : SI<opc, MRMSrcMem, (outs DstRC:$dst), (ins memop:$src),		def rm : SI<opc, MRMSrcMem, (outs DstRC:$dst), (ins memop:$src),
▲ Show 20 Lines • Show All 1,745 Lines • ▼ Show 20 Lines	def : Pat<(vt (OpNode mem_cpat:$src)),
(vt (IMPLICIT_DEF)), mem_cpat:$src)), RC))>;		(vt (IMPLICIT_DEF)), mem_cpat:$src)), RC))>;
// These are unary operations, but they are modeled as having 2 source operands		// These are unary operations, but they are modeled as having 2 source operands
// because the high elements of the destination are unchanged in SSE.		// because the high elements of the destination are unchanged in SSE.
def : Pat<(Intr VR128:$src),		def : Pat<(Intr VR128:$src),
(!cast<Instruction>(NAME#Suffix##r_Int) VR128:$src, VR128:$src)>;		(!cast<Instruction>(NAME#Suffix##r_Int) VR128:$src, VR128:$src)>;
def : Pat<(Intr (load addr:$src)),		def : Pat<(Intr (load addr:$src)),
(vt (COPY_TO_REGCLASS(!cast<Instruction>(NAME#Suffix##m)		(vt (COPY_TO_REGCLASS(!cast<Instruction>(NAME#Suffix##m)
addr:$src), VR128))>;		addr:$src), VR128))>;
		}
		// We don't want to fold scalar loads into these instructions unless
		// optimizing for size. This is because the folded instruction will have a
		// partial register update, while the unfolded sequence will not, e.g.
		// movss mem, %xmm0
		// rcpss %xmm0, %xmm0
		// which has a clobber before the rcp, vs.
		// rcpss mem, %xmm0
		let Predicates = [target, OptForSize] in {
def : Pat<(Intr mem_cpat:$src),		def : Pat<(Intr mem_cpat:$src),
(!cast<Instruction>(NAME#Suffix##m_Int)		(!cast<Instruction>(NAME#Suffix##m_Int)
(vt (IMPLICIT_DEF)), mem_cpat:$src)>;		(vt (IMPLICIT_DEF)), mem_cpat:$src)>;
}		}
}		}

multiclass avx_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,		multiclass avx_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,
ValueType vt, ValueType ScalarVT,		ValueType vt, ValueType ScalarVT,
X86MemOperand x86memop, Operand vec_memop,		X86MemOperand x86memop, Operand vec_memop,
ComplexPattern mem_cpat,		ComplexPattern mem_cpat,
Intrinsic Intr, SDNode OpNode, Domain d,		Intrinsic Intr, SDNode OpNode, Domain d,
Show All 14 Lines	multiclass avx_fp_unop_s<bits<8> opc, string OpcodeStr, RegisterClass RC,
let mayLoad = 1 in		let mayLoad = 1 in
def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst),		def m_Int : I<opc, MRMSrcMem, (outs VR128:$dst),
(ins VR128:$src1, vec_memop:$src2),		(ins VR128:$src1, vec_memop:$src2),
!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),		!strconcat(OpcodeStr, "\t{$src2, $src1, $dst\|$dst, $src1, $src2}"),
[]>, Sched<[itins.Sched.Folded, ReadAfterLd]>;		[]>, Sched<[itins.Sched.Folded, ReadAfterLd]>;
}		}
}		}

		// We don't want to fold scalar loads into these instructions unless
		// optimizing for size. This is because the folded instruction will have a
		// partial register update, while the unfolded sequence will not, e.g.
		// vmovss mem, %xmm0
		// vrcpss %xmm0, %xmm0, %xmm0
		// which has a clobber before the rcp, vs.
		// vrcpss mem, %xmm0, %xmm0
		// TODO: In theory, we could fold the load, and avoid the stall caused by
		// the partial register store, either in ExeDepFix or with smarter RA.
let Predicates = [UseAVX] in {		let Predicates = [UseAVX] in {
def : Pat<(OpNode RC:$src), (!cast<Instruction>("V"#NAME#Suffix##r)		def : Pat<(OpNode RC:$src), (!cast<Instruction>("V"#NAME#Suffix##r)
(ScalarVT (IMPLICIT_DEF)), RC:$src)>;		(ScalarVT (IMPLICIT_DEF)), RC:$src)>;

def : Pat<(vt (OpNode mem_cpat:$src)),
(!cast<Instruction>("V"#NAME#Suffix##m_Int) (vt (IMPLICIT_DEF)),
mem_cpat:$src)>;

}		}
let Predicates = [HasAVX] in {		let Predicates = [HasAVX] in {
def : Pat<(Intr VR128:$src),		def : Pat<(Intr VR128:$src),
(!cast<Instruction>("V"#NAME#Suffix##r_Int) (vt (IMPLICIT_DEF)),		(!cast<Instruction>("V"#NAME#Suffix##r_Int) (vt (IMPLICIT_DEF)),
VR128:$src)>;		VR128:$src)>;
		}
		let Predicates = [HasAVX, OptForSize] in {
def : Pat<(Intr mem_cpat:$src),		def : Pat<(Intr mem_cpat:$src),
(!cast<Instruction>("V"#NAME#Suffix##m_Int)		(!cast<Instruction>("V"#NAME#Suffix##m_Int)
(vt (IMPLICIT_DEF)), mem_cpat:$src)>;		(vt (IMPLICIT_DEF)), mem_cpat:$src)>;
}		}
let Predicates = [UseAVX, OptForSize] in		let Predicates = [UseAVX, OptForSize] in {
def : Pat<(ScalarVT (OpNode (load addr:$src))),		def : Pat<(ScalarVT (OpNode (load addr:$src))),
(!cast<Instruction>("V"#NAME#Suffix##m) (ScalarVT (IMPLICIT_DEF)),		(!cast<Instruction>("V"#NAME#Suffix##m) (ScalarVT (IMPLICIT_DEF)),
addr:$src)>;		addr:$src)>;
		def : Pat<(vt (OpNode mem_cpat:$src)),
		(!cast<Instruction>("V"#NAME#Suffix##m_Int) (vt (IMPLICIT_DEF)),
		mem_cpat:$src)>;
		}
}		}

/// sse1_fp_unop_p - SSE1 unops in packed form.		/// sse1_fp_unop_p - SSE1 unops in packed form.
multiclass sse1_fp_unop_p<bits<8> opc, string OpcodeStr, SDNode OpNode,		multiclass sse1_fp_unop_p<bits<8> opc, string OpcodeStr, SDNode OpNode,
OpndItins itins, list<Predicate> prds> {		OpndItins itins, list<Predicate> prds> {
let Predicates = prds in {		let Predicates = prds in {
def V#NAME#PSr : PSI<opc, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src),		def V#NAME#PSr : PSI<opc, MRMSrcReg, (outs VR128:$dst), (ins VR128:$src),
!strconcat("v", OpcodeStr,		!strconcat("v", OpcodeStr,
▲ Show 20 Lines • Show All 5,468 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fold-load-unops.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+sse2 < %s \| FileCheck %s --check-prefix=SSE			; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+sse2 < %s \| FileCheck %s --check-prefix=SSE
	; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+avx < %s \| FileCheck %s --check-prefix=AVX			; RUN: llc -mtriple=x86_64-unknown-unknown -mattr=+avx < %s \| FileCheck %s --check-prefix=AVX

	; Verify that we're folding the load into the math instruction.			; Verify we fold loads into unary sse intrinsics only when optimizing for size

	define float @rcpss(float* %a) {			define float @rcpss(float* %a) {
	; SSE-LABEL: rcpss:			; SSE-LABEL: rcpss:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: rcpss (%rdi), %xmm0			; SSE-NEXT: movss (%rdi), %xmm0
				; SSE-NEXT: rcpss %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: rcpss:			; AVX-LABEL: rcpss:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vrcpss (%rdi), %xmm0, %xmm0			; AVX-NEXT: vmovss (%rdi), %xmm0
				; AVX-NEXT: vrcpss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load float, float* %a			%ld = load float, float* %a
	%ins = insertelement <4 x float> undef, float %ld, i32 0			%ins = insertelement <4 x float> undef, float %ld, i32 0
	%res = tail call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %ins)			%res = tail call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %ins)
	%ext = extractelement <4 x float> %res, i32 0			%ext = extractelement <4 x float> %res, i32 0
	ret float %ext			ret float %ext
	}			}

	define float @rsqrtss(float* %a) {			define float @rsqrtss(float* %a) {
	; SSE-LABEL: rsqrtss:			; SSE-LABEL: rsqrtss:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: rsqrtss (%rdi), %xmm0			; SSE-NEXT: movss (%rdi), %xmm0
				; SSE-NEXT: rsqrtss %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: rsqrtss:			; AVX-LABEL: rsqrtss:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vrsqrtss (%rdi), %xmm0, %xmm0			; AVX-NEXT: vmovss (%rdi), %xmm0
				; AVX-NEXT: vrsqrtss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load float, float* %a			%ld = load float, float* %a
	%ins = insertelement <4 x float> undef, float %ld, i32 0			%ins = insertelement <4 x float> undef, float %ld, i32 0
	%res = tail call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %ins)			%res = tail call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %ins)
	%ext = extractelement <4 x float> %res, i32 0			%ext = extractelement <4 x float> %res, i32 0
	ret float %ext			ret float %ext
	}			}

	define float @sqrtss(float* %a) {			define float @sqrtss(float* %a) {
	; SSE-LABEL: sqrtss:			; SSE-LABEL: sqrtss:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: sqrtss (%rdi), %xmm0			; SSE-NEXT: movss (%rdi), %xmm0
				; SSE-NEXT: sqrtss %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: sqrtss:			; AVX-LABEL: sqrtss:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vsqrtss (%rdi), %xmm0, %xmm0			; AVX-NEXT: vmovss (%rdi), %xmm0
				; AVX-NEXT: vsqrtss %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load float, float* %a			%ld = load float, float* %a
	%ins = insertelement <4 x float> undef, float %ld, i32 0			%ins = insertelement <4 x float> undef, float %ld, i32 0
	%res = tail call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %ins)			%res = tail call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %ins)
	%ext = extractelement <4 x float> %res, i32 0			%ext = extractelement <4 x float> %res, i32 0
	ret float %ext			ret float %ext
	}			}

	define double @sqrtsd(double* %a) {			define double @sqrtsd(double* %a) {
	; SSE-LABEL: sqrtsd:			; SSE-LABEL: sqrtsd:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: sqrtsd (%rdi), %xmm0			; SSE-NEXT: movsd (%rdi), %xmm0
				; SSE-NEXT: sqrtsd %xmm0, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: sqrtsd:			; AVX-LABEL: sqrtsd:
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vsqrtsd (%rdi), %xmm0, %xmm0			; AVX-NEXT: vmovsd (%rdi), %xmm0
				; AVX-NEXT: vsqrtsd %xmm0, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%ld = load double, double* %a			%ld = load double, double* %a
	%ins = insertelement <2 x double> undef, double %ld, i32 0			%ins = insertelement <2 x double> undef, double %ld, i32 0
	%res = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %ins)			%res = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %ins)
	%ext = extractelement <2 x double> %res, i32 0			%ext = extractelement <2 x double> %res, i32 0
	ret double %ext			ret double %ext
	}			}

				define float @rcpss_size(float* %a) optsize {
				; SSE-LABEL: rcpss_size:
				; SSE: # BB#0:
				; SSE-NEXT: rcpss (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: rcpss_size:
				; AVX: # BB#0:
				; AVX-NEXT: vrcpss (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%ld = load float, float* %a
				%ins = insertelement <4 x float> undef, float %ld, i32 0
				%res = tail call <4 x float> @llvm.x86.sse.rcp.ss(<4 x float> %ins)
				%ext = extractelement <4 x float> %res, i32 0
				ret float %ext
				}

				define float @rsqrtss_size(float* %a) optsize {
				; SSE-LABEL: rsqrtss_size:
				; SSE: # BB#0:
				; SSE-NEXT: rsqrtss (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: rsqrtss_size:
				; AVX: # BB#0:
				; AVX-NEXT: vrsqrtss (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%ld = load float, float* %a
				%ins = insertelement <4 x float> undef, float %ld, i32 0
				%res = tail call <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float> %ins)
				%ext = extractelement <4 x float> %res, i32 0
				ret float %ext
				}

				define float @sqrtss_size(float* %a) optsize{
				; SSE-LABEL: sqrtss_size:
				; SSE: # BB#0:
				; SSE-NEXT: sqrtss (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: sqrtss_size:
				; AVX: # BB#0:
				; AVX-NEXT: vsqrtss (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%ld = load float, float* %a
				%ins = insertelement <4 x float> undef, float %ld, i32 0
				%res = tail call <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float> %ins)
				%ext = extractelement <4 x float> %res, i32 0
				ret float %ext
				}

				define double @sqrtsd_size(double* %a) optsize {
				; SSE-LABEL: sqrtsd_size:
				; SSE: # BB#0:
				; SSE-NEXT: sqrtsd (%rdi), %xmm0
				; SSE-NEXT: retq
				;
				; AVX-LABEL: sqrtsd_size:
				; AVX: # BB#0:
				; AVX-NEXT: vsqrtsd (%rdi), %xmm0, %xmm0
				; AVX-NEXT: retq
				%ld = load double, double* %a
				%ins = insertelement <2 x double> undef, double %ld, i32 0
				%res = tail call <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double> %ins)
				%ext = extractelement <2 x double> %res, i32 0
				ret double %ext
				}

	declare <4 x float> @llvm.x86.sse.rcp.ss(<4 x float>) nounwind readnone			declare <4 x float> @llvm.x86.sse.rcp.ss(<4 x float>) nounwind readnone
	declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone			declare <4 x float> @llvm.x86.sse.rsqrt.ss(<4 x float>) nounwind readnone
	declare <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float>) nounwind readnone			declare <4 x float> @llvm.x86.sse.sqrt.ss(<4 x float>) nounwind readnone
	declare <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double>) nounwind readnone			declare <2 x double> @llvm.x86.sse2.sqrt.sd(<2 x double>) nounwind readnone