This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable reciprocal estimates for v16f32 vectors by using VRCP14PS/VRSQRT14PS
ClosedPublic

Authored by craig.topper on May 5 2018, 4:16 PM.

Download Raw Diff

Details

Reviewers

Commits

rGcb2abc797780: [X86] Enable reciprocal estimates for v16f32 vectors by using…
rL331606: [X86] Enable reciprocal estimates for v16f32 vectors by using…

Summary

The legacy VRCPPS/VRSQRTPS instructions aren't available in 512-bit versions. The new increased precision versions are. So we can use those to implement v16f32 reciprocal estimates.

For KNL CPUs we can probably use VRCP28PS/VRSQRT28PS and avoid the NR step altogether, but I leave that for a future patch.

Diff Detail

Event Timeline

craig.topper created this revision.May 5 2018, 4:16 PM

Herald added a subscriber: mehdi_amini. · View Herald TranscriptMay 5 2018, 4:16 PM

LGTM - see inline for possible improvements.

lib/Target/X86/X86ISelLowering.cpp
17813–17817	Potential enhancements for follow-up patches: Use the new scalar estimate (VRSQRT14SS) if we have the required AVX-ness. Use VRSQRT14SD for an f64. Use VRSQRT14PD for vectors of f64. Repeat all of the above for VRCP14xx.
test/CodeGen/X86/recip-fastmath.ll
1226	Not sure where the timing is defined (cc @RKSimon), but that vdivps timing can't be right. Agner has it at 32:20. Might want to verify the new instruction sequence timings too

This revision is now accepted and ready to land.May 6 2018, 8:47 AM

Closed by commit rL331606: [X86] Enable reciprocal estimates for v16f32 vectors by using… (authored by ctopper). · Explain WhyMay 6 2018, 10:52 AM

This revision was automatically updated to reflect the committed changes.

LuoYuanke added a subscriber: LuoYuanke.Oct 15 2022, 4:59 PM

LuoYuanke added inline comments.

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
17823 ↗	(On Diff #145403)	@craig.topper, for v4f32 and v8f32, if avx512f is available, do we prefer RSQRT14 or FRSQRT?

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptOct 15 2022, 4:59 PM

Herald added subscribers: StephenFan, pengfei. · View Herald Transcript

craig.topper added inline comments.Oct 15 2022, 6:03 PM

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
17823 ↗	(On Diff #145403)	FRSQRT is a shorter encoding but the result would probably be more accurate with RSQRT14. Not sure what’s best.
test/CodeGen/X86/recip-fastmath.ll
1226	KNL is using the Haswell scheduler model I think. And last I looked all the divide instructions were using InstRWs for each instruction. Since Haswell doesn't have VDIVPSZrr it probably just got some garbage default.

LuoYuanke added inline comments.Oct 15 2022, 6:10 PM

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp
17823 ↗	(On Diff #145403)	Got it. Thanks, Craig.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

16 lines

test/

CodeGen/

X86/

recip-fastmath.ll

28 lines

recip-fastmath2.ll

56 lines

sqrt-fastmath.ll

8 lines

Diff 145388

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 17,797 Lines • ▼ Show 20 Lines
	SDValue X86TargetLowering::getSqrtEstimate(SDValue Op,			SDValue X86TargetLowering::getSqrtEstimate(SDValue Op,
	SelectionDAG &DAG, int Enabled,			SelectionDAG &DAG, int Enabled,
	int &RefinementSteps,			int &RefinementSteps,
	bool &UseOneConstNR,			bool &UseOneConstNR,
	bool Reciprocal) const {			bool Reciprocal) const {
	EVT VT = Op.getValueType();			EVT VT = Op.getValueType();

	// SSE1 has rsqrtss and rsqrtps. AVX adds a 256-bit variant for rsqrtps.			// SSE1 has rsqrtss and rsqrtps. AVX adds a 256-bit variant for rsqrtps.
	// TODO: Add support for AVX512 (v16f32).
	// It is likely not profitable to do this for f64 because a double-precision			// It is likely not profitable to do this for f64 because a double-precision
	// rsqrt estimate with refinement on x86 prior to FMA requires at least 16			// rsqrt estimate with refinement on x86 prior to FMA requires at least 16
	// instructions: convert to single, rsqrtss, convert back to double, refine			// instructions: convert to single, rsqrtss, convert back to double, refine
	// (3 steps = at least 13 insts). If an 'rsqrtsd' variant was added to the ISA			// (3 steps = at least 13 insts). If an 'rsqrtsd' variant was added to the ISA
	// along with FMA, this could be a throughput win.			// along with FMA, this could be a throughput win.
	// TODO: SQRT requires SSE2 to prevent the introduction of an illegal v4i32			// TODO: SQRT requires SSE2 to prevent the introduction of an illegal v4i32
	// after legalize types.			// after legalize types.
	if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|			if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v4f32 && Subtarget.hasSSE1() && Reciprocal) \|\|			(VT == MVT::v4f32 && Subtarget.hasSSE1() && Reciprocal) \|\|
	(VT == MVT::v4f32 && Subtarget.hasSSE2() && !Reciprocal) \|\|			(VT == MVT::v4f32 && Subtarget.hasSSE2() && !Reciprocal) \|\|
	(VT == MVT::v8f32 && Subtarget.hasAVX())) {			(VT == MVT::v8f32 && Subtarget.hasAVX()) \|\|
				(VT == MVT::v16f32 && Subtarget.useAVX512Regs())) {
				spatelUnsubmitted Not Done Reply Inline Actions Potential enhancements for follow-up patches: Use the new scalar estimate (VRSQRT14SS) if we have the required AVX-ness. Use VRSQRT14SD for an f64. Use VRSQRT14PD for vectors of f64. Repeat all of the above for VRCP14xx. spatel: Potential enhancements for follow-up patches: 1. Use the new scalar estimate (VRSQRT14SS) if…
	if (RefinementSteps == ReciprocalEstimate::Unspecified)			if (RefinementSteps == ReciprocalEstimate::Unspecified)
	RefinementSteps = 1;			RefinementSteps = 1;

	UseOneConstNR = false;			UseOneConstNR = false;
	return DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);			// There is no FSQRT for 512-bits, but there is RSQRT14.
				unsigned Opcode = VT == MVT::v16f32 ? X86ISD::RSQRT14 : X86ISD::FRSQRT;
				return DAG.getNode(Opcode, SDLoc(Op), VT, Op);
	}			}
	return SDValue();			return SDValue();
	}			}

	/// The minimum architected relative accuracy is 2^-12. We need one			/// The minimum architected relative accuracy is 2^-12. We need one
	/// Newton-Raphson step to have a good float result (24 bits of precision).			/// Newton-Raphson step to have a good float result (24 bits of precision).
	SDValue X86TargetLowering::getRecipEstimate(SDValue Op, SelectionDAG &DAG,			SDValue X86TargetLowering::getRecipEstimate(SDValue Op, SelectionDAG &DAG,
	int Enabled,			int Enabled,
	int &RefinementSteps) const {			int &RefinementSteps) const {
	EVT VT = Op.getValueType();			EVT VT = Op.getValueType();

	// SSE1 has rcpss and rcpps. AVX adds a 256-bit variant for rcpps.			// SSE1 has rcpss and rcpps. AVX adds a 256-bit variant for rcpps.
	// TODO: Add support for AVX512 (v16f32).
	// It is likely not profitable to do this for f64 because a double-precision			// It is likely not profitable to do this for f64 because a double-precision
	// reciprocal estimate with refinement on x86 prior to FMA requires			// reciprocal estimate with refinement on x86 prior to FMA requires
	// 15 instructions: convert to single, rcpss, convert back to double, refine			// 15 instructions: convert to single, rcpss, convert back to double, refine
	// (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA			// (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA
	// along with FMA, this could be a throughput win.			// along with FMA, this could be a throughput win.

	if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|			if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v4f32 && Subtarget.hasSSE1()) \|\|			(VT == MVT::v4f32 && Subtarget.hasSSE1()) \|\|
	(VT == MVT::v8f32 && Subtarget.hasAVX())) {			(VT == MVT::v8f32 && Subtarget.hasAVX()) \|\|
				(VT == MVT::v16f32 && Subtarget.useAVX512Regs())) {
	// Enable estimate codegen with 1 refinement step for vector division.			// Enable estimate codegen with 1 refinement step for vector division.
	// Scalar division estimates are disabled because they break too much			// Scalar division estimates are disabled because they break too much
	// real-world code. These defaults are intended to match GCC behavior.			// real-world code. These defaults are intended to match GCC behavior.
	if (VT == MVT::f32 && Enabled == ReciprocalEstimate::Unspecified)			if (VT == MVT::f32 && Enabled == ReciprocalEstimate::Unspecified)
	return SDValue();			return SDValue();

	if (RefinementSteps == ReciprocalEstimate::Unspecified)			if (RefinementSteps == ReciprocalEstimate::Unspecified)
	RefinementSteps = 1;			RefinementSteps = 1;

	return DAG.getNode(X86ISD::FRCP, SDLoc(Op), VT, Op);			// There is no FSQRT for 512-bits, but there is RSQRT14.
				unsigned Opcode = VT == MVT::v16f32 ? X86ISD::RCP14 : X86ISD::FRCP;
				return DAG.getNode(Opcode, SDLoc(Op), VT, Op);
	}			}
	return SDValue();			return SDValue();
	}			}

	/// If we have at least two divisions that use the same divisor, convert to			/// If we have at least two divisions that use the same divisor, convert to
	/// multiplication by a reciprocal. This may need to be adjusted for a given			/// multiplication by a reciprocal. This may need to be adjusted for a given
	/// CPU if a division's cost is not at least twice the cost of a multiplication.			/// CPU if a division's cost is not at least twice the cost of a multiplication.
	/// This is because we still need one division to calculate the reciprocal and			/// This is because we still need one division to calculate the reciprocal and
	▲ Show 20 Lines • Show All 21,895 Lines • Show Last 20 Lines

test/CodeGen/X86/recip-fastmath.ll

	Show First 20 Lines • Show All 1,018 Lines • ▼ Show 20 Lines
	; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm1			; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm1
	; HASWELL-NO-FMA-NEXT: vsubps %ymm1, %ymm3, %ymm1			; HASWELL-NO-FMA-NEXT: vsubps %ymm1, %ymm3, %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm2, %ymm1			; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm2, %ymm1
	; HASWELL-NO-FMA-NEXT: vaddps %ymm1, %ymm2, %ymm1			; HASWELL-NO-FMA-NEXT: vaddps %ymm1, %ymm2, %ymm1
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; KNL-LABEL: v16f32_one_step:			; KNL-LABEL: v16f32_one_step:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [10:1.00]			; KNL-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]			; KNL-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm1 zmm0) + mem sched: [12:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm1) + zmm1 sched: [5:0.50]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_one_step:			; SKX-LABEL: v16f32_one_step:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]			; SKX-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm1 zmm0) + mem sched: [11:0.50]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm1) + zmm1 sched: [4:0.33]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <16 x float> %div			ret <16 x float> %div
	}			}

	define <16 x float> @v16f32_two_step(<16 x float> %x) #2 {			define <16 x float> @v16f32_two_step(<16 x float> %x) #2 {
	; SSE-LABEL: v16f32_two_step:			; SSE-LABEL: v16f32_two_step:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines
	; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm1			; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm1
	; HASWELL-NO-FMA-NEXT: vsubps %ymm1, %ymm4, %ymm1			; HASWELL-NO-FMA-NEXT: vsubps %ymm1, %ymm4, %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm2, %ymm1			; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm2, %ymm1
	; HASWELL-NO-FMA-NEXT: vaddps %ymm1, %ymm2, %ymm1			; HASWELL-NO-FMA-NEXT: vaddps %ymm1, %ymm2, %ymm1
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; KNL-LABEL: v16f32_two_step:			; KNL-LABEL: v16f32_two_step:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [10:1.00]			; KNL-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]			; KNL-NEXT: vbroadcastss {{.*#+}} zmm2 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [10:1.00]
	spatelUnsubmitted Not Done Reply Inline Actions Not sure where the timing is defined (cc @RKSimon), but that vdivps timing can't be right. Agner has it at 32:20. Might want to verify the new instruction sequence timings too spatel: Not sure where the timing is defined (cc @RKSimon), but that vdivps timing can't be right.
	craig.topperAuthorUnsubmitted Not Done Reply Inline Actions KNL is using the Haswell scheduler model I think. And last I looked all the divide instructions were using InstRWs for each instruction. Since Haswell doesn't have VDIVPSZrr it probably just got some garbage default. craig.topper: KNL is using the Haswell scheduler model I think. And last I looked all the divide instructions…
				; KNL-NEXT: vmovaps %zmm1, %zmm3 # sched: [1:1.00]
				; KNL-NEXT: vfnmadd213ps {{.#+}} zmm3 = -(zmm0 zmm3) + zmm2 sched: [5:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm3 = (zmm3 zmm1) + zmm1 sched: [5:0.50]
				; KNL-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm3 zmm0) + zmm2 sched: [5:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm3) + zmm3 sched: [5:0.50]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_two_step:			; SKX-LABEL: v16f32_two_step:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]			; SKX-NEXT: vbroadcastss {{.*#+}} zmm2 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [8:0.50]
				; SKX-NEXT: vmovaps %zmm1, %zmm3 # sched: [1:0.33]
				; SKX-NEXT: vfnmadd213ps {{.#+}} zmm3 = -(zmm0 zmm3) + zmm2 sched: [4:0.33]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm3 = (zmm3 zmm1) + zmm1 sched: [4:0.33]
				; SKX-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm3 zmm0) + zmm2 sched: [4:0.33]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm3) + zmm3 sched: [4:0.33]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <16 x float> %div			ret <16 x float> %div
	}			}

	attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!divf,!vec-divf" }			attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!divf,!vec-divf" }
	attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf,vec-divf" }			attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf,vec-divf" }
	attributes #2 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:2,vec-divf:2" }			attributes #2 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:2,vec-divf:2" }

test/CodeGen/X86/recip-fastmath2.ll

	Show First 20 Lines • Show All 1,317 Lines • ▼ Show 20 Lines
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm2, %ymm0 # sched: [5:0.50]			; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm2, %ymm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]			; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]			; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]
	;			;
	; KNL-LABEL: v16f32_one_step2:			; KNL-LABEL: v16f32_one_step2:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vmovaps {{.*#+}} zmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00,9.000000e+00,1.000000e+01,1.100000e+01,1.200000e+01,1.300000e+01,1.400000e+01,1.500000e+01,1.600000e+01] sched: [5:0.50]			; KNL-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]			; KNL-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm1 zmm0) + mem sched: [12:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm1) + zmm1 sched: [5:0.50]
				; KNL-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm0 # sched: [12:0.50]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_one_step2:			; SKX-LABEL: v16f32_one_step2:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vmovaps {{.*#+}} zmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00,9.000000e+00,1.000000e+01,1.100000e+01,1.200000e+01,1.300000e+01,1.400000e+01,1.500000e+01,1.600000e+01] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]			; SKX-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm1 zmm0) + mem sched: [11:0.50]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm1) + zmm1 sched: [4:0.33]
				; SKX-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm0 # sched: [11:0.50]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x
	ret <16 x float> %div			ret <16 x float> %div
	}			}

	define <16 x float> @v16f32_one_step_2_divs(<16 x float> %x) #1 {			define <16 x float> @v16f32_one_step_2_divs(<16 x float> %x) #1 {
	; SSE-LABEL: v16f32_one_step_2_divs:			; SSE-LABEL: v16f32_one_step_2_divs:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm2 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm2 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm3 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm3 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm3, %ymm0 # sched: [5:0.50]			; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm3, %ymm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm2, %ymm1 # sched: [5:0.50]			; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm2, %ymm1 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]			; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]
	;			;
	; KNL-LABEL: v16f32_one_step_2_divs:			; KNL-LABEL: v16f32_one_step_2_divs:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [10:1.00]			; KNL-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]			; KNL-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm1 zmm0) + mem sched: [12:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm1) + zmm1 sched: [5:0.50]
	; KNL-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm1 # sched: [12:0.50]			; KNL-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm1 # sched: [12:0.50]
	; KNL-NEXT: vmulps %zmm0, %zmm1, %zmm0 # sched: [5:0.50]			; KNL-NEXT: vmulps %zmm0, %zmm1, %zmm0 # sched: [5:0.50]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_one_step_2_divs:			; SKX-LABEL: v16f32_one_step_2_divs:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]			; SKX-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm1 zmm0) + mem sched: [11:0.50]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm1) + zmm1 sched: [4:0.33]
	; SKX-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm1 # sched: [11:0.50]			; SKX-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm1 # sched: [11:0.50]
	; SKX-NEXT: vmulps %zmm0, %zmm1, %zmm0 # sched: [4:0.33]			; SKX-NEXT: vmulps %zmm0, %zmm1, %zmm0 # sched: [4:0.33]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x
	%div2 = fdiv fast <16 x float> %div, %x			%div2 = fdiv fast <16 x float> %div, %x
	ret <16 x float> %div2			ret <16 x float> %div2
	}			}

	▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm2, %ymm0 # sched: [5:0.50]			; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm2, %ymm0 # sched: [5:0.50]
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]			; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm2, %ymm0 # sched: [3:1.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]			; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]
	;			;
	; KNL-LABEL: v16f32_two_step2:			; KNL-LABEL: v16f32_two_step2:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vmovaps {{.*#+}} zmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00,9.000000e+00,1.000000e+01,1.100000e+01,1.200000e+01,1.300000e+01,1.400000e+01,1.500000e+01,1.600000e+01] sched: [5:0.50]			; KNL-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]			; KNL-NEXT: vbroadcastss {{.*#+}} zmm2 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [10:1.00]
				; KNL-NEXT: vmovaps %zmm1, %zmm3 # sched: [1:1.00]
				; KNL-NEXT: vfnmadd213ps {{.#+}} zmm3 = -(zmm0 zmm3) + zmm2 sched: [5:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm3 = (zmm3 zmm1) + zmm1 sched: [5:0.50]
				; KNL-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm3 zmm0) + zmm2 sched: [5:0.50]
				; KNL-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm3) + zmm3 sched: [5:0.50]
				; KNL-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm0 # sched: [12:0.50]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_two_step2:			; SKX-LABEL: v16f32_two_step2:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vmovaps {{.*#+}} zmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00,9.000000e+00,1.000000e+01,1.100000e+01,1.200000e+01,1.300000e+01,1.400000e+01,1.500000e+01,1.600000e+01] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm1 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]			; SKX-NEXT: vbroadcastss {{.*#+}} zmm2 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [8:0.50]
				; SKX-NEXT: vmovaps %zmm1, %zmm3 # sched: [1:0.33]
				; SKX-NEXT: vfnmadd213ps {{.#+}} zmm3 = -(zmm0 zmm3) + zmm2 sched: [4:0.33]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm3 = (zmm3 zmm1) + zmm1 sched: [4:0.33]
				; SKX-NEXT: vfnmadd213ps {{.#+}} zmm0 = -(zmm3 zmm0) + zmm2 sched: [4:0.33]
				; SKX-NEXT: vfmadd132ps {{.#+}} zmm0 = (zmm0 zmm3) + zmm3 sched: [4:0.33]
				; SKX-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm0 # sched: [11:0.50]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x
	ret <16 x float> %div			ret <16 x float> %div
	}			}

	define <16 x float> @v16f32_no_step(<16 x float> %x) #3 {			define <16 x float> @v16f32_no_step(<16 x float> %x) #3 {
	; SSE-LABEL: v16f32_no_step:			; SSE-LABEL: v16f32_no_step:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	Show All 36 Lines
	; HASWELL-NO-FMA-LABEL: v16f32_no_step:			; HASWELL-NO-FMA-LABEL: v16f32_no_step:
	; HASWELL-NO-FMA: # %bb.0:			; HASWELL-NO-FMA: # %bb.0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0 # sched: [11:2.00]			; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0 # sched: [11:2.00]
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm1, %ymm1 # sched: [11:2.00]			; HASWELL-NO-FMA-NEXT: vrcpps %ymm1, %ymm1 # sched: [11:2.00]
	; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]			; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]
	;			;
	; KNL-LABEL: v16f32_no_step:			; KNL-LABEL: v16f32_no_step:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [10:1.00]			; KNL-NEXT: vrcp14ps %zmm0, %zmm0 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_no_step:			; SKX-LABEL: v16f32_no_step:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm0 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <16 x float> %div			ret <16 x float> %div
	}			}

	define <16 x float> @v16f32_no_step2(<16 x float> %x) #3 {			define <16 x float> @v16f32_no_step2(<16 x float> %x) #3 {
	; SSE-LABEL: v16f32_no_step2:			; SSE-LABEL: v16f32_no_step2:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm1, %ymm1 # sched: [11:2.00]			; HASWELL-NO-FMA-NEXT: vrcpps %ymm1, %ymm1 # sched: [11:2.00]
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0 # sched: [11:2.00]			; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0 # sched: [11:2.00]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1 # sched: [12:0.50]			; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1 # sched: [12:0.50]
	; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]			; HASWELL-NO-FMA-NEXT: retq # sched: [7:1.00]
	;			;
	; KNL-LABEL: v16f32_no_step2:			; KNL-LABEL: v16f32_no_step2:
	; KNL: # %bb.0:			; KNL: # %bb.0:
	; KNL-NEXT: vmovaps {{.*#+}} zmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00,9.000000e+00,1.000000e+01,1.100000e+01,1.200000e+01,1.300000e+01,1.400000e+01,1.500000e+01,1.600000e+01] sched: [5:0.50]			; KNL-NEXT: vrcp14ps %zmm0, %zmm0 # sched: [5:1.00]
	; KNL-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [12:1.00]			; KNL-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm0 # sched: [12:0.50]
	; KNL-NEXT: retq # sched: [7:1.00]			; KNL-NEXT: retq # sched: [7:1.00]
	;			;
	; SKX-LABEL: v16f32_no_step2:			; SKX-LABEL: v16f32_no_step2:
	; SKX: # %bb.0:			; SKX: # %bb.0:
	; SKX-NEXT: vmovaps {{.*#+}} zmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00,9.000000e+00,1.000000e+01,1.100000e+01,1.200000e+01,1.300000e+01,1.400000e+01,1.500000e+01,1.600000e+01] sched: [8:0.50]			; SKX-NEXT: vrcp14ps %zmm0, %zmm0 # sched: [9:2.00]
	; SKX-NEXT: vdivps %zmm0, %zmm1, %zmm0 # sched: [18:10.00]			; SKX-NEXT: vmulps {{.*}}(%rip), %zmm0, %zmm0 # sched: [11:0.50]
	; SKX-NEXT: retq # sched: [7:1.00]			; SKX-NEXT: retq # sched: [7:1.00]
	%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x			%div = fdiv fast <16 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>, %x
	ret <16 x float> %div			ret <16 x float> %div
	}			}

	attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!divf,!vec-divf" }			attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!divf,!vec-divf" }
	attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf,vec-divf" }			attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf,vec-divf" }
	attributes #2 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:2,vec-divf:2" }			attributes #2 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:2,vec-divf:2" }
	attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:0,vec-divf:0" }			attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="divf:0,vec-divf:0" }

test/CodeGen/X86/sqrt-fastmath.ll

	Show First 20 Lines • Show All 509 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vmulps %ymm5, %ymm1, %ymm1			; AVX1-NEXT: vmulps %ymm5, %ymm1, %ymm1
	; AVX1-NEXT: vaddps %ymm4, %ymm1, %ymm1			; AVX1-NEXT: vaddps %ymm4, %ymm1, %ymm1
	; AVX1-NEXT: vmulps %ymm1, %ymm3, %ymm1			; AVX1-NEXT: vmulps %ymm1, %ymm3, %ymm1
	; AVX1-NEXT: vmulps %ymm1, %ymm2, %ymm1			; AVX1-NEXT: vmulps %ymm1, %ymm2, %ymm1
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX512-LABEL: v16f32_estimate:			; AVX512-LABEL: v16f32_estimate:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vsqrtps %zmm0, %zmm0			; AVX512-NEXT: vrsqrt14ps %zmm0, %zmm1
	; AVX512-NEXT: vbroadcastss {{.*#+}} zmm1 = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]			; AVX512-NEXT: vmulps %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vdivps %zmm0, %zmm1, %zmm0			; AVX512-NEXT: vfmadd213ps {{.#+}} zmm0 = (zmm1 zmm0) + mem
				; AVX512-NEXT: vmulps {{.*}}(%rip){1to16}, %zmm1, %zmm1
				; AVX512-NEXT: vmulps %zmm0, %zmm1, %zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%sqrt = tail call <16 x float> @llvm.sqrt.v16f32(<16 x float> %x)			%sqrt = tail call <16 x float> @llvm.sqrt.v16f32(<16 x float> %x)
	%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt			%div = fdiv fast <16 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt
	ret <16 x float> %div			ret <16 x float> %div
	}			}


	attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!sqrtf,!vec-sqrtf,!divf,!vec-divf" }			attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!sqrtf,!vec-sqrtf,!divf,!vec-divf" }
	attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" }			attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" }
	attributes #2 = { nounwind readnone }			attributes #2 = { nounwind readnone }
	attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" "denormal-fp-math"="ieee" }			attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" "denormal-fp-math"="ieee" }