This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
X86InstrSSE.td
-
X86SchedHaswell.td
-
X86SchedSandyBridge.td
-
X86Schedule.td
-
X86ScheduleAtom.td
-
X86ScheduleBtVer2.td
-
X86ScheduleSLM.td

Differential D5370

SSE reciprocal square root instruction latencies
ClosedPublic

Authored by RKSimon on Sep 16 2014, 10:46 AM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
andreadb
atrick
rob.lougher

Commits

rG196e873cdc42: [X86][SchedModel] SSE reciprocal square root instruction latencies.

Summary

The SSE rsqrt instruction is a fast reciprocal square estimate (typically <5 cycles) but is currently grouped in the same scheduling IIC_SSE_SQRT* class as the accurate (but very slow) SSE sqrt instruction (often >20 cycles). For code which uses rsqrt (possibly with newton-raphson iterations) this poor scheduling is affecting performance.

This patch splits off the rsqrt instruction from the sqrt instruction scheduling classes and creates new IIC_SSE_RSQRT* classes with latency values based on Agner's tables. The latencies/pipelines for supported x86 targets end up being the same as the rcp(ss,ps) instruction but I've kept them separate.

There is a proposal for a fast-math optimization to use rsqrt + nr (http://llvm.org/bugs/show_bug.cgi?id=20900) which would benefit from this as well.

Note - for the Haswell scheduler I've updated the base model but not altered any of the exceptions/overrides.

Diff Detail

Event Timeline

RKSimon updated this revision to Diff 13758.Sep 16 2014, 10:46 AM

RKSimon retitled this revision from to SSE reciprocal square root instruction latencies.

RKSimon updated this object.

RKSimon edited the test plan for this revision. (Show Details)

RKSimon added reviewers: spatel, andreadb, rob.lougher, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: Unknown Object (MLST).

Andy - I understand you're swamped but its been recommended that I add you to look at my rsqrt scheduling patch.

PING

Hi Simon,
Sorry for the late reply.

The patch looks good to me.
The changes to instruction latencies and the new instruction itineraries looks ok to me (I can see how latencies are based on Agner's table). However, I think it is better to get the final approval from somebody more familiar with the Intel scheduling models. For example, the change to X86ScheduleAtom.td should probably be reviewed by others.

As a side note: I have run some benchmarks using the compiler with/without your patch. Unfortunately I haven't seen any particular difference in the codegen. It turns out that most of our benchmarks I tried doesn't have good mix of sqrt/rsqrt. Also, as you said, under fastmath we lack of a rule for converting sqrt+div to rsqrt+mul` (PR20900). I am interested to see how this patch will improve things once PR20900 is fixed.

Thanks,
-Andrea

LGTM.

This revision is now accepted and ready to land.Sep 25 2014, 10:31 AM

Committed at revision 218517.
http://llvm.org/viewvc/llvm-project?view=revision&revision=218517

Revision Contents

Path

Size

lib/

Target/

X86/

	X86InstrSSE.td
	X86InstrSSE.td (revision 217886)

16 lines

	X86SchedHaswell.td
	X86SchedHaswell.td (revision 217886)

1 line

	X86SchedSandyBridge.td
	X86SchedSandyBridge.td (revision 217886)

1 line

	X86Schedule.td
	X86Schedule.td (revision 217886)

18 lines

	X86ScheduleAtom.td
	X86ScheduleAtom.td (revision 217886)

5 lines

	X86ScheduleBtVer2.td
	X86ScheduleBtVer2.td (revision 217886)

12 lines

	X86ScheduleSLM.td
	X86ScheduleSLM.td (revision 217886)

1 line

Diff 13758

lib/Target/X86/X86InstrSSE.td

Context not available.
	>;	>;
	}	}

		let Sched = WriteFRsqrt in {
		def SSE_RSQRTPS : OpndItins<
		IIC_SSE_RSQRTPS_RR, IIC_SSE_RSQRTPS_RM
		>;

		def SSE_RSQRTSS : OpndItins<
		IIC_SSE_RSQRTSS_RR, IIC_SSE_RSQRTSS_RM
		>;
		}

	let Sched = WriteFRcp in {	let Sched = WriteFRcp in {
	def SSE_RCPP : OpndItins<	def SSE_RCPP : OpndItins<
	IIC_SSE_RCPP_RR, IIC_SSE_RCPP_RM	IIC_SSE_RCPP_RR, IIC_SSE_RCPP_RM
Context not available.

	// Reciprocal approximations. Note that these typically require refinement	// Reciprocal approximations. Note that these typically require refinement
	// in order to obtain suitable precision.	// in order to obtain suitable precision.
	defm RSQRT : sse1_fp_unop_rw<0x52, "rsqrt", X86frsqrt, SSE_SQRTSS>,	defm RSQRT : sse1_fp_unop_rw<0x52, "rsqrt", X86frsqrt, SSE_RSQRTSS>,
	sse1_fp_unop_p<0x52, "rsqrt", X86frsqrt, SSE_SQRTPS>,	sse1_fp_unop_p<0x52, "rsqrt", X86frsqrt, SSE_RSQRTPS>,
	sse1_fp_unop_p_int<0x52, "rsqrt", int_x86_sse_rsqrt_ps,	sse1_fp_unop_p_int<0x52, "rsqrt", int_x86_sse_rsqrt_ps,
	int_x86_avx_rsqrt_ps_256, SSE_SQRTPS>;	int_x86_avx_rsqrt_ps_256, SSE_RSQRTPS>;
	defm RCP : sse1_fp_unop_rw<0x53, "rcp", X86frcp, SSE_RCPS>,	defm RCP : sse1_fp_unop_rw<0x53, "rcp", X86frcp, SSE_RCPS>,
	sse1_fp_unop_p<0x53, "rcp", X86frcp, SSE_RCPP>,	sse1_fp_unop_p<0x53, "rcp", X86frcp, SSE_RCPP>,
	sse1_fp_unop_p_int<0x53, "rcp", int_x86_sse_rcp_ps,	sse1_fp_unop_p_int<0x53, "rcp", int_x86_sse_rcp_ps,
Context not available.

lib/Target/X86/X86SchedHaswell.td

Context not available.
	defm : HWWriteResPair<WriteFMul, HWPort0, 5>;	defm : HWWriteResPair<WriteFMul, HWPort0, 5>;
	defm : HWWriteResPair<WriteFDiv, HWPort0, 12>; // 10-14 cycles.	defm : HWWriteResPair<WriteFDiv, HWPort0, 12>; // 10-14 cycles.
	defm : HWWriteResPair<WriteFRcp, HWPort0, 5>;	defm : HWWriteResPair<WriteFRcp, HWPort0, 5>;
		defm : HWWriteResPair<WriteFRsqrt, HWPort0, 5>;
	defm : HWWriteResPair<WriteFSqrt, HWPort0, 15>;	defm : HWWriteResPair<WriteFSqrt, HWPort0, 15>;
	defm : HWWriteResPair<WriteCvtF2I, HWPort1, 3>;	defm : HWWriteResPair<WriteCvtF2I, HWPort1, 3>;
	defm : HWWriteResPair<WriteCvtI2F, HWPort1, 4>;	defm : HWWriteResPair<WriteCvtI2F, HWPort1, 4>;
Context not available.

lib/Target/X86/X86SchedSandyBridge.td

Context not available.
	defm : SBWriteResPair<WriteFMul, SBPort0, 5>;	defm : SBWriteResPair<WriteFMul, SBPort0, 5>;
	defm : SBWriteResPair<WriteFDiv, SBPort0, 12>; // 10-14 cycles.	defm : SBWriteResPair<WriteFDiv, SBPort0, 12>; // 10-14 cycles.
	defm : SBWriteResPair<WriteFRcp, SBPort0, 5>;	defm : SBWriteResPair<WriteFRcp, SBPort0, 5>;
		defm : SBWriteResPair<WriteFRsqrt, SBPort0, 5>;
	defm : SBWriteResPair<WriteFSqrt, SBPort0, 15>;	defm : SBWriteResPair<WriteFSqrt, SBPort0, 15>;
	defm : SBWriteResPair<WriteCvtF2I, SBPort1, 3>;	defm : SBWriteResPair<WriteCvtF2I, SBPort1, 3>;
	defm : SBWriteResPair<WriteCvtI2F, SBPort1, 4>;	defm : SBWriteResPair<WriteCvtI2F, SBPort1, 4>;
Context not available.

lib/Target/X86/X86Schedule.td

Context not available.
	defm WriteJump : X86SchedWritePair;	defm WriteJump : X86SchedWritePair;

	// Floating point. This covers both scalar and vector operations.	// Floating point. This covers both scalar and vector operations.
	defm WriteFAdd : X86SchedWritePair; // Floating point add/sub/compare.	defm WriteFAdd : X86SchedWritePair; // Floating point add/sub/compare.
	defm WriteFMul : X86SchedWritePair; // Floating point multiplication.	defm WriteFMul : X86SchedWritePair; // Floating point multiplication.
	defm WriteFDiv : X86SchedWritePair; // Floating point division.	defm WriteFDiv : X86SchedWritePair; // Floating point division.
	defm WriteFSqrt : X86SchedWritePair; // Floating point square root.	defm WriteFSqrt : X86SchedWritePair; // Floating point square root.
	defm WriteFRcp : X86SchedWritePair; // Floating point reciprocal.	defm WriteFRcp : X86SchedWritePair; // Floating point reciprocal estimate.
	defm WriteFMA : X86SchedWritePair; // Fused Multiply Add.	defm WriteFRsqrt : X86SchedWritePair; // Floating point reciprocal square root estimate.
		defm WriteFMA : X86SchedWritePair; // Fused Multiply Add.
	defm WriteFShuffle : X86SchedWritePair; // Floating point vector shuffles.	defm WriteFShuffle : X86SchedWritePair; // Floating point vector shuffles.
	defm WriteFBlend : X86SchedWritePair; // Floating point vector blends.	defm WriteFBlend : X86SchedWritePair; // Floating point vector blends.
	defm WriteFVarBlend : X86SchedWritePair; // Fp vector variable blends.	defm WriteFVarBlend : X86SchedWritePair; // Fp vector variable blends.
Context not available.
	def IIC_SSE_SQRTSD_RR : InstrItinClass;	def IIC_SSE_SQRTSD_RR : InstrItinClass;
	def IIC_SSE_SQRTSD_RM : InstrItinClass;	def IIC_SSE_SQRTSD_RM : InstrItinClass;

		def IIC_SSE_RSQRTPS_RR : InstrItinClass;
		def IIC_SSE_RSQRTPS_RM : InstrItinClass;
		def IIC_SSE_RSQRTSS_RR : InstrItinClass;
		def IIC_SSE_RSQRTSS_RM : InstrItinClass;

	def IIC_SSE_RCPP_RR : InstrItinClass;	def IIC_SSE_RCPP_RR : InstrItinClass;
	def IIC_SSE_RCPP_RM : InstrItinClass;	def IIC_SSE_RCPP_RM : InstrItinClass;
	def IIC_SSE_RCPS_RR : InstrItinClass;	def IIC_SSE_RCPS_RR : InstrItinClass;
Context not available.

lib/Target/X86/X86ScheduleAtom.td

Context not available.
	InstrItinData<IIC_SSE_SQRTSD_RR, [InstrStage<62, [Port0, Port1]>] >,	InstrItinData<IIC_SSE_SQRTSD_RR, [InstrStage<62, [Port0, Port1]>] >,
	InstrItinData<IIC_SSE_SQRTSD_RM, [InstrStage<62, [Port0, Port1]>] >,	InstrItinData<IIC_SSE_SQRTSD_RM, [InstrStage<62, [Port0, Port1]>] >,

		InstrItinData<IIC_SSE_RSQRTPS_RR, [InstrStage<9, [Port0, Port1]>] >,
		InstrItinData<IIC_SSE_RSQRTPS_RM, [InstrStage<10, [Port0, Port1]>] >,
		InstrItinData<IIC_SSE_RSQRTSS_RR, [InstrStage<4, [Port0]>] >,
		InstrItinData<IIC_SSE_RSQRTSS_RM, [InstrStage<4, [Port0]>] >,

	InstrItinData<IIC_SSE_RCPP_RR, [InstrStage<9, [Port0, Port1]>] >,	InstrItinData<IIC_SSE_RCPP_RR, [InstrStage<9, [Port0, Port1]>] >,
	InstrItinData<IIC_SSE_RCPP_RM, [InstrStage<10, [Port0, Port1]>] >,	InstrItinData<IIC_SSE_RCPP_RM, [InstrStage<10, [Port0, Port1]>] >,
	InstrItinData<IIC_SSE_RCPS_RR, [InstrStage<4, [Port0]>] >,	InstrItinData<IIC_SSE_RCPS_RR, [InstrStage<4, [Port0]>] >,
Context not available.

lib/Target/X86/X86ScheduleBtVer2.td

Context not available.
	// FIXME: should we bother splitting JFPU pipe + unit stages for fast instructions?	// FIXME: should we bother splitting JFPU pipe + unit stages for fast instructions?
	// FIXME: Double precision latencies	// FIXME: Double precision latencies
	// FIXME: SS vs PS latencies	// FIXME: SS vs PS latencies
	// FIXME: RSQRT latencies
	// FIXME: ymm latencies	// FIXME: ymm latencies
	////////////////////////////////////////////////////////////////////////////////	////////////////////////////////////////////////////////////////////////////////

	defm : JWriteResFpuPair<WriteFAdd, JFPU0, 3>;	defm : JWriteResFpuPair<WriteFAdd, JFPU0, 3>;
	defm : JWriteResFpuPair<WriteFMul, JFPU1, 2>;	defm : JWriteResFpuPair<WriteFMul, JFPU1, 2>;
	defm : JWriteResFpuPair<WriteFRcp, JFPU1, 2>;	defm : JWriteResFpuPair<WriteFRcp, JFPU1, 2>;
	defm : JWriteResFpuPair<WriteFShuffle, JFPU01, 1>;	defm : JWriteResFpuPair<WriteFRsqrt, JFPU1, 2>;
	defm : JWriteResFpuPair<WriteFBlend, JFPU01, 1>;	defm : JWriteResFpuPair<WriteFShuffle, JFPU01, 1>;
		defm : JWriteResFpuPair<WriteFBlend, JFPU01, 1>;
	defm : JWriteResFpuPair<WriteFShuffle256, JFPU01, 1>;	defm : JWriteResFpuPair<WriteFShuffle256, JFPU01, 1>;

	def : WriteRes<WriteFSqrt, [JFPU1, JLAGU, JFPM]> {	def : WriteRes<WriteFSqrt, [JFPU1, JLAGU, JFPM]> {
Context not available.

lib/Target/X86/X86ScheduleSLM.td

Context not available.
	// Scalar and vector floating point.	// Scalar and vector floating point.
	defm : SMWriteResPair<WriteFAdd, FPC_RSV1, 3>;	defm : SMWriteResPair<WriteFAdd, FPC_RSV1, 3>;
	defm : SMWriteResPair<WriteFRcp, FPC_RSV0, 5>;	defm : SMWriteResPair<WriteFRcp, FPC_RSV0, 5>;
		defm : SMWriteResPair<WriteFRsqrt, FPC_RSV0, 5>;
	defm : SMWriteResPair<WriteFSqrt, FPC_RSV0, 15>;	defm : SMWriteResPair<WriteFSqrt, FPC_RSV0, 15>;
	defm : SMWriteResPair<WriteCvtF2I, FPC_RSV01, 4>;	defm : SMWriteResPair<WriteCvtF2I, FPC_RSV01, 4>;
	defm : SMWriteResPair<WriteCvtI2F, FPC_RSV01, 4>;	defm : SMWriteResPair<WriteCvtI2F, FPC_RSV01, 4>;
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

SSE reciprocal square root instruction latenciesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 13758

lib/Target/X86/X86InstrSSE.td

lib/Target/X86/X86SchedHaswell.td

lib/Target/X86/X86SchedSandyBridge.td

lib/Target/X86/X86Schedule.td

lib/Target/X86/X86ScheduleAtom.td

lib/Target/X86/X86ScheduleBtVer2.td

lib/Target/X86/X86ScheduleSLM.td

SSE reciprocal square root instruction latencies
ClosedPublic