This is an archive of the discontinued LLVM Phabricator instance.

Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385)
ClosedPublic

Authored by spatel on Nov 7 2014, 12:06 PM.

Download Raw Diff

Details

Reviewers

nadav
andreadb
hfinkel

Commits

rGe2e589288fca: Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385).
rL221706: Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385).

Summary

This is a first step for generating SSE rcp instructions for reciprocal calcs when fast-math allows it. This is very similar to the rsqrt optimization enabled in D5658 ( http://reviews.llvm.org/rL220570 ).

For now, be conservative and only enable this for AMD btver2 where performance improves significantly both in terms of latency and throughput.

We may never enable this codegen for Intel Core* chips because the divider circuits are just too fast. On SandyBridge, divss can be as fast as 10 cycles versus the 21 cycle critical path for the rcp + mul + sub + mul + add estimate.

Follow-on patches may allow configuration of the number of Newton-Raphson refinement steps, add AVX512 support, and enable the optimization for more chips.

More background here: http://llvm.org/bugs/show_bug.cgi?id=21385

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 15935.Nov 7 2014, 12:06 PM

spatel retitled this revision from to Use rcpss/rcpps (X86) to speed up reciprocal calcs (PR21385).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: hfinkel, andreadb, nadav.

spatel added a subscriber: Unknown Object (MLST).

This is very similar to the rsqrt optimization enabled in D5658 ( http://reviews.llvm.org/rL220570 ).

Yes, indeed it seems that way (when you commit, make sure you mention the commit revision corresponding to D5658 in the commit message).

LGTM.

This revision is now accepted and ready to land.Nov 11 2014, 12:11 AM

Closed by commit rL221706 (authored by @spatel).

Thanks, Hal. Committed with r221706.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

4 lines

4 lines

31 lines

6 lines

test/

CodeGen/

X86/

recip-fastmath.ll

72 lines

Diff 16056

llvm/trunk/lib/Target/X86/X86.td

Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines
def FeatureLEAUsesAG : SubtargetFeature<"lea-uses-ag", "LEAUsesAG", "true",		def FeatureLEAUsesAG : SubtargetFeature<"lea-uses-ag", "LEAUsesAG", "true",
"LEA instruction needs inputs at AG stage">;		"LEA instruction needs inputs at AG stage">;
def FeatureSlowLEA : SubtargetFeature<"slow-lea", "SlowLEA", "true",		def FeatureSlowLEA : SubtargetFeature<"slow-lea", "SlowLEA", "true",
"LEA instruction with certain arguments is slow">;		"LEA instruction with certain arguments is slow">;
def FeatureSlowIncDec : SubtargetFeature<"slow-incdec", "SlowIncDec", "true",		def FeatureSlowIncDec : SubtargetFeature<"slow-incdec", "SlowIncDec", "true",
"INC and DEC instructions are slower than ADD and SUB">;		"INC and DEC instructions are slower than ADD and SUB">;
def FeatureUseSqrtEst : SubtargetFeature<"use-sqrt-est", "UseSqrtEst", "true",		def FeatureUseSqrtEst : SubtargetFeature<"use-sqrt-est", "UseSqrtEst", "true",
"Use RSQRT* to optimize square root calculations">;		"Use RSQRT* to optimize square root calculations">;
		def FeatureUseRecipEst : SubtargetFeature<"use-recip-est", "UseReciprocalEst",
		"true", "Use RCP* to optimize division calculations">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// X86 processors supported.		// X86 processors supported.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86Schedule.td"		include "X86Schedule.td"

def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	def : Proc<"btver1", [FeatureSSSE3, FeatureSSE4A, FeatureCMPXCHG16B,
FeatureSlowSHLD]>;		FeatureSlowSHLD]>;

// Jaguar		// Jaguar
def : ProcessorModel<"btver2", BtVer2Model,		def : ProcessorModel<"btver2", BtVer2Model,
[FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,		[FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,
FeaturePRFCHW, FeatureAES, FeaturePCLMUL,		FeaturePRFCHW, FeatureAES, FeaturePCLMUL,
FeatureBMI, FeatureF16C, FeatureMOVBE,		FeatureBMI, FeatureF16C, FeatureMOVBE,
FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD,		FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD,
FeatureUseSqrtEst]>;		FeatureUseSqrtEst, FeatureUseRecipEst]>;

// Bulldozer		// Bulldozer
def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
FeatureAES, FeaturePRFCHW, FeaturePCLMUL,		FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
FeatureAVX, FeatureSSE4A, FeatureLZCNT,		FeatureAVX, FeatureSSE4A, FeatureLZCNT,
FeaturePOPCNT, FeatureSlowSHLD]>;		FeaturePOPCNT, FeatureSlowSHLD]>;
// Piledriver		// Piledriver
def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,025 Lines • ▼ Show 20 Lines	private:

/// Convert a comparison if required by the subtarget.		/// Convert a comparison if required by the subtarget.
SDValue ConvertCmpIfNecessary(SDValue Cmp, SelectionDAG &DAG) const;		SDValue ConvertCmpIfNecessary(SDValue Cmp, SelectionDAG &DAG) const;

/// Use rsqrt* to speed up sqrt calculations.		/// Use rsqrt* to speed up sqrt calculations.
SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps,		unsigned &RefinementSteps,
bool &UseOneConstNR) const override;		bool &UseOneConstNR) const override;

		/// Use rcp* to speed up fdiv calculations.
		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
		unsigned &RefinementSteps) const override;
};		};

namespace X86 {		namespace X86 {
FastISel *createFastISel(FunctionLoweringInfo &funcInfo,		FastISel *createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo);		const TargetLibraryInfo *libInfo);
}		}
}		}

#endif // X86ISELLOWERING_H		#endif // X86ISELLOWERING_H

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,508 Lines • ▼ Show 20 Lines	if ((Subtarget->hasSSE1() && (VT == MVT::f32 \|\| VT == MVT::v4f32)) \|\|
(Subtarget->hasAVX() && VT == MVT::v8f32)) {		(Subtarget->hasAVX() && VT == MVT::v8f32)) {
RefinementSteps = 1;		RefinementSteps = 1;
UseOneConstNR = false;		UseOneConstNR = false;
return DCI.DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);		return DCI.DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);
}		}
return SDValue();		return SDValue();
}		}

		/// The minimum architected relative accuracy is 2^-12. We need one
		/// Newton-Raphson step to have a good float result (24 bits of precision).
		SDValue X86TargetLowering::getRecipEstimate(SDValue Op,
		DAGCombinerInfo &DCI,
		unsigned &RefinementSteps) const {
		// FIXME: We should use instruction latency models to calculate the cost of
		// each potential sequence, but this is very hard to do reliably because
		// at least Intel's Core* chips have variable timing based on the number of
		// significant digits in the divisor.
		if (!Subtarget->useReciprocalEst())
		return SDValue();

		EVT VT = Op.getValueType();

		// SSE1 has rcpss and rcpps. AVX adds a 256-bit variant for rcpps.
		// TODO: Add support for AVX512 (v16f32).
		// It is likely not profitable to do this for f64 because a double-precision
		// reciprocal estimate with refinement on x86 prior to FMA requires
		// 15 instructions: convert to single, rcpss, convert back to double, refine
		// (3 steps = 12 insts). If an 'rcpsd' variant was added to the ISA
		// along with FMA, this could be a throughput win.
		if ((Subtarget->hasSSE1() && (VT == MVT::f32 \|\| VT == MVT::v4f32)) \|\|
		(Subtarget->hasAVX() && VT == MVT::v8f32)) {
		// TODO: Expose this as a user-configurable parameter to allow for
		// speed vs. accuracy flexibility.
		RefinementSteps = 1;
		return DCI.DAG.getNode(X86ISD::FRCP, SDLoc(Op), VT, Op);
		}
		return SDValue();
		}

static bool isAllOnes(SDValue V) {		static bool isAllOnes(SDValue V) {
ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);		ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);
return C && C->isAllOnesValue();		return C && C->isAllOnesValue();
}		}

/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node		/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node
/// if it's possible.		/// if it's possible.
SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,		SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,
▲ Show 20 Lines • Show All 11,342 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines	protected:
/// SlowIncDec - True if INC and DEC instructions are slow when writing to flags		/// SlowIncDec - True if INC and DEC instructions are slow when writing to flags
bool SlowIncDec;		bool SlowIncDec;

/// Use the RSQRT* instructions to optimize square root calculations.		/// Use the RSQRT* instructions to optimize square root calculations.
/// For this to be profitable, the cost of FSQRT and FDIV must be		/// For this to be profitable, the cost of FSQRT and FDIV must be
/// substantially higher than normal FP ops like FADD and FMUL.		/// substantially higher than normal FP ops like FADD and FMUL.
bool UseSqrtEst;		bool UseSqrtEst;

		/// Use the RCP* instructions to optimize FP division calculations.
		/// For this to be profitable, the cost of FDIV must be
		/// substantially higher than normal FP ops like FADD and FMUL.
		bool UseReciprocalEst;

/// Processor has AVX-512 PreFetch Instructions		/// Processor has AVX-512 PreFetch Instructions
bool HasPFI;		bool HasPFI;

/// Processor has AVX-512 Exponential and Reciprocal Instructions		/// Processor has AVX-512 Exponential and Reciprocal Instructions
bool HasERI;		bool HasERI;

/// Processor has AVX-512 Conflict Detection Instructions		/// Processor has AVX-512 Conflict Detection Instructions
bool HasCDI;		bool HasCDI;
▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	public:
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasSlowDivide() const { return HasSlowDivide; }		bool hasSlowDivide() const { return HasSlowDivide; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
bool useSqrtEst() const { return UseSqrtEst; }		bool useSqrtEst() const { return UseSqrtEst; }
		bool useReciprocalEst() const { return UseReciprocalEst; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
bool hasPFI() const { return HasPFI; }		bool hasPFI() const { return HasPFI; }
bool hasERI() const { return HasERI; }		bool hasERI() const { return HasERI; }
bool hasDQI() const { return HasDQI; }		bool hasDQI() const { return HasDQI; }
bool hasBWI() const { return HasBWI; }		bool hasBWI() const { return HasBWI; }
bool hasVLX() const { return HasVLX; }		bool hasVLX() const { return HasVLX; }

bool isAtom() const { return X86ProcFamily == IntelAtom; }		bool isAtom() const { return X86ProcFamily == IntelAtom; }
▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/recip-fastmath.ll

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=core2 \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=btver2 \| FileCheck %s --check-prefix=BTVER2

				; If the target's divss/divps instructions are substantially
				; slower than rcpss/rcpps with a Newton-Raphson refinement,
				; we should generate the estimate sequence.

				; See PR21385 ( http://llvm.org/bugs/show_bug.cgi?id=21385 )
				; for details about the accuracy, speed, and implementation
				; differences of x86 reciprocal estimates.

				define float @reciprocal_estimate(float %x) #0 {
				%div = fdiv fast float 1.0, %x
				ret float %div

				; CHECK-LABEL: reciprocal_estimate:
				; CHECK: movss
				; CHECK-NEXT: divss
				; CHECK-NEXT: movaps
				; CHECK-NEXT: retq

				; BTVER2-LABEL: reciprocal_estimate:
				; BTVER2: vrcpss
				; BTVER2-NEXT: vmulss
				; BTVER2-NEXT: vsubss
				; BTVER2-NEXT: vmulss
				; BTVER2-NEXT: vaddss
				; BTVER2-NEXT: retq
				}

				define <4 x float> @reciprocal_estimate_v4f32(<4 x float> %x) #0 {
				%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <4 x float> %div

				; CHECK-LABEL: reciprocal_estimate_v4f32:
				; CHECK: movaps
				; CHECK-NEXT: divps
				; CHECK-NEXT: movaps
				; CHECK-NEXT: retq

				; BTVER2-LABEL: reciprocal_estimate_v4f32:
				; BTVER2: vrcpps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vsubps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vaddps
				; BTVER2-NEXT: retq
				}

				define <8 x float> @reciprocal_estimate_v8f32(<8 x float> %x) #0 {
				%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
				ret <8 x float> %div

				; CHECK-LABEL: reciprocal_estimate_v8f32:
				; CHECK: movaps
				; CHECK: movaps
				; CHECK-NEXT: divps
				; CHECK-NEXT: divps
				; CHECK-NEXT: movaps
				; CHECK-NEXT: movaps
				; CHECK-NEXT: retq

				; BTVER2-LABEL: reciprocal_estimate_v8f32:
				; BTVER2: vrcpps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vsubps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vaddps
				; BTVER2-NEXT: retq
				}

				attributes #0 = { "unsafe-fp-math"="true" }