This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Target/
-
llvm/
-
Target/
-
TargetLowering.h
-
lib/
-
CodeGen/
-
SelectionDAG/
-
DAGCombiner.cpp
-
TargetLoweringBase.cpp
-
Target/
-
AMDGPU/
-
AMDGPUISelLowering.h
-
AMDGPUISelLowering.cpp
-
X86/
-
X86.td
-
X86ISelLowering.h
-
X86ISelLowering.cpp
-
X86Subtarget.h
-
X86Subtarget.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
sqrt-fastmath-tune.ll

Differential D21379

[X86] Heuristic to selectively build Newton-Raphson SQRT estimation
ClosedPublic

Authored by n.bozhenov on Jun 15 2016, 7:38 AM.

Download Raw Diff

Details

Reviewers

spatel
nadav
• tstellarAMD
andreadb
bogner
hfinkel

Commits

rGf679530ba180: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation
rL277725: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation

Summary

On modern Intel processors hardware SQRT in many cases is faster than RSQRT
followed by Newton-Raphson refinement. The patch introduces a simple heuristic
to choose between hardware SQRT instruction and Newton-Raphson software
estimation.

The patch treats scalars and vectors differently. The heuristic is that for
scalars the compiler should optimize for latency while for vectors it should
optimize for throughput.

Basically, the patch disables scalar NR for big cores and disables NR completely
for Skylake. Firstly, scalar SQRT has shorter latency than NR code in big cores.
Secondly, vector SQRT has been greatly improved in Skylake and has better
throughput compared to NR.

Diff Detail

Event Timeline

n.bozhenov updated this revision to Diff 60832.Jun 15 2016, 7:38 AM

n.bozhenov retitled this revision from to [X86] Heuristic to selectively build Newton-Raphson SQRT estimation.

n.bozhenov updated this object.

n.bozhenov added reviewers: bogner, hfinkel, andreadb, spatel, nadav.

n.bozhenov added subscribers: zansari, DavidKreitzer, zinovy.nis, llvm-commits.

Herald added a reviewer: • tstellarAMD. · View Herald TranscriptJun 15 2016, 7:38 AM

Herald added a subscriber: arsenm. · View Herald Transcript

Below are some figures to justify the change.

Latency/throughput data from Architecture Optimization Manual:

|      |  IVB |  HSW  |   BDW | SKL  |
|------+------+-------+-------+------|
| x32  | 14/7 |  13/7 |  13/4 | 13/3 |
| x128 | 14/7 |  13/7 |  13/7 | 13/3 |
| x256 |      | 19/13 | 19/13 | 12/6 |

Experimental Newton-Raphson efficiency for latency-bound code:

|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | -41% | -40% | -21% | -40% |
| x128 | -32% | -32% | -17% | -35% |

Experimental Newton-Raphson efficiency for throughput-bound code:

|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | +18% | +21% | -17% | -40% |
| x128 | +10% | +14% | +28% | -50% |
| x256 |      | +68% | +85% |  +3% |

RKSimon added a subscriber: RKSimon.Jun 15 2016, 9:18 AM

Latency/throughput data are for SQRT instruction of course.

In D21379#458695, @n.bozhenov wrote:
Below are some figures to justify the change.
Experimental Newton-Raphson efficiency for latency-bound code:
|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | -41% | -40% | -21% | -40% |
| x128 | -32% | -32% | -17% | -35% |
Experimental Newton-Raphson efficiency for throughput-bound code:
|      |  IVB |  HSW |  BDW |  SKL |
|------+------+------+------+------|
| x32  | +18% | +21% | -17% | -40% |
| x128 | +10% | +14% | +28% | -50% |
| x256 |      | +68% | +85% |  +3% |

Shouldn't HSW show a latency improvement over IVB from using FMA?
How many N-R steps are included in your measurements?
Do the measurements include the change from D21127?

When we enabled the estimate generation code ( https://llvm.org/bugs/show_bug.cgi?id=21385#c32 ), we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput. This patch proposes to change that behavior for those targets (ie, favor latency at the expense of throughput). Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU. That way we'll have coverage for the expected behavior independently of any individual CPU.

I have no objection to the AMDGPU changes.

An updated version of the patch is uploaded. After more careful benchmarking and
analysis I found a performance problem in a corner case when both SQRT(x) and
RSQRT(x) are required. Indeed, if this is the case the compiler may build a
plain SQRTSS instruction to calculate SQRT(x) and a RSQRTSS followed by
refinement to calculate RSQRT(x). So, I've added an additional check to
X86TargetLowering::isFsqrtCheap to avoid building both SQRT and RSQRT
instructions for the same input value.

Great questions, Sanjay!

Shouldn't HSW show a latency improvement over IVB from using FMA?

FMA doesn't much affect the results. In most cases the difference between FMA
code and non-FMA code is not crucial. The only case significantly affected by
FMA is scalar SQRT on Haswell where NR got 15% higher throughput with FMA. Here
is the table for throughput-bound FMA code:

|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | +38% | -12% | -26% |
| x128 | +12% | +32% | -30% |
| x256 | +69% | +84% |  +6% |

And the updated table for latency-bound FMA code:

|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | -32% | -20% | -25% |
| x128 | -34% | -28% | -25% |
| x256 | -21% |  +6% |  -2% |

How many N-R steps are included in your measurements?

I benchmarked the default number of steps which is one refinement step.

Do the measurements include the change from D21127?

Yes.

When we enabled the estimate generation code, we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput.

The idea behind this patch is that throughput bound code is likely to be
vectorized, so for vectorized code we should care about the throughput. But if
the code is scalar that probably means that the code has some kind of dependency
and we should care more about reducing the latency.

You mentioned PR21385, but the PR mostly deals with reciprocal square roots
which are NOT affected by this patch. SQRTSS+DIVSS is a way too heavy
combination even for Skylake. So, this patch is only about using SQRTSS/SQRTPS
instructions to calculate non-reciprocal roots.

Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

We can see large performance improvements for some our internal benchmarks. And
they were our motivating examples. As for Specs, the updated version of the
patch doesn't affect any hot code in Spec2000 and Spec2006. Neither it affects
hot code in testsuite/Bullet benchmark (which is a very sqrt-intensive one).

Also, this patch makes difference not only to performance. It also improves the
accuracy in the affected cases. And generally we should prefer more precise code
even with fast-math unless we expect significant performance improvement.

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU.

Done. I've added a few more RUNs to the test to check the attributes themselves.

In D21379#494665, @n.bozhenov wrote:
Great questions, Sanjay!

Shouldn't HSW show a latency improvement over IVB from using FMA?

FMA doesn't much affect the results. In most cases the difference between FMA
code and non-FMA code is not crucial. The only case significantly affected by
FMA is scalar SQRT on Haswell where NR got 15% higher throughput with FMA. Here
is the table for throughput-bound FMA code:
|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | +38% | -12% | -26% |
| x128 | +12% | +32% | -30% |
| x256 | +69% | +84% |  +6% |
And the updated table for latency-bound FMA code:
|      |  HSW |  BDW |  SKL |
|------+------+------+------|
| x32  | -32% | -20% | -25% |
| x128 | -34% | -28% | -25% |
| x256 | -21% |  +6% |  -2% |
How many N-R steps are included in your measurements?

I benchmarked the default number of steps which is one refinement step.

Do the measurements include the change from D21127?

Yes.

When we enabled the estimate generation code, we knew it had higher latency for SNB/IVB/HSW, but we reasoned that most real-world FP code would care more about throughput.

The idea behind this patch is that throughput bound code is likely to be
vectorized, so for vectorized code we should care about the throughput. But if
the code is scalar that probably means that the code has some kind of dependency
and we should care more about reducing the latency.

You mentioned PR21385, but the PR mostly deals with reciprocal square roots
which are NOT affected by this patch. SQRTSS+DIVSS is a way too heavy
combination even for Skylake. So, this patch is only about using SQRTSS/SQRTPS
instructions to calculate non-reciprocal roots.

Do you have any benchmark numbers (test-suite, SPEC, etc) for those CPUs that shows a difference?

We can see large performance improvements for some our internal benchmarks. And
they were our motivating examples. As for Specs, the updated version of the
patch doesn't affect any hot code in Spec2000 and Spec2006. Neither it affects
hot code in testsuite/Bullet benchmark (which is a very sqrt-intensive one).

Also, this patch makes difference not only to performance. It also improves the
accuracy in the affected cases. And generally we should prefer more precise code
even with fast-math unless we expect significant performance improvement.

For the test file, please add RUNs that include the new attributes themselves rather than specifying a CPU.

Done. I've added a few more RUNs to the test to check the attributes themselves.

I'd obviously also like to think that any throughput sensitive code is vectorized (or at least that the vectorizer has unrolled it to hide latency if helpful). In general this low-latency/low-throughput vs. high-latency/high-throughput is exactly the kind of situation that the MachineCombiner is supposed to handle. It might be difficult to use here, however, because we want to insert NR before instruction selection (to take advantage of DAGCombine simplifications and complex ISel patterns), and replacing the estimate/NR sequence in the MachineCombiner might be tricky (specially since the user can select the number of iterations, and so only matching the simple case of one iteration would be undesirable). As a result, unfortunately, phase ordering constraints might prevent using the principled solution here. The heuristic here does not seem unreasonable, however I just wish we did not have to use it.

In general, however, the rationale needs to be much better documented in the code. I don't see any comments in the patch itself about the latency/throughput tradeoff.

In general, however, the rationale needs to be much better documented in the code. I don't see any comments in the patch itself about the latency/throughput tradeoff.

Hal, you are right. The patch lacked comments explaining the tradeoff.
The new version of the patch adds the missed comment to X86.td file.

ping

Hal's comment about the MachineCombiner reminded me of the excellent list of FMA loop examples provided by @v_klochkov in:
https://reviews.llvm.org/D18751.

Summary: the latency vs. throughput trade-off is tricky even if we defer the decision to a later pass.

LGTM, but Hal can confirm if the added comments are sufficient.

In D21379#503444, @spatel wrote:

Hal's comment about the MachineCombiner reminded me of the excellent list of FMA loop examples provided by @v_klochkov in:
https://reviews.llvm.org/D18751.

Summary: the latency vs. throughput trade-off is tricky even if we defer the decision to a later pass.

LGTM, but Hal can confirm if the added comments are sufficient.

Yes, thanks!

This revision is now accepted and ready to land.Aug 2 2016, 9:31 AM

Closed by commit rL277725: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation (authored by n.bozhenov). · Explain WhyAug 4 2016, 5:55 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

14 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

8 lines

TargetLoweringBase.cpp

1 line

Target/

AMDGPU/

AMDGPUISelLowering.h

3 lines

AMDGPUISelLowering.cpp

2 lines

X86/

12 lines

3 lines

13 lines

10 lines

2 lines

test/

CodeGen/

X86/

sqrt-fastmath-tune.ll

57 lines

Diff 65344

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines	public:
/// Return true if integer divide is usually cheaper than a sequence of		/// Return true if integer divide is usually cheaper than a sequence of
/// several shifts, adds, and multiplies for this target.		/// several shifts, adds, and multiplies for this target.
/// The definition of "cheaper" may depend on whether we're optimizing		/// The definition of "cheaper" may depend on whether we're optimizing
/// for speed or for size.		/// for speed or for size.
virtual bool isIntDivCheap(EVT VT, AttributeSet Attr) const {		virtual bool isIntDivCheap(EVT VT, AttributeSet Attr) const {
return false;		return false;
}		}

/// Return true if sqrt(x) is as cheap or cheaper than 1 / rsqrt(x)		/// Return true if SQRT(X) shouldn't be replaced with X*RSQRT(X).
bool isFsqrtCheap() const {		virtual bool isFsqrtCheap(SDValue X, SelectionDAG &DAG) const {
return FsqrtIsCheap;		// Default behavior is to replace SQRT(X) with X*RSQRT(X).
		return false;
}		}

/// Returns true if target has indicated at least one type should be bypassed.		/// Returns true if target has indicated at least one type should be bypassed.
bool isSlowDivBypassed() const { return !BypassSlowDivWidths.empty(); }		bool isSlowDivBypassed() const { return !BypassSlowDivWidths.empty(); }

/// Returns map of slow types for division or remainder with corresponding		/// Returns map of slow types for division or remainder with corresponding
/// fast types		/// fast types
const DenseMap<unsigned int, unsigned int> &getBypassSlowDivWidths() const {		const DenseMap<unsigned int, unsigned int> &getBypassSlowDivWidths() const {
▲ Show 20 Lines • Show All 1,099 Lines • ▼ Show 20 Lines	void setHasExtractBitsInsn(bool hasExtractInsn = true) {
HasExtractBitsInsn = hasExtractInsn;		HasExtractBitsInsn = hasExtractInsn;
}		}

/// Tells the code generator not to expand logic operations on comparison		/// Tells the code generator not to expand logic operations on comparison
/// predicates into separate sequences that increase the amount of flow		/// predicates into separate sequences that increase the amount of flow
/// control.		/// control.
void setJumpIsExpensive(bool isExpensive = true);		void setJumpIsExpensive(bool isExpensive = true);

/// Tells the code generator that fsqrt is cheap, and should not be replaced
/// with an alternative sequence of instructions.
void setFsqrtIsCheap(bool isCheap = true) { FsqrtIsCheap = isCheap; }

/// Tells the code generator that this target supports floating point		/// Tells the code generator that this target supports floating point
/// exceptions and cares about preserving floating point exception behavior.		/// exceptions and cares about preserving floating point exception behavior.
void setHasFloatingPointExceptions(bool FPExceptions = true) {		void setHasFloatingPointExceptions(bool FPExceptions = true) {
HasFloatingPointExceptions = FPExceptions;		HasFloatingPointExceptions = FPExceptions;
}		}

/// Tells the code generator which bitwidths to bypass.		/// Tells the code generator which bitwidths to bypass.
void addBypassSlowDiv(unsigned int SlowBitWidth, unsigned int FastBitWidth) {		void addBypassSlowDiv(unsigned int SlowBitWidth, unsigned int FastBitWidth) {
▲ Show 20 Lines • Show All 504 Lines • ▼ Show 20 Lines	private:
bool HasMultipleConditionRegisters;		bool HasMultipleConditionRegisters;

/// Tells the code generator that the target has BitExtract instructions.		/// Tells the code generator that the target has BitExtract instructions.
/// The code generator will aggressively sink "shift"s into the blocks of		/// The code generator will aggressively sink "shift"s into the blocks of
/// their users if the users will generate "and" instructions which can be		/// their users if the users will generate "and" instructions which can be
/// combined with "shift" to BitExtract instructions.		/// combined with "shift" to BitExtract instructions.
bool HasExtractBitsInsn;		bool HasExtractBitsInsn;

// Don't expand fsqrt with an approximation based on the inverse sqrt.
bool FsqrtIsCheap;

/// Tells the code generator to bypass slow divide or remainder		/// Tells the code generator to bypass slow divide or remainder
/// instructions. For example, BypassSlowDivWidths[32,8] tells the code		/// instructions. For example, BypassSlowDivWidths[32,8] tells the code
/// generator to bypass 32-bit integer div/rem with an 8-bit unsigned integer		/// generator to bypass 32-bit integer div/rem with an 8-bit unsigned integer
/// div/rem when the operands are positive and less than 256.		/// div/rem when the operands are positive and less than 256.
DenseMap <unsigned int, unsigned int> BypassSlowDivWidths;		DenseMap <unsigned int, unsigned int> BypassSlowDivWidths;

/// Tells the code generator that it shouldn't generate extra flow control		/// Tells the code generator that it shouldn't generate extra flow control
/// instructions and should attempt to combine flow control instructions via		/// instructions and should attempt to combine flow control instructions via
▲ Show 20 Lines • Show All 1,145 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,867 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFREM(SDNode *N) {
if (N0CFP && N1CFP)		if (N0CFP && N1CFP)
return DAG.getNode(ISD::FREM, SDLoc(N), VT, N0, N1,		return DAG.getNode(ISD::FREM, SDLoc(N), VT, N0, N1,
&cast<BinaryWithFlagsSDNode>(N)->Flags);		&cast<BinaryWithFlagsSDNode>(N)->Flags);

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFSQRT(SDNode *N) {		SDValue DAGCombiner::visitFSQRT(SDNode *N) {
if (!DAG.getTarget().Options.UnsafeFPMath \|\| TLI.isFsqrtCheap())		if (!DAG.getTarget().Options.UnsafeFPMath)
		return SDValue();

		SDValue N0 = N->getOperand(0);
		if (TLI.isFsqrtCheap(N0, DAG))
return SDValue();		return SDValue();

// TODO: FSQRT nodes should have flags that propagate to the created nodes.		// TODO: FSQRT nodes should have flags that propagate to the created nodes.
// For now, create a Flags object for use with all unsafe math transforms.		// For now, create a Flags object for use with all unsafe math transforms.
SDNodeFlags Flags;		SDNodeFlags Flags;
Flags.setUnsafeAlgebra(true);		Flags.setUnsafeAlgebra(true);
return buildSqrtEstimate(N->getOperand(0), &Flags);		return buildSqrtEstimate(N0, &Flags);
}		}

/// copysign(x, fp_extend(y)) -> copysign(x, y)		/// copysign(x, fp_extend(y)) -> copysign(x, y)
/// copysign(x, fp_round(y)) -> copysign(x, y)		/// copysign(x, fp_round(y)) -> copysign(x, y)
static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {		static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
if ((N1.getOpcode() == ISD::FP_EXTEND \|\|		if ((N1.getOpcode() == ISD::FP_EXTEND \|\|
N1.getOpcode() == ISD::FP_ROUND)) {		N1.getOpcode() == ISD::FP_ROUND)) {
▲ Show 20 Lines • Show All 6,104 Lines • Show Last 20 Lines

lib/CodeGen/TargetLoweringBase.cpp

Show First 20 Lines • Show All 801 Lines • ▼ Show 20 Lines	TargetLoweringBase::TargetLoweringBase(const TargetMachine &tm) : TM(tm) {
MaxStoresPerMemset = MaxStoresPerMemcpy = MaxStoresPerMemmove = 8;		MaxStoresPerMemset = MaxStoresPerMemcpy = MaxStoresPerMemmove = 8;
MaxStoresPerMemsetOptSize = MaxStoresPerMemcpyOptSize		MaxStoresPerMemsetOptSize = MaxStoresPerMemcpyOptSize
= MaxStoresPerMemmoveOptSize = 4;		= MaxStoresPerMemmoveOptSize = 4;
UseUnderscoreSetJmp = false;		UseUnderscoreSetJmp = false;
UseUnderscoreLongJmp = false;		UseUnderscoreLongJmp = false;
SelectIsExpensive = false;		SelectIsExpensive = false;
HasMultipleConditionRegisters = false;		HasMultipleConditionRegisters = false;
HasExtractBitsInsn = false;		HasExtractBitsInsn = false;
FsqrtIsCheap = false;
JumpIsExpensive = JumpIsExpensiveOverride;		JumpIsExpensive = JumpIsExpensiveOverride;
PredictableSelectIsExpensive = false;		PredictableSelectIsExpensive = false;
MaskAndBranchFoldingIsLegal = false;		MaskAndBranchFoldingIsLegal = false;
EnableExtLdPromotion = false;		EnableExtLdPromotion = false;
HasFloatingPointExceptions = true;		HasFloatingPointExceptions = true;
StackPointerRegisterToSaveRestore = 0;		StackPointerRegisterToSaveRestore = 0;
BooleanContents = UndefinedBooleanContent;		BooleanContents = UndefinedBooleanContent;
BooleanFloatContents = UndefinedBooleanContent;		BooleanFloatContents = UndefinedBooleanContent;
▲ Show 20 Lines • Show All 1,019 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	SDValue CombineFMinMaxLegacy(SDLoc DL,
SDValue RHS,		SDValue RHS,
SDValue True,		SDValue True,
SDValue False,		SDValue False,
SDValue CC,		SDValue CC,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;

const char* getTargetNodeName(unsigned Opcode) const override;		const char* getTargetNodeName(unsigned Opcode) const override;

		bool isFsqrtCheap(SDValue Operand, SelectionDAG &DAG) const override {
		return true;
		}
SDValue getRsqrtEstimate(SDValue Operand,		SDValue getRsqrtEstimate(SDValue Operand,
DAGCombinerInfo &DCI,		DAGCombinerInfo &DCI,
unsigned &RefinementSteps,		unsigned &RefinementSteps,
bool &UseOneConstNR) const override;		bool &UseOneConstNR) const override;
SDValue getRecipEstimate(SDValue Operand,		SDValue getRecipEstimate(SDValue Operand,
DAGCombinerInfo &DCI,		DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const override;		unsigned &RefinementSteps) const override;

▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	AMDGPUTargetLowering::AMDGPUTargetLowering(TargetMachine &TM,
// SI at least has hardware support for floating point exceptions, but no way		// SI at least has hardware support for floating point exceptions, but no way
// of using or handling them is implemented. They are also optional in OpenCL		// of using or handling them is implemented. They are also optional in OpenCL
// (Section 7.3)		// (Section 7.3)
setHasFloatingPointExceptions(Subtarget->hasFPExceptions());		setHasFloatingPointExceptions(Subtarget->hasFPExceptions());

setSelectIsExpensive(false);		setSelectIsExpensive(false);
PredictableSelectIsExpensive = false;		PredictableSelectIsExpensive = false;

setFsqrtIsCheap(true);

// We want to find all load dependencies for long chains of stores to enable		// We want to find all load dependencies for long chains of stores to enable
// merging into very wide vectors. The problem is with vectors with > 4		// merging into very wide vectors. The problem is with vectors with > 4
// elements. MergeConsecutiveStores will attempt to merge these because x8/x16		// elements. MergeConsecutiveStores will attempt to merge these because x8/x16
// vectors are a legal type, even though we have to split the loads		// vectors are a legal type, even though we have to split the loads
// usually. When we can more precisely specify load legality per address		// usually. When we can more precisely specify load legality per address
// space, we should be able to make FindBetterChain/MergeConsecutiveStores		// space, we should be able to make FindBetterChain/MergeConsecutiveStores
// smarter so that they can figure out what to do in 2 iterations without all		// smarter so that they can figure out what to do in 2 iterations without all
// N > 4 stores on the same chain.		// N > 4 stores on the same chain.
▲ Show 20 Lines • Show All 2,431 Lines • Show Last 20 Lines

lib/Target/X86/X86.td

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines
def FeatureSoftFloat		def FeatureSoftFloat
: SubtargetFeature<"soft-float", "UseSoftFloat", "true",		: SubtargetFeature<"soft-float", "UseSoftFloat", "true",
"Use software floating point features.">;		"Use software floating point features.">;
// On at least some AMD processors, there is no performance hazard to writing		// On at least some AMD processors, there is no performance hazard to writing
// only the lower parts of a YMM register without clearing the upper part.		// only the lower parts of a YMM register without clearing the upper part.
def FeatureFastPartialYMMWrite		def FeatureFastPartialYMMWrite
: SubtargetFeature<"fast-partial-ymm-write", "HasFastPartialYMMWrite",		: SubtargetFeature<"fast-partial-ymm-write", "HasFastPartialYMMWrite",
"true", "Partial writes to YMM registers are fast">;		"true", "Partial writes to YMM registers are fast">;
		def FeatureFastScalarFSQRT
		: SubtargetFeature<"fast-scalar-fsqrt", "HasFastScalarFSQRT",
		"true", "Scalar SQRT is fast (disable Newton-Raphson)">;
		def FeatureFastVectorFSQRT
		: SubtargetFeature<"fast-vector-fsqrt", "HasFastVectorFSQRT",
		"true", "Vector SQRT is fast (disable Newton-Raphson)">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// X86 processors supported.		// X86 processors supported.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86Schedule.td"		include "X86Schedule.td"

def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines	def SNBFeatures : ProcessorFeatures<[], [
FeatureAVX,		FeatureAVX,
FeatureFXSR,		FeatureFXSR,
FeatureCMPXCHG16B,		FeatureCMPXCHG16B,
FeaturePOPCNT,		FeaturePOPCNT,
FeatureAES,		FeatureAES,
FeaturePCLMUL,		FeaturePCLMUL,
FeatureXSAVE,		FeatureXSAVE,
FeatureXSAVEOPT,		FeatureXSAVEOPT,
FeatureLAHFSAHF		FeatureLAHFSAHF,
		FeatureFastScalarFSQRT
]>;		]>;

class SandyBridgeProc<string Name> : ProcModel<Name, SandyBridgeModel,		class SandyBridgeProc<string Name> : ProcModel<Name, SandyBridgeModel,
SNBFeatures.Value, [		SNBFeatures.Value, [
FeatureSlowBTMem,		FeatureSlowBTMem,
FeatureSlowUAMem32		FeatureSlowUAMem32
]>;		]>;
def : SandyBridgeProc<"sandybridge">;		def : SandyBridgeProc<"sandybridge">;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	class BroadwellProc<string Name> : ProcModel<Name, HaswellModel,
BDWFeatures.Value, []>;		BDWFeatures.Value, []>;
def : BroadwellProc<"broadwell">;		def : BroadwellProc<"broadwell">;

def SKLFeatures : ProcessorFeatures<BDWFeatures.Value, [		def SKLFeatures : ProcessorFeatures<BDWFeatures.Value, [
FeatureMPX,		FeatureMPX,
FeatureXSAVEC,		FeatureXSAVEC,
FeatureXSAVES,		FeatureXSAVES,
FeatureSGX,		FeatureSGX,
FeatureCLFLUSHOPT		FeatureCLFLUSHOPT,
		FeatureFastVectorFSQRT
]>;		]>;

// FIXME: define SKL model		// FIXME: define SKL model
class SkylakeClientProc<string Name> : ProcModel<Name, HaswellModel,		class SkylakeClientProc<string Name> : ProcModel<Name, HaswellModel,
SKLFeatures.Value, []>;		SKLFeatures.Value, []>;
def : SkylakeClientProc<"skylake">;		def : SkylakeClientProc<"skylake">;

// FIXME: define KNL model		// FIXME: define KNL model
▲ Show 20 Lines • Show All 318 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,196 Lines • ▼ Show 20 Lines	private:
/// Emit nodes that will be selected as "cmp Op0,Op1", or something		/// Emit nodes that will be selected as "cmp Op0,Op1", or something
/// equivalent, for use with the given x86 condition code.		/// equivalent, for use with the given x86 condition code.
SDValue EmitCmp(SDValue Op0, SDValue Op1, unsigned X86CC, SDLoc dl,		SDValue EmitCmp(SDValue Op0, SDValue Op1, unsigned X86CC, SDLoc dl,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

/// Convert a comparison if required by the subtarget.		/// Convert a comparison if required by the subtarget.
SDValue ConvertCmpIfNecessary(SDValue Cmp, SelectionDAG &DAG) const;		SDValue ConvertCmpIfNecessary(SDValue Cmp, SelectionDAG &DAG) const;

		/// Check if replacement of SQRT with RSQRT should be disabled.
		bool isFsqrtCheap(SDValue Operand, SelectionDAG &DAG) const override;

/// Use rsqrt* to speed up sqrt calculations.		/// Use rsqrt* to speed up sqrt calculations.
SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps,		unsigned &RefinementSteps,
bool &UseOneConstNR) const override;		bool &UseOneConstNR) const override;

/// Use rcp* to speed up fdiv calculations.		/// Use rcp* to speed up fdiv calculations.
SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const override;		unsigned &RefinementSteps) const override;
Show All 12 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,685 Lines • ▼ Show 20 Lines	SDValue Srl = DAG.getNode(ISD::SRL, dl, MVT::i16, FNStSW,
DAG.getConstant(8, dl, MVT::i8));		DAG.getConstant(8, dl, MVT::i8));
SDValue TruncSrl = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Srl);		SDValue TruncSrl = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Srl);

// Some 64-bit targets lack SAHF support, but they do support FCOMI.		// Some 64-bit targets lack SAHF support, but they do support FCOMI.
assert(Subtarget.hasLAHFSAHF() && "Target doesn't support SAHF or FCOMI?");		assert(Subtarget.hasLAHFSAHF() && "Target doesn't support SAHF or FCOMI?");
return DAG.getNode(X86ISD::SAHF, dl, MVT::i32, TruncSrl);		return DAG.getNode(X86ISD::SAHF, dl, MVT::i32, TruncSrl);
}		}

		/// Check if replacement of SQRT with RSQRT should be disabled.
		bool X86TargetLowering::isFsqrtCheap(SDValue Op, SelectionDAG &DAG) const {
		EVT VT = Op.getValueType();

		// We never want to use both SQRT and RSQRT instructions for the same input.
		if (DAG.getNodeIfExists(X86ISD::FRSQRT, DAG.getVTList(VT), Op))
		return false;

		if (VT.isVector())
		return Subtarget.hasFastVectorFSQRT();
		return Subtarget.hasFastScalarFSQRT();
		}

/// The minimum architected relative accuracy is 2^-12. We need one		/// The minimum architected relative accuracy is 2^-12. We need one
/// Newton-Raphson step to have a good float result (24 bits of precision).		/// Newton-Raphson step to have a good float result (24 bits of precision).
SDValue X86TargetLowering::getRsqrtEstimate(SDValue Op,		SDValue X86TargetLowering::getRsqrtEstimate(SDValue Op,
DAGCombinerInfo &DCI,		DAGCombinerInfo &DCI,
unsigned &RefinementSteps,		unsigned &RefinementSteps,
bool &UseOneConstNR) const {		bool &UseOneConstNR) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
const char *RecipOp;		const char *RecipOp;
▲ Show 20 Lines • Show All 15,864 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	protected:
/// True if the LEA instruction should be used for adjusting		/// True if the LEA instruction should be used for adjusting
/// the stack pointer. This is an optimization for Intel Atom processors.		/// the stack pointer. This is an optimization for Intel Atom processors.
bool UseLeaForSP;		bool UseLeaForSP;

/// True if there is no performance penalty to writing only the lower parts		/// True if there is no performance penalty to writing only the lower parts
/// of a YMM register without clearing the upper part.		/// of a YMM register without clearing the upper part.
bool HasFastPartialYMMWrite;		bool HasFastPartialYMMWrite;

		/// True if hardware SQRTSS instruction is at least as fast (latency) as
		/// RSQRTSS followed by a Newton-Raphson iteration.
		bool HasFastScalarFSQRT;

		/// True if hardware SQRTPS/VSQRTPS instructions are at least as fast
		/// (throughput) as RSQRTPS/VRSQRTPS followed by a Newton-Raphson iteration.
		bool HasFastVectorFSQRT;

/// True if 8-bit divisions are significantly faster than		/// True if 8-bit divisions are significantly faster than
/// 32-bit divisions and should be used when possible.		/// 32-bit divisions and should be used when possible.
bool HasSlowDivide32;		bool HasSlowDivide32;

/// True if 16-bit divides are significantly faster than		/// True if 16-bit divides are significantly faster than
/// 64-bit divisions and should be used when possible.		/// 64-bit divisions and should be used when possible.
bool HasSlowDivide64;		bool HasSlowDivide64;

▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	public:
bool isBTMemSlow() const { return IsBTMemSlow; }		bool isBTMemSlow() const { return IsBTMemSlow; }
bool isSHLDSlow() const { return IsSHLDSlow; }		bool isSHLDSlow() const { return IsSHLDSlow; }
bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }		bool isUnalignedMem16Slow() const { return IsUAMem16Slow; }
bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }		bool isUnalignedMem32Slow() const { return IsUAMem32Slow; }
bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }		bool hasSSEUnalignedMem() const { return HasSSEUnalignedMem; }
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }		bool hasFastPartialYMMWrite() const { return HasFastPartialYMMWrite; }
		bool hasFastScalarFSQRT() const { return HasFastScalarFSQRT; }
		bool hasFastVectorFSQRT() const { return HasFastVectorFSQRT; }
bool hasSlowDivide32() const { return HasSlowDivide32; }		bool hasSlowDivide32() const { return HasSlowDivide32; }
bool hasSlowDivide64() const { return HasSlowDivide64; }		bool hasSlowDivide64() const { return HasSlowDivide64; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
▲ Show 20 Lines • Show All 168 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 316 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
IsBTMemSlow = false;		IsBTMemSlow = false;
IsSHLDSlow = false;		IsSHLDSlow = false;
IsUAMem16Slow = false;		IsUAMem16Slow = false;
IsUAMem32Slow = false;		IsUAMem32Slow = false;
HasSSEUnalignedMem = false;		HasSSEUnalignedMem = false;
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasFastPartialYMMWrite = false;		HasFastPartialYMMWrite = false;
		HasFastScalarFSQRT = false;
		HasFastVectorFSQRT = false;
HasSlowDivide32 = false;		HasSlowDivide32 = false;
HasSlowDivide64 = false;		HasSlowDivide64 = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
SlowLEA = false;		SlowLEA = false;
SlowIncDec = false;		SlowIncDec = false;
stackAlignment = 4;		stackAlignment = 4;
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

test/CodeGen/X86/sqrt-fastmath-tune.ll

This file was added.

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -O2 -mcpu=nehalem \| FileCheck %s --check-prefix=SCALAR-EST --check-prefix=VECTOR-EST
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -O2 -mcpu=sandybridge \| FileCheck %s --check-prefix=SCALAR-ACC --check-prefix=VECTOR-EST
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -O2 -mcpu=broadwell \| FileCheck %s --check-prefix=SCALAR-ACC --check-prefix=VECTOR-EST
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -O2 -mcpu=skylake \| FileCheck %s --check-prefix=SCALAR-ACC --check-prefix=VECTOR-ACC

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -O2 -mattr=+fast-scalar-fsqrt,-fast-vector-fsqrt \| FileCheck %s --check-prefix=SCALAR-ACC --check-prefix=VECTOR-EST
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -O2 -mattr=-fast-scalar-fsqrt,+fast-vector-fsqrt \| FileCheck %s --check-prefix=SCALAR-EST --check-prefix=VECTOR-ACC

				declare float @llvm.sqrt.f32(float) #0
				declare <4 x float> @llvm.sqrt.v4f32(<4 x float>) #0
				declare <8 x float> @llvm.sqrt.v8f32(<8 x float>) #0

				define float @foo_x1(float %f) #0 {
				; SCALAR-EST-LABEL: foo_x1:
				; SCALAR-EST: # BB#0:
				; SCALAR-EST-NEXT: rsqrtss %xmm0
				; SCALAR-EST: retq
				;
				; SCALAR-ACC-LABEL: foo_x1:
				; SCALAR-ACC: # BB#0:
				; SCALAR-ACC-NEXT: {{^ *v?sqrtss %xmm0}}
				; SCALAR-ACC-NEXT: retq
				%call = tail call float @llvm.sqrt.f32(float %f) #1
				ret float %call
				}

				define <4 x float> @foo_x4(<4 x float> %f) #0 {
				; VECTOR-EST-LABEL: foo_x4:
				; VECTOR-EST: # BB#0:
				; VECTOR-EST-NEXT: rsqrtps %xmm0
				; VECTOR-EST: retq
				;
				; VECTOR-ACC-LABEL: foo_x4:
				; VECTOR-ACC: # BB#0:
				; VECTOR-ACC-NEXT: {{^ *v?sqrtps %xmm0}}
				; VECTOR-ACC-NEXT: retq
				%call = tail call <4 x float> @llvm.sqrt.v4f32(<4 x float> %f) #1
				ret <4 x float> %call
				}

				define <8 x float> @foo_x8(<8 x float> %f) #0 {
				; VECTOR-EST-LABEL: foo_x8:
				; VECTOR-EST: # BB#0:
				; VECTOR-EST-NEXT: rsqrtps
				; VECTOR-EST: retq
				;
				; VECTOR-ACC-LABEL: foo_x8:
				; VECTOR-ACC: # BB#0:
				; VECTOR-ACC-NEXT: {{^ *v?sqrtps %[xy]mm0}}
				; VECTOR-ACC-NOT: rsqrt
				; VECTOR-ACC: retq
				%call = tail call <8 x float> @llvm.sqrt.v8f32(<8 x float> %f) #1
				ret <8 x float> %call
				}

				attributes #0 = { "unsafe-fp-math"="true" }
				attributes #1 = { nounwind readnone }