This is an archive of the discontinued LLVM Phabricator instance.

[x86] Implement combineRepeatedFPDivisors
ClosedPublic

Authored by spatel on Apr 9 2015, 4:36 PM.

Download Raw Diff

Details

Reviewers

qcolombet
RKSimon
chandlerc
andreadb

Commits

rG7024b8121a9e: [x86] Implement combineRepeatedFPDivisors
rL235012: [x86] Implement combineRepeatedFPDivisors

Summary

This is a trivial patch, but I want to make sure that I'm not being too aggressive for any existing chips.

I've set the transform bar at 2 divisions because the fastest x86 FP divider circuit that I know of is in SandyBridge / Haswell at 10 cycle latency (best case) relative to a 5 cycle multiplier. So that's the worst case for this transform (no latency win), but multiplies are obviously pipelined while divisions are not, so there's still a big throughput win which we would expect to show up in typical FP code.

These are the sequences I'm comparing:

divss   %xmm2, %xmm0
mulss   %xmm1, %xmm0
divss   %xmm2, %xmm0

Becomes:

movss   LCPI0_0(%rip), %xmm3    ## xmm3 = mem[0],zero,zero,zero
divss   %xmm2, %xmm3
mulss   %xmm3, %xmm0
mulss   %xmm1, %xmm0
mulss   %xmm3, %xmm0

[Ignore for the moment that we don't optimize the chain of 3 multiplies into 2 independent fmuls followed by 1 dependent fmul...this is the DAG version of: https://llvm.org/bugs/show_bug.cgi?id=21768 ...if we fix that, then the transform becomes even more profitable on all targets.]

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 23542.Apr 9 2015, 4:36 PM

spatel retitled this revision from to [x86] Implement combineRepeatedFPDivisors.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: chandlerc, RKSimon, andreadb, qcolombet.

spatel added a subscriber: Unknown Object (MLST).

Hi Sanjay,

Looks good to me.

Thanks,
-Quentin

This revision is now accepted and ready to land.Apr 13 2015, 11:51 AM

Closed by commit rL235012: [x86] Implement combineRepeatedFPDivisors (authored by spatel). · Explain WhyApr 15 2015, 8:26 AM

This revision was automatically updated to reflect the committed changes.

Thanks, Quentin - checked in at r235012.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.h

3 lines

X86ISelLowering.cpp

10 lines

test/

CodeGen/

X86/

fdiv-combine.ll

31 lines

Diff 23777

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,066 Lines • ▼ Show 20 Lines	private:
/// Use rsqrt* to speed up sqrt calculations.		/// Use rsqrt* to speed up sqrt calculations.
SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps,		unsigned &RefinementSteps,
bool &UseOneConstNR) const override;		bool &UseOneConstNR) const override;

/// Use rcp* to speed up fdiv calculations.		/// Use rcp* to speed up fdiv calculations.
SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const override;		unsigned &RefinementSteps) const override;

		/// Reassociate floating point divisions into multiply by reciprocal.
		bool combineRepeatedFPDivisors(unsigned NumUsers) const override;
};		};

namespace X86 {		namespace X86 {
FastISel *createFastISel(FunctionLoweringInfo &funcInfo,		FastISel *createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo);		const TargetLibraryInfo *libInfo);
}		}
}		}

#endif // X86ISELLOWERING_H		#endif // X86ISELLOWERING_H

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,812 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::getRecipEstimate(SDValue Op,
if ((Subtarget->hasSSE1() && (VT == MVT::f32 \|\| VT == MVT::v4f32)) \|\|		if ((Subtarget->hasSSE1() && (VT == MVT::f32 \|\| VT == MVT::v4f32)) \|\|
(Subtarget->hasAVX() && VT == MVT::v8f32)) {		(Subtarget->hasAVX() && VT == MVT::v8f32)) {
RefinementSteps = ReciprocalEstimateRefinementSteps;		RefinementSteps = ReciprocalEstimateRefinementSteps;
return DCI.DAG.getNode(X86ISD::FRCP, SDLoc(Op), VT, Op);		return DCI.DAG.getNode(X86ISD::FRCP, SDLoc(Op), VT, Op);
}		}
return SDValue();		return SDValue();
}		}

		/// If we have at least two divisions that use the same divisor, convert to
		/// multplication by a reciprocal. This may need to be adjusted for a given
		/// CPU if a division's cost is not at least twice the cost of a multiplication.
		/// This is because we still need one division to calculate the reciprocal and
		/// then we need two multiplies by that reciprocal as replacements for the
		/// original divisions.
		bool X86TargetLowering::combineRepeatedFPDivisors(unsigned NumUsers) const {
		return NumUsers > 1;
		}

static bool isAllOnes(SDValue V) {		static bool isAllOnes(SDValue V) {
ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);		ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);
return C && C->isAllOnesValue();		return C && C->isAllOnesValue();
}		}

/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node		/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node
/// if it's possible.		/// if it's possible.
SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,		SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,
▲ Show 20 Lines • Show All 11,880 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fdiv-combine.ll

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown \| FileCheck %s

				; Anything more than one division using a single divisor operand
				; should be converted into a reciprocal and multiplication.

				define float @div1_arcp(float %x, float %y, float %z) #0 {
				; CHECK-LABEL: div1_arcp:
				; CHECK: # BB#0:
				; CHECK-NEXT: divss %xmm1, %xmm0
				; CHECK-NEXT: retq
				%div1 = fdiv arcp float %x, %y
				ret float %div1
				}

				define float @div2_arcp(float %x, float %y, float %z) #0 {
				; CHECK-LABEL: div2_arcp:
				; CHECK: # BB#0:
				; CHECK-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
				; CHECK-NEXT: divss %xmm2, %xmm3
				; CHECK-NEXT: mulss %xmm3, %xmm0
				; CHECK-NEXT: mulss %xmm1, %xmm0
				; CHECK-NEXT: mulss %xmm3, %xmm0
				; CHECK-NEXT: retq
				%div1 = fdiv arcp float %x, %z
				%mul = fmul arcp float %div1, %y
				%div2 = fdiv arcp float %mul, %z
				ret float %div2
				}

				; FIXME: If the backend understands 'arcp', then this attribute is unnecessary.
				attributes #0 = { "unsafe-fp-math"="true" }