This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] try repeated fdiv divisor transform before building estimate
ClosedPublic

Authored by spatel on Apr 25 2019, 1:54 PM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
nemanjai

Commits

rG19728261785d: [DAGCombiner] try repeated fdiv divisor transform before building estimate (2nd…
rL359793: [DAGCombiner] try repeated fdiv divisor transform before building estimate (2nd…
rGfb9a5307a94e: [DAGCombiner] try repeated fdiv divisor transform before building estimate
rL359398: [DAGCombiner] try repeated fdiv divisor transform before building estimate

Summary

This was originally part of D61028, but it's an independent diff.

If we do the repeated divisor reciprocal transform before producing an estimate sequence, then we have an opportunity to use scalar fdiv. On x86, the trade-off is 1 divss vs. 5 vector FP ops in the default estimate sequence. On recent chips (Skylake, Ryzen), the full-precision division is only 3 cycle throughput, so that's probably the better perf default option and avoids problems from x86's inaccurate estimates.

The last 2 tests show that users still have the option to override the defaults by using the function attributes for reciprocal estimates, but we can potentially make those faster by converting vector ops (including ymm ops) to scalar math.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Apr 25 2019, 1:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2019, 1:54 PM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

LGTM - thanks for splitting this off from D61028

This revision is now accepted and ready to land.Apr 26 2019, 12:40 AM

Closed by commit rL359398: [DAGCombiner] try repeated fdiv divisor transform before building estimate (authored by spatel). · Explain WhyApr 28 2019, 5:21 AM

This revision was automatically updated to reflect the committed changes.

Reopening - I reverted this at rL359695 because it can cause an infinite loop from opposing combines.
There's a proposal to solve that in D61384, so I'll merge that here and try again.

This revision is now accepted and ready to land.May 1 2019, 9:44 AM

spatel planned changes to this revision.May 1 2019, 9:44 AM

spatel mentioned this in rL359709: [PowerPC] add test that could infinite loop with reordered transforms; NFC.May 1 2019, 10:32 AM

spatel mentioned this in rG9f6861449457: [PowerPC] add test that could infinite loop with reordered transforms; NFC.

Patch updated:
Add a check for a vector splat of "1.0" when bailing out of the repeated divisor transform. I suspect this bug may be present independently of the original change here (moving the order of the transforms), but I didn't try to create another test case to prove that.

A reduction of the infinite looping test from D61384 was added at rL359709. I don't think we need to add a specific helper to find a splat of "1.0" through a constant pool load as is proposed in D61384 - a simpler use of the existing isConstOrConstSplatFP() avoids the problem.

This revision is now accepted and ready to land.May 1 2019, 11:14 AM

spatel mentioned this in D61384: [DAGCombine] Fix for regression introduced in rL359398.May 1 2019, 12:46 PM

Interestingly enough, I tried this (or some variant of this) and it didn't work for me. Clearly I had done something wrong when trying it. In any case, I've run this against the code we were originally spinning on and it's all good. LGTM and thanks for fixing so quickly.

Closed by commit rL359793: [DAGCombiner] try repeated fdiv divisor transform before building estimate (2nd… (authored by spatel). · Explain WhyMay 2 2019, 8:04 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

8 lines

test/

CodeGen/

X86/

fdiv-combine-vec.ll

66 lines

Diff 197789

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,909 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::combineRepeatedFPDivisors(SDNode *N) {
// that only minsize should restrict this.		// that only minsize should restrict this.
bool UnsafeMath = DAG.getTarget().Options.UnsafeFPMath;		bool UnsafeMath = DAG.getTarget().Options.UnsafeFPMath;
const SDNodeFlags Flags = N->getFlags();		const SDNodeFlags Flags = N->getFlags();
if (!UnsafeMath && !Flags.hasAllowReciprocal())		if (!UnsafeMath && !Flags.hasAllowReciprocal())
return SDValue();		return SDValue();

// Skip if current node is a reciprocal.		// Skip if current node is a reciprocal.
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode N0CFP = isConstOrConstSplatFP(N0, / AllowUndefs */ true);
if (N0CFP && N0CFP->isExactlyValue(1.0))		if (N0CFP && N0CFP->isExactlyValue(1.0))
return SDValue();		return SDValue();

// Exit early if the target does not want this transform or if there can't		// Exit early if the target does not want this transform or if there can't
// possibly be enough uses of the divisor to make the transform worthwhile.		// possibly be enough uses of the divisor to make the transform worthwhile.
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
unsigned MinUses = TLI.combineRepeatedFPDivisors();		unsigned MinUses = TLI.combineRepeatedFPDivisors();

▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFDIV(SDNode *N) {

// fold (fdiv c1, c2) -> c1/c2		// fold (fdiv c1, c2) -> c1/c2
if (N0CFP && N1CFP)		if (N0CFP && N1CFP)
return DAG.getNode(ISD::FDIV, SDLoc(N), VT, N0, N1, Flags);		return DAG.getNode(ISD::FDIV, SDLoc(N), VT, N0, N1, Flags);

if (SDValue NewSel = foldBinOpIntoSelect(N))		if (SDValue NewSel = foldBinOpIntoSelect(N))
return NewSel;		return NewSel;

		if (SDValue V = combineRepeatedFPDivisors(N))
		return V;

if (Options.UnsafeFPMath \|\| Flags.hasAllowReciprocal()) {		if (Options.UnsafeFPMath \|\| Flags.hasAllowReciprocal()) {
// fold (fdiv X, c2) -> fmul X, 1/c2 if losing precision is acceptable.		// fold (fdiv X, c2) -> fmul X, 1/c2 if losing precision is acceptable.
if (N1CFP) {		if (N1CFP) {
// Compute the reciprocal 1.0 / c2.		// Compute the reciprocal 1.0 / c2.
const APFloat &N1APF = N1CFP->getValueAPF();		const APFloat &N1APF = N1CFP->getValueAPF();
APFloat Recip(N1APF.getSemantics(), 1); // 1.0		APFloat Recip(N1APF.getSemantics(), 1); // 1.0
APFloat::opStatus st = Recip.divide(N1APF, APFloat::rmNearestTiesToEven);		APFloat::opStatus st = Recip.divide(N1APF, APFloat::rmNearestTiesToEven);
// Only do the transform if the reciprocal is a legal fp immediate that		// Only do the transform if the reciprocal is a legal fp immediate that
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	if (char RHSNeg = isNegatibleForFree(N1, LegalOperations, TLI, &Options,
GetNegatedExpression(N0, DAG, LegalOperations,		GetNegatedExpression(N0, DAG, LegalOperations,
ForCodeSize),		ForCodeSize),
GetNegatedExpression(N1, DAG, LegalOperations,		GetNegatedExpression(N1, DAG, LegalOperations,
ForCodeSize),		ForCodeSize),
Flags);		Flags);
}		}
}		}

if (SDValue CombineRepeatedDivisors = combineRepeatedFPDivisors(N))
return CombineRepeatedDivisors;

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFREM(SDNode *N) {		SDValue DAGCombiner::visitFREM(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);
ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);		ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
▲ Show 20 Lines • Show All 7,998 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fdiv-combine-vec.ll

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%splaty = shufflevector <4 x double> %vy, <4 x double> undef, <4 x i32> zeroinitializer		%splaty = shufflevector <4 x double> %vy, <4 x double> undef, <4 x i32> zeroinitializer
%r = fdiv arcp <4 x double> %x, %splaty		%r = fdiv arcp <4 x double> %x, %splaty
ret <4 x double> %r		ret <4 x double> %r
}		}

define <4 x float> @splat_fdiv_v4f32(<4 x float> %x, float %y) {		define <4 x float> @splat_fdiv_v4f32(<4 x float> %x, float %y) {
; SSE-LABEL: splat_fdiv_v4f32:		; SSE-LABEL: splat_fdiv_v4f32:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; SSE-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; SSE-NEXT: rcpps %xmm1, %xmm2		; SSE-NEXT: divss %xmm1, %xmm2
; SSE-NEXT: mulps %xmm2, %xmm1		; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,0,0,0]
; SSE-NEXT: movaps {{.*#+}} xmm3 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]		; SSE-NEXT: mulps %xmm2, %xmm0
; SSE-NEXT: subps %xmm1, %xmm3
; SSE-NEXT: mulps %xmm2, %xmm3
; SSE-NEXT: addps %xmm2, %xmm3
; SSE-NEXT: mulps %xmm3, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v4f32:		; AVX-LABEL: splat_fdiv_v4f32:
; AVX: # %bb.0:		; AVX: # %bb.0:
		; AVX-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
		; AVX-NEXT: vdivss %xmm1, %xmm2, %xmm1
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]
; AVX-NEXT: vrcpps %xmm1, %xmm2
; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
; AVX-NEXT: vmovaps {{.*#+}} xmm3 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]
; AVX-NEXT: vsubps %xmm1, %xmm3, %xmm1
; AVX-NEXT: vmulps %xmm1, %xmm2, %xmm1
; AVX-NEXT: vaddps %xmm1, %xmm2, %xmm1
; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%vy = insertelement <4 x float> undef, float %y, i32 0		%vy = insertelement <4 x float> undef, float %y, i32 0
%splaty = shufflevector <4 x float> %vy, <4 x float> undef, <4 x i32> zeroinitializer		%splaty = shufflevector <4 x float> %vy, <4 x float> undef, <4 x i32> zeroinitializer
%r = fdiv arcp reassoc <4 x float> %x, %splaty		%r = fdiv arcp reassoc <4 x float> %x, %splaty
ret <4 x float> %r		ret <4 x float> %r
}		}

define <8 x float> @splat_fdiv_v8f32(<8 x float> %x, float %y) {		define <8 x float> @splat_fdiv_v8f32(<8 x float> %x, float %y) {
; SSE-LABEL: splat_fdiv_v8f32:		; SSE-LABEL: splat_fdiv_v8f32:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero		; SSE-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
; SSE-NEXT: divss %xmm2, %xmm3		; SSE-NEXT: divss %xmm2, %xmm3
; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,0,0,0]		; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,0,0,0]
; SSE-NEXT: mulps %xmm3, %xmm0		; SSE-NEXT: mulps %xmm3, %xmm0
; SSE-NEXT: mulps %xmm3, %xmm1		; SSE-NEXT: mulps %xmm3, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v8f32:		; AVX-LABEL: splat_fdiv_v8f32:
; AVX: # %bb.0:		; AVX: # %bb.0:
		; AVX-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
		; AVX-NEXT: vdivss %xmm1, %xmm2, %xmm1
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]
; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1		; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1
; AVX-NEXT: vrcpps %ymm1, %ymm2
; AVX-NEXT: vmulps %ymm2, %ymm1, %ymm1
; AVX-NEXT: vmovaps {{.*#+}} ymm3 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0]
; AVX-NEXT: vsubps %ymm1, %ymm3, %ymm1
; AVX-NEXT: vmulps %ymm1, %ymm2, %ymm1
; AVX-NEXT: vaddps %ymm1, %ymm2, %ymm1
; AVX-NEXT: vmulps %ymm1, %ymm0, %ymm0		; AVX-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%vy = insertelement <8 x float> undef, float %y, i32 0		%vy = insertelement <8 x float> undef, float %y, i32 0
%splaty = shufflevector <8 x float> %vy, <8 x float> undef, <8 x i32> zeroinitializer		%splaty = shufflevector <8 x float> %vy, <8 x float> undef, <8 x i32> zeroinitializer
%r = fdiv fast <8 x float> %x, %splaty		%r = fdiv fast <8 x float> %x, %splaty
ret <8 x float> %r		ret <8 x float> %r
}		}

define <4 x float> @splat_fdiv_v4f32_estimate(<4 x float> %x, float %y) #0 {		define <4 x float> @splat_fdiv_v4f32_estimate(<4 x float> %x, float %y) #0 {
; SSE-LABEL: splat_fdiv_v4f32_estimate:		; SSE-LABEL: splat_fdiv_v4f32_estimate:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; SSE-NEXT: rcpss %xmm1, %xmm2
; SSE-NEXT: rcpps %xmm1, %xmm2		; SSE-NEXT: mulss %xmm2, %xmm1
; SSE-NEXT: mulps %xmm2, %xmm1		; SSE-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
; SSE-NEXT: movaps {{.*#+}} xmm3 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]		; SSE-NEXT: subss %xmm1, %xmm3
; SSE-NEXT: subps %xmm1, %xmm3		; SSE-NEXT: mulss %xmm2, %xmm3
; SSE-NEXT: mulps %xmm2, %xmm3		; SSE-NEXT: addss %xmm2, %xmm3
; SSE-NEXT: addps %xmm2, %xmm3		; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,0,0,0]
; SSE-NEXT: mulps %xmm3, %xmm0		; SSE-NEXT: mulps %xmm3, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v4f32_estimate:		; AVX-LABEL: splat_fdiv_v4f32_estimate:
; AVX: # %bb.0:		; AVX: # %bb.0:
		; AVX-NEXT: vrcpss %xmm1, %xmm1, %xmm2
		; AVX-NEXT: vmulss %xmm2, %xmm1, %xmm1
		; AVX-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
		; AVX-NEXT: vsubss %xmm1, %xmm3, %xmm1
		; AVX-NEXT: vmulss %xmm1, %xmm2, %xmm1
		; AVX-NEXT: vaddss %xmm1, %xmm2, %xmm1
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]
; AVX-NEXT: vrcpps %xmm1, %xmm2
; AVX-NEXT: vmulps %xmm2, %xmm1, %xmm1
; AVX-NEXT: vmovaps {{.*#+}} xmm3 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]
; AVX-NEXT: vsubps %xmm1, %xmm3, %xmm1
; AVX-NEXT: vmulps %xmm1, %xmm2, %xmm1
; AVX-NEXT: vaddps %xmm1, %xmm2, %xmm1
; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%vy = insertelement <4 x float> undef, float %y, i32 0		%vy = insertelement <4 x float> undef, float %y, i32 0
%splaty = shufflevector <4 x float> %vy, <4 x float> undef, <4 x i32> zeroinitializer		%splaty = shufflevector <4 x float> %vy, <4 x float> undef, <4 x i32> zeroinitializer
%r = fdiv arcp reassoc <4 x float> %x, %splaty		%r = fdiv arcp reassoc <4 x float> %x, %splaty
ret <4 x float> %r		ret <4 x float> %r
}		}

define <8 x float> @splat_fdiv_v8f32_estimate(<8 x float> %x, float %y) #0 {		define <8 x float> @splat_fdiv_v8f32_estimate(<8 x float> %x, float %y) #0 {
; SSE-LABEL: splat_fdiv_v8f32_estimate:		; SSE-LABEL: splat_fdiv_v8f32_estimate:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: rcpss %xmm2, %xmm3		; SSE-NEXT: rcpss %xmm2, %xmm3
; SSE-NEXT: mulss %xmm3, %xmm2		; SSE-NEXT: mulss %xmm3, %xmm2
; SSE-NEXT: movss {{.*#+}} xmm4 = mem[0],zero,zero,zero		; SSE-NEXT: movss {{.*#+}} xmm4 = mem[0],zero,zero,zero
; SSE-NEXT: subss %xmm2, %xmm4		; SSE-NEXT: subss %xmm2, %xmm4
; SSE-NEXT: mulss %xmm3, %xmm4		; SSE-NEXT: mulss %xmm3, %xmm4
; SSE-NEXT: addss %xmm3, %xmm4		; SSE-NEXT: addss %xmm3, %xmm4
; SSE-NEXT: shufps {{.*#+}} xmm4 = xmm4[0,0,0,0]		; SSE-NEXT: shufps {{.*#+}} xmm4 = xmm4[0,0,0,0]
; SSE-NEXT: mulps %xmm4, %xmm0		; SSE-NEXT: mulps %xmm4, %xmm0
; SSE-NEXT: mulps %xmm4, %xmm1		; SSE-NEXT: mulps %xmm4, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v8f32_estimate:		; AVX-LABEL: splat_fdiv_v8f32_estimate:
; AVX: # %bb.0:		; AVX: # %bb.0:
		; AVX-NEXT: vrcpss %xmm1, %xmm1, %xmm2
		; AVX-NEXT: vmulss %xmm2, %xmm1, %xmm1
		; AVX-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
		; AVX-NEXT: vsubss %xmm1, %xmm3, %xmm1
		; AVX-NEXT: vmulss %xmm1, %xmm2, %xmm1
		; AVX-NEXT: vaddss %xmm1, %xmm2, %xmm1
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]
; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1		; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1
; AVX-NEXT: vrcpps %ymm1, %ymm2
; AVX-NEXT: vmulps %ymm2, %ymm1, %ymm1
; AVX-NEXT: vmovaps {{.*#+}} ymm3 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0,1.0E+0]
; AVX-NEXT: vsubps %ymm1, %ymm3, %ymm1
; AVX-NEXT: vmulps %ymm1, %ymm2, %ymm1
; AVX-NEXT: vaddps %ymm1, %ymm2, %ymm1
; AVX-NEXT: vmulps %ymm1, %ymm0, %ymm0		; AVX-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%vy = insertelement <8 x float> undef, float %y, i32 0		%vy = insertelement <8 x float> undef, float %y, i32 0
%splaty = shufflevector <8 x float> %vy, <8 x float> undef, <8 x i32> zeroinitializer		%splaty = shufflevector <8 x float> %vy, <8 x float> undef, <8 x i32> zeroinitializer
%r = fdiv fast <8 x float> %x, %splaty		%r = fdiv fast <8 x float> %x, %splaty
ret <8 x float> %r		ret <8 x float> %r
}		}

attributes #0 = { "reciprocal-estimates"="divf,vec-divf" }		attributes #0 = { "reciprocal-estimates"="divf,vec-divf" }