This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] scale repeated FP divisor by splat factor
ClosedPublic

Authored by spatel on Apr 23 2019, 10:14 AM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
andreadb

Commits

rG6f41bf948b5f: [DAGCombiner] scale repeated FP divisor by splat factor
rL359147: [DAGCombiner] scale repeated FP divisor by splat factor

Summary

If we have a vector FP division with a splatted divisor, we can use the existing transform that converts 'x/y' into 'x * (1.0/y)' to allow more conversions. This can then potentially be converted into a scalar FP division by existing combines (rL358984) as seen in the tests here.

That can be a potentially big perf difference if scalar fdiv has better timing (including avoiding possible frequency throttling for vector ops).

There's another diff here in the ordering of the transforms - I'm proposing to move the repeated divisor transform ahead of the reciprocal estimate transform because that seems more likely to produce the best results. For default x86, we don't turn fdiv f32 into an estimate because the estimate accuracy is too poor for most code. That's probably the right perf choice for current and future CPUs since divss throughput is down to the 3-4 cycle range (Skylake/Ryzen).

Diff Detail

Event Timeline

spatel created this revision.Apr 23 2019, 10:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 23 2019, 10:14 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

RKSimon added inline comments.Apr 24 2019, 7:52 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11987	Does this move need to be a separate patch?

spatel marked an inline comment as done.Apr 24 2019, 8:36 AM

spatel added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11987	Yes, let me change that back. I noticed that potential diff while looking at the other part of this patch, so I thought it would be good to see the 2 changes together, but it should stand independently assuming it makes sense.

Patch updated:
Remove reordering of reciprocal estimate and repeated divisor transforms. We still see a diff for the SSE target with an illegal type because we don't try to generate the reciprocal estimate sequence until the types are legal.

RKSimon added inline comments.Apr 24 2019, 9:55 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
11909	We probably want to not use this when optsize is enabled - add a TODO?

Patch updated:
Add a TODO comment about optimizing for size. That's an existing concern even for scalar code. But there may not be a clear answer for optsize because, for example, we may be able to hoist fdiv out of a loop by doing this transform.

LGTM

This revision is now accepted and ready to land.Apr 24 2019, 12:02 PM

Closed by commit rL359147: [DAGCombiner] scale repeated FP divisor by splat factor (authored by spatel). · Explain WhyApr 24 2019, 3:28 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D61149: [DAGCombiner] try repeated fdiv divisor transform before building estimate.Apr 25 2019, 1:54 PM

spatel mentioned this in rL359398: [DAGCombiner] try repeated fdiv divisor transform before building estimate.Apr 28 2019, 5:21 AM

spatel mentioned this in rGfb9a5307a94e: [DAGCombiner] try repeated fdiv divisor transform before building estimate.

spatel mentioned this in rL359793: [DAGCombiner] try repeated fdiv divisor transform before building estimate (2nd….May 2 2019, 8:04 AM

spatel mentioned this in rG19728261785d: [DAGCombiner] try repeated fdiv divisor transform before building estimate (2nd….

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

16 lines

test/

CodeGen/

X86/

fdiv-combine-vec.ll

28 lines

Diff 196481

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,895 Lines • ▼ Show 20 Lines
// Combine multiple FDIVs with the same divisor into multiple FMULs by the		// Combine multiple FDIVs with the same divisor into multiple FMULs by the
// reciprocal.		// reciprocal.
// E.g., (a / D; b / D;) -> (recip = 1.0 / D; a * recip; b * recip)		// E.g., (a / D; b / D;) -> (recip = 1.0 / D; a * recip; b * recip)
// Notice that this is not always beneficial. One reason is different targets		// Notice that this is not always beneficial. One reason is different targets
// may have different costs for FDIV and FMUL, so sometimes the cost of two		// may have different costs for FDIV and FMUL, so sometimes the cost of two
// FDIVs may be lower than the cost of one FDIV and two FMULs. Another reason		// FDIVs may be lower than the cost of one FDIV and two FMULs. Another reason
// is the critical path is increased from "one FDIV" to "one FDIV + one FMUL".		// is the critical path is increased from "one FDIV" to "one FDIV + one FMUL".
SDValue DAGCombiner::combineRepeatedFPDivisors(SDNode *N) {		SDValue DAGCombiner::combineRepeatedFPDivisors(SDNode *N) {
		// TODO: Limit this transform based on optsize/minsize - it always creates at
		// least 1 extra instruction. But the perf win may be substantial enough
		// that only minsize should restrict this.
bool UnsafeMath = DAG.getTarget().Options.UnsafeFPMath;		bool UnsafeMath = DAG.getTarget().Options.UnsafeFPMath;
const SDNodeFlags Flags = N->getFlags();		const SDNodeFlags Flags = N->getFlags();
if (!UnsafeMath && !Flags.hasAllowReciprocal())		if (!UnsafeMath && !Flags.hasAllowReciprocal())
		RKSimonUnsubmitted Done Reply Inline Actions We probably want to not use this when optsize is enabled - add a TODO? RKSimon: We probably want to not use this when optsize is enabled - add a TODO?
return SDValue();		return SDValue();

// Skip if current node is a reciprocal.		// Skip if current node is a reciprocal.
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);
if (N0CFP && N0CFP->isExactlyValue(1.0))		if (N0CFP && N0CFP->isExactlyValue(1.0))
return SDValue();		return SDValue();

// Exit early if the target does not want this transform or if there can't		// Exit early if the target does not want this transform or if there can't
// possibly be enough uses of the divisor to make the transform worthwhile.		// possibly be enough uses of the divisor to make the transform worthwhile.
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
unsigned MinUses = TLI.combineRepeatedFPDivisors();		unsigned MinUses = TLI.combineRepeatedFPDivisors();
if (!MinUses \|\| N1->use_size() < MinUses)
		// For splat vectors, scale the number of uses by the splat factor. If we can
		// convert the division into a scalar op, that will likely be much faster.
		unsigned NumElts = 1;
		EVT VT = N->getValueType(0);
		if (VT.isVector() && DAG.isSplatValue(N1))
		NumElts = VT.getVectorNumElements();

		if (!MinUses \|\| (N1->use_size() * NumElts) < MinUses)
return SDValue();		return SDValue();

// Find all FDIV users of the same divisor.		// Find all FDIV users of the same divisor.
// Use a set because duplicates may be present in the user list.		// Use a set because duplicates may be present in the user list.
SetVector<SDNode *> Users;		SetVector<SDNode *> Users;
for (auto *U : N1->uses()) {		for (auto *U : N1->uses()) {
if (U->getOpcode() == ISD::FDIV && U->getOperand(1) == N1) {		if (U->getOpcode() == ISD::FDIV && U->getOperand(1) == N1) {
// This division is eligible for optimization only if global unsafe math		// This division is eligible for optimization only if global unsafe math
// is enabled or if this division allows reciprocal formation.		// is enabled or if this division allows reciprocal formation.
if (UnsafeMath \|\| U->getFlags().hasAllowReciprocal())		if (UnsafeMath \|\| U->getFlags().hasAllowReciprocal())
Users.insert(U);		Users.insert(U);
}		}
}		}

// Now that we have the actual number of divisor uses, make sure it meets		// Now that we have the actual number of divisor uses, make sure it meets
// the minimum threshold specified by the target.		// the minimum threshold specified by the target.
if (Users.size() < MinUses)		if ((Users.size() * NumElts) < MinUses)
return SDValue();		return SDValue();

EVT VT = N->getValueType(0);
SDLoc DL(N);		SDLoc DL(N);
SDValue FPOne = DAG.getConstantFP(1.0, DL, VT);		SDValue FPOne = DAG.getConstantFP(1.0, DL, VT);
SDValue Reciprocal = DAG.getNode(ISD::FDIV, DL, VT, FPOne, N1, Flags);		SDValue Reciprocal = DAG.getNode(ISD::FDIV, DL, VT, FPOne, N1, Flags);

// Dividend / Divisor -> Dividend * Reciprocal		// Dividend / Divisor -> Dividend * Reciprocal
for (auto *U : Users) {		for (auto *U : Users) {
SDValue Dividend = U->getOperand(0);		SDValue Dividend = U->getOperand(0);
if (Dividend != FPOne) {		if (Dividend != FPOne) {
Show All 21 Lines	SDValue DAGCombiner::visitFDIV(SDNode *N) {

// fold vector ops		// fold vector ops
if (VT.isVector())		if (VT.isVector())
if (SDValue FoldedVOp = SimplifyVBinOp(N))		if (SDValue FoldedVOp = SimplifyVBinOp(N))
return FoldedVOp;		return FoldedVOp;

// fold (fdiv c1, c2) -> c1/c2		// fold (fdiv c1, c2) -> c1/c2
if (N0CFP && N1CFP)		if (N0CFP && N1CFP)
return DAG.getNode(ISD::FDIV, SDLoc(N), VT, N0, N1, Flags);		return DAG.getNode(ISD::FDIV, SDLoc(N), VT, N0, N1, Flags);
		RKSimonUnsubmitted Not Done Reply Inline Actions Does this move need to be a separate patch? RKSimon: Does this move need to be a separate patch?
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, let me change that back. I noticed that potential diff while looking at the other part of this patch, so I thought it would be good to see the 2 changes together, but it should stand independently assuming it makes sense. spatel: Yes, let me change that back. I noticed that potential diff while looking at the other part of…

if (SDValue NewSel = foldBinOpIntoSelect(N))		if (SDValue NewSel = foldBinOpIntoSelect(N))
return NewSel;		return NewSel;

if (Options.UnsafeFPMath \|\| Flags.hasAllowReciprocal()) {		if (Options.UnsafeFPMath \|\| Flags.hasAllowReciprocal()) {
// fold (fdiv X, c2) -> fmul X, 1/c2 if losing precision is acceptable.		// fold (fdiv X, c2) -> fmul X, 1/c2 if losing precision is acceptable.
if (N1CFP) {		if (N1CFP) {
// Compute the reciprocal 1.0 / c2.		// Compute the reciprocal 1.0 / c2.
▲ Show 20 Lines • Show All 8,093 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/fdiv-combine-vec.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefix=SSE		; RUN: llc < %s -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefix=SSE
; RUN: llc < %s -mtriple=x86_64-- -mattr=avx \| FileCheck %s --check-prefix=AVX		; RUN: llc < %s -mtriple=x86_64-- -mattr=avx \| FileCheck %s --check-prefix=AVX

define <2 x double> @splat_fdiv_v2f64(<2 x double> %x, double %y) {		define <2 x double> @splat_fdiv_v2f64(<2 x double> %x, double %y) {
; SSE-LABEL: splat_fdiv_v2f64:		; SSE-LABEL: splat_fdiv_v2f64:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: unpcklpd {{.*#+}} xmm1 = xmm1[0,0]		; SSE-NEXT: movsd {{.*#+}} xmm2 = mem[0],zero
; SSE-NEXT: divpd %xmm1, %xmm0		; SSE-NEXT: divsd %xmm1, %xmm2
		; SSE-NEXT: unpcklpd {{.*#+}} xmm2 = xmm2[0,0]
		; SSE-NEXT: mulpd %xmm2, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v2f64:		; AVX-LABEL: splat_fdiv_v2f64:
; AVX: # %bb.0:		; AVX: # %bb.0:
		; AVX-NEXT: vmovsd {{.*#+}} xmm2 = mem[0],zero
		; AVX-NEXT: vdivsd %xmm1, %xmm2, %xmm1
; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm1[0,0]		; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm1[0,0]
; AVX-NEXT: vdivpd %xmm1, %xmm0, %xmm0		; AVX-NEXT: vmulpd %xmm1, %xmm0, %xmm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%vy = insertelement <2 x double> undef, double %y, i32 0		%vy = insertelement <2 x double> undef, double %y, i32 0
%splaty = shufflevector <2 x double> %vy, <2 x double> undef, <2 x i32> zeroinitializer		%splaty = shufflevector <2 x double> %vy, <2 x double> undef, <2 x i32> zeroinitializer
%r = fdiv fast <2 x double> %x, %splaty		%r = fdiv fast <2 x double> %x, %splaty
ret <2 x double> %r		ret <2 x double> %r
}		}

define <4 x double> @splat_fdiv_v4f64(<4 x double> %x, double %y) {		define <4 x double> @splat_fdiv_v4f64(<4 x double> %x, double %y) {
; SSE-LABEL: splat_fdiv_v4f64:		; SSE-LABEL: splat_fdiv_v4f64:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: movsd {{.*#+}} xmm3 = mem[0],zero		; SSE-NEXT: movsd {{.*#+}} xmm3 = mem[0],zero
; SSE-NEXT: divsd %xmm2, %xmm3		; SSE-NEXT: divsd %xmm2, %xmm3
; SSE-NEXT: unpcklpd {{.*#+}} xmm3 = xmm3[0,0]		; SSE-NEXT: unpcklpd {{.*#+}} xmm3 = xmm3[0,0]
; SSE-NEXT: mulpd %xmm3, %xmm0		; SSE-NEXT: mulpd %xmm3, %xmm0
; SSE-NEXT: mulpd %xmm3, %xmm1		; SSE-NEXT: mulpd %xmm3, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v4f64:		; AVX-LABEL: splat_fdiv_v4f64:
; AVX: # %bb.0:		; AVX: # %bb.0:
		; AVX-NEXT: vmovsd {{.*#+}} xmm2 = mem[0],zero
		; AVX-NEXT: vdivsd %xmm1, %xmm2, %xmm1
; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm1[0,0]		; AVX-NEXT: vmovddup {{.*#+}} xmm1 = xmm1[0,0]
; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1		; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1
; AVX-NEXT: vdivpd %ymm1, %ymm0, %ymm0		; AVX-NEXT: vmulpd %ymm1, %ymm0, %ymm0
; AVX-NEXT: retq		; AVX-NEXT: retq
%vy = insertelement <4 x double> undef, double %y, i32 0		%vy = insertelement <4 x double> undef, double %y, i32 0
%splaty = shufflevector <4 x double> %vy, <4 x double> undef, <4 x i32> zeroinitializer		%splaty = shufflevector <4 x double> %vy, <4 x double> undef, <4 x i32> zeroinitializer
%r = fdiv arcp <4 x double> %x, %splaty		%r = fdiv arcp <4 x double> %x, %splaty
ret <4 x double> %r		ret <4 x double> %r
}		}

define <4 x float> @splat_fdiv_v4f32(<4 x float> %x, float %y) {		define <4 x float> @splat_fdiv_v4f32(<4 x float> %x, float %y) {
Show All 24 Lines	; AVX-NEXT: retq
%splaty = shufflevector <4 x float> %vy, <4 x float> undef, <4 x i32> zeroinitializer		%splaty = shufflevector <4 x float> %vy, <4 x float> undef, <4 x i32> zeroinitializer
%r = fdiv arcp reassoc <4 x float> %x, %splaty		%r = fdiv arcp reassoc <4 x float> %x, %splaty
ret <4 x float> %r		ret <4 x float> %r
}		}

define <8 x float> @splat_fdiv_v8f32(<8 x float> %x, float %y) {		define <8 x float> @splat_fdiv_v8f32(<8 x float> %x, float %y) {
; SSE-LABEL: splat_fdiv_v8f32:		; SSE-LABEL: splat_fdiv_v8f32:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: shufps {{.*#+}} xmm2 = xmm2[0,0,0,0]		; SSE-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
; SSE-NEXT: rcpps %xmm2, %xmm3		; SSE-NEXT: divss %xmm2, %xmm3
; SSE-NEXT: mulps %xmm3, %xmm2		; SSE-NEXT: shufps {{.*#+}} xmm3 = xmm3[0,0,0,0]
; SSE-NEXT: movaps {{.*#+}} xmm4 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]		; SSE-NEXT: mulps %xmm3, %xmm0
; SSE-NEXT: subps %xmm2, %xmm4		; SSE-NEXT: mulps %xmm3, %xmm1
; SSE-NEXT: mulps %xmm3, %xmm4
; SSE-NEXT: addps %xmm3, %xmm4
; SSE-NEXT: mulps %xmm4, %xmm0
; SSE-NEXT: mulps %xmm4, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-LABEL: splat_fdiv_v8f32:		; AVX-LABEL: splat_fdiv_v8f32:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]		; AVX-NEXT: vpermilps {{.*#+}} xmm1 = xmm1[0,0,0,0]
; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1		; AVX-NEXT: vinsertf128 $1, %xmm1, %ymm1, %ymm1
; AVX-NEXT: vrcpps %ymm1, %ymm2		; AVX-NEXT: vrcpps %ymm1, %ymm2
; AVX-NEXT: vmulps %ymm2, %ymm1, %ymm1		; AVX-NEXT: vmulps %ymm2, %ymm1, %ymm1
Show All 11 Lines