This is an archive of the discontinued LLVM Phabricator instance.

Combine fmul vector FP constants when unsafe math is allowed
ClosedPublic

Authored by spatel on Sep 8 2014, 6:09 PM.

Download Raw Diff

Details

Reviewers

andreadb
arsenm
mcrosier

Commits

rG7bd228a82ec6: Combine fmul vector FP constants when unsafe math is allowed.
rL217599: Combine fmul vector FP constants when unsafe math is allowed.

Summary

This is an extension of the change made with r215820:
http://llvm.org/viewvc/llvm-project?view=revision&revision=215820

That patch allowed combining of splatted vector FP constants that are multiplied.

This patch allows combining non-uniform vector FP constants too by relaxing the check on the type of vector. I've also tried to canonicalize a vector fmul in the same way that we already do for scalars - if only one operand of the fmul is a constant, make it operand 1. Otherwise, we miss potential folds.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 13432.Sep 8 2014, 6:09 PM

spatel retitled this revision from to Combine fmul vector FP constants when unsafe math is allowed.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added a reviewer: arsenm.

spatel added a subscriber: Unknown Object (MLST).

Hi Sanjay,

Thanks for the patch.
I have a couple of comments (see below):

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6812–6814 ↗	(On Diff #13432)	This is ok. However, it might be better to check that N0 is a build vector of all constants before commuting the operands of the FMUL. This is to avoid that we trigger the canonicalization for nodes where N0 is a build vector, but not all elements in N0 are constants or undef.
6835–6846 ↗	(On Diff #13432)	I think we probably don't need the 'hasOneUse()' constraint here. in the worst case scenario, we still have two fmul instructions; that is because the new (fmul c1, c2) would be always constant folded. Also, if you remove the 'hasOneUse()' constraint, every time the dag combiner triggers your new rule, the number of uses for the 'fmul x, c1' is decreased by one. This would have a positive effect on the following code example: define <4 x float> @foo(<4 x float> %A, <4 x float> %B) { %1 = fmul <4 x float> %A, <float 3.0, float 4.0, float 5.0, float 6.0> %2 = fmul <4 x float> %1, <float 1.0, float 2.0, float 3.0, float 4.0> %3 = fmul <4 x float> %1, <float 2.0, float 3.0, float 4.0, float 5.0> %4 = fadd <4 x float> %2, %3 ret <4 x float> %4 } Without the 'hasOneUse()' check, we only get two (v)mulps. With the 'hasOneUse()' we instead get three mulps.

andreadb added a reviewer: andreadb.Sep 9 2014, 10:13 AM

Thanks very much for the prompt review, Andrea!

I agree with both of your comments. Please see the updated patch which:

Confirms that a build vector is a constant before canonicalizing.
Removes the hasOneUse() restriction on the fold.
Adds a minimal test case for the multiple use scenario; it has the same number of fmuls, but we can confirm that the optimization occurs by checking the constant pool values.

spatel added a reviewer: mcrosier.Sep 9 2014, 3:12 PM

LGTM. Thanks!

This revision is now accepted and ready to land.Sep 9 2014, 3:59 PM

I'm now wondering if this patch is necessary. I'd rather not bloat up DAGCombiner any more if there's no reason to.

The constant reassociations in all 3 of the test cases that I created here are already done by -instcombine if 'fast' is specified on each fmul inst. And in http://reviews.llvm.org/D5222, I've said that 'fast' will be specified for any function produced by clang with -ffast-math.

Matt, what was the motivation for r215820? Is there some scenario you were seeing where we can't do these folds in -instcombine?

Closed by commit rL217599 (authored by @spatel).

Thanks all - checked in at r217599.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

28 lines

test/

CodeGen/

X86/

fmul-combines.ll

48 lines

Diff 13590

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,814 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFMUL(SDNode *N) {
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
ConstantFPSDNode *N0CFP = isConstOrConstSplatFP(N0);		ConstantFPSDNode *N0CFP = isConstOrConstSplatFP(N0);
ConstantFPSDNode *N1CFP = isConstOrConstSplatFP(N1);		ConstantFPSDNode *N1CFP = isConstOrConstSplatFP(N1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
const TargetOptions &Options = DAG.getTarget().Options;		const TargetOptions &Options = DAG.getTarget().Options;

// fold vector ops		// fold vector ops
if (VT.isVector()) {		if (VT.isVector()) {
		// This just handles C1 * C2 for vectors. Other vector folds are below.
SDValue FoldedVOp = SimplifyVBinOp(N);		SDValue FoldedVOp = SimplifyVBinOp(N);
if (FoldedVOp.getNode()) return FoldedVOp;		if (FoldedVOp.getNode())
		return FoldedVOp;
		// Canonicalize vector constant to RHS.
		if (N0.getOpcode() == ISD::BUILD_VECTOR &&
		N1.getOpcode() != ISD::BUILD_VECTOR)
		if (auto *BV0 = dyn_cast<BuildVectorSDNode>(N0))
		if (BV0->isConstant())
		return DAG.getNode(N->getOpcode(), SDLoc(N), VT, N1, N0);
}		}

// fold (fmul c1, c2) -> c1*c2		// fold (fmul c1, c2) -> c1*c2
if (N0CFP && N1CFP)		if (N0CFP && N1CFP)
return DAG.getNode(ISD::FMUL, SDLoc(N), VT, N0, N1);		return DAG.getNode(ISD::FMUL, SDLoc(N), VT, N0, N1);

// canonicalize constant to RHS		// canonicalize constant to RHS
if (N0CFP && !N1CFP)		if (N0CFP && !N1CFP)
return DAG.getNode(ISD::FMUL, SDLoc(N), VT, N1, N0);		return DAG.getNode(ISD::FMUL, SDLoc(N), VT, N1, N0);

// fold (fmul A, 1.0) -> A		// fold (fmul A, 1.0) -> A
if (N1CFP && N1CFP->isExactlyValue(1.0))		if (N1CFP && N1CFP->isExactlyValue(1.0))
return N0;		return N0;

if (Options.UnsafeFPMath) {		if (Options.UnsafeFPMath) {
// fold (fmul A, 0) -> 0		// fold (fmul A, 0) -> 0
if (N1CFP && N1CFP->getValueAPF().isZero())		if (N1CFP && N1CFP->getValueAPF().isZero())
return N1;		return N1;

// fold (fmul (fmul x, c1), c2) -> (fmul x, (fmul c1, c2))		// fold (fmul (fmul x, c1), c2) -> (fmul x, (fmul c1, c2))
if (N1CFP && N0.getOpcode() == ISD::FMUL &&		if (N0.getOpcode() == ISD::FMUL) {
N0.getNode()->hasOneUse() && isConstOrConstSplatFP(N0.getOperand(1))) {		// Fold scalars or any vector constants (not just splats).
		// This fold is done in general by InstCombine, but extra fmul insts
		// may have been generated during lowering.
		SDValue N01 = N0.getOperand(1);
		auto *BV1 = dyn_cast<BuildVectorSDNode>(N1);
		auto *BV01 = dyn_cast<BuildVectorSDNode>(N01);
		if ((N1CFP && isConstOrConstSplatFP(N01)) \|\|
		(BV1 && BV01 && BV1->isConstant() && BV01->isConstant())) {
SDLoc SL(N);		SDLoc SL(N);
SDValue MulConsts = DAG.getNode(ISD::FMUL, SL, VT, N0.getOperand(1), N1);		SDValue MulConsts = DAG.getNode(ISD::FMUL, SL, VT, N01, N1);
return DAG.getNode(ISD::FMUL, SL, VT, N0.getOperand(0), MulConsts);		return DAG.getNode(ISD::FMUL, SL, VT, N0.getOperand(0), MulConsts);
}		}
		}

// fold (fmul (fadd x, x), c) -> (fmul x, (fmul 2.0, c))		// fold (fmul (fadd x, x), c) -> (fmul x, (fmul 2.0, c))
// Undo the fmul 2.0, x -> fadd x, x transformation, since if it occurs		// Undo the fmul 2.0, x -> fadd x, x transformation, since if it occurs
// during an early run of DAGCombiner can prevent folding with fmuls		// during an early run of DAGCombiner can prevent folding with fmuls
// inserted during lowering.		// inserted during lowering.
if (N0.getOpcode() == ISD::FADD && N0.getOperand(0) == N0.getOperand(1)) {		if (N0.getOpcode() == ISD::FADD && N0.getOperand(0) == N0.getOperand(1)) {
SDLoc SL(N);		SDLoc SL(N);
const SDValue Two = DAG.getConstantFP(2.0, VT);		const SDValue Two = DAG.getConstantFP(2.0, VT);
▲ Show 20 Lines • Show All 5,131 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fmul-combines.ll

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	; CHECK-NOT: mulps			; CHECK-NOT: mulps
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	define <4 x float> @fmul_c3_c4_v4f32(<4 x float> %x) #0 {			define <4 x float> @fmul_c3_c4_v4f32(<4 x float> %x) #0 {
	%y = fmul <4 x float> %x, <float 3.0, float 3.0, float 3.0, float 3.0>			%y = fmul <4 x float> %x, <float 3.0, float 3.0, float 3.0, float 3.0>
	%z = fmul <4 x float> %y, <float 4.0, float 4.0, float 4.0, float 4.0>			%z = fmul <4 x float> %y, <float 4.0, float 4.0, float 4.0, float 4.0>
	ret <4 x float> %z			ret <4 x float> %z
	}			}

				; We should be able to pre-multiply the two constant vectors.
				; CHECK: ## float 5.000000e+00
				; CHECK: ## float 1.200000e+01
				; CHECK: ## float 2.100000e+01
				; CHECK: ## float 3.200000e+01
				; CHECK-LABEL: fmul_v4f32_two_consts_no_splat:
				; CHECK: mulps
				; CHECK-NOT: mulps
				; CHECK-NEXT: ret
				define <4 x float> @fmul_v4f32_two_consts_no_splat(<4 x float> %x) #0 {
				%y = fmul <4 x float> %x, <float 1.0, float 2.0, float 3.0, float 4.0>
				%z = fmul <4 x float> %y, <float 5.0, float 6.0, float 7.0, float 8.0>
				ret <4 x float> %z
				}

				; Same as above, but reverse operands to make sure non-canonical form is also handled.
				; CHECK: ## float 5.000000e+00
				; CHECK: ## float 1.200000e+01
				; CHECK: ## float 2.100000e+01
				; CHECK: ## float 3.200000e+01
				; CHECK-LABEL: fmul_v4f32_two_consts_no_splat_non_canonical:
				; CHECK: mulps
				; CHECK-NOT: mulps
				; CHECK-NEXT: ret
				define <4 x float> @fmul_v4f32_two_consts_no_splat_non_canonical(<4 x float> %x) #0 {
				%y = fmul <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x
				%z = fmul <4 x float> <float 5.0, float 6.0, float 7.0, float 8.0>, %y
				ret <4 x float> %z
				}

				; More than one use of a constant multiply should not inhibit the optimization.
				; Instead of a chain of 2 dependent mults, this test will have 2 independent mults.
				; CHECK: ## float 5.000000e+00
				; CHECK: ## float 1.200000e+01
				; CHECK: ## float 2.100000e+01
				; CHECK: ## float 3.200000e+01
				; CHECK-LABEL: fmul_v4f32_two_consts_no_splat_multiple_use:
				; CHECK: mulps
				; CHECK: mulps
				; CHECK: addps
				; CHECK: ret
				define <4 x float> @fmul_v4f32_two_consts_no_splat_multiple_use(<4 x float> %x) #0 {
				%y = fmul <4 x float> %x, <float 1.0, float 2.0, float 3.0, float 4.0>
				%z = fmul <4 x float> %y, <float 5.0, float 6.0, float 7.0, float 8.0>
				%a = fadd <4 x float> %y, %z
				ret <4 x float> %a
				}

	; CHECK-LABEL: fmul_c2_c4_f32:			; CHECK-LABEL: fmul_c2_c4_f32:
	; CHECK-NOT: addss			; CHECK-NOT: addss
	; CHECK: mulss			; CHECK: mulss
	; CHECK-NOT: mulss			; CHECK-NOT: mulss
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	define float @fmul_c2_c4_f32(float %x) #0 {			define float @fmul_c2_c4_f32(float %x) #0 {
	%y = fmul float %x, 2.0			%y = fmul float %x, 2.0
	%z = fmul float %y, 4.0			%z = fmul float %y, 4.0
	Show All 34 Lines