This is an archive of the discontinued LLVM Phabricator instance.

expose ILP for associative operations in the DAG
AbandonedPublic

Authored by spatel on May 14 2015, 12:18 PM.

Download Raw Diff

Details

Reviewers

qcolombet
jroelofs
jmolloy
echristo
mehdi_amini
resistor

Summary

In the aftermath of the reversion of http://reviews.llvm.org/rL236031 (D9232), nobody said this idea was outright crazy...so I'm trying again. :)

This time, I've added a bunch of safety measures:

Target-dependent and opt-in per target; currently only x86
A map to count and limit the number of times we try this transform
Only attempt after type legalization
x86 FADD limit artificially low until we're sure there's no fallout

I ended up with target hooks (shouldReassociate / didReassociate) similar to what Jon suggested in a reply mail to r236031.

There are a bunch of TODO items to make this better, but I'm purposely starting as small as possible.

I initially thought that the tracking map should live in X86TargetLowering with the hooks, but that doesn't seem in tune with the rest of the class (it's all const AFAICT), and the target outlives the Combiner or DAG, so the map would need to be reset before each combine if it lived there?

Diff Detail

Event Timeline

spatel updated this revision to Diff 25794.May 14 2015, 12:18 PM

spatel retitled this revision from to expose ILP for associative operations in the DAG.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: mehdi_amini, resistor, jroelofs, echristo, qcolombet.

spatel added a subscriber: Unknown Object (MLST).

spatel mentioned this in D9363: Remove non-necessary canonicalization from the DAG.May 14 2015, 12:32 PM

Could you help me try to understand what purpose this serves in terms of complementing the already-existing reassociate pass? That is, what benefit comes from doing it in the DAG in addition to the existing reassociate IR pass?

In D9780#173061, @escha wrote:

Could you help me try to understand what purpose this serves in terms of complementing the already-existing reassociate pass? That is, what benefit comes from doing it in the DAG in addition to the existing reassociate IR pass?

Sure - this came up in the earlier review thread:

http://reviews.llvm.org/D8941 provides a simple motivating example for doing this in the DAG; the multiplies that we want to reassociate don't exist before the DAGCombine that transforms the divisions.
More important: this is not a canonical-type of transform that is suitable for IR; this transform increases register pressure and some targets (like some GPUs, as I learned from the previous reversion) simply won't benefit from this optimization because there is no ILP exposed to the programmer.

On Thu, May 14, 2015 at 2:24 PM, James Molloy <james@jamesmolloy.co.uk> wrote:

It sounds like the best time to do this if at all possible is in MachineInst form,
where we have MachineTraceMetrics and accurate register pressure information.

Is there a reason why you're not doing it there? I know it's more awkward, but
it really seems we need accurate register pressure if we're not going to make it
cripplingly conservative.

I only have marginal reasons for trying this here rather than with MachineInsts: I really wanted to make this easy for any target to opt-in, nobody pushed me away from a DAG transform, and I've never attempted a machine-level transform. If the consensus is that it's better done later when we have more register knowledge, I'll certainly give it a try.

I found my way to the "MachineCombiner" pass, and it seems like this optimization would fit right in there.

But I just thought of something scary: to do this reassociation on FP ops, we would have to extend fast-math-flags to MachineInstrs.

Note: I'm still working on getting FMF into the DAG (D8900), but it could take a while to work through the bugs...

Unless anyone sees a reason to keep this patch alive, I'll abandon this DAG combine.

spatel mentioned this in D10321: [x86] add a reassociation optimization to increase ILP via the MachineCombiner pass.Jun 8 2015, 2:49 PM

spatel mentioned this in rL239486: [x86] Add a reassociation optimization to increase ILP via the MachineCombiner….Jun 10 2015, 1:36 PM

Abandoning. Re-implemented using the MachineCombiner pass (D10321).

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

25 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

61 lines

Target/

X86/

X86ISelLowering.h

4 lines

X86ISelLowering.cpp

36 lines

test/

CodeGen/

X86/

fp-fast.ll

67 lines

Diff 25794

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 521 Lines • ▼ Show 20 Lines	public:
/// Similar to isShuffleMaskLegal. This is used by Targets can use this to		/// Similar to isShuffleMaskLegal. This is used by Targets can use this to
/// indicate if there is a suitable VECTOR_SHUFFLE that can be used to replace		/// indicate if there is a suitable VECTOR_SHUFFLE that can be used to replace
/// a VAND with a constant pool entry.		/// a VAND with a constant pool entry.
virtual bool isVectorClearMaskLegal(const SmallVectorImpl<int> &/Mask/,		virtual bool isVectorClearMaskLegal(const SmallVectorImpl<int> &/Mask/,
EVT /VT/) const {		EVT /VT/) const {
return false;		return false;
}		}

		/// This pair is an opcode and value type.
		typedef std::pair<unsigned, MVT> AssocType;

		/// This map type may be used by a target to count reassociations.
		typedef std::map<AssocType, unsigned> AssocMap;

		/// Return true if a reassociation optimization that exposes more
		/// instruction-level-parallelism should be attempted for the specified
		/// opcode and type. Targets may override this function based on the
		/// amount of software-visible parallelism that is possible balanced against
		/// the number of registers that are needed to support that number of
		/// independent instructions.
		/// \param Type The opcode and MVT of the instruction to reassociate.
		/// \param Map A map for keeping track of reassociated operations.
		virtual bool shouldReassociate(AssocType Type, AssocMap &Map) const {
		return false;
		}

		/// This hook is called if a reassociation of the specified type was
		/// completed. Targets may override this function to keep track of
		/// the number of times a reassociation of this type has occurred.
		/// \param Type The opcode and MVT of the instruction that was reassociated.
		/// \param Map A map for keeping track of reassociated operations.
		virtual void didReassociate(AssocType Type, AssocMap &Map) const {}

/// Return how this operation should be treated: either it is legal, needs to		/// Return how this operation should be treated: either it is legal, needs to
/// be promoted to a larger size, needs to be expanded to some other code		/// be promoted to a larger size, needs to be expanded to some other code
/// sequence, or the target has a custom expander for it.		/// sequence, or the target has a custom expander for it.
LegalizeAction getOperationAction(unsigned Op, EVT VT) const {		LegalizeAction getOperationAction(unsigned Op, EVT VT) const {
if (VT.isExtended()) return Expand;		if (VT.isExtended()) return Expand;
// If a target-specific SDNode requires legalization, require the target		// If a target-specific SDNode requires legalization, require the target
// to provide custom legalization for it.		// to provide custom legalization for it.
if (Op > array_lengthof(OpActions[0])) return Custom;		if (Op > array_lengthof(OpActions[0])) return Custom;
▲ Show 20 Lines • Show All 2,245 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 400 Lines • ▼ Show 20 Lines	private:

/// \brief Try to transform a truncation where C is a constant:		/// \brief Try to transform a truncation where C is a constant:
/// (trunc (and X, C)) -> (and (trunc X), (trunc C))		/// (trunc (and X, C)) -> (and (trunc X), (trunc C))
///		///
/// \p N needs to be a truncation and its first operand an AND. Other		/// \p N needs to be a truncation and its first operand an AND. Other
/// requirements are checked by the function (e.g. that trunc is		/// requirements are checked by the function (e.g. that trunc is
/// single-use) and if missed an empty SDValue is returned.		/// single-use) and if missed an empty SDValue is returned.
SDValue distributeTruncateThroughAnd(SDNode *N);		SDValue distributeTruncateThroughAnd(SDNode *N);

		/// Given a dependent sequence of associative math/logic operations,
		/// reassociate the operands to increase the instruction-level-parallelism.
		SDValue reassociateForILP(SDNode *);
		TargetLoweringBase::AssocMap ReassociateMap;

public:		public:
DAGCombiner(SelectionDAG &D, AliasAnalysis &A, CodeGenOpt::Level OL)		DAGCombiner(SelectionDAG &D, AliasAnalysis &A, CodeGenOpt::Level OL)
: DAG(D), TLI(D.getTargetLoweringInfo()), Level(BeforeLegalizeTypes),		: DAG(D), TLI(D.getTargetLoweringInfo()), Level(BeforeLegalizeTypes),
OptLevel(OL), LegalOperations(false), LegalTypes(false), AA(A) {		OptLevel(OL), LegalOperations(false), LegalTypes(false), AA(A) {
auto *F = DAG.getMachineFunction().getFunction();		auto *F = DAG.getMachineFunction().getFunction();
ForCodeSize = F->hasFnAttribute(Attribute::OptimizeForSize) \|\|		ForCodeSize = F->hasFnAttribute(Attribute::OptimizeForSize) \|\|
F->hasFnAttribute(Attribute::MinSize);		F->hasFnAttribute(Attribute::MinSize);
}		}
▲ Show 20 Lines • Show All 7,363 Lines • ▼ Show 20 Lines	if (UnsafeFPMath && LookThroughFPExt) {
}		}
}		}
}		}
}		}

return SDValue();		return SDValue();
}		}

		SDValue DAGCombiner::reassociateForILP(SDNode *N) {
		assert(N->getNumOperands() == 2 && "Invalid node for binop reassociation");
		SDValue N0 = N->getOperand(0);
		SDValue N1 = N->getOperand(1);
		EVT VT = N->getValueType(0);
		SDLoc DL(N);
		unsigned Op = N->getOpcode();

		// Only try this after type legalization. Reassociating earlier could have
		// unknown consequences on register usage. Also, a new combiner is created
		// post-type-legalization, so the map that is tracking this transform does
		// not live long enough to support earlier passes.
		if (!LegalTypes)
		return SDValue();

		TargetLoweringBase::AssocType Type = std::make_pair(Op, VT.getSimpleVT());

		// Tell the target how many times we have already completed a reassociation
		// of this type and ask the target if it's ok to try another.
		if (!TLI.shouldReassociate(Type, ReassociateMap))
		return SDValue();

		// Swap chains of this operation to LHS to avoid duplicating predicate logic
		// for each commutation of the pattern.
		if (SelectionDAG::isCommutativeBinOp(Op) && N1.getOpcode() == Op &&
		N1.hasOneUse() && N0.getOpcode() != Op)
		std::swap(N1, N0);

		// Convert a chain of 3 dependent operations into 2 independent operations
		// and 1 dependent operation.
		if (N0.getOpcode() == Op && N0.hasOneUse() && N1.getOpcode() != Op) {
		SDValue N00 = N0.getOperand(0);
		SDValue N01 = N0.getOperand(1);

		// Swap second set of operands if needed.
		if (SelectionDAG::isCommutativeBinOp(Op) && N01.getOpcode() == Op &&
		N01.hasOneUse() && N00.getOpcode() != Op)
		std::swap(N01, N00);

		// (op N0: (op N00: (op z, w), N01: y), N1: x) ->
		// (op N00: (op z, w), (op N1: x, N01: y))
		// TODO: The innermost op - (op z, w) - does not need to match the others.
		if (N00.getOpcode() == Op) {
		// Tell the target that a reassociation of this type was completed.
		TLI.didReassociate(Type, ReassociateMap);
		SDValue NewOp = DAG.getNode(Op, DL, VT, N1, N01);
		return DAG.getNode(Op, DL, VT, N00, NewOp);
		}
		}
		return SDValue();
		}

SDValue DAGCombiner::visitFADD(SDNode *N) {		SDValue DAGCombiner::visitFADD(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);
ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);		ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc DL(N);		SDLoc DL(N);
const TargetOptions &Options = DAG.getTarget().Options;		const TargetOptions &Options = DAG.getTarget().Options;
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	if (TLI.isOperationLegalOrCustom(ISD::FMUL, VT) && !N0CFP && !N1CFP) {
N0.getOpcode() == ISD::FADD && N1.getOpcode() == ISD::FADD &&		N0.getOpcode() == ISD::FADD && N1.getOpcode() == ISD::FADD &&
N0.getOperand(0) == N0.getOperand(1) &&		N0.getOperand(0) == N0.getOperand(1) &&
N1.getOperand(0) == N1.getOperand(1) &&		N1.getOperand(0) == N1.getOperand(1) &&
N0.getOperand(0) == N1.getOperand(0)) {		N0.getOperand(0) == N1.getOperand(0)) {
return DAG.getNode(ISD::FMUL, DL, VT,		return DAG.getNode(ISD::FMUL, DL, VT,
N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));		N0.getOperand(0), DAG.getConstantFP(4.0, DL, VT));
}		}
}		}

		if (SDValue Reassociated = reassociateForILP(N))
		return Reassociated;

} // enable-unsafe-fp-math		} // enable-unsafe-fp-math

// FADD -> FMA combines:		// FADD -> FMA combines:
SDValue Fused = visitFADDForFMACombine(N);		SDValue Fused = visitFADDForFMACombine(N);
if (Fused) {		if (Fused) {
AddToWorklist(Fused.getNode());		AddToWorklist(Fused.getNode());
return Fused;		return Fused;
}		}
▲ Show 20 Lines • Show All 6,067 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,078 Lines • ▼ Show 20 Lines	SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
bool &UseOneConstNR) const override;		bool &UseOneConstNR) const override;

/// Use rcp* to speed up fdiv calculations.		/// Use rcp* to speed up fdiv calculations.
SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const override;		unsigned &RefinementSteps) const override;

/// Reassociate floating point divisions into multiply by reciprocal.		/// Reassociate floating point divisions into multiply by reciprocal.
bool combineRepeatedFPDivisors(unsigned NumUsers) const override;		bool combineRepeatedFPDivisors(unsigned NumUsers) const override;

		/// Increase instruction-level-parallelism for associative operations.
		bool shouldReassociate(AssocType Type, AssocMap &Map) const override;
		void didReassociate(AssocType Type, AssocMap &Map) const override;
};		};

namespace X86 {		namespace X86 {
FastISel *createFastISel(FunctionLoweringInfo &funcInfo,		FastISel *createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo);		const TargetLibraryInfo *libInfo);
}		}
}		}

#endif // X86ISELLOWERING_H		#endif // X86ISELLOWERING_H

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 12,923 Lines • ▼ Show 20 Lines
	/// CPU if a division's cost is not at least twice the cost of a multiplication.			/// CPU if a division's cost is not at least twice the cost of a multiplication.
	/// This is because we still need one division to calculate the reciprocal and			/// This is because we still need one division to calculate the reciprocal and
	/// then we need two multiplies by that reciprocal as replacements for the			/// then we need two multiplies by that reciprocal as replacements for the
	/// original divisions.			/// original divisions.
	bool X86TargetLowering::combineRepeatedFPDivisors(unsigned NumUsers) const {			bool X86TargetLowering::combineRepeatedFPDivisors(unsigned NumUsers) const {
	return NumUsers > 1;			return NumUsers > 1;
	}			}

				bool X86TargetLowering::shouldReassociate(AssocType Type, AssocMap &Map) const {
				// These are the opcodes that we allow to be reassociated and the number
				// of times each opcode may be reassociated. The limit is a conservative
				// heuristic based on the number of architected general-purpose or FP/vector
				// registers. We must avoid spilling for the optimization to have a benefit.
				static const std::map<unsigned, unsigned> ReassociationLimits = {
				{ ISD::FADD, 1 }
				};

				// TODO: Add the rest of the associative opcodes.
				// TODO: Programmatically base the limit on number of registers.
				// TODO: Limits should be based on MVT or register class. For example,
				// a vector integer add does not use the same registers as a scalar
				// integer add.

				// If the input opcode type is not a reassociation candidate, we're done.
				const auto &LimitIter = ReassociationLimits.find(Type.first);
				if (LimitIter == ReassociationLimits.end())
				return false;

				// If we hit the reassociation limit for this type of instruction, we're done.
				unsigned ReassociationMax = LimitIter->second;
				if (Map[Type] >= ReassociationMax)
				return false;

				return true;
				}

				void X86TargetLowering::didReassociate(AssocType Type, AssocMap &Map) const {
				// TODO: The map should be updated based on register type rather than
				// instruction type. For example, any floating point reassociation
				// should affect all other floating point operations because they all
				// require registers from the same register class.
				Map[Type]++;
				}

	static bool isAllOnes(SDValue V) {			static bool isAllOnes(SDValue V) {
	ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);			ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);
	return C && C->isAllOnesValue();			return C && C->isAllOnesValue();
	}			}

	/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node			/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node
	/// if it's possible.			/// if it's possible.
	SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,			SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,
	▲ Show 20 Lines • Show All 12,166 Lines • Show Last 20 Lines

test/CodeGen/X86/fp-fast.ll

	Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0			; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t1 = fsub float -0.0, %a			%t1 = fsub float -0.0, %a
	%t2 = fadd float %a, %t1			%t2 = fadd float %a, %t1
	ret float %t2			ret float %t2
	}			}

				; Verify that the first two adds are independent; the destination registers
				; are used as source registers for the third add.

				define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds1:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %t1, %x3
				ret float %t2
				}

				define float @reassociate_adds2(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds2:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %t1, %x3
				ret float %t2
				}

				define float @reassociate_adds3(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds3:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm3, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %x3, %t1
				ret float %t2
				}

				; FIXME: We should be able to do more reassociations for this test case, but the limit is set
				; conservatively low to avoid spilling disasters.

				define float @reassociate_adds4(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {
				; CHECK-LABEL: reassociate_adds4:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm2, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm4, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm5, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm6, %xmm7, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %t1, %x3
				%t3 = fadd float %t2, %x4
				%t4 = fadd float %t3, %x5
				%t5 = fadd float %t4, %x6
				%t6 = fadd float %t5, %x7
				ret float %t6
				}