This is an archive of the discontinued LLVM Phabricator instance.

[x86] generalize reassociation optimization in machine combiner to 2 instructions
ClosedPublic

Authored by spatel on Jun 15 2015, 2:05 PM.

Download Raw Diff

Details

Reviewers

qcolombet
Gerolf
mehdi_amini

Commits

rGe79b43a01f90: [x86] generalize reassociation optimization in machine combiner to 2…
rL240361: [x86] generalize reassociation optimization in machine combiner to 2…

Summary

Currently ( D10321, http://reviews.llvm.org/rL239486 ), we can use the machine combiner pass to reassociate the following sequence to reduce the critical path:

A = ? op ?
B = A op X
C = B op Y
-->
A = ? op ?
B = X op Y
C = A op B

'op' is currently limited to x86 AVX scalar FP adds (with fast-math on), but in theory, it could be any associative math/logic op (see TODO in code comment).

This patch generalizes the pattern match to ignore the instruction that defines 'A'. So instead of a sequence of 3 adds, we now only need to find 2 dependent adds and decide if it's worth reassociating them.

This generalization has a compile-time cost because we can now match more instruction sequences and we rely more heavily on the machine combiner to discard sequences where reassociation doesn't improve the critical path.

For example, in the new test case:

A = M div N
B = A add X
C = B add Y

We'll match 2 reassociation patterns, but this transform doesn't reduce the critical path:

A = M div N
B = A add Y
C = B add X

We need the combiner to reject that pattern but select this:

A = M div N
B = X add Y
C = B add A

On Mehdi's (hopefully degenerate for x86) test case from the r236031 post-commit thread, the compile-time increases from ~0.2 sec to 5.0 sec because the combiner completes 3963 reassociations. Using test-suite's benchmarking subset, however, the only test where this completes more than 4 times is linpack; there it reassociates 14 times (used to be 11). But I don't see any compile-time difference from doing that extra optimization work.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 27712.Jun 15 2015, 2:05 PM

spatel retitled this revision from to [x86] generalize reassociation optimization in machine combiner to 2 instructions.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: Gerolf, mehdi_amini, qcolombet.

spatel added a subscriber: Unknown Object (MLST).

Gerolf added inline comments.Jun 15 2015, 9:24 PM

lib/CodeGen/MachineCombiner.cpp
214 ↗	(On Diff #27712)	remove outright
218 ↗	(On Diff #27712)	This is no longer true. The equation could now be '<'.
lib/Target/X86/X86InstrInfo.cpp
6292 ↗	(On Diff #27712)	This function does more than the name suggests. Also, I don't find it intuitive that is records two pattern.

spatel added inline comments.Jun 16 2015, 9:23 AM

lib/CodeGen/MachineCombiner.cpp
214 ↗	(On Diff #27712)	Fixed.
218 ↗	(On Diff #27712)	I read that statement as <= is still the minimum requirement, but let's see if I can make that clearer. I've added another explanatory statement after the formula to explain the role of the new parameter (NewCodeHasLessInsts). Let me know if you see a better way to word this. Thanks!
lib/Target/X86/X86InstrInfo.cpp
6292 ↗	(On Diff #27712)	Looking at the AArch64 implementation, I thought it also could record 2 patterns per call. I agree that we should make this more obvious. Suggestions: getPatterns() getMachineCombinerPatterns() getMachineCombinerPatternsForRootInst() I opted for #2 in this revision of the patch. Since this change is just a naming difference but affects more files, we could make it a follow-on patch?

Patch updated based on Gerolf's feedback. See previous inline comments and replies.

The code looks pretty good, but I'd like to understand better why the new code investigates more patterns. Also, the compile time increase to 5s looks huge. It is probably correct that it must be an outlier, however, is there anything that can be done to protect from a compile-time spike? On the other hand, the extra compile-time could be a good compile-time/performance trade-off. So one possibility I can think of is to check the rate of success. For example: investigated N association patterns, never found a better code sequence (or perhaps some %threashold instead), so let's not waste more time on association patterns in this function. What do you think?
Thanks for clarifying the AARCH64 and MachineCombiner code!

lib/Target/X86/X86InstrInfo.cpp
6284 ↗	(On Diff #27764)	There is a bit of code duplication you can avoid eg. by overloading hasVirtual...() and wrapping the code starting at MRI in a function. Then you would get something like if (hasVirtualRegDefsInBasicBlock(Op1,Op2, MBB) && Sibling=findAssocSibling(Op1,Op2,MBB, Commute) && hasVirtualRegDefsInBasicBlock(Sibling, MBB)) return true; return false;
6316 ↗	(On Diff #27764)	Allowing more than one pattern was part of the original design. What confused/confuses me is that in your old code you checked if operands had to be commuted in Root and Prev. But now the code only checks Root and potentially investigates two code sequences instead of one. Isn't that more expensive? And given that the order of the operands in Prev is not checked now, should there be a change in reassociateOps() addressing that?

Phab is slow/down, so sending email to list...

spatel mentioned this in rL240192: name change: hasPattern() -> getMachineCombinerPatterns() ; NFC.Jun 19 2015, 4:26 PM

In D10460#190819, @Gerolf wrote:

The code looks pretty good, but I'd like to understand better why the new code investigates more patterns. Also, the compile time increase to 5s looks huge. It is probably correct that it must be an outlier, however, is there anything that can be done to protect from a compile-time spike? On the other hand, the extra compile-time could be a good compile-time/performance trade-off. So one possibility I can think of is to check the rate of success. For example: investigated N association patterns, never found a better code sequence (or perhaps some %threashold instead), so let's not waste more time on association patterns in this function. What do you think?

It's certainly possible that we'll cause a compile-time spike with this patch (or even the existing code), but I would prefer to leave the safety harness as a follow-on patch pending some evidence that the case actually exists in the real world. Limiting this patch without that evidence seems like a premature compile-time optimization to me. The extra compile time should always be linear to the number of instructions, so it shouldn't explode too far on us.

lib/Target/X86/X86InstrInfo.cpp
6284 ↗	(On Diff #27764)	Good point. I took a slightly different approach to reduce even further!
6316 ↗	(On Diff #27764)	reassociateOps() doesn't need any changes because the earlier patch assumed this change was coming; we made it (even the comments) assume the more general pattern could happen.

Patch updated based on Gerolf's feedback.

Also, I checked in the name change: hasPattern() -> getMachineCombinerPatterns().
http://reviews.llvm.org/rL240192

...because that's independent and NFC, so this patch is reduced to just the MachineCombiner and x86 files again.

LGTM, but for compile time please add a FIXME before commit. What more evidence does it need? "On Mehdi's (hopefully degenerate for x86) test case from the r236031 post-commit thread, the compile-time increases from ~0.2 sec to 5.0 sec". However, in the current form the patch should have negligible ct impact in general.

For the record: the test didn’t come from an X86 test, it is a simplified version of a real-world GPU shader.

—
Mehdi

Closed by commit rL240361: [x86] generalize reassociation optimization in machine combiner to 2… (authored by spatel). · Explain WhyJun 22 2015, 5:44 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D10975: [x86] extend machine combiner reassociation optimization to SSE scalar adds.Jul 6 2015, 3:14 PM

spatel mentioned this in rL241515: [x86] extend machine combiner reassociation optimization to SSE scalar adds.Jul 6 2015, 3:36 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

MachineCombiner.cpp

49 lines

Target/

X86/

X86InstrInfo.cpp

164 lines

test/

CodeGen/

X86/

fp-fast.ll

78 lines

machine-combiner.ll

99 lines

Diff 28182

llvm/trunk/lib/CodeGen/MachineCombiner.cpp

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	private:
bool combineInstructions(MachineBasicBlock *);		bool combineInstructions(MachineBasicBlock *);
MachineInstr *getOperandDef(const MachineOperand &MO);		MachineInstr *getOperandDef(const MachineOperand &MO);
unsigned getDepth(SmallVectorImpl<MachineInstr *> &InsInstrs,		unsigned getDepth(SmallVectorImpl<MachineInstr *> &InsInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
MachineTraceMetrics::Trace BlockTrace);		MachineTraceMetrics::Trace BlockTrace);
unsigned getLatency(MachineInstr Root, MachineInstr NewRoot,		unsigned getLatency(MachineInstr Root, MachineInstr NewRoot,
MachineTraceMetrics::Trace BlockTrace);		MachineTraceMetrics::Trace BlockTrace);
bool		bool
preservesCriticalPathLen(MachineBasicBlock MBB, MachineInstr Root,		improvesCriticalPathLen(MachineBasicBlock MBB, MachineInstr Root,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg);		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
		bool NewCodeHasLessInsts);
bool preservesResourceLen(MachineBasicBlock *MBB,		bool preservesResourceLen(MachineBasicBlock *MBB,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs);		SmallVectorImpl<MachineInstr *> &DelInstrs);
void instr2instrSC(SmallVectorImpl<MachineInstr *> &Instrs,		void instr2instrSC(SmallVectorImpl<MachineInstr *> &Instrs,
SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC);		SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC);
};		};
} // namespace		} // namespace
▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	for (const MachineOperand &MO : NewRoot->operands()) {
} else {		} else {
LatencyOp = TSchedModel.computeInstrLatency(NewRoot->getOpcode());		LatencyOp = TSchedModel.computeInstrLatency(NewRoot->getOpcode());
}		}
NewRootLatency = std::max(NewRootLatency, LatencyOp);		NewRootLatency = std::max(NewRootLatency, LatencyOp);
}		}
return NewRootLatency;		return NewRootLatency;
}		}

/// True when the new instruction sequence does not		/// True when the new instruction sequence does not lengthen the critical path
/// lengthen the critical path. The DAGCombine code sequence ends in MI		/// and the new sequence has less instructions or the new sequence improves the
/// (Machine Instruction) Root. The new code sequence ends in MI NewRoot. A		/// critical path.
/// necessary condition for the new sequence to replace the old sequence is that		/// The DAGCombine code sequence ends in MI (Machine Instruction) Root.
/// it cannot lengthen the critical path. This is decided by the formula		/// The new code sequence ends in MI NewRoot. A necessary condition for the new
		/// sequence to replace the old sequence is that it cannot lengthen the critical
		/// path. This is decided by the formula:
/// (NewRootDepth + NewRootLatency) <= (RootDepth + RootLatency + RootSlack)).		/// (NewRootDepth + NewRootLatency) <= (RootDepth + RootLatency + RootSlack)).
/// The slack is the number of cycles Root can be delayed before the critical		/// If the new sequence has an equal length critical path but does not reduce
/// patch becomes longer.		/// the number of instructions (NewCodeHasLessInsts is false), then it is not
bool MachineCombiner::preservesCriticalPathLen(		/// considered an improvement. The slack is the number of cycles Root can be
		/// delayed before the critical patch becomes longer.
		bool MachineCombiner::improvesCriticalPathLen(
MachineBasicBlock MBB, MachineInstr Root,		MachineBasicBlock MBB, MachineInstr Root,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) {		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
		bool NewCodeHasLessInsts) {

assert(TSchedModel.hasInstrSchedModel() && "Missing machine model\n");		assert(TSchedModel.hasInstrSchedModel() && "Missing machine model\n");
// NewRoot is the last instruction in the \p InsInstrs vector.		// NewRoot is the last instruction in the \p InsInstrs vector.
// Get depth and latency of NewRoot.		// Get depth and latency of NewRoot.
unsigned NewRootIdx = InsInstrs.size() - 1;		unsigned NewRootIdx = InsInstrs.size() - 1;
MachineInstr *NewRoot = InsInstrs[NewRootIdx];		MachineInstr *NewRoot = InsInstrs[NewRootIdx];
unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);		unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);
unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);		unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);

// Get depth, latency and slack of Root.		// Get depth, latency and slack of Root.
unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;		unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;
unsigned RootLatency = TSchedModel.computeInstrLatency(Root);		unsigned RootLatency = TSchedModel.computeInstrLatency(Root);
unsigned RootSlack = BlockTrace.getInstrSlack(Root);		unsigned RootSlack = BlockTrace.getInstrSlack(Root);

DEBUG(dbgs() << "DEPENDENCE DATA FOR " << Root << "\n";		DEBUG(dbgs() << "DEPENDENCE DATA FOR " << Root << "\n";
dbgs() << " NewRootDepth: " << NewRootDepth		dbgs() << " NewRootDepth: " << NewRootDepth
<< " NewRootLatency: " << NewRootLatency << "\n";		<< " NewRootLatency: " << NewRootLatency << "\n";
dbgs() << " RootDepth: " << RootDepth << " RootLatency: " << RootLatency		dbgs() << " RootDepth: " << RootDepth << " RootLatency: " << RootLatency
<< " RootSlack: " << RootSlack << "\n";		<< " RootSlack: " << RootSlack << "\n";
dbgs() << " NewRootDepth + NewRootLatency "		dbgs() << " NewRootDepth + NewRootLatency "
<< NewRootDepth + NewRootLatency << "\n";		<< NewRootDepth + NewRootLatency << "\n";
dbgs() << " RootDepth + RootLatency + RootSlack "		dbgs() << " RootDepth + RootLatency + RootSlack "
<< RootDepth + RootLatency + RootSlack << "\n";);		<< RootDepth + RootLatency + RootSlack << "\n";);

/// True when the new sequence does not lengthen the critical path.		unsigned NewCycleCount = NewRootDepth + NewRootLatency;
return ((NewRootDepth + NewRootLatency) <=		unsigned OldCycleCount = RootDepth + RootLatency + RootSlack;
(RootDepth + RootLatency + RootSlack));
		if (NewCodeHasLessInsts)
		return NewCycleCount <= OldCycleCount;
		else
		return NewCycleCount < OldCycleCount;
}		}

/// helper routine to convert instructions into SC		/// helper routine to convert instructions into SC
void MachineCombiner::instr2instrSC(		void MachineCombiner::instr2instrSC(
SmallVectorImpl<MachineInstr *> &Instrs,		SmallVectorImpl<MachineInstr *> &Instrs,
SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC) {		SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC) {
for (auto *InstrPtr : Instrs) {		for (auto *InstrPtr : Instrs) {
unsigned Opc = InstrPtr->getOpcode();		unsigned Opc = InstrPtr->getOpcode();
▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	if (TII->getMachineCombinerPatterns(MI, Patterns)) {
SmallVector<MachineInstr *, 16> DelInstrs;		SmallVector<MachineInstr *, 16> DelInstrs;
DenseMap<unsigned, unsigned> InstrIdxForVirtReg;		DenseMap<unsigned, unsigned> InstrIdxForVirtReg;
if (!MinInstr)		if (!MinInstr)
MinInstr = Traces->getEnsemble(MachineTraceMetrics::TS_MinInstrCount);		MinInstr = Traces->getEnsemble(MachineTraceMetrics::TS_MinInstrCount);
MachineTraceMetrics::Trace BlockTrace = MinInstr->getTrace(MBB);		MachineTraceMetrics::Trace BlockTrace = MinInstr->getTrace(MBB);
Traces->verifyAnalysis();		Traces->verifyAnalysis();
TII->genAlternativeCodeSequence(MI, P, InsInstrs, DelInstrs,		TII->genAlternativeCodeSequence(MI, P, InsInstrs, DelInstrs,
InstrIdxForVirtReg);		InstrIdxForVirtReg);
		unsigned NewInstCount = InsInstrs.size();
		unsigned OldInstCount = DelInstrs.size();
// Found pattern, but did not generate alternative sequence.		// Found pattern, but did not generate alternative sequence.
// This can happen e.g. when an immediate could not be materialized		// This can happen e.g. when an immediate could not be materialized
// in a single instruction.		// in a single instruction.
if (!InsInstrs.size())		if (!NewInstCount)
continue;		continue;
// Substitute when we optimize for codesize and the new sequence has		// Substitute when we optimize for codesize and the new sequence has
// fewer instructions OR		// fewer instructions OR
// the new sequence neither lengthens the critical path nor increases		// the new sequence neither lengthens the critical path nor increases
// resource pressure.		// resource pressure.
if (doSubstitute(InsInstrs.size(), DelInstrs.size()) \|\|		if (doSubstitute(NewInstCount, OldInstCount) \|\|
(preservesCriticalPathLen(MBB, &MI, BlockTrace, InsInstrs,		(improvesCriticalPathLen(MBB, &MI, BlockTrace, InsInstrs,
InstrIdxForVirtReg) &&		InstrIdxForVirtReg,
		NewInstCount < OldInstCount) &&
preservesResourceLen(MBB, BlockTrace, InsInstrs, DelInstrs))) {		preservesResourceLen(MBB, BlockTrace, InsInstrs, DelInstrs))) {
for (auto *InstrPtr : InsInstrs)		for (auto *InstrPtr : InsInstrs)
MBB->insert((MachineBasicBlock::iterator) &MI, InstrPtr);		MBB->insert((MachineBasicBlock::iterator) &MI, InstrPtr);
for (auto *InstrPtr : DelInstrs)		for (auto *InstrPtr : DelInstrs)
InstrPtr->eraseFromParentAndMarkDBGValuesForRemoval();		InstrPtr->eraseFromParentAndMarkDBGValuesForRemoval();

Changed = true;		Changed = true;
++NumInstCombined;		++NumInstCombined;
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,364 Lines • ▼ Show 20 Lines
bool X86InstrInfo::		bool X86InstrInfo::
hasHighOperandLatency(const TargetSchedModel &SchedModel,		hasHighOperandLatency(const TargetSchedModel &SchedModel,
const MachineRegisterInfo *MRI,		const MachineRegisterInfo *MRI,
const MachineInstr *DefMI, unsigned DefIdx,		const MachineInstr *DefMI, unsigned DefIdx,
const MachineInstr *UseMI, unsigned UseIdx) const {		const MachineInstr *UseMI, unsigned UseIdx) const {
return isHighLatencyDef(DefMI->getOpcode());		return isHighLatencyDef(DefMI->getOpcode());
}		}

/// If the input instruction is part of a chain of dependent ops that are		static bool hasVirtualRegDefsInBasicBlock(const MachineInstr &Inst,
/// suitable for reassociation, return the earlier instruction in the sequence		const MachineBasicBlock *MBB) {
/// that defines its first operand, otherwise return a nullptr.		assert(Inst.getNumOperands() == 3 && "Reassociation needs binary operators");
/// If the instruction's operands must be commuted to be considered a		const MachineOperand &Op1 = Inst.getOperand(1);
/// reassociation candidate, Commuted will be set to true.		const MachineOperand &Op2 = Inst.getOperand(2);
static MachineInstr *isReassocCandidate(const MachineInstr &Inst,
unsigned AssocOpcode,
bool checkPrevOneUse,
bool &Commuted) {
if (Inst.getOpcode() != AssocOpcode)
return nullptr;

MachineOperand Op1 = Inst.getOperand(1);
MachineOperand Op2 = Inst.getOperand(2);

const MachineBasicBlock *MBB = Inst.getParent();
const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();		const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();

// We need virtual register definitions.		// We need virtual register definitions.
MachineInstr *MI1 = nullptr;		MachineInstr *MI1 = nullptr;
MachineInstr *MI2 = nullptr;		MachineInstr *MI2 = nullptr;
if (Op1.isReg() && TargetRegisterInfo::isVirtualRegister(Op1.getReg()))		if (Op1.isReg() && TargetRegisterInfo::isVirtualRegister(Op1.getReg()))
MI1 = MRI.getUniqueVRegDef(Op1.getReg());		MI1 = MRI.getUniqueVRegDef(Op1.getReg());
if (Op2.isReg() && TargetRegisterInfo::isVirtualRegister(Op2.getReg()))		if (Op2.isReg() && TargetRegisterInfo::isVirtualRegister(Op2.getReg()))
MI2 = MRI.getUniqueVRegDef(Op2.getReg());		MI2 = MRI.getUniqueVRegDef(Op2.getReg());

// And they need to be in the trace (otherwise, they won't have a depth).		// And they need to be in the trace (otherwise, they won't have a depth).
if (!MI1 \|\| !MI2 \|\| MI1->getParent() != MBB \|\| MI2->getParent() != MBB)		if (MI1 && MI2 && MI1->getParent() == MBB && MI2->getParent() == MBB)
return nullptr;		return true;

Commuted = false;		return false;
if (MI1->getOpcode() != AssocOpcode && MI2->getOpcode() == AssocOpcode) {
std::swap(MI1, MI2);
Commuted = true;
}		}

// Avoid reassociating operands when it won't provide any benefit. If both		static bool hasReassocSibling(const MachineInstr &Inst, bool &Commuted) {
// operands are produced by instructions of this type, we may already		const MachineBasicBlock *MBB = Inst.getParent();
// have the optimal sequence.		const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();
if (MI2->getOpcode() == AssocOpcode)		MachineInstr *MI1 = MRI.getUniqueVRegDef(Inst.getOperand(1).getReg());
return nullptr;		MachineInstr *MI2 = MRI.getUniqueVRegDef(Inst.getOperand(2).getReg());
		unsigned AssocOpcode = Inst.getOpcode();
// The instruction must only be used by the other instruction that we
// reassociate with.		// If only one operand has the same opcode and it's the second source operand,
if (checkPrevOneUse && !MRI.hasOneNonDBGUse(MI1->getOperand(0).getReg()))		// the operands must be commuted.
return nullptr;		Commuted = MI1->getOpcode() != AssocOpcode && MI2->getOpcode() == AssocOpcode;
		if (Commuted)
		std::swap(MI1, MI2);

// We must match a simple chain of dependent ops.		// 1. The previous instruction must be the same type as Inst.
// TODO: This check is not necessary for the earliest instruction in the		// 2. The previous instruction must have virtual register definitions for its
// sequence. Instead of a sequence of 3 dependent instructions with the same		// operands in the same basic block as Inst.
// opcode, we only need to find a sequence of 2 dependent instructions with		// 3. The previous instruction's result must only be used by Inst.
// the same opcode plus 1 other instruction that adds to the height of the		if (MI1->getOpcode() == AssocOpcode &&
// trace.		hasVirtualRegDefsInBasicBlock(*MI1, MBB) &&
if (MI1->getOpcode() != AssocOpcode)		MRI.hasOneNonDBGUse(MI1->getOperand(0).getReg()))
return nullptr;		return true;

return MI1;		return false;
}		}

/// Select a pattern based on how the operands of each associative operation		/// Return true if the input instruction is part of a chain of dependent ops
/// need to be commuted.		/// that are suitable for reassociation, otherwise return false.
static MachineCombinerPattern::MC_PATTERN getPattern(bool CommutePrev,		/// If the instruction's operands must be commuted to have a previous
bool CommuteRoot) {		/// instruction of the same type define the first source operand, Commuted will
if (CommutePrev) {		/// be set to true.
if (CommuteRoot)		static bool isReassocCandidate(const MachineInstr &Inst, unsigned AssocOpcode,
return MachineCombinerPattern::MC_REASSOC_XA_YB;		bool &Commuted) {
return MachineCombinerPattern::MC_REASSOC_XA_BY;		// 1. The instruction must have the correct type.
} else {		// 2. The instruction must have virtual register definitions for its
if (CommuteRoot)		// operands in the same basic block.
return MachineCombinerPattern::MC_REASSOC_AX_YB;		// 3. The instruction must have a reassociatable sibling.
return MachineCombinerPattern::MC_REASSOC_AX_BY;		if (Inst.getOpcode() == AssocOpcode &&
}		hasVirtualRegDefsInBasicBlock(Inst, Inst.getParent()) &&
		hasReassocSibling(Inst, Commuted))
		return true;

		return false;
}		}

		// FIXME: This has the potential to be expensive (compile time) while not
		// improving the code at all. Some ways to limit the overhead:
		// 1. Track successful transforms; bail out if hit rate gets too low.
		// 2. Only enable at -O3 or some other non-default optimization level.
		// 3. Pre-screen pattern candidates here: if an operand of the previous
		// instruction is known to not increase the critical path, then don't match
		// that pattern.
bool X86InstrInfo::getMachineCombinerPatterns(MachineInstr &Root,		bool X86InstrInfo::getMachineCombinerPatterns(MachineInstr &Root,
SmallVectorImpl<MachineCombinerPattern::MC_PATTERN> &Patterns) const {		SmallVectorImpl<MachineCombinerPattern::MC_PATTERN> &Patterns) const {
if (!Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath)		if (!Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath)
return false;		return false;

		// TODO: There is nothing x86-specific here except the instruction type.
		// This logic could be hoisted into the machine combiner pass itself.

		// Look for this reassociation pattern:
		// B = A op X (Prev)
		// C = B op Y (Root)

// TODO: There are many more associative instruction types to match:		// TODO: There are many more associative instruction types to match:
// 1. Other forms of scalar FP add (non-AVX)		// 1. Other forms of scalar FP add (non-AVX)
// 2. Other data types (double, integer, vectors)		// 2. Other data types (double, integer, vectors)
// 3. Other math / logic operations (mul, and, or)		// 3. Other math / logic operations (mul, and, or)
unsigned AssocOpcode = X86::VADDSSrr;		unsigned AssocOpcode = X86::VADDSSrr;

// TODO: There is nothing x86-specific here except the instruction type.		bool Commute = false;
// This logic could be hoisted into the machine combiner pass itself.		if (isReassocCandidate(Root, AssocOpcode, Commute)) {
bool CommuteRoot;
if (MachineInstr *Prev = isReassocCandidate(Root, AssocOpcode, true,
CommuteRoot)) {
bool CommutePrev;
if (isReassocCandidate(*Prev, AssocOpcode, false, CommutePrev)) {
// We found a sequence of instructions that may be suitable for a		// We found a sequence of instructions that may be suitable for a
// reassociation of operands to increase ILP.		// reassociation of operands to increase ILP. Specify each commutation
Patterns.push_back(getPattern(CommutePrev, CommuteRoot));		// possibility for the Prev instruction in the sequence and let the
return true;		// machine combiner decide if changing the operands is worthwhile.
		if (Commute) {
		Patterns.push_back(MachineCombinerPattern::MC_REASSOC_AX_YB);
		Patterns.push_back(MachineCombinerPattern::MC_REASSOC_XA_YB);
		} else {
		Patterns.push_back(MachineCombinerPattern::MC_REASSOC_AX_BY);
		Patterns.push_back(MachineCombinerPattern::MC_REASSOC_XA_BY);
}		}
		return true;
}		}

return false;		return false;
}		}

/// Attempt the following reassociation to reduce critical path length:		/// Attempt the following reassociation to reduce critical path length:
/// B = A op X (Prev)		/// B = A op X (Prev)
/// C = B op Y (Root)		/// C = B op Y (Root)
/// ===>		/// ===>
/// B = X op Y		/// B = X op Y
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	void X86InstrInfo::genAlternativeCodeSequence(
MachineCombinerPattern::MC_PATTERN Pattern,		MachineCombinerPattern::MC_PATTERN Pattern,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs,		SmallVectorImpl<MachineInstr *> &DelInstrs,
DenseMap<unsigned, unsigned> &InstIdxForVirtReg) const {		DenseMap<unsigned, unsigned> &InstIdxForVirtReg) const {
MachineRegisterInfo &MRI = Root.getParent()->getParent()->getRegInfo();		MachineRegisterInfo &MRI = Root.getParent()->getParent()->getRegInfo();

// Select the previous instruction in the sequence based on the input pattern.		// Select the previous instruction in the sequence based on the input pattern.
MachineInstr *Prev = nullptr;		MachineInstr *Prev = nullptr;
if (Pattern == MachineCombinerPattern::MC_REASSOC_AX_BY \|\|		switch (Pattern) {
Pattern == MachineCombinerPattern::MC_REASSOC_XA_BY)		case MachineCombinerPattern::MC_REASSOC_AX_BY:
		case MachineCombinerPattern::MC_REASSOC_XA_BY:
Prev = MRI.getUniqueVRegDef(Root.getOperand(1).getReg());		Prev = MRI.getUniqueVRegDef(Root.getOperand(1).getReg());
else if (Pattern == MachineCombinerPattern::MC_REASSOC_AX_YB \|\|		break;
Pattern == MachineCombinerPattern::MC_REASSOC_XA_YB)		case MachineCombinerPattern::MC_REASSOC_AX_YB:
		case MachineCombinerPattern::MC_REASSOC_XA_YB:
Prev = MRI.getUniqueVRegDef(Root.getOperand(2).getReg());		Prev = MRI.getUniqueVRegDef(Root.getOperand(2).getReg());
else		}
llvm_unreachable("Unknown pattern for machine combiner");		assert(Prev && "Unknown pattern for machine combiner");

reassociateOps(Root, *Prev, Pattern, InsInstrs, DelInstrs, InstIdxForVirtReg);		reassociateOps(Root, *Prev, Pattern, InsInstrs, DelInstrs, InstIdxForVirtReg);
return;		return;
}		}

namespace {		namespace {
/// Create Global Base Reg pass. This initializes the PIC		/// Create Global Base Reg pass. This initializes the PIC
/// global base register for x86-32.		/// global base register for x86-32.
▲ Show 20 Lines • Show All 180 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fp-fast.ll

	Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0			; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t1 = fsub float -0.0, %a			%t1 = fsub float -0.0, %a
	%t2 = fadd float %a, %t1			%t2 = fadd float %a, %t1
	ret float %t2			ret float %t2
	}			}

	; Verify that the first two adds are independent regardless of how the inputs are
	; commuted. The destination registers are used as source registers for the third add.

	define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds1:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2
	%t2 = fadd float %t1, %x3
	ret float %t2
	}

	define float @reassociate_adds2(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds2:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %x2, %t0
	%t2 = fadd float %t1, %x3
	ret float %t2
	}

	define float @reassociate_adds3(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds3:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2
	%t2 = fadd float %x3, %t1
	ret float %t2
	}

	define float @reassociate_adds4(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds4:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %x2, %t0
	%t2 = fadd float %x3, %t1
	ret float %t2
	}

	; Verify that we reassociate some of these ops. The optimal balanced tree of adds is not
	; produced because that would cost more compile time.

	define float @reassociate_adds5(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {
	; CHECK-LABEL: reassociate_adds5:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm5, %xmm4, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm7, %xmm6, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2
	%t2 = fadd float %t1, %x3
	%t3 = fadd float %t2, %x4
	%t4 = fadd float %t3, %x5
	%t5 = fadd float %t4, %x6
	%t6 = fadd float %t5, %x7
	ret float %t6
	}

llvm/trunk/test/CodeGen/X86/machine-combiner.ll

				; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math < %s \| FileCheck %s

				; Verify that the first two adds are independent regardless of how the inputs are
				; commuted. The destination registers are used as source registers for the third add.

				define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds1:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %t1, %x3
				ret float %t2
				}

				define float @reassociate_adds2(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds2:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %t1, %x3
				ret float %t2
				}

				define float @reassociate_adds3(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds3:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %x3, %t1
				ret float %t2
				}

				define float @reassociate_adds4(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds4:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %x3, %t1
				ret float %t2
				}

				; Verify that we reassociate some of these ops. The optimal balanced tree of adds is not
				; produced because that would cost more compile time.

				define float @reassociate_adds5(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {
				; CHECK-LABEL: reassociate_adds5:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm5, %xmm4, %xmm1
				; CHECK-NEXT: vaddss %xmm6, %xmm1, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm7, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %t1, %x3
				%t3 = fadd float %t2, %x4
				%t4 = fadd float %t3, %x5
				%t5 = fadd float %t4, %x6
				%t6 = fadd float %t5, %x7
				ret float %t6
				}

				; Verify that we only need two associative operations to reassociate the operands.
				; Also, we should reassociate such that the result of the high latency division
				; is used by the final 'add' rather than reassociating the %x3 operand with the
				; division. The latter reassociation would not improve anything.

				define float @reassociate_adds6(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds6:
				; CHECK: # BB#0:
				; CHECK-NEXT: vdivss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fdiv float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %x3, %t1
				ret float %t2
				}