This is an archive of the discontinued LLVM Phabricator instance.

[x86] generalize reassociation optimization in machine combiner to 2 instructions
ClosedPublic

Authored by spatel on Jun 15 2015, 2:05 PM.

Download Raw Diff

Details

Reviewers

qcolombet
Gerolf
mehdi_amini

Commits

rGe79b43a01f90: [x86] generalize reassociation optimization in machine combiner to 2…
rL240361: [x86] generalize reassociation optimization in machine combiner to 2…

Summary

Currently ( D10321, http://reviews.llvm.org/rL239486 ), we can use the machine combiner pass to reassociate the following sequence to reduce the critical path:

A = ? op ?
B = A op X
C = B op Y
-->
A = ? op ?
B = X op Y
C = A op B

'op' is currently limited to x86 AVX scalar FP adds (with fast-math on), but in theory, it could be any associative math/logic op (see TODO in code comment).

This patch generalizes the pattern match to ignore the instruction that defines 'A'. So instead of a sequence of 3 adds, we now only need to find 2 dependent adds and decide if it's worth reassociating them.

This generalization has a compile-time cost because we can now match more instruction sequences and we rely more heavily on the machine combiner to discard sequences where reassociation doesn't improve the critical path.

For example, in the new test case:

A = M div N
B = A add X
C = B add Y

We'll match 2 reassociation patterns, but this transform doesn't reduce the critical path:

A = M div N
B = A add Y
C = B add X

We need the combiner to reject that pattern but select this:

A = M div N
B = X add Y
C = B add A

On Mehdi's (hopefully degenerate for x86) test case from the r236031 post-commit thread, the compile-time increases from ~0.2 sec to 5.0 sec because the combiner completes 3963 reassociations. Using test-suite's benchmarking subset, however, the only test where this completes more than 4 times is linpack; there it reassociates 14 times (used to be 11). But I don't see any compile-time difference from doing that extra optimization work.

Diff Detail

Event Timeline

spatel updated this revision to Diff 27712.Jun 15 2015, 2:05 PM

spatel retitled this revision from to [x86] generalize reassociation optimization in machine combiner to 2 instructions.

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: Gerolf, mehdi_amini, qcolombet.

spatel added a subscriber: Unknown Object (MLST).

Gerolf added inline comments.Jun 15 2015, 9:24 PM

lib/CodeGen/MachineCombiner.cpp
214	remove outright
218	This is no longer true. The equation could now be '<'.
lib/Target/X86/X86InstrInfo.cpp
6292	This function does more than the name suggests. Also, I don't find it intuitive that is records two pattern.

spatel added inline comments.Jun 16 2015, 9:23 AM

lib/CodeGen/MachineCombiner.cpp
214	Fixed.
218	I read that statement as <= is still the minimum requirement, but let's see if I can make that clearer. I've added another explanatory statement after the formula to explain the role of the new parameter (NewCodeHasLessInsts). Let me know if you see a better way to word this. Thanks!
lib/Target/X86/X86InstrInfo.cpp
6292	Looking at the AArch64 implementation, I thought it also could record 2 patterns per call. I agree that we should make this more obvious. Suggestions: getPatterns() getMachineCombinerPatterns() getMachineCombinerPatternsForRootInst() I opted for #2 in this revision of the patch. Since this change is just a naming difference but affects more files, we could make it a follow-on patch?

Patch updated based on Gerolf's feedback. See previous inline comments and replies.

The code looks pretty good, but I'd like to understand better why the new code investigates more patterns. Also, the compile time increase to 5s looks huge. It is probably correct that it must be an outlier, however, is there anything that can be done to protect from a compile-time spike? On the other hand, the extra compile-time could be a good compile-time/performance trade-off. So one possibility I can think of is to check the rate of success. For example: investigated N association patterns, never found a better code sequence (or perhaps some %threashold instead), so let's not waste more time on association patterns in this function. What do you think?
Thanks for clarifying the AARCH64 and MachineCombiner code!

lib/Target/X86/X86InstrInfo.cpp
6284	There is a bit of code duplication you can avoid eg. by overloading hasVirtual...() and wrapping the code starting at MRI in a function. Then you would get something like if (hasVirtualRegDefsInBasicBlock(Op1,Op2, MBB) && Sibling=findAssocSibling(Op1,Op2,MBB, Commute) && hasVirtualRegDefsInBasicBlock(Sibling, MBB)) return true; return false;
6316	Allowing more than one pattern was part of the original design. What confused/confuses me is that in your old code you checked if operands had to be commuted in Root and Prev. But now the code only checks Root and potentially investigates two code sequences instead of one. Isn't that more expensive? And given that the order of the operands in Prev is not checked now, should there be a change in reassociateOps() addressing that?

Phab is slow/down, so sending email to list...

spatel mentioned this in rL240192: name change: hasPattern() -> getMachineCombinerPatterns() ; NFC.Jun 19 2015, 4:26 PM

In D10460#190819, @Gerolf wrote:

The code looks pretty good, but I'd like to understand better why the new code investigates more patterns. Also, the compile time increase to 5s looks huge. It is probably correct that it must be an outlier, however, is there anything that can be done to protect from a compile-time spike? On the other hand, the extra compile-time could be a good compile-time/performance trade-off. So one possibility I can think of is to check the rate of success. For example: investigated N association patterns, never found a better code sequence (or perhaps some %threashold instead), so let's not waste more time on association patterns in this function. What do you think?

It's certainly possible that we'll cause a compile-time spike with this patch (or even the existing code), but I would prefer to leave the safety harness as a follow-on patch pending some evidence that the case actually exists in the real world. Limiting this patch without that evidence seems like a premature compile-time optimization to me. The extra compile time should always be linear to the number of instructions, so it shouldn't explode too far on us.

lib/Target/X86/X86InstrInfo.cpp
6284	Good point. I took a slightly different approach to reduce even further!
6316	reassociateOps() doesn't need any changes because the earlier patch assumed this change was coming; we made it (even the comments) assume the more general pattern could happen.

Patch updated based on Gerolf's feedback.

Also, I checked in the name change: hasPattern() -> getMachineCombinerPatterns().
http://reviews.llvm.org/rL240192

...because that's independent and NFC, so this patch is reduced to just the MachineCombiner and x86 files again.

LGTM, but for compile time please add a FIXME before commit. What more evidence does it need? "On Mehdi's (hopefully degenerate for x86) test case from the r236031 post-commit thread, the compile-time increases from ~0.2 sec to 5.0 sec". However, in the current form the patch should have negligible ct impact in general.

For the record: the test didn’t come from an X86 test, it is a simplified version of a real-world GPU shader.

—
Mehdi

Closed by commit rL240361: [x86] generalize reassociation optimization in machine combiner to 2… (authored by spatel). · Explain WhyJun 22 2015, 5:44 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D10975: [x86] extend machine combiner reassociation optimization to SSE scalar adds.Jul 6 2015, 3:14 PM

spatel mentioned this in rL241515: [x86] extend machine combiner reassociation optimization to SSE scalar adds.Jul 6 2015, 3:36 PM

Revision Contents

Path

Size

lib/

CodeGen/

MachineCombiner.cpp

43 lines

Target/

X86/

X86InstrInfo.cpp

144 lines

test/

CodeGen/

X86/

fp-fast.ll

78 lines

machine-combiner.ll

99 lines

Diff 27712

lib/CodeGen/MachineCombiner.cpp

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	private:
bool combineInstructions(MachineBasicBlock *);		bool combineInstructions(MachineBasicBlock *);
MachineInstr *getOperandDef(const MachineOperand &MO);		MachineInstr *getOperandDef(const MachineOperand &MO);
unsigned getDepth(SmallVectorImpl<MachineInstr *> &InsInstrs,		unsigned getDepth(SmallVectorImpl<MachineInstr *> &InsInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
MachineTraceMetrics::Trace BlockTrace);		MachineTraceMetrics::Trace BlockTrace);
unsigned getLatency(MachineInstr Root, MachineInstr NewRoot,		unsigned getLatency(MachineInstr Root, MachineInstr NewRoot,
MachineTraceMetrics::Trace BlockTrace);		MachineTraceMetrics::Trace BlockTrace);
bool		bool
preservesCriticalPathLen(MachineBasicBlock MBB, MachineInstr Root,		improvesCriticalPathLen(MachineBasicBlock MBB, MachineInstr Root,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg);		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
		bool NewCodeHasLessInsts);
bool preservesResourceLen(MachineBasicBlock *MBB,		bool preservesResourceLen(MachineBasicBlock *MBB,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs);		SmallVectorImpl<MachineInstr *> &DelInstrs);
void instr2instrSC(SmallVectorImpl<MachineInstr *> &Instrs,		void instr2instrSC(SmallVectorImpl<MachineInstr *> &Instrs,
SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC);		SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC);
};		};
}		}
▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	for (const MachineOperand &MO : NewRoot->operands()) {
} else {		} else {
LatencyOp = TSchedModel.computeInstrLatency(NewRoot->getOpcode());		LatencyOp = TSchedModel.computeInstrLatency(NewRoot->getOpcode());
}		}
NewRootLatency = std::max(NewRootLatency, LatencyOp);		NewRootLatency = std::max(NewRootLatency, LatencyOp);
}		}
return NewRootLatency;		return NewRootLatency;
}		}

/// True when the new instruction sequence does not		/// True when the new instruction sequence does not lengthen the critical path
/// lengthen the critical path. The DAGCombine code sequence ends in MI		/// and the new sequence has less instructions or the new sequence improves the
/// (Machine Instruction) Root. The new code sequence ends in MI NewRoot. A		/// critical path outright.
		GerolfUnsubmitted Not Done Reply Inline Actions remove outright Gerolf: remove outright
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Fixed. spatel: Fixed.
/// necessary condition for the new sequence to replace the old sequence is that		/// The DAGCombine code sequence ends in MI (Machine Instruction) Root.
/// it cannot lengthen the critical path. This is decided by the formula		/// The new code sequence ends in MI NewRoot. A necessary condition for the new
		/// sequence to replace the old sequence is that it cannot lengthen the critical
		/// path. This is decided by the formula:
		GerolfUnsubmitted Not Done Reply Inline Actions This is no longer true. The equation could now be '<'. Gerolf: This is no longer true. The equation could now be '<'.
		spatelAuthorUnsubmitted Not Done Reply Inline Actions I read that statement as <= is still the minimum requirement, but let's see if I can make that clearer. I've added another explanatory statement after the formula to explain the role of the new parameter (NewCodeHasLessInsts). Let me know if you see a better way to word this. Thanks! spatel: I read that statement as <= is still the minimum requirement, but let's see if I can make that…
/// (NewRootDepth + NewRootLatency) <= (RootDepth + RootLatency + RootSlack)).		/// (NewRootDepth + NewRootLatency) <= (RootDepth + RootLatency + RootSlack)).
/// The slack is the number of cycles Root can be delayed before the critical		/// The slack is the number of cycles Root can be delayed before the critical
/// patch becomes longer.		/// patch becomes longer.
bool MachineCombiner::preservesCriticalPathLen(		bool MachineCombiner::improvesCriticalPathLen(
MachineBasicBlock MBB, MachineInstr Root,		MachineBasicBlock MBB, MachineInstr Root,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) {		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
		bool NewCodeHasLessInsts) {

assert(TSchedModel.hasInstrSchedModel() && "Missing machine model\n");		assert(TSchedModel.hasInstrSchedModel() && "Missing machine model\n");
// NewRoot is the last instruction in the \p InsInstrs vector.		// NewRoot is the last instruction in the \p InsInstrs vector.
// Get depth and latency of NewRoot.		// Get depth and latency of NewRoot.
unsigned NewRootIdx = InsInstrs.size() - 1;		unsigned NewRootIdx = InsInstrs.size() - 1;
MachineInstr *NewRoot = InsInstrs[NewRootIdx];		MachineInstr *NewRoot = InsInstrs[NewRootIdx];
unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);		unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);
unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);		unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);

// Get depth, latency and slack of Root.		// Get depth, latency and slack of Root.
unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;		unsigned RootDepth = BlockTrace.getInstrCycles(Root).Depth;
unsigned RootLatency = TSchedModel.computeInstrLatency(Root);		unsigned RootLatency = TSchedModel.computeInstrLatency(Root);
unsigned RootSlack = BlockTrace.getInstrSlack(Root);		unsigned RootSlack = BlockTrace.getInstrSlack(Root);

DEBUG(dbgs() << "DEPENDENCE DATA FOR " << Root << "\n";		DEBUG(dbgs() << "DEPENDENCE DATA FOR " << Root << "\n";
dbgs() << " NewRootDepth: " << NewRootDepth		dbgs() << " NewRootDepth: " << NewRootDepth
<< " NewRootLatency: " << NewRootLatency << "\n";		<< " NewRootLatency: " << NewRootLatency << "\n";
dbgs() << " RootDepth: " << RootDepth << " RootLatency: " << RootLatency		dbgs() << " RootDepth: " << RootDepth << " RootLatency: " << RootLatency
<< " RootSlack: " << RootSlack << "\n";		<< " RootSlack: " << RootSlack << "\n";
dbgs() << " NewRootDepth + NewRootLatency "		dbgs() << " NewRootDepth + NewRootLatency "
<< NewRootDepth + NewRootLatency << "\n";		<< NewRootDepth + NewRootLatency << "\n";
dbgs() << " RootDepth + RootLatency + RootSlack "		dbgs() << " RootDepth + RootLatency + RootSlack "
<< RootDepth + RootLatency + RootSlack << "\n";);		<< RootDepth + RootLatency + RootSlack << "\n";);

/// True when the new sequence does not lengthen the critical path.		unsigned NewCycleCount = NewRootDepth + NewRootLatency;
return ((NewRootDepth + NewRootLatency) <=		unsigned OldCycleCount = RootDepth + RootLatency + RootSlack;
(RootDepth + RootLatency + RootSlack));
		if (NewCodeHasLessInsts)
		return NewCycleCount <= OldCycleCount;
		else
		return NewCycleCount < OldCycleCount;
}		}

/// helper routine to convert instructions into SC		/// helper routine to convert instructions into SC
void MachineCombiner::instr2instrSC(		void MachineCombiner::instr2instrSC(
SmallVectorImpl<MachineInstr *> &Instrs,		SmallVectorImpl<MachineInstr *> &Instrs,
SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC) {		SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC) {
for (auto *InstrPtr : Instrs) {		for (auto *InstrPtr : Instrs) {
unsigned Opc = InstrPtr->getOpcode();		unsigned Opc = InstrPtr->getOpcode();
▲ Show 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	if (TII->hasPattern(MI, Pattern)) {
SmallVector<MachineInstr *, 16> DelInstrs;		SmallVector<MachineInstr *, 16> DelInstrs;
DenseMap<unsigned, unsigned> InstrIdxForVirtReg;		DenseMap<unsigned, unsigned> InstrIdxForVirtReg;
if (!MinInstr)		if (!MinInstr)
MinInstr = Traces->getEnsemble(MachineTraceMetrics::TS_MinInstrCount);		MinInstr = Traces->getEnsemble(MachineTraceMetrics::TS_MinInstrCount);
MachineTraceMetrics::Trace BlockTrace = MinInstr->getTrace(MBB);		MachineTraceMetrics::Trace BlockTrace = MinInstr->getTrace(MBB);
Traces->verifyAnalysis();		Traces->verifyAnalysis();
TII->genAlternativeCodeSequence(MI, P, InsInstrs, DelInstrs,		TII->genAlternativeCodeSequence(MI, P, InsInstrs, DelInstrs,
InstrIdxForVirtReg);		InstrIdxForVirtReg);
		unsigned NewInstCount = InsInstrs.size();
		unsigned OldInstCount = DelInstrs.size();
// Found pattern, but did not generate alternative sequence.		// Found pattern, but did not generate alternative sequence.
// This can happen e.g. when an immediate could not be materialized		// This can happen e.g. when an immediate could not be materialized
// in a single instruction.		// in a single instruction.
if (!InsInstrs.size())		if (!NewInstCount)
continue;		continue;
// Substitute when we optimize for codesize and the new sequence has		// Substitute when we optimize for codesize and the new sequence has
// fewer instructions OR		// fewer instructions OR
// the new sequence neither lengthens the critical path nor increases		// the new sequence neither lengthens the critical path nor increases
// resource pressure.		// resource pressure.
if (doSubstitute(InsInstrs.size(), DelInstrs.size()) \|\|		if (doSubstitute(NewInstCount, OldInstCount) \|\|
(preservesCriticalPathLen(MBB, &MI, BlockTrace, InsInstrs,		(improvesCriticalPathLen(MBB, &MI, BlockTrace, InsInstrs,
InstrIdxForVirtReg) &&		InstrIdxForVirtReg,
		NewInstCount < OldInstCount) &&
preservesResourceLen(MBB, BlockTrace, InsInstrs, DelInstrs))) {		preservesResourceLen(MBB, BlockTrace, InsInstrs, DelInstrs))) {
for (auto *InstrPtr : InsInstrs)		for (auto *InstrPtr : InsInstrs)
MBB->insert((MachineBasicBlock::iterator) &MI, InstrPtr);		MBB->insert((MachineBasicBlock::iterator) &MI, InstrPtr);
for (auto *InstrPtr : DelInstrs)		for (auto *InstrPtr : DelInstrs)
InstrPtr->eraseFromParentAndMarkDBGValuesForRemoval();		InstrPtr->eraseFromParentAndMarkDBGValuesForRemoval();

Changed = true;		Changed = true;
++NumInstCombined;		++NumInstCombined;
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,218 Lines • ▼ Show 20 Lines
bool X86InstrInfo::		bool X86InstrInfo::
hasHighOperandLatency(const TargetSchedModel &SchedModel,		hasHighOperandLatency(const TargetSchedModel &SchedModel,
const MachineRegisterInfo *MRI,		const MachineRegisterInfo *MRI,
const MachineInstr *DefMI, unsigned DefIdx,		const MachineInstr *DefMI, unsigned DefIdx,
const MachineInstr *UseMI, unsigned UseIdx) const {		const MachineInstr *UseMI, unsigned UseIdx) const {
return isHighLatencyDef(DefMI->getOpcode());		return isHighLatencyDef(DefMI->getOpcode());
}		}

/// If the input instruction is part of a chain of dependent ops that are		static bool hasVirtualRegDefsInBasicBlock(MachineOperand Op1,
/// suitable for reassociation, return the earlier instruction in the sequence		MachineOperand Op2,
/// that defines its first operand, otherwise return a nullptr.		const MachineBasicBlock *MBB) {
/// If the instruction's operands must be commuted to be considered a
/// reassociation candidate, Commuted will be set to true.
static MachineInstr *isReassocCandidate(const MachineInstr &Inst,
unsigned AssocOpcode,
bool checkPrevOneUse,
bool &Commuted) {
if (Inst.getOpcode() != AssocOpcode)
return nullptr;

MachineOperand Op1 = Inst.getOperand(1);
MachineOperand Op2 = Inst.getOperand(2);

const MachineBasicBlock *MBB = Inst.getParent();
const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();		const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();

// We need virtual register definitions.		// We need virtual register definitions.
MachineInstr *MI1 = nullptr;		MachineInstr *MI1 = nullptr;
MachineInstr *MI2 = nullptr;		MachineInstr *MI2 = nullptr;
if (Op1.isReg() && TargetRegisterInfo::isVirtualRegister(Op1.getReg()))		if (Op1.isReg() && TargetRegisterInfo::isVirtualRegister(Op1.getReg()))
MI1 = MRI.getUniqueVRegDef(Op1.getReg());		MI1 = MRI.getUniqueVRegDef(Op1.getReg());
if (Op2.isReg() && TargetRegisterInfo::isVirtualRegister(Op2.getReg()))		if (Op2.isReg() && TargetRegisterInfo::isVirtualRegister(Op2.getReg()))
MI2 = MRI.getUniqueVRegDef(Op2.getReg());		MI2 = MRI.getUniqueVRegDef(Op2.getReg());

// And they need to be in the trace (otherwise, they won't have a depth).		// And they need to be in the trace (otherwise, they won't have a depth).
if (!MI1 \|\| !MI2 \|\| MI1->getParent() != MBB \|\| MI2->getParent() != MBB)		if (!MI1 \|\| !MI2 \|\| MI1->getParent() != MBB \|\| MI2->getParent() != MBB)
return nullptr;		return false;

		return true;
		}

		/// Return true if the input instruction is part of a chain of dependent ops
		/// that are suitable for reassociation, otherwise return false.
		/// If the instruction's operands must be commuted to have a previous
		/// instruction of the same type define the first source operand, Commuted will
		/// be set to true.
		static bool isReassocCandidate(const MachineInstr &Inst, unsigned AssocOpcode,
		bool &Commuted) {
		if (Inst.getOpcode() != AssocOpcode)
		return false;

		assert(Inst.getNumOperands() == 3 &&
		"Must be a binary operator for reassociation");

		const MachineBasicBlock *MBB = Inst.getParent();
		MachineOperand Op1 = Inst.getOperand(1);
		MachineOperand Op2 = Inst.getOperand(2);
		if (!hasVirtualRegDefsInBasicBlock(Op1, Op2, MBB))
		return false;

		const MachineRegisterInfo &MRI = MBB->getParent()->getRegInfo();
		MachineInstr *MI1 = MRI.getUniqueVRegDef(Op1.getReg());
		MachineInstr *MI2 = MRI.getUniqueVRegDef(Op2.getReg());

Commuted = false;		Commuted = false;
if (MI1->getOpcode() != AssocOpcode && MI2->getOpcode() == AssocOpcode) {		if (MI1->getOpcode() != AssocOpcode && MI2->getOpcode() == AssocOpcode) {
std::swap(MI1, MI2);		std::swap(MI1, MI2);
Commuted = true;		Commuted = true;
}		}

// Avoid reassociating operands when it won't provide any benefit. If both		// We need a previous instruction of the same type to reassociate.
// operands are produced by instructions of this type, we may already		if (MI1->getOpcode() != AssocOpcode)
// have the optimal sequence.		return false;
if (MI2->getOpcode() == AssocOpcode)
return nullptr;

// The instruction must only be used by the other instruction that we		// The previous instruction must only be used by the instruction that we
// reassociate with.		// reassociate with.
if (checkPrevOneUse && !MRI.hasOneNonDBGUse(MI1->getOperand(0).getReg()))		if (!MRI.hasOneNonDBGUse(MI1->getOperand(0).getReg()))
return nullptr;		return false;

// We must match a simple chain of dependent ops.
// TODO: This check is not necessary for the earliest instruction in the
// sequence. Instead of a sequence of 3 dependent instructions with the same
// opcode, we only need to find a sequence of 2 dependent instructions with
// the same opcode plus 1 other instruction that adds to the height of the
// trace.
if (MI1->getOpcode() != AssocOpcode)
return nullptr;

return MI1;		MachineOperand Op11 = MI1->getOperand(1);
		GerolfUnsubmitted Not Done Reply Inline Actions There is a bit of code duplication you can avoid eg. by overloading hasVirtual...() and wrapping the code starting at MRI in a function. Then you would get something like if (hasVirtualRegDefsInBasicBlock(Op1,Op2, MBB) && Sibling=findAssocSibling(Op1,Op2,MBB, Commute) && hasVirtualRegDefsInBasicBlock(Sibling, MBB)) return true; return false; Gerolf: There is a bit of code duplication you can avoid eg. by overloading hasVirtual...() and…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Good point. I took a slightly different approach to reduce even further! spatel: Good point. I took a slightly different approach to reduce even further!
}		MachineOperand Op12 = MI1->getOperand(2);
		if (!hasVirtualRegDefsInBasicBlock(Op11, Op12, MBB))
		return false;

/// Select a pattern based on how the operands of each associative operation		return true;
/// need to be commuted.
static MachineCombinerPattern::MC_PATTERN getPattern(bool CommutePrev,
bool CommuteRoot) {
if (CommutePrev) {
if (CommuteRoot)
return MachineCombinerPattern::MC_REASSOC_XA_YB;
return MachineCombinerPattern::MC_REASSOC_XA_BY;
} else {
if (CommuteRoot)
return MachineCombinerPattern::MC_REASSOC_AX_YB;
return MachineCombinerPattern::MC_REASSOC_AX_BY;
}
}		}

bool X86InstrInfo::hasPattern(MachineInstr &Root,		bool X86InstrInfo::hasPattern(MachineInstr &Root,
		GerolfUnsubmitted Not Done Reply Inline Actions This function does more than the name suggests. Also, I don't find it intuitive that is records two pattern. Gerolf: This function does more than the name suggests. Also, I don't find it intuitive that is records…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Looking at the AArch64 implementation, I thought it also could record 2 patterns per call. I agree that we should make this more obvious. Suggestions: getPatterns() getMachineCombinerPatterns() getMachineCombinerPatternsForRootInst() I opted for #2 in this revision of the patch. Since this change is just a naming difference but affects more files, we could make it a follow-on patch? spatel: Looking at the AArch64 implementation, I thought it also could record 2 patterns per call. I…
SmallVectorImpl<MachineCombinerPattern::MC_PATTERN> &Pattern) const {		SmallVectorImpl<MachineCombinerPattern::MC_PATTERN> &Pattern) const {
if (!Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath)		if (!Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath)
return false;		return false;

		// TODO: There is nothing x86-specific here except the instruction type.
		// This logic could be hoisted into the machine combiner pass itself.

		// Look for this reassociation pattern:
		// B = A op X (Prev)
		// C = B op Y (Root)

// TODO: There are many more associative instruction types to match:		// TODO: There are many more associative instruction types to match:
// 1. Other forms of scalar FP add (non-AVX)		// 1. Other forms of scalar FP add (non-AVX)
// 2. Other data types (double, integer, vectors)		// 2. Other data types (double, integer, vectors)
// 3. Other math / logic operations (mul, and, or)		// 3. Other math / logic operations (mul, and, or)
unsigned AssocOpcode = X86::VADDSSrr;		unsigned AssocOpcode = X86::VADDSSrr;

// TODO: There is nothing x86-specific here except the instruction type.		bool Commute = false;
// This logic could be hoisted into the machine combiner pass itself.		if (isReassocCandidate(Root, AssocOpcode, Commute)) {
bool CommuteRoot;
if (MachineInstr *Prev = isReassocCandidate(Root, AssocOpcode, true,
CommuteRoot)) {
bool CommutePrev;
if (isReassocCandidate(*Prev, AssocOpcode, false, CommutePrev)) {
// We found a sequence of instructions that may be suitable for a		// We found a sequence of instructions that may be suitable for a
// reassociation of operands to increase ILP.		// reassociation of operands to increase ILP. Specify each commutation
Pattern.push_back(getPattern(CommutePrev, CommuteRoot));		// possibility for the Prev instruction in the sequence and let the
return true;		// machine combiner decide if changing the operands is worthwhile.
		if (Commute) {
		GerolfUnsubmitted Not Done Reply Inline Actions Allowing more than one pattern was part of the original design. What confused/confuses me is that in your old code you checked if operands had to be commuted in Root and Prev. But now the code only checks Root and potentially investigates two code sequences instead of one. Isn't that more expensive? And given that the order of the operands in Prev is not checked now, should there be a change in reassociateOps() addressing that? Gerolf: Allowing more than one pattern was part of the original design. What confused/confuses me is…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions reassociateOps() doesn't need any changes because the earlier patch assumed this change was coming; we made it (even the comments) assume the more general pattern could happen. spatel: reassociateOps() doesn't need any changes because the earlier patch assumed this change was…
		Pattern.push_back(MachineCombinerPattern::MC_REASSOC_AX_YB);
		Pattern.push_back(MachineCombinerPattern::MC_REASSOC_XA_YB);
		} else {
		Pattern.push_back(MachineCombinerPattern::MC_REASSOC_AX_BY);
		Pattern.push_back(MachineCombinerPattern::MC_REASSOC_XA_BY);
}		}
		return true;
}		}

return false;		return false;
}		}

/// Attempt the following reassociation to reduce critical path length:		/// Attempt the following reassociation to reduce critical path length:
/// B = A op X (Prev)		/// B = A op X (Prev)
/// C = B op Y (Root)		/// C = B op Y (Root)
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	void X86InstrInfo::genAlternativeCodeSequence(
MachineCombinerPattern::MC_PATTERN Pattern,		MachineCombinerPattern::MC_PATTERN Pattern,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs,		SmallVectorImpl<MachineInstr *> &DelInstrs,
DenseMap<unsigned, unsigned> &InstIdxForVirtReg) const {		DenseMap<unsigned, unsigned> &InstIdxForVirtReg) const {
MachineRegisterInfo &MRI = Root.getParent()->getParent()->getRegInfo();		MachineRegisterInfo &MRI = Root.getParent()->getParent()->getRegInfo();

// Select the previous instruction in the sequence based on the input pattern.		// Select the previous instruction in the sequence based on the input pattern.
MachineInstr *Prev = nullptr;		MachineInstr *Prev = nullptr;
if (Pattern == MachineCombinerPattern::MC_REASSOC_AX_BY \|\|		switch (Pattern) {
Pattern == MachineCombinerPattern::MC_REASSOC_XA_BY)		case MachineCombinerPattern::MC_REASSOC_AX_BY:
		case MachineCombinerPattern::MC_REASSOC_XA_BY:
Prev = MRI.getUniqueVRegDef(Root.getOperand(1).getReg());		Prev = MRI.getUniqueVRegDef(Root.getOperand(1).getReg());
else if (Pattern == MachineCombinerPattern::MC_REASSOC_AX_YB \|\|		break;
Pattern == MachineCombinerPattern::MC_REASSOC_XA_YB)		case MachineCombinerPattern::MC_REASSOC_AX_YB:
		case MachineCombinerPattern::MC_REASSOC_XA_YB:
Prev = MRI.getUniqueVRegDef(Root.getOperand(2).getReg());		Prev = MRI.getUniqueVRegDef(Root.getOperand(2).getReg());
else		}
llvm_unreachable("Unknown pattern for machine combiner");		assert(Prev && "Unknown pattern for machine combiner");

reassociateOps(Root, *Prev, Pattern, InsInstrs, DelInstrs, InstIdxForVirtReg);		reassociateOps(Root, *Prev, Pattern, InsInstrs, DelInstrs, InstIdxForVirtReg);
return;		return;
}		}

namespace {		namespace {
/// Create Global Base Reg pass. This initializes the PIC		/// Create Global Base Reg pass. This initializes the PIC
/// global base register for x86-32.		/// global base register for x86-32.
▲ Show 20 Lines • Show All 180 Lines • Show Last 20 Lines

test/CodeGen/X86/fp-fast.ll

	Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0			; CHECK-NEXT: vxorps %xmm0, %xmm0, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t1 = fsub float -0.0, %a			%t1 = fsub float -0.0, %a
	%t2 = fadd float %a, %t1			%t2 = fadd float %a, %t1
	ret float %t2			ret float %t2
	}			}

	; Verify that the first two adds are independent regardless of how the inputs are
	; commuted. The destination registers are used as source registers for the third add.

	define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds1:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2
	%t2 = fadd float %t1, %x3
	ret float %t2
	}

	define float @reassociate_adds2(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds2:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %x2, %t0
	%t2 = fadd float %t1, %x3
	ret float %t2
	}

	define float @reassociate_adds3(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds3:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2
	%t2 = fadd float %x3, %t1
	ret float %t2
	}

	define float @reassociate_adds4(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds4:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %x2, %t0
	%t2 = fadd float %x3, %t1
	ret float %t2
	}

	; Verify that we reassociate some of these ops. The optimal balanced tree of adds is not
	; produced because that would cost more compile time.

	define float @reassociate_adds5(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {
	; CHECK-LABEL: reassociate_adds5:
	; CHECK: # BB#0:
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm5, %xmm4, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: vaddss %xmm7, %xmm6, %xmm1
	; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
	; CHECK-NEXT: retq
	%t0 = fadd float %x0, %x1
	%t1 = fadd float %t0, %x2
	%t2 = fadd float %t1, %x3
	%t3 = fadd float %t2, %x4
	%t4 = fadd float %t3, %x5
	%t5 = fadd float %t4, %x6
	%t6 = fadd float %t5, %x7
	ret float %t6
	}

test/CodeGen/X86/machine-combiner.ll

				; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math < %s \| FileCheck %s

				; Verify that the first two adds are independent regardless of how the inputs are
				; commuted. The destination registers are used as source registers for the third add.

				define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds1:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %t1, %x3
				ret float %t2
				}

				define float @reassociate_adds2(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds2:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %t1, %x3
				ret float %t2
				}

				define float @reassociate_adds3(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds3:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %x3, %t1
				ret float %t2
				}

				define float @reassociate_adds4(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds4:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %x3, %t1
				ret float %t2
				}

				; Verify that we reassociate some of these ops. The optimal balanced tree of adds is not
				; produced because that would cost more compile time.

				define float @reassociate_adds5(float %x0, float %x1, float %x2, float %x3, float %x4, float %x5, float %x6, float %x7) {
				; CHECK-LABEL: reassociate_adds5:
				; CHECK: # BB#0:
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm5, %xmm4, %xmm1
				; CHECK-NEXT: vaddss %xmm6, %xmm1, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm7, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fadd float %x0, %x1
				%t1 = fadd float %t0, %x2
				%t2 = fadd float %t1, %x3
				%t3 = fadd float %t2, %x4
				%t4 = fadd float %t3, %x5
				%t5 = fadd float %t4, %x6
				%t6 = fadd float %t5, %x7
				ret float %t6
				}

				; Verify that we only need two associative operations to reassociate the operands.
				; Also, we should reassociate such that the result of the high latency division
				; is used by the final 'add' rather than reassociating the %x3 operand with the
				; division. The latter reassociation would not improve anything.

				define float @reassociate_adds6(float %x0, float %x1, float %x2, float %x3) {
				; CHECK-LABEL: reassociate_adds6:
				; CHECK: # BB#0:
				; CHECK-NEXT: vdivss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: vaddss %xmm3, %xmm2, %xmm1
				; CHECK-NEXT: vaddss %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				%t0 = fdiv float %x0, %x1
				%t1 = fadd float %x2, %t0
				%t2 = fadd float %x3, %t1
				ret float %t2
				}