This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/CodeGen/
-
CodeGen/
-
MachineCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
aarch64-combine-fmul-fsub.mir
-
machine-combiner.ll
-
machine-combiner.mir
-
X86/
-
machine-combiner-int-vec.ll
-
machine-combiner-int.ll
-
machine-combiner.ll

Differential D41766

[MachineCombiner] Add check for optimal pattern order.
ClosedPublic

Authored by fhahn on Jan 5 2018, 8:41 AM.

Download Raw Diff

Details

Reviewers

Gerolf
spatel
mssimpso

Commits

rGc68428b5dc47: [MachineCombiner] Add check for optimal pattern order.
rL323873: [MachineCombiner] Add check for optimal pattern order.

Summary

In D41587, @mssimpso discovered that the order of some patterns for
AArch64 was sub-optimal. I thought a bit about how we could avoid that
case in the future. I do not think there is a need for evaluating all
patterns for now. But this patch adds an extra (expensive) check, that
evaluates the latencies of all patterns, and ensures that the latency
saved decreases for subsequent patterns.

This catches the sub-optimal order fixed in D41587, but I am not
entirely happy with the check, as it only applies to sub-optimal
patterns seen while building with EXPENSIVE_CHECKS on. It did not
discover any other sub-optimal pattern ordering.

Do you think that additional check would be useful?

Diff Detail

Repository: rL LLVM

Event Timeline

fhahn created this revision.Jan 5 2018, 8:41 AM

Herald added subscribers: kristof.beyls, javed.absar, aemerson. · View Herald TranscriptJan 5 2018, 8:41 AM

Hi Florian,

It's hard to say how useful this patch will be, but I wouldn't mind having it under EXPENSIVE_CHECKS. Have you looked at other targets other than Arm/AArch64?

test/CodeGen/AArch64/aarch64-combine-fmul-fsub.mir
29–30 ↗	(On Diff #128752)	These tests don't need changing because of this patch, right? But you can go ahead and commit the test changes separately if you need to.

TBH I was hoping for a more complete approach to testing this, but I thought I share this relatively straight forward check.

I've run the LLVM test suite with this patch (without the EXPENSIVE_CHECKS guard) on AArch64 with -O3 -ffast-math and various mcpu options. I also run the patch on a single X86 machine with -mcpu=native. Beyond that, I did not run this patch on any other targets (except all LLVM unit tests with this change)

test/CodeGen/AArch64/aarch64-combine-fmul-fsub.mir
29–30 ↗	(On Diff #128752)	Unfortunately they do. Some patterns, at least on AArch64 add additional virtual registers when the pattern is instantiated. Unused virtual registers should be dropped by later passes, but they change the numbers of the registers in this test case.

In D41766#972219, @fhahn wrote:

TBH I was hoping for a more complete approach to testing this, but I thought I share this relatively straight forward check.

I'm fine with that. We can always improve it later.

I've run the LLVM test suite with this patch (without the EXPENSIVE_CHECKS guard) on AArch64 with -O3 -ffast-math and various mcpu options. I also run the patch on a single X86 machine with -mcpu=native. Beyond that, I did not run this patch on any other targets (except all LLVM unit tests with this change)

OK, sounds good. I'm just trying to avoid the case where some in-tree target is intentionally adding the patterns in a way that doesn't increase expected latency. Though if that's the case, they might want to know.

This seems useful enough to me to commit.

test/CodeGen/AArch64/aarch64-combine-fmul-fsub.mir
29–30 ↗	(On Diff #128752)	Ahh, that makes sense. Thanks!

This revision is now accepted and ready to land.Jan 11 2018, 12:56 PM

Thanks Matthew! I'll wait with committing this till next week, in case @Gerolf has some additional thoughts.

I'll commit this in the next couple of days. @Gerolf do you have any additional thoughts/ideas?

I like the spirit of the idea. What made you look into this? Some more questions and suggestions below, but LGTM as is.

Cheers
Gerolf

lib/CodeGen/MachineCombiner.cpp
485 ↗	(On Diff #128752)	Could that go into a separate function? You might want to control it by an internal flag rather than a compile-time define.
506 ↗	(On Diff #128752)	You might want to add a dumping capability that prints the pattern in a sorted order. Did you encounter scenarios where different orderings of pattern (other than the default) gave improvements in some cases, but not others? If these cases can be tight to some characteristic in the program it might be possible to pass this information on to getMachineCombinerPatterns() and return a different order.

Thanks Gerolf! The main reason for adding this is that @mssimpso pointed out in D41587 that some of the recently added patterns were not optimally ordered for at least Falkor and I wanted to give us a better chance at detecting sub-optimal orderings automatically.

lib/CodeGen/MachineCombiner.cpp
485 ↗	(On Diff #128752)	Done! The option is enabled with expensive checks. I've also enabled the option in a couple of machine-combiner tests.
506 ↗	(On Diff #128752)	I could add dumping of the sorted patterns, but I think it would be better to do it in a separate commit. I did not find any cases where a different order gave improvements in some cases. Although it would be worth to run this verification on larger benchmarks (e.g. SPEC2017), which should be easy with the new option.

Closed by commit rL323873: [MachineCombiner] Add check for optimal pattern order. (authored by fhahn). · Explain WhyJan 31 2018, 5:56 AM

This revision was automatically updated to reflect the committed changes.

Hi @fhahn
We are testing AArch64 builds with EXPENSIVE_CHECK=ON and SPEC 2017 and llvm test-suite are failing with the assert in this patch.

assert(CurrentLatencyDiff <= PrevLatencyDiff &&
           "Current pattern is better than previous pattern.");

In addition to the above benchmarks, there are a couple of other internal benchmarks failing too. Here is a small reproducer we got with creduce.

long a, b, c, d;
void f() {
  while (!0) {
    long e = c * b - d * a;
    if (e < 7)
      break;
  }
}

Command-line: clang  -O2 reduced.c

I have a high-level question for this patch. I understand that the verify function aims to verify patterns to be present in certain order but insertion is not guaranteed to be in the same order.

Does it make sense to have this verification in the first place?

FWIW, the following patch would fix the issue but it is not a scalable approach:

host:~/build_release$ git diff ../llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
index d49b82290ff3..ed1ccb7b7484 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
@@ -5143,8 +5143,8 @@ static bool getMaddPatterns(MachineInstr &Root,
     setFound(AArch64::MADDWrrr, 2, AArch64::WZR, MCP::MULSUBW_OP2);
     break;
   case AArch64::SUBXrr:
-    setFound(AArch64::MADDXrrr, 1, AArch64::XZR, MCP::MULSUBX_OP1);
     setFound(AArch64::MADDXrrr, 2, AArch64::XZR, MCP::MULSUBX_OP2);
+    setFound(AArch64::MADDXrrr, 1, AArch64::XZR, MCP::MULSUBX_OP1);
     break;
   case AArch64::ADDWri:
     setFound(AArch64::MADDWrrr, 1, AArch64::WZR, MCP::MULADDWI_OP1);

I can file a Gitlab issue if you prefer to.

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJul 3 2023, 11:56 PM

Herald added subscribers: StephenFan, pengfei. · View Herald Transcript

I have a high-level question for this patch. I understand that the verify function aims to verify patterns to be present in certain order but insertion is not guaranteed to be in the same order.

Does it make sense to have this verification in the first place?

The machine combiner applies the pattern in order, so the goal is to ensure that the most profitable patterns are processed first. Did you encounter cases where it wasn't possible to fix the insertion order?

In D41766#4474557, @fhahn wrote:

I have a high-level question for this patch. I understand that the verify function aims to verify patterns to be present in certain order but insertion is not guaranteed to be in the same order.

Does it make sense to have this verification in the first place?

The machine combiner applies the pattern in order, so the goal is to ensure that the most profitable patterns are processed first. Did you encounter cases where it wasn't possible to fix the insertion order?

So far not but the I have one case where we had to find it meticulously to fix the order and it is time-consuming each time. It is not possible to make sure that we always do it. EXPENSIVE_CHECK builds are not very well tested as I couldn't see a build bot for aarch64 platform. (I might be wrong if added recently).

If the goal is to make sure we apply patterns in certain order then we should sort them based on latency. Verifying without making sure the insertion order is a risk and we have more than tests failing. Moreover, at the insertion point we have no clue if it is inserting in the required order.

Please let me know if I am missing something here.

Ping to @fhahn!

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

MachineCombiner.cpp

98 lines

test/

CodeGen/

AArch64/

aarch64-combine-fmul-fsub.mir

20 lines

machine-combiner.ll

2 lines

machine-combiner.mir

2 lines

X86/

machine-combiner-int-vec.ll

4 lines

machine-combiner-int.ll

4 lines

machine-combiner.ll

4 lines

Diff 132156

llvm/trunk/lib/CodeGen/MachineCombiner.cpp

Show All 33 Lines

STATISTIC(NumInstCombined, "Number of machineinst combined");		STATISTIC(NumInstCombined, "Number of machineinst combined");

static cl::opt<unsigned>		static cl::opt<unsigned>
inc_threshold("machine-combiner-inc-threshold", cl::Hidden,		inc_threshold("machine-combiner-inc-threshold", cl::Hidden,
cl::desc("Incremental depth computation will be used for basic "		cl::desc("Incremental depth computation will be used for basic "
"blocks with more instructions."), cl::init(500));		"blocks with more instructions."), cl::init(500));

		#ifdef EXPENSIVE_CHECKS
		static cl::opt<bool> VerifyPatternOrder(
		"machine-combiner-verify-pattern-order", cl::Hidden,
		cl::desc(
		"Verify that the generated patterns are ordered by increasing latency"),
		cl::init(true));
		#else
		static cl::opt<bool> VerifyPatternOrder(
		"machine-combiner-verify-pattern-order", cl::Hidden,
		cl::desc(
		"Verify that the generated patterns are ordered by increasing latency"),
		cl::init(false));
		#endif

namespace {		namespace {
class MachineCombiner : public MachineFunctionPass {		class MachineCombiner : public MachineFunctionPass {
const TargetInstrInfo *TII;		const TargetInstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
MCSchedModel SchedModel;		MCSchedModel SchedModel;
MachineRegisterInfo *MRI;		MachineRegisterInfo *MRI;
MachineLoopInfo *MLI; // Current MachineLoopInfo		MachineLoopInfo *MLI; // Current MachineLoopInfo
MachineTraceMetrics *Traces;		MachineTraceMetrics *Traces;
Show All 30 Lines	improvesCriticalPathLen(MachineBasicBlock MBB, MachineInstr Root,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
MachineCombinerPattern Pattern, bool SlackIsAccurate);		MachineCombinerPattern Pattern, bool SlackIsAccurate);
bool preservesResourceLen(MachineBasicBlock *MBB,		bool preservesResourceLen(MachineBasicBlock *MBB,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs);		SmallVectorImpl<MachineInstr *> &DelInstrs);
void instr2instrSC(SmallVectorImpl<MachineInstr *> &Instrs,		void instr2instrSC(SmallVectorImpl<MachineInstr *> &Instrs,
SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC);		SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC);
		std::pair<unsigned, unsigned>
		getLatenciesForInstrSequences(MachineInstr &MI,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		SmallVectorImpl<MachineInstr *> &DelInstrs,
		MachineTraceMetrics::Trace BlockTrace);

		void verifyPatternOrder(MachineBasicBlock *MBB, MachineInstr &Root,
		SmallVector<MachineCombinerPattern, 16> &Patterns);
};		};
}		}

char MachineCombiner::ID = 0;		char MachineCombiner::ID = 0;
char &llvm::MachineCombinerID = MachineCombiner::ID;		char &llvm::MachineCombinerID = MachineCombiner::ID;

INITIALIZE_PASS_BEGIN(MachineCombiner, DEBUG_TYPE,		INITIALIZE_PASS_BEGIN(MachineCombiner, DEBUG_TYPE,
"Machine InstCombiner", false, false)		"Machine InstCombiner", false, false)
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	static CombinerObjective getCombinerObjective(MachineCombinerPattern P) {
case MachineCombinerPattern::REASSOC_XA_BY:		case MachineCombinerPattern::REASSOC_XA_BY:
case MachineCombinerPattern::REASSOC_XA_YB:		case MachineCombinerPattern::REASSOC_XA_YB:
return CombinerObjective::MustReduceDepth;		return CombinerObjective::MustReduceDepth;
default:		default:
return CombinerObjective::Default;		return CombinerObjective::Default;
}		}
}		}

		/// Estimate the latency of the new and original instruction sequence by summing
		/// up the latencies of the inserted and deleted instructions. This assumes
		/// that the inserted and deleted instructions are dependent instruction chains,
		/// which might not hold in all cases.
		std::pair<unsigned, unsigned> MachineCombiner::getLatenciesForInstrSequences(
		MachineInstr &MI, SmallVectorImpl<MachineInstr *> &InsInstrs,
		SmallVectorImpl<MachineInstr *> &DelInstrs,
		MachineTraceMetrics::Trace BlockTrace) {
		assert(!InsInstrs.empty() && "Only support sequences that insert instrs.");
		unsigned NewRootLatency = 0;
		// NewRoot is the last instruction in the \p InsInstrs vector.
		MachineInstr *NewRoot = InsInstrs.back();
		for (unsigned i = 0; i < InsInstrs.size() - 1; i++)
		NewRootLatency += TSchedModel.computeInstrLatency(InsInstrs[i]);
		NewRootLatency += getLatency(&MI, NewRoot, BlockTrace);

		unsigned RootLatency = 0;
		for (auto I : DelInstrs)
		RootLatency += TSchedModel.computeInstrLatency(I);

		return {NewRootLatency, RootLatency};
		}

/// The DAGCombine code sequence ends in MI (Machine Instruction) Root.		/// The DAGCombine code sequence ends in MI (Machine Instruction) Root.
/// The new code sequence ends in MI NewRoot. A necessary condition for the new		/// The new code sequence ends in MI NewRoot. A necessary condition for the new
/// sequence to replace the old sequence is that it cannot lengthen the critical		/// sequence to replace the old sequence is that it cannot lengthen the critical
/// path. The definition of "improve" may be restricted by specifying that the		/// path. The definition of "improve" may be restricted by specifying that the
/// new path improves the data dependency chain (MustReduceDepth).		/// new path improves the data dependency chain (MustReduceDepth).
bool MachineCombiner::improvesCriticalPathLen(		bool MachineCombiner::improvesCriticalPathLen(
MachineBasicBlock MBB, MachineInstr Root,		MachineBasicBlock MBB, MachineInstr Root,
MachineTraceMetrics::Trace BlockTrace,		MachineTraceMetrics::Trace BlockTrace,
SmallVectorImpl<MachineInstr *> &InsInstrs,		SmallVectorImpl<MachineInstr *> &InsInstrs,
SmallVectorImpl<MachineInstr *> &DelInstrs,		SmallVectorImpl<MachineInstr *> &DelInstrs,
DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
MachineCombinerPattern Pattern,		MachineCombinerPattern Pattern,
bool SlackIsAccurate) {		bool SlackIsAccurate) {
assert(TSchedModel.hasInstrSchedModelOrItineraries() &&		assert(TSchedModel.hasInstrSchedModelOrItineraries() &&
"Missing machine model\n");		"Missing machine model\n");
// NewRoot is the last instruction in the \p InsInstrs vector.
unsigned NewRootIdx = InsInstrs.size() - 1;
MachineInstr *NewRoot = InsInstrs[NewRootIdx];

// Get depth and latency of NewRoot and Root.		// Get depth and latency of NewRoot and Root.
unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);		unsigned NewRootDepth = getDepth(InsInstrs, InstrIdxForVirtReg, BlockTrace);
unsigned RootDepth = BlockTrace.getInstrCycles(*Root).Depth;		unsigned RootDepth = BlockTrace.getInstrCycles(*Root).Depth;

DEBUG(dbgs() << "DEPENDENCE DATA FOR " << *Root << "\n";		DEBUG(dbgs() << "DEPENDENCE DATA FOR " << *Root << "\n";
dbgs() << " NewRootDepth: " << NewRootDepth << "\n";		dbgs() << " NewRootDepth: " << NewRootDepth << "\n";
dbgs() << " RootDepth: " << RootDepth << "\n");		dbgs() << " RootDepth: " << RootDepth << "\n");

// For a transform such as reassociation, the cost equation is		// For a transform such as reassociation, the cost equation is
// conservatively calculated so that we must improve the depth (data		// conservatively calculated so that we must improve the depth (data
// dependency cycles) in the critical path to proceed with the transform.		// dependency cycles) in the critical path to proceed with the transform.
// Being conservative also protects against inaccuracies in the underlying		// Being conservative also protects against inaccuracies in the underlying
// machine trace metrics and CPU models.		// machine trace metrics and CPU models.
if (getCombinerObjective(Pattern) == CombinerObjective::MustReduceDepth)		if (getCombinerObjective(Pattern) == CombinerObjective::MustReduceDepth)
return NewRootDepth < RootDepth;		return NewRootDepth < RootDepth;

// A more flexible cost calculation for the critical path includes the slack		// A more flexible cost calculation for the critical path includes the slack
// of the original code sequence. This may allow the transform to proceed		// of the original code sequence. This may allow the transform to proceed
// even if the instruction depths (data dependency cycles) become worse.		// even if the instruction depths (data dependency cycles) become worse.

// Account for the latency of the inserted and deleted instructions by		// Account for the latency of the inserted and deleted instructions by
// adding up their latencies. This assumes that the inserted and deleted		unsigned NewRootLatency, RootLatency;
// instructions are dependent instruction chains, which might not hold		std::tie(NewRootLatency, RootLatency) =
// in all cases.		getLatenciesForInstrSequences(*Root, InsInstrs, DelInstrs, BlockTrace);
unsigned NewRootLatency = 0;
for (unsigned i = 0; i < InsInstrs.size() - 1; i++)
NewRootLatency += TSchedModel.computeInstrLatency(InsInstrs[i]);
NewRootLatency += getLatency(Root, NewRoot, BlockTrace);

unsigned RootLatency = 0;
for (auto I : DelInstrs)
RootLatency += TSchedModel.computeInstrLatency(I);

unsigned RootSlack = BlockTrace.getInstrSlack(*Root);		unsigned RootSlack = BlockTrace.getInstrSlack(*Root);
unsigned NewCycleCount = NewRootDepth + NewRootLatency;		unsigned NewCycleCount = NewRootDepth + NewRootLatency;
unsigned OldCycleCount = RootDepth + RootLatency +		unsigned OldCycleCount = RootDepth + RootLatency +
(SlackIsAccurate ? RootSlack : 0);		(SlackIsAccurate ? RootSlack : 0);
DEBUG(dbgs() << " NewRootLatency: " << NewRootLatency << "\n";		DEBUG(dbgs() << " NewRootLatency: " << NewRootLatency << "\n";
dbgs() << " RootLatency: " << RootLatency << "\n";		dbgs() << " RootLatency: " << RootLatency << "\n";
dbgs() << " RootSlack: " << RootSlack << " SlackIsAccurate="		dbgs() << " RootSlack: " << RootSlack << " SlackIsAccurate="
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	if (IncrementalUpdate)
for (auto *InstrPtr : InsInstrs)		for (auto *InstrPtr : InsInstrs)
MinInstr->updateDepth(MBB, *InstrPtr, RegUnits);		MinInstr->updateDepth(MBB, *InstrPtr, RegUnits);
else		else
MinInstr->invalidate(MBB);		MinInstr->invalidate(MBB);

NumInstCombined++;		NumInstCombined++;
}		}

		// Check that the difference between original and new latency is decreasing for
		// later patterns. This helps to discover sub-optimal pattern orderings.
		void MachineCombiner::verifyPatternOrder(
		MachineBasicBlock *MBB, MachineInstr &Root,
		SmallVector<MachineCombinerPattern, 16> &Patterns) {
		long PrevLatencyDiff = std::numeric_limits<long>::max();
		for (auto P : Patterns) {
		SmallVector<MachineInstr *, 16> InsInstrs;
		SmallVector<MachineInstr *, 16> DelInstrs;
		DenseMap<unsigned, unsigned> InstrIdxForVirtReg;
		TII->genAlternativeCodeSequence(Root, P, InsInstrs, DelInstrs,
		InstrIdxForVirtReg);
		// Found pattern, but did not generate alternative sequence.
		// This can happen e.g. when an immediate could not be materialized
		// in a single instruction.
		if (InsInstrs.empty() \|\| !TSchedModel.hasInstrSchedModelOrItineraries())
		continue;

		unsigned NewRootLatency, RootLatency;
		std::tie(NewRootLatency, RootLatency) = getLatenciesForInstrSequences(
		Root, InsInstrs, DelInstrs, MinInstr->getTrace(MBB));
		long CurrentLatencyDiff = ((long)RootLatency) - ((long)NewRootLatency);
		assert(CurrentLatencyDiff <= PrevLatencyDiff &&
		"Current pattern is better than previous pattern.");
		PrevLatencyDiff = CurrentLatencyDiff;
		}
		}

/// Substitute a slow code sequence with a faster one by		/// Substitute a slow code sequence with a faster one by
/// evaluating instruction combining pattern.		/// evaluating instruction combining pattern.
/// The prototype of such a pattern is MUl + ADD -> MADD. Performs instruction		/// The prototype of such a pattern is MUl + ADD -> MADD. Performs instruction
/// combining based on machine trace metrics. Only combine a sequence of		/// combining based on machine trace metrics. Only combine a sequence of
/// instructions when this neither lengthens the critical path nor increases		/// instructions when this neither lengthens the critical path nor increases
/// resource pressure. When optimizing for codesize always combine when the new		/// resource pressure. When optimizing for codesize always combine when the new
/// sequence is shorter.		/// sequence is shorter.
bool MachineCombiner::combineInstructions(MachineBasicBlock *MBB) {		bool MachineCombiner::combineInstructions(MachineBasicBlock *MBB) {
Show All 34 Lines	while (BlockIter != MBB->end()) {
// pattern. Then for each pattern the new code sequence in form of MI is		// pattern. Then for each pattern the new code sequence in form of MI is
// generated and evaluated. When the efficiency criteria (don't lengthen		// generated and evaluated. When the efficiency criteria (don't lengthen
// critical path, don't use more resources) is met the new sequence gets		// critical path, don't use more resources) is met the new sequence gets
// hooked up into the basic block before the old sequence is removed.		// hooked up into the basic block before the old sequence is removed.
//		//
// The algorithm does not try to evaluate all patterns and pick the best.		// The algorithm does not try to evaluate all patterns and pick the best.
// This is only an artificial restriction though. In practice there is		// This is only an artificial restriction though. In practice there is
// mostly one pattern, and getMachineCombinerPatterns() can order patterns		// mostly one pattern, and getMachineCombinerPatterns() can order patterns
// based on an internal cost heuristic.		// based on an internal cost heuristic. If
		// machine-combiner-verify-pattern-order is enabled, all patterns are
		// checked to ensure later patterns do not provide better latency savings.

if (!TII->getMachineCombinerPatterns(MI, Patterns))		if (!TII->getMachineCombinerPatterns(MI, Patterns))
continue;		continue;

		if (VerifyPatternOrder)
		verifyPatternOrder(MBB, MI, Patterns);

for (auto P : Patterns) {		for (auto P : Patterns) {
SmallVector<MachineInstr *, 16> InsInstrs;		SmallVector<MachineInstr *, 16> InsInstrs;
SmallVector<MachineInstr *, 16> DelInstrs;		SmallVector<MachineInstr *, 16> DelInstrs;
DenseMap<unsigned, unsigned> InstrIdxForVirtReg;		DenseMap<unsigned, unsigned> InstrIdxForVirtReg;
TII->genAlternativeCodeSequence(MI, P, InsInstrs, DelInstrs,		TII->genAlternativeCodeSequence(MI, P, InsInstrs, DelInstrs,
InstrIdxForVirtReg);		InstrIdxForVirtReg);
unsigned NewInstCount = InsInstrs.size();		unsigned NewInstCount = InsInstrs.size();
unsigned OldInstCount = DelInstrs.size();		unsigned OldInstCount = DelInstrs.size();
▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/aarch64-combine-fmul-fsub.mir

	# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=cortex-a57 -enable-unsafe-fp-math %s \| FileCheck --check-prefixes=UNPROFITABLE,ALL %s			# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=cortex-a57 -enable-unsafe-fp-math -machine-combiner-verify-pattern-order=true %s \| FileCheck --check-prefixes=UNPROFITABLE,ALL %s
	# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=falkor -enable-unsafe-fp-math %s \| FileCheck --check-prefixes=PROFITABLE,ALL %s			# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=falkor -enable-unsafe-fp-math %s -machine-combiner-verify-pattern-order=true \| FileCheck --check-prefixes=PROFITABLE,ALL %s
	# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=exynos-m1 -enable-unsafe-fp-math %s \| FileCheck --check-prefixes=PROFITABLE,ALL %s			# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=exynos-m1 -enable-unsafe-fp-math -machine-combiner-verify-pattern-order=true %s \| FileCheck --check-prefixes=PROFITABLE,ALL %s
	# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=thunderx2t99 -enable-unsafe-fp-math %s \| FileCheck --check-prefixes=PROFITABLE,ALL %s			# RUN: llc -run-pass=machine-combiner -o - -mtriple=aarch64-unknown-linux -mcpu=thunderx2t99 -enable-unsafe-fp-math -machine-combiner-verify-pattern-order=true %s \| FileCheck --check-prefixes=PROFITABLE,ALL %s
	#			#
	name: f1_2s			name: f1_2s
	registers:			registers:
	- { id: 0, class: fpr64 }			- { id: 0, class: fpr64 }
	- { id: 1, class: fpr64 }			- { id: 1, class: fpr64 }
	- { id: 2, class: fpr64 }			- { id: 2, class: fpr64 }
	- { id: 3, class: fpr64 }			- { id: 3, class: fpr64 }
	- { id: 4, class: fpr64 }			- { id: 4, class: fpr64 }
	body: \|			body: \|
	bb.0.entry:			bb.0.entry:
	%2:fpr64 = COPY %d2			%2:fpr64 = COPY %d2
	%1:fpr64 = COPY %d1			%1:fpr64 = COPY %d1
	%0:fpr64 = COPY %d0			%0:fpr64 = COPY %d0
	%3:fpr64 = FMULv2f32 %0, %1			%3:fpr64 = FMULv2f32 %0, %1
	%4:fpr64 = FSUBv2f32 killed %3, %2			%4:fpr64 = FSUBv2f32 killed %3, %2
	%d0 = COPY %4			%d0 = COPY %4
	RET_ReallyLR implicit %d0			RET_ReallyLR implicit %d0

	...			...
	# UNPROFITABLE-LABEL: name: f1_2s			# UNPROFITABLE-LABEL: name: f1_2s
	# UNPROFITABLE: %3:fpr64 = FMULv2f32 %0, %1			# UNPROFITABLE: %3:fpr64 = FMULv2f32 %0, %1
	# UNPROFITABLE-NEXT: FSUBv2f32 killed %3, %2			# UNPROFITABLE-NEXT: FSUBv2f32 killed %3, %2
	#			#
	# PROFITABLE-LABEL: name: f1_2s			# PROFITABLE-LABEL: name: f1_2s
	# PROFITABLE: %5:fpr64 = FNEGv2f32 %2			# PROFITABLE: [[R1:%[0-9]+]]:fpr64 = FNEGv2f32 %2
	# PROFITABLE-NEXT: FMLAv2f32 killed %5, %0, %1			# PROFITABLE-NEXT: FMLAv2f32 killed [[R1]], %0, %1
	---			---
	name: f1_4s			name: f1_4s
	registers:			registers:
	- { id: 0, class: fpr128 }			- { id: 0, class: fpr128 }
	- { id: 1, class: fpr128 }			- { id: 1, class: fpr128 }
	- { id: 2, class: fpr128 }			- { id: 2, class: fpr128 }
	- { id: 3, class: fpr128 }			- { id: 3, class: fpr128 }
	- { id: 4, class: fpr128 }			- { id: 4, class: fpr128 }
	body: \|			body: \|
	bb.0.entry:			bb.0.entry:
	%2:fpr128 = COPY %q2			%2:fpr128 = COPY %q2
	%1:fpr128 = COPY %q1			%1:fpr128 = COPY %q1
	%0:fpr128 = COPY %q0			%0:fpr128 = COPY %q0
	%3:fpr128 = FMULv4f32 %0, %1			%3:fpr128 = FMULv4f32 %0, %1
	%4:fpr128 = FSUBv4f32 killed %3, %2			%4:fpr128 = FSUBv4f32 killed %3, %2
	%q0 = COPY %4			%q0 = COPY %4
	RET_ReallyLR implicit %q0			RET_ReallyLR implicit %q0

	...			...
	# UNPROFITABLE-LABEL: name: f1_4s			# UNPROFITABLE-LABEL: name: f1_4s
	# UNPROFITABLE: %3:fpr128 = FMULv4f32 %0, %1			# UNPROFITABLE: %3:fpr128 = FMULv4f32 %0, %1
	# UNPROFITABLE-NEXT: FSUBv4f32 killed %3, %2			# UNPROFITABLE-NEXT: FSUBv4f32 killed %3, %2
	#			#
	# PROFITABLE-LABEL: name: f1_4s			# PROFITABLE-LABEL: name: f1_4s
	# PROFITABLE: %5:fpr128 = FNEGv4f32 %2			# PROFITABLE: [[R1:%[0-9]+]]:fpr128 = FNEGv4f32 %2
	# PROFITABLE-NEXT: FMLAv4f32 killed %5, %0, %1			# PROFITABLE-NEXT: FMLAv4f32 killed [[R1]], %0, %1
	---			---
	name: f1_2d			name: f1_2d
	registers:			registers:
	- { id: 0, class: fpr128 }			- { id: 0, class: fpr128 }
	- { id: 1, class: fpr128 }			- { id: 1, class: fpr128 }
	- { id: 2, class: fpr128 }			- { id: 2, class: fpr128 }
	- { id: 3, class: fpr128 }			- { id: 3, class: fpr128 }
	- { id: 4, class: fpr128 }			- { id: 4, class: fpr128 }
	body: \|			body: \|
	bb.0.entry:			bb.0.entry:
	%2:fpr128 = COPY %q2			%2:fpr128 = COPY %q2
	%1:fpr128 = COPY %q1			%1:fpr128 = COPY %q1
	%0:fpr128 = COPY %q0			%0:fpr128 = COPY %q0
	%3:fpr128 = FMULv2f64 %0, %1			%3:fpr128 = FMULv2f64 %0, %1
	%4:fpr128 = FSUBv2f64 killed %3, %2			%4:fpr128 = FSUBv2f64 killed %3, %2
	%q0 = COPY %4			%q0 = COPY %4
	RET_ReallyLR implicit %q0			RET_ReallyLR implicit %q0

	...			...
	# UNPROFITABLE-LABEL: name: f1_2d			# UNPROFITABLE-LABEL: name: f1_2d
	# UNPROFITABLE: %3:fpr128 = FMULv2f64 %0, %1			# UNPROFITABLE: %3:fpr128 = FMULv2f64 %0, %1
	# UNPROFITABLE-NEXT: FSUBv2f64 killed %3, %2			# UNPROFITABLE-NEXT: FSUBv2f64 killed %3, %2
	#			#
	# PROFITABLE-LABEL: name: f1_2d			# PROFITABLE-LABEL: name: f1_2d
	# PROFITABLE: %5:fpr128 = FNEGv2f64 %2			# PROFITABLE: [[R1:%[0-9]+]]:fpr128 = FNEGv2f64 %2
	# PROFITABLE-NEXT: FMLAv2f64 killed %5, %0, %1			# PROFITABLE-NEXT: FMLAv2f64 killed [[R1]], %0, %1
	---			---
	name: f1_both_fmul_2s			name: f1_both_fmul_2s
	registers:			registers:
	- { id: 0, class: fpr64 }			- { id: 0, class: fpr64 }
	- { id: 1, class: fpr64 }			- { id: 1, class: fpr64 }
	- { id: 2, class: fpr64 }			- { id: 2, class: fpr64 }
	- { id: 3, class: fpr64 }			- { id: 3, class: fpr64 }
	- { id: 4, class: fpr64 }			- { id: 4, class: fpr64 }
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/machine-combiner.ll

	; RUN: llc -mtriple=aarch64-gnu-linux -mcpu=cortex-a57 -enable-unsafe-fp-math -disable-post-ra < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-gnu-linux -mcpu=cortex-a57 -enable-unsafe-fp-math -disable-post-ra < %s \| FileCheck %s

	; Incremental updates of the instruction depths should be enough for this test			; Incremental updates of the instruction depths should be enough for this test
	; case.			; case.
	; RUN: llc -mtriple=aarch64-gnu-linux -mcpu=cortex-a57 -enable-unsafe-fp-math \			; RUN: llc -mtriple=aarch64-gnu-linux -mcpu=cortex-a57 -enable-unsafe-fp-math \
	; RUN: -disable-post-ra -machine-combiner-inc-threshold=0 < %s \| FileCheck %s			; RUN: -disable-post-ra -machine-combiner-inc-threshold=0 -machine-combiner-verify-pattern-order=true < %s \| FileCheck %s

	; Verify that the first two adds are independent regardless of how the inputs are			; Verify that the first two adds are independent regardless of how the inputs are
	; commuted. The destination registers are used as source registers for the third add.			; commuted. The destination registers are used as source registers for the third add.

	define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {			define float @reassociate_adds1(float %x0, float %x1, float %x2, float %x3) {
	; CHECK-LABEL: reassociate_adds1:			; CHECK-LABEL: reassociate_adds1:
	; CHECK: fadd s0, s0, s1			; CHECK: fadd s0, s0, s1
	; CHECK-NEXT: fadd s1, s2, s3			; CHECK-NEXT: fadd s1, s2, s3
	▲ Show 20 Lines • Show All 249 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/machine-combiner.mir

	# RUN: llc -mtriple=aarch64-none-linux-gnu -mcpu=cortex-a57 -enable-unsafe-fp-math \			# RUN: llc -mtriple=aarch64-none-linux-gnu -mcpu=cortex-a57 -enable-unsafe-fp-math \
	# RUN: -run-pass machine-combiner -machine-combiner-inc-threshold=0 \			# RUN: -run-pass machine-combiner -machine-combiner-inc-threshold=0 \
	# RUN: -verify-machineinstrs -o - %s \| FileCheck %s			# RUN: -machine-combiner-verify-pattern-order=true -verify-machineinstrs -o - %s \| FileCheck %s
	---			---
	# Test incremental depth updates succeed when triggered after the removal of			# Test incremental depth updates succeed when triggered after the removal of
	# the first instruction in a basic block.			# the first instruction in a basic block.

	# CHECK-LABEL: name: inc_update_iterator_test			# CHECK-LABEL: name: inc_update_iterator_test
	name: inc_update_iterator_test			name: inc_update_iterator_test
	registers:			registers:
	- { id: 0, class: fpr64 }			- { id: 0, class: fpr64 }
	Show All 37 Lines

llvm/trunk/test/CodeGen/X86/machine-combiner-int-vec.ll

	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=sse2 < %s \| FileCheck %s --check-prefix=SSE			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=sse2 -machine-combiner-verify-pattern-order=true < %s \| FileCheck %s --check-prefix=SSE
	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx2 < %s \| FileCheck %s --check-prefix=AVX			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx2 -machine-combiner-verify-pattern-order=true < %s \| FileCheck %s --check-prefix=AVX

	; Verify that 128-bit vector logical ops are reassociated.			; Verify that 128-bit vector logical ops are reassociated.

	define <4 x i32> @reassociate_and_v4i32(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2, <4 x i32> %x3) {			define <4 x i32> @reassociate_and_v4i32(<4 x i32> %x0, <4 x i32> %x1, <4 x i32> %x2, <4 x i32> %x3) {
	; SSE-LABEL: reassociate_and_v4i32:			; SSE-LABEL: reassociate_and_v4i32:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: paddd %xmm1, %xmm0			; SSE-NEXT: paddd %xmm1, %xmm0
	; SSE-NEXT: pand %xmm3, %xmm2			; SSE-NEXT: pand %xmm3, %xmm2
	▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/machine-combiner-int.ll

	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -machine-combiner-verify-pattern-order=true \| FileCheck %s
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -stop-after machine-combiner -o - \| FileCheck %s --check-prefix=DEAD			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -stop-after machine-combiner -machine-combiner-verify-pattern-order=true -o - \| FileCheck %s --check-prefix=DEAD

	; Verify that integer multiplies are reassociated. The first multiply in			; Verify that integer multiplies are reassociated. The first multiply in
	; each test should be independent of the result of the preceding add (lea).			; each test should be independent of the result of the preceding add (lea).

	; TODO: This test does not actually test i16 machine instruction reassociation			; TODO: This test does not actually test i16 machine instruction reassociation
	; because the operands are being promoted to i32 types.			; because the operands are being promoted to i32 types.

	define i16 @reassociate_muls_i16(i16 %x0, i16 %x1, i16 %x2, i16 %x3) {			define i16 @reassociate_muls_i16(i16 %x0, i16 %x1, i16 %x2, i16 %x3) {
	▲ Show 20 Lines • Show All 189 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/machine-combiner.ll

	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=sse -enable-unsafe-fp-math < %s \| FileCheck %s --check-prefix=SSE			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=sse -enable-unsafe-fp-math -machine-combiner-verify-pattern-order=true < %s \| FileCheck %s --check-prefix=SSE
	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math < %s \| FileCheck %s --check-prefix=AVX			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math -machine-combiner-verify-pattern-order=true < %s \| FileCheck %s --check-prefix=AVX

	; Incremental updates of the instruction depths should be enough for this test			; Incremental updates of the instruction depths should be enough for this test
	; case.			; case.
	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=sse -enable-unsafe-fp-math -machine-combiner-inc-threshold=0 < %s \| FileCheck %s --check-prefix=SSE			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=sse -enable-unsafe-fp-math -machine-combiner-inc-threshold=0 < %s \| FileCheck %s --check-prefix=SSE
	; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math -machine-combiner-inc-threshold=0 < %s \| FileCheck %s --check-prefix=AVX			; RUN: llc -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mattr=avx -enable-unsafe-fp-math -machine-combiner-inc-threshold=0 < %s \| FileCheck %s --check-prefix=AVX

	; Verify that the first two adds are independent regardless of how the inputs are			; Verify that the first two adds are independent regardless of how the inputs are
	; commuted. The destination registers are used as source registers for the third add.			; commuted. The destination registers are used as source registers for the third add.
	▲ Show 20 Lines • Show All 668 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner] Add check for optimal pattern order.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 132156

llvm/trunk/lib/CodeGen/MachineCombiner.cpp

llvm/trunk/test/CodeGen/AArch64/aarch64-combine-fmul-fsub.mir

llvm/trunk/test/CodeGen/AArch64/machine-combiner.ll

llvm/trunk/test/CodeGen/AArch64/machine-combiner.mir

llvm/trunk/test/CodeGen/X86/machine-combiner-int-vec.ll

llvm/trunk/test/CodeGen/X86/machine-combiner-int.ll

llvm/trunk/test/CodeGen/X86/machine-combiner.ll

[MachineCombiner] Add check for optimal pattern order.
ClosedPublic