This is an archive of the discontinued LLVM Phabricator instance.

MachineScheduler: Enable macro fusion in post-RA scheduler
AbandonedPublic

Authored by MatzeB on Sep 22 2016, 6:22 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
jonpa
atrick
kparzysz

Summary

The post-RA scheduler should respect macro fusion opportunities.

This also changes the strategy to enforce adjacent scheduling: While we
previously added a weak clustering edge between the fusing nodes we now
add strong artificial ordering edges towards all other nodes to enforce
the order. This was found necessary to avoid cases in which the cost
heuristic for the weak edges did not have any effect because some nodes
were still in the pending queue and not even considered by
tryCandidate().

Diff Detail

Repository: rL LLVM

Event Timeline

MatzeB updated this revision to Diff 72232.Sep 22 2016, 6:22 PM

MatzeB retitled this revision from to MachineScheduler: Enable macro fusion in post-RA scheduler.

MatzeB updated this object.

MatzeB added reviewers: atrick, • tstellarAMD, jonpa.

MatzeB set the repository for this revision to rL LLVM.

MatzeB added a subscriber: llvm-commits.

Herald added subscribers: wdng, mcrosier, MatzeB. · View Herald TranscriptSep 22 2016, 6:22 PM

MatzeB added a reviewer: kparzysz.Sep 23 2016, 1:24 PM

This looks fairly straightforward... LGTM.

lib/CodeGen/MachineScheduler.cpp
2870	delay -> Delay

This revision is now accepted and ready to land.Sep 26 2016, 5:16 AM

It turned out the approach taken previously was not enough: Currently nodes predicted to stall will end up in the pending queue and not even get consider in tryCandidate() for the usual heuristics. However a possible stalls should not get in the way of the macrofusion heuristic so instead of adjusting the picking heuristic this patch adds artificial scheduling edges to the roots to enforce the adjacent scheduling.

PS: In some internal discussions we decided that a nice long term solution would be to merge the fusing nodes instead in the preparation step (by creating an instruction bundle and merging the ScheduleDAG nodes). However merging nodes after creating the scheduling DAG turns out to be tricky because the existing code expects a fixed number of SUnits, we would need to update or recompute the topological ordering, etc. So to get this specific problem under control I found adding edges to root nodes a robust and simpler solution for now.

Marking a DAG node as "dead" for the purpose of scheduling should be an easy thing to do, relative to supporting instruction bundles. But adding the extra DAG edges is also a fine solution, just not quite as direct.

Update the patch so that addFusionEdges() does not walk over all edges in the graph anymore. (Just all nodes and the predecessors edges of ExitSU now).

MatzeB added a child revision: D25140: ScheduleDAGInstrs: Add condjump deps in addSchedBarrierDeps().Oct 3 2016, 5:30 PM

ping

Doesn't this force macro fusion for all targets/subtargets? If we wanted to do that, we wouldn't need the cluster edge and scheduler heuristic anymore. Shouldn't there be a TII->forceMacroFusion() option?

In D24855#572492, @atrick wrote:

Doesn't this force macro fusion for all targets/subtargets? If we wanted to do that, we wouldn't need the cluster edge and scheduler heuristic anymore. Shouldn't there be a TII->forceMacroFusion() option?

If the target doesn't implement TII::shouldScheduleAdjacent() then no fusion will happen. Of course this commit forces fusion to be respected even in the presence of possible stalls in the scheduling model. This is a switch of priorities and indeed does not allow you any more to insert the macrofusion check at any place in the heuristic (the scheduling model/stalls aren't part of that heuristic either so I couldn't just move the cluster edge heuristic to an earlier place in tryCandidate() so I went this route).

Do you think a TII->forceMacroFusion() hook is necessary? Currently the only targets implementing shouldScheduleAdjacent() are X86 and AArch64 and they should both prefer fusion over reported stalls, making the cluster/weak solution untested/dead code.

On x86 it's meant to be a heuristic. If you think all subtargets should instead force macro-fusion before scheduling (and have benchmarks to prove it) then we should delete the code that implements the heuristic.

In D24855#572495, @atrick wrote:

On x86 it's meant to be a heuristic. If you think all subtargets should instead force macro-fusion before scheduling (and have benchmarks to prove it) then we should delete the code that implements the heuristic.

My experience has been that this mostly matters/changes the outcome when scheduling top-down, which currently happens only in the PostRAScheduler. The combination of PostRAScheduling and MacroOpFusion enabled only seems to happen in the BtVer2/Jaguar scheduling model at the moment for which I have no hardware to test. So I have no good way of benchmarking this (but also no indication why this would ever be good as a heuristic).

Anyway I to keep the possibility of weak edges, I'd add make shouldScheduleAdjacent() return an enum value to indicate whether weak/hard edges should be used. I'll update that in the next days.

This is out of date. Nowadays targets can decide themselfes whether they want post-ra fusion by overriding TaretPassConfig::createPostMachineScheduler() and adding a dag mutation there.

Revision Contents

Path

Size

lib/

CodeGen/

MachineScheduler.cpp

70 lines

test/

CodeGen/

AArch64/

postmisched-fusion.mir

23 lines

Diff 73369

lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 1,527 Lines • ▼ Show 20 Lines	for (const MachineOperand &MO : MI.uses()) {

unsigned Reg = MO.getReg();		unsigned Reg = MO.getReg();
if (Other.modifiesRegister(Reg, &TRI))		if (Other.modifiesRegister(Reg, &TRI))
return true;		return true;
}		}
return false;		return false;
}		}

		/// Check dependencies in \p DAG whether \p Node0 can be schedule immediately
		/// before \p Node1.
		static bool canScheduleAdjacent(ScheduleDAGInstrs &DAG, const SUnit &Node0,
		const SUnit &Node1) {
		// This is only a barebones implementation right now, limited to
		// Node1==ExitSU. (This could be extended by employing
		// ScheduleDAGMI.Topo.isReachable() queries on Node0 successors in the future)
		assert(&Node1 == &DAG.ExitSU && "Only implemented for ExitSU node");
		for (const SDep &Succ : Node0.Succs) {
		if (Succ.getSUnit() != &Node1)
		return false;
		}
		return true;
		}

		/// Add artificial edges to force adjacent scheduling of \p Node0 and \p Node1.
		static void addFusionEdges(ScheduleDAGMI &DAG, SUnit &Node0, SUnit &Node1) {
		assert(&Node1 == &DAG.ExitSU &&
		"addFusionEdges() only implemented for Node1 == ExitSU");
		// This is simpler than the general case: We only need an artifical edge from
		// roots and predecessors of ExitSU to Node0.
		for (SDep &PredDep : DAG.ExitSU.Preds) {
		SUnit &SU = *PredDep.getSUnit();
		if (&SU == &Node0)
		continue;

		bool NeedEdge = true;
		for (const SDep &SuccDep : SU.Succs) {
		if (SuccDep.isWeak())
		continue;
		const SUnit &Succ = *SuccDep.getSUnit();
		if (&Succ != &DAG.ExitSU) {
		NeedEdge = false;
		break;
		}
		}
		if (NeedEdge)
		DAG.addEdge(&Node0, SDep(&SU, SDep::Artificial));
		}

		for (SUnit &SU : DAG.SUnits) {
		if (SU.NumSuccsLeft == 0 && &SU != &Node0)
		DAG.addEdge(&Node0, SDep(&SU, SDep::Artificial));
		}
		}

/// \brief Callback from DAG postProcessing to create cluster edges to encourage		/// \brief Callback from DAG postProcessing to create cluster edges to encourage
/// fused operations.		/// fused operations.
void MacroFusion::apply(ScheduleDAGInstrs *DAGInstrs) {		void MacroFusion::apply(ScheduleDAGInstrs *DAGInstrs) {
ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);		ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);

// For now, assume targets can only fuse with the branch.		// For now, assume targets can only fuse with the branch.
SUnit &ExitSU = DAG->ExitSU;		SUnit &ExitSU = DAG->ExitSU;
MachineInstr *Branch = ExitSU.getInstr();		MachineInstr *Branch = ExitSU.getInstr();
if (!Branch)		if (!Branch)
return;		return;

for (SUnit &SU : DAG->SUnits) {		for (SUnit &SU : DAG->SUnits) {
// SUnits with successors can't be schedule in front of the ExitSU.		// SUnits with successors can't be schedule in front of the ExitSU.
if (!SU.Succs.empty())		if (!SU.Succs.empty())
continue;		continue;
// We only care if the node writes to a register that the branch reads.		// We only care if the node writes to a register that the branch reads.
MachineInstr *Pred = SU.getInstr();		MachineInstr *Pred = SU.getInstr();
if (!HasDataDep(TRI, Branch, Pred))		if (!HasDataDep(TRI, Branch, Pred))
continue;		continue;

if (!TII.shouldScheduleAdjacent(Pred, Branch))		if (!TII.shouldScheduleAdjacent(Pred, Branch))
continue;		continue;

// Create a single weak edge from SU to ExitSU. The only effect is to cause		if (!canScheduleAdjacent(*DAG, SU, ExitSU))
// bottom-up scheduling to heavily prioritize the clustered SU. There is no		continue;
// need to copy predecessor edges from ExitSU to SU, since top-down
// scheduling cannot prioritize ExitSU anyway. To defer top-down scheduling
// of SU, we could create an artificial edge from the deepest root, but it
// hasn't been needed yet.
bool Success = DAG->addEdge(&ExitSU, SDep(&SU, SDep::Cluster));
(void)Success;
assert(Success && "No DAG nodes should be reachable from ExitSU");

DEBUG(dbgs() << "Macro Fuse SU(" << SU.NodeNum << ")\n");		DEBUG(dbgs() << "Macro Fuse SU(" << SU.NodeNum << ") and SU("
		<< ExitSU.NodeNum << ")\n");
		addFusionEdges(*DAG, SU, ExitSU);
break;		break;
}		}
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// CopyConstrain - DAG post-processing to encourage copy elimination.		// CopyConstrain - DAG post-processing to encourage copy elimination.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 1,243 Lines • ▼ Show 20 Lines	if (tryGreater(biasPhysRegCopy(TryCand.SU, TryCand.AtTop),
TryCand, Cand, PhysRegCopy))		TryCand, Cand, PhysRegCopy))
return;		return;

// Avoid exceeding the target's limit.		// Avoid exceeding the target's limit.
if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.Excess,		if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.Excess,
Cand.RPDelta.Excess,		Cand.RPDelta.Excess,
TryCand, Cand, RegExcess, TRI,		TryCand, Cand, RegExcess, TRI,
DAG->MF))		DAG->MF))
return;		return;
		kparzyszUnsubmitted Not Done Reply Inline Actions delay -> Delay kparzysz: delay -> Delay

// Avoid increasing the max critical pressure in the scheduled region.		// Avoid increasing the max critical pressure in the scheduled region.
if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.CriticalMax,		if (DAG->isTrackingPressure() && tryPressure(TryCand.RPDelta.CriticalMax,
Cand.RPDelta.CriticalMax,		Cand.RPDelta.CriticalMax,
TryCand, Cand, RegCritical, TRI,		TryCand, Cand, RegCritical, TRI,
DAG->MF))		DAG->MF))
return;		return;

▲ Show 20 Lines • Show All 424 Lines • ▼ Show 20 Lines
/// scheduled/remaining flags in the DAG nodes.		/// scheduled/remaining flags in the DAG nodes.
void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {		void PostGenericScheduler::schedNode(SUnit *SU, bool IsTopNode) {
SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());		SU->TopReadyCycle = std::max(SU->TopReadyCycle, Top.getCurrCycle());
Top.bumpNode(SU);		Top.bumpNode(SU);
}		}

/// Create a generic scheduler with no vreg liveness or DAG mutation passes.		/// Create a generic scheduler with no vreg liveness or DAG mutation passes.
static ScheduleDAGInstrs createGenericSchedPostRA(MachineSchedContext C) {		static ScheduleDAGInstrs createGenericSchedPostRA(MachineSchedContext C) {
return new ScheduleDAGMI(C, make_unique<PostGenericScheduler>(C), /IsPostRA=/true);		ScheduleDAGMI *DAG =
		new ScheduleDAGMI(C, make_unique<PostGenericScheduler>(C),
		/IsPostRA=/true);
		if (EnableMacroFusion)
		DAG->addMutation(createMacroFusionDAGMutation(DAG->TII, DAG->TRI));
		return DAG;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// ILP Scheduler. Currently for experimental analysis of heuristics.		// ILP Scheduler. Currently for experimental analysis of heuristics.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
/// \brief Order nodes by the ILP metric.		/// \brief Order nodes by the ILP metric.
▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

test/CodeGen/AArch64/postmisched-fusion.mir

This file was added.

				# RUN: llc -o - %s -mtriple=aarch64-- -mcpu=cyclone -enable-post-misched -run-pass=postmisched \| FileCheck %s
				# Test that the post machine scheduler respects macro op fusion.
				--- \|
				define void @func0() { ret void }
				...
				---
				# CHECK-LABEL: name: func0
				# CHECK: %xzr = SUBSXri{{.*}}implicit-def %nzcv
				# CHECK-NEXT: Bcc {{.*}}implicit killed %nzcv
				name: func0
				body: \|
				bb.0:
				successors: %bb.1, %bb.2
				%x8 = IMPLICIT_DEF
				%x9 = LDRXui %x8, 0 :: (load 8)
				dead %xzr = SUBSXri %x8, 0, 0, implicit def %nzcv
				%x10 = ADDXri %x9, 13, 0
				Bcc 1, %bb.1, implicit killed %nzcv
				B %bb.2

				bb.1:
				bb.2:
				...