This is an archive of the discontinued LLVM Phabricator instance.

For the context: I'm experimenting with turning post-ra scheduling on for SNB onwards, and I some promising improvements. All the regressions I see are macro-fused instructions being moved apart, which this fixes.

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases.

If you think you can share some numbers then that would be great.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.

Did you experiment with it?
The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.
To enable PostMIScheduler however you have to substitutePass(&PostRASchedulerID, &PostMachineSchedulerID) in the X86PassConfig (similar to how AArch64 does it).
It would be interesting to see which one gives us the best codegen in your experiments.. (just curious).

In D59688#1439297, @andreadb wrote:

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases. If you think you can share some numbers then that would be great.

Thanks Andrea.

Yes, that's essentially what the comment in X86.td says:

"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."

However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?

I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.

When I'm done experimenting with different possibilities I'll write a summary of the results.

The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.

Yup, I have another patch coming for this :)

RKSimon added reviewers: arsenm, MatzeB.Mar 22 2019, 6:17 AM

Herald added a subscriber: wdng. · View Herald TranscriptMar 22 2019, 6:17 AM

arsenm added a reviewer: vpykhtin.Mar 22 2019, 6:24 AM

In D59688#1439350, @courbet wrote:

In D59688#1439297, @andreadb wrote:

Nice patch Clement!

I always wondered why on x86 we only enabled that mutator in the pre-ra scheduler.
In the past, I remember I did some quick experiments with enabling that mutator in the post-RA scheduler. I must admit that I wasn't particularly lucky wih the experiments (i.e. I couldn't find significant/promising improvements). But then - again - those were just quick experiments, and I didn't try it on many codebases. If you think you can share some numbers then that would be great.

Thanks Andrea.

Yes, that's essentially what the comment in X86.td says:

"This generally gives a nice performance increase on silvermont, with largely neutral behavior on other contemporary large core processors."

However, that was before the round of scheduling information fixes that Simon & I made based on llvm-exegesis. I wanted to give it another try after that, and from my first experiments it seems that it indeed makes sense to look at it again.
What I have done for now is run our (internal, sorry) main macrobenchmark with post-ra enabled. With the base code I see a consistent regression of 0.5% to 1% depending on metrics. With this patch I see a consistent improvement of 0.5% to 2%.

Thanks. That's good to know.
I plan to run some experiments today using your patch.

But in general, I am happy with this patch.

On a slightly different topic: I often wondered whether we - at some point - we should start using the PostMIScheduler for X86.
Did you experiment with it?

I'm looking at all the options right now. But I want to make sure that we're comparing apples to apples, and that's why I'm fixing this.

When I'm done experimenting with different possibilities I'll write a summary of the results.

The API is basically the same; mutators are run on the DAG as part of a dag-postprocessing step. You can try to enable that mutator on that pass too.

Yup, I have another patch coming for this :)

Cool. :-)

-Andrea

I plan to run some experiments today using your patch.

That's great, thanks.

In D59688#1439408, @courbet wrote:

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.
Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?
I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

In D59688#1441464, @andreadb wrote:

In D59688#1439408, @courbet wrote:

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for

In D59688#1441464, @andreadb wrote:

AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.

That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.

Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?

It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?

In D59688#1442004, @lebedev.ri wrote:

In D59688#1441464, @andreadb wrote:

In D59688#1439408, @courbet wrote:

I plan to run some experiments today using your patch.

That's great, thanks.

Sorry, I was over optimistic about my other workload. I don't think I'll get a chance to get any perf numbers anytime soon.

That being said, I tried your patch on a few small examples on some different targets, and results seem good.
For example, before your patch I saw cases where the test/cmp was not emitted before the conditional branch. Your patch seems to fix that "issue" in most cases.

My only concern is that the macro-fusion mutator might be a bit too aggressive for AMD processors.
X86MacroFusion assumes that branch fusion can happen with ADD/SUB/INC/DEC too. That is okay for Intel processors, but not necessarily for

In D59688#1441464, @andreadb wrote:

AMD processors where branch fusion (as far as I remember) is limited to CMP/TEST opcodes only.

That is consistent with what is stated in agner's microarchitecture, amd sog for piledriver.

Since your patch enables that mutator for targets with FeatureMacroFusion, it would be nice to get some feedback from somebody with access to an AMD target where macro fusion is enabled (Bobcat/Jaguar doesn't do branch fusion). Perhaps @lebedev.ri can run some quick tests on BdVer2?

It will, as usual, depend on whether this happens to affect the hotpath or not.
I did just run my rawspeed benchmark, and i'm not observing any notable non-noise perf changes.

Thanks. That matches what I also saw in the past when I tested it.

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

While there, @andreadb, can you reply on https://reviews.llvm.org/D46662#1293043 ?

I replied to that code review (the two small benchmarks were available from one of Xur’s older posts).

courbet mentioned this in D59872: [X86MacroFusion] Handle branch fusion (AMD CPUs)..Mar 27 2019, 4:44 AM

courbet mentioned this in rL357171: [X86MacroFusion] Handle branch fusion (AMD CPUs)..Mar 28 2019, 7:12 AM

courbet mentioned this in rG699dc025a625: [X86MacroFusion] Handle branch fusion (AMD CPUs)..

I don't think is a blocking issue, but in future we should revisit the logic in X86MacroFusion.

For the record, this was dealt with in D59872.

Thanks Clement.

LGTM

This revision is now accepted and ready to land.Mar 28 2019, 8:12 AM

This could use test coverage i guess?

This could use test coverage i guess?

This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.

In D59688#1446142, @courbet wrote:

This could use test coverage i guess?

This is actually an NFC as the post-ra scheduler is not on by default. The next patch that enables scheduling will show the actual changes.

What about CPU's that specify let PostRAScheduler = 1; ?

What about CPU's that specify let PostRAScheduler = 1; ?

Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently. I'll try to come up with some that do.

For now I'll submit the refactoring part of this change (D59689).

In D59688#1446261, @courbet wrote:

What about CPU's that specify let PostRAScheduler = 1; ?

Ah yes, sorry I lost track of this. Interestingly there are no tests that fail currently.

I'll try to come up with some that do.

For now I'll submit the refactoring part of this change (D59689).

Thanks!

courbet mentioned this in rL357381: [X86MacroFusion][NFC] Add more tests..Apr 1 2019, 6:18 AM

courbet mentioned this in rGd9f6ee1c3cc6: [X86MacroFusion][NFC] Add more tests..

Add tests.

This revision is now accepted and ready to land.Apr 1 2019, 6:37 AM

Harbormaster completed remote builds in B29895: Diff 193074.Apr 1 2019, 6:39 AM

Please upload correct diff, this seems to be relative to previous patch.

nevermind, i see that it was committed in D59689.
Looks good.

Closed by commit rL357384: [X86] Make post-ra scheduling macrofusion-aware. (authored by courbet). · Explain WhyApr 1 2019, 6:47 AM

This revision was automatically updated to reflect the committed changes.

lebedev.ri mentioned this in D60185: [X86] Make the post machine scheduler macrofusion-aware..Apr 3 2019, 3:08 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

MachineScheduler.h

17 lines

ScheduleDAGInstrs.h

16 lines

lib/

CodeGen/

MachineScheduler.cpp

29 lines

MacroFusion.cpp

12 lines

ScheduleDAGInstrs.cpp

21 lines

Target/

AMDGPU/

AMDGPUSubtarget.cpp

4 lines

X86/

X86Subtarget.h

3 lines

X86Subtarget.cpp

6 lines

Diff 191845

llvm/include/llvm/CodeGen/MachineScheduler.h

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines
/// without much extra book-keeping. This is the common functionality between		/// without much extra book-keeping. This is the common functionality between
/// PreRA and PostRA MachineScheduler.		/// PreRA and PostRA MachineScheduler.
class ScheduleDAGMI : public ScheduleDAGInstrs {		class ScheduleDAGMI : public ScheduleDAGInstrs {
protected:		protected:
AliasAnalysis *AA;		AliasAnalysis *AA;
LiveIntervals *LIS;		LiveIntervals *LIS;
std::unique_ptr<MachineSchedStrategy> SchedImpl;		std::unique_ptr<MachineSchedStrategy> SchedImpl;

/// Topo - A topological ordering for SUnits which permits fast IsReachable
/// and similar queries.
ScheduleDAGTopologicalSort Topo;

/// Ordered list of DAG postprocessing steps.		/// Ordered list of DAG postprocessing steps.
std::vector<std::unique_ptr<ScheduleDAGMutation>> Mutations;		std::vector<std::unique_ptr<ScheduleDAGMutation>> Mutations;

/// The top of the unscheduled zone.		/// The top of the unscheduled zone.
MachineBasicBlock::iterator CurrentTop;		MachineBasicBlock::iterator CurrentTop;

/// The bottom of the unscheduled zone.		/// The bottom of the unscheduled zone.
MachineBasicBlock::iterator CurrentBottom;		MachineBasicBlock::iterator CurrentBottom;

/// Record the next node in a scheduled cluster.		/// Record the next node in a scheduled cluster.
const SUnit *NextClusterPred = nullptr;		const SUnit *NextClusterPred = nullptr;
const SUnit *NextClusterSucc = nullptr;		const SUnit *NextClusterSucc = nullptr;

#ifndef NDEBUG		#ifndef NDEBUG
/// The number of instructions scheduled so far. Used to cut off the		/// The number of instructions scheduled so far. Used to cut off the
/// scheduler at the point determined by misched-cutoff.		/// scheduler at the point determined by misched-cutoff.
unsigned NumInstrsScheduled = 0;		unsigned NumInstrsScheduled = 0;
#endif		#endif

public:		public:
ScheduleDAGMI(MachineSchedContext *C, std::unique_ptr<MachineSchedStrategy> S,		ScheduleDAGMI(MachineSchedContext *C, std::unique_ptr<MachineSchedStrategy> S,
bool RemoveKillFlags)		bool RemoveKillFlags)
: ScheduleDAGInstrs(*C->MF, C->MLI, RemoveKillFlags), AA(C->AA),		: ScheduleDAGInstrs(*C->MF, C->MLI, RemoveKillFlags), AA(C->AA),
LIS(C->LIS), SchedImpl(std::move(S)), Topo(SUnits, &ExitSU) {}		LIS(C->LIS), SchedImpl(std::move(S)) {}

// Provide a vtable anchor		// Provide a vtable anchor
~ScheduleDAGMI() override;		~ScheduleDAGMI() override;

/// If this method returns true, handling of the scheduling regions		/// If this method returns true, handling of the scheduling regions
/// themselves (in case of a scheduling boundary in MBB) will be done		/// themselves (in case of a scheduling boundary in MBB) will be done
/// beginning with the topmost region of MBB.		/// beginning with the topmost region of MBB.
bool doMBBSchedRegionsTopDown() const override {		bool doMBBSchedRegionsTopDown() const override {
Show All 11 Lines	public:
/// building and before MachineSchedStrategy initialization.		/// building and before MachineSchedStrategy initialization.
///		///
/// ScheduleDAGMI takes ownership of the Mutation object.		/// ScheduleDAGMI takes ownership of the Mutation object.
void addMutation(std::unique_ptr<ScheduleDAGMutation> Mutation) {		void addMutation(std::unique_ptr<ScheduleDAGMutation> Mutation) {
if (Mutation)		if (Mutation)
Mutations.push_back(std::move(Mutation));		Mutations.push_back(std::move(Mutation));
}		}

/// True if an edge can be added from PredSU to SuccSU without creating
/// a cycle.
bool canAddEdge(SUnit SuccSU, SUnit PredSU);

/// Add a DAG edge to the given SU with the given predecessor
/// dependence data.
///
/// \returns true if the edge may be added without creating a cycle OR if an
/// equivalent edge already existed (false indicates failure).
bool addEdge(SUnit *SuccSU, const SDep &PredDep);

MachineBasicBlock::iterator top() const { return CurrentTop; }		MachineBasicBlock::iterator top() const { return CurrentTop; }
MachineBasicBlock::iterator bottom() const { return CurrentBottom; }		MachineBasicBlock::iterator bottom() const { return CurrentBottom; }

/// Implement the ScheduleDAGInstrs interface for handling the next scheduling		/// Implement the ScheduleDAGInstrs interface for handling the next scheduling
/// region. This covers all instructions in a block, while schedule() may only		/// region. This covers all instructions in a block, while schedule() may only
/// cover a subset.		/// cover a subset.
void enterRegion(MachineBasicBlock *bb,		void enterRegion(MachineBasicBlock *bb,
MachineBasicBlock::iterator begin,		MachineBasicBlock::iterator begin,
▲ Show 20 Lines • Show All 748 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/ScheduleDAGInstrs.h

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	protected:
/// Adds barrier chain edges from all SUs in map with higher NodeNums than		/// Adds barrier chain edges from all SUs in map with higher NodeNums than
/// this new BarrierChain, and remove them from map. It is assumed		/// this new BarrierChain, and remove them from map. It is assumed
/// BarrierChain has been set before calling this.		/// BarrierChain has been set before calling this.
void insertBarrierChain(Value2SUsMap &map);		void insertBarrierChain(Value2SUsMap &map);

/// For an unanalyzable memory access, this Value is used in maps.		/// For an unanalyzable memory access, this Value is used in maps.
UndefValue *UnknownValue;		UndefValue *UnknownValue;


		/// Topo - A topological ordering for SUnits which permits fast IsReachable
		/// and similar queries.
		ScheduleDAGTopologicalSort Topo;

using DbgValueVector =		using DbgValueVector =
std::vector<std::pair<MachineInstr , MachineInstr >>;		std::vector<std::pair<MachineInstr , MachineInstr >>;
/// Remember instruction that precedes DBG_VALUE.		/// Remember instruction that precedes DBG_VALUE.
/// These are generated by buildSchedGraph but persist so they can be		/// These are generated by buildSchedGraph but persist so they can be
/// referenced when emitting the final schedule.		/// referenced when emitting the final schedule.
DbgValueVector DbgValues;		DbgValueVector DbgValues;
MachineInstr *FirstDbgValue = nullptr;		MachineInstr *FirstDbgValue = nullptr;

▲ Show 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	public:
std::string getGraphNodeLabel(const SUnit *SU) const override;		std::string getGraphNodeLabel(const SUnit *SU) const override;

/// Returns a label for the region of code covered by the DAG.		/// Returns a label for the region of code covered by the DAG.
std::string getDAGName() const override;		std::string getDAGName() const override;

/// Fixes register kill flags that scheduling has made invalid.		/// Fixes register kill flags that scheduling has made invalid.
void fixupKills(MachineBasicBlock &MBB);		void fixupKills(MachineBasicBlock &MBB);

		/// True if an edge can be added from PredSU to SuccSU without creating
		/// a cycle.
		bool canAddEdge(SUnit SuccSU, SUnit PredSU);

		/// Add a DAG edge to the given SU with the given predecessor
		/// dependence data.
		///
		/// \returns true if the edge may be added without creating a cycle OR if an
		/// equivalent edge already existed (false indicates failure).
		bool addEdge(SUnit *SuccSU, const SDep &PredDep);

protected:		protected:
void initSUnits();		void initSUnits();
void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);		void addPhysRegDataDeps(SUnit *SU, unsigned OperIdx);
void addPhysRegDeps(SUnit *SU, unsigned OperIdx);		void addPhysRegDeps(SUnit *SU, unsigned OperIdx);
void addVRegDefDeps(SUnit *SU, unsigned OperIdx);		void addVRegDefDeps(SUnit *SU, unsigned OperIdx);
void addVRegUseDeps(SUnit *SU, unsigned OperIdx);		void addVRegUseDeps(SUnit *SU, unsigned OperIdx);

/// Initializes register live-range state for updating kills.		/// Initializes register live-range state for updating kills.
Show All 36 Lines

llvm/lib/CodeGen/MachineScheduler.cpp

Show First 20 Lines • Show All 598 Lines • ▼ Show 20 Lines
// ScheduleDAGMI - Basic machine instruction scheduling. This is		// ScheduleDAGMI - Basic machine instruction scheduling. This is
// independent of PreRA/PostRA scheduling and involves no extra book-keeping for		// independent of PreRA/PostRA scheduling and involves no extra book-keeping for
// virtual registers.		// virtual registers.
// ===----------------------------------------------------------------------===/		// ===----------------------------------------------------------------------===/

// Provide a vtable anchor.		// Provide a vtable anchor.
ScheduleDAGMI::~ScheduleDAGMI() = default;		ScheduleDAGMI::~ScheduleDAGMI() = default;

bool ScheduleDAGMI::canAddEdge(SUnit SuccSU, SUnit PredSU) {
return SuccSU == &ExitSU \|\| !Topo.IsReachable(PredSU, SuccSU);
}

bool ScheduleDAGMI::addEdge(SUnit *SuccSU, const SDep &PredDep) {
if (SuccSU != &ExitSU) {
// Do not use WillCreateCycle, it assumes SD scheduling.
// If Pred is reachable from Succ, then the edge creates a cycle.
if (Topo.IsReachable(PredDep.getSUnit(), SuccSU))
return false;
Topo.AddPred(SuccSU, PredDep.getSUnit());
}
SuccSU->addPred(PredDep, /Required=/!PredDep.isArtificial());
// Return true regardless of whether a new edge needed to be inserted.
return true;
}

/// ReleaseSucc - Decrement the NumPredsLeft count of a successor. When		/// ReleaseSucc - Decrement the NumPredsLeft count of a successor. When
/// NumPredsLeft reaches zero, release the successor node.		/// NumPredsLeft reaches zero, release the successor node.
///		///
/// FIXME: Adjust SuccSU height based on MinLatency.		/// FIXME: Adjust SuccSU height based on MinLatency.
void ScheduleDAGMI::releaseSucc(SUnit SU, SDep SuccEdge) {		void ScheduleDAGMI::releaseSucc(SUnit SU, SDep SuccEdge) {
SUnit *SuccSU = SuccEdge->getSUnit();		SUnit *SuccSU = SuccEdge->getSUnit();

if (SuccEdge->isWeak()) {		if (SuccEdge->isWeak()) {
▲ Show 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
/// scheduling and potentially other custom schedulers.		/// scheduling and potentially other custom schedulers.
void ScheduleDAGMI::schedule() {		void ScheduleDAGMI::schedule() {
LLVM_DEBUG(dbgs() << "ScheduleDAGMI::schedule starting\n");		LLVM_DEBUG(dbgs() << "ScheduleDAGMI::schedule starting\n");
LLVM_DEBUG(SchedImpl->dumpPolicy());		LLVM_DEBUG(SchedImpl->dumpPolicy());

// Build the DAG.		// Build the DAG.
buildSchedGraph(AA);		buildSchedGraph(AA);

Topo.InitDAGTopologicalSorting();

postprocessDAG();		postprocessDAG();

SmallVector<SUnit*, 8> TopRoots, BotRoots;		SmallVector<SUnit*, 8> TopRoots, BotRoots;
findRootsAndBiasEdges(TopRoots, BotRoots);		findRootsAndBiasEdges(TopRoots, BotRoots);

LLVM_DEBUG(dump());		LLVM_DEBUG(dump());
if (PrintDAGs) dump();		if (PrintDAGs) dump();
if (ViewMISchedDAGs) viewGraph();		if (ViewMISchedDAGs) viewGraph();
▲ Show 20 Lines • Show All 432 Lines • ▼ Show 20 Lines
/// scheduling algorithm. However, if a scheduler further subclasses		/// scheduling algorithm. However, if a scheduler further subclasses
/// ScheduleDAGMILive then it will want to override this virtual method in order		/// ScheduleDAGMILive then it will want to override this virtual method in order
/// to update any specialized state.		/// to update any specialized state.
void ScheduleDAGMILive::schedule() {		void ScheduleDAGMILive::schedule() {
LLVM_DEBUG(dbgs() << "ScheduleDAGMILive::schedule starting\n");		LLVM_DEBUG(dbgs() << "ScheduleDAGMILive::schedule starting\n");
LLVM_DEBUG(SchedImpl->dumpPolicy());		LLVM_DEBUG(SchedImpl->dumpPolicy());
buildDAGWithRegPressure();		buildDAGWithRegPressure();

Topo.InitDAGTopologicalSorting();

postprocessDAG();		postprocessDAG();

SmallVector<SUnit*, 8> TopRoots, BotRoots;		SmallVector<SUnit*, 8> TopRoots, BotRoots;
findRootsAndBiasEdges(TopRoots, BotRoots);		findRootsAndBiasEdges(TopRoots, BotRoots);

// Initialize the strategy before modifying the DAG.		// Initialize the strategy before modifying the DAG.
// This may initialize a DFSResult to be used for queue priority.		// This may initialize a DFSResult to be used for queue priority.
SchedImpl->initialize(this);		SchedImpl->initialize(this);
▲ Show 20 Lines • Show All 303 Lines • ▼ Show 20 Lines
public:		public:
BaseMemOpClusterMutation(const TargetInstrInfo *tii,		BaseMemOpClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri, bool IsLoad)		const TargetRegisterInfo *tri, bool IsLoad)
: TII(tii), TRI(tri), IsLoad(IsLoad) {}		: TII(tii), TRI(tri), IsLoad(IsLoad) {}

void apply(ScheduleDAGInstrs *DAGInstrs) override;		void apply(ScheduleDAGInstrs *DAGInstrs) override;

protected:		protected:
void clusterNeighboringMemOps(ArrayRef<SUnit > MemOps, ScheduleDAGMI DAG);		void clusterNeighboringMemOps(ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG);
};		};

class StoreClusterMutation : public BaseMemOpClusterMutation {		class StoreClusterMutation : public BaseMemOpClusterMutation {
public:		public:
StoreClusterMutation(const TargetInstrInfo *tii,		StoreClusterMutation(const TargetInstrInfo *tii,
const TargetRegisterInfo *tri)		const TargetRegisterInfo *tri)
: BaseMemOpClusterMutation(tii, tri, false) {}		: BaseMemOpClusterMutation(tii, tri, false) {}
};		};
Show All 20 Lines	createStoreClusterDAGMutation(const TargetInstrInfo *TII,
const TargetRegisterInfo *TRI) {		const TargetRegisterInfo *TRI) {
return EnableMemOpCluster ? llvm::make_unique<StoreClusterMutation>(TII, TRI)		return EnableMemOpCluster ? llvm::make_unique<StoreClusterMutation>(TII, TRI)
: nullptr;		: nullptr;
}		}

} // end namespace llvm		} // end namespace llvm

void BaseMemOpClusterMutation::clusterNeighboringMemOps(		void BaseMemOpClusterMutation::clusterNeighboringMemOps(
ArrayRef<SUnit > MemOps, ScheduleDAGMI DAG) {		ArrayRef<SUnit > MemOps, ScheduleDAGInstrs DAG) {
SmallVector<MemOpInfo, 32> MemOpRecords;		SmallVector<MemOpInfo, 32> MemOpRecords;
for (SUnit *SU : MemOps) {		for (SUnit *SU : MemOps) {
MachineOperand *BaseOp;		MachineOperand *BaseOp;
int64_t Offset;		int64_t Offset;
if (TII->getMemOperandWithOffset(*SU->getInstr(), BaseOp, Offset, TRI))		if (TII->getMemOperandWithOffset(*SU->getInstr(), BaseOp, Offset, TRI))
MemOpRecords.push_back(MemOpInfo(SU, BaseOp, Offset));		MemOpRecords.push_back(MemOpInfo(SU, BaseOp, Offset));
}		}
if (MemOpRecords.size() < 2)		if (MemOpRecords.size() < 2)
Show All 23 Lines	if (TII->shouldClusterMemOps(*MemOpRecords[Idx].BaseOp,
}		}
++ClusterLength;		++ClusterLength;
} else		} else
ClusterLength = 1;		ClusterLength = 1;
}		}
}		}

/// Callback from DAG postProcessing to create cluster edges for loads.		/// Callback from DAG postProcessing to create cluster edges for loads.
void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAGInstrs) {		void BaseMemOpClusterMutation::apply(ScheduleDAGInstrs *DAG) {
ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);

// Map DAG NodeNum to store chain ID.		// Map DAG NodeNum to store chain ID.
DenseMap<unsigned, unsigned> StoreChainIDs;		DenseMap<unsigned, unsigned> StoreChainIDs;
// Map each store chain to a set of dependent MemOps.		// Map each store chain to a set of dependent MemOps.
SmallVector<SmallVector<SUnit*,4>, 32> StoreChainDependents;		SmallVector<SmallVector<SUnit*,4>, 32> StoreChainDependents;
for (SUnit &SU : DAG->SUnits) {		for (SUnit &SU : DAG->SUnits) {
if ((IsLoad && !SU.getInstr()->mayLoad()) \|\|		if ((IsLoad && !SU.getInstr()->mayLoad()) \|\|
(!IsLoad && !SU.getInstr()->mayStore()))		(!IsLoad && !SU.getInstr()->mayStore()))
continue;		continue;
▲ Show 20 Lines • Show All 2,106 Lines • Show Last 20 Lines

llvm/lib/CodeGen/MacroFusion.cpp

	Show All 30 Lines

	static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,			static cl::opt<bool> EnableMacroFusion("misched-fusion", cl::Hidden,
	cl::desc("Enable scheduling for macro fusion."), cl::init(true));			cl::desc("Enable scheduling for macro fusion."), cl::init(true));

	static bool isHazard(const SDep &Dep) {			static bool isHazard(const SDep &Dep) {
	return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;			return Dep.getKind() == SDep::Anti \|\| Dep.getKind() == SDep::Output;
	}			}

	static bool fuseInstructionPair(ScheduleDAGMI &DAG, SUnit &FirstSU,			static bool fuseInstructionPair(ScheduleDAGInstrs &DAG, SUnit &FirstSU,
	SUnit &SecondSU) {			SUnit &SecondSU) {
	// Check that neither instr is already paired with another along the edge			// Check that neither instr is already paired with another along the edge
	// between them.			// between them.
	for (SDep &SI : FirstSU.Succs)			for (SDep &SI : FirstSU.Succs)
	if (SI.isCluster())			if (SI.isCluster())
	return false;			return false;

	for (SDep &SI : SecondSU.Preds)			for (SDep &SI : SecondSU.Preds)
	if (SI.isCluster())			if (SI.isCluster())
	return false;			return false;
	// Though the reachability checks above could be made more generic,			// Though the reachability checks above could be made more generic,
	// perhaps as part of ScheduleDAGMI::addEdge(), since such edges are valid,			// perhaps as part of ScheduleDAGInstrs::addEdge(), since such edges are valid,
	// the extra computation cost makes it less interesting in general cases.			// the extra computation cost makes it less interesting in general cases.

	// Create a single weak edge between the adjacent instrs. The only effect is			// Create a single weak edge between the adjacent instrs. The only effect is
	// to cause bottom-up scheduling to heavily prioritize the clustered instrs.			// to cause bottom-up scheduling to heavily prioritize the clustered instrs.
	if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))			if (!DAG.addEdge(&SecondSU, SDep(&FirstSU, SDep::Cluster)))
	return false;			return false;

	// Adjust the latency between both instrs.			// Adjust the latency between both instrs.
	▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines

	namespace {			namespace {

	/// Post-process the DAG to create cluster edges between instrs that may			/// Post-process the DAG to create cluster edges between instrs that may
	/// be fused by the processor into a single operation.			/// be fused by the processor into a single operation.
	class MacroFusion : public ScheduleDAGMutation {			class MacroFusion : public ScheduleDAGMutation {
	ShouldSchedulePredTy shouldScheduleAdjacent;			ShouldSchedulePredTy shouldScheduleAdjacent;
	bool FuseBlock;			bool FuseBlock;
	bool scheduleAdjacentImpl(ScheduleDAGMI &DAG, SUnit &AnchorSU);			bool scheduleAdjacentImpl(ScheduleDAGInstrs &DAG, SUnit &AnchorSU);

	public:			public:
	MacroFusion(ShouldSchedulePredTy shouldScheduleAdjacent, bool FuseBlock)			MacroFusion(ShouldSchedulePredTy shouldScheduleAdjacent, bool FuseBlock)
	: shouldScheduleAdjacent(shouldScheduleAdjacent), FuseBlock(FuseBlock) {}			: shouldScheduleAdjacent(shouldScheduleAdjacent), FuseBlock(FuseBlock) {}

	void apply(ScheduleDAGInstrs *DAGInstrs) override;			void apply(ScheduleDAGInstrs *DAGInstrs) override;
	};			};

	} // end anonymous namespace			} // end anonymous namespace

	void MacroFusion::apply(ScheduleDAGInstrs *DAGInstrs) {			void MacroFusion::apply(ScheduleDAGInstrs *DAG) {
	ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);

	if (FuseBlock)			if (FuseBlock)
	// For each of the SUnits in the scheduling block, try to fuse the instr in			// For each of the SUnits in the scheduling block, try to fuse the instr in
	// it with one in its predecessors.			// it with one in its predecessors.
	for (SUnit &ISU : DAG->SUnits)			for (SUnit &ISU : DAG->SUnits)
	scheduleAdjacentImpl(*DAG, ISU);			scheduleAdjacentImpl(*DAG, ISU);

	if (DAG->ExitSU.getInstr())			if (DAG->ExitSU.getInstr())
	// Try to fuse the instr in the ExitSU with one in its predecessors.			// Try to fuse the instr in the ExitSU with one in its predecessors.
	scheduleAdjacentImpl(*DAG, DAG->ExitSU);			scheduleAdjacentImpl(*DAG, DAG->ExitSU);
	}			}

	/// Implement the fusion of instr pairs in the scheduling DAG,			/// Implement the fusion of instr pairs in the scheduling DAG,
	/// anchored at the instr in AnchorSU..			/// anchored at the instr in AnchorSU..
	bool MacroFusion::scheduleAdjacentImpl(ScheduleDAGMI &DAG, SUnit &AnchorSU) {			bool MacroFusion::scheduleAdjacentImpl(ScheduleDAGInstrs &DAG, SUnit &AnchorSU) {
	const MachineInstr &AnchorMI = *AnchorSU.getInstr();			const MachineInstr &AnchorMI = *AnchorSU.getInstr();
	const TargetInstrInfo &TII = *DAG.TII;			const TargetInstrInfo &TII = *DAG.TII;
	const TargetSubtargetInfo &ST = DAG.MF.getSubtarget();			const TargetSubtargetInfo &ST = DAG.MF.getSubtarget();

	// Check if the anchor instr may be fused.			// Check if the anchor instr may be fused.
	if (!shouldScheduleAdjacent(TII, ST, nullptr, AnchorMI))			if (!shouldScheduleAdjacent(TII, ST, nullptr, AnchorMI))
	return false;			return false;

	Show All 36 Lines

llvm/lib/CodeGen/ScheduleDAGInstrs.cpp

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines
}		}

ScheduleDAGInstrs::ScheduleDAGInstrs(MachineFunction &mf,		ScheduleDAGInstrs::ScheduleDAGInstrs(MachineFunction &mf,
const MachineLoopInfo *mli,		const MachineLoopInfo *mli,
bool RemoveKillFlags)		bool RemoveKillFlags)
: ScheduleDAG(mf), MLI(mli), MFI(mf.getFrameInfo()),		: ScheduleDAG(mf), MLI(mli), MFI(mf.getFrameInfo()),
RemoveKillFlags(RemoveKillFlags),		RemoveKillFlags(RemoveKillFlags),
UnknownValue(UndefValue::get(		UnknownValue(UndefValue::get(
Type::getVoidTy(mf.getFunction().getContext()))) {		Type::getVoidTy(mf.getFunction().getContext()))), Topo(SUnits, &ExitSU) {
DbgValues.clear();		DbgValues.clear();

const TargetSubtargetInfo &ST = mf.getSubtarget();		const TargetSubtargetInfo &ST = mf.getSubtarget();
SchedModel.init(&ST);		SchedModel.init(&ST);
}		}

/// If this machine instr has memory reference information and it can be		/// If this machine instr has memory reference information and it can be
/// tracked to a normal reference to a known object, return the Value		/// tracked to a normal reference to a known object, return the Value
▲ Show 20 Lines • Show All 837 Lines • ▼ Show 20 Lines	void ScheduleDAGInstrs::buildSchedGraph(AliasAnalysis *AA,

if (DbgMI)		if (DbgMI)
FirstDbgValue = DbgMI;		FirstDbgValue = DbgMI;

Defs.clear();		Defs.clear();
Uses.clear();		Uses.clear();
CurrentVRegDefs.clear();		CurrentVRegDefs.clear();
CurrentVRegUses.clear();		CurrentVRegUses.clear();

		Topo.InitDAGTopologicalSorting();
}		}

raw_ostream &llvm::operator<<(raw_ostream &OS, const PseudoSourceValue* PSV) {		raw_ostream &llvm::operator<<(raw_ostream &OS, const PseudoSourceValue* PSV) {
PSV->printCustom(OS);		PSV->printCustom(OS);
return OS;		return OS;
}		}

void ScheduleDAGInstrs::Value2SUsMap::dump() {		void ScheduleDAGInstrs::Value2SUsMap::dump() {
▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
}		}

/// Return the basic block label. It is not necessarilly unique because a block		/// Return the basic block label. It is not necessarilly unique because a block
/// contains multiple scheduling regions. But it is fine for visualization.		/// contains multiple scheduling regions. But it is fine for visualization.
std::string ScheduleDAGInstrs::getDAGName() const {		std::string ScheduleDAGInstrs::getDAGName() const {
return "dag." + BB->getFullName();		return "dag." + BB->getFullName();
}		}

		bool ScheduleDAGInstrs::canAddEdge(SUnit SuccSU, SUnit PredSU) {
		return SuccSU == &ExitSU \|\| !Topo.IsReachable(PredSU, SuccSU);
		}

		bool ScheduleDAGInstrs::addEdge(SUnit *SuccSU, const SDep &PredDep) {
		if (SuccSU != &ExitSU) {
		// Do not use WillCreateCycle, it assumes SD scheduling.
		// If Pred is reachable from Succ, then the edge creates a cycle.
		if (Topo.IsReachable(PredDep.getSUnit(), SuccSU))
		return false;
		Topo.AddPred(SuccSU, PredDep.getSUnit());
		}
		SuccSU->addPred(PredDep, /Required=/!PredDep.isArtificial());
		// Return true regardless of whether a new edge needed to be inserted.
		return true;
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SchedDFSResult Implementation		// SchedDFSResult Implementation
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace llvm {		namespace llvm {

/// Internal state used to compute SchedDFSResult.		/// Internal state used to compute SchedDFSResult.
class SchedDFSImpl {		class SchedDFSImpl {
▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

	Show First 20 Lines • Show All 626 Lines • ▼ Show 20 Lines
	}			}

	namespace {			namespace {
	struct MemOpClusterMutation : ScheduleDAGMutation {			struct MemOpClusterMutation : ScheduleDAGMutation {
	const SIInstrInfo *TII;			const SIInstrInfo *TII;

	MemOpClusterMutation(const SIInstrInfo *tii) : TII(tii) {}			MemOpClusterMutation(const SIInstrInfo *tii) : TII(tii) {}

	void apply(ScheduleDAGInstrs *DAGInstrs) override {			void apply(ScheduleDAGInstrs *DAG) override {
	ScheduleDAGMI DAG = static_cast<ScheduleDAGMI>(DAGInstrs);

	SUnit *SUa = nullptr;			SUnit *SUa = nullptr;
	// Search for two consequent memory operations and link them			// Search for two consequent memory operations and link them
	// to prevent scheduler from moving them apart.			// to prevent scheduler from moving them apart.
	// In DAG pre-process SUnits are in the original order of			// In DAG pre-process SUnits are in the original order of
	// the instructions before scheduling.			// the instructions before scheduling.
	for (SUnit &SU : DAG->SUnits) {			for (SUnit &SU : DAG->SUnits) {
	MachineInstr &MI2 = *SU.getInstr();			MachineInstr &MI2 = *SU.getInstr();
	if (!MI2.mayLoad() && !MI2.mayStore()) {			if (!MI2.mayLoad() && !MI2.mayStore()) {
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 829 Lines • ▼ Show 20 Lines	bool enableIndirectBrExpand() const override {
return useRetpolineIndirectBranches();		return useRetpolineIndirectBranches();
}		}

/// Enable the MachineScheduler pass for all X86 subtargets.		/// Enable the MachineScheduler pass for all X86 subtargets.
bool enableMachineScheduler() const override { return true; }		bool enableMachineScheduler() const override { return true; }

bool enableEarlyIfConversion() const override;		bool enableEarlyIfConversion() const override;

		void getPostRAMutations(
		std::vector<std::unique_ptr<ScheduleDAGMutation>> &Mutations) const;

AntiDepBreakMode getAntiDepBreakMode() const override {		AntiDepBreakMode getAntiDepBreakMode() const override {
return TargetSubtargetInfo::ANTIDEP_CRITICAL;		return TargetSubtargetInfo::ANTIDEP_CRITICAL;
}		}

bool enableAdvancedRASplitCost() const override { return true; }		bool enableAdvancedRASplitCost() const override { return true; }
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_X86_X86SUBTARGET_H		#endif // LLVM_LIB_TARGET_X86_X86SUBTARGET_H

llvm/lib/Target/X86/X86Subtarget.cpp

	//===-- X86Subtarget.cpp - X86 Subtarget Information ----------------------===//			//===-- X86Subtarget.cpp - X86 Subtarget Information ----------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the X86 specific subclass of TargetSubtargetInfo.			// This file implements the X86 specific subclass of TargetSubtargetInfo.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "X86.h"			#include "X86.h"

	#include "X86CallLowering.h"			#include "X86CallLowering.h"
	#include "X86LegalizerInfo.h"			#include "X86LegalizerInfo.h"
				#include "X86MacroFusion.h"
	#include "X86RegisterBankInfo.h"			#include "X86RegisterBankInfo.h"
	#include "X86Subtarget.h"			#include "X86Subtarget.h"
	#include "MCTargetDesc/X86BaseInfo.h"			#include "MCTargetDesc/X86BaseInfo.h"
	#include "X86TargetMachine.h"			#include "X86TargetMachine.h"
	#include "llvm/ADT/Triple.h"			#include "llvm/ADT/Triple.h"
	#include "llvm/CodeGen/GlobalISel/CallLowering.h"			#include "llvm/CodeGen/GlobalISel/CallLowering.h"
	#include "llvm/CodeGen/GlobalISel/InstructionSelect.h"			#include "llvm/CodeGen/GlobalISel/InstructionSelect.h"
	#include "llvm/IR/Attributes.h"			#include "llvm/IR/Attributes.h"
	▲ Show 20 Lines • Show All 336 Lines • ▼ Show 20 Lines

	const RegisterBankInfo *X86Subtarget::getRegBankInfo() const {			const RegisterBankInfo *X86Subtarget::getRegBankInfo() const {
	return RegBankInfo.get();			return RegBankInfo.get();
	}			}

	bool X86Subtarget::enableEarlyIfConversion() const {			bool X86Subtarget::enableEarlyIfConversion() const {
	return hasCMov() && X86EarlyIfConv;			return hasCMov() && X86EarlyIfConv;
	}			}

				void X86Subtarget::getPostRAMutations(
				std::vector<std::unique_ptr<ScheduleDAGMutation>> &Mutations) const {
				Mutations.push_back(createX86MacroFusionDAGMutation());
				}

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Make post-ra scheduling macrofusion-aware.ClosedPublic

Details

Diff Detail

Event Timeline