This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
MC/
2
MCSchedule.h
-
MCSubtargetInfo.h
-
Target/
1
TargetSchedule.td
-
utils/TableGen/
-
TableGen/
-
SubtargetEmitter.cpp

Differential D35228

[TableGen] Add support for instruction clusters
Needs ReviewPublic

Authored by evandro on Jul 10 2017, 4:29 PM.

Download Raw Diff

Details

Reviewers

javed.absar
t.p.northover
joelkevinjones
fhahn

Summary

This change allows the description of instruction clusters by using the new ReadCluster and SchedReadCluster classes, analogously to the ReadAdvance and SchedReadAdvance classes, respectively.

In a subsequent patch, the instruction scheduler is modified to use this information.

The motivation is to allow a target maintainer to specify which instructions pairs should be clustered together in the machine model. Thus the clustering is done as part of instruction scheduling instead of relying in a scheduling mutation and adding code to the backend to match the instructions.

Diff Detail

Event Timeline

evandro created this revision.Jul 10 2017, 4:29 PM

evandro added a child revision: D35229: [CodeGen] Add support for instruction clusters.Jul 10 2017, 4:32 PM

Can you give an example of a typical use? Is this just a different way to express macrofusion opportunities or is this more or less powerfull?

javed.absar added inline comments.Jul 11 2017, 1:52 AM

llvm/include/llvm/MC/MCSchedule.h
93	Shouldn't this function be amended as well to equate Cluster?
llvm/include/llvm/Target/TargetSchedule.td
312	Perhaps the order of the parameters should be - cycles, cluster, writes = []

In D35228#804631, @MatzeB wrote:

Can you give an example of a typical use? Is this just a different way to express macrofusion opportunities or is this more or less powerfull?

I agree it would be helpful to see an example, e.g. how this could be used to cluster AArch64 AES instructions.

If this is intended as a replacement for the MacroFusion pass, could the more complicated constraints for fusing MOVK/MOVZ on AArch64 be expressed (lib/target/AArch64/AArch64MacroFusion.cpp from line 130)?

llvm/include/llvm/MC/MCSchedule.h
91	I assume you added the cluster bit here, because the whole machinery for handling "ReadAdvance" can be easily extended to do add cluster dependencies? It feels that after adding the cluster bit, most related function/class names are slightly misleading, e.g. `getReadAdvanceCycles` now does not only get the number of cycles but also the cluster bit.

evandro added a child revision: D35260: [AArch64] Move AES instruction fusion support.Jul 11 2017, 8:48 AM

I posted the preliminary patch D35260 to illustrate how this change, along with D35229, can be used to simplify instruction fusion.

Thanks @evandro , I think the example is really helpful to judge the impact of the set of changes.

I am not sure I see the clear advantages of the new approach over the macro fusion DAG mutation though. It seems to me that the new approach spreads out the implementation and definitions out over multiple files, whereas the DAG mutation is basically 2 files (the generic pass and the target specific implementation of shouldScheduleAdjecent) and it's relatively easy to see exactly what's going on, even though overall it probably requires slightly more code than the new approach.

As it can be seen from the example, the fusion does not spread definitions over multiple files. In this proposed approach, fusion is specified in the machine model. Moreover, it folds fusion into the machine scheduling and decreases its run-time cost. All the while, yielding the same result as before.

In D35228#807205, @evandro wrote:

As it can be seen from the example, the fusion does not spread definitions over multiple files. In this proposed approach, fusion is specified in the machine model. Moreover, it folds fusion into the machine scheduling and decreases its run-time cost. All the while, yielding the same result as before.

I agree, it does not spread the information out over multiple files. However, the fact that the clusters are just specified by numbers makes the implementation less self-explanatory (i.e. it is really not clear what the numbers mean in the model files). I suggest that you add separate tablegen classes that represent the clusters so that you can refer to the clusters by name in the scheduling definitions.

Address feedback by @hfinkel.

In D35228#807888, @hfinkel wrote:

I agree, it does not spread the information out over multiple files. However, the fact that the clusters are just specified by numbers makes the implementation less self-explanatory (i.e. it is really not clear what the numbers mean in the model files). I suggest that you add separate tablegen classes that represent the clusters so that you can refer to the clusters by name in the scheduling definitions.

Agreed, putting the fusion information in the scheduling model makes sense conceptually, as it's a scheduling problem.

With spreading out I meant the changes to the machine scheduler related files (but that's mostly just passing through the cluster info) and that we need to add fusion entries to all relevant machine models , whereas before there was a single file describing which instructions should be fused.

For AArch64, there might be 2 potential problems:

Currently machine models are shared by different CPUs. What if one CPU supports fusing more instructions than the other? I'm not sure if that would be a practical issue at the moment, but it may be worth considering.
For fusing some instructions, more complicated constraints are used, e.g. CBZ and some arithmetic instructions are only fused if no shifted register is used [1] or MOVZWi should only be fused with MOVKWi, if the immediate used with MOVKWi == 16 [2]. Could such constraints be expressed in the machine model?

[1] https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64MacroFusion.cpp#L70
[2] https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64MacroFusion.cpp#L140

tschuett added a subscriber: tschuett.Jul 15 2017, 10:45 AM

In D35228#810596, @fhahn wrote:

In D35228#807888, @hfinkel wrote:

I agree, it does not spread the information out over multiple files. However, the fact that the clusters are just specified by numbers makes the implementation less self-explanatory (i.e. it is really not clear what the numbers mean in the model files). I suggest that you add separate tablegen classes that represent the clusters so that you can refer to the clusters by name in the scheduling definitions.

Agreed, putting the fusion information in the scheduling model makes sense conceptually, as it's a scheduling problem.

With spreading out I meant the changes to the machine scheduler related files (but that's mostly just passing through the cluster info) and that we need to add fusion entries to all relevant machine models , whereas before there was a single file describing which instructions should be fused.

For AArch64, there might be 2 potential problems:

Currently machine models are shared by different CPUs. What if one CPU supports fusing more instructions than the other? I'm not sure if that would be a practical issue at the moment, but it may be worth considering.

For fusing some instructions, more complicated constraints are used, e.g. CBZ and some arithmetic instructions are only fused if no shifted register is used [1] or MOVZWi should only be fused with MOVKWi, if the immediate used with MOVKWi == 16 [2]. Could such constraints be expressed in the machine model?

Maybe the TableGen model could support optional C++ predicates (much like we do for isel). We could also have a target callback that can override the td-file clustering for special cases.

[1] https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64MacroFusion.cpp#L70
[2] https://github.com/llvm-mirror/llvm/blob/master/lib/Target/AArch64/AArch64MacroFusion.cpp#L140

RKSimon added a subscriber: RKSimon.Jul 22 2017, 10:51 AM

fhahn resigned from this revision.Dec 15 2017, 12:35 PM

evandro mentioned this in D42392: [AArch64] Add new target feature to fuse conditional select.Jan 26 2018, 10:05 AM

evandro added a subscriber: evandro.Mar 7 2018, 9:02 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

MC/

MCSchedule.h

6 lines

MCSubtargetInfo.h

4 lines

Target/

TargetSchedule.td

19 lines

utils/

TableGen/

SubtargetEmitter.cpp

18 lines

Diff 106527

llvm/include/llvm/MC/MCSchedule.h

Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	struct MCWriteLatencyEntry {
bool operator==(const MCWriteLatencyEntry &Other) const {		bool operator==(const MCWriteLatencyEntry &Other) const {
return Cycles == Other.Cycles && WriteResourceID == Other.WriteResourceID;		return Cycles == Other.Cycles && WriteResourceID == Other.WriteResourceID;
}		}
};		};

/// Specify the number of cycles allowed after instruction issue before a		/// Specify the number of cycles allowed after instruction issue before a
/// particular use operand reads its registers. This effectively reduces the		/// particular use operand reads its registers. This effectively reduces the
/// write's latency. Here we allow negative cycles for corner cases where		/// write's latency. Here we allow negative cycles for corner cases where
/// latency increases. This rule only applies when the entry's WriteResource		/// latency increases. Optionally, the read instruction may be clustered with
/// matches the write's WriteResource.		/// the write instruction. This rule only applies when the entry's
		/// WriteResource matches the write's WriteResource.
///		///
/// MCReadAdvanceEntries are sorted first by operand index (UseIdx), then by		/// MCReadAdvanceEntries are sorted first by operand index (UseIdx), then by
/// WriteResourceIdx.		/// WriteResourceIdx.
struct MCReadAdvanceEntry {		struct MCReadAdvanceEntry {
unsigned UseIdx;		unsigned UseIdx;
unsigned WriteResourceID;		unsigned WriteResourceID;
int Cycles;		int Cycles;
		bool Cluster;
		fhahnUnsubmitted Not Done Reply Inline Actions I assume you added the cluster bit here, because the whole machinery for handling "ReadAdvance" can be easily extended to do add cluster dependencies? It feels that after adding the cluster bit, most related function/class names are slightly misleading, e.g. `getReadAdvanceCycles` now does not only get the number of cycles but also the cluster bit. fhahn: I assume you added the cluster bit here, because the whole machinery for handling "ReadAdvance"…

bool operator==(const MCReadAdvanceEntry &Other) const {		bool operator==(const MCReadAdvanceEntry &Other) const {
		javed.absarUnsubmitted Not Done Reply Inline Actions Shouldn't this function be amended as well to equate Cluster? javed.absar: Shouldn't this function be amended as well to equate Cluster?
return UseIdx == Other.UseIdx && WriteResourceID == Other.WriteResourceID		return UseIdx == Other.UseIdx && WriteResourceID == Other.WriteResourceID
&& Cycles == Other.Cycles;		&& Cycles == Other.Cycles;
}		}
};		};

/// Summarize the scheduling resources required for an instruction of a		/// Summarize the scheduling resources required for an instruction of a
/// particular scheduling class.		/// particular scheduling class.
///		///
▲ Show 20 Lines • Show All 134 Lines • Show Last 20 Lines

llvm/include/llvm/MC/MCSubtargetInfo.h

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	const MCWriteLatencyEntry getWriteLatencyEntry(const MCSchedClassDesc SC,
unsigned DefIdx) const {		unsigned DefIdx) const {
assert(DefIdx < SC->NumWriteLatencyEntries &&		assert(DefIdx < SC->NumWriteLatencyEntries &&
"MachineModel does not specify a WriteResource for DefIdx");		"MachineModel does not specify a WriteResource for DefIdx");

return &WriteLatencyTable[SC->WriteLatencyIdx + DefIdx];		return &WriteLatencyTable[SC->WriteLatencyIdx + DefIdx];
}		}

int getReadAdvanceCycles(const MCSchedClassDesc *SC, unsigned UseIdx,		int getReadAdvanceCycles(const MCSchedClassDesc *SC, unsigned UseIdx,
unsigned WriteResID) const {		unsigned WriteResID, bool *Cluster = nullptr) const {
// TODO: The number of read advance entries in a class can be significant		// TODO: The number of read advance entries in a class can be significant
// (~50). Consider compressing the WriteID into a dense ID of those that are		// (~50). Consider compressing the WriteID into a dense ID of those that are
// used by ReadAdvance and representing them as a bitset.		// used by ReadAdvance and representing them as a bitset.
for (const MCReadAdvanceEntry *I = &ReadAdvanceTable[SC->ReadAdvanceIdx],		for (const MCReadAdvanceEntry *I = &ReadAdvanceTable[SC->ReadAdvanceIdx],
*E = I + SC->NumReadAdvanceEntries; I != E; ++I) {		*E = I + SC->NumReadAdvanceEntries; I != E; ++I) {
if (I->UseIdx < UseIdx)		if (I->UseIdx < UseIdx)
continue;		continue;
if (I->UseIdx > UseIdx)		if (I->UseIdx > UseIdx)
break;		break;
// Find the first WriteResIdx match, which has the highest cycle count.		// Find the first WriteResIdx match, which has the highest cycle count.
if (!I->WriteResourceID \|\| I->WriteResourceID == WriteResID) {		if (!I->WriteResourceID \|\| I->WriteResourceID == WriteResID) {
		if (Cluster != nullptr)
		*Cluster = I->Cluster;
return I->Cycles;		return I->Cycles;
}		}
}		}
return 0;		return 0;
}		}

/// getInstrItineraryForCPU - Get scheduling itinerary of a CPU.		/// getInstrItineraryForCPU - Get scheduling itinerary of a CPU.
///		///
Show All 24 Lines

llvm/include/llvm/Target/TargetSchedule.td

	Show First 20 Lines • Show All 303 Lines • ▼ Show 20 Lines
	// type at the same time. This class is unaware of its SchedModel so			// type at the same time. This class is unaware of its SchedModel so
	// must be referenced by InstRW or ItinRW.			// must be referenced by InstRW or ItinRW.
	class SchedWriteRes<list<ProcResourceKind> resources> : SchedWrite,			class SchedWriteRes<list<ProcResourceKind> resources> : SchedWrite,
	ProcWriteResources<resources>;			ProcWriteResources<resources>;

	// Define values common to ReadAdvance and SchedReadAdvance.			// Define values common to ReadAdvance and SchedReadAdvance.
	//			//
	// SchedModel ties these resources to a processor.			// SchedModel ties these resources to a processor.
	class ProcReadAdvance<int cycles, list<SchedWrite> writes = []> {			class ProcReadAdvance<int cycles, list<SchedWrite> writes = []> {
				javed.absarUnsubmitted Not Done Reply Inline Actions Perhaps the order of the parameters should be - cycles, cluster, writes = [] javed.absar: Perhaps the order of the parameters should be - cycles, cluster, writes = []
	int Cycles = cycles;			int Cycles = cycles;
	list<SchedWrite> ValidWrites = writes;			list<SchedWrite> ValidWrites = writes;
				bit Cluster = 0;
	// Allow a processor to mark some scheduling classes as unsupported			// Allow a processor to mark some scheduling classes as unsupported
	// for stronger verification.			// for stronger verification.
	bit Unsupported = 0;			bit Unsupported = 0;
	SchedMachineModel SchedModel = ?;			SchedMachineModel SchedModel = ?;
	}			}

	// A processor may define a ReadAdvance associated with a SchedRead			// A processor may define a ReadAdvance associated with a SchedRead
	// to reduce latency of a prior write by N cycles. A negative advance			// to reduce latency of a prior write by N cycles. A negative advance
	// effectively increases latency, which may be used for cross-domain			// effectively increases latency, which may be used for cross-domain
	// stalls.			// stalls.
	//			//
	// A ReadAdvance may be associated with a list of SchedWrites			// A ReadAdvance may be associated with a list of SchedWrites
	// to implement pipeline bypass. The Writes list may be empty to			// to implement pipeline bypass. The Writes list may be empty to
	// indicate operands that are always read this number of Cycles later			// indicate operands that are always read this number of Cycles later
	// than a normal register read, allowing the read's parent instruction			// than a normal register read, allowing the read's parent instruction
	// to issue earlier relative to the writer.			// to issue earlier relative to the writer.
	class ReadAdvance<SchedRead read, int cycles, list<SchedWrite> writes = []>			class ReadAdvance<SchedRead read, int cycles, list<SchedWrite> writes = []>
	: ProcReadAdvance<cycles, writes> {			: ProcReadAdvance<cycles, writes> {
	SchedRead ReadType = read;			SchedRead ReadType = read;
	}			}

				// A processor may define a similar pipeline bypass that also requires that the
				// reader and writer instructions be clustered together and scheduled back to
				// back.
				class ReadCluster<SchedRead read, int cycles, list<SchedWrite> writes = []>
				: ReadAdvance<read, cycles, writes> {
				let Cluster = 1;
				}

	// Directly associate a new SchedRead type with a delay and optional			// Directly associate a new SchedRead type with a delay and optional
	// pipeline bypass. For use with InstRW or ItinRW.			// pipeline bypass. For use with InstRW or ItinRW.
	class SchedReadAdvance<int cycles, list<SchedWrite> writes = []> : SchedRead,			class SchedReadAdvance<int cycles, list<SchedWrite> writes = []>
	ProcReadAdvance<cycles, writes>;			: SchedRead, ProcReadAdvance<cycles, writes>;

				// Likewise, with clustered instructions.
				class SchedReadCluster<int cycles, list<SchedWrite> writes = []>
				: SchedReadAdvance<cycles, writes> {
				let Cluster = 1;
				}

	// Define SchedRead defaults. Reads seldom need special treatment.			// Define SchedRead defaults. Reads seldom need special treatment.
	def ReadDefault : SchedRead;			def ReadDefault : SchedRead;
	def NoReadAdvance : SchedReadAdvance<0>;			def NoReadAdvance : SchedReadAdvance<0>;

	// Define shared code that will be in the same scope as all			// Define shared code that will be in the same scope as all
	// SchedPredicates. Available variables are:			// SchedPredicates. Available variables are:
	// (const MachineInstr MI, const TargetSchedModel SchedModel)			// (const MachineInstr MI, const TargetSchedModel SchedModel)
	▲ Show 20 Lines • Show All 92 Lines • Show Last 20 Lines

llvm/utils/TableGen/SubtargetEmitter.cpp

Show First 20 Lines • Show All 974 Lines • ▼ Show 20 Lines	for (unsigned UseIdx = 0, EndIdx = Reads.size();
}		}
}		}
std::sort(WriteIDs.begin(), WriteIDs.end());		std::sort(WriteIDs.begin(), WriteIDs.end());
for(unsigned W : WriteIDs) {		for(unsigned W : WriteIDs) {
MCReadAdvanceEntry RAEntry;		MCReadAdvanceEntry RAEntry;
RAEntry.UseIdx = UseIdx;		RAEntry.UseIdx = UseIdx;
RAEntry.WriteResourceID = W;		RAEntry.WriteResourceID = W;
RAEntry.Cycles = ReadAdvance->getValueAsInt("Cycles");		RAEntry.Cycles = ReadAdvance->getValueAsInt("Cycles");
		RAEntry.Cluster = ReadAdvance->getValueAsBit("Cluster");
ReadAdvanceEntries.push_back(RAEntry);		ReadAdvanceEntries.push_back(RAEntry);
}		}
}		}
if (SCDesc.NumMicroOps == MCSchedClassDesc::InvalidNumMicroOps) {		if (SCDesc.NumMicroOps == MCSchedClassDesc::InvalidNumMicroOps) {
WriteProcResources.clear();		WriteProcResources.clear();
WriteLatencies.clear();		WriteLatencies.clear();
ReadAdvanceEntries.clear();		ReadAdvanceEntries.clear();
}		}
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	OS << " {" << format("%2d", WLEntry.Cycles) << ", "
<< format("%2d", WLEntry.WriteResourceID) << "}";		<< format("%2d", WLEntry.WriteResourceID) << "}";
if (WLIdx + 1 < WLEnd)		if (WLIdx + 1 < WLEnd)
OS << ',';		OS << ',';
OS << " // #" << WLIdx << " " << SchedTables.WriterNames[WLIdx] << '\n';		OS << " // #" << WLIdx << " " << SchedTables.WriterNames[WLIdx] << '\n';
}		}
OS << "}; // " << Target << "WriteLatencyTable\n";		OS << "}; // " << Target << "WriteLatencyTable\n";

// Emit global ReadAdvanceTable.		// Emit global ReadAdvanceTable.
OS << "\n// {UseIdx, WriteResourceID, Cycles}\n"		OS << "\n// {UseIdx, WriteResourceID, Cycles, Cluster}\n"
<< "extern const llvm::MCReadAdvanceEntry "		<< "extern const llvm::MCReadAdvanceEntry "
<< Target << "ReadAdvanceTable[] = {\n"		<< Target << "ReadAdvanceTable[] = {\n"
<< " {0, 0, 0}, // Invalid\n";		<< " {0, 0, 0, 0}, // Invalid\n";
for (unsigned RAIdx = 1, RAEnd = SchedTables.ReadAdvanceEntries.size();		for (unsigned RAIdx = 1, RAEnd = SchedTables.ReadAdvanceEntries.size();
RAIdx != RAEnd; ++RAIdx) {		RAIdx != RAEnd; ++RAIdx) {
MCReadAdvanceEntry &RAEntry = SchedTables.ReadAdvanceEntries[RAIdx];		MCReadAdvanceEntry &RAEntry = SchedTables.ReadAdvanceEntries[RAIdx];
OS << " {" << RAEntry.UseIdx << ", "		OS << " {"
<< format("%2d", RAEntry.WriteResourceID) << ", "		<< RAEntry.UseIdx << ", "
<< format("%2d", RAEntry.Cycles) << "}";		<< format("%3d", RAEntry.WriteResourceID) << ", "
if (RAIdx + 1 < RAEnd)		<< format("%2d", RAEntry.Cycles) << ", "
OS << ',';		<< RAEntry.Cluster << "}"
OS << " // #" << RAIdx << '\n';		<< (RAIdx + 1 < RAEnd ? ',' : ' ')
		<< " // #" << RAIdx << '\n';
}		}
OS << "}; // " << Target << "ReadAdvanceTable\n";		OS << "}; // " << Target << "ReadAdvanceTable\n";

// Emit a SchedClass table for each processor.		// Emit a SchedClass table for each processor.
for (CodeGenSchedModels::ProcIter PI = SchedModels.procModelBegin(),		for (CodeGenSchedModels::ProcIter PI = SchedModels.procModelBegin(),
PE = SchedModels.procModelEnd(); PI != PE; ++PI) {		PE = SchedModels.procModelEnd(); PI != PE; ++PI) {
if (!PI->hasInstrSchedModel())		if (!PI->hasInstrSchedModel())
continue;		continue;
▲ Show 20 Lines • Show All 428 Lines • Show Last 20 Lines