llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
1984–1988 ↗	(On Diff #83716)	Have you actually tested this properly? I am pretty sure this function is only called for terminator instructions. I am currently working on bigger rewrites of the macrofusion code to allow fusion inside of basic blocks, I don't think it is possible today.

@MatzeB, please see the parent patch D28489.

MatzeB added inline comments.Jan 9 2017, 3:19 PM

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
1984–1988 ↗	(On Diff #83716)	Just saw that you linked a dependent patch about that.

flyingforyou added a subscriber: flyingforyou.Jan 9 2017, 5:44 PM

kristof.beyls added a subscriber: kristof.beyls.Jan 10 2017, 3:19 AM

mcrosier added a subscriber: mcrosier.Jan 10 2017, 7:21 AM

The MacroFusion pass is currently being added before the RA runs. However, since the AArch64ExpandPseudo pass is run after the RA (in AArch64PassConfig::addPreSched2()), I wonder if it'd make more sense to run the MISched after the RA as well, and not before as it is now. Thoughts?

evandro added a subscriber: hiraditya.Jan 11 2017, 1:35 PM

In D28491#643256, @evandro wrote:

The MacroFusion pass is currently being added before the RA runs. However, since the AArch64ExpandPseudo pass is run after the RA (in AArch64PassConfig::addPreSched2()), I wonder if it'd make more sense to run the MISched after the RA as well, and not before as it is now. Thoughts?

There are a number of benefits when running the scheduler before register allocation (for example we can still reduce register pressure). We already have the PostMachineScheduler for scheduling again after regalloc (it's based on the same MISched framework but added considerably later in the pipeline; see also TargetSubtargetInfo::enablePostRAScheduler()).

In D28491#643259, @MatzeB wrote:

In D28491#643256, @evandro wrote:

The MacroFusion pass is currently being added before the RA runs. However, since the AArch64ExpandPseudo pass is run after the RA (in AArch64PassConfig::addPreSched2()), I wonder if it'd make more sense to run the MISched after the RA as well, and not before as it is now. Thoughts?

There are a number of benefits when running the scheduler before register allocation (for example we can still reduce register pressure). We already have the PostMachineScheduler for scheduling again after regalloc (it's based on the same MISched framework but added considerably later in the pipeline; see also TargetSubtargetInfo::enablePostRAScheduler()).

I'm asking this because, looking further at other pairs of instrs that A57 fuses, such as ADRP/ADD, they only appear in the instr stream after pseudo expansion.

In D28491#643283, @evandro wrote:

In D28491#643259, @MatzeB wrote:

In D28491#643256, @evandro wrote:

The MacroFusion pass is currently being added before the RA runs. However, since the AArch64ExpandPseudo pass is run after the RA (in AArch64PassConfig::addPreSched2()), I wonder if it'd make more sense to run the MISched after the RA as well, and not before as it is now. Thoughts?

There are a number of benefits when running the scheduler before register allocation (for example we can still reduce register pressure). We already have the PostMachineScheduler for scheduling again after regalloc (it's based on the same MISched framework but added considerably later in the pipeline; see also TargetSubtargetInfo::enablePostRAScheduler()).

I'm asking this because, looking further at other pairs of instrs that A57 fuses, such as ADRP/ADD, they only appear in the instr stream after pseudo expansion.

Well if there is no reason to ever break the instructions apart, then using a Pseudo instruction and expanding that later may be the easier solution, is that the case for the AES instructions?

In D28491#643290, @MatzeB wrote:

In D28491#643283, @evandro wrote:

I'm asking this because, looking further at other pairs of instrs that A57 fuses, such as ADRP/ADD, they only appear in the instr stream after pseudo expansion.

Well if there is no reason to ever break the instructions apart, then using a Pseudo instruction and expanding that later may be the easier solution, is that the case for the AES instructions?

No, since they are pretty opaque. But the pseudo MOVaddr is expanded into the pair ADRP/ADD only after the RA. On A57, it's important to schedule them back to back, e.g., by running the MISched after the RA instead of before.

No, since they are pretty opaque. But the pseudo MOVaddr is expanded into the pair ADRP/ADD only after the RA. On A57, it's important to schedule them back to back, e.g., by running the MISched after the RA instead of before.

Or rather, I wonder why pseudo expansion is happening this late, when they are very simple instrs in AArch64. Methinks that expanding them sooner would expose them to more optimizations, yes?

evandro updated this revision to Diff 84372.Jan 13 2017, 1:59 PM

evandro edited edge metadata.

evandro updated this revision to Diff 86106.Jan 27 2017, 12:57 PM

evandro edited the summary of this revision. (Show Details)

Look good overal

llvm/lib/Target/AArch64/AArch64.td
183 ↗	(On Diff #86106)	The features seem to be sorted alphabetically (same with the Exynos entry).
llvm/test/CodeGen/AArch64/misched-fusion.ll
13–19 ↗	(On Diff #86106)	Why is this test affected here? I see no AES instructions.

evandro marked an inline comment as done.Jan 30 2017, 8:26 AM

evandro added inline comments.

llvm/test/CodeGen/AArch64/misched-fusion.ll
13–19 ↗	(On Diff #86106)	Indeed, this test shouldn't be part of this patch.

evandro marked an inline comment as done.Jan 30 2017, 8:29 AM

evandro updated this revision to Diff 86294.Jan 30 2017, 8:44 AM

LGTM

This revision is now accepted and ready to land.Jan 30 2017, 10:38 AM

Final patch after approval.

Closed by commit rL293738: [AArch64] Add new subtarget feature to fuse AES crypto operations (authored by evandro). · Explain WhyJan 31 2017, 7:06 PM

This revision was automatically updated to reflect the committed changes.

Diff 86557

llvm/trunk/lib/Target/AArch64/AArch64.td

	Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
	def FeatureArithmeticBccFusion : SubtargetFeature<			def FeatureArithmeticBccFusion : SubtargetFeature<
	"arith-bcc-fusion", "HasArithmeticBccFusion", "true",			"arith-bcc-fusion", "HasArithmeticBccFusion", "true",
	"CPU fuses arithmetic+bcc operations">;			"CPU fuses arithmetic+bcc operations">;

	def FeatureArithmeticCbzFusion : SubtargetFeature<			def FeatureArithmeticCbzFusion : SubtargetFeature<
	"arith-cbz-fusion", "HasArithmeticCbzFusion", "true",			"arith-cbz-fusion", "HasArithmeticCbzFusion", "true",
	"CPU fuses arithmetic + cbz/cbnz operations">;			"CPU fuses arithmetic + cbz/cbnz operations">;

				def FeatureFuseAES : SubtargetFeature<
				"fuse-aes", "HasFuseAES", "true",
				"CPU fuses AES crypto operations">;

	def FeatureDisableLatencySchedHeuristic : SubtargetFeature<			def FeatureDisableLatencySchedHeuristic : SubtargetFeature<
	"disable-latency-sched-heuristic", "DisableLatencySchedHeuristic", "true",			"disable-latency-sched-heuristic", "DisableLatencySchedHeuristic", "true",
	"Disable latency scheduling heuristic">;			"Disable latency scheduling heuristic">;

	def FeatureUseRSqrt : SubtargetFeature<			def FeatureUseRSqrt : SubtargetFeature<
	"use-reciprocal-square-root", "UseRSqrt", "true",			"use-reciprocal-square-root", "UseRSqrt", "true",
	"Use the reciprocal square root approximation">;			"Use the reciprocal square root approximation">;

	▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines

	def ProcA57 : SubtargetFeature<"a57", "ARMProcFamily", "CortexA57",			def ProcA57 : SubtargetFeature<"a57", "ARMProcFamily", "CortexA57",
	"Cortex-A57 ARM processors", [			"Cortex-A57 ARM processors", [
	FeatureBalanceFPOps,			FeatureBalanceFPOps,
	FeatureCRC,			FeatureCRC,
	FeatureCrypto,			FeatureCrypto,
	FeatureCustomCheapAsMoveHandling,			FeatureCustomCheapAsMoveHandling,
	FeatureFPARMv8,			FeatureFPARMv8,
				FeatureFuseAES,
	FeatureNEON,			FeatureNEON,
	FeaturePerfMon,			FeaturePerfMon,
	FeaturePostRAScheduler,			FeaturePostRAScheduler,
	FeaturePredictableSelectIsExpensive			FeaturePredictableSelectIsExpensive
	]>;			]>;

	def ProcA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",			def ProcA72 : SubtargetFeature<"a72", "ARMProcFamily", "CortexA72",
	"Cortex-A72 ARM processors", [			"Cortex-A72 ARM processors", [
	Show All 30 Lines

	def ProcExynosM1 : SubtargetFeature<"exynosm1", "ARMProcFamily", "ExynosM1",			def ProcExynosM1 : SubtargetFeature<"exynosm1", "ARMProcFamily", "ExynosM1",
	"Samsung Exynos-M1 processors",			"Samsung Exynos-M1 processors",
	[FeatureSlowPaired128,			[FeatureSlowPaired128,
	FeatureCRC,			FeatureCRC,
	FeatureCrypto,			FeatureCrypto,
	FeatureCustomCheapAsMoveHandling,			FeatureCustomCheapAsMoveHandling,
	FeatureFPARMv8,			FeatureFPARMv8,
				FeatureFuseAES,
	FeatureNEON,			FeatureNEON,
	FeaturePerfMon,			FeaturePerfMon,
	FeaturePostRAScheduler,			FeaturePostRAScheduler,
	FeatureSlowMisaligned128Store,			FeatureSlowMisaligned128Store,
	FeatureUseRSqrt,			FeatureUseRSqrt,
	FeatureZCZeroing]>;			FeatureZCZeroing]>;

	def ProcExynosM2 : SubtargetFeature<"exynosm2", "ARMProcFamily", "ExynosM1",			def ProcExynosM2 : SubtargetFeature<"exynosm2", "ARMProcFamily", "ExynosM1",
	▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64MacroFusion.cpp

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	if (SecondOpcode == AArch64::CBNZW \|\| SecondOpcode == AArch64::CBNZX \|\|
case AArch64::BICWrs:		case AArch64::BICWrs:
case AArch64::BICXrs:		case AArch64::BICXrs:
// Shift value can be 0 making these behave like the "rr" variant...		// Shift value can be 0 making these behave like the "rr" variant...
return !TII.hasShiftedReg(*First);		return !TII.hasShiftedReg(*First);
case AArch64::INSTRUCTION_LIST_END:		case AArch64::INSTRUCTION_LIST_END:
return true;		return true;
}		}

		if (ST.hasFuseAES())
		// Fuse AES crypto operations.
		switch(FirstOpcode) {
		// AES encode.
		case AArch64::AESErr:
		return SecondOpcode == AArch64::AESMCrr \|\|
		SecondOpcode == AArch64::INSTRUCTION_LIST_END;
		// AES decode.
		case AArch64::AESDrr:
		return SecondOpcode == AArch64::AESIMCrr \|\|
		SecondOpcode == AArch64::INSTRUCTION_LIST_END;
		}

return false;		return false;
}		}

/// \brief Implement the fusion of instruction pairs in the scheduling		/// \brief Implement the fusion of instruction pairs in the scheduling
/// \param DAG, anchored at the instruction in \param ASU. \param Preds		/// \param DAG, anchored at the instruction in \param ASU. \param Preds
/// indicates if its dependencies in \param APreds are predecessors instead of		/// indicates if its dependencies in \param APreds are predecessors instead of
/// successors.		/// successors.
static bool scheduleAdjacentImpl(ScheduleDAGMI DAG, SUnit ASU,		static bool scheduleAdjacentImpl(ScheduleDAGMI DAG, SUnit ASU,
▲ Show 20 Lines • Show All 83 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64SchedA57.td

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	def : InstRW<[A57Write_1cyc_1I], (instrs EXTRWrri)>;			def : InstRW<[A57Write_1cyc_1I], (instrs EXTRWrri)>;
	def : InstRW<[A57Write_3cyc_1I_1M], (instrs EXTRXrri)>;			def : InstRW<[A57Write_3cyc_1I_1M], (instrs EXTRXrri)>;
	def : InstRW<[A57Write_2cyc_1M], (instregex "BFM")>;			def : InstRW<[A57Write_2cyc_1M], (instregex "BFM")>;


	// Cryptography Extensions			// Cryptography Extensions
	// -----------------------------------------------------------------------------			// -----------------------------------------------------------------------------

	def : InstRW<[A57Write_3cyc_1W], (instregex "^AES")>;			def A57ReadAES : SchedReadAdvance<3, [A57Write_3cyc_1W]>;
				def : InstRW<[A57Write_3cyc_1W], (instregex "^AES[DE]")>;
				def : InstRW<[A57Write_3cyc_1W, A57ReadAES], (instregex "^AESI?MC")>;
	def : InstRW<[A57Write_6cyc_2V], (instregex "^SHA1SU0")>;			def : InstRW<[A57Write_6cyc_2V], (instregex "^SHA1SU0")>;
	def : InstRW<[A57Write_3cyc_1W], (instregex "^SHA1(H\|SU1)")>;			def : InstRW<[A57Write_3cyc_1W], (instregex "^SHA1(H\|SU1)")>;
	def : InstRW<[A57Write_6cyc_2W], (instregex "^SHA1[CMP]")>;			def : InstRW<[A57Write_6cyc_2W], (instregex "^SHA1[CMP]")>;
	def : InstRW<[A57Write_3cyc_1W], (instregex "^SHA256SU0")>;			def : InstRW<[A57Write_3cyc_1W], (instregex "^SHA256SU0")>;
	def : InstRW<[A57Write_6cyc_2W], (instregex "^SHA256(H\|H2\|SU1)")>;			def : InstRW<[A57Write_6cyc_2W], (instregex "^SHA256(H\|H2\|SU1)")>;
	def : InstRW<[A57Write_3cyc_1W], (instregex "^CRC32")>;			def : InstRW<[A57Write_3cyc_1W], (instregex "^CRC32")>;


	▲ Show 20 Lines • Show All 491 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64SchedM1.td

	Show First 20 Lines • Show All 360 Lines • ▼ Show 20 Lines

	// ASIMD load instructions.			// ASIMD load instructions.

	// ASIMD store instructions.			// ASIMD store instructions.

	// Cryptography instructions.			// Cryptography instructions.
	def M1WriteAES : SchedWriteRes<[M1UnitNCRYPT]> { let Latency = 1; }			def M1WriteAES : SchedWriteRes<[M1UnitNCRYPT]> { let Latency = 1; }
	def M1ReadAES : SchedReadAdvance<1, [M1WriteAES]>;			def M1ReadAES : SchedReadAdvance<1, [M1WriteAES]>;
	def : InstRW<[M1WriteAES, M1ReadAES], (instregex "^AES")>;			def : InstRW<[M1WriteAES], (instregex "^AES[DE]")>;
				def : InstRW<[M1WriteAES, M1ReadAES], (instregex "^AESI?MC")>;

	def : InstRW<[M1WriteNCRYPT1], (instregex "^PMUL")>;			def : InstRW<[M1WriteNCRYPT1], (instregex "^PMUL")>;
	def : InstRW<[M1WriteNCRYPT1], (instregex "^SHA1(H\|SU)")>;			def : InstRW<[M1WriteNCRYPT1], (instregex "^SHA1(H\|SU)")>;
	def : InstRW<[M1WriteNCRYPT5], (instregex "^SHA1[CMP]")>;			def : InstRW<[M1WriteNCRYPT5], (instregex "^SHA1[CMP]")>;
	def : InstRW<[M1WriteNCRYPT1], (instregex "^SHA256SU0")>;			def : InstRW<[M1WriteNCRYPT1], (instregex "^SHA256SU0")>;
	def : InstRW<[M1WriteNCRYPT5], (instregex "^SHA256(H\|SU1)")>;			def : InstRW<[M1WriteNCRYPT5], (instregex "^SHA256(H\|SU1)")>;

	// CRC instructions.			// CRC instructions.
	def : InstRW<[M1WriteC2], (instregex "^CRC32")>;			def : InstRW<[M1WriteC2], (instregex "^CRC32")>;

	} // SchedModel = ExynosM1Model			} // SchedModel = ExynosM1Model

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	protected:
bool BalanceFPOps = false;		bool BalanceFPOps = false;
bool CustomAsCheapAsMove = false;		bool CustomAsCheapAsMove = false;
bool UsePostRAScheduler = false;		bool UsePostRAScheduler = false;
bool Misaligned128StoreIsSlow = false;		bool Misaligned128StoreIsSlow = false;
bool Paired128IsSlow = false;		bool Paired128IsSlow = false;
bool UseAlternateSExtLoadCVTF32Pattern = false;		bool UseAlternateSExtLoadCVTF32Pattern = false;
bool HasArithmeticBccFusion = false;		bool HasArithmeticBccFusion = false;
bool HasArithmeticCbzFusion = false;		bool HasArithmeticCbzFusion = false;
		bool HasFuseAES = false;
bool DisableLatencySchedHeuristic = false;		bool DisableLatencySchedHeuristic = false;
bool UseRSqrt = false;		bool UseRSqrt = false;
uint8_t MaxInterleaveFactor = 2;		uint8_t MaxInterleaveFactor = 2;
uint8_t VectorInsertExtractBaseCost = 3;		uint8_t VectorInsertExtractBaseCost = 3;
uint16_t CacheLineSize = 0;		uint16_t CacheLineSize = 0;
uint16_t PrefetchDistance = 0;		uint16_t PrefetchDistance = 0;
uint16_t MinPrefetchStride = 1;		uint16_t MinPrefetchStride = 1;
unsigned MaxPrefetchIterationsAhead = UINT_MAX;		unsigned MaxPrefetchIterationsAhead = UINT_MAX;
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	public:
bool hasCustomCheapAsMoveHandling() const { return CustomAsCheapAsMove; }		bool hasCustomCheapAsMoveHandling() const { return CustomAsCheapAsMove; }
bool isMisaligned128StoreSlow() const { return Misaligned128StoreIsSlow; }		bool isMisaligned128StoreSlow() const { return Misaligned128StoreIsSlow; }
bool isPaired128Slow() const { return Paired128IsSlow; }		bool isPaired128Slow() const { return Paired128IsSlow; }
bool useAlternateSExtLoadCVTF32Pattern() const {		bool useAlternateSExtLoadCVTF32Pattern() const {
return UseAlternateSExtLoadCVTF32Pattern;		return UseAlternateSExtLoadCVTF32Pattern;
}		}
bool hasArithmeticBccFusion() const { return HasArithmeticBccFusion; }		bool hasArithmeticBccFusion() const { return HasArithmeticBccFusion; }
bool hasArithmeticCbzFusion() const { return HasArithmeticCbzFusion; }		bool hasArithmeticCbzFusion() const { return HasArithmeticCbzFusion; }
		bool hasFuseAES() const { return HasFuseAES; }
bool useRSqrt() const { return UseRSqrt; }		bool useRSqrt() const { return UseRSqrt; }
unsigned getMaxInterleaveFactor() const { return MaxInterleaveFactor; }		unsigned getMaxInterleaveFactor() const { return MaxInterleaveFactor; }
unsigned getVectorInsertExtractBaseCost() const {		unsigned getVectorInsertExtractBaseCost() const {
return VectorInsertExtractBaseCost;		return VectorInsertExtractBaseCost;
}		}
unsigned getCacheLineSize() const { return CacheLineSize; }		unsigned getCacheLineSize() const { return CacheLineSize; }
unsigned getPrefetchDistance() const { return PrefetchDistance; }		unsigned getPrefetchDistance() const { return PrefetchDistance; }
unsigned getMinPrefetchStride() const { return MinPrefetchStride; }		unsigned getMinPrefetchStride() const { return MinPrefetchStride; }
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add new subtarget feature to fuse AES crypto operations
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86557

llvm/trunk/lib/Target/AArch64/AArch64.td

llvm/trunk/lib/Target/AArch64/AArch64MacroFusion.cpp

llvm/trunk/lib/Target/AArch64/AArch64SchedA57.td

llvm/trunk/lib/Target/AArch64/AArch64SchedM1.td

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add new subtarget feature to fuse AES crypto operationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 86557

llvm/trunk/lib/Target/AArch64/AArch64.td

llvm/trunk/lib/Target/AArch64/AArch64MacroFusion.cpp

llvm/trunk/lib/Target/AArch64/AArch64SchedA57.td

llvm/trunk/lib/Target/AArch64/AArch64SchedM1.td

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

[AArch64] Add new subtarget feature to fuse AES crypto operations
ClosedPublic