This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Load Balancing for AES instructions on Cortex-A57
AbandonedPublic

Authored by zzheng on Nov 6 2014, 12:57 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
jmolloy
Jiangning
apazos

Summary

On Cortex-A57, an AES chain achieves best performance if the accumulation register is kept the same through the entire chain. This patch utilizes the A57 FP load balancing pass to enforce that we emit such AES instruction sequence.

Diff Detail

Event Timeline

zzheng updated this revision to Diff 15889.Nov 6 2014, 12:57 PM

zzheng retitled this revision from to [AArch64] Load Balancing for AES instructions on Cortex-A57.

zzheng updated this object.

zzheng edited the test plan for this revision. (Show Details)

zzheng added reviewers: jmolloy, t.p.northover, apazos, mcrosier.

zzheng added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptNov 6 2014, 12:57 PM

aadg added a subscriber: aadg.Nov 6 2014, 1:20 PM

mcrosier added a reviewer: Jiangning.Nov 6 2014, 1:49 PM

The code looks ok, but I'm not sure about the architecture decisions. I'd rather Tim or James had a look.

cheers,
--renato

lib/Target/AArch64/AArch64A57FPLoadBalancing.cpp
566	You could have moved isAES up and made Change one line
test/CodeGen/AArch64/aes-load-balancing.ll
15	can you use CHECK variables here? Like: ;CHECK: aese v[[accum:0-9]].16b, ...

Patch revised to address Renato's comments.

Thanks Zheng! Let's see what Tim or James have to say.

cheers,
--renato

Hi Zhao,

There are some subtleties here that I don't think you're handling correctly. The major one is tied operands, which are not handled right and are the reason I removed handling of MLAv2f32 from this pass.

Consider a chain where the end instruction is the same as the kill instruction (which means the end dest register cannot be changed). The fixup code currently will look at the last instruction and see something like:

%D0<def,tied0> = AES... %D0<kill,tied0>, %D1...

It will look at the uses first, totally ignore the tied constraint and change %D0 to be whatever the chain register is. It will then not look at the def, because it knows it is illegal to change the def. We then have a broken instruction.

When you add proper handling for this (along with a decent amount of tests, because this is incredibly fiddly to get right), we can re-add support for MLAv2f32. Which would be nice :)

Secondly I think your algorithm is going the wrong way. It is chaining the instructions going forward through the instruction stream instead of backwards. Consider, it's possible that we can't change the dest operand of the last instruction in the chain. As the operand is tied, the accumulator has to be the same as the dest. So choosing the chaining register as the dest of the last instruction seems the right thing to do. Currently your algorithm could get to the last instruction and not be able to change it. If you implement this algorithm, you actually don't need to handle tied operands specially (although I still would like to see proper assertions in place that they're not corrupted), because you never get to the situation where you can change an accum but not a dest.

Thirdly, are AES instruction statically scheduled by the core? My belief is "no", which is fine, but they should not contribute to the computed static parity. If "yes", then they should probably be sorted to appear in the ready queue before FMADDs.

Cheers,

James

This revision now requires changes to proceed.Nov 11 2014, 2:40 AM

James,

Thanks for the feed back. To make sure I understand correctly: the way we build chains (mul-mla or aes) is forward, which is fine.

But it's better to color the chains backward for the G->getLast() to handle tied dest register. This would requires us to iterate from the last instructions to the first of a chain in ColorChain() and make appropriate changes to scavengeRegister().

Is this the plan you envisioned to add back support for MLAv2f32?

Thanks,
Zhaoshi

mcrosier resigned from this revision.Aug 5 2015, 11:17 AM

mcrosier removed a reviewer: mcrosier.

zzheng abandoned this revision.Mar 16 2017, 11:10 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64A57FPLoadBalancing.cpp

81 lines

test/

CodeGen/

AArch64/

aes-load-balancing.ll

28 lines

Diff 16004

lib/Target/AArch64/AArch64A57FPLoadBalancing.cpp

Context not available.
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	// Helper functions	// Helper functions

		// Is the instruction an AESE or AESD?
		static bool isAESEnDe(MachineInstr *MI) {
		return (MI->getOpcode() == AArch64::AESErr \|\|
		MI->getOpcode() == AArch64::AESDrr);
		}

		// Is the instruction an AESMC or AESIMC?
		static bool isAESMix(MachineInstr *MI) {
		return (MI->getOpcode() == AArch64::AESMCrr \|\|
		MI->getOpcode() == AArch64::AESIMCrr);
		}

	// Is the instruction a type of multiply on 64-bit (or 32-bit) FPRs?	// Is the instruction a type of multiply on 64-bit (or 32-bit) FPRs?
	static bool isMul(MachineInstr *MI) {	static bool isMul(MachineInstr *MI) {
	switch (MI->getOpcode()) {	switch (MI->getOpcode()) {
Context not available.
	if (&*I != G->getKill()) {	if (&*I != G->getKill()) {
	MachineOperand &MO = I->getOperand(0);	MachineOperand &MO = I->getOperand(0);

	bool Change = TransformAll \|\| getColor(MO.getReg()) != C;	bool isAES = isAESEnDe(I) \|\| isAESMix(I);
		bool Change = TransformAll \|\| getColor(MO.getReg()) != C \|\| isAES;
	if (G->requiresFixup() && &*I == G->getLast())	if (G->requiresFixup() && &*I == G->getLast())
		rengolinUnsubmitted Not Done Reply Inline Actions You could have moved isAES up and made Change one line rengolin: You could have moved isAES up and made Change one line
	Change = false;	Change = false;

	if (Change) {	if (Change) {
		if (isAES && I->getOperand(1).isKill())
		// Keep the same accumulation register for AES chains
		Reg = I->getOperand(1).getReg();

	Substs[MO.getReg()] = Reg;	Substs[MO.getReg()] = Reg;
	MO.setReg(Reg);	MO.setReg(Reg);
	MRI->setPhysRegUsed(Reg);	MRI->setPhysRegUsed(Reg);
Context not available.
	ActiveChains[DestReg] = G.get();	ActiveChains[DestReg] = G.get();
	AllChains.insert(std::move(G));	AllChains.insert(std::move(G));

		} else if (isAESEnDe(MI)) {
		// AESE and AESD are executed by FMA functional units and the Dest register
		// is the Accum register, treat them as MLAs.
		unsigned DestReg = MI->getOperand(0).getReg();
		unsigned SrcReg = MI->getOperand(2).getReg();

		if (DestReg != SrcReg)
		maybeKillChain(MI->getOperand(2), Idx, ActiveChains);

		if (ActiveChains.find(DestReg) != ActiveChains.end()) {
		DEBUG(dbgs() << "Chain found for AESE/AESD dest register "
		<< TRI->getName(DestReg) << " in MI " << *MI);

		// DestReg is the AccumReg, so no need to check if it's killed.
		DEBUG(dbgs() << "Instruction was successfully added to chain.\n");
		ActiveChains[DestReg]->add(MI, Idx, getColor(DestReg));
		return;
		}

		// Create a new chain for DestReg
		maybeKillChain(MI->getOperand(0), Idx, ActiveChains);
		DEBUG(dbgs() << "Creating new chain for AESE/AESD dest register "
		<< TRI->getName(DestReg) << " at " << *MI);
		auto G = llvm::make_unique<Chain>(MI, Idx, getColor(DestReg));
		ActiveChains[DestReg] = G.get();
		AllChains.insert(std::move(G));

		} else if (isAESMix(MI)) {
		// AESMC and AESIMC
		unsigned DestReg = MI->getOperand(0).getReg();
		unsigned SrcReg = MI->getOperand(1).getReg();

		if (DestReg != SrcReg)
		maybeKillChain(MI->getOperand(0), Idx, ActiveChains);

		if (ActiveChains.find(SrcReg) != ActiveChains.end()) {
		DEBUG(dbgs() << "Chain found for AESMC/AESIMC src register "
		<< TRI->getName(SrcReg) << " in MI " << *MI);

		DEBUG(dbgs() << "Instruction was successfully added to chain.\n");
		ActiveChains[SrcReg]->add(MI, Idx, getColor(SrcReg));
		// Handle cases where the destination is not the same as the accumulator.
		if (DestReg != SrcReg) {
		DEBUG(dbgs() << "Transfer chain onwership from "
		<< TRI->getName(SrcReg) << " to "
		<< TRI->getName(DestReg) << "\n");
		ActiveChains[DestReg] = ActiveChains[SrcReg];
		ActiveChains.erase(SrcReg);
		}
		return;
		}

		// Create a new chain for SrcReg
		maybeKillChain(MI->getOperand(0), Idx, ActiveChains);
		DEBUG(dbgs() << "Creating new chain for AESMC/AEIMC dest register "
		<< TRI->getName(DestReg) << " at " << *MI);
		auto G = llvm::make_unique<Chain>(MI, Idx, getColor(DestReg));
		ActiveChains[DestReg] = G.get();
		AllChains.insert(std::move(G));

	} else {	} else {

	// Non-MUL or MLA instruction. Invalidate any chain in the uses or defs	// Not MUL, MLA or AES instruction. Invalidate any chain in the uses or defs
	// lists.	// lists.
	for (auto &I : MI->uses())	for (auto &I : MI->uses())
	maybeKillChain(I, Idx, ActiveChains);	maybeKillChain(I, Idx, ActiveChains);
Context not available.

test/CodeGen/AArch64/aes-load-balancing.ll

This file was added.

				; RUN: llc < %s -mcpu=cortex-a57 \| FileCheck %s
				; RUN: llc < %s -mcpu=cortex-a53 \| FileCheck %s

				target triple = "aarch64--linux-gnu"

				declare <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8>, <16 x i8>)
				declare <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8>, <16 x i8>)
				declare <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8>)
				declare <16 x i8> @llvm.aarch64.crypto.aesimc(<16 x i8>)

				; Check that we use the same accumulation register for mixed AES instructions.
				define i32 @aes_load_balancing(<16 x i8>* %x, <16 x i8>* %y, <16 x i8>* %z) {
				;CHECK-LABEL: aes_load_balancing:
				;CHECK: aese v[[accum:[0-9]]].16b, v{{[0-9]}}.16b
				;CHECK: aesmc v[[accum]].16b, v[[accum]].16b
				rengolinUnsubmitted Not Done Reply Inline Actions can you use CHECK variables here? Like: ;CHECK: aese v[[accum:0-9]].16b, ... rengolin: can you use CHECK variables here? Like: ;CHECK: aese v[[accum:0-9]].16b, ...
				;CHECK: aesd v[[accum]].16b, v{{[0-9]}}.16b
				;CHECK: aesimc v[[accum]].16b, v[[accum]].16b
				entry:
				%0 = load <16 x i8>* %x, align 16
				%1 = load <16 x i8>* %y, align 16
				%2 = load <16 x i8>* %z, align 16
				%3 = tail call <16 x i8> @llvm.aarch64.crypto.aese(<16 x i8> %0, <16 x i8> %1)
				%4 = tail call <16 x i8> @llvm.aarch64.crypto.aesmc(<16 x i8> %3)
				%5 = tail call <16 x i8> @llvm.aarch64.crypto.aesd(<16 x i8> %4, <16 x i8> %2)
				%6 = tail call <16 x i8> @llvm.aarch64.crypto.aesimc(<16 x i8> %5)
				store <16 x i8> %6, <16 x i8>* %x, align 16
				ret i32 0
				}