This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AArch64/
-
Target/
-
AArch64/
13
AArch64BranchRelaxation.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
branch-relax-fuse-cbz.ll

Differential D18572

[AArch64] Relax branches by fusing compare with conditional branch when we can infer that source register is zero/non-zero.
AbandonedPublic

Authored by bmakam on Mar 29 2016, 1:41 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
llvm-commits
mcrosier

Summary

eg..:
        tbnz    x8, #63, .LBB0_5
        cmp             x8, #1
        b.lt    .LBB0_2
->to:
        tbnz    x8, #63, .LBB0_5
        cbz     .LBB0_2

Diff Detail

Event Timeline

bmakam updated this revision to Diff 51975.Mar 29 2016, 1:41 PM

bmakam retitled this revision from to [AArch64] Relax branches by fusing compare with conditional branch when we can infer that source register is zero/non-zero..

bmakam updated this object.

bmakam added reviewers: t.p.northover, mcrosier, llvm-commits.

Herald added subscribers: mcrosier, rengolin, aemerson. · View Herald TranscriptMar 29 2016, 1:41 PM

t.p.northover added inline comments.Mar 29 2016, 4:25 PM

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483	What about "cmp reg, #0"/"b.le ..."?
491	Operand 0 is a destination, isn't it? Usually XZR/WZR in a true compare. Actually, I think we want to make sure it's <dead> if we're going to be removing this definition.
510	This seems wrong on 2 levels. First, if there are multiple predecessors then the fact that one of them ends in a TBZ says nothing about the contents of SrcReg2 coming from any others. But even if that wasn't the case we really wouldn't want the logic to rely on the order MBB's predecessors happen to be returned in.
568	Related to two other comments about this loop: this seems like an immediate continue condition. In general, it looks likely that once you prune out the extra logic `CompleteNZCVUsers` will be unnecessary.
580	break? It doesn't seem like you could ever really do more in this BB.
583–585	Isn't this also an immediate break condition? We've hit something that defines NZCV but isn't a compare. Even if there's a "cmp reg, #1" above we mustn't optimize it.

flyingforyou added a subscriber: flyingforyou.Mar 29 2016, 6:26 PM

Thanks for the comments Tim. Please see my replies inline.

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483	This could be turned into a TBNZ but isn't AArch64ConditionOptimizer a better place to handle this? Although this is very similar we only fuse a compare with branch in this patch i.e. (a>0 && a<1) -> a == 0 or (a>0 && a>= 1) -> a != 0 and so it depends on branchfolding to shape the CFG.
491	You are correct. I agree, will update in my next patch.
510	I see your point now. I actually wanted to get the immediate dominator but the dom tree is not available in this pass without reconstructing. I will refactor this.
568	I will fix this.

Hi Balaram,

I'm not sure I understand why this is being done so late in the pipeline. It seems to me like we could catch these cases much earlier.

This also seems related to the patch http://reviews.llvm.org/D7708, which I believe would allow this case to be caught much earlier in instsimplify (perhaps requiring a bit of work in instsimplify as well). This code was recently removed for lack of evidence of its benefit, but it might be worthwhile to evaluate it in the context of this particular optimization.

Geoff, there are 2 reasons for doing this too late. First, for the benchmark I looked at which is mcf, these patterns occur very late only after branchfolding. My initial implementation to do this at AArch64ConditionOptimizer did not catch all the interesting cases. Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.

bmakam added a comment.Mar 31 2016, 9:26 AM

This comment was removed by bmakam.

Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.

What gave you that idea? They both seem to allow imm19*4.

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483	The case I'm talking about is analogous though: "a > 0 && a <= 0 -> a != 0".

bmakam added inline comments.Mar 31 2016, 10:48 AM

lib/Target/AArch64/AArch64BranchRelaxation.cpp
510	I am having difficulty in using the MachineDominatorTree in this pass even after reconstructing. When I call the DomTree->getNode API it crashes. The last user of MachineDomTree is MachineBlockPlacement pass. Is the DomTree invalid after MachineBlockPlacement? If this is the only way to get the immediate dominator then I am thinking of moving just this logic to an earlier pass. The interesting cases are found only after branchfolding so I am thinking this should be done in PreSched2 stage. Is it reasonable to do it in a separate pass in PreSched2?

I think I agree with Geoff here. This pass may be the most convenient place to add the peep-hole optimization, but even the IR doesn't look particularly obscure (well, apart from being massively and horribly undefined now that I look at it: the conditions are only related at all because of a quirk in the register allocator!).

If you change the operands of the icmps to be a function parameter instead of undefs I think it makes for a better example/test.

If you were able to do this at the IR level, I believe the relevant transformation would be to convert the second icmp (the sgt 0 one) into an icmp ne 0

In D18572#388370, @t.p.northover wrote:

Second, this is a type of branch relaxation optimization because the branch displacement of cbz is better than a conditional branch IMHO.

What gave you that idea? They both seem to allow imm19*4.

This was based on my observation on our hardware where I found ~4% performance improvement when we changed the following assembly

tbnz             x15, #63, L13
cmp              x8, #1
b.lt             L12
cmp              w16, #2         
b.ne             L12
b                L14

into:

tbnz             x15, #63, L13
cbz              x15, L12
cmp              w16, #2         
b.ne             L12
b                L14

I might be wrong in attributing this as a result of branch displacement.

lib/Target/AArch64/AArch64BranchRelaxation.cpp
483	Ah I see, this is a valid case too. I am sorry I misread this as b.lt and was thinking a < 0 should be turned into TBNZ.

In D18572#388414, @gberry wrote:

If you change the operands of the icmps to be a function parameter instead of undefs I think it makes for a better example/test.

If you were able to do this at the IR level, I believe the relevant transformation would be to convert the second icmp (the sgt 0 one) into an icmp ne 0

Thanks guys, you convinced me that this should be done at the IR level. I am thinking of doing it in SimplifyCmpInst and hope it catches all the interesting cases that I am looking for.

Changed to http://reviews.llvm.org/D18841

bmakam mentioned this in D18841: [InstCombine] Canonicalize icmp instructions based on dominating conditions..Apr 28 2016, 2:50 PM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64BranchRelaxation.cpp

117 lines

test/

CodeGen/

AArch64/

branch-relax-fuse-cbz.ll

35 lines

Diff 51975

lib/Target/AArch64/AArch64BranchRelaxation.cpp

Show First 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	unsigned postOffset(unsigned LogAlign = 0) const {
return (PO + Align - 1) / Align * Align;		return (PO + Align - 1) / Align * Align;
}		}
};		};

SmallVector<BasicBlockInfo, 16> BlockInfo;		SmallVector<BasicBlockInfo, 16> BlockInfo;

MachineFunction *MF;		MachineFunction *MF;
const AArch64InstrInfo *TII;		const AArch64InstrInfo *TII;
		const TargetRegisterInfo *TRI;

bool relaxBranchInstructions();		bool relaxBranchInstructions();
void scanFunction();		void scanFunction();
MachineBasicBlock splitBlockBeforeInstr(MachineInstr MI);		MachineBasicBlock splitBlockBeforeInstr(MachineInstr MI);
void adjustBlockOffsets(MachineBasicBlock &MBB);		void adjustBlockOffsets(MachineBasicBlock &MBB);
bool isBlockInRange(MachineInstr MI, MachineBasicBlock BB, unsigned Disp);		bool isBlockInRange(MachineInstr MI, MachineBasicBlock BB, unsigned Disp);
bool fixupConditionalBranch(MachineInstr *MI);		bool fixupConditionalBranch(MachineInstr *MI);
		bool fuseCompareAndBranch(MachineInstr *Compare,
		SmallVectorImpl<MachineInstr *> &NZCVUsers);
void computeBlockSize(const MachineBasicBlock &MBB);		void computeBlockSize(const MachineBasicBlock &MBB);
unsigned getInstrOffset(MachineInstr *MI) const;		unsigned getInstrOffset(MachineInstr *MI) const;
void dumpBBs();		void dumpBBs();
void verify();		void verify();

public:		public:
static char ID;		static char ID;
AArch64BranchRelaxation() : MachineFunctionPass(ID) {		AArch64BranchRelaxation() : MachineFunctionPass(ID) {
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	static bool BBHasFallthrough(MachineBasicBlock *MBB) {

for (MachineBasicBlock *S : MBB->successors())		for (MachineBasicBlock *S : MBB->successors())
if (S == &*NextBB)		if (S == &*NextBB)
return true;		return true;

return false;		return false;
}		}

		static bool isNZCVLiveOut(MachineBasicBlock &MBB) {
		for (auto *SI : MBB.successors())
		if (SI->isLiveIn(AArch64::NZCV))
		return true;
		return false;
		}

/// scanFunction - Do the initial scan of the function, building up		/// scanFunction - Do the initial scan of the function, building up
/// information about each block.		/// information about each block.
void AArch64BranchRelaxation::scanFunction() {		void AArch64BranchRelaxation::scanFunction() {
BlockInfo.clear();		BlockInfo.clear();
BlockInfo.resize(MF->getNumBlockIDs());		BlockInfo.resize(MF->getNumBlockIDs());

// First thing, compute the size of all basic blocks, and see if the function		// First thing, compute the size of all basic blocks, and see if the function
// has any inline assembly in it. If so, we have to be conservative about		// has any inline assembly in it. If so, we have to be conservative about
▲ Show 20 Lines • Show All 288 Lines • ▼ Show 20 Lines	bool AArch64BranchRelaxation::fixupConditionalBranch(MachineInstr *MI) {
BlockInfo[MI->getParent()->getNumber()].Size -= TII->GetInstSizeInBytes(MI);		BlockInfo[MI->getParent()->getNumber()].Size -= TII->GetInstSizeInBytes(MI);
MI->eraseFromParent();		MI->eraseFromParent();

// Finally, keep the block offsets up to date.		// Finally, keep the block offsets up to date.
adjustBlockOffsets(*MBB);		adjustBlockOffsets(*MBB);
return true;		return true;
}		}

		bool AArch64BranchRelaxation::fuseCompareAndBranch(
		MachineInstr Compare, SmallVectorImpl<MachineInstr > &NZCVUsers) {

		if (NZCVUsers.size() != 1)
		return false;

		MachineInstr *Branch = NZCVUsers[0];
		if (Branch->getOpcode() != AArch64::Bcc)
		return false;

		if (!Compare->getOperand(1).isReg() \|\| !Compare->getOperand(2).isImm() \|\|
		Compare->getOperand(2).getImm() != 1)
		t.p.northoverUnsubmitted Not Done Reply Inline Actions What about "cmp reg, #0"/"b.le ..."? t.p.northover: What about "cmp reg, #0"/"b.le ..."?
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions This could be turned into a TBNZ but isn't AArch64ConditionOptimizer a better place to handle this? Although this is very similar we only fuse a compare with branch in this patch i.e. (a>0 && a<1) -> a == 0 or (a>0 && a>= 1) -> a != 0 and so it depends on branchfolding to shape the CFG. bmakam: This could be turned into a TBNZ but isn't AArch64ConditionOptimizer a better place to handle…
		t.p.northoverUnsubmitted Not Done Reply Inline Actions The case I'm talking about is analogous though: "a > 0 && a <= 0 -> a != 0". t.p.northover: The case I'm talking about is analogous though: "a > 0 && a <= 0 -> a != 0".
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions Ah I see, this is a valid case too. I am sorry I misread this as b.lt and was thinking a < 0 should be turned into TBNZ. bmakam: Ah I see, this is a valid case too. I am sorry I misread this as b.lt and was thinking a < 0…
		return false;

		AArch64CC::CondCode CC = (AArch64CC::CondCode)Branch->getOperand(0).getImm();
		if (CC != AArch64CC::LT && CC != AArch64CC::GE)
		return false;

		MachineBasicBlock &MBB = *Compare->getParent();
		unsigned SrcReg = Compare->getOperand(0).getReg();
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Operand 0 is a destination, isn't it? Usually XZR/WZR in a true compare. Actually, I think we want to make sure it's <dead> if we're going to be removing this definition. t.p.northover: Operand 0 is a destination, isn't it? Usually XZR/WZR in a true compare. Actually, I think we…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions You are correct. I agree, will update in my next patch. bmakam: You are correct. I agree, will update in my next patch.
		unsigned SrcReg2 = Compare->getOperand(1).getReg();
		if (!MBB.isLiveIn(SrcReg2))
		return false;

		MachineBasicBlock::iterator MBBI = MBB.begin(), MBBC = Compare, MBBE = Branch;
		for (; MBBI != MBBC; ++MBBI)
		if (MBBI->modifiesRegister(SrcReg2, TRI))
		return false;
		for (++MBBI; MBBI != MBBE; ++MBBI)
		if (MBBI->modifiesRegister(SrcReg, TRI) \|\|
		MBBI->modifiesRegister(SrcReg2, TRI))
		return false;

		MachineInstr *LastUse;
		for (auto *PBB : MBB.predecessors()) {
		for (auto &PI : *PBB) {
		if (PI.getNumOperands() > 0 && PI.getOperand(0).isReg() &&
		PI.getOperand(0).isUse() && PI.getOperand(0).getReg() == SrcReg2)
		LastUse = &PI;
		t.p.northoverUnsubmitted Not Done Reply Inline Actions This seems wrong on 2 levels. First, if there are multiple predecessors then the fact that one of them ends in a TBZ says nothing about the contents of SrcReg2 coming from any others. But even if that wasn't the case we really wouldn't want the logic to rely on the order MBB's predecessors happen to be returned in. t.p.northover: This seems wrong on 2 levels. First, if there are multiple predecessors then the fact that one…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions I see your point now. I actually wanted to get the immediate dominator but the dom tree is not available in this pass without reconstructing. I will refactor this. bmakam: I see your point now. I actually wanted to get the immediate dominator but the dom tree is not…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions I am having difficulty in using the MachineDominatorTree in this pass even after reconstructing. When I call the DomTree->getNode API it crashes. The last user of MachineDomTree is MachineBlockPlacement pass. Is the DomTree invalid after MachineBlockPlacement? If this is the only way to get the immediate dominator then I am thinking of moving just this logic to an earlier pass. The interesting cases are found only after branchfolding so I am thinking this should be done in PreSched2 stage. Is it reasonable to do it in a separate pass in PreSched2? bmakam: I am having difficulty in using the MachineDominatorTree in this pass even after reconstructing.
		}
		}

		bool Positive = false;
		bool Is64Bit = false;
		switch (LastUse->getOpcode()) {
		default:
		break;
		case AArch64::TBNZX:
		Positive = (LastUse->getOperand(1).getImm() == 63 &&
		LastUse->getOperand(2).getMBB() != &MBB);
		Is64Bit = true;
		break;
		case AArch64::TBZX:
		Positive = (LastUse->getOperand(1).getImm() == 63 &&
		LastUse->getOperand(2).getMBB() == &MBB);
		Is64Bit = true;
		break;
		case AArch64::TBNZW:
		Positive = (LastUse->getOperand(1).getImm() == 31 &&
		LastUse->getOperand(2).getMBB() != &MBB);
		break;
		case AArch64::TBZW:
		Positive = (LastUse->getOperand(1).getImm() == 31 &&
		LastUse->getOperand(2).getMBB() == &MBB);
		break;
		}
		if (!Positive)
		return false;

		unsigned FusedOpcode =
		Is64Bit ? (CC == AArch64CC::LT ? AArch64::CBZX : AArch64::CBNZX)
		: (CC == AArch64CC::LT ? AArch64::CBZW : AArch64::CBNZW);

		BuildMI(MBB, Branch, Branch->getDebugLoc(), TII->get(FusedOpcode))
		.addReg(SrcReg2)
		.addOperand(Branch->getOperand(1))
		.addReg(AArch64::NZCV, RegState::ImplicitDefine);

		Branch->eraseFromParent();
		MBB.updateTerminator();
		return true;
		}
bool AArch64BranchRelaxation::relaxBranchInstructions() {		bool AArch64BranchRelaxation::relaxBranchInstructions() {
bool Changed = false;		bool Changed = false;
// Relaxing branches involves creating new basic blocks, so re-eval		// Relaxing branches involves creating new basic blocks, so re-eval
// end() for termination.		// end() for termination.
for (auto &MBB : *MF) {		for (auto &MBB : *MF) {
MachineInstr *MI = MBB.getFirstTerminator();		MachineInstr *MI = MBB.getFirstTerminator();
if (isConditionalBranch(MI->getOpcode()) &&		if (isConditionalBranch(MI->getOpcode()) &&
!isBlockInRange(MI, getDestBlock(MI),		!isBlockInRange(MI, getDestBlock(MI),
getBranchDisplacementBits(MI->getOpcode()))) {		getBranchDisplacementBits(MI->getOpcode()))) {
fixupConditionalBranch(MI);		fixupConditionalBranch(MI);
++NumRelaxed;		++NumRelaxed;
Changed = true;		Changed = true;
}		}

		bool CompleteNZCVUsers = !isNZCVLiveOut(MBB);
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Related to two other comments about this loop: this seems like an immediate continue condition. In general, it looks likely that once you prune out the extra logic `CompleteNZCVUsers` will be unnecessary. t.p.northover: Related to two other comments about this loop: this seems like an immediate continue condition.
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions I will fix this. bmakam: I will fix this.
		SmallVector<MachineInstr *, 4> NZCVUsers;
		MachineBasicBlock::iterator MBBI = MBB.end();
		while (MBBI != MBB.begin()) {
		MachineInstr *MI = --MBBI;
		if (CompleteNZCVUsers && MI->isCompare() &&
		fuseCompareAndBranch(MI, NZCVUsers)) {
		++NumRelaxed;
		++MBBI;
		MI->eraseFromParent();
		Changed = true;
		NZCVUsers.clear();
		continue;
		t.p.northoverUnsubmitted Not Done Reply Inline Actions break? It doesn't seem like you could ever really do more in this BB. t.p.northover: break? It doesn't seem like you could ever really do more in this BB.
		}

		if (MI->definesRegister(AArch64::NZCV)) {
		NZCVUsers.clear();
		CompleteNZCVUsers = true;
		t.p.northoverUnsubmitted Not Done Reply Inline Actions Isn't this also an immediate break condition? We've hit something that defines NZCV but isn't a compare. Even if there's a "cmp reg, #1" above we mustn't optimize it. t.p.northover: Isn't this also an immediate break condition? We've hit something that defines NZCV but isn't a…
		}

		if (MI->readsRegister(AArch64::NZCV) && CompleteNZCVUsers)
		NZCVUsers.push_back(MI);
		}
}		}
return Changed;		return Changed;
}		}

bool AArch64BranchRelaxation::runOnMachineFunction(MachineFunction &mf) {		bool AArch64BranchRelaxation::runOnMachineFunction(MachineFunction &mf) {
MF = &mf;		MF = &mf;

// If the pass is disabled, just bail early.		// If the pass is disabled, just bail early.
if (!BranchRelaxation)		if (!BranchRelaxation)
return false;		return false;

DEBUG(dbgs() << "*** AArch64BranchRelaxation ***\n");		DEBUG(dbgs() << "*** AArch64BranchRelaxation ***\n");

TII = (const AArch64InstrInfo *)MF->getSubtarget().getInstrInfo();		TII = (const AArch64InstrInfo *)MF->getSubtarget().getInstrInfo();
		TRI = MF->getSubtarget().getRegisterInfo();

// Renumber all of the machine basic blocks in the function, guaranteeing that		// Renumber all of the machine basic blocks in the function, guaranteeing that
// the numbers agree with the position of the block in the function.		// the numbers agree with the position of the block in the function.
MF->RenumberBlocks();		MF->RenumberBlocks();

// Do the initial scan of the function, building up information about the		// Do the initial scan of the function, building up information about the
// sizes of each block.		// sizes of each block.
scanFunction();		scanFunction();
Show All 23 Lines

test/CodeGen/AArch64/branch-relax-fuse-cbz.ll

This file was added.

				; RUN: llc -mtriple=aarch64-linux--gnu -o - %s \| FileCheck %s
				%struct.arc = type { i64, %struct.node, %struct.node, i32, %struct.arc, %struct.arc, i64, i64 }
				%struct.node = type { i64, i32, %struct.node, %struct.node, %struct.node, %struct.node, %struct.arc, %struct.arc, %struct.arc, %struct.arc, i64, i64, i32, i32 }
				%struct.basket = type { %struct.arc*, i64, i64 }

				; Function Attrs: nounwind
				define void @primal_bea_mpp() {
				; CHECK-LABEL: primal_bea_mpp:
				; CHECK: tbnz [[REG:x[0-9]+]], #63, .LBB0_5
				; CHECK: cbz [[REG]], .[[TRUE:LBB[0-9]+_[0-9]+]]
				; CHECK-NOT: cmp [[REG]], #1
				; CHECK-NOT: b.lt .[[TRUE]]

				entry:
				br label %for.body5

				for.body5: ; preds = %if.then16, %lor.lhs.false, %land.lhs.true, %entry
				%0 = load %struct.arc, %struct.arc* undef, align 8
				%cmp10 = icmp slt i64 undef, 0
				br i1 %cmp10, label %land.lhs.true, label %lor.lhs.false

				land.lhs.true: ; preds = %for.body5
				br i1 undef, label %if.then16, label %for.body5

				lor.lhs.false: ; preds = %for.body5
				%cmp12 = icmp sgt i64 undef, 0
				%cmp12.not = xor i1 %cmp12, true
				%brmerge = or i1 %cmp12.not, false
				br i1 %brmerge, label %for.body5, label %if.then16

				if.then16: ; preds = %lor.lhs.false, %land.lhs.true
				%a19 = getelementptr inbounds %struct.basket, %struct.basket* undef, i64 0, i32 0
				store %struct.arc* %0, %struct.arc** %a19, align 8
				br label %for.body5
				}