This is an archive of the discontinued LLVM Phabricator instance.

Differential D20030

[AArch64] Add option to disable speculation of triangle whose tail is the only latch block
AbandonedPublic

Authored by bmakam on May 6 2016, 2:07 PM.

Download Raw Diff

Details

Reviewers

rengolin
t.p.northover
jmolloy

Summary

This patch adds an option to disable speculation of a triangle when its
tail is the only latch block of this loop. At this time, the option
-aarch64-ccmp-disable-triangle-latch is disabled by default. I'm hoping for feedback
from others on the profitability on other targets.

When the tail of triangle is the only latch block of this loop, we end up inserting ccmp
inside the critical path of the loop. If the speculated code is cold we execute
the cold code for all the loop iterations. If the speculated code were hot the branch
predictor would anyway take that direction.

This impacts the chances of forming a ld/st pair because now the loads could possibly
end up in different blocks. However, when tested on Kryo the performance was slightly
better on spec2006 CINT/CFP benchmarks and no regressions above noise range.

Diff Detail

Event Timeline

bmakam updated this revision to Diff 56459.May 6 2016, 2:07 PM

bmakam retitled this revision from to [AArch64] Add option to disable speculation of triangle whose tail is the only latch block.

bmakam updated this object.

bmakam added reviewers: mcrosier, jmolloy, t.p.northover, llvm-commits.

Herald added subscribers: mcrosier, rengolin, aemerson. · View Herald TranscriptMay 6 2016, 2:07 PM

bmakam mentioned this in D17729: [TargetInstrInfo] Add TargetInstrInfo interface isProfitableToBranch..May 6 2016, 2:08 PM

gentle ping.

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

cheers,
--renato

In D20030#430743, @rengolin wrote:

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

I agree and would advocate enabling this by default after some additional testing.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

We should be able to get numbers for at least A57, right?

cheers,
--renato

In D20030#430780, @mcrosier wrote:

In D20030#430743, @rengolin wrote:

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

I agree and would advocate enabling this by default after some additional testing.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

We should be able to get numbers for at least A57, right?

Thanks Renato and Chad,

Yes I will test it on A57 and report the results, but I need others help in testing other AArch64 targets as I do not have access to them.

cheers,
--renato

Hi Renato,

It seems like this patch is unfavourable to A57. In our internal tests we found that in spec2006/mcf this patch generates the following code difference:

sub x15, x15, x17                                                             ==  sub x15, x15, x17
ldr x17, [x14,#16]                                                            ==  ldr x17, [x14,#16]
ldr x17, [x17]                                                                ==  ldr x17, [x17]
add x15, x17, x15                                                             ==  add x15, x17, x15
tbnz x15, #63, L13                                                            ==  tbnz x15, #63, L13
                                                                              >>  cbz x15, L12
cmp x15, #0x0                                                                 <<
                                                                              >>  cmp w16, #0x2
ccmp w16, #0x2, #0x0, ne                                                      <<
b.eq L14                                                                      ==  b.eq L14
b L12                                                                         ==  b L12

The performance depends on the cost of the cbz here. On Kryo we see 3% gain with this patch whereas on A57 we see 10% regression. This branch seems to be mostly not taken and so when we place the cbz out of the critical path as show below

tbnz    x15, #63, L13
cmp     w16, #0x2
b.ne    L12
cbnz    x15, L14
b       L12

the performance on A57 improves by 4%. I am not sure but it seems like the cost of cbz on A57 is higher than on Kryo.

In D20030#437974, @bmakam wrote:

The performance depends on the cost of the cbz here. On Kryo we see 3% gain with this patch whereas on A57 we see 10% regression. This branch seems to be mostly not taken and so when we place the cbz out of the critical path as show below (...)
the performance on A57 improves by 4%. I am not sure but it seems like the cost of cbz on A57 is higher than on Kryo.

This is interesting... I wasn't expecting that much of a difference, but sub-arch issues can cause massive changes.

Now, about the CBZ, it may be it, or it may be a class of instructions that are faster on Kryo, or it could be a red herring.

Regardless, I don't think we should limit this optimization with a FeatureFastCBZ flag, because that would be extrapolating without data. I really think that enabling this on all A64 will bring more problems than solve, but limiting on a flag or for Kryo only would be weird.

Since this doesn't seem to have a big impact, even for Kryo, I'm reluctant...

I'd welcome Tim's and James' point of view on this.

cheers,
--renato

In D20030#430851, @bmakam wrote:

In D20030#430780, @mcrosier wrote:

In D20030#430743, @rengolin wrote:

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

I agree and would advocate enabling this by default after some additional testing.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

We should be able to get numbers for at least A57, right?

Thanks Renato and Chad,

Yes I will test it on A57 and report the results, but I need others help in testing other AArch64 targets as I do not have access to them.

cheers,
--renato

I ran this patch against r270609, on both Cortex-A53 and Cortex-A57, across spec2000, spec2006, test-suite, and a few other benchmark suites.
The only clear performance differences I saw were:

On Cortex-A53:
SingleSource/Benchmarks/Misc/mandel-2 7.7% speedup

On Cortex-A57:
spec.cpu2006.ref.429_mcf 2.9% slowdown
SingleSource/Benchmarks/Misc/mandel-2 2.9% slowdown

Overall, my measurements seem to indicate this gives a significant performance impact only in very few cases.
I'd be happy for this to be enabled by default for all cores, even though the data indicates this only has a minor impact on performance overall.

bmakam added a parent revision: D21299: [Codegen Prepare] Swap commutative binops before splitting branch condition..Jun 13 2016, 9:45 AM

Thanks for testing Kristof. I have pushed D21299 that will fix the regressions and will help enabling this change by default for all the subtargets. Once D21299 lands, I expect to see 8% improvement in spec2006/mcf with this patch.

With this being a flag that is disabled by default and generic to all targets, it now looks good to me.

Thanks Balaram, Kristof!

This revision is now accepted and ready to land.Jun 15 2016, 10:51 AM

Thanks for the review Renato.

bmakam mentioned this in D21299: [Codegen Prepare] Swap commutative binops before splitting branch condition..Jun 16 2016, 8:24 AM

This hasn't gone in yet, just FYI.

In D20030#467849, @rengolin wrote:

This hasn't gone in yet, just FYI.

Thanks Renato, I was hoping to get this in after D21299.

mcrosier resigned from this revision.Jul 26 2016, 10:38 AM

mcrosier removed a reviewer: mcrosier.

mcrosier removed a subscriber: mcrosier.

bmakam removed a parent revision: D21299: [Codegen Prepare] Swap commutative binops before splitting branch condition..Aug 18 2016, 3:23 PM

In D20030#467849, @rengolin wrote:

This hasn't gone in yet, just FYI.

Renato, do you still think this should be landed, now that D21299 is abandoned? Perhaps someone can use it for experimenting, but I am ok with it not going in too.

No, better to keep this local in your tree and put it up again if it's needed.

bmakam abandoned this revision.Aug 19 2016, 4:09 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ConditionalCompares.cpp

18 lines

test/

CodeGen/

AArch64/

aarch64-ccmp-heuristics.ll

59 lines

Diff 56459

lib/Target/AArch64/AArch64ConditionalCompares.cpp

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
static cl::opt<unsigned> BlockInstrLimit(		static cl::opt<unsigned> BlockInstrLimit(
"aarch64-ccmp-limit", cl::init(30), cl::Hidden,		"aarch64-ccmp-limit", cl::init(30), cl::Hidden,
cl::desc("Maximum number of instructions per speculated block."));		cl::desc("Maximum number of instructions per speculated block."));

// Stress testing mode - disable heuristics.		// Stress testing mode - disable heuristics.
static cl::opt<bool> Stress("aarch64-stress-ccmp", cl::Hidden,		static cl::opt<bool> Stress("aarch64-stress-ccmp", cl::Hidden,
cl::desc("Turn all knobs to 11"));		cl::desc("Turn all knobs to 11"));

		// disable speculation of triangle when its tail is the only latch block
		// of this loop.
		static cl::opt<bool> DisableTriangleLatch(
		"aarch64-ccmp-disable-triangle-latch", cl::init(false), cl::Hidden,
		cl::desc("Disable when the tail block is a loop latch."));

STATISTIC(NumConsidered, "Number of ccmps considered");		STATISTIC(NumConsidered, "Number of ccmps considered");
STATISTIC(NumPhiRejs, "Number of ccmps rejected (PHI)");		STATISTIC(NumPhiRejs, "Number of ccmps rejected (PHI)");
STATISTIC(NumPhysRejs, "Number of ccmps rejected (Physregs)");		STATISTIC(NumPhysRejs, "Number of ccmps rejected (Physregs)");
STATISTIC(NumPhi2Rejs, "Number of ccmps rejected (PHI2)");		STATISTIC(NumPhi2Rejs, "Number of ccmps rejected (PHI2)");
STATISTIC(NumHeadBranchRejs, "Number of ccmps rejected (Head branch)");		STATISTIC(NumHeadBranchRejs, "Number of ccmps rejected (Head branch)");
STATISTIC(NumCmpBranchRejs, "Number of ccmps rejected (CmpBB branch)");		STATISTIC(NumCmpBranchRejs, "Number of ccmps rejected (CmpBB branch)");
STATISTIC(NumCmpTermRejs, "Number of ccmps rejected (CmpBB is cbz...)");		STATISTIC(NumCmpTermRejs, "Number of ccmps rejected (CmpBB is cbz...)");
STATISTIC(NumImmRangeRejs, "Number of ccmps rejected (Imm out of range)");		STATISTIC(NumImmRangeRejs, "Number of ccmps rejected (Imm out of range)");
▲ Show 20 Lines • Show All 800 Lines • ▼ Show 20 Lines	bool AArch64ConditionalCompares::shouldConvert() {

// Heuristic: The speculatively executed instructions must all be able to		// Heuristic: The speculatively executed instructions must all be able to
// merge into the Head block. The Head critical path should dominate the		// merge into the Head block. The Head critical path should dominate the
// resource cost of the speculated instructions.		// resource cost of the speculated instructions.
if (ResDepth > HeadDepth) {		if (ResDepth > HeadDepth) {
DEBUG(dbgs() << "Too many instructions to speculate.\n");		DEBUG(dbgs() << "Too many instructions to speculate.\n");
return false;		return false;
}		}

		// Heuristic: If the tail is the only latch block for this loop then the
		// compare conversion delays the loop backedge because we now execute ccmp
		// instruction inside the critical path of the loop.
		if (DisableTriangleLatch && Loops)
		if (MachineLoop *ML = Loops->getLoopFor(CmpConv.Head))
		if (MachineBasicBlock *LatchBB = ML->getLoopLatch())
		if (LatchBB == CmpConv.Tail) {
		DEBUG(dbgs() << "Won't speculate when tail block is a loop latch.\n");
		return false;
		}

return true;		return true;
}		}

bool AArch64ConditionalCompares::tryConvert(MachineBasicBlock *MBB) {		bool AArch64ConditionalCompares::tryConvert(MachineBasicBlock *MBB) {
bool Changed = false;		bool Changed = false;
while (CmpConv.canConvert(MBB) && shouldConvert()) {		while (CmpConv.canConvert(MBB) && shouldConvert()) {
invalidateTraces();		invalidateTraces();
SmallVector<MachineBasicBlock *, 4> RemovedBlocks;		SmallVector<MachineBasicBlock *, 4> RemovedBlocks;
Show All 38 Lines

test/CodeGen/AArch64/aarch64-ccmp-heuristics.ll

This file was added.

				; RUN: llc < %s -mcpu=kryo -mtriple=aarch64--linux-gnu -verify-machineinstrs -aarch64-ccmp -aarch64-ccmp-disable-triangle-latch\| FileCheck %s

				%struct.arc = type { i64, %struct.node, %struct.node, i32, %struct.arc, %struct.arc, i64, i64 }
				%struct.node = type { i64, i32, %struct.node, %struct.node, %struct.node, %struct.node, %struct.arc, %struct.arc, %struct.arc, %struct.arc, i64, i64, i32, i32 }
				%struct.basket = type { %struct.arc*, i64, i64 }

				; CHECK: foo
				; CHECK: %if.then34
				; CHECK: cmp x{{[0-9]+}}, #1
				; CHECK-NEXT: b.ge
				; CHECK: %if.then34.if.else.exit
				; CHECK: cmp w{{[0-9]+}}, #2
				; CHECK-NEXT: b.ne
				; Function Attrs: nounwind
				define void @foo() #0 {
				entry:
				br label %for.body

				for.body: ; preds = %for.inc, %entry
				%arc = phi %struct.arc* [ %add.ptr60, %for.inc ], [ undef, %entry ]
				%ident32 = getelementptr inbounds %struct.arc, %struct.arc* %arc, i64 0, i32 3
				%ident32.load = load i32, i32* %ident32, align 8
				%cmp33 = icmp sgt i32 %ident32.load, 0
				br i1 %cmp33, label %if.then34, label %for.inc

				if.then34: ; preds = %for.body
				%cost35 = getelementptr inbounds %struct.arc, %struct.arc* %arc, i64 0, i32 0
				%0 = load i64, i64* %cost35, align 8
				%tail36 = getelementptr inbounds %struct.arc, %struct.arc* %arc, i64 0, i32 1
				%1 = load %struct.node, %struct.node* %tail36, align 8
				%potential37 = getelementptr inbounds %struct.node, %struct.node* %1, i64 0, i32 0
				%2 = load i64, i64* %potential37, align 8
				%sub38 = sub nsw i64 %0, %2
				%head39 = getelementptr inbounds %struct.arc, %struct.arc* %arc, i64 0, i32 2
				%3= load %struct.node, %struct.node* %head39, align 8
				%potential40 = getelementptr inbounds %struct.node, %struct.node* %3, i64 0, i32 0
				%4 = load i64, i64* %potential40, align 8
				%add41 = add nsw i64 %4, %sub38
				%cmp.i = icmp sgt i64 %add41, 0
				br i1 %cmp.i, label %land.lhs.true.i, label %if.then34.if.else.exit

				land.lhs.true.i: ; preds = %if.then34
				%cmp1.i = icmp eq i32 %ident32.load, 1
				br i1 %cmp1.i, label %if.then43, label %for.inc

				if.then34.if.else.exit: ; preds = %if.then34
				%cmp2.i = icmp sgt i64 %add41, 0
				%cmp4.i = icmp eq i32 %ident32.load, 2
				%cmp4.i. = and i1 %cmp4.i, %cmp2.i
				br i1 %cmp4.i., label %if.then43, label %for.inc

				if.then43: ; preds = %if.then34
				%abs_cost56 = getelementptr inbounds %struct.basket, %struct.basket* undef, i64 0, i32 2
				br label %for.inc

				for.inc: ; preds = %if.then43, %if.then34, %for.body
				%add.ptr60 = getelementptr inbounds %struct.arc, %struct.arc* %arc, i64 undef
				br label %for.body
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add option to disable speculation of triangle whose tail is the only latch blockAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 56459

lib/Target/AArch64/AArch64ConditionalCompares.cpp

test/CodeGen/AArch64/aarch64-ccmp-heuristics.ll

[AArch64] Add option to disable speculation of triangle whose tail is the only latch block
AbandonedPublic