This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Add option to disable speculation of triangle whose tail is the only latch block
AbandonedPublic

Authored by bmakam on May 6 2016, 2:07 PM.

Details

Summary

This patch adds an option to disable speculation of a triangle when its
tail is the only latch block of this loop. At this time, the option
-aarch64-ccmp-disable-triangle-latch is disabled by default. I'm hoping for feedback
from others on the profitability on other targets.

When the tail of triangle is the only latch block of this loop, we end up inserting ccmp
inside the critical path of the loop. If the speculated code is cold we execute
the cold code for all the loop iterations. If the speculated code were hot the branch
predictor would anyway take that direction.

This impacts the chances of forming a ld/st pair because now the loads could possibly
end up in different blocks. However, when tested on Kryo the performance was slightly
better on spec2006 CINT/CFP benchmarks and no regressions above noise range.

Diff Detail

Event Timeline

bmakam updated this revision to Diff 56459.May 6 2016, 2:07 PM
bmakam retitled this revision from to [AArch64] Add option to disable speculation of triangle whose tail is the only latch block.
bmakam updated this object.
bmakam added a subscriber: llvm-commits.

gentle ping.

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

cheers,
--renato

mcrosier edited edge metadata.May 16 2016, 5:37 AM

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

I agree and would advocate enabling this by default after some additional testing.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

We should be able to get numbers for at least A57, right?

cheers,
--renato

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

I agree and would advocate enabling this by default after some additional testing.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

We should be able to get numbers for at least A57, right?

Thanks Renato and Chad,

Yes I will test it on A57 and report the results, but I need others help in testing other AArch64 targets as I do not have access to them.

cheers,
--renato

bmakam added a comment.EditedMay 24 2016, 10:27 AM

Hi Renato,

It seems like this patch is unfavourable to A57. In our internal tests we found that in spec2006/mcf this patch generates the following code difference:

sub x15, x15, x17                                                             ==  sub x15, x15, x17
ldr x17, [x14,#16]                                                            ==  ldr x17, [x14,#16]
ldr x17, [x17]                                                                ==  ldr x17, [x17]
add x15, x17, x15                                                             ==  add x15, x17, x15
tbnz x15, #63, L13                                                            ==  tbnz x15, #63, L13
                                                                              >>  cbz x15, L12
cmp x15, #0x0                                                                 <<
                                                                              >>  cmp w16, #0x2
ccmp w16, #0x2, #0x0, ne                                                      <<
b.eq L14                                                                      ==  b.eq L14
b L12                                                                         ==  b L12

The performance depends on the cost of the cbz here. On Kryo we see 3% gain with this patch whereas on A57 we see 10% regression. This branch seems to be mostly not taken and so when we place the cbz out of the critical path as show below

tbnz    x15, #63, L13
cmp     w16, #0x2
b.ne    L12
cbnz    x15, L14
b       L12

the performance on A57 improves by 4%. I am not sure but it seems like the cost of cbz on A57 is higher than on Kryo.

The performance depends on the cost of the cbz here. On Kryo we see 3% gain with this patch whereas on A57 we see 10% regression. This branch seems to be mostly not taken and so when we place the cbz out of the critical path as show below (...)
the performance on A57 improves by 4%. I am not sure but it seems like the cost of cbz on A57 is higher than on Kryo.

This is interesting... I wasn't expecting that much of a difference, but sub-arch issues can cause massive changes.

Now, about the CBZ, it may be it, or it may be a class of instructions that are faster on Kryo, or it could be a red herring.

Regardless, I don't think we should limit this optimization with a FeatureFastCBZ flag, because that would be extrapolating without data. I really think that enabling this on all A64 will bring more problems than solve, but limiting on a flag or for Kryo only would be weird.

Since this doesn't seem to have a big impact, even for Kryo, I'm reluctant...

I'd welcome Tim's and James' point of view on this.

cheers,
--renato

Hi Balaram,

This seems like a good thing to do overall, not just for Kryo, or when the option is chosen.

I agree and would advocate enabling this by default after some additional testing.

It would be good to know how it performs in vanilla AArch64 cores (A53, A57) so we could enable them by default.

We should be able to get numbers for at least A57, right?

Thanks Renato and Chad,

Yes I will test it on A57 and report the results, but I need others help in testing other AArch64 targets as I do not have access to them.

cheers,
--renato

I ran this patch against r270609, on both Cortex-A53 and Cortex-A57, across spec2000, spec2006, test-suite, and a few other benchmark suites.
The only clear performance differences I saw were:

On Cortex-A53:
SingleSource/Benchmarks/Misc/mandel-2 7.7% speedup

On Cortex-A57:
spec.cpu2006.ref.429_mcf 2.9% slowdown
SingleSource/Benchmarks/Misc/mandel-2 2.9% slowdown

Overall, my measurements seem to indicate this gives a significant performance impact only in very few cases.
I'd be happy for this to be enabled by default for all cores, even though the data indicates this only has a minor impact on performance overall.

Thanks for testing Kristof. I have pushed D21299 that will fix the regressions and will help enabling this change by default for all the subtargets. Once D21299 lands, I expect to see 8% improvement in spec2006/mcf with this patch.

rengolin accepted this revision.Jun 15 2016, 10:51 AM
rengolin added a reviewer: rengolin.

With this being a flag that is disabled by default and generic to all targets, it now looks good to me.

Thanks Balaram, Kristof!

This revision is now accepted and ready to land.Jun 15 2016, 10:51 AM

Thanks for the review Renato.

This hasn't gone in yet, just FYI.

bmakam added a comment.Jul 7 2016, 8:46 AM

This hasn't gone in yet, just FYI.

Thanks Renato, I was hoping to get this in after D21299.

mcrosier resigned from this revision.Jul 26 2016, 10:38 AM
mcrosier removed a reviewer: mcrosier.
mcrosier removed a subscriber: mcrosier.

This hasn't gone in yet, just FYI.

Renato, do you still think this should be landed, now that D21299 is abandoned? Perhaps someone can use it for experimenting, but I am ok with it not going in too.

No, better to keep this local in your tree and put it up again if it's needed.

bmakam abandoned this revision.Aug 19 2016, 4:09 AM