This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Cannonicalize commutative operands based on LSLFast
AbandonedPublic

Authored by haicheng on Apr 10 2016, 11:52 PM.

Details

Summary

LSLFast - a logical shift left up to 3 places.

In Kryo if an commutative instruction has a LSL for both operands and if the LSL can be folded into the instruction's shifted register (e.g., add x0, x1, x2, lsl #3) then we should canonicalize the operands so the smaller (in terms of the number of shifts) is the operands that is folded.

For example, rather than

lsl x1, x1, #1
add x0, x1, x2, lsl #4

we should prefer

lsl x2, x2, #4
add x0, x2, x1, lsl #1

as this safes a cycle on the add instruction.

Now Add/And/Xor/Or/And support this.

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng updated this revision to Diff 53195.Apr 10 2016, 11:52 PM
haicheng retitled this revision from to [AArch64] Cannonicalize commutative operands based on LSLFast.
haicheng updated this object.
haicheng set the repository for this revision to rL LLVM.
haicheng added a subscriber: llvm-commits.

I think this optimization will affect all ARM Architectures. Is this optimization also good for cortex-a57?

Could you let me know which benchmarks have this pattern, please?

I think this optimization will affect all ARM Architectures. Is this optimization also good for cortex-a57?

Or A53, or A72, or A35...

Without further performance numbers, it's hard to say. If they don't have a positive effect on vanilla AArch64 design, it should be behind a Feature flag.

I think this optimization will affect all ARM Architectures. Is this optimization also good for cortex-a57?

Could you let me know which benchmarks have this pattern, please?

If compiling spec2006 targeting cortex-a57, this pattern happens in spec2006/gcc once, spec2006/h264ref twice, spec2006/perlbench once. Rest of spec2006 remain unchanged. As to the performance, after the change, spec2006/gcc -0.05%, spec2006/h264ref +0.12%, spec2006/perlbench +0.01% averaging from 5 iterations on a cortex-a57 device.

Kindly ping.

t.p.northover edited edge metadata.Apr 18 2016, 3:22 PM

If they don't have a positive effect on vanilla AArch64 design, it should be behind a Feature flag.

I'm instantly suspicious of code behind any kind of CPU check. I'd say enable it generally unless there's good reason not to. It avoids code complication and overly-specific optimizations.

Tim.

mcrosier edited edge metadata.Apr 19 2016, 2:07 PM

! In D18949#404513, @t.p.northover wrote:
I'm instantly suspicious of code behind any kind of CPU check. I'd say enable it generally unless there's good reason not to. It avoids code complication and overly-specific optimizations.

I tend to agree with Tim. I'd prefer to minimize the use of CPU checks.

Given the limited gains (all appear to be well within noise), I'm inclined to abandon this patch.

Chad

rengolin requested changes to this revision.Apr 19 2016, 2:24 PM
rengolin added a reviewer: rengolin.

Given the limited gains (all appear to be well within noise), I'm inclined to abandon this patch.

Agreed.

This revision now requires changes to proceed.Apr 19 2016, 2:24 PM
haicheng abandoned this revision.Apr 20 2016, 7:09 AM