This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Lower sdiv x, pow2 using add + select + shift.
ClosedPublic

Authored by mcrosier on Jul 9 2014, 9:45 AM.

Details

Summary

This patch lower sdiv x, pow2 using a series of add + select + shift.

The target-independent DAGcombiner will generate:
asr w1, X, #31 w1 = splat sign bit.
add X, X, w1, lsr #28 X = X + 0 or pow2-1
asr w0, X, asr #4 w0 = X/pow2

However, the add + shifts is expensive, so generate:
add w0, X, 15 w0 = X + pow2-1
cmp X, wzr X - 0
csel X, w0, X, lt X = (X < 0) ? X + pow2-1 : X;
asr w0, X, asr 4 w0 = X/pow2

This resulted in a speedup for
eembc/consumer_suite/mp3playerfixeddata* +9-15%
eembc/office_suite/ditherv2data* +8-9%

No regressions above noise were seen.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 11202.Jul 9 2014, 9:45 AM
mcrosier retitled this revision from to [AArch64] Lower sdiv x, pow2 using add + select + shift..
mcrosier updated this object.
mcrosier edited the test plan for this revision. (Show Details)
mcrosier set the repository for this revision to rL LLVM.

add llvm-commits to this review so it's on-list?

dblaikie added a subscriber: Unknown Object (MLST).Jul 9 2014, 10:03 AM

(adding the commits list)

jmolloy edited edge metadata.Jul 9 2014, 10:20 AM

Hi Chad,

This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?

I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:

add w0, X, 15
cmp X, wzr
b.lt 2f
1:
... continue basic block
... end basic block

2:
mov X, w0
b 1b

Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

Cheers,

James

In D4438#6, @dblaikie wrote:

add llvm-commits to this review so it's on-list?

Thanks, David!

James,

In D4438#10, @jmolloy wrote:

This is interesting. I suppose this is relying on the fact that X is >= 0 much more often than it is negative?

I can't think of why this sequence would be faster otherwise - the csel is resolvable to nothing as soon as X is known (if X >= 0).

The specific cases in EEMBC do not remove the csel, so I don't think that is the case. My understanding is that the add with shift is rather expensive (at least on A53). I don't know if this would be an enhancement on A57 or other processors, but I hope so.

Extending this, it seems not improbable that a sequence involving a branch instead of a select would be even faster on OoO cores as it would allow the branch to resolve as soon as X is known:

add w0, X, 15
cmp X, wzr
b.lt 2f
1:
... continue basic block
... end basic block

2:
mov X, w0
b 1b

That seems reasonable, but I don't have any way of testing this.

Have you tried generating such a sequence? What core did you measure the speedup on - A53, A57 or another?

I have not tried such a sequence. This was measured on an A53 device, which is the only device I have available.

If anyone could check this on A57/Cyclone I would greatly appreciate it.

BTW, gcc performs the same transformation.

Cheers,

James

Thanks, James.

grosbach edited edge metadata.Jul 9 2014, 11:48 AM

Are there improvements other than on EEMBC? We don't have access to that suite.

In D4438#14, @grosbach wrote:

Are there improvements other than on EEMBC? We don't have access to that suite.

I saw not regressions or improvements in SPEC2K. Unfortunately, I can't run SPEC2K6 on our devices. Our infrastructure also doesn't support the llvm/test-suite.

I can create a synthetic benchmark, if that helps.

mcrosier edited edge metadata.Jul 9 2014, 1:14 PM
mcrosier removed rL LLVM as the repository for this revision.

James,

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf Of James Molloy
Sent: Wednesday, July 09, 2014 3:25 PM
To: reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: mcrosier@codeaurora.org; Tim Northover; Chandler Carruth; Jiangning Liu; James Molloy; Jim Grosbach; LLVM Commits; silviu.baranga@gmail.com
Subject: Re: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

I can create a synthetic benchmark, if that helps.

That would help a lot, and would save me from doing the exact same thing to test it on C-A57 :)

Ok, I’ll try to put something together shortly.

Chad

Hi Chad,

I’ve taken a look at the performance of that code sequence, and can confirm that it is no worse in all situations than the current sequence. In some situations it causes a ~5% performance uplift on A53, and in some cases a ~20% performance uplift in A57 (on a microbenchmark running this sequence in a loop).

So the code sequence itself looks good to me, but I haven’t yet looked at the implementation.

Cheers,

James

From: Chad Rosier [mailto:mcrosier@codeaurora.org]
Sent: 09 July 2014 21:22
To: 'James Molloy'; reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: 'Tim Northover'; 'Chandler Carruth'; 'Jiangning Liu'; James Molloy; 'Jim Grosbach'; 'LLVM Commits'; silviu.baranga@gmail.com
Subject: RE: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

James,

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf Of James Molloy
Sent: Wednesday, July 09, 2014 3:25 PM
To: reviews+D4438+public+a452d711668fdd91@reviews.llvm.org
Cc: mcrosier@codeaurora.org; Tim Northover; Chandler Carruth; Jiangning Liu; James Molloy; Jim Grosbach; LLVM Commits; silviu.baranga@gmail.com
Subject: Re: [PATCH] [AArch64] Lower sdiv x, pow2 using add + select + shift.

I can create a synthetic benchmark, if that helps.

That would help a lot, and would save me from doing the exact same thing to test it on C-A57 :)

Ok, I’ll try to put something together shortly.

Chad

James,

It seems to me that the branch mispredict cost for the case where the values of X are random would outweigh the benefits of this transformation for your alternative code sequence, even on OoO cores.

I don't think it would entirely ok to make that assumption here (X >= 0 predictable).

This point obviously doesn't matter for the csel solution.

Thanks,
Silviu

mcrosier updated this revision to Diff 11716.Jul 21 2014, 1:01 PM

New revision with Tim's suggested code changes.

Tim,
I gave this a little more thought and I don't think we want to move the BuildSDIV call above the pow2 block. How would you feel about adding a BuildSDivPow2 function? The problem is that the pow2 block will never be hit unless I add some ugly logic in both the target-independent and target-dependent (i.e., AArch64) implementations of BuildSDIV to bail when we should perform the pow2 combine. If I had a BuildSDivPow2 function I would just put it inside the pow2 block before the target-independent implementation. Please let me know what you think.

Chad

mcrosier updated this revision to Diff 11774.Jul 22 2014, 12:58 PM

Revised version with proposed BuildSDIVPow2 API. Please have a look.

Chad

t.p.northover edited edge metadata.Jul 23 2014, 2:01 AM

Hi Chad,

I think that's an entirely reasonable approach to. This looks fine to me.

Cheers.

Tim.

t.p.northover accepted this revision.Jul 23 2014, 2:01 AM
t.p.northover edited edge metadata.
This revision is now accepted and ready to land.Jul 23 2014, 2:01 AM
mcrosier closed this revision.Jul 23 2014, 8:54 AM

Thanks, Tim. This has been committed in r213758.