This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Optimize SDIV with pow2 constant divisor
AbandonedPublic

Authored by bcl5980 on Mar 31 2022, 8:18 AM.

Details

Summary

SRA/SRL instead CMP/CSEL in SDIV with pow2 const divisor's folding.

Fix https://github.com/llvm/llvm-project/issues/54649

Diff Detail

Event Timeline

bcl5980 created this revision.Mar 31 2022, 8:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 8:18 AM
bcl5980 requested review of this revision.Mar 31 2022, 8:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 8:18 AM
bcl5980 updated this revision to Diff 419484.Mar 31 2022, 8:42 AM

add fast isel

craig.topper added a subscriber: craig.topper.EditedMar 31 2022, 9:08 AM

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.

Is that because of the SVE special case on line 13536? What if you left that code and returned SDValue() after it?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13536

Should this be checking that the VT is a fixed vector before calling Subtarget->useSVEForFixedLengthVectors()? Without that it looks like even scalar are blocked when Subtarget->useSVEForFixedLengthVectors() is true.

bcl5980 updated this revision to Diff 419505.Mar 31 2022, 10:08 AM

update per @craig.topper's suggestion. Use general BuildSDIVPow2

Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.

remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.

Is that because of the SVE special case on line 13536? What if you left that code and returned SDValue() after it?

Yeah, you are right. Return after 13536 is correct.

bcl5980 updated this revision to Diff 419506.Mar 31 2022, 10:10 AM
bcl5980 marked an inline comment as done.

update correct version

If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.

If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.

Yeah I was wondering what made this better. It's smaller at least, which is good. It should probably be used at minsize if it isn't already. But this kind of shows that it's not necessarily better, if the add and cmp can be executed in parallel, and the add+shift is 2 cycles: https://godbolt.org/z/3zvhxPbsP

bcl5980 added a comment.EditedMar 31 2022, 7:39 PM

If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.

Yeah I was wondering what made this better. It's smaller at least, which is good. It should probably be used at minsize if it isn't already. But this kind of shows that it's not necessarily better, if the add and cmp can be executed in parallel, and the add+shift is 2 cycles: https://godbolt.org/z/3zvhxPbsP

I think AArch64's current BuildSDIVPow2 has two AArch64ISD which means less optinuity to combine.

Add with shift can be implemented by 3 ways:

  1. Split to two micro op, one add one shift, it is always slower than independent add+cmp.
  2. One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.
  3. Direct three input combinational logic circuit, it is always better than independent add+cmp.

Can someone tell me which way the mainstream AArch64 processor is?
I know some of Arm processors use case 2

And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?

negs    w1, w0
and     w0, w0, 15
and     w1, w1, 15
csneg   w0, w0, w1, mi
ret

current clang

add     w8, w0, #15
cmp     w0, #0
csel    w8, w8, w0, lt
and     w8, w8, #0xfffffff0
sub     w0, w0, w8
ret
david-arm added a comment.EditedApr 1 2022, 1:08 AM

Is it also possible that in some contexts we may want to avoid setting flags where possible? i.e. for loops with control flow? The architecture only has one flags register, but has many GPRs.

And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?

And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?

Where is the savings actually coming from here? I don't think it's related to it being a vector; we're just unrolling it into scalar ops.

Is it also possible that in some contexts we may want to avoid setting flags where possible? i.e. for loops with control flow? The architecture only has one flags register, but has many GPRs.

Doesn't seem likely to me? From what I've seen, flags don't normally have a long live range in practice.

I think AArch64's current BuildSDIVPow2 has two AArch64ISD which means less optinuity to combine.

True; are there specific opportunities that matter?

Add with shift can be implemented by 3 ways:

  1. Split to two micro op, one add one shift, it is always slower than independent add+cmp.
  2. One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.
  3. Direct three input combinational logic circuit, it is always better than independent add+cmp.

Can someone tell me which way the mainstream AArch64 processor is?
I know some of Arm processors use case 2

You can download the software optimization guides for most Cortex cores from ARM. Generally, add-with-shift is one micro-op, but it has two cycles of latency, where basic arithmetic has one. So the current sequence saves a cycle of latency in most situations.

And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?

That looks like an improvement, sure.

And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?

Where is the savings actually coming from here? I don't think it's related to it being a vector; we're just unrolling it into scalar ops.

It looks two cases use some extra fmov with v1 register origin version. And they share some SRA with srem with non-power2 case.

True; are there specific opportunities that matter?

It's hard to say. There are a lot of combine code in DAGCombiner::visitADD and DAGCombiner::visitSRA. And also SRA can be shared in the vector case.

That looks like an improvement, sure.

I will use this way to fix the issue later.

bcl5980 abandoned this revision.Apr 1 2022, 11:45 PM