SRA/SRL instead CMP/CSEL in SDIV with pow2 const divisor's folding.
Details
Diff Detail
Unit Tests
Event Timeline
Does this differ in an important way from what you get if you remove BuildSDIVPow2? I think you'll get sign bit shifted all the way to the right followed by a lsr. While this patch will shift the sign bit only as far as it needs to.
remove BuildSDIVPow2 also works for scalar cases. But test failed on sve cases.
I can also shift 31/63 everytime but I think currect code is also correct.
Is that because of the SVE special case on line 13536? What if you left that code and returned SDValue() after it?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
13536 | Should this be checking that the VT is a fixed vector before calling Subtarget->useSVEForFixedLengthVectors()? Without that it looks like even scalar are blocked when Subtarget->useSVEForFixedLengthVectors() is true. |
If I'm following correctly, this is essentially reverting D4438 . It was implemented that way at the time to avoid the complex add w8, w0, w8, lsr #26 operation. Reverting that isn't obviously an improvement; I don't think ARM microarchitecture has significantly changed in this respect since the last time it was measured.
Yeah I was wondering what made this better. It's smaller at least, which is good. It should probably be used at minsize if it isn't already. But this kind of shows that it's not necessarily better, if the add and cmp can be executed in parallel, and the add+shift is 2 cycles: https://godbolt.org/z/3zvhxPbsP
I think AArch64's current BuildSDIVPow2 has two AArch64ISD which means less optinuity to combine.
Add with shift can be implemented by 3 ways:
- Split to two micro op, one add one shift, it is always slower than independent add+cmp.
- One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.
- Direct three input combinational logic circuit, it is always better than independent add+cmp.
Can someone tell me which way the mainstream AArch64 processor is?
I know some of Arm processors use case 2
And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?
negs w1, w0 and w0, w0, 15 and w1, w1, 15 csneg w0, w0, w1, mi ret
current clang
add w8, w0, #15 cmp w0, #0 csel w8, w8, w0, lt and w8, w8, #0xfffffff0 sub w0, w0, w8 ret
Is it also possible that in some contexts we may want to avoid setting flags where possible? i.e. for loops with control flow? The architecture only has one flags register, but has many GPRs.
And one other point is:
Case @dont_fold_srem_i16_smax save 6 instructions with 3 extra add+shift
Case @dont_fold_srem_power_of_two, it save a 9 instructions with 1 extra add+shift
So maybe we can use the general path for vector case at least?
Where is the savings actually coming from here? I don't think it's related to it being a vector; we're just unrolling it into scalar ops.
Doesn't seem likely to me? From what I've seen, flags don't normally have a long live range in practice.
True; are there specific opportunities that matter?
Add with shift can be implemented by 3 ways:
- Split to two micro op, one add one shift, it is always slower than independent add+cmp.
- One pipeline stage to do the shift, generally it can get better IPC than independent add+cmp but the worst case will be slower than independent add+cmp.
- Direct three input combinational logic circuit, it is always better than independent add+cmp.
Can someone tell me which way the mainstream AArch64 processor is?
I know some of Arm processors use case 2
You can download the software optimization guides for most Cortex cores from ARM. Generally, add-with-shift is one micro-op, but it has two cycles of latency, where basic arithmetic has one. So the current sequence saves a cycle of latency in most situations.
And if we confirmed add with shift is really slower, how about the gcc's implementation for mod (2^N) ?
That looks like an improvement, sure.
It looks two cases use some extra fmov with v1 register origin version. And they share some SRA with srem with non-power2 case.
True; are there specific opportunities that matter?
It's hard to say. There are a lot of combine code in DAGCombiner::visitADD and DAGCombiner::visitSRA. And also SRA can be shared in the vector case.
That looks like an improvement, sure.
I will use this way to fix the issue later.
Should this be checking that the VT is a fixed vector before calling Subtarget->useSVEForFixedLengthVectors()? Without that it looks like even scalar are blocked when Subtarget->useSVEForFixedLengthVectors() is true.