This is an archive of the discontinued LLVM Phabricator instance.

Add more efficient bitwise vector reductions on AArch64
ClosedPublic

Authored by Sp00ph on Apr 12 2023, 6:22 PM.

Details

Summary

Improves the codegen for VECREDUCE_{AND,OR,XOR} operations on AArch64. Currently, these are fully scalarized, except if the vector is a <N x i1>. This patch improves the codegen down to O(log(N)) where N is the length of the vector for vectors whose elements are not i1, by repeatedly applying the bitwise operations to the two halves of the vector until only one element is left, which contains the final result. <N x i1> bitwise reductions are handled using VECREDUCE_{UMAX,UMIN,ADD} instead.

I had to update quite a few codegen tests with these changes, with a general downward trend in instruction count. Since the vector reductions already have tests, I haven't added any new tests myself.

This is my first patch submitted to LLVM, so please tell me if I did anything wrong or if I should change anything.

Diff Detail

Event Timeline

Sp00ph created this revision.Apr 12 2023, 6:22 PM
Sp00ph requested review of this revision.Apr 12 2023, 6:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 12 2023, 6:22 PM
Sp00ph updated this revision to Diff 513038.Apr 12 2023, 7:03 PM

Split mask vectors that don't fit in a single register in half. This reverts some slight regressions in the codegen of vectors that are larger than a register.

It looks like the updated patch has only included changes from the first, not against base. Can you try and squash the two into a single commit?

Sp00ph updated this revision to Diff 513167.Apr 13 2023, 4:12 AM

Squash commits into one patch.

The patch as it is right now has decreased the codegen quality on some of the tests with non-power-of-2 vector reductions. I have a fix for that (padding the vectors to the next larger power of 2), but I'm not sure if that still fits in the scope of this patch or if I should put that in a follow up patch, especially since you currently can't really trust LLVM to generate good code with non-power-of-2 sized vectors anyways.

Is it worth splitting this into i1 and non-i1 parts? There are quite a few changes in the test.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13188

If SVE is available then the orv/eorv/etc should be preferred.

13216

special -> Special

13218

split -> Split

13222

Formatting - the line is a bot long here.

13248

Would it be possible for vector <= 64bits to use the 64bit type sizes? It won't matter in a lot of cases but some cpu's have a higher throughput for 64bit vectors.

efriedma added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13248

For <=64 bits, don't you want to switch to integer registers? orr x0, x0, x0, lsr #32 etc. is generally going to be faster than dup+orr.

Is it worth splitting this into i1 and non-i1 parts? There are quite a few changes in the test.

I'm pretty sure removing the i1 optimization would lead to even more test cases being changed, as codegen currently already does the i1 optimization (presumably in some platform independent lowering of REDUCE_{AND,OR}. I introduced that i1 optimization mainly to keep the i1 vector reductions from regressing.

Sp00ph updated this revision to Diff 513813.Apr 14 2023, 5:12 PM

Changes the codegen to split the vectors until they're at most 64 bits large and then do the remaining work in an integer register. This reduces instruction count in many cases, as codegen is able to combine the shift and bitwise operation into one instruction. Also revert to using andv/orv/eorv if SVE is available.

Sp00ph marked 6 inline comments as done.Apr 14 2023, 5:13 PM
Sp00ph added inline comments.Apr 15 2023, 4:36 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13229

Using either zext or sext here adds a few extra instructions in the generated code. Is it guaranteed that any-extending an i1 vector results in a vector whose elements are all either 0 or -1? It seems reasonable because afaik mask vector elements on AArch64 are always either 0 or -1, but it could also introduce some subtle incorrectness if there is some case where any-extending an i1 vector does not result in such a mask vector.

Matt added a subscriber: Matt.Apr 15 2023, 10:33 AM
efriedma added inline comments.Apr 17 2023, 9:29 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13229

No, no guarantee here. I mean, there are restrictions related to boolean operands certain specific operations (like the condition of a VSELECT), but there isn't any restriction that applies to arithmetic operations. An easy way to get a vector with arbitrary data in the high bits is truncating from nxi8 to nxi1.

You could generate a different sequence if the operand is known to be sign-extended (ComputeNumSignBits).

Sp00ph added inline comments.Apr 17 2023, 3:53 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13229

ComputeNumSignBits doesn't seem to work properly on <N x i1> function arguments. So e.g. an <8 x i1> gets lowered to an <8 x i8> during function argument lowering, and calling ComputeNumSignBits on that returns a 1 (even though <N x i1> in function arguments seems to always be all zeros or all ones; either that or the current codegen is already incorrect). If I instead sign extend the vector in the i1 branch it adds 2 redundant instructions to all the codegen tests that take a <N x i1> as a function argument. Tests that e.g. reduce a <N x i1> obtained from a setcc don't get those extra instructions because there's a setcc + sext combine I believe. I guess this could be fixed by somehow convincing ComputeNumSignBits that a <N x i1> function argument that got lowered to a <N x iM> does in fact have M sign bits?

dmgreen added inline comments.Apr 18 2023, 3:16 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13229

I believe there is no requirement that arguments are all-ones. For example https://godbolt.org/z/MYdEh1fET. There is a signext attribute that can be applied to scalars, but not vectors.

Sp00ph added inline comments.Apr 18 2023, 4:46 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13229

In that case it looks like the codegen for the boolean vector reductions is already wrong without this patch. For example this: https://llvm.godbolt.org/z/YjE8n7q7s causes calls to bad to return a 0, when I believe they should return a 1 instead, because it does umax then truncate instead of truncate then umax, and those don't commute in the general case.

Sp00ph updated this revision to Diff 514790.Apr 18 2023, 5:19 PM

Use sign_ext before umax/umin. Using any_ext instead could lead to miscompilations similar to this: https://github.com/llvm/llvm-project/issues/62211 . Also, choose the best bit width to extend to for short i1 vectors. This leads to better codegen as it doesn't need to e.g. truncate setcc results as part of the reduction.

Sp00ph marked 2 inline comments as done.Apr 18 2023, 5:20 PM

Can you rebase over D148672, now that that is in?

llvm/test/CodeGen/AArch64/reduce-and.ll
252–260

I'm surprised this passes vectors in gpr registers. It would be quite different for values vector regs.

Sp00ph updated this revision to Diff 518505.May 1 2023, 11:33 AM

Rebase to the main branch as requested.

Thanks. Can you rebase again after 41549b535097?

I think the remaining regressions are some of the illegal types (which will be different if the value was already in a vector), some cases with Not where the sign bits are not calculated from a constant buildvector properly, and one case where I think it hits the Depth limit. Some of those are a bit of a shame but understandable, and we can fix the buildvector if we have a useful testcase for it. I think we can get this in and address that after.

Sp00ph updated this revision to Diff 518820.May 2 2023, 12:02 PM

Rebase again

dmgreen accepted this revision.May 3 2023, 7:07 AM

Thanks. LGTM

Do you have commit access, or should I submit this one on your behalf?

This revision is now accepted and ready to land.May 3 2023, 7:07 AM
Sp00ph added a comment.May 3 2023, 7:08 AM

I don't have commit access, please commit it as "Sp00ph <markuseverling@gmail.com>". Thank you!

This revision was landed with ongoing or failed builds.May 3 2023, 7:56 AM
This revision was automatically updated to reflect the committed changes.