This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Better reductions
ClosedPublic

Authored by dmgreen on Jun 8 2020, 7:41 AM.

Details

Summary

MVE has native reductions for integer add and min/max. The others need to be expanded to a series of extract's and scalar operators to reduce the vector into a single scalar. The default codegen for that expands the reduction into a series on in-order operations.

This modifies that to something more suitable for MVE. The basic idea is to use vector operations until there are 4 remaining items then switch to pairwise operations. For example a v8f16 fadd reduction would become:
Y = VREV X
Z = ADD(X, Y)
z0 = Z[0] + Z[1]
z1 = Z[2] + Z[3]
return z0 + z1

The awkwardness (there is always some) comes in from something like a v4f16, which is first legalized by adding identity values to the extra lanes of the reduction, and which can then not be optimized away through the vrev; fadd combo, the inserts remain. I've made sure they custom lower so that we can produce the pairwise additions before the extra values are added.

Diff Detail

Event Timeline

dmgreen created this revision.Jun 8 2020, 7:41 AM
samparker added inline comments.Jun 9 2020, 1:47 AM
llvm/lib/Target/ARM/ARMISelLowering.cpp
9550

So, why 4? Is this beat and/or register pressure related? If these is beat related, shouldn't the subtarget be controlling this?

dmgreen marked an inline comment as done.Jun 9 2020, 6:36 AM
dmgreen added inline comments.
llvm/lib/Target/ARM/ARMISelLowering.cpp
9550

The options are going down to 2 or 4 really. 4 seemed best on the test I ran it on, especially for float. There you get to the point where you can pull out of each lane independently, which is important for fp16, not needing any vmovx's.

For integer it's probably closer. 2 will be less instructions, but there wasn't a lot in the performance. Some sizes/operators were quicker, some were slower by a cycle or 2. They are likely to be much rarer than float. We could go down to 2 with a vrev64, but like you said that would cross a beats boundary.

I'd prefer not to add a subtarget hook until we actually find that we need it.

samparker accepted this revision.Jun 29 2020, 2:18 AM

Sorry, forgot about this. Please add a quick comment about the importance of 4 lanes for FP16 operations.

This revision is now accepted and ready to land.Jun 29 2020, 2:18 AM
This revision was automatically updated to reflect the committed changes.