This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] Find and remove shuffles from commutative reductions
ClosedPublic

Authored by dmgreen on Apr 11 2022, 3:53 AM.

Details

Summary

Given a shuffle feeding a reduction, the lane ordering of the shuffle will not alter the result. This is also true if there are a number of operations between the reduction and the shuffle, providing they only operate lane-wise. This patch searches for cases like that in Vector Combine, allowing us to check the cost of the shuffle vs an in-order identity shuffle and replace the order of possible. This only handles a single shuffle at the moment to keep things simple, and is able to ignore splats that produce results where every result is the same.

This is a more powerful version of a combine that already happens in instrcombine, capable of optimizing more cases by looking through more instructions and being able to cost the shuffle.

Diff Detail

Event Timeline

dmgreen created this revision.Apr 11 2022, 3:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 3:53 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
dmgreen requested review of this revision.Apr 11 2022, 3:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 3:53 AM
RKSimon added inline comments.Apr 11 2022, 4:03 AM
llvm/test/Transforms/VectorCombine/AArch64/vecreduce-shuffle.ll
20

What about we start with a much simpler combine in InstSimplify/Combine that just removes a permute it is feeds a commutative reduction?

dmgreen added inline comments.Apr 11 2022, 4:12 AM
llvm/test/Transforms/VectorCombine/AArch64/vecreduce-shuffle.ll
20

We already have that combine in https://github.com/llvm/llvm-project/blob/431e93f4f56e5b839bf1f746d65139ccf3ca2232/llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp#L2601.

This is trying to add a more complex version for vector combine that can handle more patterns, and potentially remove more shuffles that are cheaper than the original (not just single-input identity masks).

This test I wanted to leave in to make sure there was vector-combine testing for it, even if the simpler combine exists elsewhere too.

Ping any comments?

samtebbs added inline comments.Apr 26 2022, 8:08 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1162

Does this also need to check if the binary operation is commutative? It may be a good idea to add a test with a non-commutative reduction and a qualifying shuffle, to make sure this function's behaviour isn't broken in the future.

dmgreen added inline comments.Apr 27 2022, 6:59 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1162

I don't think it should matter if the binary operator is commutative. There is a test with a lshr in reduceshuffle_twoin_ext_v16i32, for example. These binary operators we are only looking through, back to the shuffle. We don't modify them directly, just the order that the vector-lanes operate. Which is safe because they are only used by the reduction, and only use splats and the shuffle we transform.

LGTM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1162

Yes you're right. Only the reduction intrinsic needs to be commutative which you have checked for in the switch statement above.

samtebbs accepted this revision.Apr 27 2022, 8:25 AM
This revision is now accepted and ready to land.Apr 27 2022, 8:25 AM
spatel added inline comments.Apr 27 2022, 8:34 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1162

Could add a near duplicate of that test except the shuffled value is operand 1 of the shift (to prove that we can find the shuffle through either operand of a binop).

1213–1219

If I'm seeing it correctly, this logic ensures we won't try to re-shuffle a mask with duplicate elements like:
<i32 1, i32 2, i32 1, i32 3>

Is that intentional? Either way, it would be good to have a test like that.

dmgreen added inline comments.Apr 28 2022, 5:47 AM
llvm/lib/Transforms/Vectorize/VectorCombine.cpp
1213–1219

I think it was, yes. But mostly just because it simplified things and it didn't sound as profitable (as-in the best order to aim for is an identity mask and having extra lanes in there can mess that up). There is a test for it in reduceshuffle_twoin_repeat_v4i32. I was aiming for something that removed the shuffle - either because it can be removed or it becomes a concat or something simple like that.

It might be profitable in more cases though, and so long as we are cost modelling it, sorting the mask indices should be OK. I can change it to that, as it has the added benefit that is simplifies the code a little.

dmgreen updated this revision to Diff 425754.Apr 28 2022, 5:48 AM

Update to sort the indices.

spatel accepted this revision.Apr 28 2022, 8:08 AM

LGTM - I think the patch title would be better as something like "[VectorCombine] try to reduce shuffle cost for commutative reduction operands"

I tried testing with x86 and opened: https://github.com/llvm/llvm-project/issues/55170

There may be some generalization where we recognize other operations that don't care about element order.
For example, if we're testing if all elements are equal to zero:
https://alive2.llvm.org/ce/z/DFk-aC

We don't appear to optimize even that easy unary shuffle case in any pass currently.

LGTM - I think the patch title would be better as something like "[VectorCombine] try to reduce shuffle cost for commutative reduction operands"

I tried testing with x86 and opened: https://github.com/llvm/llvm-project/issues/55170

There may be some generalization where we recognize other operations that don't care about element order.
For example, if we're testing if all elements are equal to zero:
https://alive2.llvm.org/ce/z/DFk-aC

We don't appear to optimize even that easy unary shuffle case in any pass currently.

Thanks. There is D100486 that helps with X86 shuffles, but I'm not sure it applies to these cases quite yet.

This revision was landed with ongoing or failed builds.Apr 28 2022, 11:46 AM
This revision was automatically updated to reflect the committed changes.