Page MenuHomePhabricator

[SelectionDAG][AArch64] Legalize VECREDUCE

Authored by nikic on Feb 10 2019, 7:44 AM.




Implement basic legalizations (PromoteIntRes, PromoteIntOp, ExpandIntRes, ScalarizeVecOp, WidenVecOp) for VECREDUCE opcodes. There are more legalizations missing (esp float legalizations), but there's no way to test them right now, so I'm not adding them.

This also includes a few more changes to make this work somewhat reasonably:

  • Add support for expanding VECREDUCE in SDAG. Usually experimental.vector.reduce is expanded prior to codegen, but if the target does have native vector reduce, it may of course still be necessary to expand due to legalization issues. Given the intended purpose, this uses a trivial linear expansion.
  • Allow the result type of integer VECREDUCE to be larger than the vector element type. For example we need to be able to reduce a v8i8 into an (nominally) i32 result type on AArch64.
  • Use the vector operand type rather than the scalar result type to determine the action, so we can control exactly which vector types are supported. Also default VECREDUCE to Expand.
  • On AArch64 (only target using VECREDUCE), explicitly specify for which vector types the reductions are supported.

I believe this fixes all the VECREDUCE legalization/expansion related assertions on AArch64.

Diff Detail


Event Timeline

nikic created this revision.Feb 10 2019, 7:44 AM

Thanks for this patch @nikic! Please find some comments inline.

3972 ↗(On Diff #186151)

Although the code in SelectionDAGLegalize::LegalizeOp() will fallthrough and get into ConvertNodeToLibcall(), should this conceptually not be done in SelectionDAGLegalize::ExpandNode() ? It is not creating libcalls, but rather expanding the operation into scalar operations on the vector's elements.

1538 ↗(On Diff #186151)

This promotes the result type of the operation as well as the operand. As far as I understand it, the result type is legal at this point. If bits(EltVT) > bits(VT), should this promote the operation to work on a bigger type, and finally truncate the result?

3245 ↗(On Diff #186151)

Do we need to expand the entire operation here if only the result needs expansion (and not the input vector itself)?

4264 ↗(On Diff #186151)

Do we want to expand the operation here, or just widen the input operand (inserted with zeros as you suggest in the comment)?
If a target just wants the operation to be widened, but does have support for reductions on the widened vectors, we probably prefer that over the default lowering that expandVecReduce provides.

5640 ↗(On Diff #186151)

Since all the VECREDUCE_<OP> are non-strict, can we split the vector in a low- and high half, and do the reduction on both halves? This would eventually boil down to using ScalarizeVecOp_VECREDUCE that transforms <1 x i32> => i32, and would allow more parallelised execution of the reduction. Is this maybe something you already tried?

5674 ↗(On Diff #186151)

nit: unnecessary curly braces

670 ↗(On Diff #186151)

This patch has AArch64 tests for VECREDUCE_ADD, VECREDUCE_FMAX and VECREDUCE_UMAX, which all have custom lowering but actually adds lots of generic support for VECREDUCE operations that don't have custom lowering. I think we'll also want to add some tests for those cases.

nikic updated this revision to Diff 188982.Mar 1 2019, 3:10 PM
nikic marked 11 inline comments as done.

Address review. Also perform expansion during vec op legalization.

nikic added inline comments.Mar 1 2019, 3:12 PM
3972 ↗(On Diff #186151)

Thanks for catching this, I didn't intend to put it in here. I scrolled too far while trying to find the end of the switch :)

1538 ↗(On Diff #186151)

Right, this was also promoting the result type (if necessary), but missed the truncate. I've added it now.

3245 ↗(On Diff #186151)

I think so. Expanding the result without expanding the operand would need something like a hi/lo reduce, which can compute the hi/lo half of the reduction result.

4264 ↗(On Diff #186151)

I agree that this is the better approach in general and have implemented it now. It's slightly more complicated than just padding with zeros, as the neutral element is different for the different reductions.

Doing this gives some pretty bad code for cases like v1i8, where we end up inserting 7 zero elements before performing the reduction. To avoid this, I've added a DAGCombine to convert a VECREDUCE of one element into an EXTRACT_VECTOR_ELT, but generating good code for cases like v9i8 would need more work.

5640 ↗(On Diff #186151)

This mostly already happens as part of the SplitVecOp legalization. Mostly, because it only goes down to a legal type. I've used a naive implementation here, because this expansion code is intended only for legalization purposes, not to produce particularly good code for unsupported general reductions. If a target doesn't support a reduction, it should request it to be expanded prior to SDAG construction, which will produce a shuffle reduction if possible. AArch64 currently always opts out of expansion, even though it does not support all reductions -- I plan to change that in a separate patch.

670 ↗(On Diff #186151)

I've added two more tests for AND and FADD. As mentioned in another comment, I plan to make these go through the pre-SDAG expansion code in the future though.

What about expanding the reductions into shuffle vector sequences? If we add support for that, such that the resulting constructed SDAG would be the same as the IR expansion shuffle vector sequence, then we pave the way for a move to using the intrinsics for all targets as the canonical form. So what we'd do is:

  1. Add the expansion to shuffle vector sequences (instead of a naive implementation)
  2. Move targets to use the intrinsic representation unconditionally. This means we don't need the useReductionIntrinsic TTI took any more. Targets' TargetLowering would need to specify which reduction kinds to expand using the new SDAG expansion code.
  3. ...and as a result we can kill the ExpandReductions pass and finally move these intrinsics from experimental to fully supported and preferred representations.

I'm not saying this all needs to be done by you, but I think it's worth tackling 1) here.

nikic updated this revision to Diff 189991.Mar 9 2019, 12:07 PM

If possible, use shuffle reduction in vecreduce expansion.

nikic added a comment.Mar 9 2019, 12:14 PM

I've extended the expansion code to use a tree reduction for pow2 vectors as long as we have the necessary operations. I think the most relevant example is the v16f32 reduction for fadd:

define float @test_v16f32(<16 x float> %a) nounwind {
; CHECK-LABEL: test_v16f32:
; CHECK:       // %bb.0:
; CHECK-NEXT:    fadd v1.4s, v1.4s, v3.4s
; CHECK-NEXT:    fadd v0.4s, v0.4s, v2.4s
; CHECK-NEXT:    fadd v0.4s, v0.4s, v1.4s
; CHECK-NEXT:    ext v1.16b, v0.16b, v0.16b, #8
; CHECK-NEXT:    fadd v0.2s, v0.2s, v1.2s
; CHECK-NEXT:    faddp s0, v0.2s
; CHECK-NEXT:    ret
  %b = call fast nnan float @llvm.experimental.vector.reduce.fadd.f32.v16f32(float 0.0, <16 x float> %a)
  ret float %b

Not being overly familiar with AArch64 this looks optimal. Results for "weird" vector sizes are sometimes quite bad though (on the other hand, the current pre-sdag expansion code doesn't support them at all).

sdesmalen accepted this revision.Mar 11 2019, 4:24 AM

Thanks for making these changes, I think the patch looks good now!

4264 ↗(On Diff #186151)

Thanks for this!

3918 ↗(On Diff #189991)

nit: simplify change -> simply change?

4307 ↗(On Diff #189991)

I wanted to suggest doing a build_vector of all NeutralElems, and then a INSERT_SUBVECTOR, so you can benefit from cheap 'splat immediate' instructions, but unfortunately you then run into missing pieces of legalisation of INSERT_SUBVECTOR.

This revision is now accepted and ready to land.Mar 11 2019, 4:24 AM
This revision was automatically updated to reflect the committed changes.
nikic marked 2 inline comments as done.
nikic added inline comments.Mar 11 2019, 1:21 PM
4307 ↗(On Diff #189991)

I initially wanted to use INSERT_SUBVECTOR here, but then found out that it requires the insertion index to be a multiple of the subvector length, which is not terribly useful for legalization purposes :(

sdesmalen added inline comments.Mar 11 2019, 2:26 PM
4307 ↗(On Diff #189991)

Wouldn't the insertion index always be 0? (i.e. inserting the subvector into the lower elements (starting at index 0) of a wide all-NeutralElems vector).

nikic marked an inline comment as done.Mar 11 2019, 2:43 PM
nikic added inline comments.
4307 ↗(On Diff #189991)

Uh yes, sorry. I was focused on inserting the NeutralElems into the top of the wide vector, while doing it the other way around makes a lot more sense and only needs a zero index. I'm assuming the missing legalization you're referring to is WidenVecOp?

Possible alternatives here could be a VSELECT or SHUFFLE_VECTOR, but not sure if these will get sensible lowerings.

aemerson added inline comments.Mar 11 2019, 7:08 PM
4307 ↗(On Diff #189991)

Nit: NeutralElem would be more accurately named as something like IdentityValue.

And for widening, what about using CONCAT_VECTORS?

5673 ↗(On Diff #189991)

To be honest this isn't quite what I meant. I meant generating the same tree of shufflevector instructions used for tree reductions. That way, if we switch targets to use the intrinsics and mark them all as "Expand" by default, then the existing pattern matching code would continue to work.

E.g SelectionDAG::matchBinOpReduction should match the code generated by this.

sdesmalen added inline comments.Mar 12 2019, 8:34 AM
5673 ↗(On Diff #189991)

AFAICT this code is doing the same as what llvm::getShuffleReduction() does (which is called by the ExpandReductions pass), so functionally the expansion should not be much different from what we had before the patch.
I would actually think the code generated here should be easier to support by all targets than a true tree reduction, as there is always an element-wise vector-operation available, where there aren't always pair-wise operations, so the code resulting from this expansion should work reasonably well (log-n) without additional lowering/matching effort.