Page MenuHomePhabricator

[RFC][Scalable] Add scalable shuffle intrinsic to extract evens from a pair of vectors
Needs ReviewPublic

Authored by cameron.mcinally on Mon, Jan 11, 12:51 PM.

Details

Summary

Here's a proposal for a scalable shuffle intrinsic that extracts the even elements from a pair of vectors. It is the first in the set that was originally discussed in the llvm-dev thread here:

https://lists.llvm.org/pipermail/llvm-dev/2020-January/138762.html

The following are some design decisions I made that could use discussion:

  1. I chose to extract the even elements from a pair of vectors (full vector result), rather than a single vector (1/2 width vector result). This is in line with existing fixed shuffle vectors. And can be extended to accept an undef argument if needed. The motivation behind this decision was that we'd want the result vector to be a full vector for performance reasons. It would also map well to SVE's LD2 and UZP1.
  1. How do we feel about the intrinsic name: llvm.experimental.vector.extract.evens(...)?
  1. How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.
  1. Has anyone thought through all the legalizations that are valid on scalable vectors? Promote and Split are obviously valid. Scalarize is obviously invalid. How about Widen? Widen conflicts with the existing unpacked scalable vectors, so it's not clear if it's possible to do.

Diff Detail

Event Timeline

cameron.mcinally requested review of this revision.Mon, Jan 11, 12:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptMon, Jan 11, 12:51 PM

Thanks for creating this patch!

I chose to extract the even elements from a pair of vectors (full vector result), rather than a single vector (1/2 width vector result). This is in line with existing fixed shuffle vectors. And can be extended to accept an undef argument if needed. The motivation behind this decision was that we'd want the result vector to be a full vector for performance reasons. It would also map well to SVE's LD2 and UZP1.

Are you also planning to add intrinsics for interleaving?

How do we feel about the intrinsic name: llvm.experimental.vector.extract.evens(...)?

I quite like the odd/even terminology, but would prefer to drop the s, as in: "extracting the even elements".

How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.

I would favour not to add SCALABLE to that name, because there is nothing limiting these nodes from being used for fixed-width vectors.

Has anyone thought through all the legalizations that are valid on scalable vectors? Promote and Split are obviously valid. Scalarize is obviously invalid. How about Widen? Widen conflicts with the existing unpacked scalable vectors, so it's not clear if it's possible to do.

You're right, I don't think widening is always possible with the current definition. It's not really about how the vectors are laid out in the registers, taking the even elements of <vscale x 3 x i32> means that for each 3 x i32 part you'd need to alternate between selecting the even, odd, even, odd, ... elements. That problem goes away if the intrinsic has the requirement that (at least for scalable vectors) the minimum number of elements needs to be a multiple of 2.

llvm/docs/LangRef.rst
16201

Is it worth adding the clarification that it extracts the even elements from a concatenated vector %vec1:%vec2?

llvm/include/llvm/CodeGen/ISDOpcodes.h
543

Probably the same as above, clarify that it returns a vector containing all even elements of VEC1:VEC2.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6997

nit: SmallVector chooses a default N since D92522 and https://llvm.org/docs/ProgrammersManual.html recommends leaving out N unless there is a well-motivated choice.

llvm/lib/IR/Verifier.cpp
5171–5178

Can these two Assert's be merged into one?

Hi all, this is just a thought and I hope I'm not confusing things further (!), but we could also have something like:

llvm.experimental.vector.deinterleave.even/odd
llvm.experimental.vector.interleave.lo/hi

since we are actually performing deinterleaving operations in this patch and I assume we'll want the matching interleaving ops at some point too? If you wanted to reduce the number of intrinsics, ISD opcodes you could also have the even/odd as a third flag, i.e.

llvm.experimental.vector.deinterleave(<>,<>, i1)

although I'm happy with separate intrinsics/opcodes too!

For what it's worth if we stick with something like llvm.experimental.vector.extract.even(s) I agree with Sander and would prefer to drop the 's' at the end.

Thanks for creating this patch!

I chose to extract the even elements from a pair of vectors (full vector result), rather than a single vector (1/2 width vector result). This is in line with existing fixed shuffle vectors. And can be extended to accept an undef argument if needed. The motivation behind this decision was that we'd want the result vector to be a full vector for performance reasons. It would also map well to SVE's LD2 and UZP1.

Are you also planning to add intrinsics for interleaving?

I am, plus some others for Complex vectorization. I just wanted to work out the kinks with this first example.

How do we feel about the intrinsic name: llvm.experimental.vector.extract.evens(...)?

I quite like the odd/even terminology, but would prefer to drop the s, as in: "extracting the even elements".

I like that too. I also like David's suggestion about deinterleave.even. I don't really have a strong opinion on either though. Does anyone feel strongly either way?

How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.

I would favour not to add SCALABLE to that name, because there is nothing limiting these nodes from being used for fixed-width vectors.

Ok, that's fair. Right now I have an assert in ISelLowering to ensure only scalable types. That could really be removed though, since the UZP1 lowering would also work for fixed types. It might take a little work to clean up, but I don't foresee any problems.

cameron.mcinally marked 4 inline comments as done.

Address some of @sdesmalen's comments, but deferring name changes...

I like that too. I also like David's suggestion about deinterleave.even. I don't really have a strong opinion on either though. Does anyone feel strongly either way?

I quite like Dave's suggestion for deinterleave.even/odd and interleave.lo/hi, because now they are related (antonyms) and the names are still intuitive.

How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.

I would favour not to add SCALABLE to that name, because there is nothing limiting these nodes from being used for fixed-width vectors.

Ok, that's fair. Right now I have an assert in ISelLowering to ensure only scalable types. That could really be removed though, since the UZP1 lowering would also work for fixed types. It might take a little work to clean up, but I don't foresee any problems.

It's probably fine either with/without the assert, since there is currently no way to test the fixed-width case.

Updated to @david-arm's suggested naming scheme...

sdesmalen added inline comments.Wed, Jan 13, 9:26 AM
llvm/docs/LangRef.rst
16206

AIUI, for scalable vectors the minimum number of elements must be a power of two in order to be able to legalize this operation without widening (not just an even number as I suggested earlier). Can you add that as a restriction?

llvm/lib/IR/Verifier.cpp
5176

Can you also check the constraint that the minimum number of elements must be a power of two for scalable vectors?

cameron.mcinally marked 2 inline comments as done.

Add known minimum number of elements restrictions...

llvm/lib/IR/Verifier.cpp
5176

I'm still getting up to speed with ElementCount, so I'm not sure that his is the best way to use it. Any experts?

A bit of a flyby review as I'm still on holidays but to my mind many of the restrictions being proposed for the new intrinsic seem purely down to the design decision of splitting the input vector across two operands. I understand this is how the underlying instructions work for SVE but that does not seem like a good enough reason to compromise the IR.

So my first questions are whether the IR and ISD interfaces need to match and from an IR point of view what is the expected usage? Is having two input operands going to result in the common case of having to "split" the result of a large load. I ask because I recall this being how InterleavedAccess worked with LLVM (i.e. one big load which a set of shuffles to extra the lanes).

My second question is what are the code generation advantages of having multiple operands against the negatives. We know type legalisation is a negative but I'm guessing the advantage is it allows a simpler mapping to the underlying SVE instructions. The question is whether this is worth the cost.

By only having a single input vector I believe the current proposed type restrictions disappear as widening becomes quite easy. The downside is that some of this type legalisation becomes more complex but this feels worth it if that means less compromises. From an SVE point of view it seems pretty easier to rely on common type legalisation until you get to the point where the input vector is twice the size of the legal type at which point we custom lower to the relevant AArch64 specific node, which mirrors how we handle things like ZERO_EXTEND today.

My final question relates to future usages and how the intrinsic's idiom scales. Taking the above InterleavedAccess example, there is a requirement to have a stride other than two, for example pixel data will want three or four. One route is to add an intrinsic for each option but I'm wondering if there's any appetite for a single generic intrinsic of the form:

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

If it helps we could also initially restrict the range of stride as this is something that can be easily changed with improved code generation abilities. By this I mean with your current patch we can restrict it to being <=2 and still have distinct ISD nodes for these supported variants if that results in the better implementation.

A bit of a flyby review as I'm still on holidays but to my mind many of the restrictions being proposed for the new intrinsic seem purely down to the design decision of splitting the input vector across two operands. I understand this is how the underlying instructions work for SVE but that does not seem like a good enough reason to compromise the IR.

So my first questions are whether the IR and ISD interfaces need to match and from an IR point of view what is the expected usage?

My main IR use case is Complex vectorization. The vector Complex lowerings require vectors of just the reals and/or imags for the intermediate steps.

And also the trivial case of a stride 2 loop.

Is having two input operands going to result in the common case of having to "split" the result of a large load. I ask because I recall this being how InterleavedAccess worked with LLVM (i.e. one big load which a set of shuffles to extra the lanes).

Yeah, I could see where a large load would need to be split. That doesn't seem like too much of a headache though. We're going to do the two loads either way.

The two operand intrinsics are my preferred choice since when we vectorize loops, we want to keep full vectors. We don't want to run the loop 2x times on 1/2 full vectors, or pay the vector concatenation cost in the loop. This does map pretty well to SVE. We either do an LD2 if the operands are from memory and throw away one result, or a UZP if they're in register. Not sure how this would map to RISCV.

If we have one operand intrinsics, we'd need two UZPs for the lo and hi halves, and then a splice. I suppose ISel could combine those two patterns into a two operand UZP though. Unless someone has a better lowering?

The two operand intrinsic could also be extended to accept one undef operand. So there is some flexibility there to get the same one operand intrinsic result.

My second question is what are the code generation advantages of having multiple operands against the negatives. We know type legalisation is a negative but I'm guessing the advantage is it allows a simpler mapping to the underlying SVE instructions. The question is whether this is worth the cost.

By only having a single input vector I believe the current proposed type restrictions disappear as widening becomes quite easy. The downside is that some of this type legalisation becomes more complex but this feels worth it if that means less compromises. From an SVE point of view it seems pretty easier to rely on common type legalisation until you get to the point where the input vector is twice the size of the legal type at which point we custom lower to the relevant AArch64 specific node, which mirrors how we handle things like ZERO_EXTEND today.

I don't have a strong sense for what the trade off are. Maybe you can elaborate once you're back from vacation.

My final question relates to future usages and how the intrinsic's idiom scales. Taking the above InterleavedAccess example, there is a requirement to have a stride other than two, for example pixel data will want three or four. One route is to add an intrinsic for each option but I'm wondering if there's any appetite for a single generic intrinsic of the form:

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

If it helps we could also initially restrict the range of stride as this is something that can be easily changed with improved code generation abilities. By this I mean with your current patch we can restrict it to being <=2 and still have distinct ISD nodes for these supported variants if that results in the better implementation.

I like this idea a lot. Essentially a step vector shuffle. You could even roll splats into it with a 0 stride. Implementing it sounds pretty challenging though. Especially for an index >=2. Maybe I'm missing an easy solution, but that sounds like a lot of work to generalize.

Having said that, I wonder if we should revisit the idea of allowing shuffle vectors to accept step vector masks?

Matt added a subscriber: Matt.Tue, Jan 19, 9:14 AM

Having said that, I wonder if we should revisit the idea of allowing shuffle vectors to accept step vector masks?

At today's sync-up meeting, this met strong resistance. @ctetreau argued that we don't want to allow a stepvector constant as the shuffle vector mask operand, since that could lead to slippery slope of constant initializer strings. Chris, correct my paraphrase if you'd like...

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

Thinking about this some more, this wouldn't be too bad to implement. I'm okay going this route, unless anyone feels strongly against it...