Page MenuHomePhabricator

[RFC][Scalable] Add scalable shuffle intrinsic to extract evens from a pair of vectors
Needs ReviewPublic

Authored by cameron.mcinally on Jan 11 2021, 12:51 PM.

Details

Summary

Here's a proposal for a scalable shuffle intrinsic that extracts the even elements from a pair of vectors. It is the first in the set that was originally discussed in the llvm-dev thread here:

https://lists.llvm.org/pipermail/llvm-dev/2020-January/138762.html

The following are some design decisions I made that could use discussion:

  1. I chose to extract the even elements from a pair of vectors (full vector result), rather than a single vector (1/2 width vector result). This is in line with existing fixed shuffle vectors. And can be extended to accept an undef argument if needed. The motivation behind this decision was that we'd want the result vector to be a full vector for performance reasons. It would also map well to SVE's LD2 and UZP1.
  1. How do we feel about the intrinsic name: llvm.experimental.vector.extract.evens(...)?
  1. How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.
  1. Has anyone thought through all the legalizations that are valid on scalable vectors? Promote and Split are obviously valid. Scalarize is obviously invalid. How about Widen? Widen conflicts with the existing unpacked scalable vectors, so it's not clear if it's possible to do.

Diff Detail

Event Timeline

cameron.mcinally requested review of this revision.Jan 11 2021, 12:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2021, 12:51 PM

Thanks for creating this patch!

I chose to extract the even elements from a pair of vectors (full vector result), rather than a single vector (1/2 width vector result). This is in line with existing fixed shuffle vectors. And can be extended to accept an undef argument if needed. The motivation behind this decision was that we'd want the result vector to be a full vector for performance reasons. It would also map well to SVE's LD2 and UZP1.

Are you also planning to add intrinsics for interleaving?

How do we feel about the intrinsic name: llvm.experimental.vector.extract.evens(...)?

I quite like the odd/even terminology, but would prefer to drop the s, as in: "extracting the even elements".

How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.

I would favour not to add SCALABLE to that name, because there is nothing limiting these nodes from being used for fixed-width vectors.

Has anyone thought through all the legalizations that are valid on scalable vectors? Promote and Split are obviously valid. Scalarize is obviously invalid. How about Widen? Widen conflicts with the existing unpacked scalable vectors, so it's not clear if it's possible to do.

You're right, I don't think widening is always possible with the current definition. It's not really about how the vectors are laid out in the registers, taking the even elements of <vscale x 3 x i32> means that for each 3 x i32 part you'd need to alternate between selecting the even, odd, even, odd, ... elements. That problem goes away if the intrinsic has the requirement that (at least for scalable vectors) the minimum number of elements needs to be a multiple of 2.

llvm/docs/LangRef.rst
16201

Is it worth adding the clarification that it extracts the even elements from a concatenated vector %vec1:%vec2?

llvm/include/llvm/CodeGen/ISDOpcodes.h
543

Probably the same as above, clarify that it returns a vector containing all even elements of VEC1:VEC2.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
6997

nit: SmallVector chooses a default N since D92522 and https://llvm.org/docs/ProgrammersManual.html recommends leaving out N unless there is a well-motivated choice.

llvm/lib/IR/Verifier.cpp
5171–5178

Can these two Assert's be merged into one?

Hi all, this is just a thought and I hope I'm not confusing things further (!), but we could also have something like:

llvm.experimental.vector.deinterleave.even/odd
llvm.experimental.vector.interleave.lo/hi

since we are actually performing deinterleaving operations in this patch and I assume we'll want the matching interleaving ops at some point too? If you wanted to reduce the number of intrinsics, ISD opcodes you could also have the even/odd as a third flag, i.e.

llvm.experimental.vector.deinterleave(<>,<>, i1)

although I'm happy with separate intrinsics/opcodes too!

For what it's worth if we stick with something like llvm.experimental.vector.extract.even(s) I agree with Sander and would prefer to drop the 's' at the end.

Thanks for creating this patch!

I chose to extract the even elements from a pair of vectors (full vector result), rather than a single vector (1/2 width vector result). This is in line with existing fixed shuffle vectors. And can be extended to accept an undef argument if needed. The motivation behind this decision was that we'd want the result vector to be a full vector for performance reasons. It would also map well to SVE's LD2 and UZP1.

Are you also planning to add intrinsics for interleaving?

I am, plus some others for Complex vectorization. I just wanted to work out the kinks with this first example.

How do we feel about the intrinsic name: llvm.experimental.vector.extract.evens(...)?

I quite like the odd/even terminology, but would prefer to drop the s, as in: "extracting the even elements".

I like that too. I also like David's suggestion about deinterleave.even. I don't really have a strong opinion on either though. Does anyone feel strongly either way?

How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.

I would favour not to add SCALABLE to that name, because there is nothing limiting these nodes from being used for fixed-width vectors.

Ok, that's fair. Right now I have an assert in ISelLowering to ensure only scalable types. That could really be removed though, since the UZP1 lowering would also work for fixed types. It might take a little work to clean up, but I don't foresee any problems.

cameron.mcinally marked 4 inline comments as done.

Address some of @sdesmalen's comments, but deferring name changes...

I like that too. I also like David's suggestion about deinterleave.even. I don't really have a strong opinion on either though. Does anyone feel strongly either way?

I quite like Dave's suggestion for deinterleave.even/odd and interleave.lo/hi, because now they are related (antonyms) and the names are still intuitive.

How do we feel about the ISDNode name: ISD::EXTRACT_EVENS_VECTOR? It could be argued that this set of nodes should have SCALABLE in their names, unless we plan to also allow fixed width arguments as well. Currently the fixed width intrinsics are canonicalized to the existing shuffle vector implementation, so they never reach this ISDNode.

I would favour not to add SCALABLE to that name, because there is nothing limiting these nodes from being used for fixed-width vectors.

Ok, that's fair. Right now I have an assert in ISelLowering to ensure only scalable types. That could really be removed though, since the UZP1 lowering would also work for fixed types. It might take a little work to clean up, but I don't foresee any problems.

It's probably fine either with/without the assert, since there is currently no way to test the fixed-width case.

Updated to @david-arm's suggested naming scheme...

sdesmalen added inline comments.Jan 13 2021, 9:26 AM
llvm/docs/LangRef.rst
16206

AIUI, for scalable vectors the minimum number of elements must be a power of two in order to be able to legalize this operation without widening (not just an even number as I suggested earlier). Can you add that as a restriction?

llvm/lib/IR/Verifier.cpp
5176

Can you also check the constraint that the minimum number of elements must be a power of two for scalable vectors?

cameron.mcinally marked 2 inline comments as done.

Add known minimum number of elements restrictions...

llvm/lib/IR/Verifier.cpp
5176

I'm still getting up to speed with ElementCount, so I'm not sure that his is the best way to use it. Any experts?

A bit of a flyby review as I'm still on holidays but to my mind many of the restrictions being proposed for the new intrinsic seem purely down to the design decision of splitting the input vector across two operands. I understand this is how the underlying instructions work for SVE but that does not seem like a good enough reason to compromise the IR.

So my first questions are whether the IR and ISD interfaces need to match and from an IR point of view what is the expected usage? Is having two input operands going to result in the common case of having to "split" the result of a large load. I ask because I recall this being how InterleavedAccess worked with LLVM (i.e. one big load which a set of shuffles to extra the lanes).

My second question is what are the code generation advantages of having multiple operands against the negatives. We know type legalisation is a negative but I'm guessing the advantage is it allows a simpler mapping to the underlying SVE instructions. The question is whether this is worth the cost.

By only having a single input vector I believe the current proposed type restrictions disappear as widening becomes quite easy. The downside is that some of this type legalisation becomes more complex but this feels worth it if that means less compromises. From an SVE point of view it seems pretty easier to rely on common type legalisation until you get to the point where the input vector is twice the size of the legal type at which point we custom lower to the relevant AArch64 specific node, which mirrors how we handle things like ZERO_EXTEND today.

My final question relates to future usages and how the intrinsic's idiom scales. Taking the above InterleavedAccess example, there is a requirement to have a stride other than two, for example pixel data will want three or four. One route is to add an intrinsic for each option but I'm wondering if there's any appetite for a single generic intrinsic of the form:

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

If it helps we could also initially restrict the range of stride as this is something that can be easily changed with improved code generation abilities. By this I mean with your current patch we can restrict it to being <=2 and still have distinct ISD nodes for these supported variants if that results in the better implementation.

A bit of a flyby review as I'm still on holidays but to my mind many of the restrictions being proposed for the new intrinsic seem purely down to the design decision of splitting the input vector across two operands. I understand this is how the underlying instructions work for SVE but that does not seem like a good enough reason to compromise the IR.

So my first questions are whether the IR and ISD interfaces need to match and from an IR point of view what is the expected usage?

My main IR use case is Complex vectorization. The vector Complex lowerings require vectors of just the reals and/or imags for the intermediate steps.

And also the trivial case of a stride 2 loop.

Is having two input operands going to result in the common case of having to "split" the result of a large load. I ask because I recall this being how InterleavedAccess worked with LLVM (i.e. one big load which a set of shuffles to extra the lanes).

Yeah, I could see where a large load would need to be split. That doesn't seem like too much of a headache though. We're going to do the two loads either way.

The two operand intrinsics are my preferred choice since when we vectorize loops, we want to keep full vectors. We don't want to run the loop 2x times on 1/2 full vectors, or pay the vector concatenation cost in the loop. This does map pretty well to SVE. We either do an LD2 if the operands are from memory and throw away one result, or a UZP if they're in register. Not sure how this would map to RISCV.

If we have one operand intrinsics, we'd need two UZPs for the lo and hi halves, and then a splice. I suppose ISel could combine those two patterns into a two operand UZP though. Unless someone has a better lowering?

The two operand intrinsic could also be extended to accept one undef operand. So there is some flexibility there to get the same one operand intrinsic result.

My second question is what are the code generation advantages of having multiple operands against the negatives. We know type legalisation is a negative but I'm guessing the advantage is it allows a simpler mapping to the underlying SVE instructions. The question is whether this is worth the cost.

By only having a single input vector I believe the current proposed type restrictions disappear as widening becomes quite easy. The downside is that some of this type legalisation becomes more complex but this feels worth it if that means less compromises. From an SVE point of view it seems pretty easier to rely on common type legalisation until you get to the point where the input vector is twice the size of the legal type at which point we custom lower to the relevant AArch64 specific node, which mirrors how we handle things like ZERO_EXTEND today.

I don't have a strong sense for what the trade off are. Maybe you can elaborate once you're back from vacation.

My final question relates to future usages and how the intrinsic's idiom scales. Taking the above InterleavedAccess example, there is a requirement to have a stride other than two, for example pixel data will want three or four. One route is to add an intrinsic for each option but I'm wondering if there's any appetite for a single generic intrinsic of the form:

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

If it helps we could also initially restrict the range of stride as this is something that can be easily changed with improved code generation abilities. By this I mean with your current patch we can restrict it to being <=2 and still have distinct ISD nodes for these supported variants if that results in the better implementation.

I like this idea a lot. Essentially a step vector shuffle. You could even roll splats into it with a 0 stride. Implementing it sounds pretty challenging though. Especially for an index >=2. Maybe I'm missing an easy solution, but that sounds like a lot of work to generalize.

Having said that, I wonder if we should revisit the idea of allowing shuffle vectors to accept step vector masks?

Matt added a subscriber: Matt.Jan 19 2021, 9:14 AM

Having said that, I wonder if we should revisit the idea of allowing shuffle vectors to accept step vector masks?

At today's sync-up meeting, this met strong resistance. @ctetreau argued that we don't want to allow a stepvector constant as the shuffle vector mask operand, since that could lead to slippery slope of constant initializer strings. Chris, correct my paraphrase if you'd like...

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Where index and stride are required to be constant immediate values with "stride > 0" and "0 <= index < stride".

Thinking about this some more, this wouldn't be too bad to implement. I'm okay going this route, unless anyone feels strongly against it...

In D94444#2497697, @paulwalker-arm wrote:
<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Sorry for the slow reply. I'm just getting back to looking at this and now notice it is a unary shuffle. I'd like to see this as a binary shuffle. E.g.:

void foo(double res[16], double x[16], std::complex<double> vec[16]) {
  for (int i = 0; i < 16; i++)
    res[i] = x[i] + vec[i].real();
  return;
}

In the general vectorization case, we want to keep the vectors as full as possible on each iteration . I think the Complex part of the loop body should look like:

%lo = load %vec, 0
%hi = load %vec, 64
%reals = extract_elements(%lo, %hi, 0, 2)

And not splicing together two 1/2 width vectors:

%lo = load %vec, 0
%reals_lo = extract_elements(%lo, 0, 2)
%hi = load %vec, 64
%reals_hi = extract_elements(%hi, 0, 2)
%reals = concat(%reals_lo, %reals_hi)

And also not having 2x the loop trips on 1/2 width vectors:

%ld = load %vec, 0
%reals = extract_elements(%ld, 0, 2)

I'm hand-waving over some other obvious optimizations, but I think this illustrates the unary shuffle problem pretty well. Thoughts?

For now I'll just cover the IR side of things as the ISD node discussion raises different points and there's nothing to say they need to match.

If you take your code snippet (although I changed the loop trip count to 1024 to allow vectorisation) and look at the IR emitted by LoopVectorize, you'll see what I was referring to in my previous comment. You end up with the following snippet within vector.body:

%wide.vec = load <4 x double>, <4 x double>* %11, align 8, !tbaa !6
%wide.vec23 = load <4 x double>, <4 x double>* %13, align 8, !tbaa !6
%strided.vec = shufflevector <4 x double> %wide.vec, <4 x double> poison, <2 x i32> <i32 0, i32 2>
%strided.vec24 = shufflevector <4 x double> %wide.vec23, <4 x double> poison, <2 x i32> <i32 0, i32 2>

So today the loop is vectorised using vectors as full as possible, in this case the loop was also unrolled hence the pair of loads and shuffles. Here LoopVectorize simple creates a double length load and a matching shuffle to extract the even lanes. If the loop operated on the imaginary parts then there would also be a shuffle to extra the odd lanes. There's no concatenation or splicing involved and the "large" load it trivial to code generate. For AArch64 there is also the InterleavedAccess pass that knows how to convert this logic to an aarch64.ld2 intrinsic call. This is something we'll want for SVE as well, although with the shufflevector replaced by an intrinsic it'll be simpler for SVE to detect as InterleavedAccess is a tad complicated.

This is why I believe at the IR level we should have an intrinsic that mirrors this type of shuffle and thus one that takes a single vector and extracts elements based on a simple pattern (i.e. odd or even....). Doing so means it'll be a drop in replacement for the existing shufflevector usage, which is the goal. Note that if complex was changed to a three element structure, then LoopVectorize will do the expected thing in creating a triple wide load and create shuffles to extract every third element starting at index 0, 1, or 2 based on the field in question.

Ok, I see where you are coming from now. LoopVectorize is keeping the shuffle result full by widening the the load+shuffle to double wide. LV's double wide choice seems like a weird one, but I suppose if that sequence is codegen'd correctly, then it will work out.

It will be interesting to see how this codegens when the input lives in registers. But again I suppose that ISelLowering could straighten it out if needed.

In D94444#2497697, @paulwalker-arm wrote:
<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Working through this now. @paulwalker-arm, have you given any thought to what happens to A below?

<vscale x A x double> llvm.experimental.vector.extract.elements(<vscale x 2 x double> %invec, i32 0, i32 4)

What should A be here? It should really be 0.5, but my understanding is we're not doing fractional VLs. I suppose that we could restrict stride to be >= the known VL.

I also wonder what A should be with odd strides, like 3. Any thoughts on that?

paulwalker-arm added a comment.EditedJan 26 2021, 3:22 AM

Considering:

<A x Elt> llvm.experimental.vector.extract.elements(<B x Elt> %invec, i32 index, i32 stride)

Suitable restrictions are that "A = B/stride" and that "B % stride == 0". When combined with the original restriction that "0 <= index < stride", I think this nicely joins up our use case of having an intrinsic that effectively extracts field "index" from a vector of structs[1] where each struct has "stride" fields.

[1] invec is not actually a vector of structs at the LLVM level, but logically this is what the vector represents.

There is the option to have no restrictions and allow a use case like:

<2 x Elt> llvm.experimental.vector.extract.elements(<vscale x 4 x Elt> %invec, i32 7, i32 13)

but I honestly think that's needlessly looking for trouble. Besides, this extreme use case has the same interface and so once a restricted variant is available it would be easy enough for others to soften restrictions if they see a genuine reason to do so. The important part is to ensure the intrinsic's return and parameter types are good enough to allow these future use cases, which I believe we've achieved in this instance.

[NOT READY FOR REVIEW]

Today is my last day at Cray/HPE, so taking a mid-development snapshot to be finished later.

This isn't working out to be as general as I'd like it though...

Any use of this intrinsic with a vector result half-width or smaller trips up Legalization/CodeGen. E.g. <vscale x 1 x X> or <vscale x 2 x f32>. I'm not sure if there are reasonable fixes for these problems or not yet.