This is an archive of the discontinued LLVM Phabricator instance.

[SVE] Fix incorrect codegen when inserting vector elements into widened scalable vectors
AbandonedPublic

Authored by DylanFleming-arm on Jun 29 2021, 7:31 AM.

Diff Detail

Event Timeline

DylanFleming-arm requested review of this revision.Jun 29 2021, 7:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2021, 7:31 AM

For non-scalable vectors, widening means we pad the end with undef elements (i.e. INSERT_SUBVECTOR into an undef vector). Do you want it to mean something different for scalable vectors? Is this documented somewhere?

For non-scalable vectors, widening means we pad the end with undef elements (i.e. INSERT_SUBVECTOR into an undef vector). Do you want it to mean something different for scalable vectors? Is this documented somewhere?

The difference is that scalable vectors are unpacked, so for <vscale x 1 x i64> <=> <v0 | v1 | .... | vn-1>, needs widening to <vscale x 2 x i64> <=> <v0, _ | v1, _, | ... | vn-1, _>, which means that the index in which to insert the element needs to be multiplied by 2. For fixed-width vectors, we'd indeed pad the number of elements in the vector to go from <2 x i32> <v0, v1> to <4 x i32> <v0, v1, _, _>, so the index stays the same. I'm not really sure where this should be documented as it's mostly a characteristic of scalable vectors, but I guess an extra comment describing it here wouldn't hurt. Note that the widening only works for vscale x 1 because it is a power of 2. I'm not sure if it would even be possible to widen <vscale x 3 to <vscale x 4.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
4009

Can you add an assert that MinElems is a power of 2?
And also that the widened vector is scalable as well?

4018

nit: redundant whitespace

4022

nit: maybe handle this case first, so you don't need the indentation for the scalable case.

Added asserts to check scalable vectors are valid
Reordered control flow to be cleaner

david-arm added inline comments.Jun 30 2021, 8:47 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
4022

For the type here (MVT::i64) I wonder if we should be using the type returned by this:

TLI->getVectorIdxTy(getDataLayout())

I discovered this by looking at what this function does:

SDValue SelectionDAG::getVectorIdxConstant(uint64_t Val, const SDLoc &DL,
                                           bool isTarget) {
  return getConstant(Val, DL, TLI->getVectorIdxTy(getDataLayout()), isTarget);
}
4023

nit: I think you can drop the extra braces here.

4025

Same comment as above about the type

llvm/test/CodeGen/AArch64/sve-insert-element.ll
414

Do we need to add a test for inserting into a <vscale x 1 x i1> type here?

For non-scalable vectors, widening means we pad the end with undef elements (i.e. INSERT_SUBVECTOR into an undef vector). Do you want it to mean something different for scalable vectors? Is this documented somewhere?

The difference is that scalable vectors are unpacked, so for <vscale x 1 x i64> <=> <v0 | v1 | .... | vn-1>, needs widening to <vscale x 2 x i64> <=> <v0, _ | v1, _, | ... | vn-1, _>, which means that the index in which to insert the element needs to be multiplied by 2. For fixed-width vectors, we'd indeed pad the number of elements in the vector to go from <2 x i32> <v0, v1> to <4 x i32> <v0, v1, _, _>, so the index stays the same. I'm not really sure where this should be documented as it's mostly a characteristic of scalable vectors, but I guess an extra comment describing it here wouldn't hurt. Note that the widening only works for vscale x 1 because it is a power of 2. I'm not sure if it would even be possible to widen <vscale x 3 to <vscale x 4.

This sounds very SVE-specific.

I'm not sure I understand why we can't pack the elements tightly. It might be a little more verbose to widen certain SVE operations, but not impossible, and injecting SVE-specific behavior into target-independent code is messy.

Removed unneeded parentheses
Changed index type to be variable based on the target system

For non-scalable vectors, widening means we pad the end with undef elements (i.e. INSERT_SUBVECTOR into an undef vector). Do you want it to mean something different for scalable vectors? Is this documented somewhere?

The difference is that scalable vectors are unpacked, so for <vscale x 1 x i64> <=> <v0 | v1 | .... | vn-1>, needs widening to <vscale x 2 x i64> <=> <v0, _ | v1, _, | ... | vn-1, _>, which means that the index in which to insert the element needs to be multiplied by 2. For fixed-width vectors, we'd indeed pad the number of elements in the vector to go from <2 x i32> <v0, v1> to <4 x i32> <v0, v1, _, _>, so the index stays the same. I'm not really sure where this should be documented as it's mostly a characteristic of scalable vectors, but I guess an extra comment describing it here wouldn't hurt. Note that the widening only works for vscale x 1 because it is a power of 2. I'm not sure if it would even be possible to widen <vscale x 3 to <vscale x 4.

This sounds very SVE-specific.

I'm not sure I understand why we can't pack the elements tightly. It might be a little more verbose to widen certain SVE operations, but not impossible, and injecting SVE-specific behavior into target-independent code is messy.

This was a design decision that was made quite early on. It's not really specific to SVE and various code in SelectionDAG already builds on this decision.
It seems the design was never really documented anywhere, which I think would be a good idea for us to still do (suggestions on where to put this are welcome).

For <vscale x 4 x i16> (assuming a native vector width of vscale x 128 bits for this example):

<h0, _, h1, _, h2, _, h3, _> (vscale = 1)
<h0, _, h1, _, h2, _, h3, _ | h4, _, h5, _, h6, _, h7, _> (vscale = 2)
<h0, _, h1, _, h2, _, h3, _ | h4, _, h5, _, h6, _, h7, _ | h8, _, h9, _, h10, _, h11, _> (vscale = 3)
<h0, _, h1, _, h2, _, h3, _ | h4, _, h5, _, h6, _, h7, _ | h8, _, h9, _, h10, _, h11, _ | h12, _, h13, _, h14, _, h15, _> (vscale = 4)

If the elements would be packed, then this would be laid out as:

<h0, h1, h2, h3, _, _, _, _> (vscale = 1)
<h0, h1, h2, h3, h4 , h5, h6, h7 | _, _, _, _, _, _, _, _> (vscale = 2)
<h0, h1, h2, h3, h4 , h5, h6, h7 | h8, h9, h10, h11, _, _, _, _ | _, _, _, _, _, _, _, _> (vscale = 3)
<h0, h1, h2, h3, h4 , h5, h6, h7 | h8, h9, h10, h11, h12, h13, h14, h15 | _, _, _, _, _, _, _, _, _ | _, _, _, _, _, _, _, _, _> (vscale = 4)

There is no efficient way to half the vector now from <vscale x 4 x i16> to <vscale x 2 x i16>, because that requires knowledge about the runtime value of vscale which may be anywhere in the vector (and its index may no longer be a power of two). For the unpacked format, it means that common unpack/zip operations can be used. Concatenating vectors would be even more awkward, whereas for the unpacked format, a 'concat' is just a matter of concatenating each even element from both vectors.

For <vscale x 4 x i16> (assuming a native vector width of vscale x 128 bits for this example):

There are two independent questions to consider: one, what layout we're using, and two, how we achieve that layout.

For <vscale x 4 x i16>, we use TypePromoteInteger in legalization, to convert it to <vscale x 4 x i32>. TypePromoteInteger has the same meaning whether or not a vector is scalable.

Say we have <vscale x 1 x i64>. You're saying we need to ensure this has a layout where we alternate between legal and illegal elements. Possible ways to achieve that layout:

  1. Make <vscale x 1 x i64> a legal type, and use some combination of operation legalization/isel patterns to ensure the alternation.
  2. Use TypePromoteInteger to <vscale x 1 x i128>. But then we need to figure out how to legalize <vscale x 1 x i128>, and we'd need a different solution for floating-point types.
  3. Define TypeWidenVector to appends undef elements to the end, but convert between the two layouts where the difference is externally visible (calling convention code).
  4. Define TypeWidenVector to do some sort of interleaving. I guess widening <vscale x N x i64> -> <vscale x M x i64>, we insert M-N undef elements every N elements?
  5. Come up with a new LegalizeTypeAction that represents what we want to do here.

This patch is proposing (4), I think? Have we made any other changes that imply (4)? Have the other possibilities been discussed anywhere?

For <vscale x 4 x i16> (assuming a native vector width of vscale x 128 bits for this example):

alternate between legal and illegal elements.

I meant alternate between significant values and padding elements here.

Matt added a subscriber: Matt.Jul 2 2021, 8:42 AM

For <vscale x 4 x i16> (assuming a native vector width of vscale x 128 bits for this example):

There are two independent questions to consider: one, what layout we're using, and two, how we achieve that layout.

To address the first point, what layout we're using, for scalable vectors my understanding is that LLVM could assume three different layouts:

  • FMT1: <e0, _, e1 _ | e2, _, e3, _ | ... > (elements unpacked in #vscale containers)
  • FMT2: <e0, e1, e2, e3 | _, _, _, _ | ... > (elements packed in lower #elts / (sizeof(container)/sizeof(elt)) containers)
  • FMT3: <e0, e1, _, _ | e2, e3, _, _ | ... > (elements packed in lower part of each container)

We need to decide on a single layout throughout SelectionDAG, so that generic code can implement things like widening, as needed for e.g. INSERT_VECTOR_ELT (implemented in this patch).
From the start we worked on the premise and requirement that LLVM would consider scalable vectors to always use an unpacked layout (FMT1).

  • The two most fundamental operations in legalization, concatenating and splitting, have to be implemented efficiently.
    • Common shuffle instructions (de-interleave/unpack) can be used to efficiently implement the unpacked (FMT1) case.
    • For FMT2, the index in the vector where to split/concatenate can be anywhere within the vector, which is a lot more complicated than splitting/concatenating at the middle of the vector.
    • I guess FMT3 is the most nonsensical layout, since no load/store instructions would ever load data into such a layout. Doing things like concatenating two vectors or inserting/extracting elements would get very complicated.
  • There are no legacy procedure call standards that dictate how partial scalable vector operands passed in/out of functions should be laid out, so we have freedom to choose a layout that is profitable without having to worry about conversions.
    • Scalable vector architectures are likely to always have full predication, so having all elements in a vector lined up makes working with mixed-sized elements a lot cheaper (i.e. add zext(<vscale x 2 x i8>) to <vscale x 2 x i64> doesn't require shuffling to line up the elements first and the extends may come for free.
    • When implementing fixed-width vectors with SVE we implement a packed format (FMT2) to match the PCS for Neon. Because the number of lanes is a constant, we can relatively easily map those types onto scalable vectors using a predicate to control only the active lanes (i.e. using ptrue vl<N> where N is a known a compile-time constant). For scalable vectors, we can't cheaply generate such a predicate, since we'd have to compute it using the runtime value of vscale divided by the ratio of sizeof(VL1)/sizeof(VL2). For fixed-width vectors we accept the extra cost of the extends/truncates in user code in favour of transforming between packed/unpacked formats, although this could still be reconsidered in the future if that turns out more profitable.

For <vscale x 4 x i16>, we use TypePromoteInteger in legalization, to convert it to <vscale x 4 x i32>. TypePromoteInteger has the same meaning whether or not a vector is scalable.

Say we have <vscale x 1 x i64>. You're saying we need to ensure this has a layout where we alternate between legal and illegal elements. Possible ways to achieve that layout:

  1. Make <vscale x 1 x i64> a legal type, and use some combination of operation legalization/isel patterns to ensure the alternation.
  2. Use TypePromoteInteger to <vscale x 1 x i128>. But then we need to figure out how to legalize <vscale x 1 x i128>, and we'd need a different solution for floating-point types.
  3. Define TypeWidenVector to appends undef elements to the end, but convert between the two layouts where the difference is externally visible (calling convention code).
  4. Define TypeWidenVector to do some sort of interleaving. I guess widening <vscale x N x i64> -> <vscale x M x i64>, we insert M-N undef elements every N elements?
  5. Come up with a new LegalizeTypeAction that represents what we want to do here.

This patch is proposing (4), I think? Have we made any other changes that imply (4)? Have the other possibilities been discussed anywhere?

Given the above reasoning that LLVM implements FMT1, this would discard option 3 and would make option 5 unnecessary since widening has only a single meaning.
Option (2) is not possible for SVE, because in order to promote the type, <vscale x 1 x i128> must be a legal type. There are also no instructions to promote <vscale x 1 x float> to <vscale x 1 x f128>.

That leaves options (1) and (4) as the only feasible options. Given that SelectionDAG already implements widening (and already just works for most cases), we figured we'd use that instead of adding this support manually and adding lots of new patterns. This also strengthens the generic support for scalable vectors in ISel.

I'm not sure in how many places in generic SelectionDAG code the layout concretely matters, since most operations are implemented with generic ISD nodes where most of that level of detail is abstracted away and is left to the target. This just happens to be one of those cases where the layout actively matters.

FMT1: <e0, _, e1 _ | e2, _, e3, _ | ... > (elements unpacked in #vscale containers)

Could you clarify how you expect FMT1 to work for <vscale x 3 x i32>? Or is it only meant to apply to types where the element count is a power of two?

I agree FMT3 is pretty unlikely to be useful in practice.

Given the above reasoning that LLVM implements FMT1, this would discard option 3 and would make option 5 unnecessary since widening has only a single meaning.

Each target has control over how it prefers to convert a vector with an illegal type to a legal type, by overriding getPreferredVectorAction(). There isn't any global "LLVM" layout; there's just what each particular target requests for each type. LLVM's target-independent code then has to take that request, and figure out how to turn it into actual code.

So we could easily end up with a situation where, for some combination of targets/vectors, we want FMT1, and for some other combination of targets/vectors, we want FMT2. In that case, we probably want TypeWidenVector to mean FMT2. That's where options 3 and 5 comes into play. For option 5, we can share most of the legalization code; most operations would be legalized the same way with either layout.

Another benefit of option 5 is that it separates the choice of layout from the distinction between fixed/scalable vectors. They aren't really fundamentally tied together; a target could choose to use FMT1 for a fixed-width vector. (At the moment, I can't think of any target that would want to do that, but I'm not that familiar with all the various vector instruction sets out there.)

Hi @efriedma, I had to give this a moment to sink in, but you are right about the widening not having to assume anything about the target's vector format, I had my wires crossed here.

The widening operation done here is purely performed on a conceptual vector. Widening <vscale x 1 x i64> to <vscale x 2 x i64> by appending undef elements at the end of the vector, is actually the operation that's performed by INSERT_SUBVECTOR (and vice-versa for EXTRACT_SUBVECTOR), which we then implement efficiently with UZP1 and UUNPKLO if the (min) element count is a power of two.

What confused me was that there were no such shuffles in the code which made me think we had to care about the unpacked format. But that makes sense, because <vscale x 1 x i64> %arg coming into the function has already been widened by the caller. There is no need to extract a subvector, and then insert the subvector again into an UNDEF to get a widened vector, because the widened value/operand is already available.

Based on that, I no longer see the need for the kind of widening as proposed here, so this patch can probably be closed/abandoned.

Could you clarify how you expect FMT1 to work for <vscale x 3 x i32>? Or is it only meant to apply to types where the element count is a power of two?

This would indeed only apply to types where the element-count is a power of two (otherwise it would imply FMT3). I'm not sure what the best approach would be for non-power-of-two element counts or how to implement this. At least the loop vectorizer should never pick such a VF, so I'd hope we don't need to worry about this.

Hi @efriedma, I had to give this a moment to sink in, but you are right about the widening not having to assume anything about the target's vector format, I had my wires crossed here.

The widening operation done here is purely performed on a conceptual vector. Widening <vscale x 1 x i64> to <vscale x 2 x i64> by appending undef elements at the end of the vector, is actually the operation that's performed by INSERT_SUBVECTOR (and vice-versa for EXTRACT_SUBVECTOR), which we then implement efficiently with UZP1 and UUNPKLO if the (min) element count is a power of two.

Currently, all the target-independent code is assuming FMT2. This includes lowering for INSERT_ELEMENT, INSERT_SUBVECTOR, EXTRACT_SUBVECTOR, etc. And I don't think we have any target-specific code that triggers for <vscale x 1 x i64>. (Due to a bug/quirk in the type legalization code, we don't end up using our code for AArch64TargetLowering::LowerINSERT_SUBVECTOR. I'll look at this.)

Like you mention, it's pretty efficient to flip between the formats if we need to.

(Due to a bug/quirk in the type legalization code, we don't end up using our code for AArch64TargetLowering::LowerINSERT_SUBVECTOR. I'll look at this.)

Oh, this was just me misreading the AArch64 code. We don't setOperationAction for nxv1i64. Which is probably a good thing, since the code would crash, anyway.

DylanFleming-arm abandoned this revision.Jul 19 2021, 2:04 AM