Page MenuHomePhabricator

RFC: Explicit Vector Length Intrinsics and Attributes
Needs ReviewPublic

Authored by simoll on Oct 23 2018, 2:45 PM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This is a proposal to add vector intrinsics and function attributes to LLVM IR to better support predicated vector code, including targets with a dynamic vector length (RISC-V V, NEC SX-Aurora).
The attributes are designed to simplify automatic vectorization and optimization of predicated data flow. Non-predicating SIMD architectures should benefit from these changes as well through a common legalization scheme (eg lowering of fdiv in predicated contexts).

This is a follow up on my tech talk at last week's LLVM DevMtg, "Stories from RV..." (https://llvm.org/devmtg/2018-10/talk-abstracts.html#talk22), and the subsequent discussions at the round table.

Rationale

LLVM IR does not support predicated execution as a first-order concept. Instead there is a growing body of intrinsics (llvm.masked.*) and workarounds (select for arithmetic, VectorABI for general function calls), which encode or at least emulate predication in their respective context. The discussions and patches for LLVM-SVE show that there is a need to accomodate architectures with a Dynamic Vector Length (RISC-V V extension, NEC SX-Aurora TSUBASA).

This RFC provides a coherent set of intrinsics and attributes that enable predication through bit masks and EVL in LLVM IR.

Proposed changes

Intrinsics

We propose to add a new set of intrinsics to the "llvm.evl.*" prefix. After the change, it will include the following operations:

  • all standard binary (Add, FAdd, Sub, FSub, Mul, FMul, UDiv, SDiv, FDiv, URem, SRem, FRem)
  • logical operators (Shl, LShr, AShr, And, Or, Xor)
  • experimental reduce (fadd, fmul, add, mul, and, or, xor, smax, smin, uman, umin, fmax, fmin)
  • ICmp, FCmp
  • Select
  • All of llvm.masked.* namespace (load, store, gather, scatter, expandload, compressstore)

All of the intrinsics in the llvm.evl namespace take in two predicating parameters: a mask of bit vector type (eg <8 x i1>) and a dynamic vector length value (i32).

Attributes

We propose three new attributes for function parameters:
mask: this parameter encodes the predicate of this operation. Inputs on unmasked lanes must not affect enabled result lanes in any way.
vlen: this parameter encodes the explicit vector length (VL) of the instruction. The operation does not apply for lanes beyond this parameter. The result for lanes >= vlen is "undef".
maskedout_ret: this parameter contains the return value of masked-out lanes (within the vector length).

We show the semantics in the example below.
The attributes are intended for general use in IR functions, not just the EVL intrinsics.

An example

Let the predicated fdiv have the following signature:

llvm.evl.fdiv.v4f64(<4 x double> maskedout_ret %a, <4 x double> %b, <4 x i1> mask %mask, i32 vlen %dynamic)

Consider this application of fdiv:

llvm.evl.fdiv.v4f64(<4 x double> <4.2, 6.0, 1.0, 1.0>, <4 x double> <0.0, 3.0, nan, 0>, <4 x i1> <0, 1, 1, 1>, 2)
== <4.2, 2.0, undef, undef>

The first %mask bit is '0' and the operation will not execute for the first lane. Yet, since the first paramter %a% has the maskedout_ret attribute the result on the first lane is the value of %a at that lane.
The second %mask bit is '1' and so the result on the second lane is just 6.0 / 3.0.
The last two lanes are beyond the dynamic vector length %vlen and so their results are undef regardless of maskedout_ret.

Note that the outcome of the first and last two lanes could have been told from the new attributes alone without knowing that this is an fdiv operation.
This can be used to implement general predicate analyses and optimizations.

Lowering

We show possible lowering strategies for the following prototypical SIMD ISAs:

LLVM-SVE with predication and dynamic vector length (RISC-V V extension, NEC SX-Aurora)

For these targets, the intrinsics map over directly to the ISA.

Lowering for targets w/o dynamic vector length (AVX512, ARM SVE, ..)

ARM SVE does not feature a dynamic vector length register.
Hence, the vector length needs to be promoted to the bit mask predicate, shown here for a LLVM-SVE target:

Block before legalization:

..
foo (..., %mask, %dynamic_vl)
...

After legalization:

%vscale32 = call i32 @llvm.experimental.vector.vscale.32()
...
%stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
%vl_mask = icmp <scalable 4 x i1> %stepvector, %stepvector, %dynamic_vl
%new_mask = and <scalable 4 x i1> %mask, %vl_mask
foo (..., <scalable 4 x i1> %new_mask, i32 %vscale32)
...
Lowering for fixed-width SIMD w/o predication (SSE, NEON, AdvSimd, ..)

Scalarization and/or speculation on a full predicate.

Example 1: safe fdiv

int foo(double * A, double * B, int n) {

#pragma omp simd simdlen(8)
for (int i = 0; i < n; ++i) {
  double a = A[i];
  double r = a;
  if (a > 0.0) {
    r = 42.0 / a;
  }
  B[i] = r;
}

}

<8 x double> @llvm.evl.fdiv.v8f64(<8 x f64> maskedout_ret %a, <8 x f64> %b, <8 x i1> mask %mask, i32 vlen %length)
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %0 = getelementptr inbounds double, double* %A, i64 %index
  %1 = bitcast double* %0 to <8 x double>*
  %wide.load = load <8 x double>, <8 x double>* %1, align 8, !tbaa !2
  %2 = fcmp ogt <8 x double> %wide.load, zeroinitializer

  ; variant that LV generates today:
  ; %3 = fdiv <8 x double> <double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01>, %wide.load
  ; %4 = select <8 x i1> %2, <8 x double> %3, <8 x double> %wide.load

  ; using EVL:
  %4 = call <8 x double> @llvm.evl.fdiv.v8f64(<8 x double> <double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01>, %wide.load, <8 x i1> %2, i32 8)

  %5 = getelementptr inbounds double, double* %B, i64 %index
  %6 = bitcast double* %5 to <8 x double>*
  store <8 x double> %4, <8 x double>* %6, align 8, !tbaa !2
  %index.next = add i64 %index, 8
  %7 = icmp eq i64 %index.next, %n.vec
  br i1 %7, label %middle.block, label %vector.body, !llvm.loop !6

Pros & Cons

Pros

The generality of the intrinsics simplifies the job of the vectorizer's widening phase (speaking for RV, should apply to LV/VPlan as well): Scalar instruction opcodes only need to be mapped over to their respective evl intrinsic name. The mask and vlen are passed to the annotated arguments.

Regarding the evl intrinsics (instead of extending the IR):

  • The new predication scheme is completely optional and does not interfere with LLVM's vector instructions at all.
  • Existing backends can use a generic lowering scheme from evl to "classic" vector instructions.
  • Likewise, lifting passes can convert classic vector instructions to the new intrinsics if deemed beneficial for backend implementation (NEC SX-Aurora, RISC-V V(?)..)

Marking out the mask and the vlen parameters with attributes has the following advantages:

  • Analyses and optimizations understand the flow of predicates from a quick glance at the functions' attributes, no further knowledge about the functions' internals is required.
  • Dynamic vlen and the vector mask may be treated specially in the target's CC (eg by passing dynamic vlen in a VL register, or the active mask in a dedicated register (AMDGPU(?))).
  • Legalization does not have to know the nature of the intrinsic to legalize dynamic vlen where it is not supported.
Cons
  • Intrinsic bloat.
  • Predicating architectures without a dynamic vector length have to pass in a redundant vlen to exploit these intrinsics.
  • Some of LLVM's optimizations that need to understand the nature of the intrinsics` semantics, like InstCombine, need to be taught about evl intrinsics to be able to optimize them. This will require at least some engineering effort.

Alternatives considered

Piggy backing

This means extending the current vector instructions to feature a predicate and a dynamic vector length in some way, both of which would be optional.
One approach to achieve this is a direct extension of the existing instructions. Decorating instructions with an extended OperandBoundle scheme should work as well.

Diff Detail

Event Timeline

simoll created this revision.Oct 23 2018, 2:45 PM
simoll created this object with visibility "No One".
simoll retitled this revision from RFC: Predicated Vector Intrinsics to RFC: Dynamic Vector Length Intrinsics and Attributes.Oct 23 2018, 2:47 PM
simoll changed the visibility from "No One" to "Public (No Login Required)".

This patch is just for reference: it implements the three attributes and (some) intrinsic declarations.

Limiting this to the LLVM-SVE round table for now. We can bring this to llvm-dev when there is agreement on the overall design.

Hi Simon,

thanks for contributing this proposal.

Do you think it is orthogonal for the cases, like in RISC-V V-extension, where we may not want to commit to a specific simdlen? I'm thinking of a vectorisation scheme similar to the one described here https://content.riscv.org/wp-content/uploads/2018/05/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf (slides 22 to 41). (I understand it still needs some "variable-length vector" but there are proposals already in this area).

Regards

Hi Roger,

good to see RISC-V people getting back on this! We couldn't get a hold of anybody working on the V extension at the DevMtg.

Yes, the changes in this RFC are compatible with a physical simdlen that is unknown at compile time.

Regarding the re-configurable MVL of the V extension If this scheme here were adopted, it would definitely work with the Fixed-MVL-Per-Function approach that @rkruppe presented.

Here is an idea (...at the risk of side-tracking this RFC): I believe it will still work if a more flexible approach to re-configuring MVL is taken: eg if you used DVL intrinsics for RISC-V V then the dynamic vector length for every call will be derived from some llvm.r5v.getmvl() intrinsic call. You could then pick one common MVL for all DVL intrinsics that derive their dynamicvl from the same call to this intrinsic. More appropriately, the intrinsic's name should then be more like llvm.r5v.configuremvl(). You could then pass down the MVL to callees the same way it is done here for the dynamic vector length (function argument plus attribute). So functions that inherit their MVL from the caller could transparently do so by deriving dynamicvl from that argument (instead of the intrinsic, which would re-roll the MVL die).

Thanks a lot for this proposal! It's very unfortunate I couldn't be at the dev meeting to discuss in person.

This basic approach of intrinsics with both a regular mask parameter and an integer parameter for the number of lanes to process matches what I've been doing for RISC-V and it works well for that purpose, especially for the strip-mined loops @rogfer01 highlighted. I think the same approach should generalize to other architectures and to vectorized code that does not follow the strip-mining style:

  • The dynamic vector length, if not needed, can be set to the constant -1 (= UINT_MAX)
  • The mask parameter, if not needed, can be set to the constant <i1 true, i1 true, ...>

Both of these settings are easy to recognize in legalization/ISel and in IR passes, so unpredicated dynamic-vl operations as well as predicated-but-full-length operations can be represented with (in principle) no loss of optimization power and only a little bit of boilerplate. That seems like a good trade off for not having multiple variants of every intrinsic.

I am a bit less sure about the new attributes. If it was just about the intrinsics, I'd argue for creating helper query functions like that extract the relevant arguments from a call or Function object, using knowledge of the intrinsic signatures. But on my third reading of the text I finally realized you want to apply them to non-intrinsincs as well. An example of how each of these would be used (e.g. by RV or an OpenMP implementation) would be useful. I can see the value of passing the dynamic vector length in a specific register, but at a glance, unmasked_return seems rarely applicable to user-defined functions (similarly to the returned parameter attribute, which is a bit niche).

Editorial note: I find the way "unmasked" is used here confusing. You seem to use it for "lanes where the mask bit is 0, which are disabled", but IME "unmasked" means operations with no predication at all and lanes with mask bit 0 are called "disabled" or "masked out" or something to that effect.

PS: Representing RISC-V's MVL/vector configuration as an SSA value returned by and passed to functions is the first thought everyone has, including me, but I've tried extensively and it just can't work. I don't want to be too curt about this, but I'd really prefer to not side-track this RFC with rehashing the reasons why it doesn't work. If you want, Simon, you can start a thread on llvm-dev or email me privately and we can chat about it, but let's keep this thread on-topic.

Thanks a lot for this proposal! It's very unfortunate I couldn't be at the dev meeting to discuss in person.

This basic approach of intrinsics with both a regular mask parameter and an integer parameter for the number of lanes to process matches what I've been doing for RISC-V and it works well for that purpose, especially for the strip-mined loops @rogfer01 highlighted. I think the same approach should generalize to other architectures and to vectorized code that does not follow the strip-mining style:

  • The dynamic vector length, if not needed, can be set to the constant -1 (= UINT_MAX)
  • The mask parameter, if not needed, can be set to the constant <i1 true, i1 true, ...>

    Both of these settings are easy to recognize in legalization/ISel and in IR passes, so unpredicated dynamic-vl operations as well as predicated-but-full-length operations can be represented with (in principle) no loss of optimization power and only a little bit of boilerplate. That seems like a good trade off for not having multiple variants of every intrinsic.

Great, good to hear we are on the same page here.

I am a bit less sure about the new attributes. If it was just about the intrinsics, I'd argue for creating helper query functions like that extract the relevant arguments from a call or Function object, using knowledge of the intrinsic signatures. But on my third reading of the text I finally realized you want to apply them to non-intrinsincs as well. An example of how each of these would be used (e.g. by RV or an OpenMP implementation) would be useful. I can see the value of passing the dynamic vector length in a specific register, but at a glance, unmasked_return seems rarely applicable to user-defined functions (similarly to the returned parameter attribute, which is a bit niche).

Two reasons: first, we want to avoid this kind of hard-coded knowledge about intrinsics and second, the attributes allow you to coalesce vector registers. As a plus they simplify whole-function vectorization with dvl and predication beyond what's currently supported by OpenMP/VectorABI.

Example

Let's say you'd want to vectorize a loop like this for a predicating/dynamicvl architecture:

for (int i =0; i <n; ++i) {
  double x = B[i];
  double y = C[i];
  A[i] = x > 0 ? bar(x) : y;
}

And there were a user-provided (or RV-auto-vectorized) SIMD version of bar with the following signature:

def <scalable 1 x double> @bar_dvl_nxv1(<scalable 1 x double> %a, <scalable 1 x double> unmasked_ret %b, <scalable 1 x i1> mask %mask, i32 dynamicvl %vl) {..}

Crucially, the implementation of @bar may use llvm.dvl intrinsics (or other function calls) internally but there is no way of telling the default return value (for masked-out lanes) without inspecting the IR... and worse you might just be given a declaration of that function.

However, by inspecting just the attributes the loop vectorizer could simply vectorize the call to @bar like below (you'd still need a way to tell vector shapes as in OpenMPs linear,aligned,.. clauses and VectorABI).

for.body.rv:
    %cond = fcmp <scalable 1 x i1> %b, splat 0.0
    %x = call <scalable 1 x double> @llvm.dvl.load(..., %dvl)
    %y = call <scalable 1 x double> @llvm.dvl.load(..., %dvl)
    ...
    %result = call <scalable 1 x double> @bar_dvl_nxv1( %x, %y, %cond, %dvl)
    ...
    llvm.dvl.store(%Aptr, %result, ...)
    ...

The select is folded into the vectorized function call, which is not possible otherwise.
Moreover if RV auto-vectorizes @bar it will automatically annotate the vectorized functions with these attributes for you.

Vector register coalescing

If you have unmasked_ret knowledge about the data flow in you vector code, register allocation can exploit that to safe registers and avoid vector register spills.
Vector values with complementary masks can be coalesced into one register, relying on the fact that the masked-out part will be preserved through all arithmetic and function calls.
This already applies to the earlier example because there is no need to keep %y alive across the call site: instead we know that the return value of call @bar_dvl_nxv1 contains the parts of %y we care about.

Editorial note: I find the way "unmasked" is used here confusing. You seem to use it for "lanes where the mask bit is 0, which are disabled", but IME "unmasked" means operations with no predication at all and lanes with mask bit 0 are called "disabled" or "masked out" or something to that effect.

Sure. Do you have a specific suggestion? How about `maskedout_ret' instead?

PS: Representing RISC-V's MVL/vector configuration as an SSA value returned by and passed to functions is the first thought everyone has, including me, but I've tried extensively and it just can't work. I don't want to be too curt about this, but I'd really prefer to not side-track this RFC with rehashing the reasons why it doesn't work. If you want, Simon, you can start a thread on llvm-dev or email me privately and we can chat about it, but let's keep this thread on-topic.

I can see how transformations may accidentally interleave operations with different MVLs. Let's focus on this RFC for now.

Hi @simoll

Yes, the changes in this RFC are compatible with a physical simdlen that is unknown at compile time.

thanks, this is good to know. Apologies if I side-tracked a little bit the discussion, I was a bit concerned of seeing fixed-size vectors here and I wanted to dispel my doubts.

Regards,

simoll planned changes to this revision.Oct 29 2018, 3:51 AM

Hi @simoll

Yes, the changes in this RFC are compatible with a physical simdlen that is unknown at compile time.

thanks, this is good to know. Apologies if I side-tracked a little bit the discussion, I was a bit concerned of seeing fixed-size vectors here and I wanted to dispel my doubts.

Regards,

No worries. Since R5V is one of the targets with a dynamic vector length that this RFC is aimed at it is good to make clear that dvl intrinsics will work with R5V's re-configurable MVL.

Is there any further input on this RFC? Otherwise, i will send out an updated version to llvm-dev next week.

I do not think we need to unnecessarily tie this proposal to "dynamic" vector length. These just have "explicit vector length" that is not implied by the operand vector length. May I suggest "evl" instead of "dvl"? Stands for Explicit Vector Length".

I am a bit less sure about the new attributes. If it was just about the intrinsics, I'd argue for creating helper query functions like that extract the relevant arguments from a call or Function object, using knowledge of the intrinsic signatures. But on my third reading of the text I finally realized you want to apply them to non-intrinsincs as well. An example of how each of these would be used (e.g. by RV or an OpenMP implementation) would be useful. I can see the value of passing the dynamic vector length in a specific register, but at a glance, unmasked_return seems rarely applicable to user-defined functions (similarly to the returned parameter attribute, which is a bit niche).

Two reasons: first, we want to avoid this kind of hard-coded knowledge about intrinsics and second, the attributes allow you to coalesce vector registers.

I don't really follow. If this information was only attached to intrinsics, then the choice is between specifying the meaning of the arguments once in the intrinsics TableGen file versus specifying it once in a single location in the C++ code. That doesn't seem like a significant difference. Of course, non-intrinsic functions are another matter, so this is entirely hypothetical anyway.

As a plus they simplify whole-function vectorization with dvl and predication beyond what's currently supported by OpenMP/VectorABI.

  1. Example Let's say you'd want to vectorize a loop like this for a predicating/dynamicvl architecture:
for (int i =0; i <n; ++i) {
  double x = B[i];
  double y = C[i];
  A[i] = x > 0 ? bar(x) : y;
}

And there were a user-provided (or RV-auto-vectorized) SIMD version of bar with the following signature:

def <scalable 1 x double> @bar_dvl_nxv1(<scalable 1 x double> %a, <scalable 1 x double> unmasked_ret %b, <scalable 1 x i1> mask %mask, i32 dynamicvl %vl) {..}

Crucially, the implementation of @bar may use llvm.dvl intrinsics (or other function calls) internally but there is no way of telling the default return value (for masked-out lanes) without inspecting the IR... and worse you might just be given a declaration of that function.

However, by inspecting just the attributes the loop vectorizer could simply vectorize the call to @bar like below (you'd still need a way to tell vector shapes as in OpenMPs linear,aligned,.. clauses and VectorABI).

for.body.rv:
    %cond = fcmp <scalable 1 x i1> %b, splat 0.0
    %x = call <scalable 1 x double> @llvm.dvl.load(..., %dvl)
    %y = call <scalable 1 x double> @llvm.dvl.load(..., %dvl)
    ...
    %result = call <scalable 1 x double> @bar_dvl_nxv1( %x, %y, %cond, %dvl)
    ...
    llvm.dvl.store(%Aptr, %result, ...)
    ...

The select is folded into the vectorized function call, which is not possible otherwise.
Moreover if RV auto-vectorizes @bar it will automatically annotate the vectorized functions with these attributes for you.

Thank you for this example. It makes sense, though it still seems like a relatively small win (saves just a few data movement instructions here and there and removes only one overlapping live range). I don't know whether I'd bother introducing an attribute for that, but I won't object.

  1. Vector register coalescing If you have unmasked_ret knowledge about the data flow in you vector code, register allocation can exploit that to safe registers and avoid vector register spills. Vector values with complementary masks can be coalesced into one register, relying on the fact that the masked-out part will be preserved through all arithmetic and function calls. This already applies to the earlier example because there is no need to keep %y alive across the call site: instead we know that the return value of call @bar_dvl_nxv1 contains the parts of %y we care about.

Note that for arithmetic and everything else except calls to user-defined functions, this optimization is feasible to do in the backend without any changes to the IR (I know of one out-of-tree backend doing this, Nyuzi). Only when crossing function boundaries you need the different ABI to be able to ensure the optimization can happen.

Editorial note: I find the way "unmasked" is used here confusing. You seem to use it for "lanes where the mask bit is 0, which are disabled", but IME "unmasked" means operations with no predication at all and lanes with mask bit 0 are called "disabled" or "masked out" or something to that effect.

Sure. Do you have a specific suggestion? How about `maskedout_ret' instead?

That seems good enough for me, further bikeshedding can happen on the mailing list if someone cares enough.


Regarding the "dvl" naming concern @hsaito brought up, in RISC-V we call this concept *active vector length* and I've used that name on llvm-dev in the past. It's a bit more specific than "dynamic" (e.g., one might say "the first n lanes are *active*") and it's applicable to fixed-width SIMD. It could be abreviated to llvm.avl.* (minor name collision in that AVL is sometimes used for the *application* vector length which usually exceeds the vector register size, but c'est la vie).

Hi @simoll ,

thank you for sending this out. I was wondering whether we really want to separate the concepts of vector lane predication and dynamic vector length.

Before explaining what I mean with this, I need to define two concepts. Note that these definitions apply to both to Vector Length Agnostic (<scalable n x type>) or Vector Lenght Fixed (<n x type>) vector types.

Definition 1: Vector Lane Predication (VLP)

For a given operation INST that operates on vector inputs of type <{scalable }n x type>, Vector Lane Predication is the operation of attaching to INST a vector with the same number of lanes {scalable }n, but with (boolean) lanes of type i1 that selects on which lanes the operation INST needs to be executed. This concept can be applied to any IR instruction that have vectors in any of the input operand or in the result.

This can be represented in the language as an additional parameter in form of a vector of i1 lanes.

To achieve VLP, the instruction cold be extended adding the VLP parameter as the last parameter ()this would make predicated INST distinguishable from the non predicated version):

%ret = <{scalable }n x ret_type> INST(<{scalable }n x type1> %in1, <{scalable }n x type2> %in1, ..., <{scalable }n x i1> %VLP)

Definition 2: Dinamic Vector Length (DVL)

For a given operation INST that operates on vector inputs of type <{scalable }n x type>, Dynamic Vector Length is the concept of attaching a run-time value of integer type, say i32 %DVL, that informs the instruction to operate on the first VLP lanes of the vector and discard (or set as undef) the remaining ones.

With this concept, the instruction INST should be extended to accept an additional scalar parameter that would represent the number of active lanes in the operation:

%ret = <{scalable }n x ret_type> INST(<{scalable }n x type1> %in1, <{scalable }n x type2> %in2, ..., i32 %DVL)

DVL, VLP, Vector Length Agnostic (VLA) and Vector Length Specific (VLS) ISAs

To my understanding, DVL is orthogonal to VLA. In principle, you could have a VLA ISA that have an additional register for setting the DVL of the vectors.

DVL as part of VLP

Extending any language to support DVL and VLP would require some form of polymorfism in the language itself, because the same INST would need to have an additional parameter of two different types, the scalar i32 %DVL and the vector <{scalable }n x type> %VLP. I would like to argue that this is not necessary, because DVL is just a specific case of VLP. In fact, the same result of a DVL parameter could be obtained with a VLP parameter that would represent the split of the vector in active and inactive parts.

By using the extension of INST in the VLP definition, we could achieve DVL predication by introducing a new instruction, say VECTOR_SPLIT, that could generate the appropriate predicate partition for the instruction, as follows:

%VLP = <{scalable }n x i1> VECTOR_SPLIT( %i32 DVL)
%ret = <{scalable }n x ret_type> INST(<{scalable }n x type1> %in1, <{scalable }n x type2> %in2, ..., <{scalable }n x i1> %VLP)

Lowering VLP vs lowering DVL

The pattern for lowering the DVL would involve a VECTOR_SPLIT (intrinsic or instruction that would make it generate a single DVL instruction that uses an implicit DVL register. This special VECTOR_SPLIT instruction would be rendered as per-lane predication for non-DVL ISA that support predication), as for example SVE or AVX512.

Please let me know if you don't agree with this reasoning. I might have an incomplete view of the problems introduced by such simplification, but I believe that this approach simplifies the process of introducing predication into the IR because it merges two concepts that (to my understanding, or ignorance!) have been treated separately until now.

Predication: extending IR instructions vs intrinsics

Whether we decide to go via extending the IR over adding new intrinsics, with the reasoning about the DVL and the VLP, we can assume that both cases will require just adding an additional VLP parameter, with, of course, the additional cost of adding a VECTOR_SPLIT instruction or intrinsic.

I personally would prefer to go by adding a new input parameter to the IR instruction (defaulted to "all true" when no predication is used), for two reasons:

  1. avoid the explosion of intrinsics, that would need to be treated separately in all passes.
  2. for experience in extending LLVM IR instrucion to support the scalable vector type, I believe that there is little risk in doing so.

Please understand that this is my personal preference. I understand that the community and also my colleagues at Arm might have a different view on this.

Do we really need to extend the IR?

My gut feeling is that we don't. I think we can ignore the scalable vs fixed vector types, and just consider the pure task of deciding whether we need to perform VLP predication vs DVL predication.Consider the following sequence.

%ret1 = <{scalable }n x ret_type> INST1(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)
%ret2 = <{scalable }n x ret_type> INST2(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)
%ret3 = <{scalable }n x ret_type> INST3(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)
%ret4 = <{scalable }n x ret_type> INST4(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)

It should be fairly easy to detect whether any SELECT instructions used on the input parameters or return values of the sequence is done using a VLP predication or DVL predication. At Arm, we use this
pattern matching mechanism to lower selects and unpredicated IR instructions into predicated SVE instruction. As far as I know, there are specific cases that might require us switching to a more sophisticated method (like native predication support in IR), but the pattern matching mechanism on select allowed us to cover the majority of the cases that we see.

I agree with @fpetrogalli here that there is some overlap between a "dynamic vector length" i32 %dvl and a mask of the form %m = <i1 0, ..., i1 0, i1 1, ..., i1 1> (or the reverse, if the lanes are to be interpreted in the other direction) where %dvl = llvm.ctpop(%m). As Francesco, puts it, we can always construct %m from %dvl. Perhaps I'm wrong, but I think in the context of strip-mined loops or whole function vectorisation, more elaborated masks that might arise due to control flow would always be subsumed by %m (i.e. will have a strict subset of the lanes enabled by %dvl).

In the original RFC proposal above:

All of the intrinsics in the llvm.dvl namespace take in two predicating parameters: a mask of bit vector type (eg <8 x i1>) and a dynamic vector length value (i32).

Masks seems more general to me which would make me think there is no real need to have intrinsics with %dvl. Perhaps I'm missing something really obvious here?

Kind regards,

simoll added a comment.EditedOct 31 2018, 2:21 AM

Hi @fpetrogalli , @rogfer01 ,

The dynamic vector length is explicit because it crucially impacts the performance of vector instructions for SX-Aurora and RISC-V V (depending on hardware implementation).

SIMD instructions on NEC SX-Aurora execute in a pipelined fashion. While the vector registers hold 256 elements, the SIMD execution units operate on chunks of 32 elements.

Here is an example of two dvl invocations, which compute the same result:

a) llvm_dvl_fadd.v256f64(%x, %y, <full mask>, 13)
Since 13 < 32, the hardware will only issue 1 operation to its SIMD execution units. The occupation is thus something like 13/32 ~ 40%.

b`) llvm_dvl_fadd.v256f64(%x, %y, <mask with first 13 bits set>, 256)
Since the DVL is 256, the hardware will issue 8 operations to its SIMD units. However, only the first 13 elements are relevant leading to an occupation of 13/256 ~ 5%.

By keeping the bit mask and the DVL value separate, DVL intrinsics allow us to make this distinction cleanly on IR level.

ARM SVE does not have a DVL and so it will be lowered to the mask (as it's described in the RFC):

Lowering for targets w/o dynamic vector length (AVX512, ARM SVE, ..)

ARM SVE does not feature a dynamic vector length register.
Hence, the vector length needs to be promoted to the bit mask predicate, shown here for a LLVM-SVE target:

Block before legalization:

..
foo (..., %mask, %dynamic_vl)
...

After legalization:

%vscale32 = call i32 @llvm.experimental.vector.vscale.32()
...
%stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32()
...
%vl_mask = icmp ult <scalable 4 x i1> %stepvector, %dynamic_vl
%new_mask = and <scalable 4 x i1> %mask, %vl_mask
foo (..., <scalable 4 x i1> %new_mask, i32 %vscale32)
...

With the semantics defined in @simoll's proposal, the active vector length is actually subtly different from predication in that the former makes some lanes undef while predication takes the lane value from another parameter. I actually don't know what motivates this, in RISC-V masked-out lanes and lanes beyond VL are treated the same and this seems the most consistent choice in any ISA that has both concepts (and ISAs that only have predication would legalize the latter with predication so they too would treat all lanes the the same). Is there an architecture I'm not aware of that makes past-VL lanes undef but leave masked-out lanes undisturbed?

Ignoring that difference for the rest of this post, it's true that you can implement "active vector length"-style loop control with just predication. Arm SVE even has dedicated instructions for generating these sorts of masks. So functionality-wise there would be no problem. However, when compiling for an architecture that has a vector length mechanism in hardware, we want to be able to reliably use it, not badly emulate it with the predication mechanism. If AVL is kept separate from masks, that is trivial. If the two are intermingled, code quality heavily depends on how good the backend is at disentangling the instructions computing the mask and separating it into mask_for_avl(%n) & ordinary_mask. That isn't too hard if every mask computation is in the canonical form emitted by the vectorizer, but experience shows that complex canonical forms tend to be mangled by optimizations before they reach the backend. For example, we run InstCombine after LV and that can do a whole lot to a tree composed of bitwise operators and to the selects they feed into.

To be clear, I've not (yet) tried it out, tried hard, and found that it actually doesn't work well enough. This is just educated speculation. But it seems like a plausible enough problem to me that it outweighs the complexity of an extra argument/concept (which isn't all that big, since anyone who doesn't care about active vector lengths can still understand the intrinsics in terms of masking).

Hi @simoll

Here is an example of two dvl invocations, which compute the same result:

a) llvm_dvl_fadd.v256f64(%x, %y, <full mask>, 13)
Since 13 < 32, the hardware will only issue 1 operation to its SIMD execution units. The occupation is thus something like 13/32 ~ 40%.

b`) llvm_dvl_fadd.v256f64(%x, %y, <mask with first 13 bits set>, 256)
Since the DVL is 256, the hardware will issue 8 operations to its SIMD units. However, only the first 13 elements are relevant leading to an occupation of 13/256 ~ 5%.

Aha I see.

I guess we anticipate the compiler would not be able to tell that a mask corresponds to a dvl if we chose not to represent it. I can imagine this happening to a function vectorised with an arbitrary mask, does this align with your expectations too?

Thanks a lot for the clarification.

With the semantics defined in @simoll's proposal, the active vector length is actually subtly different from predication in that the former makes some lanes undef while predication takes the lane value from another parameter. I actually don't know what motivates this, in RISC-V masked-out lanes and lanes beyond VL are treated the same and this seems the most consistent choice in any ISA that has both concepts (and ISAs that only have predication would legalize the latter with predication so they too would treat all lanes the the same). Is there an architecture I'm not aware of that makes past-VL lanes undef but leave masked-out lanes undisturbed?

With the current unmasked_ret semantics, we know exactly the defined range of the result vector because all lanes beyond the dynamicvl argument are undef.
This means that the backend only needs to spill registers up to that value. This matters a lot for wide SIMD architectures like the SX-Aurora (and ARM SVE btw..) where one full vector register comes in at 256x8 byte.

Excess lane semantics

While i expect this to be a rare event, i agree that still you might want to be able to preserve those excess (== beyond dvl in defining instruction`) lanes.
For regular function calls, semantics of excess lanes will depend on the calling convention (so there is some interplay between unmasked_ret semantics and the CC).
For dvl intrinsics, i see the following approach:

; %r is defined from [0, .. 42)
%r = llvm.dvl.fma.nxv1f64(%a, %b, <full mask>, 42)

; %full is defined from [0, .., %MVL)
%full = llvm.dvl.compose.nxv1f64(%a, %r, 42, %MVL)

declare @llvm.dvl.compose.nxv1f64(.. %a, .. %b, %split, i32 dynamicvl  %MVL)

The semantics of llvm.dvl.compose would be like select with lane idx: all lanes beyond '42' are taken from %a while the lanes >= 42 are taken from '%r'.
The advantage here is that the defined range of the result of llvm.dvl.compose is still encoded using the dynamicvl attribute.
The backend can fold this pattern down into the appropriate excess-preserving instruction.

Ignoring that difference for the rest of this post, it's true that you can implement "active vector length"-style loop control with just predication.

Lanes beyond dvl are undef... no need to "ignore" that difference, semantics allows for any value on those lanes.

Hi @simoll

Here is an example of two dvl invocations, which compute the same result:

a) llvm_dvl_fadd.v256f64(%x, %y, <full mask>, 13)
Since 13 < 32, the hardware will only issue 1 operation to its SIMD execution units. The occupation is thus something like 13/32 ~ 40%.

b`) llvm_dvl_fadd.v256f64(%x, %y, <mask with first 13 bits set>, 256)
Since the DVL is 256, the hardware will issue 8 operations to its SIMD units. However, only the first 13 elements are relevant leading to an occupation of 13/256 ~ 5%.

Aha I see.

I guess we anticipate the compiler would not be able to tell that a mask corresponds to a dvl if we chose not to represent it. I can imagine this happening to a function vectorised with an arbitrary mask, does this align with your expectations too?

Thanks a lot for the clarification.

Yep, a call boundary would be the ultimate limit to the inferring-dvl-from-mask approach.

With the semantics defined in @simoll's proposal, the active vector length is actually subtly different from predication in that the former makes some lanes undef while predication takes the lane value from another parameter. I actually don't know what motivates this, in RISC-V masked-out lanes and lanes beyond VL are treated the same and this seems the most consistent choice in any ISA that has both concepts (and ISAs that only have predication would legalize the latter with predication so they too would treat all lanes the the same). Is there an architecture I'm not aware of that makes past-VL lanes undef but leave masked-out lanes undisturbed?

With the current unmasked_ret semantics, we know exactly the defined range of the result vector because all lanes beyond the dynamicvl argument are undef.
This means that the backend only needs to spill registers up to that value. This matters a lot for wide SIMD architectures like the SX-Aurora (and ARM SVE btw..) where one full vector register comes in at 256x8 byte.

Spilling only the useful prefix of each vector is important, but I don't think we need to change the IR intrinsics' semantics to enable that. I've sketched an analysis that determines the demanded/consumed vector lengths of each vector value (on MIR in SSA form). With this information the backend can do the same optimization whenever the lanes beyond VL are not ever actually observed. This information is already necessary for many reasons other than spilling, such as implementing regular full-width vector operations (i.e., pretty much everything aside from the intrinsics we discuss here) that can sneak into the IR, or even ordinary register copies (on RISC-V at least). Normally I'd be hesitant to staking such an important aspect of code quality on a best-effort analysis, but in this case it seems very feasible to have very high precision:

  • All the sinks which normally let values escape and thus force "demanded-X" style analyses to be conservative (stores, calls, etc.) are restricted with a VL, so we don't need to worry about other code using the higher lanes
  • In natural code (strip-mined loops or vectorized functions), most instructions trivially have the same VL (same SSA value) -- we don't need algebraic transformations or anything to show that two instructions access the same number of lanes
  • When the vector length does change in the middle of a computation, it typically becomes monotonically smaller in a fairly obvious way (e.g., speculative loads for search loop vectorization, or early exits in functions), so we don't need to worry about later instructions accessing elements that an earlier definition didn't write to
  • In the case where two instructions have completely unrelated vector lengths, they typically belong to two separate loops, between which you normally don't have any (vector-shaped) data flow to begin with, so the problem of what to do with the disabled lanes doesn't even arise

I haven't been able to evaluate this idea yet (partly because I don't have a complete enough compiler to compile benchmarks and see what happens) but for the above reasons I am quite optimistic that it will be enough to spill&fill vectors only up to VL, and solve other problems in the same stroke.

Aside: the advantage for spilling I see is just less memory traffic, I don't think you can make stack frames smaller since you generally won't have an upper bound on the vector lengths and recomputing frame layout every time the active vector length changes seems both very difficult for the backend and at best performance-neutral, more likely causing slowdowns.

  1. Excess lane semantics While i expect this to be a rare event, i agree that still you might want to be able to preserve those excess (== beyond dvl in defining instruction`) lanes. For regular function calls, semantics of excess lanes will depend on the calling convention (so there is some interplay between unmasked_ret semantics and the CC). For dvl intrinsics, i see the following approach:
; %r is defined from [0, .. 42)
%r = llvm.dvl.fma.nxv1f64(%a, %b, <full mask>, 42)

; %full is defined from [0, .., %MVL)
%full = llvm.dvl.compose.nxv1f64(%a, %r, 42, %MVL)

declare @llvm.dvl.compose.nxv1f64(.. %a, .. %b, %split, i32 dynamicvl  %MVL)

The semantics of llvm.dvl.compose would be like select with lane idx: all lanes beyond '42' are taken from %a while the lanes >= 42 are taken from '%r'.
The advantage here is that the defined range of the result of llvm.dvl.compose is still encoded using the dynamicvl attribute.
The backend can fold this pattern down into the appropriate excess-preserving instruction.

I'm not really sure what use the dynamicvl attribute on %MVL is if that parameter is always MVL. I guess you envision also using this intrinsic to stitch together shorter vectors? That's not an operation I've encountered before.

Thinking some more about it, a more convincing reason to keep it that way (to me) is that different architectures might have different behavior for masking and lanes-beyond-VL. RISC-V V, for a long time, wanted to zero lanes in both cases rather than keeping the old value around. For masking, merging with an "old" value is usually done because the application needs it, so generating an explicit merge instruction in those cases is probably fine, but having to copy over the completely irrelevant higher lanes would be pretty bad. The demanded-elements analysis mentioned above could help with this, though.

Ignoring that difference for the rest of this post, it's true that you can implement "active vector length"-style loop control with just predication.

Lanes beyond dvl are undef... no need to "ignore" that difference, semantics allows for any value on those lanes.

You're right, in that direction everything's fine, I was just hesitant to equate the two approaches to loop control entirely because (if vector length-predication makes lanes undef) the opposite direction doesn't work.

simoll added a comment.EditedOct 31 2018, 6:19 AM

With the semantics defined in @simoll's proposal, the active vector length is actually subtly different from predication in that the former makes some lanes undef while predication takes the lane value from another parameter. I actually don't know what motivates this, in RISC-V masked-out lanes and lanes beyond VL are treated the same and this seems the most consistent choice in any ISA that has both concepts (and ISAs that only have predication would legalize the latter with predication so they too would treat all lanes the the same). Is there an architecture I'm not aware of that makes past-VL lanes undef but leave masked-out lanes undisturbed?

With the current unmasked_ret semantics, we know exactly the defined range of the result vector because all lanes beyond the dynamicvl argument are undef.
This means that the backend only needs to spill registers up to that value. This matters a lot for wide SIMD architectures like the SX-Aurora (and ARM SVE btw..) where one full vector register comes in at 256x8 byte.

Spilling only the useful prefix of each vector is important, but I don't think we need to change the IR intrinsics' semantics to enable that. I've sketched an analysis that determines the demanded/consumed vector lengths of each vector value (on MIR in SSA form). With this information the backend can do the same optimization whenever the lanes beyond VL are not ever actually observed. This information is already necessary for many reasons other than spilling, such as implementing regular full-width vector operations (i.e., pretty much everything aside from the intrinsics we discuss here) that can sneak into the IR, or even ordinary register copies (on RISC-V at least). Normally I'd be hesitant to staking such an important aspect of code quality on a best-effort analysis, but in this case it seems very feasible to have very high precision:

Actually, you could translate regular vector code to EVL intrinsics first and have your backend only work on that. This is the route we are aiming for with the SX-Aurora SVE backend. We propose undef-on-excess-lanes as the default semantics of dynamicvl. There is no special interpretation nor a change for IR intrinsics' semantics.

  • All the sinks which normally let values escape and thus force "demanded-X" style analyses to be conservative (stores, calls, etc.) are restricted with a VL, so we don't need to worry about other code using the higher lanes

I disagree. You brought up regular vector instructions yourself: what happens if a full vector store (non-EVL) goes to a buffer and that buffer is then passed to a scalar function bar(double * rawData)? You have no idea which lanes bar is going to access. If the store value is defined with an explicit %dvl and undef on excess, you will know, however.

  • In natural code (strip-mined loops or vectorized functions), most instructions trivially have the same VL (same SSA value) -- we don't need algebraic transformations or anything to show that two instructions access the same number of lanes

This applies to both interpretations of excess lanes.

  • When the vector length does change in the middle of a computation, it typically becomes monotonically smaller in a fairly obvious way (e.g., speculative loads for search loop vectorization, or early exits in functions), so we don't need to worry about later instructions accessing elements that an earlier definition didn't write to

This may hold for some, basic vectorization schemes but it is too restrictive for more advanced techniques: e.g."FlexVec: auto-vectorization for irregular loops.", PLDI '16 or Dynamic SIMD Vector Lane Scheduling where new workitems are pulled into a spinning loop to avoid reducing the DVL whenever a thread drops out but you may decide to only do so if the occupation drops below a certain threshold.
RV already implements a transformation of this kind for regular LLVM IR and it will do so using the intrinsics and attributes we propose here.

Aside: the advantage for spilling I see is just less memory traffic, I don't think you can make stack frames smaller since you generally won't have an upper bound on the vector lengths and recomputing frame layout every time the active vector length changes seems both very difficult for the backend and at best performance-neutral, more likely causing slowdowns.

The impact can be significant if the differences is a spill+reload of 256x8bytes worth of data. Besides, you could actually avoid memory all together, e.g. by compressing to vectors of size MVL/2 into one.

  1. Excess lane semantics While i expect this to be a rare event, i agree that still you might want to be able to preserve those excess (== beyond dvl in defining instruction`) lanes. For regular function calls, semantics of excess lanes will depend on the calling convention (so there is some interplay between unmasked_ret semantics and the CC). For dvl intrinsics, i see the following approach:
; %r is defined from [0, .. 42)
%r = llvm.dvl.fma.nxv1f64(%a, %b, <full mask>, 42)

; %full is defined from [0, .., %MVL)
%full = llvm.dvl.compose.nxv1f64(%a, %r, 42, %MVL)

declare @llvm.dvl.compose.nxv1f64(.. %a, .. %b, %split, i32 dynamicvl  %MVL)

The semantics of llvm.dvl.compose would be like select with lane idx: all lanes beyond '42' are taken from %a while the lanes >= 42 are taken from '%r'.
The advantage here is that the defined range of the result of llvm.dvl.compose is still encoded using the dynamicvl attribute.
The backend can fold this pattern down into the appropriate excess-preserving instruction.

I'm not really sure what use the dynamicvl attribute on %MVL is if that parameter is always MVL. I guess you envision also using this intrinsic to stitch together shorter vectors? That's not an operation I've encountered before.

dynamicvl %MVL is the canonical way to express this without adding hard-coded knowledge. Lowering evl.compose is cheap on ARM-SVE and SX-Aurora (1. generate mask, 2. blend).

Thinking some more about it, a more convincing reason to keep it that way (to me) is that different architectures might have different behavior for masking and lanes-beyond-VL. RISC-V V, for a long time, wanted to zero lanes in both cases rather than keeping the old value around. For masking, merging with an "old" value is usually done because the application needs it, so generating an explicit merge instruction in those cases is probably fine, but having to copy over the completely irrelevant higher lanes would be pretty bad. The demanded-elements analysis mentioned above could help with this, though.

With undef-on-excess-lanes, you could actually implement a demanded-elements analysis on IR level. You'd then store your findings to lower the dynamicvl %dvl argument of the producing instruction. This may have positive interactions with other parts of LLVM as well. For example, in legalization for non-dvl targets it could mean that a EVL intrinsic could be pruned to the native vector length, implying that it lowered to a plain vector instruction (without any DVL/MVL loop).

Spilling only the useful prefix of each vector is important, but I don't think we need to change the IR intrinsics' semantics to enable that. I've sketched an analysis that determines the demanded/consumed vector lengths of each vector value (on MIR in SSA form). With this information the backend can do the same optimization whenever the lanes beyond VL are not ever actually observed. This information is already necessary for many reasons other than spilling, such as implementing regular full-width vector operations (i.e., pretty much everything aside from the intrinsics we discuss here) that can sneak into the IR, or even ordinary register copies (on RISC-V at least). Normally I'd be hesitant to staking such an important aspect of code quality on a best-effort analysis, but in this case it seems very feasible to have very high precision:

Actually, you could translate regular vector code to EVL intrinsics first and have your backend only work on that. This is the route we are aiming for with the SX-Aurora SVE backend. We propose undef-on-excess-lanes as the default semantics of dynamicvl. There is no special interpretation nor a change for IR intrinsics' semantics.

Ideally you'd want these intrinsics for all code, yes, but

  1. since backends don't dictate the IR pass pipeline it will be fragile/impossible to guarantee your pass for turning full vector operations into intrinsics will be last
  2. there are operations not visible in the IR (such as register copies) for which you'll probably also need this sort of analysis

But yes, other than that you can run the same analysis before ISel.

  • All the sinks which normally let values escape and thus force "demanded-X" style analyses to be conservative (stores, calls, etc.) are restricted with a VL, so we don't need to worry about other code using the higher lanes

I disagree. You brought up regular vector instructions yourself: what happens if a full vector store (non-EVL) goes to a buffer and that buffer is then passed to a scalar function bar(double * rawData)? You have no idea which lanes bar is going to access. If it's an EVL-store, you will know, however.

If there is a store of this sort in the input program, there is nothing you can do about it, you can't justify changing it to an EVL-store at any stage, MIR or IR. It will just have to store the whole vector. Vector-ignorant IR optimizations don't introduce this sort of code (they should keep things in SSA values or promote from memory to SSA values, not demote) and vectorization passes which turn scalar stores into vector stores know to use the bounded intrinsics -- and usually have to use them for correctness anyway.

  • In natural code (strip-mined loops or vectorized functions), most instructions trivially have the same VL (same SSA value) -- we don't need algebraic transformations or anything to show that two instructions access the same number of lanes

This applies to both interpretations of excess lanes.

Yes, I am not saying this analysis isn't possible with excess lanes being undef, just arguing excess lanes being undef is not necessary for good codegen.

  • When the vector length does change in the middle of a computation, it typically becomes monotonically smaller in a fairly obvious way (e.g., speculative loads for search loop vectorization, or early exits in functions), so we don't need to worry about later instructions accessing elements that an earlier definition didn't write to

This may hold for some, basic vectorization schemes but it is too restrictive for more advanced techniques: e.g."FlexVec: auto-vectorization for irregular loops.", PLDI '16 or Dynamic SIMD Vector Lane Scheduling where new workitems are pulled into a spinning loop to avoid reducing the DVL whenever a thread drops out but you may decide to only do so if the occupation drops below a certain threshold.
RV already implements these transformations for regular LLVM IR and it will do so using the intrinsics and attributes we propose here.

Sorry, I was only talking about changes within an iteration of the vectorized loop. Across iterations, of course the vector length doesn't need to fall monotonically, even basic search loop vectorization (e.g. of strlen) doesn't satisfy that. It seems all the strategies you mention are smarter about how they pack scalar work items into vectors, but still only do this between iterations of the vectorized loop, i.e., don't do something like this:

loop {
    // first half of the work
    // pull in more work items
    // second half of the work on existing+newly pulled in work items
}

Is that right?

  • In the case where two instructions have completely unrelated vector lengths, they typically belong to two separate loops, between which you normally don't have any (vector-shaped) data flow to begin with, so the problem of what to do with the disabled lanes doesn't even arise

    I haven't been able to evaluate this idea yet (partly because I don't have a complete enough compiler to compile benchmarks and see what happens) but for the above reasons I am quite optimistic that it will be enough to spill&fill vectors only up to VL, and solve other problems in the same stroke.

Aside: the advantage for spilling I see is just less memory traffic, I don't think you can make stack frames smaller since you generally won't have an upper bound on the vector lengths and recomputing frame layout every time the active vector length changes seems both very difficult for the backend and at best performance-neutral, more likely causing slowdowns.

The impact can be significant if the differences is a spill+reload of 256x8bytes worth of data. Besides, you could actually avoid memory all together, e.g. by compressing to vectors of size MVL/2 into one.

Impact on what? It's obvious that one shouldn't have to spill 256x8 bytes if only a small subset of the lanes is needed, I'm just saying I don't see a good way to avoid allocating stack space for the full vector ((assuming one needs a stack slot at all).

I'm not really sure what use the dynamicvl attribute on %MVL is if that parameter is always MVL. I guess you envision also using this intrinsic to stitch together shorter vectors? That's not an operation I've encountered before.

dynamicvl %MVL is the canonical way to express this without adding hard-coded knowledge. Lowering evl.compose is cheap on ARM-SVE and SX-Aurora (1. generate mask, 2. blend).

If this intrinsic is only ever used with dynamicvl %MVL and not with shorter dynamicvls, there's nothing there to express and we could drop the parameter altogether (not just the attribute, as I realize now). If you have a use case for compose(..., dynamicvl %something_shorter), then sure.

Thinking some more about it, a more convincing reason to keep it that way (to me) is that different architectures might have different behavior for masking and lanes-beyond-VL. RISC-V V, for a long time, wanted to zero lanes in both cases rather than keeping the old value around. For masking, merging with an "old" value is usually done because the application needs it, so generating an explicit merge instruction in those cases is probably fine, but having to copy over the completely irrelevant higher lanes would be pretty bad. The demanded-elements analysis mentioned above could help with this, though.

With undef-on-excess-lanes, you could actually implement a demanded-elements analysis on IR level. You'd then store your findings to lower the dynamicvl %dvl argument of the producing instruction. This may have positive interactions with other parts of LLVM as well. For example, in legalization for non-dvl targets it could mean that a EVL intrinsic could be pruned to the native vector length, implying that it lowered to a plain vector instruction (without any DVL/MVL loop).

You can have the analysis on IR either way, that some lanes are never read isn't affected by what you put in those lanes. I don't quite understand your point about legalization on other targets -- the analysis I propose makes the vector length shorter, to use a packed-SIMD architecture's full vectors you need the opposite (proving you're allowed to widen it to a full operation), which is a rather different task. If you can apply my analysis then you either already have full-width vector ops and don't need to do anything more to lower them well on a conventional SIMD architecture, or it will just replace one unknown dynamic vector length with another, possibly shorter, one.


All this being said, I want to be clear I don't really oppose excess lanes being undef. It doesn't seem necessary or even particularly helpful for the optimizations and codegen strategy I have planned, but it's not an obstacle either, so I'm happy to let the intrinsics be defined this way.

Spilling only the useful prefix of each vector is important, but I don't think we need to change the IR intrinsics' semantics to enable that. I've sketched an analysis that determines the demanded/consumed vector lengths of each vector value (on MIR in SSA form). With this information the backend can do the same optimization whenever the lanes beyond VL are not ever actually observed. This information is already necessary for many reasons other than spilling, such as implementing regular full-width vector operations (i.e., pretty much everything aside from the intrinsics we discuss here) that can sneak into the IR, or even ordinary register copies (on RISC-V at least). Normally I'd be hesitant to staking such an important aspect of code quality on a best-effort analysis, but in this case it seems very feasible to have very high precision:

Actually, you could translate regular vector code to EVL intrinsics first and have your backend only work on that. This is the route we are aiming for with the SX-Aurora SVE backend. We propose undef-on-excess-lanes as the default semantics of dynamicvl. There is no special interpretation nor a change for IR intrinsics' semantics.

Ideally you'd want these intrinsics for all code, yes, but

  1. since backends don't dictate the IR pass pipeline it will be fragile/impossible to guarantee your pass for turning full vector operations into intrinsics will be last

Actually, you could use custom legalization in ISelLowering for this. No pass involved.

  1. there are operations not visible in the IR (such as register copies) for which you'll probably also need this sort of analysis

Fair enough. Would it be possible to simply extend the %dvl of the defining operation to the newly created register? (instead of re-running a full fledged analysis).

But yes, other than that you can run the same analysis before ISel.

  • All the sinks which normally let values escape and thus force "demanded-X" style analyses to be conservative (stores, calls, etc.) are restricted with a VL, so we don't need to worry about other code using the higher lanes

I disagree. You brought up regular vector instructions yourself: what happens if a full vector store (non-EVL) goes to a buffer and that buffer is then passed to a scalar function bar(double * rawData)? You have no idea which lanes bar is going to access. If it's an EVL-store, you will know, however.

If there is a store of this sort in the input program, there is nothing you can do about it, you can't justify changing it to an EVL-store at any stage, MIR or IR. It will just have to store the whole vector. Vector-ignorant IR optimizations don't introduce this sort of code (they should keep things in SSA values or promote from memory to SSA values, not demote) and vectorization passes which turn scalar stores into vector stores know to use the bounded intrinsics -- and usually have to use them for correctness anyway.

Good point.

I think for me boils it down to this: without undef-on-excess there is no obvious way for programmers to specify that the result of an operation does not matter beyond the excess lanes and instead they are at the mercy of some clever analysis in the backend. HPC programmers dislike this sort of thing. A good example for this is register allocation/spilling where developers go a long way to guide the spilling heuristics in the right direction by placing branch probabilities and the like.

Sorry, I was only talking about changes within an iteration of the vectorized loop. Across iterations, of course the vector length doesn't need to fall monotonically, even basic search loop vectorization (e.g. of strlen) doesn't satisfy that. It seems all the strategies you mention are smarter about how they pack scalar work items into vectors, but still only do this between iterations of the vectorized loop, i.e., don't do something like this:

loop {
    // first half of the work
    // pull in more work items
    // second half of the work on existing+newly pulled in work items
}

Is that right?

Basically, yes:

while (any thread live) {
  // perform work
  // evaluate loop exit condition and deactivate leaving lanes
  if (/*number of active lanes*/ < threshold) {
    // pull in new work onto the inactive lanes
  }
}
  • In the case where two instructions have completely unrelated vector lengths, they typically belong to two separate loops, between which you normally don't have any (vector-shaped) data flow to begin with, so the problem of what to do with the disabled lanes doesn't even arise

    I haven't been able to evaluate this idea yet (partly because I don't have a complete enough compiler to compile benchmarks and see what happens) but for the above reasons I am quite optimistic that it will be enough to spill&fill vectors only up to VL, and solve other problems in the same stroke.

Aside: the advantage for spilling I see is just less memory traffic, I don't think you can make stack frames smaller since you generally won't have an upper bound on the vector lengths and recomputing frame layout every time the active vector length changes seems both very difficult for the backend and at best performance-neutral, more likely causing slowdowns.

The impact can be significant if the differences is a spill+reload of 256x8bytes worth of data. Besides, you could actually avoid memory all together, e.g. by compressing to vectors of size MVL/2 into one.

Impact on what? It's obvious that one shouldn't have to spill 256x8 bytes if only a small subset of the lanes is needed, I'm just saying I don't see a good way to avoid allocating stack space for the full vector ((assuming one needs a stack slot at all).

Allocating stack space does not imply memory traffic per se. Spilling however does.

I'm not really sure what use the dynamicvl attribute on %MVL is if that parameter is always MVL. I guess you envision also using this intrinsic to stitch together shorter vectors? That's not an operation I've encountered before.

dynamicvl %MVL is the canonical way to express this without adding hard-coded knowledge. Lowering evl.compose is cheap on ARM-SVE and SX-Aurora (1. generate mask, 2. blend).

If this intrinsic is only ever used with dynamicvl %MVL and not with shorter dynamicvls, there's nothing there to express and we could drop the parameter altogether (not just the attribute, as I realize now). If you have a use case for compose(..., dynamicvl %something_shorter), then sure.

It's likely we will see vector code that only uses fractions of MVL (eg %MVL/2).

Thinking some more about it, a more convincing reason to keep it that way (to me) is that different architectures might have different behavior for masking and lanes-beyond-VL. RISC-V V, for a long time, wanted to zero lanes in both cases rather than keeping the old value around. For masking, merging with an "old" value is usually done because the application needs it, so generating an explicit merge instruction in those cases is probably fine, but having to copy over the completely irrelevant higher lanes would be pretty bad. The demanded-elements analysis mentioned above could help with this, though.

With undef-on-excess-lanes, you could actually implement a demanded-elements analysis on IR level. You'd then store your findings to lower the dynamicvl %dvl argument of the producing instruction. This may have positive interactions with other parts of LLVM as well. For example, in legalization for non-dvl targets it could mean that a EVL intrinsic could be pruned to the native vector length, implying that it lowered to a plain vector instruction (without any DVL/MVL loop).

You can have the analysis on IR either way, that some lanes are never read isn't affected by what you put in those lanes. I don't quite understand your point about legalization on other targets -- the analysis I propose makes the vector length shorter, to use a packed-SIMD architecture's full vectors you need the opposite (proving you're allowed to widen it to a full operation), which is a rather different task. If you can apply my analysis then you either already have full-width vector ops and don't need to do anything more to lower them well on a conventional SIMD architecture, or it will just replace one unknown dynamic vector length with another, possibly shorter, one.

%dvl = <complex integer arithmetic>
%dvl2 = <some more complex integer arithmetic>

%R = llvm.evl.fadd.v512f32(%a, %b, %m, 16)
%userOne = ... evl.fma.v512f32(%R, ..., %dvl)
%userTwo = ... evl.fma.v512f32(%R, ..., %dvl2)

With undef-on-excess, operations %userOne and %userTwo can be pruned to a width of 16.
Without looking any further at users of %userOne and %userTwo, this could be legalized into

%R = avx512.fma(%a, %b, %m)
%userOne = avx512.fma.v16f32(%R, ...)
%userTwo = avx512.fma.v16f32(%R, ...)

That's hard to do with maskedout-on-excess since it would all depend on %dvl and %dvl2.. and there would be no simple way for users to convey the information "dont care about %R beyond 16" in IR.


All this being said, I want to be clear I don't really oppose excess lanes being undef. It doesn't seem necessary or even particularly helpful for the optimizations and codegen strategy I have planned, but it's not an obstacle either, so I'm happy to let the intrinsics be defined this way.

Thank you for the scrutiny! It's important to validate this proposal so we have a robust representation that should work for all targets. I'll update the RFC shortly keeping undef-on-excess in place.

simoll added a comment.Nov 1 2018, 3:02 AM

HI @fpetrogalli,

Hi @simoll ,

Definition 1: Vector Lane Predication (VLP)

For a given operation INST that operates on vector inputs of type <{scalable }n x type>, Vector Lane Predication is the operation of attaching to INST a vector with the same number of lanes {scalable }n, but with (boolean) lanes of type i1 that selects on which lanes the operation INST needs to be executed. This concept can be applied to any IR instruction that have vectors in any of the input operand or in the result.

This can be represented in the language as an additional parameter in form of a vector of i1 lanes.

To achieve VLP, the instruction cold be extended adding the VLP parameter as the last parameter ()this would make predicated INST distinguishable from the non predicated version):

%ret = <{scalable }n x ret_type> INST(<{scalable }n x type1> %in1, <{scalable }n x type2> %in1, ..., <{scalable }n x i1> %VLP)

As you state here, predicated INSTs would be indistinguishable from unpredicated INSTs if you are unaware of predication. As a result, every existing transformation that touches vector instructions will happily ignore the predicate and break your code. In effect this is similar to using metadata to annotate the predicate.

Definition 2: Dinamic Vector Length (DVL)

For a given operation INST that operates on vector inputs of type <{scalable }n x type>, Dynamic Vector Length is the concept of attaching a run-time value of integer type, say i32 %DVL, that informs the instruction to operate on the first VLP lanes of the vector and discard (or set as undef) the remaining ones.

With this concept, the instruction INST should be extended to accept an additional scalar parameter that would represent the number of active lanes in the operation:

%ret = <{scalable }n x ret_type> INST(<{scalable }n x type1> %in1, <{scalable }n x type2> %in2, ..., i32 %DVL)

Same reasoning as above.

DVL, VLP, Vector Length Agnostic (VLA) and Vector Length Specific (VLS) ISAs

To my understanding, DVL is orthogonal to VLA. In principle, you could have a VLA ISA that have an additional register for setting the DVL of the vectors.

DVL as part of VLP

Extending any language to support DVL and VLP would require some form of polymorfism in the language itself, because the same INST would need to have an additional parameter of two different types, the scalar i32 %DVL and the vector <{scalable }n x type> %VLP. I would like to argue that this is not necessary, because DVL is just a specific case of VLP. In fact, the same result of a DVL parameter could be obtained with a VLP parameter that would represent the split of the vector in active and inactive parts.

By using the extension of INST in the VLP definition, we could achieve DVL predication by introducing a new instruction, say VECTOR_SPLIT, that could generate the appropriate predicate partition for the instruction, as follows:

%VLP = <{scalable }n x i1> VECTOR_SPLIT( %i32 DVL)
%ret = <{scalable }n x ret_type> INST(<{scalable }n x type1> %in1, <{scalable }n x type2> %in2, ..., <{scalable }n x i1> %VLP)

Call boundaries will obscure any predicate producing code (Also see the Comment by @rogfer01.

**Predication: extending IR instructions vs intrinsics**

Whether we decide to go via extending the IR over adding new intrinsics, with the reasoning about the DVL and the VLP, we can assume that both cases will require just adding an additional VLP parameter, with, of course, the additional cost of adding a VECTOR_SPLIT instruction or intrinsic.

I personally would prefer to go by adding a new input parameter to the IR instruction (defaulted to "all true" when no predication is used), for two reasons:

  1. avoid the explosion of intrinsics, that would need to be treated separately in all passes.

In fact, there already is an explosion of intrinsics. Basically every SIMD target exposes its ISA via buitin functions. These are highly target-specific and the semantics might not even be publically documented.
This RFC actually tries to come up with a set of target-agnostic,predicated,dvl primitives that should work well on all SIMD targets so we do not have to look into target-specific intrinsics to get functionality in IR (e.g. for predicated fdiv, if floating point is trapping).

  1. for experience in extending LLVM IR instrucion to support the scalable vector type, I believe that there is little risk in doing so.

It's deceptively easy to modify the core IR. However, this has ramifications for all users (transformations, analyses, backends, ...). E.g. consider the concerns and discussions surrounding LLVM-SVE (llvm-dev, also D53695).

Please understand that this is my personal preference. I understand that the community and also my colleagues at Arm might have a different view on this.

Thank you for your take on this!

Do we really need to extend the IR?

My gut feeling is that we don't. I think we can ignore the scalable vs fixed vector types, and just consider the pure task of deciding whether we need to perform VLP predication vs DVL predication.Consider the following sequence.

%ret1 = <{scalable }n x ret_type> INST1(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)
%ret2 = <{scalable }n x ret_type> INST2(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)
%ret3 = <{scalable }n x ret_type> INST3(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)
%ret4 = <{scalable }n x ret_type> INST4(<{scalable }n x type1> ..., <{scalable }n x type2> , ...,)

It should be fairly easy to detect whether any SELECT instructions used on the input parameters or return values of the sequence is done using a VLP predication or DVL predication. At Arm, we use this
pattern matching mechanism to lower selects and unpredicated IR instructions into predicated SVE instruction. As far as I know, there are specific cases that might require us switching to a more sophisticated method (like native predication support in IR), but the pattern matching mechanism on select allowed us to cover the majority of the cases that we see.

This does not work for general function calls or any instructions with side effects (e.g. memory accesses), which is why there are already a bunch of intrinsics in the llvm.masked namespace (which this RFC hopes to supersede).
For the reasons brought up earlier (eg by @rkruppe) this is a non-starter for Dynamic Vector Length targets.

Ideally you'd want these intrinsics for all code, yes, but

  1. since backends don't dictate the IR pass pipeline it will be fragile/impossible to guarantee your pass for turning full vector operations into intrinsics will be last

Actually, you could use custom legalization in ISelLowering for this. No pass involved.

Oh, I misunderstood you, sorry. If you meant lowering "add <n x i32>" to "dvl.add with dvl = MAX" that makes sense and it's basically what I'll be doing in RISC-V too (though I think I can just use patterns directly, no custom lowering code required). However, that still produces an inefficient full-width operation that isn't always necessary and fixing that needs some analysis.

  1. there are operations not visible in the IR (such as register copies) for which you'll probably also need this sort of analysis

Fair enough. Would it be possible to simply extend the %dvl of the defining operation to the newly created register? (instead of re-running a full fledged analysis).

At MIR level, using the semantics of RISC-V instructions, that is not generally correct: uses of the copied register can run with a different VL and therefore use lanes that wouldn't be copied by this approach.

I think for me boils it down to this: without undef-on-excess there is no obvious way for programmers to specify that the result of an operation does not matter beyond the excess lanes and instead they are at the mercy of some clever analysis in the backend. HPC programmers dislike this sort of thing. A good example for this is register allocation/spilling where developers go a long way to guide the spilling heuristics in the right direction by placing branch probabilities and the like.

This is a good point. There's a bit more subtlety to it (users don't write LLVM IR themselves, we could pick a multi-instruction canonical form for making excess lanes undef and lower the code they write to that) but overall I agree that predictable optimizations are very important and since "don't care about excess lanes" is by far the more common choice we should optimize for representing that intent more naturally & reliably.

%dvl = <complex integer arithmetic>
%dvl2 = <some more complex integer arithmetic>

%R = llvm.evl.fadd.v512f32(%a, %b, %m, 16)
%userOne = ... evl.fma.v512f32(%R, ..., %dvl)
%userTwo = ... evl.fma.v512f32(%R, ..., %dvl2)

With undef-on-excess, operations %userOne and %userTwo can be pruned to a width of 16.
Without looking any further at users of %userOne and %userTwo, this could be legalized into

%R = avx512.fma(%a, %b, %m)
%userOne = avx512.fma.v16f32(%R, ...)
%userTwo = avx512.fma.v16f32(%R, ...)

That's hard to do with maskedout-on-excess since it would all depend on %dvl and %dvl2.. and there would be no simple way for users to convey the information "dont care about %R beyond 16" in IR.

Thanks you for the example, it makes sense, I guess I am too focused on the dynamic-MVL case. (I would not call this a "demanded"-style pass though, it just propagates the vector length forward and uses undef-excess-lanes to justify not having to compare 16 with %dvl and %dvl2.)

simoll added a comment.Nov 3 2018, 2:50 PM
  1. there are operations not visible in the IR (such as register copies) for which you'll probably also need this sort of analysis

Fair enough. Would it be possible to simply extend the %dvl of the defining operation to the newly created register? (instead of re-running a full fledged analysis).

At MIR level, using the semantics of RISC-V instructions, that is not generally correct: uses of the copied register can run with a different VL and therefore use lanes that wouldn't be copied by this approach.

Well, if you generate RISC-V instructions starting from EVL intrinsics then undef-on-excess still holds. So, excess lanes should be fair game for spilling. My hope is that %dvl could be annotated on MIR level like divergence is in the AMDGPU backend today. If the annotation is missing, you'd spill the full register.

  1. there are operations not visible in the IR (such as register copies) for which you'll probably also need this sort of analysis

Fair enough. Would it be possible to simply extend the %dvl of the defining operation to the newly created register? (instead of re-running a full fledged analysis).

At MIR level, using the semantics of RISC-V instructions, that is not generally correct: uses of the copied register can run with a different VL and therefore use lanes that wouldn't be copied by this approach.

Well, if you generate RISC-V instructions starting from EVL intrinsics then undef-on-excess still holds. So, excess lanes should be fair game for spilling. My hope is that %dvl could be annotated on MIR level like divergence is in the AMDGPU backend today. If the annotation is missing, you'd spill the full register.

Yeah there's ways to pass on this information through MIR. Defining the RVV MachineInsts differently than the architecture defines the corresponding instructions isn't a good way in my opinion, but metadata on the instructions might work well. In any case this has drifted away from being directly relevant to this RFC. Thank you for the interesting discussion and once again for creating this RFC!

Today I took a stab at changing my RVV patches to use these intrinsics and that basically went well, affirming belief that these intrinsics are a good fit for RISC-V vectors. I stashes those changes for now rather than continuing to build on them because currently I can't match them with plain old isel patterns so I'd have to write annoying and error-prone custom lowering. That should be a temporary issue, partly due to how I don't really handle predication at the moment, partly due to a surprising extra argument on loads and stores (see inline comment).

FYI I noticed the argument numbers for the new attributes don't match the actual parameters in many cases (they often seem to be off by one). No big deal, just something to keep in mind for when the RFC goes through and the patch gets submitted for real.

include/llvm/IR/Intrinsics.td
928

One of these i32 arguments it the dynamic_vl argument, what's the other? Alignment?

simoll updated this revision to Diff 172959.Nov 7 2018, 8:03 AM
simoll retitled this revision from RFC: Dynamic Vector Length Intrinsics and Attributes to RFC: Explicit Vector Length Intrinsics and Attributes.
simoll edited the summary of this revision. (Show Details)
Changes
  • dynamic_vl -> vlen.
  • unmasked_ret -> maskedout_ret.
  • DVL -> EVL (Explicit Vector Length).
  • Added llvm.evl.compose(%A, %B, %pivot, %mvl) intrinsic (select on lane pivot).
simoll added a comment.Nov 7 2018, 8:40 AM

Today I took a stab at changing my RVV patches to use these intrinsics and that basically went well, affirming belief that these intrinsics are a good fit for RISC-V vectors. I stashes those changes for now rather than continuing to build on them because currently I can't match them with plain old isel patterns so I'd have to write annoying and error-prone custom lowering. That should be a temporary issue, partly due to how I don't really handle predication at the moment, partly due to a surprising extra argument on loads and stores (see inline comment).

That's great news! Thanks for trying it out. Speaking of ISel, there should probably be one new ISD node type per EVL intrinsic.

FYI I noticed the argument numbers for the new attributes don't match the actual parameters in many cases (they often seem to be off by one). No big deal, just something to keep in mind for when the RFC goes through and the patch gets submitted for real.

The patch in this RFC is a showcase version to discuss the general concept (and sort out bike shedding issues). The actual patches will be cleaner.

include/llvm/IR/Intrinsics.td
928

Yep. That's alignment as in llvm.masked.store.

rkruppe added inline comments.Nov 7 2018, 9:18 AM
include/llvm/IR/Intrinsics.td
928

Ah, I forgot that the existing masked intrinsics also take the alignment as a normal parameter. I think new intrinsics shouldn't follow that precedent, nowadays we have the align attribute for call sites (already used e.g. by llvm.memcpy), so the alignment information can be supplied like this:

call void @llvm.evl.store(<4 x i32> %v, <4 x i32>* align 16 %p, ...)

This ensures that the alignment is a compile time constant, and during instruction selection and later stages, it should be stored in the MachineMemOperand, not be an extra operand (that's the part that caused me trouble in my experiments).

simoll marked 3 inline comments as done.Nov 9 2018, 5:15 AM
simoll added inline comments.
include/llvm/IR/Intrinsics.td
928

Ok. I'll drop the alignment arguments in the next update.