This is a proposal to add vector intrinsics and function attributes to LLVM IR to better support predicated vector code, including targets with a dynamic vector length (RISC-V V, NEC SX-Aurora).

The attributes are designed to simplify automatic vectorization and optimization of predicated data flow. Non-predicating SIMD architectures should benefit from these changes as well through a common legalization scheme (eg lowering of fdiv in predicated contexts).

This is a follow up on my tech talk at last week's LLVM DevMtg, "Stories from RV..." (https://llvm.org/devmtg/2018-10/talk-abstracts.html#talk22), and the subsequent discussions at the round table.

### Rationale

LLVM IR does not support predicated execution as a first-order concept. Instead there is a growing body of intrinsics (`llvm.masked.*`) and workarounds (`select` for arithmetic, VectorABI for general function calls), which encode or at least emulate predication in their respective context. The discussions and patches for LLVM-SVE show that there is a need to accomodate architectures with a Dynamic Vector Length (RISC-V V extension, NEC SX-Aurora TSUBASA).

This RFC provides a coherent set of intrinsics and attributes that enable predication through bit masks and EVL in LLVM IR.

### Proposed changes

#### Intrinsics

We propose to add a new set of intrinsics to the "llvm.evl.*" prefix. After the change, it will include the following operations:

- all standard binary (Add, FAdd, Sub, FSub, Mul, FMul, UDiv, SDiv, FDiv, URem, SRem, FRem)
- logical operators (Shl, LShr, AShr, And, Or, Xor)
- experimental reduce (fadd, fmul, add, mul, and, or, xor, smax, smin, uman, umin, fmax, fmin)
- ICmp, FCmp
- Select
- All of llvm.masked.* namespace (load, store, gather, scatter, expandload, compressstore)

All of the intrinsics in the llvm.evl namespace take in two predicating parameters: a mask of bit vector type (eg `<8 x i1>`) and a dynamic vector length value (`i32`).

#### Attributes

We propose three new attributes for function parameters:

mask: this parameter encodes the predicate of this operation. Inputs on unmasked lanes must not affect enabled result lanes in any way.

vlen: this parameter encodes the explicit vector length (VL) of the instruction. The operation does not apply for lanes beyond this parameter. The result for lanes >= vlen is "undef".

maskedout_ret: this parameter contains the return value of masked-out lanes (within the vector length).

We show the semantics in the example below.

The attributes are intended for general use in IR functions, not just the EVL intrinsics.

#### An example

Let the predicated fdiv have the following signature:

llvm.evl.fdiv.v4f64(<4 x double> maskedout_ret %a, <4 x double> %b, <4 x i1> mask %mask, i32 vlen %dynamic)

Consider this application of fdiv:

llvm.evl.fdiv.v4f64(<4 x double> <4.2, 6.0, 1.0, 1.0>, <4 x double> <0.0, 3.0, nan, 0>, <4 x i1> <0, 1, 1, 1>, 2) == <4.2, 2.0, undef, undef>

The first `%mask` bit is '0' and the operation will not execute for the first lane. Yet, since the first paramter `%a%` has the `maskedout_ret` attribute the result on the first lane is the value of `%a` at that lane.

The second `%mask` bit is '1' and so the result on the second lane is just `6.0 / 3.0`.

The last two lanes are beyond the dynamic vector length `%vlen` and so their results are *undef* regardless of `maskedout_ret`.

Note that the outcome of the first and last two lanes could have been told from the new attributes alone without knowing that this is an `fdiv` operation.

This can be used to implement general predicate analyses and optimizations.

### Lowering

We show possible lowering strategies for the following prototypical SIMD ISAs:

###### LLVM-SVE with predication and dynamic vector length (RISC-V V extension, NEC SX-Aurora)

For these targets, the intrinsics map over directly to the ISA.

###### Lowering for targets w/o dynamic vector length (AVX512, ARM SVE, ..)

ARM SVE does not feature a dynamic vector length register.

Hence, the vector length needs to be promoted to the bit mask predicate, shown here for a LLVM-SVE target:

Block before legalization:

.. foo (..., %mask, %dynamic_vl) ...

After legalization:

%vscale32 = call i32 @llvm.experimental.vector.vscale.32() ... %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() %vl_mask = icmp <scalable 4 x i1> %stepvector, %stepvector, %dynamic_vl %new_mask = and <scalable 4 x i1> %mask, %vl_mask foo (..., <scalable 4 x i1> %new_mask, i32 %vscale32) ...

###### Lowering for fixed-width SIMD w/o predication (SSE, NEON, AdvSimd, ..)

Scalarization and/or speculation on a full predicate.

#### Example 1: safe fdiv

int foo(double * A, double * B, int n) {

#pragma omp simd simdlen(8) for (int i = 0; i < n; ++i) { double a = A[i]; double r = a; if (a > 0.0) { r = 42.0 / a; } B[i] = r; }

}

<8 x double> @llvm.evl.fdiv.v8f64(<8 x f64> maskedout_ret %a, <8 x f64> %b, <8 x i1> mask %mask, i32 vlen %length)

vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %0 = getelementptr inbounds double, double* %A, i64 %index %1 = bitcast double* %0 to <8 x double>* %wide.load = load <8 x double>, <8 x double>* %1, align 8, !tbaa !2 %2 = fcmp ogt <8 x double> %wide.load, zeroinitializer ; variant that LV generates today: ; %3 = fdiv <8 x double> <double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01>, %wide.load ; %4 = select <8 x i1> %2, <8 x double> %3, <8 x double> %wide.load ; using EVL: %4 = call <8 x double> @llvm.evl.fdiv.v8f64(<8 x double> <double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01>, %wide.load, <8 x i1> %2, i32 8) %5 = getelementptr inbounds double, double* %B, i64 %index %6 = bitcast double* %5 to <8 x double>* store <8 x double> %4, <8 x double>* %6, align 8, !tbaa !2 %index.next = add i64 %index, 8 %7 = icmp eq i64 %index.next, %n.vec br i1 %7, label %middle.block, label %vector.body, !llvm.loop !6

### Pros & Cons

###### Pros

The generality of the intrinsics simplifies the job of the vectorizer's widening phase (speaking for RV, should apply to LV/VPlan as well): Scalar instruction opcodes only need to be mapped over to their respective evl intrinsic name. The mask and vlen are passed to the annotated arguments.

Regarding the `evl` intrinsics (instead of extending the IR):

- The new predication scheme is completely optional and does not interfere with LLVM's vector instructions at all.
- Existing backends can use a generic lowering scheme from
`evl`to "classic" vector instructions. - Likewise, lifting passes can convert classic vector instructions to the new intrinsics if deemed beneficial for backend implementation (NEC SX-Aurora, RISC-V V(?)..)

Marking out the mask and the vlen parameters with attributes has the following advantages:

- Analyses and optimizations understand the flow of predicates from a quick glance at the functions' attributes, no further knowledge about the functions' internals is required.
- Dynamic vlen and the vector mask may be treated specially in the target's CC (eg by passing dynamic vlen in a VL register, or the active mask in a dedicated register (AMDGPU(?))).
- Legalization does not have to know the nature of the intrinsic to legalize dynamic vlen where it is not supported.

###### Cons

- Intrinsic bloat.
- Predicating architectures without a dynamic vector length have to pass in a redundant
`vlen`to exploit these intrinsics. - Some of LLVM's optimizations that need to understand the nature of the intrinsics` semantics, like
`InstCombine`, need to be taught about evl intrinsics to be able to optimize them. This will require at least some engineering effort.

### Alternatives considered

###### Piggy backing

This means extending the current vector instructions to feature a predicate and a dynamic vector length in some way, both of which would be optional.

One approach to achieve this is a direct extension of the existing instructions. Decorating instructions with an extended OperandBoundle scheme should work as well.