This is a proposal to add vector intrinsics and function attributes to LLVM IR to better support predicated vector code, including targets with a dynamic vector length (RISC-V V, NEC SX-Aurora).
The attributes are designed to simplify automatic vectorization and optimization of predicated data flow. Non-predicating SIMD architectures should benefit from these changes as well through a common legalization scheme (eg lowering of fdiv in predicated contexts).
This is a follow up on my tech talk at the 2018 LLVM DevMtg, "Stories from RV..." (https://llvm.org/devmtg/2018-10/talk-abstracts.html#talk22), and the subsequent discussions at the round table.
Rationale
LLVM IR does not support predicated execution as a first-order concept. Instead there is a growing body of intrinsics (llvm.masked.*) and workarounds (select for arithmetic, VectorABI for general function calls), which encode or at least emulate predication in their respective context. The discussions and patches for LLVM-SVE show that there is a need to accomodate architectures with a Dynamic Vector Length (RISC-V V extension, NEC SX-Aurora TSUBASA).
This RFC provides a coherent set of intrinsics and attributes that enable predication through bit masks and EVL in LLVM IR.
Proposed changes
Intrinsics
We propose to add a new set of intrinsics to the "llvm.evl.*" prefix. After the change, it will include the following operations:
- all standard binary (Add, FAdd, Sub, FSub, Mul, FMul, UDiv, SDiv, FDiv, URem, SRem, FRem)
- logical operators (Shl, LShr, AShr, And, Or, Xor)
- experimental reduce (fadd, fmul, add, mul, and, or, xor, smax, smin, uman, umin, fmax, fmin)
- ICmp, FCmp
- Select
- All of llvm.masked.* namespace (load, store, gather, scatter, expandload, compressstore)
All of the intrinsics in the llvm.evl namespace take in two predicating parameters: a mask of bit vector type (eg <8 x i1>) and a explicit vector length value (i32).
Attributes
We propose three new attributes for function parameters:
mask: this parameter encodes the predicate of this operation. Inputs on unmasked lanes must not affect enabled result lanes in any way.
vlen: this parameter encodes the explicit vector length (VL) of the instruction. The operation does not apply for lanes beyond this parameter. The result for lanes >= vlen is "undef".
passthru: lanes of this parameter are returned where the same lane in the mask is false. This only applies to lanes below the value of the <vlen> parameter (if there is one). This is to be used on general IR functions.
We show the semantics in the example below.
The attributes are intended for general use in IR functions, not just the EVL intrinsics.
An example
Let the predicated fdiv have the following signature:
llvm.evl.fdiv.v4f64(<4 x double> %a, <4 x double> %b, <4 x i1> mask %mask, i32 vlen %dynamic)
Consider this application of fdiv:
llvm.evl.fdiv.v4f64(<4 x double> <4.2, 6.0, 1.0, 1.0>, <4 x double> <0.0, 3.0, nan, 0>, <4 x i1> <0, 1, 1, 1>, 2) == <undef, 2.0, undef, undef>
The first %mask bit is '0' and the operation will yield undef for the first lane.
The second %mask bit is '1' and so the result on the second lane is just 6.0 / 3.0.
The last two lanes are beyond the explicit vector length %vlen and so their results are undef regardless of passthru.
Lowering
We show possible lowering strategies for the following prototypical SIMD ISAs:
LLVM-SVE with predication and dynamic vector length (RISC-V V extension, NEC SX-Aurora)
For these targets, the intrinsics map over directly to the ISA.
Lowering for targets w/o dynamic vector length (AVX512, ARM SVE, ..)
ARM SVE does not feature a dynamic vector length register.
Hence, the vector length needs to be promoted to the bit mask predicate, shown here for a LLVM-SVE target:
Block before legalization:
.. foo (..., %mask, %dynamic_vl) ...
After legalization:
%vscale32 = call i32 @llvm.experimental.vector.vscale.32() ... %stepvector = call <scalable 4 x i32> @llvm.experimental.vector.stepvector.nxv4i32() %vl_mask = icmp <scalable 4 x i1> %stepvector, %stepvector, %dynamic_vl %new_mask = and <scalable 4 x i1> %mask, %vl_mask foo (..., <scalable 4 x i1> %new_mask, i32 %vscale32) ...
Lowering for fixed-width SIMD w/o predication (SSE, NEON, AdvSimd, ..)
Scalarization and/or speculation on a full predicate.
Example 1: safe fdiv
int foo(double * A, double * B, int n) {
#pragma omp simd simdlen(8) for (int i = 0; i < n; ++i) { double a = A[i]; double r = a; if (a > 0.0) { r = 42.0 / a; } B[i] = r; }
}
<8 x double> @llvm.evl.fdiv.v8f64(<8 x f64> %a, <8 x f64> %b, <8 x i1> mask %mask, i32 vlen %length)
vector.body: ; preds = %vector.body, %vector.ph %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ] %0 = getelementptr inbounds double, double* %A, i64 %index %1 = bitcast double* %0 to <8 x double>* %wide.load = load <8 x double>, <8 x double>* %1, align 8, !tbaa !2 %2 = fcmp ogt <8 x double> %wide.load, zeroinitializer ; variant that LV generates today: ; %3 = fdiv <8 x double> <double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01>, %wide.load ; %4 = select <8 x i1> %2, <8 x double> %3, <8 x double> %wide.load ; using EVL: %4 = call <8 x double> @llvm.evl.fdiv.v8f64(<8 x double> <double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01, double 4.200000e+01>, %wide.load, <8 x i1> %2, i32 8) %5 = getelementptr inbounds double, double* %B, i64 %index %6 = bitcast double* %5 to <8 x double>* store <8 x double> %4, <8 x double>* %6, align 8, !tbaa !2 %index.next = add i64 %index, 8 %7 = icmp eq i64 %index.next, %n.vec br i1 %7, label %middle.block, label %vector.body, !llvm.loop !6
Pros & Cons
Pros
The generality of the intrinsics simplifies the job of the vectorizer's widening phase (speaking for RV, should apply to LV/VPlan as well): Scalar instruction opcodes only need to be mapped over to their respective evl intrinsic name. The mask and vlen are passed to the annotated arguments.
Regarding the evl intrinsics (instead of extending the IR):
- The new predication scheme is completely optional and does not interfere with LLVM's vector instructions at all.
- Existing backends can use a generic lowering scheme from evl to "classic" vector instructions.
- Likewise, lifting passes can convert classic vector instructions to the new intrinsics if deemed beneficial for backend implementation (NEC SX-Aurora, RISC-V V(?)..)
Marking out the mask and the vlen parameters with attributes has the following advantages:
- Analyses and optimizations understand the flow of predicates from a quick glance at the functions' attributes, no further knowledge about the functions' internals is required.
- Dynamic vlen and the vector mask may be treated specially in the target's CC (eg by passing dynamic vlen in a VL register, or the active mask in a dedicated register (AMDGPU(?))).
- Legalization does not have to know the nature of the intrinsic to legalize dynamic vlen where it is not supported.
Cons
- Intrinsic bloat.
- Predicating architectures without dynamic vector length have to pass in a redundant vlen to exploit these intrinsics.
- Some of LLVM's optimizations that need to understand the nature of the intrinsics` semantics, like InstCombine, need to be taught about evl intrinsics to be able to optimize them. This will require at least some engineering effort.
One of these i32 arguments it the dynamic_vl argument, what's the other? Alignment?