As Vector Predication intriniscs are being introduced in LLVM, we propose extending the Loop Vectorizer to target these intrinsics. SIMD ISAs such as RISC-V V-extension, NEC SX-Aurora and Power VSX with active vector length predication support can specially benefit from this since there is currently no reasonable way in the IR to model active vector length in the vector instructions.
ISAs such as AVX512 and ARM SVE with masked vector predication support would benefit by being able to use predicated operations other than just memory operations (via masked load/store/gather/scatter intrinsics).
This patch shows a proof of concept implementation that demonstrates Loop Vectorizer generating VP intrinsics for simple integer operations on fixed vectors.
Details and Strategy
Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.
Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for all arithmetic operations as well.
Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.
- The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
- The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
- For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic set_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.
For the last two ways, if there is no outer mask, we use an all-true boolean vector for the mask parameter of the VP intrinsics. (We do not yet support control flow in the loop body.)
We have also extended VPlan to add new recipes for PREDICATED-WIDENING of arithmetic operations and memory operations, and a recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.
Alternate vectorization strategies with predication
Other than the tail-folding based vectorization strategy, we are considering two other vectorization strategies (not implemented yet):
- Non-predicated body followed by a predicated vectorized tail - This will generate a vector body without any predication (except control flow), same as for the existing approach of a vector body with scalar tail loop. The tail however will be vectorized using the VP intrinsics with EVL = trip_count % VF. While this approach will result in larger code size, it might be more efficient than our currently implemented approach; It will have a straight line code for the tail and the vector body will be free of all the overhead of using intrinsics.
- Another strategy could be to use tail-folding based approach but use predication only for memory operations. This might be beneficial for architectures like Power VSX that support vector length predication only for memory operations.
Caveats / Current limitations / Current status of things
This patch is far from complete and simply meant to be a proof-of-concept with the aim to (it will also be broken into smaller more concrete patches once we have more feedback from the community):
- demonstrate the feasibility of the Loop Vectorizer to target VP intrinsics.
- start a deeper implementation-backed discussion around vector predication support in LLVM.
That being said, there are several limitations at the moment; Some need more supporting implementation and some need more discussion:
- For the purpose of demonstration, we use a command line switch that can be used to force VP intrinsic support and needs tail-folding enabled to work.
- VP Intrinsic development is going on in parallel, and currently only supports integer arithmetic intrinsics in the upstream.
- No support for control flow in the loop.
- No support for interleaving.
- We need more discussion around the best approach for computing EVL parameter. If using an intrinsic, more thought needs to go into its semantics. Also the VPlan recipe for EVL is sort of a dummy recipe with widening delegated to the vectorizer.
- We also do not use the active.vector.lane.mask intrinsic yet, but it is something we consider for the future.
- No support for scalable vectors yet (Due to missing support for tail folding for scalable vectors).
Note: If you are interested in how it may work end-to-end for scalable vectors,
do take a look at our downstream implementation for RISC-V [ RVV-Impl ] and an
end-to-end demo on Godot compiler explorer [ Demo ].
Note: This patch also includes our implementation of vp_load and vp_store
intrinsics. There is currently a more complete patch [ D99355 ] open for review,
which we will merge when it lands.
Tentative Development Roadmap
Our plan is to start with integrating the functionality in this patch, with changes/enhancements agreed upon by the community. For next steps, we want to:
- Support VP intrinsics for vectorization for scalable vectors (starting with enabling tail folding for scalable vectors if required by the time.)
- Support for floating point operations.
- Support for control flow in the loop.
- Support for more complicated loops - reductions, inductions, recurrences, reverse.