This patch introduces generating VP intrinsics in the Loop Vectorizer.
Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.
Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.
Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.
- The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
- The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
- For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.
Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.
Tentative Development Roadmap
- Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
- Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
- Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
- Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.
Part of me wants to say this should be usesCustomActiveVectorLengthIntrinsic like "the target hasActiveVectorLength" and "the target supportsScalableVectors".
I'm not arguing very strongly though because we have preferPredicatedReductionSelect. I know, it's the epitome of bikeshedding. I don't know if it's any better if it's self-aware.