Page MenuHomePhabricator

[LV, VP] RFC: VP intrinsics support for the Loop Vectorizer (Proof-of-Concept)
Needs ReviewPublic

Authored by vkmr on Apr 1 2021, 10:32 AM.




As Vector Predication intriniscs are being introduced in LLVM, we propose extending the Loop Vectorizer to target these intrinsics. SIMD ISAs such as RISC-V V-extension, NEC SX-Aurora and Power VSX with active vector length predication support can specially benefit from this since there is currently no reasonable way in the IR to model active vector length in the vector instructions.

ISAs such as AVX512 and ARM SVE with masked vector predication support would benefit by being able to use predicated operations other than just memory operations (via masked load/store/gather/scatter intrinsics).

This patch shows a proof of concept implementation that demonstrates Loop Vectorizer generating VP intrinsics for simple integer operations on fixed vectors.

Details and Strategy

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for all arithmetic operations as well.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

  • The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
  • The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
  • For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic set_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

For the last two ways, if there is no outer mask, we use an all-true boolean vector for the mask parameter of the VP intrinsics. (We do not yet support control flow in the loop body.)

We have also extended VPlan to add new recipes for PREDICATED-WIDENING of arithmetic operations and memory operations, and a recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

Alternate vectorization strategies with predication

Other than the tail-folding based vectorization strategy, we are considering two other vectorization strategies (not implemented yet):

  1. Non-predicated body followed by a predicated vectorized tail - This will generate a vector body without any predication (except control flow), same as for the existing approach of a vector body with scalar tail loop. The tail however will be vectorized using the VP intrinsics with EVL = trip_count % VF. While this approach will result in larger code size, it might be more efficient than our currently implemented approach; It will have a straight line code for the tail and the vector body will be free of all the overhead of using intrinsics.
  2. Another strategy could be to use tail-folding based approach but use predication only for memory operations. This might be beneficial for architectures like Power VSX that support vector length predication only for memory operations.

Caveats / Current limitations / Current status of things

This patch is far from complete and simply meant to be a proof-of-concept with the aim to (it will also be broken into smaller more concrete patches once we have more feedback from the community):

  • demonstrate the feasibility of the Loop Vectorizer to target VP intrinsics.
  • start a deeper implementation-backed discussion around vector predication support in LLVM.

That being said, there are several limitations at the moment; Some need more supporting implementation and some need more discussion:

  • For the purpose of demonstration, we use a command line switch that can be used to force VP intrinsic support and needs tail-folding enabled to work.
  • VP Intrinsic development is going on in parallel, and currently only supports integer arithmetic intrinsics in the upstream.
  • No support for control flow in the loop.
  • No support for interleaving.
  • We need more discussion around the best approach for computing EVL parameter. If using an intrinsic, more thought needs to go into its semantics. Also the VPlan recipe for EVL is sort of a dummy recipe with widening delegated to the vectorizer.
  • We also do not use the active.vector.lane.mask intrinsic yet, but it is something we consider for the future.
  • No support for scalable vectors yet (Due to missing support for tail folding for scalable vectors).

Note: If you are interested in how it may work end-to-end for scalable vectors,
do take a look at our downstream implementation for RISC-V [ RVV-Impl ] and an
end-to-end demo on Godot compiler explorer [ Demo ].

Note: This patch also includes our implementation of vp_load and vp_store
intrinsics. There is currently a more complete patch [ D99355 ] open for review,
which we will merge when it lands.

Tentative Development Roadmap

Our plan is to start with integrating the functionality in this patch, with changes/enhancements agreed upon by the community. For next steps, we want to:

  • Support VP intrinsics for vectorization for scalable vectors (starting with enabling tail folding for scalable vectors if required by the time.)
  • Support for floating point operations.
  • Support for control flow in the loop.
  • Support for more complicated loops - reductions, inductions, recurrences, reverse.

[ RVV-Impl ]:
[ Demo ]:
[ D99355 ]:

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
vkmr updated this revision to Diff 334878.Apr 1 2021, 6:29 PM

Merged more recent changes on main.

craig.topper added inline comments.Apr 7 2021, 2:39 PM

I think you can use Builder.getInt32Ty


Same here


Drop curly braces since it's only 1 line


Drop curly braces


intrinisc -> intrinsic

khchen added a subscriber: khchen.Apr 8 2021, 9:25 AM
frasercrmck added inline comments.Apr 12 2021, 2:10 AM

This method doesn't seem used by this patch.


Does this need to be a VectorType? Can't you pass StoredVal->getType() straight through to CreateIntrinsic?


Is this comment out of place?


Unnecessary parens around this statement.

This LGTM on a conceptual level from the standpoint of SX-Aurora.
I guess @fhahn would be the right person to weigh in on the VPlan changes.
When it comes to the actual patches for this RFC, it'd help to upstream the VectorBuilder class. It allows you to emit VP code without having to juggle with VP intrinsics.

vkmr updated this revision to Diff 337792.Apr 15 2021, 9:33 AM

Addressed review comments by @craig.topper and @frasercrmck.

vkmr marked 9 inline comments as done.Apr 15 2021, 9:37 AM

Overall, at a high-level, this LGTM too. I might play around with it locally if I find the time.


Part of me wants to say this should be usesCustomActiveVectorLengthIntrinsic like "the target hasActiveVectorLength" and "the target supportsScalableVectors".

I'm not arguing very strongly though because we have preferPredicatedReductionSelect. I know, it's the epitome of bikeshedding. I don't know if it's any better if it's self-aware.


These need to go in the LangRef, don't they?

simoll added inline comments.Apr 16 2021, 2:45 AM

Yep. The patch for vp.load/store/gather/scatter is D99355

frasercrmck added inline comments.Apr 16 2021, 2:48 AM

Ah yes you mean that revision I've already seen and commented on! Sorry for the noise :)

The overall approach looks good to me. I might have gone a little too deep into the details with my inline comments, but the direction makes sense to me.


don't see ForceAVLSupport being used anywhere in this patch.


Could this be moved to the VPWidenEVLRecipe class?


how do we guard against this situation? I think this should be part of the legality isLegalMaskedLoad check for Legal->isConsecutivePtr(Ptr)...if we had a function similar to isLegalMaskedLoad (say isLegalEVLLoad), then it could check for -1 stride and return false...then we'd be guaranteed not to hit this assert.


not sure if this is already in your todo list...but apart from "preferring" to predicate, we need to check that the predication is legal for the target platform. This should probably be done in isScalarWithPredication with calls similar to isLegalMaskedLoad.


Value *Remaining = Builder.CreateSub(TripCount, Induction, "tc.minus.iv");


can we try to initialize RuntimeVL with the right type, to avoid having to zero extend it here?


you can avoid calling the intrinsic, by just doing a cmp and a select. (ie check that Remaining is less than RuntimeVL and if so pick Ramining, otherwise select RuntimeVL).


Can we avoid truncating to 32-bit on 64-bit targets that take in 64-bit length? I think the type should be the same as the IV.


rename to tryToPredicateWidenMemory or tryToWidenPredicatedMemory


"PREDICATED-WIDEN-MEMORY" to distinguish from VPPredicatedWidenRecipe


I suppose the EVL is not one of the operands, right? If so, I think printing the EVL here would be useful too.

bmahjour added inline comments.Apr 16 2021, 9:11 AM

I guess the targets that don't need/have the concept of predicated binary operations, must provide lowering for all these calls to ultimately generate non-predicated vector code. I'd expect that to be a large effort with little gain. To allow the EVL exploitation while the lowering is being provided, would it make sense to have a path in this function where we just fall back to InnerLoopVectorizer::widenInstruction under an option?

Matt added a subscriber: Matt.Apr 17 2021, 9:12 AM
simoll added inline comments.Apr 19 2021, 1:19 AM

D78203 implements lowering from VP intrinsics to regular SIMD instructions for all targets.

If targets support VP, they should get it. If IR with VP intrinsics ends up getting compiled for a non-VP targets, the intrinsics will disappear before they hit lowering.

vkmr added inline comments.Apr 19 2021, 7:56 AM

Hmmm...I am in double-mind here too.
One way to look at it is that with useCustomActiveVectorLengthIntrinsic or preferPredicatedReductionSelect there is a notion of choice - even if it can use a custom vector length intrinsic it might choose not to; It is being told to pick one choice . Whereas usesCustomActiveVectorLengthIntrinsic or hasActiveVectorLength implies reporting a fact. Not sure if it makes much sense!

May be others can chime in too. I am not too particular about names as long as they are reasonably sensible.

I know, it's the epitome of bikeshedding. I don't know if it's any better if it's self-aware.

"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton


Not explicitly but implicitly as the default choice if other options do not match. See the test cases for more clarity on how it works and what it results in.


This is called from the execute method of the recipe and is responsible for generating IR instructions to compute EVL. Keeping it in InnerLoopVectorizer is consistent with how the widening methods for other recipes are a part of InnerLoopVectorizer.


Do you mean the legality checker should determine if it's a reverse operation then it is illegal to use EVL predicated Load and should default to Masked load (if that is legal)?

Eventually, there will be support for reverse with EVL predication and required legality checks will be added. For this PoC patch however, it is more explicit and straightforward to have an assert here.


To add to @simoll's comment, eventually we will be using the VectorBuilder (earlier VPBuilder) to widen to VP intrinsics instead of manually creating each intrinsic here. VectorBuilder can also make some decisions on when and how to use the intrinsics.


preferPredicatedWiden takes into account FoldTailByMasking and TTI.hasActiveVectorLength(). TTI.hasActiveVectorLength() checks whether the EVL based predication is legal for the target or not.

For this initial patch, legality checker and cost model are purposely avoided and the explicit command line option is used to 1) keep things simple and 2) demonstrate different approaches.


The core issue here is that the Vectorizer sort of implicitly uses i32 for everything related to VF and Vector Lenght. %evl parameter in VP intrinsics is also of type i32.
TC, BTC and IV are however i64. So whenever we mix these two we end up having extends and truncates.

That being said, in this particular case, it should be possible to initialize MinVF and thus RuntimeVL to i64 if we are going to use it with TC and IV and i32 otherwise.


I may be wrong here but from what I understand, patches D81829 and D84125 recently introduces these intrinsics to avoid these IR patterns.
I am not sure if their use is discouraged for some reason.


See previous related comment. %evl parameter in VP intrinsics is of type i32.


Not sure I understand your reasoning here.

My understanding is that we are trying to do predicated widening of memory instruction.
tryToPredicateWidenMemory would imply that we are trying to predicate and/or widen memory instruction, whereas tryToWidenPredicatedMemory would imply we are trying to widen predicated memory instruction, which might not be the case (VP intrinsics can be used to widen non predicated instructions too).

Also, it is consistent with wording in other places (tryToPredicatedWiden, VPPredicatedWidenRecipe, PREDICATED-WIDEN, etc.). If we change here, it should be changed everywhere too.

If others also think that the current wording should be changed everywhere, I am open to changing it.


"PREDICATED-WIDEN" is consistent with "WIDEN" (and not "WIDEN-MEMORY") used for VPWidenMemoryInstructionRecipe.


It is appended as one of the operands and is printed correctly here.
See the vplan-vp-intrinsics.ll test case for how the printed VPlan looks.

vkmr marked 2 inline comments as done.Apr 20 2021, 2:08 AM
vkmr updated this revision to Diff 343530.May 6 2021, 4:33 PM

Sync up with ongoing changes in VPlan. Minor refactor.

fhahn added inline comments.May 18 2021, 2:59 PM

Do we gain a lot of code-reuse by adding this to ::vectorizeMemoryInstruction? Seems like it would be cleaner to handle predicated codegen in a separate function?


IIUC codegen depends on the induction and trip count for targets with active vector length? Those should probably be modeled as operands for the recipe.

For the non-active-vector length case, the recipe is loop invariant, right? Could we just use live-in VPValue in those cases instead of the recipe? We may have to set up the value outside the loop for scalable VFs.

It would also be good to expand the comment to describe the behavior.


Not sure how well mirroring predicated recipes will scale as more recipes get added. Did you consider extending the existing recipes to take an extra operand for the EVL?


nit: not needed, same for the zext i32 %N to i64, you can pass N as i64 directly.

vkmr added a comment.May 19 2021, 10:23 AM

Thanks @fhahn for a look at the VPlan bits of the patch.


I am not too happy about not having a separate function for ::vectorizeMemoryInstruction! But there is substantial code overlap. Perhaps a better approach would be to abstract out much of the shared code in a separate function and duplicate some parts in a separate vectorizePredicatedMemoryInstruction.


IIUC codegen depends on the induction and trip count for targets with active vector length? Those should probably be modeled as operands for the recipe.

Correct. Given that TripCount and Induction are members of ILV, from VPlan's perspective does modeling them as operands for the recipe give any functional benefit or is it more of a VPlan design principle thing?

For the non-active-vector length case, the recipe is loop invariant, right? Could we just use live-in VPValue in those cases instead of the recipe? We may have to set up the value outside the loop for scalable VFs.

Right. It is the runtime VF of the vector loop. Just to make sure, in the VPlan terminology, a "live-in VPValue" is a direct mapping from an LLVM IR Value defined outside the loop body that is used in the loop body, right?

For scalable vectors , yes we would have to create an IR instruction for vscale x vf. IIUC, there's supposed to be a VPlan built for each possible VF, but it wouldn't be desirable to create an IR instruction for each VF; Does VPlan allow creating a VPValue (that is not a recipe) that would be expand to an IR instruction on plan execution? (Pardon me if I am missing something obvious here!)


I am actually considering and would prefer to extend the existing recipes to support EVL, for the reason you mentioned.

IIRC at the time of our downstream implementation, the existing recipes were different enough and in a state of flux, that it felt cleaner to have mirrored recipes (besides we needed just 2 of them). Doesn't make much sense now, though.


Good catch!

vkmr added inline comments.May 19 2021, 1:20 PM

BTW, when you say "extending the existing recipes", do you mean modifying the existing recipe implementation to add a new constructor and EVL and Mask operands or do you mean implementing predicated recipe class as an inherited class from the non-predicated recipe as the base class?
In my previous reply I assumed the latter, and it is what I prefer over modifying existing recipes (unless it doesn't align with VPlan design principles!) - at the expense of little extra boilerplate code, keeps the two approaches separate and clean.

vkmr updated this revision to Diff 353243.Jun 20 2021, 2:51 PM

Address VPlan related comments.

mdchen added a subscriber: mdchen.Oct 21 2021, 1:33 AM
SamB added a subscriber: SamB.EditedFeb 21 2022, 1:51 PM

So, um, I would like to look at your patches, but I'm not having much success running "arc patch" against them.

For example, trying to apply this one results in something that starts like this:

 ARGV  '/usr/share/arcanist/bin/../scripts/arcanist.php' 'patch' '--trace' 'D99750'
 LOAD  Loaded "phutil" from "/usr/share/libphutil/src".
 LOAD  Loaded "arcanist" from "/usr/share/arcanist/src".
Config: Reading user configuration file "/home/naesten/.arcrc"...
Config: Did not find system configuration at "/etc/arcconfig".
Working Copy: Reading .arcconfig from "/home/naesten/llvm-project/.arcconfig".
Working Copy: Path "/home/naesten/llvm-project" is part of `git` working copy "/home/naesten/llvm-project".
Working Copy: Project root is at "/home/naesten/llvm-project".
Config: Did not find local configuration at "/home/naesten/llvm-project/.git/arc/config".
>>> [0] (+0) <http>
<<< [0] (+130) <http> 130,791 us
>>> [1] (+130) <http>
<<< [1] (+191) <http> 60,373 us
>>> [2] (+191) <http>
<<< [2] (+775) <http> 584,009 us
>>> [3] (+801) <http>
<<< [3] (+869) <http> 68,718 us
>>> [4] (+870) <exec> $ git --version
<<< [4] (+872) <exec> 1,854 us
>>> [5] (+872) <exec> $ git status --porcelain=2 -z
<<< [5] (+1,016) <exec> 143,931 us
>>> [6] (+1,016) <exec> $ git show -s --format='%H' '' --
<<< [6] (+1,018) <exec> 1,804 us
 INFO  Base commit is not in local repository; trying to fetch.
>>> [7] (+1,018) <exec> $ git fetch --quiet --all
<<< [7] (+1,288) <exec> 269,787 us
>>> [8] (+1,288) <exec> $ git show -s --format='%H' '' --
<<< [8] (+1,290) <exec> 1,654 us
>>> [9] (+1,290) <exec> $ git symbolic-ref --quiet HEAD
<<< [9] (+1,291) <exec> 1,428 us
>>> [10] (+1,292) <exec> $ git rev-parse --verify 'arcpatch-D99750'
<<< [10] (+1,293) <exec> 1,565 us
Branch name arcpatch-D99750 already exists; trying a new name.

I trimmed the rest, but since it didn't really find a base commit -- and it kinda looks like it was looking for one called '' -- it shouldn't be particularly surprising that the patch failed to apply.

So, um, yeah. Do you have one or more git branches I could play with instead, or perhaps you could somehow communicate the proper context to Differential?

vkmr added a subscriber: ABataev.EditedFeb 25 2022, 8:23 AM

So, um, I would like to look at your patches, but I'm not having much success running "arc patch" against them.

@SamB, this patch is highly out of sync with the upstream and won't work in its current state. The VPlan infrastructure has significantly changed since this patch. If you want to focus on the VPlan part of this patch, patch stack - D104608, D104610, D104673, D104979 is a bit more recent although I suspect they would still conflict with the upstream.

The good news is that there has been lot of backend work done to support this and there is renewed interest in this work. I won't be maintaining this work anymore. I believe @ABataev and @simoll would be taking care of it going forward. You can also join this LLVM-VP discord channel to follow/contribute to its progress.

You can also join this LLVM-VP discord channel to follow/contribute to its progress.

Is this open to anyone? I am unable to access it but very interested in following the discussion. Thanks.

Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2022, 8:20 AM
Aries added a subscriber: Aries.Apr 14 2022, 5:52 PM
XiaPZ added a subscriber: XiaPZ.Nov 13 2022, 4:57 AM