Page MenuHomePhabricator

[LV, VP] RFC: VP intrinsics support for the Loop Vectorizer (Proof-of-Concept)
Needs ReviewPublic

Authored by vkmr on Thu, Apr 1, 10:32 AM.

Details

Summary

Abstract

As Vector Predication intriniscs are being introduced in LLVM, we propose extending the Loop Vectorizer to target these intrinsics. SIMD ISAs such as RISC-V V-extension, NEC SX-Aurora and Power VSX with active vector length predication support can specially benefit from this since there is currently no reasonable way in the IR to model active vector length in the vector instructions.

ISAs such as AVX512 and ARM SVE with masked vector predication support would benefit by being able to use predicated operations other than just memory operations (via masked load/store/gather/scatter intrinsics).

This patch shows a proof of concept implementation that demonstrates Loop Vectorizer generating VP intrinsics for simple integer operations on fixed vectors.

Details and Strategy

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for all arithmetic operations as well.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

  • The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
  • The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
  • For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic set_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

For the last two ways, if there is no outer mask, we use an all-true boolean vector for the mask parameter of the VP intrinsics. (We do not yet support control flow in the loop body.)

We have also extended VPlan to add new recipes for PREDICATED-WIDENING of arithmetic operations and memory operations, and a recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

Alternate vectorization strategies with predication

Other than the tail-folding based vectorization strategy, we are considering two other vectorization strategies (not implemented yet):

  1. Non-predicated body followed by a predicated vectorized tail - This will generate a vector body without any predication (except control flow), same as for the existing approach of a vector body with scalar tail loop. The tail however will be vectorized using the VP intrinsics with EVL = trip_count % VF. While this approach will result in larger code size, it might be more efficient than our currently implemented approach; It will have a straight line code for the tail and the vector body will be free of all the overhead of using intrinsics.
  2. Another strategy could be to use tail-folding based approach but use predication only for memory operations. This might be beneficial for architectures like Power VSX that support vector length predication only for memory operations.

Caveats / Current limitations / Current status of things

This patch is far from complete and simply meant to be a proof-of-concept with the aim to (it will also be broken into smaller more concrete patches once we have more feedback from the community):

  • demonstrate the feasibility of the Loop Vectorizer to target VP intrinsics.
  • start a deeper implementation-backed discussion around vector predication support in LLVM.

That being said, there are several limitations at the moment; Some need more supporting implementation and some need more discussion:

  • For the purpose of demonstration, we use a command line switch that can be used to force VP intrinsic support and needs tail-folding enabled to work.
  • VP Intrinsic development is going on in parallel, and currently only supports integer arithmetic intrinsics in the upstream.
  • No support for control flow in the loop.
  • No support for interleaving.
  • We need more discussion around the best approach for computing EVL parameter. If using an intrinsic, more thought needs to go into its semantics. Also the VPlan recipe for EVL is sort of a dummy recipe with widening delegated to the vectorizer.
  • We also do not use the active.vector.lane.mask intrinsic yet, but it is something we consider for the future.
  • No support for scalable vectors yet (Due to missing support for tail folding for scalable vectors).

Note: If you are interested in how it may work end-to-end for scalable vectors,
do take a look at our downstream implementation for RISC-V [ RVV-Impl ] and an
end-to-end demo on Godot compiler explorer [ Demo ].

Note: This patch also includes our implementation of vp_load and vp_store
intrinsics. There is currently a more complete patch [ D99355 ] open for review,
which we will merge when it lands.

Tentative Development Roadmap

Our plan is to start with integrating the functionality in this patch, with changes/enhancements agreed upon by the community. For next steps, we want to:

  • Support VP intrinsics for vectorization for scalable vectors (starting with enabling tail folding for scalable vectors if required by the time.)
  • Support for floating point operations.
  • Support for control flow in the loop.
  • Support for more complicated loops - reductions, inductions, recurrences, reverse.

[ RVV-Impl ]: https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi
[ Demo ]: https://repo.hca.bsc.es/epic/z/9eYRIF
[ D99355 ]: https://reviews.llvm.org/D99355

Diff Detail

Event Timeline

vkmr created this revision.Thu, Apr 1, 10:32 AM
vkmr requested review of this revision.Thu, Apr 1, 10:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptThu, Apr 1, 10:32 AM
vkmr edited the summary of this revision. (Show Details)Thu, Apr 1, 11:19 AM
vkmr added a project: Restricted Project.
vkmr updated this revision to Diff 334842.Thu, Apr 1, 4:03 PM

Remove unused header include

vkmr updated this revision to Diff 334860.Thu, Apr 1, 4:57 PM

Remove unused header

vkmr updated this revision to Diff 334865.Thu, Apr 1, 5:02 PM

Revert diff 334860

vkmr updated this revision to Diff 334878.Thu, Apr 1, 6:29 PM

Merged more recent changes on main.

craig.topper added inline comments.Wed, Apr 7, 2:39 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3021

I think you can use Builder.getInt32Ty

3062

Same here

9017

Drop curly braces since it's only 1 line

9060

Drop curly braces

9699

intrinisc -> intrinsic

khchen added a subscriber: khchen.Thu, Apr 8, 9:25 AM
frasercrmck added inline comments.Mon, Apr 12, 2:10 AM
llvm/include/llvm/IR/IRBuilder.h
2545

This method doesn't seem used by this patch.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
3018

Does this need to be a VectorType? Can't you pass StoredVal->getType() straight through to CreateIntrinsic?

5010

Is this comment out of place?

8683–8690

Unnecessary parens around this statement.

This LGTM on a conceptual level from the standpoint of SX-Aurora.
I guess @fhahn would be the right person to weigh in on the VPlan changes.
When it comes to the actual patches for this RFC, it'd help to upstream the VectorBuilder class. It allows you to emit VP code without having to juggle with VP intrinsics.

vkmr updated this revision to Diff 337792.Thu, Apr 15, 9:33 AM

Addressed review comments by @craig.topper and @frasercrmck.

vkmr marked 9 inline comments as done.Thu, Apr 15, 9:37 AM

Overall, at a high-level, this LGTM too. I might play around with it locally if I find the time.

llvm/include/llvm/Analysis/TargetTransformInfo.h
1381

Part of me wants to say this should be usesCustomActiveVectorLengthIntrinsic like "the target hasActiveVectorLength" and "the target supportsScalableVectors".

I'm not arguing very strongly though because we have preferPredicatedReductionSelect. I know, it's the epitome of bikeshedding. I don't know if it's any better if it's self-aware.

llvm/include/llvm/IR/Intrinsics.td
1346

These need to go in the LangRef, don't they?

simoll added inline comments.Fri, Apr 16, 2:45 AM
llvm/include/llvm/IR/Intrinsics.td
1346

Yep. The patch for vp.load/store/gather/scatter is D99355

frasercrmck added inline comments.Fri, Apr 16, 2:48 AM
llvm/include/llvm/IR/Intrinsics.td
1346

Ah yes you mean that revision I've already seen and commented on! Sorry for the noise :)

The overall approach looks good to me. I might have gone a little too deep into the details with my inline comments, but the direction makes sense to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
280

don't see ForceAVLSupport being used anywhere in this patch.

628

Could this be moved to the VPWidenEVLRecipe class?

2931

how do we guard against this situation? I think this should be part of the legality check...eg isLegalMaskedLoad check for Legal->isConsecutivePtr(Ptr)...if we had a function similar to isLegalMaskedLoad (say isLegalEVLLoad), then it could check for -1 stride and return false...then we'd be guaranteed not to hit this assert.

9017

not sure if this is already in your todo list...but apart from "preferring" to predicate, we need to check that the predication is legal for the target platform. This should probably be done in isScalarWithPredication with calls similar to isLegalMaskedLoad.

9697

Value *Remaining = Builder.CreateSub(TripCount, Induction, "tc.minus.iv");

9712

can we try to initialize RuntimeVL with the right type, to avoid having to zero extend it here?

9714

you can avoid calling the intrinsic, by just doing a cmp and a select. (ie check that Remaining is less than RuntimeVL and if so pick Ramining, otherwise select RuntimeVL).

9715

Can we avoid truncating to 32-bit on 64-bit targets that take in 64-bit length? I think the type should be the same as the IV.

llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
74

rename to tryToPredicateWidenMemory or tryToWidenPredicatedMemory

llvm/lib/Transforms/Vectorize/VPlan.cpp
1144

"PREDICATED-WIDEN-MEMORY" to distinguish from VPPredicatedWidenRecipe

1152

I suppose the EVL is not one of the operands, right? If so, I think printing the EVL here would be useful too.

bmahjour added inline comments.Fri, Apr 16, 9:11 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4969

I guess the targets that don't need/have the concept of predicated binary operations, must provide lowering for all these calls to ultimately generate non-predicated vector code. I'd expect that to be a large effort with little gain. To allow the EVL exploitation while the lowering is being provided, would it make sense to have a path in this function where we just fall back to InnerLoopVectorizer::widenInstruction under an option?

Matt added a subscriber: Matt.Sat, Apr 17, 9:12 AM
simoll added inline comments.Mon, Apr 19, 1:19 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4969

D78203 implements lowering from VP intrinsics to regular SIMD instructions for all targets.

If targets support VP, they should get it. If IR with VP intrinsics ends up getting compiled for a non-VP targets, the intrinsics will disappear before they hit lowering.

vkmr added inline comments.Mon, Apr 19, 7:56 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
1381

Hmmm...I am in double-mind here too.
One way to look at it is that with useCustomActiveVectorLengthIntrinsic or preferPredicatedReductionSelect there is a notion of choice - even if it can use a custom vector length intrinsic it might choose not to; It is being told to pick one choice . Whereas usesCustomActiveVectorLengthIntrinsic or hasActiveVectorLength implies reporting a fact. Not sure if it makes much sense!

May be others can chime in too. I am not too particular about names as long as they are reasonably sensible.

I know, it's the epitome of bikeshedding. I don't know if it's any better if it's self-aware.

"There are only two hard things in Computer Science: cache invalidation and naming things." -- Phil Karlton

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
280

Not explicitly but implicitly as the default choice if other options do not match. See the test cases for more clarity on how it works and what it results in.

628

This is called from the execute method of the recipe and is responsible for generating IR instructions to compute EVL. Keeping it in InnerLoopVectorizer is consistent with how the widening methods for other recipes are a part of InnerLoopVectorizer.

2931

Do you mean the legality checker should determine if it's a reverse operation then it is illegal to use EVL predicated Load and should default to Masked load (if that is legal)?

Eventually, there will be support for reverse with EVL predication and required legality checks will be added. For this PoC patch however, it is more explicit and straightforward to have an assert here.

4969

To add to @simoll's comment, eventually we will be using the VectorBuilder (earlier VPBuilder) to widen to VP intrinsics instead of manually creating each intrinsic here. VectorBuilder can also make some decisions on when and how to use the intrinsics.

9017

preferPredicatedWiden takes into account FoldTailByMasking and TTI.hasActiveVectorLength(). TTI.hasActiveVectorLength() checks whether the EVL based predication is legal for the target or not.

For this initial patch, legality checker and cost model are purposely avoided and the explicit command line option is used to 1) keep things simple and 2) demonstrate different approaches.

9712

The core issue here is that the Vectorizer sort of implicitly uses i32 for everything related to VF and Vector Lenght. %evl parameter in VP intrinsics is also of type i32.
TC, BTC and IV are however i64. So whenever we mix these two we end up having extends and truncates.

That being said, in this particular case, it should be possible to initialize MinVF and thus RuntimeVL to i64 if we are going to use it with TC and IV and i32 otherwise.

9714

I may be wrong here but from what I understand, patches D81829 and D84125 recently introduces these intrinsics to avoid these IR patterns.
I am not sure if their use is discouraged for some reason.

9715

See previous related comment. %evl parameter in VP intrinsics is of type i32.

llvm/lib/Transforms/Vectorize/VPRecipeBuilder.h
74

Not sure I understand your reasoning here.

My understanding is that we are trying to do predicated widening of memory instruction.
tryToPredicateWidenMemory would imply that we are trying to predicate and/or widen memory instruction, whereas tryToWidenPredicatedMemory would imply we are trying to widen predicated memory instruction, which might not be the case (VP intrinsics can be used to widen non predicated instructions too).

Also, it is consistent with wording in other places (tryToPredicatedWiden, VPPredicatedWidenRecipe, PREDICATED-WIDEN, etc.). If we change here, it should be changed everywhere too.

If others also think that the current wording should be changed everywhere, I am open to changing it.

llvm/lib/Transforms/Vectorize/VPlan.cpp
1144

"PREDICATED-WIDEN" is consistent with "WIDEN" (and not "WIDEN-MEMORY") used for VPWidenMemoryInstructionRecipe.

1152

It is appended as one of the operands and is printed correctly here.
See the vplan-vp-intrinsics.ll test case for how the printed VPlan looks.