Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[LV, VP]VP intrinsics support for the Loop Vectorizer
Needs ReviewPublic

Authored by ABataev on Apr 1 2021, 10:32 AM.

Details

Summary

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

  • The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
  • The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
  • For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

Tentative Development Roadmap

  • Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
    1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
    2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
  • Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Diff Detail

Unit TestsFailed

TimeTest
0 msLinux x64 > LLVM-Unit.Transforms/Vectorize/_/VectorizeTests/41::44
Script(shard): -- GTEST_OUTPUT=json:/var/lib/buildkite-agent/builds/linux-56-7f758798dd-z9fgd-1/llvm-project/phabricator-build/build/unittests/Transforms/Vectorize/./VectorizeTests-LLVM-Unit-3825689-41-44.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=44 GTEST_SHARD_INDEX=41 /var/lib/buildkite-agent/builds/linux-56-7f758798dd-z9fgd-1/llvm-project/phabricator-build/build/unittests/Transforms/Vectorize/./VectorizeTests
30 msLinux x64 > LLVM-Unit.Transforms/Vectorize/_/VectorizeTests/43::44
Script(shard): -- GTEST_OUTPUT=json:/var/lib/buildkite-agent/builds/linux-56-7f758798dd-z9fgd-1/llvm-project/phabricator-build/build/unittests/Transforms/Vectorize/./VectorizeTests-LLVM-Unit-3825689-43-44.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=44 GTEST_SHARD_INDEX=43 /var/lib/buildkite-agent/builds/linux-56-7f758798dd-z9fgd-1/llvm-project/phabricator-build/build/unittests/Transforms/Vectorize/./VectorizeTests
0 msWindows x64 > Clang-Unit.Analysis/FlowSensitive/_/ClangAnalysisFlowSensitiveTests_exe/BoolValueDebugStringTest::ComplexBooleanWithSomeNames
Script: -- C:\ws\src\build\tools\clang\unittests\Analysis\FlowSensitive\.\ClangAnalysisFlowSensitiveTests.exe --gtest_filter=BoolValueDebugStringTest.ComplexBooleanWithSomeNames

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ABataev added inline comments.Aug 15 2023, 7:02 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9618

Shall I wait for your patch or we can land this with the current implementation so you can later adjust this in your patch?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
378

Yep, we can do it. Same question as before - shall I wait for your patch to be landed or we can commit it and then later you will update шеп in your patch?

Ayal added a comment.Aug 15 2023, 8:48 AM

Joining this review late admittedly, adding comments relative to recent version.

What loops are amenable to use EVL by this patch should be clearly documented, tested, and potentially VPlanVerified:
Innermost loops, must not be unrolled, must not use masked interleave groups, must produce tail - which must be folded.
What about loops processing elements of multiple/mixed sizes?
What about reductions, fixed-order recurrences?

What is the benefit of using EVL on top of scalable vectors - only replacing the active-lane-mask / tail-folding compare setting the header mask? Can that be done as a VPlan-to-VPlan transform?

Should the title be more accurate, as it Introduces VP Intrinsics, or rather Introduces EVL? Seems also possible to first introduce VP Intrinsics while setting EVL to a maximum default, i.e., w/o actually introducing EVL, hence would be good to emphasize in the title.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
81

Better called hasEffectiveVectorLength() as in hasEVL - for consistency?
From the summary: "We use active vector length and explicit vector length interchangeably" - why? Could only EVL be used, consistently? (RISC-V documentation of setvl has AVL as a parameter and [E]VL as return/set value.)

Add Opcode/DataType/Alignment parameters later - when used?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
253

Drop "experimental"?

264

Drop "experimental" and "which will be removed in [the] future"?

Or drop altogether - is a switch enabling EVL for default target needed or suffice to test for concrete RISC-V EVL target?

266–272

Do you mean to say "Some optimizations work better for an invariant VF", i.e., that of VL0, rather than "for countable loops"?
Vectorizing tail with EVL may still produce countable vector loops, as noted above.

277

Description talks about tail-folding, name of variable and command-line argument do not.

1647

+1

The acronym VP is overloaded here and mainly short for "VPlan", as in VPInstruction.
Perhaps VPI would be clearer, short for VP Intrinsics.
Seems to mean "foldTailByEVL()" or "foldTailByVPI()"?

2759

interleave groups are invalidated if foldTailByMasking() is true

unless useMaskedInterleavedAccesses() is also true?

2762

What is the meaning of setting VectorTripCount = TC?

5120

It may be confusing to keep PreferVPIntrinsics = false when "Preference for VP intrinsics indicated".

The (names and) relation between PreferPredicateWithVPIntrinsics and PreferVPIntrinsics should be clarified.

Separate the part that sets PreferVPIntrinsics, possible switching on PreferPredicateWithVPIntrinsics?

5127

Add expected fail tests to cover opcode/data types that are not supported nor checked?

5131

May as well also inform if the target supports it or not?

5138

So EVL is applied to scalable VFs only, right?

8213–8214

If useVPVectorization() is another reason to predicate the header BB, it should be included in blockNeedsPredicationForAnyReason(). But since the former requires tail folding, is it really another reason, or can it be dropped?

8987

(Another) Premature creation of a recipe out of place/time?

9127

Restrict EVL to inner loop vectorization only, for now?

9548

Recipes should strive to have straightforward code-gen as much as possible (contrary to "smart vector instructions/vp intrinsics emission" of the 2nd bullet in the summary's Tentative Development Roadmap).
This is already challenged by the existing (non-EVL) VPWidenMemoryInstructionRecipe::execute.

Design dedicated recipe(s) for widening memory instructions under EVL, and introduce them instead of the existing non-EVL recipes, preferably as a VPlan-to-VPlan transformation, rather than try to fit everything here, and potentially elsewhere?
Also discussed below in https://reviews.llvm.org/D99750#inline-967127, and iinm in earlier revisions.

9642

if foldTailByMasking() is true (for VP support it is true), interleave groups are invalidated.

unless useMaskedInterleavedAccesses() is also true?

9645

nit: if (State.EVL) { Value *EVLPart = State.get(State.EVL, Part); ...?

Why deal with Parts when EVL mandates UF=1?

llvm/lib/Transforms/Vectorize/VPlan.h
2114

But VPEVLRecipe seems to compute min(VF,TC-I) aka "the second way" in the above summary, rather than RISC-V's special setvl(TC-I,SEW,LMUL) instruction aka third bullet there?

The use of "VF" here is confusing, as it is effectively used to compute VF... should it be MinVF or MaxVF instead?

2116

Good clarification!

Would be good to explain at the outset "what the recipe actually computes"? (aka "3" - whose definition was lost somewhere?).

Trying to reason about this further: an original scalar loop

for(I=0; I<TC; ++I) {
  ...
}

processing Standard Elements of Width SEW can be vectorized, along with tail if needed, by doing

for (I=0, N=TC; I<TC; EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
  ...
}

The spec and comments above imply that, strictly speaking:
(a) I may advance in non-fixed increments, in the end, and thus is not an Induction Variable in general.
(b) the vector loop produced as such is uncountable in general - does not have a predetermined vector-trip-count.
(c) a countable vector loop can be produced by considering the VL of the first vector iteration, e.g.,:

VL0 = vsetlv(TC, SEW);
VTC = ceil(TC/VL0);
for (I=0, N=TC, VIV=0; VIV<VTC; ++VIV, EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
  ...
}

where VIV is an Induction Variable counting Vector Iterations and is separate from I which specifies the lanes processed in each vector iteration. A countable vector loop may be amenable to subsequent optimizations such as unrolling, modulo-scheduling, translation to hardware-loop.

2124

VPEVLRecipe holds no state, so deserves an opcode in VPInstruction rather than a dedicated recipe?

Regarding the two operands of VPEVLRecipe - strictly speaking neither an Induction Variable nor a Vector Trip Count seem applicable, following the above?
The current 2nd operand is received in the constructor as TC, but retrieved as VTC?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
378

Or keep CanonicalIVIncrement intact using an implicit VF*UF step, and introduce separate recipe(s) to take care of controlling the loop and/or computing the lanes each vector iteration, under EVL?

nit: consider checking asserts in VPlanVerifier instead.

1484

Placing vsetvli outside the main loop will compute VL0 as above. Idea of the TODO is to use VL0 repeatedly for all floor(TC/VL0) iterations excluding tail, thereby producing "explicit remainder loop" aka tail, and a countable main vector loop facilitating UF>1?

1486

"interleave or unroll", "Neither unrolling, nor interleaving" - what's the distinction between unrolling and interleaving?

If UF must be 1 then the loop over UF should be omitted.

1491

This indeed looks like the original TripCount rather than a VectorTripCount?

The Sub can be taken care of by a separate VPInstruction, which should feed VPEVLRecipe, rather than having the latter depend on IV and [Vector]TripCount.

The result EVL produced by VPEVLRecipe should feed its dependent recipes directly, rather than via State.EVL?

llvm/lib/Transforms/Vectorize/VPlanValue.h
353 ↗(On Diff #547860)

nit: lex order.

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
205

nit: retrieve Entry from the loop region enclosing VPBB, rather than pass it in.

208

nit: may as well dump the error message here.

A loop region under EVL may deserve its own verification to check its various constraints, e.g., no inner replicate regions, no replicate recipes, UF=1 only - should also be indicated when printed.

Hi Ayal, thanks for the review.

Joining this review late admittedly, adding comments relative to recent version.

What loops are amenable to use EVL by this patch should be clearly documented, tested, and potentially VPlanVerified:
Innermost loops, must not be unrolled, must not use masked interleave groups, must produce tail - which must be folded.

Do you ask to ad this to the summary, right?

What about loops processing elements of multiple/mixed sizes?
What about reductions, fixed-order recurrences?

No restrictions for now AFAIR, but some extra work required for better results.

What is the benefit of using EVL on top of scalable vectors - only replacing the active-lane-mask / tail-folding compare setting the header mask? Can that be done as a VPlan-to-VPlan transform?

Well, theoretically this can be considered as a kind of active-lane-mask replacement. And there are RISC-V CPUs which follow this way. But for some of them this is not quite so and it may require some extra stuff like knowledge about the active vector length on the previous iteration(s) + dynamically balance the actual vector length.
We're going to add this recipe in future (it is required for reductions, AFAIR). Can this be considered as a stopper for VPlan-to-VPlan?

Should the title be more accurate, as it Introduces VP Intrinsics, or rather Introduces EVL? Seems also possible to first introduce VP Intrinsics while setting EVL to a maximum default, i.e., w/o actually introducing EVL, hence would be good to emphasize in the title.

No, setting EVL to default maximum won't work for loads/stores.

ABataev marked 5 inline comments as done.Aug 15 2023, 2:04 PM
ABataev added inline comments.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
81

This is separate. This function is already introduced in TTI, the purpose of the patch is not related to TTI directly.
As to params, it is done in the implementation.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
266–272

Currently, vectorization with EVL will result in an uncountable loop. Tail vectorization is not always the best option for some targets

2759

I can return check back, but currently cannot provide a test, since the target invalidates the costs for interleaved groups when making cost-based decisions.

2762

We don't need to round vector trip count, so just set it to original trip count value (EVL is adjusted automatically to be not larger than trip count).

5120

PreferPredicateWithVPIntrinsics is just the option, that controls, whether we can emit vp intrinsics at all. If later we see, that it is not possible for some reason, we do not do it to still be able to produce correct code. I don't see what else should be adjusted here

5127

Currently there are no such test, since we support only loads/stores, which are supported by all potential targets

5138

For now - yes.

8213–8214

It is another reason and cannot be dropped

9548

Why? I t already handles masking, why it should not be extended with the handling of EVL? New recipe will not add anything new here, just will be copy/paste of the existing recipes.

9645

The loop executes to State.UF (1 here, yes), so everything is fine here

llvm/lib/Transforms/Vectorize/VPlan.h
2114
  1. No, it emits setvl
  2. There is no VF.
2116

Good clarification!

Would be good to explain at the outset "what the recipe actually computes"? (aka "3" - whose definition was lost somewhere?).

Trying to reason about this further: an original scalar loop

for(I=0; I<TC; ++I) {
  ...
}

processing Standard Elements of Width SEW can be vectorized, along with tail if needed, by doing

for (I=0, N=TC; I<TC; EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
  ...
}

The spec and comments above imply that, strictly speaking:
(a) I may advance in non-fixed increments, in the end, and thus is not an Induction Variable in general.
(b) the vector loop produced as such is uncountable in general - does not have a predetermined vector-trip-count.

And I described this already.

(c) a countable vector loop can be produced by considering the VL of the first vector iteration, e.g.,:

VL0 = vsetlv(TC, SEW);
VTC = ceil(TC/VL0);
for (I=0, N=TC, VIV=0; VIV<VTC; ++VIV, EVL=vsetlv(N, SEW), I+=EVL, N-=EVL) {
  ...
}

where VIV is an Induction Variable counting Vector Iterations and is separate from I which specifies the lanes processed in each vector iteration. A countable vector loop may be amenable to subsequent optimizations such as unrolling, modulo-scheduling, translation to hardware-loop.

Yes, we're going to implement this later.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1484

It is not possible, generally speaking, since not only tail might be affected

ABataev updated this revision to Diff 550471.Aug 15 2023, 2:11 PM

Rebase + address comments

fhahn added inline comments.Aug 20 2023, 2:11 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9652

Ah I originally thought that the vector builder provides the same interface as IRBuilder, but uses the vector predication intrinsics when mask/EVL is set. Together with performing the MaskValue simplification as VP2VP transform, this would effectively reduce the changes here to a few lines, i.e. only setting EVL in addition to the mask.

Another thing came to mind: is there a plan to converge the separate masked memory intrinsics and the vector predication versions, i.e. long term, should the masked memory intrinsics be superseded by the vector predication ones?

ABataev added inline comments.Aug 20 2023, 2:23 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9652

Ah I originally thought that the vector builder provides the same interface as IRBuilder, but uses the vector predication intrinsics when mask/EVL is set. Together with performing the MaskValue simplification as VP2VP transform, this would effectively reduce the changes here to a few lines, i.e. only setting EVL in addition to the mask.

Another thing came to mind: is there a plan to converge the separate masked memory intrinsics and the vector predication versions, i.e. long term, should the masked memory intrinsics be superseded by the vector predication ones?

We did not discuss it yet and it should separate discussion/decision. Though it would be good to make it part of masked intrinsics or even instructions (along with the mask).

nikolaypanchenko edited the summary of this revision. (Show Details)Aug 21 2023, 11:55 AM
ABataev updated this revision to Diff 555150.Aug 31 2023, 1:09 PM

Make Predicated VPlan as a VPlan-to-VPlan transformation

iamlouk added a subscriber: iamlouk.Wed, Sep 6, 4:14 AM
fhahn added a comment.Thu, Sep 7, 3:10 PM

Thanks for the latest update! A few more comments inline and it looks like there are a few older comments that haven't been addressed yet.

Overall it looks like we are converging on a good initial version. As for modeling EVL, IMO having the EVL VPInstruction set the active vector length for the remainder of the region would be OK to start with, but it would be good to wait and hear @Ayal's thoughts as well.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
277

Looks like the description still needs updating?

1647

Agreed, it should be possible to use VP intrinsics without EVL, so it would be good to have the name reflect that this is focused about introducing EVL? Same for the user-exposed options.

1803

This is only for using EVL for now, would be good to clarify name and comment.

8163

Better to replace the mask together with introducing EVL to make sure EVL gets added when the mask gets removed?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

I think turning the step of the canonical induction non-loop-invariant technically turns the canonical IV into a phi that's not a canonical IV any more (which is guaranteed to step the same amount each iteration). Would it work to keep the increment unchanged and keep rounding up the trip count was with regular tail folding initially? Further down the line, the canonical IV issue may be resolved by also replacing the canonical IV node with a regular scalar phi when doing the replacement here.

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll
429

one more test that would probably be good to have is one with a conditional memory operation, which needs EVL and the block mask, right?

ABataev added inline comments.Wed, Sep 13, 10:30 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

Currently it will require some extra work. We'll need to handle both cases, with activelane instrnsics and direct comparison. Would be possible to keep it for now and fix it once you land emission of activelane intrinsic in VPlan-toVPlan transform?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

I'll try to improve this.

ABataev updated this revision to Diff 556711.Wed, Sep 13, 12:58 PM

Rebase, address comments

fhahn added inline comments.Mon, Sep 25, 6:01 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

With the latest version, can the useVPWithVPEVLVectorization part be dropped (if the transform is updated to remove the mask from load/stores)?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

Did you get a chance to try this out yet?

97687b7aea17 landed, it would probably be good to also remove the header mask from load/store recipes here, to make clear that this optimizes the tail-folded loop?

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
70

It look like the current implementation needs to be updated to actually replace the checks. It also adjusts the induction increment, would be good to check if that is actually needed in the initial version, as per comments elsewhere.

ABataev added inline comments.Mon, Sep 25, 8:17 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

Not quite, it will require an extra VPValue, something like VPAllTrueMask, which should replace IV <= BTC. Shall I add it?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

Already did. The loop is countable, adding a new phi won't give anything, just some extra work without any effect.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
70
  1. Yes, need to add VPPAllTrye mask vp value.
  2. It is required!
fhahn added inline comments.Mon, Oct 2, 1:05 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

Would a live in i1 true work? I think that may work as is. As EVL is only used for lowering of loads/stores at the moment, it should be only removed there for now?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

Oh right I missed that, sorry!

Does the latest version actually have to update the canonical IV increment?

I might be missing something, but shouldn't the exit condition now use the rounded up version (a multiple of the VF) of the trip count for the compare, so if we increment by EVL then the we might not reach the exit condition?

ABataev added inline comments.Mon, Oct 2, 1:16 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

You mean scalar i1 true?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

There are 2 increments now: 2 than feeds into PHI and another one used for exit condition. It uses rounded version of trip count for comparison.

fhahn added inline comments.Mon, Oct 2, 1:42 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

yes, that should be broadcasted across all vector lanes

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

Right, but why both are needed? Can there be more than 1 iteration where EVL < VF?

ABataev added inline comments.Mon, Oct 2, 1:47 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8163

I don't like that we won't match actual type here. I thought about other possible solution - overload VPActiaveLaneMask and make it return all-true mask for RVL targets. WDYT?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

The first one is used to increment IV, another one - to check for number of iterations. We need the first one (which is vsetvli dependable) to correctly count IV.
But looks like I need to adjust the IV here, because otherwise we may have wrong comparison. I'll think more about it.

craig.topper added inline comments.Mon, Oct 2, 8:59 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
876

Right, but why both are needed? Can there be more than 1 iteration where EVL < VF?

Yes the last 2 iterations can have an EVL less than VL for RISC-V. The vsetvli instruction on RISC-V takes an input called AVL that contains the number of values to process and returns VL subject to the following constraints:

  1. vl = AVL if AVL ≤ VLMAX
  2. ceil(AVL / 2) ≤ vl ≤ VLMAX if AVL < (2 * VLMAX)
  3. vl = VLMAX if AVL ≥ (2 * VLMAX)

Bullet 2 there allows the AVL between VLMAX and 2*VLMAX to be split over the last 2 iterations. Not all microarchitectures implement this.

On each iteration VL cannot be larger than the VL on the previous iteration.

michalt added a subscriber: michalt.Tue, Oct 3, 3:42 AM