This is an archive of the discontinued LLVM Phabricator instance.

[LV, VP]VP intrinsics support for the Loop Vectorizer
Needs ReviewPublic

Authored by ABataev on Apr 1 2021, 10:32 AM.

Details

Summary

This patch introduces generating VP intrinsics in the Loop Vectorizer.

Currently the Loop Vectorizer supports vector predication in a very limited capacity via tail-folding and masked load/store/gather/scatter intrinsics. However, this does not let architectures with active vector length predication support take advantage of their capabilities. Architectures with general masked predication support also can only take advantage of predication on memory operations. By having a way for the Loop Vectorizer to generate Vector Predication intrinsics, which (will) provide a target-independent way to model predicated vector instructions, These architectures can make better use of their predication capabilities.

Our first approach (implemented in this patch) builds on top of the existing tail-folding mechanism in the LV, but instead of generating masked intrinsics for memory operations it generates VP intrinsics for loads/stores instructions.

Other important part of this approach is how the Explicit Vector Length is computed. (We use active vector length and explicit vector length interchangeably; VP intrinsics define this vector length parameter as Explicit Vector Length (EVL)). We consider the following three ways to compute the EVL parameter for the VP Intrinsics.

  • The simplest way is to use the VF as EVL and rely solely on the mask parameter to control predication. The mask parameter is the same as computed for current tail-folding implementation.
  • The second way is to insert instructions to compute min(VF, trip_count - index) for each vector iteration.
  • For architectures like RISC-V, which have special instruction to compute/set an explicit vector length, we also introduce an experimental intrinsic get_vector_length, that can be lowered to architecture specific instruction(s) to compute EVL.

Also, added a new recipe to emit instructions for computing EVL. Using VPlan in this way will eventually help build and compare VPlans corresponding to different strategies and alternatives.

Tentative Development Roadmap

  • Use vp-intrinsics for all possible vector operations. That work has 2 possible implementations:
    1. Introduce a new pass which transforms emitted vector instructions to vp intrinsics if the the loop was transformed to use predication for loads/stores. The advantage of this approach is that it does not require many changes in the loop vectorizer itself. The disadvantage is that it may require to copy some existing functionality from the loop vectorizer in a separate patch, have similar code in the different passes and perform the same analysis 2 times, at least.
    2. Extend Loop Vectorizer using VectorBuildor and make it emit vp intrinsics automatically in presence of EVL value. The advantage is that it does not require a separate pass, thus it may reduce compile time. Plus, we can avoid code duplication. It requires some extra work in the LoopVectorizer to add VectorBuilder support and smart vector instructions/vp intrinsics emission. Also, to fully support Loop Vectorizer it will require adding a new PHI recipe to handle EVL on the previous iteration + extending several existing recipes with the new operands (depends on the design).
  • Switch to vp-intrinsics for memory operations for VLS and VLA vectorizations.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Rebase, fix issue with exit condition

fhahn added a comment.Oct 16 2023, 3:33 AM

Thanks for the latest update! It looks like some of the unit tests (check-llvm-unit) are failing (maybe related to the verifier changes?), would be good to check and update them.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6338

nit: if EVL is preferred?

9608

nit: EC unused?

9613

nit: If EVLPart

llvm/lib/Transforms/Vectorize/VPlan.h
2144 ↗(On Diff #557696)

VPExplicitVectorLengthExitPHISC?

2155 ↗(On Diff #557696)

nit: active lane mask phi?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1725 ↗(On Diff #557696)

nit: VPWidenPHIRecipe implies vector phi, but this is a scalar phi.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
966 ↗(On Diff #557696)

I think using Exit condition may be confusing, it replaces the predicate for the header mask.

1026 ↗(On Diff #557696)

nit: Entry->Header

1028 ↗(On Diff #557696)

How ActiveLaneMask factor in here?

1037 ↗(On Diff #557696)

Would be good to describe here what recipes are added and what's changed.

1039 ↗(On Diff #557696)

Simpler to first get the canonical IV via getCanonicalIV and use it to access its increment?

1055 ↗(On Diff #557696)

IIUC this introduces EVLExitPhi and uses it for the canonical IV increment only, which controls the exit and adjust the canonical IV. Would it be possible to do it the other way around, i.e. keep the canonical induction incremented by the canonical IV increment (thus keeping it canonical), and instead have EVLExitPhi updated by ExplicitVectorLengthIVIncrement, and updating the relevant users (all except the canonical increment?) to use that one?

832 ↗(On Diff #555150)

Thanks for clarifying! Does the current codegen in the patch work correctly for cases where we execute more than 1 iteration for EVL < VF? IIUC the current approach with rounding up the trip count and using VF as increment assumes only one extra iteration.

ABataev added inline comments.Oct 16 2023, 4:08 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
832 ↗(On Diff #555150)

The total number of iterations is the same, just the vector length changes by balancing the value. If EVL is less than VLMAX, EVL is used as vector length. Only if VLMAX < AVL < 2 * VLMAX some magic may happen, i.e. in last 2 (vectorized) iterations.

ABataev updated this revision to Diff 557762.Oct 18 2023, 9:03 AM

Rebase, address comments

fhahn added inline comments.Oct 19 2023, 7:53 AM
llvm/lib/Transforms/Vectorize/VPlan.h
2139 ↗(On Diff #557762)

This is out of date now I think as now the CanonicalIV is left unchanged and used for the countable exit condition. There might be a better name for the recipe, as the current one doesn't seem to capture the essence of the recipe, which effectively represents the current index of elements to process in the current iteration, by account for EVL.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
966 ↗(On Diff #557762)

nit: replaceHeaderPredicateWithIdiom?

1027 ↗(On Diff #557762)

nit: all uses except the canonical IV increment.

1028 ↗(On Diff #557762)

nit: the only user is CanonicalIVIncrement I think, branch-on-count uses the increment.

1079 ↗(On Diff #557762)

The only users remaining should be CanonicalIVIncrement. Simpler to just do the below?

   // Replace all uses of VPCanonicalIVPHIRecipe by
   // VPExplicitVectorLengthPHIRecipe except for
-  // VPInstruction::CanonicalIVIncrement and VPCanonicalIVPHIRecipe itself.
-  for (VPUser *U : to_vector(CanonicalIVPHI->users())) {
-    if (auto *I = dyn_cast<VPInstruction>(U);
-        I && I->getOpcode() == VPInstruction::CanonicalIVIncrement)
-      continue;
-    if (isa<VPCanonicalIVPHIRecipe>(U))
-      continue;
-    auto *UI = dyn_cast<VPRecipeBase>(U);
-    if (!UI)
-      continue;
-    for (unsigned Idx = 0, E = UI->getNumOperands(); Idx < E; ++Idx)
-      if (UI->getOperand(Idx) == CanonicalIVPHI)
-        UI->setOperand(Idx, EVLPhi);
-  }
-  // Cleanup dead recipes after the transformation.
-  removeDeadRecipes(Plan);
+  // VPInstruction::CanonicalIVIncrement.
+  CanonicalIVPHI->replaceAllUsesWith(EVLPhi);
+  CanonicalIVIncrement->setOperand(0, CanonicalIVPHI);
1093 ↗(On Diff #557762)

This shouldn't introduce any new dead recipes, unless the canonical IV is only used to control the exit, so IIUC the only dead recipes introduced here should be due to replacePredicateWithIdiom.

May be good to either clean up the recipes there or move addVectorPredication after optimizeInductions, but that would require passing in an extra flag to optimize(), so maybe better to clean up the dead recipes directly after removing their users.

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
83 ↗(On Diff #557762)

nit: not all users, all users except the canonical IV increment, right?

87 ↗(On Diff #557762)

nit: addExplicitVectorLength?

ABataev updated this revision to Diff 557781.Oct 19 2023, 8:30 AM

Rebase, address comments

fhahn added a comment.Oct 20 2023, 3:03 PM

Thanks for the update. Some more comments inline. Mostly small suggestions, but there's one question if masked mem ops are handled correctly and a clarification about active vs effective vector length.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
84 ↗(On Diff #547860)

Just following up on this, should this the name be changed in TTI? Do you know the reason for referring to it has active vector length there vs effective vector length in the patch?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9607–9608

Is the gather scatter case handled correctly for EVL at the moment?

llvm/lib/Transforms/Vectorize/VPlan.h
2140 ↗(On Diff #557781)

Might be good to explicit say that it starts at 0 and gets incremented by EVL in each iteration of the vector loop?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
338 ↗(On Diff #557781)

nit: Compute EVL. would be more accurate IIUC, nothing gets set in this function AFACIT.

371 ↗(On Diff #557781)

might be good to have a test case where the induction is i32 and no cast is needed.

1730 ↗(On Diff #557781)

evl.based.iv?

1739 ↗(On Diff #557781)

This still needs updating I think

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
991 ↗(On Diff #557781)

VPWidenCanonicalIVRecipe should inherit from VPValue via VPHeaderPHIRecipe, so it should only define a single value and WideCanonicalIV->getNumUsers() should be sufficient.

1028 ↗(On Diff #557781)

needs updating with new name

1030 ↗(On Diff #557781)

needs updating with new name

1062 ↗(On Diff #557781)

name needs updating

1080 ↗(On Diff #557781)

name needs updating

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
84 ↗(On Diff #557781)

VPExplicitVectorLengthPHIRecipe needs updating with new name

85 ↗(On Diff #557781)

maybe only used to control the loop or something like that

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
210 ↗(On Diff #557781)

Plan here should never be nullptr

ABataev marked 13 inline comments as done.Oct 23 2023, 8:03 AM
ABataev added inline comments.
llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
84 ↗(On Diff #547860)

We can do it later in a separate patch. EVL stands for Explicit vector length, TTI interface was introduced long before this patch. There are just different abbrevs for the same technique - Active Vector Length, Explicit Vector length, etc.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9607–9608

Added support for this.

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
371 ↗(On Diff #557781)

Added @iv32 function in the test

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
210 ↗(On Diff #557781)

It can be nullptr in the unit tests, had to add this to avoid crashes in unit tests.

ABataev updated this revision to Diff 557842.Oct 23 2023, 8:11 AM
ABataev marked 4 inline comments as done.

Rebase, address comments

fhahn added inline comments.Oct 26 2023, 11:46 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9607–9608

Not sure if we also have a test case for this path, do you know if this would be handled correctly at the moment?

9607–9608

Great thanks! Now that there is VP intrinsic handling in multiple places, would it be better to handle all EVL related codegen together, i.e. something like below to avoid complicating reading the existing non-EVL code. WDYT?

for ()
   Value *VectorGep = State.get(getAddr(), Part);

   if (Value *EVLPart = State.EVL ? State.get(State.EVL, Part) : nullptr) {
      NewSI = lowerUsingVectorIntrinsics(vectorGEP..)
    } else {
      existing code....
    }
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1089 ↗(On Diff #557842)

Thinking a bit more about this, at the moment is only safe to replace the with TrueMask for VPWidenMemoryInstructionRecipes, for other recipes that use the mask it would be a potential mis-compile, correct? If not, then it would be good to have an assert that all users of the mask we replace are VPWidenMemoryInstructionRecipes. Might be good to also have a test case for such a scenario

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
210 ↗(On Diff #557781)

Ah I see. Do you happen to know which unit tests? I suspect they need fixing and also checking why this isn't already caught by verification.

ABataev updated this revision to Diff 557904.Oct 26 2023, 1:29 PM

Rebase, address comments

ABataev updated this revision to Diff 557905.Oct 26 2023, 1:33 PM

Fixed reversed loads/stores

fhahn added inline comments.Oct 27 2023, 6:51 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9607–9608

can leave unchanged?

9607–9608

can leave unchanged?

9607–9608

can leave unchanged now?

9608–9609

can leave unchanged?

9613

Could you elaborate what better means here? Might have missed it, does the current code handle reverse?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1089 ↗(On Diff #557842)

Just do double-check as it looks like the above may have been missed in the latest update, WDYT?

ABataev added inline comments.Oct 27 2023, 6:55 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9613

I disabled reverse support, see useVPWithVPEVLVectorization(), need support for vp_reverse intrinsic, which is not added yet.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1089 ↗(On Diff #557842)

Yes, missed it, will fix

ABataev updated this revision to Diff 557913.Oct 27 2023, 7:47 AM

Rebase, address comments

ABataev marked 4 inline comments as done.Oct 27 2023, 7:47 AM
fhahn added a comment.Nov 6 2023, 8:10 AM

Thanks for the updates! I think all correctness issues should now have been addressed AFAICT, some minor comments left inline

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
9607–9608

Indent looks off here

9613

nit: drop Better here, as reverse loading isn't supported at all a the moment. Would be good to assert here that the load isn't reversed

9613

nit: Drop better here, as reverse storing isn't supported at all at the moment. Would be good to also assert.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1020 ↗(On Diff #557913)

Would it be simpler just have a common function for finding all the header predicates and then have the code to update the users at the call sites?

1023 ↗(On Diff #557913)

Added VPValue::replaceUsesWithIf in a002271972fb3fb2877bdb4abf9275b2c1291036 as there were already multiple places hand-rolling that functionality.

ABataev marked 5 inline comments as done.Nov 7 2023, 6:26 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
1020 ↗(On Diff #557913)

Tried it, does not look better. Need to remove WideCanonicalIV, it means need it to find it again after the procedure, this leads to code duplication and size increase. Replace parameter with the predicate function instead.

ABataev updated this revision to Diff 558035.Nov 7 2023, 6:27 AM
ABataev marked an inline comment as done.

Rebase, address comments

shiva0217 added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1650

If the loop contains reduction variables, there might need a mask to merge the last two iteration results.

int a[128];
int foo (int end) {
  int size = 0;
  for (int i = 0; i < end; i++)
    size += a[i];
  return size;
}

Should the case be guarded by Legal->getReductionVars().empty() && ?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1743 ↗(On Diff #558035)

There is a case that the PHI didnt' been inserted at top of basic block.

int foo (int value, int *buf, int *end) {
  int *tmp;
  for (tmp = buf; tmp < end; tmp++)
    value -= *tmp;
  return value;
}

Should we specify insertion point?
Something like:

PHINode *EntryPart = PHINode::Create(
    Start->getType(), 2, "evl.based.iv", &*State.CFG.PrevBB->getFirstInsertionPt());
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

There is a case that VPWidenCanonicalIVRecipe didn't be generated with tail folding.

int i;
int foo (int q, int z)
{
  int e = 0;
  while (z < 1)
    {
      q = z * 2;
      if (q != 0)
        for (i = 0; i < 2; ++i)
          e = 5;
      ++z;
    }
  return e;
}

for (i = 0; i < 2; ++i) been simplifed as store i32 2, ptr @i.
Both pointer and store value are loop-invariant, so the mask(VPWidenCanonicalIVRecipe) might not be generated.
Should we suppress the replacement when the mask is not available?

ABataev updated this revision to Diff 558054.Nov 8 2023, 6:36 AM
ABataev marked 3 inline comments as done.

Rebase, address comments

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1650

Added

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1743 ↗(On Diff #558035)

Fixed in VPlanTransforms.cpp by inserting the recipe immediately after CanonicalIVPHI.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

Fixed, added the test

ABataev updated this revision to Diff 558097.Nov 14 2023, 6:49 AM

Rebase, ping!

fhahn added inline comments.Nov 15 2023, 6:43 AM
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1743 ↗(On Diff #558035)

I think VPEVLBasedIVPHIRecipe should be turned into a subclass of VPHeaderPHIRecipe, this will also ensure that the VPlan verifier checks it is in the header section

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

Was this fixed by adding the bool KeepVPCanonicalWidenRecipes flag? What's the test case for this? There's a new no_masking case, but it has an empty body and no vector code is generated?

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll
5 ↗(On Diff #558097)

no need to redirect stderr here?

7 ↗(On Diff #558097)

Can this configuration be used for target-independent tests?

385 ↗(On Diff #558097)

This test file is getting quite big with 3 different run lines. I think it would be good to try to split this up a bit, to make it easier to see what's going on.

I'd recommend having the test cases for various legality issues as target-independent tests with force flags (force EVL support, VF and IC). And keep cost-model specific tests target specific.

llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll
3 ↗(On Diff #558097)

Can this test be target independent? does it need to check the no VP case?

ABataev added inline comments.Nov 15 2023, 6:55 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)
  1. Yes.
  2. Yes, this test is a reduced version of the failed case
llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll
5 ↗(On Diff #558097)

Will drop

7 ↗(On Diff #558097)

Not now, it relies on the check of the TTI interface for now

llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll
3 ↗(On Diff #558097)
  1. No
  2. Yes, need to check that the option works correctly
ABataev added inline comments.Nov 16 2023, 8:47 AM
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1743 ↗(On Diff #558035)

You mean it should be dervied from VPEVLBasedIVPHIRecipe? Already done.

ABataev updated this revision to Diff 558115.Nov 16 2023, 1:54 PM
ABataev marked 5 inline comments as done.

Rebase, address comments

llvm/test/Transforms/LoopVectorize/RISCV/vectorize-vp-intrinsics.ll
7 ↗(On Diff #558097)

Added several target independent tests

385 ↗(On Diff #558097)

Done

llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll
3 ↗(On Diff #558097)

Added also target independent version

fhahn added inline comments.Nov 17 2023, 12:51 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

Is it possible it is over-reduced? The same IR seems to be generated both with and without KeepVPCanonicalWidenRecipes IIUC, because the loop is not vectorized due to being empty.

What's the issue if there's no mask/canonical widen recipe? Wouldn't it be fine to jus not replace anything?

fhahn added inline comments.Nov 17 2023, 12:53 PM
llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
1743 ↗(On Diff #558035)

Great!

ABataev added inline comments.Nov 17 2023, 12:56 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

When I checked it, it crashed without this parameter, maybe there were some other changes.

fhahn added inline comments.Nov 17 2023, 1:20 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

Can you check if it still crashes? Would be good to understand exactly what the issue is, and if possible avoid having a separate KeepVPCanonicalWidenRecipes flag

shiva0217 added inline comments.Nov 19 2023, 11:57 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

KeepVPCanonicalWidenRecipes might be motivated by the case that VPWidenCanonicalIVRecipe once exist but been optimized out to the (VPWidenIntOrFpInductionRecipe IV ule trip-count) which let replaceHeaderPredicateWithIdiom fail to replace (VPWidenCanonicalIV ule trip-count) to all-true mask.

void foo (char *a) {
  for (int i = 0; i < 256; i++)
    if (i != '\n')
      a[i] = 0;
}
ABataev updated this revision to Diff 558136.Nov 20 2023, 9:33 AM

Rebase
Removed KeepVPCanonicalWidenRecipes parameter since there is check for VPWidenCanonicalIVRecipe presence in replaceHeaderPredicateWithIdiom

ABataev updated this revision to Diff 558174.Nov 27 2023, 5:01 AM

Rebase, ping!

fhahn added inline comments.Nov 27 2023, 12:43 PM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

VPWidenCanonicalIV should only be replaced by something else if here's user that accesses the vector values of it I think. Could you share an IR test case that would show an issue? It is still not clear to me what the exact issue would be.

@ABataev just to double check, the latest version shouldn't have any issues with @shiva0217's test case, correct?

fhahn added a comment.Nov 27 2023, 1:01 PM

I think it may be also good to run clang-format again on the latest version of the patch if possible

I think it may be also good to run clang-format again on the latest version of the patch if possible

I have a special hook for formatting checking, it does not report any issues about formatting

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
986 ↗(On Diff #558035)

Yes, I added reduced test cases to this patch already and they do not crash

ABataev updated this revision to Diff 558199.Nov 30 2023, 4:01 PM

Rebase, fixes.

Ayal added a comment.Dec 12 2023, 2:49 PM

Went over parts of this patch again, have other parts yet to go over again. Status of various past comments should be clarified.

llvm/include/llvm/IR/IRBuilder.h
2576–2581 ↗(On Diff #558054)

This is used by VectorBuilder, committed in 9959cdb66a02b, some 1800 lines above.

838 ↗(On Diff #558136)

Remove getTrueVector() below, given getAllOnesMask() here.

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
79 ↗(On Diff #512838)

Documentation of "hasActiveVectorLength()" indicator here should better explain the intent. E.g., whether the target supports Active Vector Length intrinsics for given \p Opcode, \p DataType and \p Alignment.

82 ↗(On Diff #512838)

Looks like it to me, specifically to its 3rd TTI related bullet, but @ABataev should confirm.

84 ↗(On Diff #547860)

Better use a single term, consistently, than have different abbreviations for the same thing. To clarify:
VF: a constant number-of-elements provided within the Type, either a compile-time constant or a runtime constant, aka Fixed or Scalable, respectively.
mask: a boolean vector of type <VF x i1> used by all vector intrinsics - both llvm.masked.loads/stores/gather/scatter and llvm.vp.*.

Let FirstVF denote the number-of-elements that can be processed in the first vector iteration. If all iterations can process FirstVF elements - because we know the trip count of the loop N is a multiple of FirstVF or because we leave a tail of remaining leftover iterations to be processed by a subsequent loop, then there is no need for any "Active/Explicit Vector Length" support.

Otherwise, when folding the tail - the last vector iteration operates on LastVF = N - IV <= FirstVF elements where IV is the canonical induction variable of the loop. Combining LastVF with VF for all but last iteration yields IterVF = min(N - IV, VF) which varies per iteration. Conceptually, the vector type inside a tail-folded loop should use IterVF as its number-of-elements rather than VF, if types would support non-constant VF's. Instead, the type continues to use a conservative constant VF as a maximal/full number-of-elements across all loop iterations, and the excessive lanes of last iteration are masked-out by computing a vector tail_mask using a compare: <VF x i1> %tail_mask = icmp ule <IV,IV+1,...,IV+VF-1>, <N,N,...,N> or an intrinsic <VF x i1> %tail_mask = llvm.get.active.lane.mask_VF(IV, N) which captures this comparison w/o relying on vectorizing or broadcasting IV and N explicitly. This tail_mask is then folded into mask &= %tail_mask.

The term active seems to refer to the dynamic, changing nature of the mask. But the name get.active.lane.mask is confusing - every mask by definition is there to indicate which lanes are active. Perhaps a more accurate name would be get.variable.vf.mask. Another option could be get.prefix.mask_VF(N-IV) which simply produces a mask whose 'on' bits are the first N-IV bits, but there's an overflow case to consider.

Now the evl support mentioned here, is closely tied to RVV's setvl, and involves two aspects: (1) providing a variable VF as a separate scalar operand to vector intrinsics - only the llvm.vp.* ones - alongside the mask operand rather than inside it; and (2) obtaining the variable VF for a given iteration by some target-dependent computation which may differ from min(N - IV, VF) for the last two iterations.

The term explicit seems to refer to (1), in contrast to implicitly embedding the variable-VF inside the constant-VF via mask. Perhaps a more accurate name would be get.variable.vf, analogous to the above suggestion but providing the scalar variable VF rather than a mask thereof.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
297–303

(zlxu-rivai) IMHO, the vector length predication could be regarded as an another form of tail-folding, but not like ARM SVE which is based on active.lane.mask but based on an integer as vector length.

Based on this notion of unification, it would be better to augment the enum TailFoldingStyle, for example, add new entries like TailFoldingStyle::DataWithEVL (serving the purpose of EVLOption::IfEVLSupport) and the existent TailFoldingStyle::None can be reused as EVLOption::NoPredication.

(ABataev) Not necessarily. Generally speaking, we would like to support vectorized loop remainder in the future as one of the alternative solutions, so better to keep them separate.

I agree with zlxu-rivai, this patch deals with yet another Style of tail folding, as in the proposed DataWithEVL. If/when EVL is desired w/o tail folding, a separate patch with potential options should be discussed.

308

Looks like it. Tail folding is "enabled", "forced", or absent.

1206

Better check if CurRec is a header PHI recipe?

1630

Reiterating the suggestion to use VPI to denote llvm.vp.* vector predicate intrinsicts, if/where needed, and "foldTailByEVL()" for deciding to use EVL style tail-folding.

1648

nit: move this TODO into a FIXME next to checking isSafeForAnyVectorWidth as other cases?

1774

+1
Suffice to say PreferEVL, as that implies VPI.

5428

Shouldn't/Doesn't ForceEVLSupport imply useVPWithVPEVLVectorization()? Other cases of using EVL can scalarize-and-predicate loads and stores? Early-return earlier.

5837

The logic of this method is getting excessively complicated.
Suggest to turn this into an early-exiting if (!Legal->prepareToFoldTailByMasking()).

5842

Is there a test?
Message intends to say that preference was indicated but ignored, i.e., tail will be folded (with UF>1) but w/o VP intrinsics - w/ or w/o EVL?

5849

ScalableVF is surely scalable, assert rather than ask?
isNonZero or isVector?

5852

This method already has an undesired side-effect of setting CanFoldTailByMasking independent of returning MaxFactors, better avoid having it set yet another side-effect PreferVPWithVPEVLIntrinsics.
Perhaps extend CanFoldTailByMasking from a bool to also indicate how the tail is folded, and/or extend the return value to carry more than a FixedScalableVFPair?

Logic can be simplified into

PreferVPWithVPEVLIntrinsics = PreferPredicateWithVPEVLIntrinsics == EVLOption::IfEVLSupported ||
  TTI.hasActiveVectorLength(0, nullptr, Align());

the rest is debug prints placed under !NDEBUG, or a single LLVM_DEBUG with some ternary ? selecting if the preference indicated is followed or ignored.

5858

"if the target support vector length predication" - we already checked that it does?

5869

This deserves a comment, explaining that a tail folded using VP intrinsics restricts the VF to be scalable.

9252–9253

What's the relation between useVPWithVPEVLVectorization and useActiveLaneMask? Should the latter cover the former, so that it suffices to check if (useActiveLaneMask(Style)), or is it meaningful to have both true?

9601

Reason explained above: execute() of recipes should be straightforward.
This is one of the main guidelines outlined in the VPlan roadmap.
This recipe is getting too complicated, should probably separate gather/scatter from wide load/store, and separate the pointer setting (as in [VPlan] Model address separately. #72164), independent of this patch.
Recipe should indicate statically if EVL is used or not, to simplify code-gen and facilitate cost estimation, rather than having to check State.EVL during execute().
If multiple recipes share some common core, it can be shared via a common base class, as in VPHeaderPHIRecipe and VPRecipeWithIRFlags.

9602

These two functions each have a single caller, better defined as lambdas next to them?
Simpler to separate into two separate store/scatter functions?
Remove EVLPart because EVL currently works w/o unrolling, introduce it in the future as part of enabling EVL with unrolling?

9607–9608

Better simply check if (State.EVL) and then State.get() it, or could the latter return null? Raised as a nit below.

Better avoid checking if State.EVL altogether during VPlan execution, as noted above.

9614

Moving tail folding to a transform, following VPlan roadmap, is already quite an endeavor, which would hopefully better facilitate this patch. Better avoid complicating it further.

llvm/lib/Transforms/Vectorize/VPlan.h
2155 ↗(On Diff #557696)

Is a dedicated recipe needed, or would some general header-phi recipe/VPInstruction suffice?

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
371 ↗(On Diff #557781)

Have Phi and EVL be of the same type instead of needs to cast the latter to the former?

Is a dedicated VPInstruction needed, or would a general Add suffice, similar to [VPlan] Initial modeling of VF * UF as VPValue. #74761?

349 ↗(On Diff #558136)

This patch enables EVL for scalable VF's only, so best avoid pretending otherwise (w/o testing) and pass true for now. Extend to VF.isScalable when EVL is extended to handle fixed VF's - along with suitable tests.

357 ↗(On Diff #558136)
359 ↗(On Diff #558136)

to match RISC-V vsetvli terminology. Avoid AVL which also stands for Active Vector Length...

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
832 ↗(On Diff #555150)

As @nikolaypanchenko clarified in https://reviews.llvm.org/D99750#inline-1521321

Hence the trip-count can be computed by taking the ceiling of TC divided by the EVL of first vector iteration, which can be computed in the pre-header. But if the loop iterates until an index reaches an upper bound, where the index is repeatedly bumped by a non-invariant EVL, then the index is not an Induction Variable and the upper bound does not qualify as a (scaled) trip-count, countability of the loop is hidden away.

llvm/test/Transforms/LoopVectorize/RISCV/vplan-vp-intrinsics.ll
19 ↗(On Diff #558054)

UF>=1 or UF=1?

31 ↗(On Diff #558054)

scalar-steps is needed only for UF>1, as this feeds scalar/cloned geps only?

33 ↗(On Diff #558054)

Widened loads and store here should say that they use EVL, implicitly.

39 ↗(On Diff #558054)

Can this simply be an ADD of vp<[[EVL_PHI]]> and vp<[[EVL]]>?

41 ↗(On Diff #558054)

IF-EVL and FORCE-EVL are the same, check them together?

82 ↗(On Diff #558054)

MASK is unused?

116 ↗(On Diff #558054)

safe dep relies on VF*UF < 100, where VF relies on -riscv-v-vector-bits-max=128, vscale x 4 is omitted from VF, and UF>=1 is restricted?

llvm/test/Transforms/LoopVectorize/X86/vectorize-vp-intrinsics.ll
19–20 ↗(On Diff #558054)

nit: can remove CHECKs for this redundant block which never jumps to the dead scalar loop.

50–67 ↗(On Diff #558054)

nit: can remove the CHECKs for the dead scalar loop.

1 ↗(On Diff #522575)

The three runs produce the same results, can combine their CHECKs.

llvm/test/Transforms/LoopVectorize/X86/vplan-vp-intrinsics.ll
19–43 ↗(On Diff #558054)

All three runs {IF-EVL-NEXT, FORCE-EVL, NO-VP} result in the same VPlan; combine their checks together?

llvm/test/Transforms/LoopVectorize/vectorize-vp-intrinsics.ll
18 ↗(On Diff #558136)

IF-EVL can share checks with NO-VP?

llvm/test/Transforms/LoopVectorize/vplan-vp-intrinsics.ll
4 ↗(On Diff #558136)

This is IF-EVL, but its checks coincide with those of NO-VP? Perhaps use some combined CHECK

ABataev updated this revision to Diff 558247.Thu, Dec 21, 10:11 AM
ABataev marked 34 inline comments as done.

Rebase, address comments.
Moving it to github, since Phab does not send e-mails anymore.