Page MenuHomePhabricator

[LoopVectorize] Add option to use active lane mask for loop control flow
ClosedPublic

Authored by david-arm on May 10 2022, 3:40 AM.

Details

Summary

Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

  1. Generating the active lane mask for the next iteration of the

vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.

  1. Extract the first bit of this mask and use this bit for the

conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
fhahn added inline comments.Jun 7 2022, 2:18 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

It looks like the update for the phi doesn't depend on the phi, but only on trip count & main induction.

I might be missing something, but is the phi actually needed or would it be possible to compute the active lane mask at the beginning of each iteration instead of at the end of the iteration?

llvm/lib/Transforms/Vectorize/VPlan.cpp
1553

The start value should be created in the preheader block in VPlan instead of in the phi node here. There there should also be no need for a TC argument?

llvm/lib/Transforms/Vectorize/VPlan.h
764

Can BranchOnCond be used instead of the dedicated BranchOnactiveLaneMask?

david-arm marked an inline comment as done.Jun 8 2022, 1:08 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

Hi @fhahn, perhaps I've misunderstood your question here, but actually the point of this patch is to do the exact opposite of that, i.e. we *don't* want to generate the active lane mask for the current iteration - we want to create the mask for the next iteration. This requires a PHI to carry the live value around the loop and is the only way to use the mask for control flow because at the point of branching we want to know if there are any active lanes in the mask for the *next* iteration.

With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Effectively, the whilelo instruction is doing the comparison already, which makes the traditional scalar IV comparison redundant.

I could be wrong, but I believe this form of vectorised loop would be beneficial for some other targets with a predicated instruction set too, such as RISC-V.

llvm/lib/Transforms/Vectorize/VPlan.h
764

I created this patch a month ago, which predated your BranchOnCond work. That's why I haven't used it. I can certainly look into this and see if it's possible though?

david-arm added inline comments.Jun 10 2022, 12:51 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
1553

Hi @fhahn, I'm not really sure what you mean here. Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? If so, then VPInstruction::ActiveLaneMask needs a trip count operand.

Or do you mean I should deviate from the expected recipe creation somehow in a VPlan function such as prepareToExecute by manually adding the incoming value?

fhahn added inline comments.Jun 13 2022, 1:05 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on.

Oh right, I missed this in the test case I was looking at! I originally thought the intention of the new phi recipe was to encode some extra guarantees/information like we do for inductions or reductions.

From the latest comment, it sounds like there would be no need to have a new recipe class for the phi, but perhaps VPWidenPHIRecipe could be used instead, if the setup can be moved to the pre-header.

llvm/lib/Transforms/Vectorize/VPlan.cpp
1553

Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block?

Yes exactly.

If so, then VPInstruction::ActiveLaneMask needs a trip count operand.

Looking at VPInstruction::generateInstruction, it looks like it already takes a trip count operand? The first operand can be set to zero by wrapping a zero constant in a VPValue.

case VPInstruction::ActiveLaneMask: {
  // Get first lane of vector induction variable.
  Value *VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0));
  // Get the original loop tripcount.
  Value *ScalarTC = State.get(getOperand(1), Part);
david-arm added inline comments.Jun 13 2022, 1:12 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp:

void VPWidenPHIRecipe::execute(VPTransformState &State) {
  State.ILV->widenPHIInstruction(cast<PHINode>(getUnderlyingValue()), this,
                                 State);
}

but the PHI node does not exist in the original scalar loop. It's not currently possible to widen a PHI that didn't previously exist, which means I would have to modify VPWidenPHIRecipe to test for the existence of an underlying value and take different paths accordingly.

david-arm updated this revision to Diff 436733.Jun 14 2022, 4:04 AM
  • Removed the creation of the initial lane mask values from ActiveLaneMaskPHIRecipe::execute. We now create the appropriate set of VPInstructions in the vplan preheader block and feed these into the ActiveLaneMaskPHI recipe.
david-arm marked 6 inline comments as done.Jun 14 2022, 4:14 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.h
764

So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. It's possible to do this, but then I wasn't sure about the semantics of this new instruction. When passing in a scalar constant of 0 for the lane, it gets widened to something like <vscale x 4 x i32> zeroinitializer for every part. However, I only need a single lane so I'd have to do something like:

case VPInstruction::ExtractElement: {
  Value *Vec = State.get(getOperand(0), Part);
  Value *Lane = State.get(getOperand(1), VPIteration(0, 0));
  Value *V = Builder.CreateExtractElement(Vec, Lane);
  State.set(this, V, Part);
  break;
}

It feels quite inefficient to go to all the effort of widening, only to discard everything!

If you still prefer me to proceed with this approach I'm happy to try if you can provide your thoughts on what the new ExtractElement operation should look like?

fhahn added inline comments.Jun 19 2022, 2:21 PM
llvm/lib/Transforms/Vectorize/VPlan.h
764

So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type.

Extracts are not modeled explicitly at the moment and usually State.get will take care of interesting an extract when requesting scalar lanes if it is needed. I *think* when using a VPInstruction::not as operand for BranchOnCond, State.get should insert the extract for the first lane, as this is what BranchOnCond uses.

fhahn added inline comments.Jun 19 2022, 2:27 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp:

Yeah this was a bit unfortunate! I landed 2 patches that removed the unnecessary dependence on the underlying instruction by instead using the type of its operand.

david-arm added inline comments.Jun 20 2022, 12:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

OK well thanks a lot for tidying that class up! The only problem is that even after your patches landed VPWidenPhiRecipe::execute only ever deals with Part 0, whereas we need a Phi for every Part. So unfortunately I still can't use the class in it's current state. If you have suggestions about how to fix this I'm happy to take another look? Perhaps I can add a boolean flag to the VPWidenPhiRecipe constructor to indicate how many parts to generate?

llvm/lib/Transforms/Vectorize/VPlan.h
764

OK I'll take another look and give this a try!

david-arm updated this revision to Diff 438346.Jun 20 2022, 5:14 AM
  • Removed BranchOnActiveLaneMask and now use BranchOnCond instead
david-arm edited the summary of this revision. (Show Details)Jun 20 2022, 5:15 AM
david-arm marked an inline comment as done.

Out of interest are the test output produced manually or via an update script. I ask because I cannot see any of the normal headers and am wondering why.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8635–8637

This comment is now out of date. Perhaps it's best to replace it with something more generic and have the implementation specific part nearer the code?

8673–8674

Please ignore if too awkward but I think the output IR will be more readable if we have tighter control over the name used for EntryALM and ALM. Currently the output is just littered with active.lane.mask### making it harder to understand the data flow.

If the entry masks were called active.lane.mask.entry###, next iteration masks called active.lane.mask.next### and the PHIs keeping their current active.lane.mask### name, the loops would be easier to read.

8697–8698

This seems a bit wasteful given it's only the first lane we care about. perhaps VPInstruction::BranchOnCount could take an invert flag that switches the branch destinations?

llvm/lib/Transforms/Vectorize/VPlan.cpp
795–797

I know it'll get folded away by later passes but would it be wrong to handling the Part == 0 case here? so that we don't litter the loop vectorize tests with pointless adds of zero. I see CanonicalIVIncrement doing similar for the non zero parts so I'll take that as precedent for doing such common sense early optimisation.

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions-tf.ll
7

Given the structure has changed, is it worth showing how the entry active lane mask is constructed?

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
4

What's this protecting?

david-arm updated this revision to Diff 440927.Jun 29 2022, 4:03 AM
  • Fixed some incorrect comments.
  • Added more descriptive names for ActiveLaneMask VPInstructions.
  • Minor improvement in code quality for some canonical IV increments.
fhahn added inline comments.Jun 29 2022, 4:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

Hm, let me double check!

david-arm marked 6 inline comments as done.Jun 29 2022, 4:06 AM

Out of interest are the test output produced manually or via an update script. I ask because I cannot see any of the normal headers and am wondering why.

So we've been using the update_test_checks.py to generate the CHECK lines, but then manually deleting CHECK lines after the middle.block since they were unnecessary for the tests. That reduced the amount of noise, but it does mean we can't really have the NOTE at the top of the tests to say "CHECK lines generated by update_test_checks.py".

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8697–8698

It's not as trivial as you'd think passing in flags to VPInstructions. :( You have to create an IR Constant, pass it in as a 3rd VPValue argument to a VPInstruction and mark it as IR "live value in" operand. We do end up with nice code after InstCombine, since it swaps the labels around for us anyway.

If you really prefer me to go down this route I can - it's just a little awkward that's all.

paulwalker-arm accepted this revision.Jun 30 2022, 5:58 AM

A few comment related issues but otherwise looks good to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8649

remove the "it"?

8697–8698

No need if this is how the interface works. Perhaps VPIntruction could do with some domain specific flags, which might also simplify the NUW split. Either way it's not directly relevant to this patch.

llvm/lib/Transforms/Vectorize/VPlan.cpp
797–798

This looks to be copied from case VPInstruction::CanonicalIVIncrement but it not really correct within the context of this opcode.

llvm/lib/Transforms/Vectorize/VPlan.h
2484

Could do with a comment.

This revision is now accepted and ready to land.Jun 30 2022, 5:58 AM
fhahn added a reviewer: Ayal.Jun 30 2022, 7:35 AM
fhahn added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
536

Looks like this comment needs updating, as the interface is more fine grained now?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

Ok it looks like the only reason is that VPWidenPHIRecipe is only used in plans for outer-loop vectorization and there interleaving isn't supported.

So I think it should be fine to just extend it to generate phis for all parts. It would be beneficial to ensure we use generic recipes where possible to avoid duplication and confusion. This will also help with further unifying the split between native/inner loop planning.

llvm/lib/Transforms/Vectorize/VPlan.cpp
1548

Can Builder.CreatePhi be used without the need to manually pick the insertion point?

llvm/lib/Transforms/Vectorize/VPlan.h
773

This should be documented more clearly. At the moment this is used as the name of the generated IR instruction? Also, this is only used for VPInstruction::ActiveLaneMask which is a bit surprising.

2564

Do we need to keep this as member variable? I think at the moment we generally avoid having links to recipes here, as this requires transforms that may remove or replace recipes to be aware of this state (see the similar getCanonicalIV)

It should be trivial to find the active lane mask phi recipe, if there is one (just look at the phi recipes of the header)

david-arm marked an inline comment as done.Jun 30 2022, 8:56 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.cpp
1549

Hi @fhahn, if I use VPWidenPHIRecipe instead it's not obvious how I add the incoming start mask. It looks like VPWidenPHIRecipe::execute doesn't add it - do you know how?

fhahn added inline comments.Jun 30 2022, 12:40 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
1549

I think D128937 should do the trick, together with the snippet below. If D128937 allows VPWidenPHIRecipe to be used here I'll send it for review.

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index d8fb7a9c12b0..1ba76bf52cc6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3646,8 +3646,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
     truncateToMinimalBitwidths(State);

   // Fix widened non-induction PHIs by setting up the PHI operands.
-  if (EnableVPlanNativePath)
-    fixNonInductionPHIs(Plan, State);
+  fixNonInductionPHIs(Plan, State);

   // At this point every instruction in the original loop is widened to a
   // vector form. Now we need to fix the recurrences in the loop. These PHI
david-arm updated this revision to Diff 441660.Jul 1 2022, 3:59 AM
  • Changed getActiveLaneMaskPHI to look up the recipe from the header, and removed the member variable.
  • Improved comments in code.
  • Split out the VPInstruction Name parameter work into a separate NFC patch - D128982.
david-arm marked 8 inline comments as done.Jul 1 2022, 4:01 AM
david-arm added inline comments.Jul 4 2022, 6:20 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
1549

Hi @fhahn, if you're happy with everything else in the patch, would you be ok with submitting this patch as it is now, then look into using VPWidenPHIRecipe as a follow-on piece of refactoring? We're quite keen to get this work submitted soonish so we can get sufficient testing in the run-up to the LLVM 15 release. We also have quite a few other pieces of work on-going that depend upon this too.

fhahn added inline comments.Jul 5 2022, 6:03 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
718

nit: stray whitespace.

927

Is this covered by a test? The comment and logic above applies specifically to BranchOnCount. It is not entirely clear to me that it would be safe for any BranchOnCond?

1549

Ok sounds good to me, could you please add a FIXME/TODO to the code in the meantime?

We also have quite a few other pieces of work on-going that depend upon this too.

If other stuff depends on the patch it might be helpful to link them together as patch stack, to get a better idea on the wider impact.

1559

Should this also print the operands here?

llvm/lib/Transforms/Vectorize/VPlan.h
2562

What about multiple VPActiveLaneMasks? Should the verifier check to make sure there's only a single one?

It would probably also be good to place this after getCanonicalIV, which seems related.

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
193

This still references BranchOnActiveLaneMask?

david-arm added inline comments.Jul 6 2022, 5:48 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
927

It is covered by sve-low-trip-count.ll, since we create BranchOnCond using the active lane mask. I did this after you suggested in earlier review comments to use BranchOnCond instead of BranchOnActiveLaneMask because we don't want to regress for low trip counts. We also don't have access to the TripCount in addCanonicalIVRecipes, so I can't use that to determine what type of Branch instruction to use.

It's a shame that there is no easy way to add a flag to the VPInstruction to mark it as safe for optimisation here. For example, it would be nice to do something like

(Term->getOpcode() == VPInstruction::BranchOnCount ||
 (Term->getOpcode() == VPInstruction::BranchOnCond && Term->isBranchSafeToOptimise())
david-arm updated this revision to Diff 442552.Jul 6 2022, 6:45 AM
  • Updated new VPActiveLaneMaskPHIRecipe::print function to write out the operands as well.
  • Add a check to the vplan verifier that we only have one active lane mask phi recipe.
  • Added a TODO above VPActiveLaneMaskPHIRecipe::execute to try to use VPWidenPHIRecipe.
david-arm marked 5 inline comments as done.Jul 6 2022, 6:46 AM
fhahn added inline comments.Jul 6 2022, 10:10 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
927

Thanks, I don't think a flag is actually needed. IIUC it should be sufficient to check if the operand of the BranchOnCond is fed by the current active lane mask and update the comment?

david-arm added inline comments.Jul 7 2022, 12:18 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
927

OK, it's more complicated than that though, since we have to check for not(active_lane_mask).

david-arm updated this revision to Diff 442820.Jul 7 2022, 1:48 AM
david-arm marked 2 inline comments as done.
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.cpp
927

I've added explicit checks for the right flavour of BranchOnCond now, although the logic is considerably more complicated now. :)

fhahn accepted this revision.Jul 7 2022, 10:11 PM

LGTM, thanks! I left a few inline comments where some of the comments could still be update. Also it would be good to add to TODO to replace the phi recipe and do this as follow up soon.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1582

nit: would be good to add a quick comment here.

llvm/lib/Transforms/Vectorize/VPlan.cpp
923

nit: comment still needs updating?

llvm/lib/Transforms/Vectorize/VPlan.h
1845

nit: TOOD/FIXME to replace this by the more general VPWidenPHIRecipe?

This revision was landed with ongoing or failed builds.Jul 11 2022, 5:47 AM
This revision was automatically updated to reflect the committed changes.

Hi @fhahn, I've tried working on a patch to use VPWidenPHIRecipe instead of VPActiveLaneMaskPHIRecipe (see D129744), which is based off another WIP of yours (D128937). However, there are still problems with this approach:

  1. Still quite a lot of code assumes that VPWidenPHIRecipe has an underlying value, which I've tried to hack my way around.
  2. Updating VPlan::getActiveLaneMaskPHI() is tricky now - I've hacked this by looking for a VPWidenPHIRecipe without an underlying value.
  3. In addCanonicalIVRecipes the loop exit block that I add as an incoming block for the VPWidenPHIRecipe is not the final exit block that you see in VPlan::execute. In fixNonInductionPHIs we crash because we cannot map this bad exit VPBasicBlock to a real BasicBlock. It looks like I still may have to let VPlan::execute add the latch incoming value for VPWidenPHIRecipes.