This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize] Add option to use active lane mask for loop control flow
ClosedPublic

Authored by david-arm on May 10 2022, 3:40 AM.

Details

Summary

Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

  1. Generating the active lane mask for the next iteration of the

vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.

  1. Extract the first bit of this mask and use this bit for the

conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Diff Detail

Event Timeline

david-arm created this revision.May 10 2022, 3:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:40 AM
david-arm requested review of this revision.May 10 2022, 3:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:40 AM
david-arm edited reviewers, added: dmgreen; removed: greened.May 10 2022, 3:41 AM
sdesmalen added inline comments.May 25 2022, 1:38 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
536–541

I find this interface a bit cumbersome, can we make emitGetActiveLaneMask return an enum value that explains how the lane mask must be used, something like:

enum class ActiveLaneMask {
  None = 0,
  PredicateOnly,
  PredicateAndControlFlow
};

(but then with possibly better names)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8035–8038

this comment is no longer correct.

8044

Can you add a comment explaining what this does?

From my understanding, it sets the predicate for the vectorized loop's header block (which would be the vectorized loop's main predicate) to the ActiveLaneMaskPhi cached in the Plan. Is that right?

8596

nit: s/UseLaneMaskForControlFlow/UseLaneMaskForLoopControlFlow`/

llvm/lib/Transforms/Vectorize/VPlan.cpp
593–594

nit: s/CanonicalIVIncrementParts/CanonicalIVPIncrementForPart/ ?

605

nit: IV is probably not a suitable name, because it's the active lane mask, not the induction variable.

982

Is this extra scope necessary? There are no uses of IR builder after this scope, so I think you can just as well use the scope of the loop body for the InsertPointGuard.

llvm/lib/Transforms/Vectorize/VPlan.h
787

It would be nice to have a one-line description here of what this does and how it's different from CanonicalIVIncrement.

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Hi @dmgreen, it's worth pointing out that this behaviour is toggled under a TTI interface so it is currently only enabled for SVE targets where this structure happens to be the most optimal. Any targets that rely upon the existing structure (e.g. MVE) remain unaffected.

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Hi @dmgreen, it's worth pointing out that this behaviour is toggled under a TTI interface so it is currently only enabled for SVE targets where this structure happens to be the most optimal. Any targets that rely upon the existing structure (e.g. MVE) remain unaffected.

Sure - that's good, but it wasn't my point. My point was that loops have a structure to them that usually includes latch block controlled by an induction variable. My worry is for SVE, if we are creating loops that do not have that structure. Do all the transforms such as LSR and anything that might happen in an LTO pipeline still behave optimally, or do they work less well because the loop is less analyzable. A simple example that comes up quite a lot is that during the compile step of lto the loop is vectorized, then during the lto step the trip count becomes a known constant. The loop should be fully unrolled and simplified away nicely. I think the simplification should still happen given the chance because the activelanemask will propagate constants, but would the unrolling happen if the trip count is not obvious any more? There may be any number of other similar transforms that could happen during an LTO phase.

I haven't looked at the details, it sounds like this is a good thing to do. But my gut would say don't mess with the structure of the loop in the vectorizer, do the transform later.

david-arm updated this revision to Diff 432006.May 25 2022, 8:20 AM
  • Changed TTI interface emitGetActiveLaneMask to return a new ActiveLaneMask enum class.
  • Added/moved comments in a few places.
  • Tidied up the ::execute function.
david-arm marked 8 inline comments as done.May 25 2022, 8:22 AM

Hi @dmgreen, thanks for the explanation! The reason for doing this in LoopVectorize.cpp is because it feels much more natural to express it this way than writing a complicated pass just before lowering to transform it afterwards. There is no guarantee that such a transform would even succeed if subsequent passes change aspects of the loop. I believe this is actually the way it's done in GCC as well. From a performance perspective I've not observed any degradations in code quality or performance so far with this approach.

I think that in general LLVM is moving towards supporting predication in IR and this patch is aligned with that. You're right that it's possible we may occasionally have work to do in future looking for backedge counts, but I don't feel we should be afraid of that. The information is there if we need it, since it's inferred from the trip count passed to the active lane mask intrinsic.

Matt added a subscriber: Matt.May 25 2022, 2:05 PM
paulwalker-arm added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
163

Please say if I'm being picky but to me ActiveLaneMask is a thing, as in, it's the active lane mask. Here I think you're asking the target to choose the style of predication. So something like the following reads better to me:

enum class PredicationStyle { None, Data, DataAndControlFlow };
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
338–339

Will we do the correct thing if the vectoriser chooses fixed length vectorisation? Not sure what "correct" really means here other than not crash or regress performance? I just want to ensure we've considered the possibility.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8642

Given the new block is already EB->appendRecipeing several instructions I don't see any value in breaking out the last instruction like this.

9086–9087

This idiom occurs in a few places. Is it worth adding something like CM.useActiveLaneMaskForControlFlow(). I think it makes sense for this to imply tail folding by mask as why else would you be doing it.

llvm/lib/Transforms/Vectorize/VPlan.cpp
603

no active lanes?

984

Why is this here? and not where StartMask is assigned a couple of lines below.

david-arm updated this revision to Diff 434840.Jun 7 2022, 9:14 AM
  • Changed name of new enum to PredicationStyle.
  • Addressed other refactoring comments.
  • Added new test simple_memset_v4i32 to demonstrate that the new style of tail-folding works when choosing a fixed-width VF. This can happen because SVE is enabled, but the cost model may still decide to choose a fixed-width VF.
david-arm marked 6 inline comments as done.Jun 7 2022, 9:16 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
338–339

I think it does yeah. I've added a new test @simple_memset_v4i32 in sve-tail-folding.ll to defend the case where SVE is enabled and we've chosen a fixed-width VF. The codegen side of things is safe because we have tests to defend lowering of get.active.lane.mask calls for fixed-width too.

fhahn added inline comments.Jun 7 2022, 2:18 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
986

The start value should be created in the preheader block in VPlan instead of in the phi node here. There there should also be no need for a TC argument?

llvm/lib/Transforms/Vectorize/VPlan.h
793

Can BranchOnCond be used instead of the dedicated BranchOnactiveLaneMask?

1880

TC would need to be an operand instead of just a member variable, otherwise things will break if something does TC->replaceAllUsesWith() outside.

fhahn added inline comments.Jun 7 2022, 2:18 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

It looks like the update for the phi doesn't depend on the phi, but only on trip count & main induction.

I might be missing something, but is the phi actually needed or would it be possible to compute the active lane mask at the beginning of each iteration instead of at the end of the iteration?

david-arm marked an inline comment as done.Jun 8 2022, 1:08 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

Hi @fhahn, perhaps I've misunderstood your question here, but actually the point of this patch is to do the exact opposite of that, i.e. we *don't* want to generate the active lane mask for the current iteration - we want to create the mask for the next iteration. This requires a PHI to carry the live value around the loop and is the only way to use the mask for control flow because at the point of branching we want to know if there are any active lanes in the mask for the *next* iteration.

With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Effectively, the whilelo instruction is doing the comparison already, which makes the traditional scalar IV comparison redundant.

I could be wrong, but I believe this form of vectorised loop would be beneficial for some other targets with a predicated instruction set too, such as RISC-V.

llvm/lib/Transforms/Vectorize/VPlan.h
793

I created this patch a month ago, which predated your BranchOnCond work. That's why I haven't used it. I can certainly look into this and see if it's possible though?

david-arm added inline comments.Jun 10 2022, 12:51 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
986

Hi @fhahn, I'm not really sure what you mean here. Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? If so, then VPInstruction::ActiveLaneMask needs a trip count operand.

Or do you mean I should deviate from the expected recipe creation somehow in a VPlan function such as prepareToExecute by manually adding the incoming value?

fhahn added inline comments.Jun 13 2022, 1:05 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on.

Oh right, I missed this in the test case I was looking at! I originally thought the intention of the new phi recipe was to encode some extra guarantees/information like we do for inductions or reductions.

From the latest comment, it sounds like there would be no need to have a new recipe class for the phi, but perhaps VPWidenPHIRecipe could be used instead, if the setup can be moved to the pre-header.

llvm/lib/Transforms/Vectorize/VPlan.cpp
986

Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block?

Yes exactly.

If so, then VPInstruction::ActiveLaneMask needs a trip count operand.

Looking at VPInstruction::generateInstruction, it looks like it already takes a trip count operand? The first operand can be set to zero by wrapping a zero constant in a VPValue.

case VPInstruction::ActiveLaneMask: {
  // Get first lane of vector induction variable.
  Value *VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0));
  // Get the original loop tripcount.
  Value *ScalarTC = State.get(getOperand(1), Part);
david-arm added inline comments.Jun 13 2022, 1:12 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp:

void VPWidenPHIRecipe::execute(VPTransformState &State) {
  State.ILV->widenPHIInstruction(cast<PHINode>(getUnderlyingValue()), this,
                                 State);
}

but the PHI node does not exist in the original scalar loop. It's not currently possible to widen a PHI that didn't previously exist, which means I would have to modify VPWidenPHIRecipe to test for the existence of an underlying value and take different paths accordingly.

david-arm updated this revision to Diff 436733.Jun 14 2022, 4:04 AM
  • Removed the creation of the initial lane mask values from ActiveLaneMaskPHIRecipe::execute. We now create the appropriate set of VPInstructions in the vplan preheader block and feed these into the ActiveLaneMaskPHI recipe.
david-arm marked 6 inline comments as done.Jun 14 2022, 4:14 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.h
793

So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. It's possible to do this, but then I wasn't sure about the semantics of this new instruction. When passing in a scalar constant of 0 for the lane, it gets widened to something like <vscale x 4 x i32> zeroinitializer for every part. However, I only need a single lane so I'd have to do something like:

case VPInstruction::ExtractElement: {
  Value *Vec = State.get(getOperand(0), Part);
  Value *Lane = State.get(getOperand(1), VPIteration(0, 0));
  Value *V = Builder.CreateExtractElement(Vec, Lane);
  State.set(this, V, Part);
  break;
}

It feels quite inefficient to go to all the effort of widening, only to discard everything!

If you still prefer me to proceed with this approach I'm happy to try if you can provide your thoughts on what the new ExtractElement operation should look like?

fhahn added inline comments.Jun 19 2022, 2:21 PM
llvm/lib/Transforms/Vectorize/VPlan.h
793

So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type.

Extracts are not modeled explicitly at the moment and usually State.get will take care of interesting an extract when requesting scalar lanes if it is needed. I *think* when using a VPInstruction::not as operand for BranchOnCond, State.get should insert the extract for the first lane, as this is what BranchOnCond uses.

fhahn added inline comments.Jun 19 2022, 2:27 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp:

Yeah this was a bit unfortunate! I landed 2 patches that removed the unnecessary dependence on the underlying instruction by instead using the type of its operand.

david-arm added inline comments.Jun 20 2022, 12:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

OK well thanks a lot for tidying that class up! The only problem is that even after your patches landed VPWidenPhiRecipe::execute only ever deals with Part 0, whereas we need a Phi for every Part. So unfortunately I still can't use the class in it's current state. If you have suggestions about how to fix this I'm happy to take another look? Perhaps I can add a boolean flag to the VPWidenPhiRecipe constructor to indicate how many parts to generate?

llvm/lib/Transforms/Vectorize/VPlan.h
793

OK I'll take another look and give this a try!

david-arm updated this revision to Diff 438346.Jun 20 2022, 5:14 AM
  • Removed BranchOnActiveLaneMask and now use BranchOnCond instead
david-arm edited the summary of this revision. (Show Details)Jun 20 2022, 5:15 AM
david-arm marked an inline comment as done.

Out of interest are the test output produced manually or via an update script. I ask because I cannot see any of the normal headers and am wondering why.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8591–8593

This comment is now out of date. Perhaps it's best to replace it with something more generic and have the implementation specific part nearer the code?

8632–8633

Please ignore if too awkward but I think the output IR will be more readable if we have tighter control over the name used for EntryALM and ALM. Currently the output is just littered with active.lane.mask### making it harder to understand the data flow.

If the entry masks were called active.lane.mask.entry###, next iteration masks called active.lane.mask.next### and the PHIs keeping their current active.lane.mask### name, the loops would be easier to read.

8656–8657

This seems a bit wasteful given it's only the first lane we care about. perhaps VPInstruction::BranchOnCount could take an invert flag that switches the branch destinations?

llvm/lib/Transforms/Vectorize/VPlan.cpp
578–580

I know it'll get folded away by later passes but would it be wrong to handling the Part == 0 case here? so that we don't litter the loop vectorize tests with pointless adds of zero. I see CanonicalIVIncrement doing similar for the non zero parts so I'll take that as precedent for doing such common sense early optimisation.

llvm/test/Transforms/LoopVectorize/AArch64/scalable-reductions-tf.ll
9

Given the structure has changed, is it worth showing how the entry active lane mask is constructed?

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
3–4

What's this protecting?

david-arm updated this revision to Diff 440927.Jun 29 2022, 4:03 AM
  • Fixed some incorrect comments.
  • Added more descriptive names for ActiveLaneMask VPInstructions.
  • Minor improvement in code quality for some canonical IV increments.
fhahn added inline comments.Jun 29 2022, 4:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

Hm, let me double check!

david-arm marked 6 inline comments as done.Jun 29 2022, 4:06 AM

Out of interest are the test output produced manually or via an update script. I ask because I cannot see any of the normal headers and am wondering why.

So we've been using the update_test_checks.py to generate the CHECK lines, but then manually deleting CHECK lines after the middle.block since they were unnecessary for the tests. That reduced the amount of noise, but it does mean we can't really have the NOTE at the top of the tests to say "CHECK lines generated by update_test_checks.py".

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8656–8657

It's not as trivial as you'd think passing in flags to VPInstructions. :( You have to create an IR Constant, pass it in as a 3rd VPValue argument to a VPInstruction and mark it as IR "live value in" operand. We do end up with nice code after InstCombine, since it swaps the labels around for us anyway.

If you really prefer me to go down this route I can - it's just a little awkward that's all.

paulwalker-arm accepted this revision.Jun 30 2022, 5:58 AM

A few comment related issues but otherwise looks good to me.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8606

remove the "it"?

8656–8657

No need if this is how the interface works. Perhaps VPIntruction could do with some domain specific flags, which might also simplify the NUW split. Either way it's not directly relevant to this patch.

llvm/lib/Transforms/Vectorize/VPlan.cpp
580–581

This looks to be copied from case VPInstruction::CanonicalIVIncrement but it not really correct within the context of this opcode.

llvm/lib/Transforms/Vectorize/VPlan.h
2519

Could do with a comment.

This revision is now accepted and ready to land.Jun 30 2022, 5:58 AM
fhahn added a reviewer: Ayal.Jun 30 2022, 7:35 AM
fhahn added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
536–541

Looks like this comment needs updating, as the interface is more fine grained now?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8629

Ok it looks like the only reason is that VPWidenPHIRecipe is only used in plans for outer-loop vectorization and there interleaving isn't supported.

So I think it should be fine to just extend it to generate phis for all parts. It would be beneficial to ensure we use generic recipes where possible to avoid duplication and confusion. This will also help with further unifying the split between native/inner loop planning.

llvm/lib/Transforms/Vectorize/VPlan.cpp
981

Can Builder.CreatePhi be used without the need to manually pick the insertion point?

llvm/lib/Transforms/Vectorize/VPlan.h
802

This should be documented more clearly. At the moment this is used as the name of the generated IR instruction? Also, this is only used for VPInstruction::ActiveLaneMask which is a bit surprising.

2598

Do we need to keep this as member variable? I think at the moment we generally avoid having links to recipes here, as this requires transforms that may remove or replace recipes to be aware of this state (see the similar getCanonicalIV)

It should be trivial to find the active lane mask phi recipe, if there is one (just look at the phi recipes of the header)

david-arm marked an inline comment as done.Jun 30 2022, 8:56 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.cpp
982

Hi @fhahn, if I use VPWidenPHIRecipe instead it's not obvious how I add the incoming start mask. It looks like VPWidenPHIRecipe::execute doesn't add it - do you know how?

fhahn added inline comments.Jun 30 2022, 12:40 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
982

I think D128937 should do the trick, together with the snippet below. If D128937 allows VPWidenPHIRecipe to be used here I'll send it for review.

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index d8fb7a9c12b0..1ba76bf52cc6 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3646,8 +3646,7 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
     truncateToMinimalBitwidths(State);

   // Fix widened non-induction PHIs by setting up the PHI operands.
-  if (EnableVPlanNativePath)
-    fixNonInductionPHIs(Plan, State);
+  fixNonInductionPHIs(Plan, State);

   // At this point every instruction in the original loop is widened to a
   // vector form. Now we need to fix the recurrences in the loop. These PHI
david-arm updated this revision to Diff 441660.Jul 1 2022, 3:59 AM
  • Changed getActiveLaneMaskPHI to look up the recipe from the header, and removed the member variable.
  • Improved comments in code.
  • Split out the VPInstruction Name parameter work into a separate NFC patch - D128982.
david-arm marked 8 inline comments as done.Jul 1 2022, 4:01 AM
david-arm added inline comments.Jul 4 2022, 6:20 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
982

Hi @fhahn, if you're happy with everything else in the patch, would you be ok with submitting this patch as it is now, then look into using VPWidenPHIRecipe as a follow-on piece of refactoring? We're quite keen to get this work submitted soonish so we can get sufficient testing in the run-up to the LLVM 15 release. We also have quite a few other pieces of work on-going that depend upon this too.

fhahn added inline comments.Jul 5 2022, 6:03 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
571

nit: stray whitespace.

596

Is this covered by a test? The comment and logic above applies specifically to BranchOnCount. It is not entirely clear to me that it would be safe for any BranchOnCond?

982

Ok sounds good to me, could you please add a FIXME/TODO to the code in the meantime?

We also have quite a few other pieces of work on-going that depend upon this too.

If other stuff depends on the patch it might be helpful to link them together as patch stack, to get a better idea on the wider impact.

992

Should this also print the operands here?

llvm/lib/Transforms/Vectorize/VPlan.h
2596

What about multiple VPActiveLaneMasks? Should the verifier check to make sure there's only a single one?

It would probably also be good to place this after getCanonicalIV, which seems related.

llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp
202

This still references BranchOnActiveLaneMask?

david-arm added inline comments.Jul 6 2022, 5:48 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
596

It is covered by sve-low-trip-count.ll, since we create BranchOnCond using the active lane mask. I did this after you suggested in earlier review comments to use BranchOnCond instead of BranchOnActiveLaneMask because we don't want to regress for low trip counts. We also don't have access to the TripCount in addCanonicalIVRecipes, so I can't use that to determine what type of Branch instruction to use.

It's a shame that there is no easy way to add a flag to the VPInstruction to mark it as safe for optimisation here. For example, it would be nice to do something like

(Term->getOpcode() == VPInstruction::BranchOnCount ||
 (Term->getOpcode() == VPInstruction::BranchOnCond && Term->isBranchSafeToOptimise())
david-arm updated this revision to Diff 442552.Jul 6 2022, 6:45 AM
  • Updated new VPActiveLaneMaskPHIRecipe::print function to write out the operands as well.
  • Add a check to the vplan verifier that we only have one active lane mask phi recipe.
  • Added a TODO above VPActiveLaneMaskPHIRecipe::execute to try to use VPWidenPHIRecipe.
david-arm marked 5 inline comments as done.Jul 6 2022, 6:46 AM
fhahn added inline comments.Jul 6 2022, 10:10 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
596

Thanks, I don't think a flag is actually needed. IIUC it should be sufficient to check if the operand of the BranchOnCond is fed by the current active lane mask and update the comment?

david-arm added inline comments.Jul 7 2022, 12:18 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
596

OK, it's more complicated than that though, since we have to check for not(active_lane_mask).

david-arm updated this revision to Diff 442820.Jul 7 2022, 1:48 AM
david-arm marked 2 inline comments as done.
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.cpp
596

I've added explicit checks for the right flavour of BranchOnCond now, although the logic is considerably more complicated now. :)

fhahn accepted this revision.Jul 7 2022, 10:11 PM

LGTM, thanks! I left a few inline comments where some of the comments could still be update. Also it would be good to add to TODO to replace the phi recipe and do this as follow up soon.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1515

nit: would be good to add a quick comment here.

llvm/lib/Transforms/Vectorize/VPlan.cpp
595–596

nit: comment still needs updating?

llvm/lib/Transforms/Vectorize/VPlan.h
1878

nit: TOOD/FIXME to replace this by the more general VPWidenPHIRecipe?

This revision was landed with ongoing or failed builds.Jul 11 2022, 5:47 AM
This revision was automatically updated to reflect the committed changes.

Hi @fhahn, I've tried working on a patch to use VPWidenPHIRecipe instead of VPActiveLaneMaskPHIRecipe (see D129744), which is based off another WIP of yours (D128937). However, there are still problems with this approach:

  1. Still quite a lot of code assumes that VPWidenPHIRecipe has an underlying value, which I've tried to hack my way around.
  2. Updating VPlan::getActiveLaneMaskPHI() is tricky now - I've hacked this by looking for a VPWidenPHIRecipe without an underlying value.
  3. In addCanonicalIVRecipes the loop exit block that I add as an incoming block for the VPWidenPHIRecipe is not the final exit block that you see in VPlan::execute. In fixNonInductionPHIs we crash because we cannot map this bad exit VPBasicBlock to a real BasicBlock. It looks like I still may have to let VPlan::execute add the latch incoming value for VPWidenPHIRecipes.