Page MenuHomePhabricator

[LoopVectorize] Add option to use active lane mask for loop control flow
Needs ReviewPublic

Authored by david-arm on May 10 2022, 3:40 AM.

Details

Summary

Currently, for vectorised loops that use the get.active.lane.mask
intrinsic we only use the mask for predicated vector operations,
such as masked loads and stores, etc. The loop itself is still
controlled by comparing the canonical induction variable with the
trip count. However, for some targets this is inefficient when it's
cheap to use the mask itself to control the loop.

This patch adds support for using the active lane mask for control
flow by:

  1. Generating the active lane mask for the next iteration of the

vector loop, rather than the current one. If there are still any
remaining iterations then at least the first bit of the mask will
be set.

  1. Extract the first bit of this mask and use this bit for the

conditional branch.

I did this by creating a new VPActiveLaneMaskPHIRecipe that sets
up the initial PHI values in the vector loop pre-header. I've also
made use of the new BranchOnCond VPInstruction for the final
instruction in the loop region.

Diff Detail

Event Timeline

david-arm created this revision.May 10 2022, 3:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:40 AM
david-arm requested review of this revision.May 10 2022, 3:40 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2022, 3:40 AM
david-arm edited reviewers, added: dmgreen; removed: greened.May 10 2022, 3:41 AM
sdesmalen added inline comments.May 25 2022, 1:38 AM
llvm/include/llvm/Analysis/TargetTransformInfo.h
536

I find this interface a bit cumbersome, can we make emitGetActiveLaneMask return an enum value that explains how the lane mask must be used, something like:

enum class ActiveLaneMask {
  None = 0,
  PredicateOnly,
  PredicateAndControlFlow
};

(but then with possibly better names)

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8071–8074

this comment is no longer correct.

8080

Can you add a comment explaining what this does?

From my understanding, it sets the predicate for the vectorized loop's header block (which would be the vectorized loop's main predicate) to the ActiveLaneMaskPhi cached in the Plan. Is that right?

8640

nit: s/UseLaneMaskForControlFlow/UseLaneMaskForLoopControlFlow`/

llvm/lib/Transforms/Vectorize/VPlan.cpp
800

nit: s/CanonicalIVIncrementParts/CanonicalIVPIncrementForPart/ ?

822

nit: IV is probably not a suitable name, because it's the active lane mask, not the induction variable.

1542

Is this extra scope necessary? There are no uses of IR builder after this scope, so I think you can just as well use the scope of the loop body for the InsertPointGuard.

llvm/lib/Transforms/Vectorize/VPlan.h
758

It would be nice to have a one-line description here of what this does and how it's different from CanonicalIVIncrement.

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Hi @dmgreen, it's worth pointing out that this behaviour is toggled under a TTI interface so it is currently only enabled for SVE targets where this structure happens to be the most optimal. Any targets that rely upon the existing structure (e.g. MVE) remain unaffected.

Would it be better for this be done later, by the backend? I worry that the BETC would not be calculable any more for the loop, and the new structure would be more difficult to analyze in general. Handling it separately from the vectorizer would also allow other loops to be transformed. What happens at the moment for loops with ACLE intrinsics, for example?

Hi @dmgreen, it's worth pointing out that this behaviour is toggled under a TTI interface so it is currently only enabled for SVE targets where this structure happens to be the most optimal. Any targets that rely upon the existing structure (e.g. MVE) remain unaffected.

Sure - that's good, but it wasn't my point. My point was that loops have a structure to them that usually includes latch block controlled by an induction variable. My worry is for SVE, if we are creating loops that do not have that structure. Do all the transforms such as LSR and anything that might happen in an LTO pipeline still behave optimally, or do they work less well because the loop is less analyzable. A simple example that comes up quite a lot is that during the compile step of lto the loop is vectorized, then during the lto step the trip count becomes a known constant. The loop should be fully unrolled and simplified away nicely. I think the simplification should still happen given the chance because the activelanemask will propagate constants, but would the unrolling happen if the trip count is not obvious any more? There may be any number of other similar transforms that could happen during an LTO phase.

I haven't looked at the details, it sounds like this is a good thing to do. But my gut would say don't mess with the structure of the loop in the vectorizer, do the transform later.

david-arm updated this revision to Diff 432006.May 25 2022, 8:20 AM
  • Changed TTI interface emitGetActiveLaneMask to return a new ActiveLaneMask enum class.
  • Added/moved comments in a few places.
  • Tidied up the ::execute function.
david-arm marked 8 inline comments as done.May 25 2022, 8:22 AM

Hi @dmgreen, thanks for the explanation! The reason for doing this in LoopVectorize.cpp is because it feels much more natural to express it this way than writing a complicated pass just before lowering to transform it afterwards. There is no guarantee that such a transform would even succeed if subsequent passes change aspects of the loop. I believe this is actually the way it's done in GCC as well. From a performance perspective I've not observed any degradations in code quality or performance so far with this approach.

I think that in general LLVM is moving towards supporting predication in IR and this patch is aligned with that. You're right that it's possible we may occasionally have work to do in future looking for backedge counts, but I don't feel we should be afraid of that. The information is there if we need it, since it's inferred from the trip count passed to the active lane mask intrinsic.

Matt added a subscriber: Matt.May 25 2022, 2:05 PM
paulwalker-arm added inline comments.
llvm/include/llvm/Analysis/TargetTransformInfo.h
162

Please say if I'm being picky but to me ActiveLaneMask is a thing, as in, it's the active lane mask. Here I think you're asking the target to choose the style of predication. So something like the following reads better to me:

enum class PredicationStyle { None, Data, DataAndControlFlow };
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
338–339

Will we do the correct thing if the vectoriser chooses fixed length vectorisation? Not sure what "correct" really means here other than not crash or regress performance? I just want to ensure we've considered the possibility.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8683

Given the new block is already EB->appendRecipeing several instructions I don't see any value in breaking out the last instruction like this.

9121–9122

This idiom occurs in a few places. Is it worth adding something like CM.useActiveLaneMaskForControlFlow(). I think it makes sense for this to imply tail folding by mask as why else would you be doing it.

llvm/lib/Transforms/Vectorize/VPlan.cpp
820

no active lanes?

1551

Why is this here? and not where StartMask is assigned a couple of lines below.

david-arm updated this revision to Diff 434840.Tue, Jun 7, 9:14 AM
  • Changed name of new enum to PredicationStyle.
  • Addressed other refactoring comments.
  • Added new test simple_memset_v4i32 to demonstrate that the new style of tail-folding works when choosing a fixed-width VF. This can happen because SVE is enabled, but the cost model may still decide to choose a fixed-width VF.
david-arm marked 6 inline comments as done.Tue, Jun 7, 9:16 AM
david-arm added inline comments.
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
338–339

I think it does yeah. I've added a new test @simple_memset_v4i32 in sve-tail-folding.ll to defend the case where SVE is enabled and we've chosen a fixed-width VF. The codegen side of things is safe because we have tests to defend lowering of get.active.lane.mask calls for fixed-width too.

fhahn added inline comments.Tue, Jun 7, 2:18 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

It looks like the update for the phi doesn't depend on the phi, but only on trip count & main induction.

I might be missing something, but is the phi actually needed or would it be possible to compute the active lane mask at the beginning of each iteration instead of at the end of the iteration?

llvm/lib/Transforms/Vectorize/VPlan.cpp
1553

The start value should be created in the preheader block in VPlan instead of in the phi node here. There there should also be no need for a TC argument?

llvm/lib/Transforms/Vectorize/VPlan.h
764

Can BranchOnCond be used instead of the dedicated BranchOnactiveLaneMask?

1847

TC would need to be an operand instead of just a member variable, otherwise things will break if something does TC->replaceAllUsesWith() outside.

david-arm marked an inline comment as done.Wed, Jun 8, 1:08 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

Hi @fhahn, perhaps I've misunderstood your question here, but actually the point of this patch is to do the exact opposite of that, i.e. we *don't* want to generate the active lane mask for the current iteration - we want to create the mask for the next iteration. This requires a PHI to carry the live value around the loop and is the only way to use the mask for control flow because at the point of branching we want to know if there are any active lanes in the mask for the *next* iteration.

With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on. Effectively, the whilelo instruction is doing the comparison already, which makes the traditional scalar IV comparison redundant.

I could be wrong, but I believe this form of vectorised loop would be beneficial for some other targets with a predicated instruction set too, such as RISC-V.

llvm/lib/Transforms/Vectorize/VPlan.h
764

I created this patch a month ago, which predated your BranchOnCond work. That's why I haven't used it. I can certainly look into this and see if it's possible though?

david-arm added inline comments.Fri, Jun 10, 12:51 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
1553

Hi @fhahn, I'm not really sure what you mean here. Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block? If so, then VPInstruction::ActiveLaneMask needs a trip count operand.

Or do you mean I should deviate from the expected recipe creation somehow in a VPlan function such as prepareToExecute by manually adding the incoming value?

fhahn added inline comments.Mon, Jun 13, 1:05 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

With particular reference to SVE, the motivation for this work is to use the 'whilelo' instruction to both generate the lane mask and set the flags to branch on.

Oh right, I missed this in the test case I was looking at! I originally thought the intention of the new phi recipe was to encode some extra guarantees/information like we do for inductions or reductions.

From the latest comment, it sounds like there would be no need to have a new recipe class for the phi, but perhaps VPWidenPHIRecipe could be used instead, if the setup can be moved to the pre-header.

llvm/lib/Transforms/Vectorize/VPlan.cpp
1553

Do you mean that LoopVectorize.cpp should create a VPInstruction with type ActiveLaneMask that lives in the VPlan's preheader block?

Yes exactly.

If so, then VPInstruction::ActiveLaneMask needs a trip count operand.

Looking at VPInstruction::generateInstruction, it looks like it already takes a trip count operand? The first operand can be set to zero by wrapping a zero constant in a VPValue.

case VPInstruction::ActiveLaneMask: {
  // Get first lane of vector induction variable.
  Value *VIVElem0 = State.get(getOperand(0), VPIteration(Part, 0));
  // Get the original loop tripcount.
  Value *ScalarTC = State.get(getOperand(1), Part);
david-arm added inline comments.Mon, Jun 13, 1:12 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp:

void VPWidenPHIRecipe::execute(VPTransformState &State) {
  State.ILV->widenPHIInstruction(cast<PHINode>(getUnderlyingValue()), this,
                                 State);
}

but the PHI node does not exist in the original scalar loop. It's not currently possible to widen a PHI that didn't previously exist, which means I would have to modify VPWidenPHIRecipe to test for the existence of an underlying value and take different paths accordingly.

david-arm updated this revision to Diff 436733.Tue, Jun 14, 4:04 AM
  • Removed the creation of the initial lane mask values from ActiveLaneMaskPHIRecipe::execute. We now create the appropriate set of VPInstructions in the vplan preheader block and feed these into the ActiveLaneMaskPHI recipe.
david-arm marked 6 inline comments as done.Tue, Jun 14, 4:14 AM
david-arm added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.h
764

So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type. It's possible to do this, but then I wasn't sure about the semantics of this new instruction. When passing in a scalar constant of 0 for the lane, it gets widened to something like <vscale x 4 x i32> zeroinitializer for every part. However, I only need a single lane so I'd have to do something like:

case VPInstruction::ExtractElement: {
  Value *Vec = State.get(getOperand(0), Part);
  Value *Lane = State.get(getOperand(1), VPIteration(0, 0));
  Value *V = Builder.CreateExtractElement(Vec, Lane);
  State.set(this, V, Part);
  break;
}

It feels quite inefficient to go to all the effort of widening, only to discard everything!

If you still prefer me to proceed with this approach I'm happy to try if you can provide your thoughts on what the new ExtractElement operation should look like?

fhahn added inline comments.Sun, Jun 19, 2:21 PM
llvm/lib/Transforms/Vectorize/VPlan.h
764

So I did look into this. In order to do it this way I have to explicitly generate the Not and ExtractElement operations using VPInstructions, which requires a new VPInstruction::ExtractElement type.

Extracts are not modeled explicitly at the moment and usually State.get will take care of interesting an extract when requesting scalar lanes if it is needed. I *think* when using a VPInstruction::not as operand for BranchOnCond, State.get should insert the extract for the first lane, as this is what BranchOnCond uses.

fhahn added inline comments.Sun, Jun 19, 2:27 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

So I actually tried doing exactly this initially, but I think that VPWidenPHIRecipe requires an underlying scalar instruction to widen due to the execute function that lives in LoopVectorize.cpp:

Yeah this was a bit unfortunate! I landed 2 patches that removed the unnecessary dependence on the underlying instruction by instead using the type of its operand.

david-arm added inline comments.Mon, Jun 20, 12:56 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8670

OK well thanks a lot for tidying that class up! The only problem is that even after your patches landed VPWidenPhiRecipe::execute only ever deals with Part 0, whereas we need a Phi for every Part. So unfortunately I still can't use the class in it's current state. If you have suggestions about how to fix this I'm happy to take another look? Perhaps I can add a boolean flag to the VPWidenPhiRecipe constructor to indicate how many parts to generate?

llvm/lib/Transforms/Vectorize/VPlan.h
764

OK I'll take another look and give this a try!

david-arm updated this revision to Diff 438346.Mon, Jun 20, 5:14 AM
  • Removed BranchOnActiveLaneMask and now use BranchOnCond instead
david-arm edited the summary of this revision. (Show Details)Mon, Jun 20, 5:15 AM
david-arm marked an inline comment as done.