This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Add active-lane-mask as VPlan-to-VPlan transformation.
ClosedPublic

Authored by fhahn on Aug 24 2023, 2:09 PM.

Details

Summary

This patch updates the mask creation code to always create compares of
the form (ICMP_ULE, wide canonical IV, backedge-taken-count) up front
when tail folding and introduce active-lane-mask as later
transformation.

This effectively makes (ICMP_ULE, wide canonical IV, backedge-taken-count)
the canonical form for tail-folding early on. Introducing more specific
active-lane-mask recipes is treated as a VPlan-to-VPlan optimization.

This has the advantage of keeping the logic (and complexity) of
introducing active-lane-mask recipes in a single place, instead of
spreading the logic out across multiple functions. It also simplifies
initial VPlan construction and enables treating introducing EVL as
similar optimization.

Diff Detail

Event Timeline

fhahn created this revision.Aug 24 2023, 2:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2023, 2:09 PM
fhahn requested review of this revision.Aug 24 2023, 2:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2023, 2:09 PM
Herald added subscribers: wangpc, vkmr. · View Herald Transcript
Matt added a subscriber: Matt.Aug 24 2023, 2:13 PM
ABataev added inline comments.Aug 25 2023, 3:59 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
823

auto *

886

auto *

903–904

Looks like boolean result of Fn is not used, you can make it simply return void.

Ayal added a comment.Aug 30 2023, 3:38 PM

Trying to simplify and clarify the logic as it gets refactored.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8728–8729

nit: may be clearer to set and pass bool HasNUW = CM.getTailFoldingStyle() == TailFoldingStyle::None;, and possibly auto DL = DLInst ? DLInst->getDebugLoc() : DebugLoc());

8900

nit: may be slightly clearer to set and pass

bool ForControlFlow = useActiveLaneMaskForControlFlow(Style);
bool WithoutRuntimeCheck = Style == TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck;
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
814

This adds much more than a VPActiveLaneMaskPHIRecipe. One part adds recipes in preheader and this phi, returning it for further/any use. Another part replaces the loop branch with several recipes including an ActiveLaneMask. The two parts are apparently independent of each other.

853

nit: this seems like the simpler case so should probably go first.

854

nit: aka CanonicalIVIncrement?

858

Better to erase the (BranchOnCount?) terminator next to inserting the new one which replaces it below - a BranchOnCond which effectively turns the loop into an uncountable one.

870

Better to use separate variables than redefine same CanonicalIVIncrementParts twice. As in ALM and EntryALM. Especially for something canonical...

892–893
894

Would vp_find_if_user() be more accurate, as in std::find_if(), returning the found VPUser* or null?

903

nit: check if (Fn(U)) before inserting U's users to Worklist.

903–904

Agreed, Fn always returns false below, resulting in "for_all". Suggest to use "find_if" instead, in this case.

913

This is a major part of the transform. If replacing the loop branch can indeed be separated from generating LaneMaskPhi, perhaps it may be clearer to have here

if (UseActiveLaneMaskForControlFlow)
  useActiveLaneMaskForLoopBranch(Plan, DataAndControlFlowWithoutRuntimeCheck);

and later have

if (UseActiveLaneMaskForControlFlow) {
    auto *LaneMaskPhi = useActiveLaneMaskForPhi(Plan);
    VPI->replaceAllUsesWith(LaneMaskPhi);
}
else { ...
917

How many instances of the same "WidenIV <= BTC" are we expecting?
Why replace each with a separate copy of ActiveLaneMask(IV, TC)?
Suffice to find_if one exists and process it? Can then verify find_if again does not find another.

928

nit: erase from parent here too, or leave both to DCE? Erasing a user during traversal?

fhahn updated this revision to Diff 556278.Sep 8 2023, 9:59 AM
fhahn marked 17 inline comments as done.

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8728–8729

Simplified, separately updated getDebugLocFromInstOrOperands to return the DebugLoc directly in 08de6508ab6a

8900

Adjusted, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
814

Clarified the name & comment. The branch part only adds a Not and Branch recipe, so not sure if it is worth splitting this off?

823

Adjusted, thanks!

853

Reordered the blocks, thanks!

854

Yep, adjusted, thanks!

858

Updated to use VPBuilder to construction (and insert) recipes and moved removal down.

870

Split up, thanks!

886

Now that this is using VPBuilder, there's no. need to assign the result, thanks!

892–893

Adjusted, thanks!

894

Bool return value was intended to let the callback stop iteration, but it is not needed in the current version, removed the bool return value.

903

Boolean removed, thanks!

903–904

Updated to use void, thanks!

913

I am not sure, creating the phi and adjusting the branch is tied together at the moment, as in whenever we introduce a active-lane-mask-phi we will also replace the branch.

917

Updated to re-use created lane mask. At the moment there should be only one, but not sure if it is worth relying on that here.

928

Restructured the code to replace & erase once for both cases.

Ayal added a comment.Sep 17 2023, 2:37 PM

Trying to (further) simplify and clarify the logic as it gets refactored.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
139

nit: these “create op”’s should arguably return an op/recipe/vpinstruction rather than a vpvalue, unless they’re expected to potentially fold constants?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8617

nit: canonical IV also serves as a basis for derived IV's.

Note that the CanonicalIVPHI - CanonicalIVIncrement cycle introduced here along with BranchOnCount produce a single value/instruction per vector iteration, with StartV producing a single live-in (scalar) value. In contrast to subsequent IV/AVL recipes producing values per part.

Note that the three recipes are created and introduced into header and latch directly, w/o VPBuilder.

8724

nit: worth a TODO to revisit this note?

8730

This call of getTailFoldingStyle() need not pass IVUpdateMayOverflow?
Suffice to call getTailFoldingStyle() once and record its result in Style, to be reused below?
nit: worth a comment explaining why NUW corresponds with not folding tail.

8942

nit: comment that true stands for HasNUW and why that is chosen here.

llvm/lib/Transforms/Vectorize/VPlan.h
2084

Where/is this (non-const variant) needed?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
518

nit: is this related?

814

Worth explaining what is returned.

Worth sketching the Inc-for-part(StartV)/ALM at preheader plus Inc-for-part(CanInc/CanPhi)/ALM both feeding ALM-Phi, the latter also feeding Not/BrOnCond which replaces BrOnCount.

Worth mentioning that StartV/CanPhi/CanInc and their users remain intact, only BrOnCount is replaced, and ALM-phi is returned.

821
823

Place this comment below, before actually creating EntryALM and its Increment-for-part?

825
830

This depends on how ALM recipe is defined - it could generate a mask per part based on using a single value per vector iteration of its operand, effectively folding the increment-per-part recipe into the ALM recipe. This could first be set up w/o unrolling in mind, i.e., when UF=1, CanonicalIVIncrementForPart is redundant; and then extend to multiple parts, eventually to replicate them in VPlan.

833
834

nit: worth having a constructor of VPBuilder that takes an insertion point as parameter.

852

nit: keep the order of setting IncrementValue and TripCount consistent across both then and else.

868

nit: clarify that new Inc-for-part/ALM/Not/BrOnCond sequence is generated before BrOnCount, which is then eliminated.

869

CanonicalIVIncrementParts is essentially the opcode. Better distinguish it from the other CanonicalIVIncrementForPart above, called EntryIncrement. How about InLoopIncrement?

880

Maybe worth having both BranchOnCondTrue and BranchOnCondFalse.

908

This def-use walk over all in/direct users of the canonical induction may end up scanning all recipes.
To visit all (ICMP_ULE, wide canonical IV, backedge-taken-count) patterns, suffice to look at all direct users of BTC - looking for ICMP_ULE's whose other operand is a wide canonical IV?

912

VPI >> CompareToReplace ?

917

In order to re-use a lane mask, need to make sure it is placed such that it dominates all uses? When its an ALM phi its fine, but here it is placed after a wide canonical IV, which it uses.

Seems clearer to generate a (wide canonical IV and) ALM and place LaneMask in the header at the outset, to be reused by any ICMP_ULE pattern, or collected by DCE later if no such pattern exists(*)? I.e., something like

if (UseActiveLaneMaskForControlFlow)
  LaneMask = addVPLaneMaskPhiAndUpdateExitBranch();
  // Can update exit branch here, subsequently.
else { // generate it here
  LaneMask = ...;
}

(*) Note that addALM is (currently) called only if tail is folded, so at-least one ICMP_ULE pattern is expected, and could be asserted.

924

Define LaneMask as a RecipeBase instead of VPValue, to avoid VPInstruction->VPValue->VPRecipeBase?

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Can more explicitly explain that (ICMP_ULE, WideCanonIIV, BTC) is replaced with (ActiveLaneMask, IV, BTC).

Worth also explaining here about DataAndControlFlowWithoutRuntimeCheck.

fhahn updated this revision to Diff 557046.Sep 19 2023, 8:22 AM
fhahn marked 21 inline comments as done.
fhahn added a comment.Sep 19 2023, 8:22 AM

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
139

Updated for the new one, can adjust the others as follow-up cleanup, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8617

Should the comment be clarified separately with more details here, after landing the change now that it is much simpler to spell out what is going on?

8724

Can do separately, thanks!

8730

Adjusted and added a comment, thanks!

8942

Commented and use variable to attach name, thanks!

llvm/lib/Transforms/Vectorize/VPlan.h
2084

Not needed in the current version. Removed, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
518

Removed , thanks!

821

Adjusted, thanks! Also update to use getBackedgeValue below.

823

Moved, thanks!

825

Adjusted, thanks!

830

I've a set of patches that introduce explicit unrolling, will share that soon. To keep the current patch as simple as possible, it would be great to keep the increment recipe as is for now.

833

Adjusted, thanks!

834

Added, thanks!

852

reordered, thanks!

868

Expanded comment and also clarified the code by using EB->getTerminator().

869

Renamed, thanks!

880

Maybe it would be better to just have BranchOnCond to use the blocks Successor to set destinations for the branch, then this can be controlled by adjusting the order of successors? In any case, would be good as follow up.

908

Adjusted, thanks!

912

Renamed, thanks!

924

Adjusted, thanks! Trade-off is the need to use getVPSingleValue. Really need a base class for recipes defining a single VPValue....

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Expanded comment, thanks! Also added that DataAndControlFlowWithoutRuntimeCheck implies UseActiveLaneMaskForControlFlow ( and added an assert)

Ayal added inline comments.Sep 19 2023, 11:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8945
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
824

nit: worth commenting why poison generating flags are dropped from CanonicalIVIncrement.

911

Instead of setting to nullptr, create an ALM instruction following canonical IV widened/incrementedForPart placing both before first non-phi of loop header, to make sure the former can serve all potential users of ICMP_ULE's?

916

Only immediate users of BTC are sought, suffice to traverse them via
for (VPUser &U : BTC->users()) {
rather than
vp_for_all_users() { ?

923

nit: can assert operand 1 is BTC, given that operand 0 is not and the 2-operand compare uses BTC.

931

The first operand of ALM when addVPLaneMaskPhiAndUpdateExitBranch() generates it in

  • preheader is a CanonicalIVIncrementForPart(StartV)
  • latch is CanonicalIVIncrementForPart(CanonicalIVPHI/CanonicalIVIncrement)

When generating it here in header it is VPWidenCanonicalIVRecipe(CanonicalIVPHI).

Can ALM use CanonicalIVIncrementForPart(CanonicalIVPHI) here too, for consistency?

Are VPWidenCanonicalIVRecipe and CanonicalIVIncrementForPart interchangeable, are both needed?

fhahn updated this revision to Diff 557067.Sep 19 2023, 12:08 PM
fhahn marked 6 inline comments as done.

Address latest comments, thanks

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8945

Argh, yes, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
824

IIUC we should only need to drop them for DataAndControlFlowWithoutRuntimeCheck? Might be good to move the code as follow-up and document then.

911

I adjusted the code to first look for the widened IV. If it exists and !UseActiveLaneMaskForControlFlow, create ALM up front.

916

Adjusted the code to first look for the wide canonical IV and then iterate here over just its users. Overall that makes the structure of the code easier to follow IMO.

923

added assert, thanks!

931

ALM needs scalar arguments, so it uses lane 0. This means here it would be better to use the canonical IV instead of the widened recipe. There are some small codegen changes, so I think that would be better as follow up.

Ayal added inline comments.Sep 19 2023, 3:07 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8904

To move all this from tryToBuildVPlanWithVPRecipes() into LVP::optimize() would require moving getTailFoldingStyle(), useActiveLaneMask() et al. from LoopVectorizer.cpp?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
824

Worth leaving some note behind for now.

895

Look for a user U that is a VPWidenCanonicalIVRecipe and resides in the header?
(There could also be multiple such U's....)
As we're considering multiple ICMP_ULE's, they could be using multiple R's, and both could spread out of the header.

Or simply create a WideCanonicalIV before the first non-phi of the header.

901

This should be an assert, before checking if UseActiveLaneMaskForControlFlow, because addActiveLaneMask() is called only when tail is folded, which implies an ICMP_ULE pattern.

924

OK to erase a user from parent during traversal over users?

Really need a base class for recipes defining a single VPValue....

+1

fhahn updated this revision to Diff 557117.Sep 20 2023, 8:48 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8904

Yes, it would require plugging all this through to optimize(). Would need to think about how to consolidate passing the required info through in a general way.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
824

Thanks, added a comment + TODO

895

Added an assert that it is in the header. All compares are users of the WideCanonicalIV, the created ALM should be placed to be valid for all of them. Might be good to ensure there's a single VPWidenCanonicalIVRecipe in the future as part of the verifier?

901

Converted to assert, thanks!

924

Not necessarily, created a temporary vector, thanks!

Ayal added inline comments.Sep 21 2023, 2:47 PM
llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
139

All callers of createOverflowingOp() below pass {false, false} as their WrapFlags, i.e., both HasNUW and HasNSW are set to false, which is the default. Suffice to call createNaryOp() instead?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8904

Worth a TODO to wrap this all within a VPlanTransforms::addActiveLaneMask(*Plan)?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
824

TODO to revisit poison droppings still needs to be added?

868

nit: should be simply Builder.setInsertPoint(OriginalTerminator); by introducing VPBuilder::setInsertPoint(VPRecipeBase*);

893

?

895
901

assert still needs to be added, above?

924

Can alternatively avoid erasing from parent here and leave it to dead recipe elimination.

fhahn updated this revision to Diff 557273.Sep 23 2023, 2:20 PM
fhahn marked 10 inline comments as done.

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
139

The current flag-handling code assumes that flags are only queried when the correct OperationType is set. Using createNaryOp would not set the operation type, so during recipe executing an assert would be triggered when hasNUW is called. The main motivation for strict checking is to ensure consistent handling. It effectively enforces that hasNUW is only used on recipes/opcodes that have associated flags, which we would have to weaken, either by having hasNUW return false on mismatch or adding extra checks for the operation type. The current handling is similar to how it is handled in LLVM IR.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8904

added, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
814

Thanks, expanded comment!

824

added, thanks!

868

Done, thanks!

893

Done, thanks!

895

done ,thanks!

901

Removed, thanks!

924

Currently we only run dead recipe removal after induction optimizations, so leaving the compares around keeps the wide induction live. Could be addressed by running dead recipe removal at the beginning of the VPlan optimizations as well separately.

Ayal accepted this revision.Sep 24 2023, 3:51 AM

This looks good to me, adding some final thoughts and comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
139

Yes, OperationType needs to be set correctly for any VPRecipeWithIRFlags, and to OverflowingBinOp for any VPInstruction with CanonicalIVIncrementForPart as its opcode. This could presumably be handled by createNaryOp/createInstruction/VPInstruction-constructor, possibly with a protected setOperationType if needed.
OTOH, if non-default wrap flags are desired, such a designated creation method would be needed.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
816

(note that replacing branch-on-count with branch-on-cond(!ALM) effectively turns a countable loop into an uncountable one, yet VectorTripCount remains intact.)

817

(plus some poison droppings, to be percise.)

820
827

// The function adds the following recipes:

828

// %TripCount = calculate-trip-count-minus-VF (original TC) [if DataWithControlFlowWithoutRuntimeCheck]

850
895

Might be good to ensure there's a single VPWidenCanonicalIVRecipe in the future as part of the verifier?

+1

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
418

nit (independent of this patch): suffice to generate this saturating subtraction of N - vscale*32 once instead of replicating it four times.

This revision is now accepted and ready to land.Sep 24 2023, 3:51 AM
fhahn marked 8 inline comments as done.Sep 25 2023, 5:32 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
816

added a note for now/

817

Added, thanks !

820

Adjusted and moved the add part below.

827

Added, thanks!

828

Added, thanks!

850

Fixed, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
418

Yes, hopefully should be cleaned up by explicit unrolling, but I could check if we can adjust this during execute to start with ,like done in other places.

This revision was landed with ongoing or failed builds.Sep 25 2023, 5:35 AM
This revision was automatically updated to reflect the committed changes.
fhahn marked 7 inline comments as done.