This is an archive of the discontinued LLVM Phabricator instance.

[VPlan] Add active-lane-mask as VPlan-to-VPlan transformation.
ClosedPublic

Authored by fhahn on Aug 24 2023, 2:09 PM.

Details

Summary

This patch updates the mask creation code to always create compares of
the form (ICMP_ULE, wide canonical IV, backedge-taken-count) up front
when tail folding and introduce active-lane-mask as later
transformation.

This effectively makes (ICMP_ULE, wide canonical IV, backedge-taken-count)
the canonical form for tail-folding early on. Introducing more specific
active-lane-mask recipes is treated as a VPlan-to-VPlan optimization.

This has the advantage of keeping the logic (and complexity) of
introducing active-lane-mask recipes in a single place, instead of
spreading the logic out across multiple functions. It also simplifies
initial VPlan construction and enables treating introducing EVL as
similar optimization.

Diff Detail

Event Timeline

fhahn created this revision.Aug 24 2023, 2:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2023, 2:09 PM
fhahn requested review of this revision.Aug 24 2023, 2:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2023, 2:09 PM
Herald added subscribers: wangpc, vkmr. · View Herald Transcript
Matt added a subscriber: Matt.Aug 24 2023, 2:13 PM
ABataev added inline comments.Aug 25 2023, 3:59 AM
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
869

auto *

932

auto *

949–950

Looks like boolean result of Fn is not used, you can make it simply return void.

Ayal added a comment.Aug 30 2023, 3:38 PM

Trying to simplify and clarify the logic as it gets refactored.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8730–8731

nit: may be clearer to set and pass bool HasNUW = CM.getTailFoldingStyle() == TailFoldingStyle::None;, and possibly auto DL = DLInst ? DLInst->getDebugLoc() : DebugLoc());

8905

nit: may be slightly clearer to set and pass

bool ForControlFlow = useActiveLaneMaskForControlFlow(Style);
bool WithoutRuntimeCheck = Style == TailFoldingStyle::DataAndControlFlowWithoutRuntimeCheck;
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
860

This adds much more than a VPActiveLaneMaskPHIRecipe. One part adds recipes in preheader and this phi, returning it for further/any use. Another part replaces the loop branch with several recipes including an ActiveLaneMask. The two parts are apparently independent of each other.

899

nit: this seems like the simpler case so should probably go first.

900

nit: aka CanonicalIVIncrement?

904

Better to erase the (BranchOnCount?) terminator next to inserting the new one which replaces it below - a BranchOnCond which effectively turns the loop into an uncountable one.

916

Better to use separate variables than redefine same CanonicalIVIncrementParts twice. As in ALM and EntryALM. Especially for something canonical...

938–939
940

Would vp_find_if_user() be more accurate, as in std::find_if(), returning the found VPUser* or null?

949

nit: check if (Fn(U)) before inserting U's users to Worklist.

949–950

Agreed, Fn always returns false below, resulting in "for_all". Suggest to use "find_if" instead, in this case.

959

This is a major part of the transform. If replacing the loop branch can indeed be separated from generating LaneMaskPhi, perhaps it may be clearer to have here

if (UseActiveLaneMaskForControlFlow)
  useActiveLaneMaskForLoopBranch(Plan, DataAndControlFlowWithoutRuntimeCheck);

and later have

if (UseActiveLaneMaskForControlFlow) {
    auto *LaneMaskPhi = useActiveLaneMaskForPhi(Plan);
    VPI->replaceAllUsesWith(LaneMaskPhi);
}
else { ...
963

How many instances of the same "WidenIV <= BTC" are we expecting?
Why replace each with a separate copy of ActiveLaneMask(IV, TC)?
Suffice to find_if one exists and process it? Can then verify find_if again does not find another.

974

nit: erase from parent here too, or leave both to DCE? Erasing a user during traversal?

fhahn updated this revision to Diff 556278.Sep 8 2023, 9:59 AM
fhahn marked 17 inline comments as done.

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8730–8731

Simplified, separately updated getDebugLocFromInstOrOperands to return the DebugLoc directly in 08de6508ab6a

8905

Adjusted, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
860

Clarified the name & comment. The branch part only adds a Not and Branch recipe, so not sure if it is worth splitting this off?

869

Adjusted, thanks!

899

Reordered the blocks, thanks!

900

Yep, adjusted, thanks!

904

Updated to use VPBuilder to construction (and insert) recipes and moved removal down.

916

Split up, thanks!

932

Now that this is using VPBuilder, there's no. need to assign the result, thanks!

938–939

Adjusted, thanks!

940

Bool return value was intended to let the callback stop iteration, but it is not needed in the current version, removed the bool return value.

949

Boolean removed, thanks!

949–950

Updated to use void, thanks!

959

I am not sure, creating the phi and adjusting the branch is tied together at the moment, as in whenever we introduce a active-lane-mask-phi we will also replace the branch.

963

Updated to re-use created lane mask. At the moment there should be only one, but not sure if it is worth relying on that here.

974

Restructured the code to replace & erase once for both cases.

Ayal added a comment.Sep 17 2023, 2:37 PM

Trying to (further) simplify and clarify the logic as it gets refactored.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
147

nit: these “create op”’s should arguably return an op/recipe/vpinstruction rather than a vpvalue, unless they’re expected to potentially fold constants?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8619–8620

nit: canonical IV also serves as a basis for derived IV's.

Note that the CanonicalIVPHI - CanonicalIVIncrement cycle introduced here along with BranchOnCount produce a single value/instruction per vector iteration, with StartV producing a single live-in (scalar) value. In contrast to subsequent IV/AVL recipes producing values per part.

Note that the three recipes are created and introduced into header and latch directly, w/o VPBuilder.

8726

nit: worth a TODO to revisit this note?

8732–8736

This call of getTailFoldingStyle() need not pass IVUpdateMayOverflow?
Suffice to call getTailFoldingStyle() once and record its result in Style, to be reused below?
nit: worth a comment explaining why NUW corresponds with not folding tail.

8948–8952

nit: comment that true stands for HasNUW and why that is chosen here.

llvm/lib/Transforms/Vectorize/VPlan.h
2086

Where/is this (non-const variant) needed?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
518

nit: is this related?

860

Worth explaining what is returned.

Worth sketching the Inc-for-part(StartV)/ALM at preheader plus Inc-for-part(CanInc/CanPhi)/ALM both feeding ALM-Phi, the latter also feeding Not/BrOnCond which replaces BrOnCount.

Worth mentioning that StartV/CanPhi/CanInc and their users remain intact, only BrOnCount is replaced, and ALM-phi is returned.

867
869

Place this comment below, before actually creating EntryALM and its Increment-for-part?

871
876

This depends on how ALM recipe is defined - it could generate a mask per part based on using a single value per vector iteration of its operand, effectively folding the increment-per-part recipe into the ALM recipe. This could first be set up w/o unrolling in mind, i.e., when UF=1, CanonicalIVIncrementForPart is redundant; and then extend to multiple parts, eventually to replicate them in VPlan.

879
880

nit: worth having a constructor of VPBuilder that takes an insertion point as parameter.

898

nit: keep the order of setting IncrementValue and TripCount consistent across both then and else.

914

nit: clarify that new Inc-for-part/ALM/Not/BrOnCond sequence is generated before BrOnCount, which is then eliminated.

915

CanonicalIVIncrementParts is essentially the opcode. Better distinguish it from the other CanonicalIVIncrementForPart above, called EntryIncrement. How about InLoopIncrement?

926

Maybe worth having both BranchOnCondTrue and BranchOnCondFalse.

954

This def-use walk over all in/direct users of the canonical induction may end up scanning all recipes.
To visit all (ICMP_ULE, wide canonical IV, backedge-taken-count) patterns, suffice to look at all direct users of BTC - looking for ICMP_ULE's whose other operand is a wide canonical IV?

958

VPI >> CompareToReplace ?

963

In order to re-use a lane mask, need to make sure it is placed such that it dominates all uses? When its an ALM phi its fine, but here it is placed after a wide canonical IV, which it uses.

Seems clearer to generate a (wide canonical IV and) ALM and place LaneMask in the header at the outset, to be reused by any ICMP_ULE pattern, or collected by DCE later if no such pattern exists(*)? I.e., something like

if (UseActiveLaneMaskForControlFlow)
  LaneMask = addVPLaneMaskPhiAndUpdateExitBranch();
  // Can update exit branch here, subsequently.
else { // generate it here
  LaneMask = ...;
}

(*) Note that addALM is (currently) called only if tail is folded, so at-least one ICMP_ULE pattern is expected, and could be asserted.

970

Define LaneMask as a RecipeBase instead of VPValue, to avoid VPInstruction->VPValue->VPRecipeBase?

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Can more explicitly explain that (ICMP_ULE, WideCanonIIV, BTC) is replaced with (ActiveLaneMask, IV, BTC).

Worth also explaining here about DataAndControlFlowWithoutRuntimeCheck.

fhahn updated this revision to Diff 557046.Sep 19 2023, 8:22 AM
fhahn marked 21 inline comments as done.
fhahn added a comment.Sep 19 2023, 8:22 AM

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
147

Updated for the new one, can adjust the others as follow-up cleanup, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8619–8620

Should the comment be clarified separately with more details here, after landing the change now that it is much simpler to spell out what is going on?

8726

Can do separately, thanks!

8732–8736

Adjusted and added a comment, thanks!

8948–8952

Commented and use variable to attach name, thanks!

llvm/lib/Transforms/Vectorize/VPlan.h
2086

Not needed in the current version. Removed, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
518

Removed , thanks!

867

Adjusted, thanks! Also update to use getBackedgeValue below.

869

Moved, thanks!

871

Adjusted, thanks!

876

I've a set of patches that introduce explicit unrolling, will share that soon. To keep the current patch as simple as possible, it would be great to keep the increment recipe as is for now.

879

Adjusted, thanks!

880

Added, thanks!

898

reordered, thanks!

914

Expanded comment and also clarified the code by using EB->getTerminator().

915

Renamed, thanks!

926

Maybe it would be better to just have BranchOnCond to use the blocks Successor to set destinations for the branch, then this can be controlled by adjusting the order of successors? In any case, would be good as follow up.

954

Adjusted, thanks!

958

Renamed, thanks!

970

Adjusted, thanks! Trade-off is the need to use getVPSingleValue. Really need a base class for recipes defining a single VPValue....

llvm/lib/Transforms/Vectorize/VPlanTransforms.h
72

Expanded comment, thanks! Also added that DataAndControlFlowWithoutRuntimeCheck implies UseActiveLaneMaskForControlFlow ( and added an assert)

Ayal added inline comments.Sep 19 2023, 11:04 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8951
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
870

nit: worth commenting why poison generating flags are dropped from CanonicalIVIncrement.

957

Instead of setting to nullptr, create an ALM instruction following canonical IV widened/incrementedForPart placing both before first non-phi of loop header, to make sure the former can serve all potential users of ICMP_ULE's?

962

Only immediate users of BTC are sought, suffice to traverse them via
for (VPUser &U : BTC->users()) {
rather than
vp_for_all_users() { ?

969

nit: can assert operand 1 is BTC, given that operand 0 is not and the 2-operand compare uses BTC.

977

The first operand of ALM when addVPLaneMaskPhiAndUpdateExitBranch() generates it in

  • preheader is a CanonicalIVIncrementForPart(StartV)
  • latch is CanonicalIVIncrementForPart(CanonicalIVPHI/CanonicalIVIncrement)

When generating it here in header it is VPWidenCanonicalIVRecipe(CanonicalIVPHI).

Can ALM use CanonicalIVIncrementForPart(CanonicalIVPHI) here too, for consistency?

Are VPWidenCanonicalIVRecipe and CanonicalIVIncrementForPart interchangeable, are both needed?

fhahn updated this revision to Diff 557067.Sep 19 2023, 12:08 PM
fhahn marked 6 inline comments as done.

Address latest comments, thanks

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8951

Argh, yes, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
870

IIUC we should only need to drop them for DataAndControlFlowWithoutRuntimeCheck? Might be good to move the code as follow-up and document then.

957

I adjusted the code to first look for the widened IV. If it exists and !UseActiveLaneMaskForControlFlow, create ALM up front.

962

Adjusted the code to first look for the wide canonical IV and then iterate here over just its users. Overall that makes the structure of the code easier to follow IMO.

969

added assert, thanks!

977

ALM needs scalar arguments, so it uses lane 0. This means here it would be better to use the canonical IV instead of the widened recipe. There are some small codegen changes, so I think that would be better as follow up.

Ayal added inline comments.Sep 19 2023, 3:07 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8909

To move all this from tryToBuildVPlanWithVPRecipes() into LVP::optimize() would require moving getTailFoldingStyle(), useActiveLaneMask() et al. from LoopVectorizer.cpp?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
870

Worth leaving some note behind for now.

941

Look for a user U that is a VPWidenCanonicalIVRecipe and resides in the header?
(There could also be multiple such U's....)
As we're considering multiple ICMP_ULE's, they could be using multiple R's, and both could spread out of the header.

Or simply create a WideCanonicalIV before the first non-phi of the header.

947

This should be an assert, before checking if UseActiveLaneMaskForControlFlow, because addActiveLaneMask() is called only when tail is folded, which implies an ICMP_ULE pattern.

970

OK to erase a user from parent during traversal over users?

Really need a base class for recipes defining a single VPValue....

+1

fhahn updated this revision to Diff 557117.Sep 20 2023, 8:48 AM
fhahn marked 5 inline comments as done.

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8909

Yes, it would require plugging all this through to optimize(). Would need to think about how to consolidate passing the required info through in a general way.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
870

Thanks, added a comment + TODO

941

Added an assert that it is in the header. All compares are users of the WideCanonicalIV, the created ALM should be placed to be valid for all of them. Might be good to ensure there's a single VPWidenCanonicalIVRecipe in the future as part of the verifier?

947

Converted to assert, thanks!

970

Not necessarily, created a temporary vector, thanks!

Ayal added inline comments.Sep 21 2023, 2:47 PM
llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
147

All callers of createOverflowingOp() below pass {false, false} as their WrapFlags, i.e., both HasNUW and HasNSW are set to false, which is the default. Suffice to call createNaryOp() instead?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8909

Worth a TODO to wrap this all within a VPlanTransforms::addActiveLaneMask(*Plan)?

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
870

TODO to revisit poison droppings still needs to be added?

914

nit: should be simply Builder.setInsertPoint(OriginalTerminator); by introducing VPBuilder::setInsertPoint(VPRecipeBase*);

939

?

941
947

assert still needs to be added, above?

970

Can alternatively avoid erasing from parent here and leave it to dead recipe elimination.

fhahn updated this revision to Diff 557273.Sep 23 2023, 2:20 PM
fhahn marked 10 inline comments as done.

Address latest comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
147

The current flag-handling code assumes that flags are only queried when the correct OperationType is set. Using createNaryOp would not set the operation type, so during recipe executing an assert would be triggered when hasNUW is called. The main motivation for strict checking is to ensure consistent handling. It effectively enforces that hasNUW is only used on recipes/opcodes that have associated flags, which we would have to weaken, either by having hasNUW return false on mismatch or adding extra checks for the operation type. The current handling is similar to how it is handled in LLVM IR.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8909

added, thanks!

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
860

Thanks, expanded comment!

870

added, thanks!

914

Done, thanks!

939

Done, thanks!

941

done ,thanks!

947

Removed, thanks!

970

Currently we only run dead recipe removal after induction optimizations, so leaving the compares around keeps the wide induction live. Could be addressed by running dead recipe removal at the beginning of the VPlan optimizations as well separately.

Ayal accepted this revision.Sep 24 2023, 3:51 AM

This looks good to me, adding some final thoughts and comments, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
147

Yes, OperationType needs to be set correctly for any VPRecipeWithIRFlags, and to OverflowingBinOp for any VPInstruction with CanonicalIVIncrementForPart as its opcode. This could presumably be handled by createNaryOp/createInstruction/VPInstruction-constructor, possibly with a protected setOperationType if needed.
OTOH, if non-default wrap flags are desired, such a designated creation method would be needed.

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
862

(note that replacing branch-on-count with branch-on-cond(!ALM) effectively turns a countable loop into an uncountable one, yet VectorTripCount remains intact.)

863

(plus some poison droppings, to be percise.)

866
873

// The function adds the following recipes:

874

// %TripCount = calculate-trip-count-minus-VF (original TC) [if DataWithControlFlowWithoutRuntimeCheck]

896
941

Might be good to ensure there's a single VPWidenCanonicalIVRecipe in the future as part of the verifier?

+1

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
418

nit (independent of this patch): suffice to generate this saturating subtraction of N - vscale*32 once instead of replicating it four times.

This revision is now accepted and ready to land.Sep 24 2023, 3:51 AM
fhahn marked 8 inline comments as done.Sep 25 2023, 5:32 AM
fhahn added inline comments.
llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp
862

added a note for now/

863

Added, thanks !

866

Adjusted and moved the add part below.

873

Added, thanks!

874

Added, thanks!

896

Fixed, thanks!

llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
418

Yes, hopefully should be cleaned up by explicit unrolling, but I could check if we can adjust this during execute to start with ,like done in other places.

This revision was landed with ongoing or failed builds.Sep 25 2023, 5:35 AM
This revision was automatically updated to reflect the committed changes.
fhahn marked 7 inline comments as done.