This is an archive of the discontinued LLVM Phabricator instance.

[LV] Plan with and without FoldTailByMasking
Needs ReviewPublic

Authored by dmgreen on Jan 18 2023, 6:33 AM.

Details

Summary

Currently the loop vectorizer has a single parameter in the CostModel that controls FoldTailByMasking. It is set fairly early, and can't be changed later meaning we need to pick between tail folding and non tail folding before we have done much cost modelling. This patch aims to alter that so that there is not a single parameter, moving it eventually into the vplan, so that we can have plans both with and without FoldTailByMasking, that can be costed against one another and the best one picked for vectorization.

A lot of the changes are fairly mechanical and attempted to be non-disruptive to keep the patch simpler, but there are still a fair number of changes. The important parts are:

  • FoldTailByMasking is removed from CostModel. It now has a MayFoldTailByMasking variable to hold whether FoldTailByMasking VPlans can be created.
  • A number of maps in the cost model like InstsToScalarize/Uniforms/Scalars/ForcedScalars were made conditional on the pair of (FoldTailByMasking, VF), so that they still contain the information required in both FoldTailByMasking=true and FoldTailByMasking=false cases. The semi-random name VectorizationScheme was used to describe the pair of (FoldTailByMasking, VF). Alternative suggestions welcome.
  • We then create vplans with both FoldTailByMasking=false and FoldTailByMasking=true if we MayFoldTailByMasking. The VPlan from then stores whether it is FoldTailByMasking. For VF=1 only non-predicated plans are created.
  • Tail folded and non tail folded vplans then need to be able to be costed against one another. This patch makes FoldTailByMasking win on a tie if the costs are equal, which will be more consistent with the vectorization before this patch.
  • Epilogue loops for the moment are always use FoldTailByMasking=false. This can hopefully be changed in the near future to allow predicated epilogues for unpredicated loop bodies.

Overall this has the effect of allowing us to model and cost tail folded and non tail folded vplans against one another, and should in the future allow us to generate predicated epilogues for unpredicated vector loops.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 18 2023, 6:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2023, 6:33 AM
dmgreen requested review of this revision.Jan 18 2023, 6:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2023, 6:33 AM

Thanks for working on this. General direction looks good to me, this is exactly what we want.

VectorizationScheme was used to describe the pair of (FoldTailByMasking, VF). Alternative suggestions welcome

I was going to suggest PredicationScheme. But the VF is also part of it, so PredicationScheme does not covert it. Perhaps VectorizationScheme is just fine. :)

I have done only a first scan of the patch, but it is big, so need to do it again, which I will do soon.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5783

Nit: perhaps rename this and in some other places to VS, so that the VF.Width references below become VS.Width.

SjoerdMeijer accepted this revision.Jan 26 2023, 12:46 AM

This makes a lot of sense to me, so with the nits addressed, this LGTM.
But wait a day in case there are other ideas about this.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
182

This comment needs to be moved to line no. 224.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7618

nit: TODO

This revision is now accepted and ready to land.Jan 26 2023, 12:46 AM

Hi @dmgreen, I'm sorry I've not looked into this in more detail, but from the description you gave (which is very detailed - thanks!) I do have one concern about choosing tail-folding by default in a tie. One major problem we have at the moment is that the active lane mask call is effectively free because, unless I'm mistaken, we don't add the cost of this intrinsic to the loop. A simple IV increment (an add instruction) is likely to be cheaper than the codegen for the loop predicate for many targets? However, I do appreciate that targets that can't efficiently generate loop predicates probably also can't do masked loads and stores, so the tail-folded loop cost is likely to be high anyway. I wonder if it's better to be conservative and choose the non-tail-folded version in a tie until we've got a fairer comparison between different vectorisation styles?

fhahn requested changes to this revision.Jan 26 2023, 3:09 PM

I think the overall direction is very desirable! However I am not sure if adding a FoldTail argument to the cost functions is an ideal way to get there as it requires a large number of changes and is not really scalable (e.g. adding support to generate a 3rd variant of plans may mean adding another argument to all functions)

I am not sure if I am missing something, but would it be possible to instead run the planner multiple times with different cost-model configurations? I tried to sketch an option for more general support for that in D142669. Then we could generate the plans for tail-folding by just instantiating another instance of the cost model, sketched in D142670. Maybe that could help to simplify the patch?

Another thing I noticed that the patch changes skeleton creation to take a VPlan as argument. I think it would be preferable to do this the other way around if possible, so instead model the difference between plans with and without tail-folding explicitly in the pre-header blocks in the VPlan. I've not looked at the details here to see how difficult that would be yet unfortunately though. But more than happy to iterate on that aspect together!

This revision now requires changes to proceed.Jan 26 2023, 3:09 PM
tschuett added a subscriber: tschuett.EditedJan 27 2023, 12:10 AM

Could you model this approach instead as a VPlan transformation, instead of hardcoding unscalable flags. Transform: add tail folding. Transform: add masked tail folding. Transform: add scalar tail. Then you can take the cost model to decide which is more desirable.

Thanks for working on this. General direction looks good to me, this is exactly what we want.

VectorizationScheme was used to describe the pair of (FoldTailByMasking, VF). Alternative suggestions welcome

I was going to suggest PredicationScheme. But the VF is also part of it, so PredicationScheme does not covert it. Perhaps VectorizationScheme is just fine. :)

Thanks. There are still some MVE performance issues I need to work through, where it now picks a higher unpredicated trip count over tail predication, and a combination of it being inside a nested loop and low cheap counts make it unprofitable. There are some good improvements too, but that one is a little too large. I think with epilog vectorization and a few other tricks it can get the the point where it too is an improvement.

Hi @dmgreen, I'm sorry I've not looked into this in more detail, but from the description you gave (which is very detailed - thanks!) I do have one concern about choosing tail-folding by default in a tie. One major problem we have at the moment is that the active lane mask call is effectively free because, unless I'm mistaken, we don't add the cost of this intrinsic to the loop. A simple IV increment (an add instruction) is likely to be cheaper than the codegen for the loop predicate for many targets? However, I do appreciate that targets that can't efficiently generate loop predicates probably also can't do masked loads and stores, so the tail-folded loop cost is likely to be high anyway. I wonder if it's better to be conservative and choose the non-tail-folded version in a tie until we've got a fairer comparison between different vectorisation styles?

Yep - that's hopefully step 3. Once we have this and epilog vectorization (which should be a fairly simple addition) we can then have the option on SVE of choosing between unpredicated body + predicated remainder vs a full predicated body, depending on the target. Targets where the while instructions are not a bottleneck can get the benefits of full predication. I'm not sure yet whether it will just be a hard limit or an added cost. That's the idea at least, there are plenty of details to sort out first and it we will have to see how it performs in practice.

I think the overall direction is very desirable! However I am not sure if adding a FoldTail argument to the cost functions is an ideal way to get there as it requires a large number of changes and is not really scalable (e.g. adding support to generate a 3rd variant of plans may mean adding another argument to all functions)

I am not sure if I am missing something, but would it be possible to instead run the planner multiple times with different cost-model configurations? I tried to sketch an option for more general support for that in D142669. Then we could generate the plans for tail-folding by just instantiating another instance of the cost model, sketched in D142670. Maybe that could help to simplify the patch?

Multiple cost models was something I considered, but dismissed fairly quickly. I didn't think it would be something that anyone would agree to in a review - that we invent a new hierarchy of multiple cost-models each containing multiple vplans. The vplans should be flatter than that, and there is more in the cost-model that isn't dependent on FoldTail than is. I think we should be trying to reduce the number of cost-models, not increase it! The same argument about new variants could equally apply, just to an explosion in the number of cost-models needed.

This patch only pushed the FoldTailByMasking to the places that need it, just treating FoldTailByMasking like the VF already is. How married are you to the idea of multiple cost models? (Or how anti this are you?) I know this patch is fairly large but it feels like a cleaner end result once it is done. Let me know and I can try and get everything working the other way if necessary.

Another thing I noticed that the patch changes skeleton creation to take a VPlan as argument. I think it would be preferable to do this the other way around if possible, so instead model the difference between plans with and without tail-folding explicitly in the pre-header blocks in the VPlan. I've not looked at the details here to see how difficult that would be yet unfortunately though. But more than happy to iterate on that aspect together!

I was hoping for one step at a time. Firstly we can break the dependency on specifying FoldTailByMasking early in the costmodel, whilst keeping then number of other changes needed minimal. Bigger changes can come later, but might require more changes. (It is useful, for example, to know that a plan is FoldTailByMasking when picking epilog plans).

Matt added a subscriber: Matt.Jan 30 2023, 3:01 PM
fhahn added a comment.Feb 8 2023, 2:50 PM

I think the overall direction is very desirable! However I am not sure if adding a FoldTail argument to the cost functions is an ideal way to get there as it requires a large number of changes and is not really scalable (e.g. adding support to generate a 3rd variant of plans may mean adding another argument to all functions)

I am not sure if I am missing something, but would it be possible to instead run the planner multiple times with different cost-model configurations? I tried to sketch an option for more general support for that in D142669. Then we could generate the plans for tail-folding by just instantiating another instance of the cost model, sketched in D142670. Maybe that could help to simplify the patch?

Multiple cost models was something I considered, but dismissed fairly quickly. I didn't think it would be something that anyone would agree to in a review - that we invent a new hierarchy of multiple cost-models each containing multiple vplans. The vplans should be flatter than that, and there is more in the cost-model that isn't dependent on FoldTail than is. I think we should be trying to reduce the number of cost-models, not increase it! The same argument about new variants could equally apply, just to an explosion in the number of cost-models needed.

IIUC we in practice already have multiple cost models (e.g. tail folded vs regular), but at the moment we effectively pick which one to run very early. So I don't think this adds a new hierarchy, but I might be missing some aspects you are thinking about here?

I agree that we shouldn't have a new hierarchies containing multiple plans. IIUC this is referring to D142669 where it picks selects the list of plans for the most profitable VF returned by plan. This was mostly to keep the sketch of supporting multiple cost models simple (in terms of time to implement). I *think* this could be improved by computing the cost for each plan directly, instead of doing so after constructing all plans in selectVectorizationFactor. I'll try to see if there are any stumbling blocks down this path.

I was also hoping to untangle some of the cost model functionality that computes the max VF, but unfortunately things are very tied together and it is not really feasible to split those parts of from computing costs per VF.

This patch only pushed the FoldTailByMasking to the places that need it, just treating FoldTailByMasking like the VF already is. How married are you to the idea of multiple cost models? (Or how anti this are you?) I know this patch is fairly large but it feels like a cleaner end result once it is done. Let me know and I can try and get everything working the other way if necessary.

The changes are mostly mechanical, but it makes the mappings in the cost model more complicated and adding the additional argument potentially increases maintenance cost and makes the signatures slightly more complex. I just want to make sure we the considered potential tradeoffs/alternatives if we go down that route.

Another thing I noticed that the patch changes skeleton creation to take a VPlan as argument. I think it would be preferable to do this the other way around if possible, so instead model the difference between plans with and without tail-folding explicitly in the pre-header blocks in the VPlan. I've not looked at the details here to see how difficult that would be yet unfortunately though. But more than happy to iterate on that aspect together!

I was hoping for one step at a time. Firstly we can break the dependency on specifying FoldTailByMasking early in the costmodel, whilst keeping then number of other changes needed minimal. Bigger changes can come later, but might require more changes. (It is useful, for example, to know that a plan is FoldTailByMasking when picking epilog plans).

Agreed, I think there are existing places where it would be convenient to know if the plan is FoldTailByMasking. One example is adjustRecipesForReductions, where this is the main thing for removing the reliance on the cost-model there.

I agree that we shouldn't have a new hierarchies containing multiple plans. IIUC this is referring to D142669 where it picks selects the list of plans for the most profitable VF returned by plan. This was mostly to keep the sketch of supporting multiple cost models simple (in terms of time to implement). I *think* this could be improved by computing the cost for each plan directly, instead of doing so after constructing all plans in selectVectorizationFactor. I'll try to see if there are any stumbling blocks down this path.

Sketched in D143938 and updated D142669.

dmgreen updated this revision to Diff 508526.Mar 27 2023, 1:48 AM

This is just a rebase of the existing patch - I think not much has changed other than adjustments needed for the rebase.

dmgreen updated this revision to Diff 508547.Mar 27 2023, 2:55 AM

OK This now attempt to use the scheme @fhahn suggested where there are multiple cost models that get created with TailFold=true and TailFold=false. It should hopefully work with allowing planning with and without tail folding (this patch), plus the predicated epilog vectorization in the followup.

Some of what was in CostModel has moved into the Planner. This patch adds a reference from the vplan to the costmodel to make sure the correct model is used with the correct plan. Hopefully that can be reduced and removed eventually. It does manage to remove the VFCandidates array which is a nice little simplification.

Hi @dmgreen, I'm starting to review this patch again, but it's quite large so it might take a while. Just a quick skim through so far I thought of a few ideas about some NFC patches that could be useful in their own right and might help to reduce the diff, if you agree?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5317–5318

Some changes like this don't really need to be in this patch, right?

5321

Can't we just pass in TheFunction instead?

5382

Again, I'm a bit concerned about the effect this will have when the active lane mask call is not costed into the loop. I'd prefer for now to be conservative and opt for the non-predicated scheme until we've accounted for the IV costs.

5784

Again, it looks like the change in prototype doesn't need to be part of this patch and might be a useful tidy-up NFC patch.

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll
973

These debug output changes look useful by themselves outside of this patch - not sure if it's possible to pass in the FoldTailByMasking flag in a separate patch?

fhahn added a comment.Apr 4 2023, 9:17 AM

OK This now attempt to use the scheme @fhahn suggested where there are multiple cost models that get created with TailFold=true and TailFold=false. It should hopefully work with allowing planning with and without tail folding (this patch), plus the predicated epilog vectorization in the followup.

Some of what was in CostModel has moved into the Planner. This patch adds a reference from the vplan to the costmodel to make sure the correct model is used with the correct plan. Hopefully that can be reduced and removed eventually. It does manage to remove the VFCandidates array which is a nice little simplification.

I updated D143938 and D142669 recently and I think they should be in good shape now and should be ready for review. Rebasing the patch on top of them should hopefully simplify the diff. There are a few additional dependencies to remove a few more places that rely on the cost model after construction.

dmgreen added inline comments.Apr 6 2023, 9:30 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5317–5318

Yeah sure, I can give that a try. It seems a bit odd on it's own to be honest, like a patch that makes things worse on its own and feels like we change it just so that we change it again later. Creating needless busywork. But I have for the moment moved it out of this patch.

5321

I think I chose Loop because the LoopVectorizationPlanner doesn't store the Function, so would need pass L->getHeader()->getParent() as argument and it simplified the interface a little. Happy to change it though.

5351

I also have pulled this part out into D147720, as a functional change that can be done separately.

5382

For MVE in the past, when we returned true for preferPredicateOverEpilogue, we would get a tail-predicated loop (or not vectorize). Now that we get both tail-folded and non-tail-folded loops to cost against one another, we need to pick the tail-folded version on a tie to not be worse than before. The scores between the two are often the same, so to be closer to the old codegen the conservative option is to chose FoldTail on a tie.

I believe that the only target that returns true from preferPredicateOverEpilogue at the moment is MVE. SVE can be adjusted later if needed, but my current thinking was to use UsePredicatedEpilogue from D145925 as a first step to get the benefits of epilog vectorization whilst hopefully not messing anything else up. That was the plan for how to treat SVE conservatively, and we can expand things if needed in the future. It doesn't really make a lot of sense to add up disparate throughput costs and expect them to mean anything, but we can perhaps come up with something if we need and if not we can always just force unpredicated body + predicated epilog. Let me know what you think.

5784

Thanks Yeah - That is a good idea. I can actually just remove this from the current version of the patch, I believe.

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll
973

Hmm. It would involve pulling out VPlan->FoldTailByMasking into a separate patch. That feels like it is the core of this patch, to be honest. Pulling it out just for some debug messages that are usually present elsewhere feels like a bit of an odd patch on it's own. When you look at the whole debug output there is already parts explaining whether the CostModel is FoldTailByMasking.

dmgreen updated this revision to Diff 511438.Apr 6 2023, 9:31 AM

Hi @dmgreen, this patch looks a lot smaller and tidier than last time - thanks a lot! I've just got a few more comments.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
192–195

Perhaps this comment should now be something like:

/// Cost of the loop with that width and vectorization style.

What do you think?

301–302

I think the comment needs updating now because it no longer returns a VF.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1550–1551

This is not your fault, but whilst you are here could you update the comment to reflect the function behaviour? It looks like the comment probably got out of sync with the function at some point. :)

Perhaps something like:

/// Returns true if a scalar epilogue is allowed. It may return false if:
///   1. We are optimising for code size
///   2. There is a loop hint annotation
///   etc.
1557

Do you know the scenarios in which we aren't folding the tail by masking, but a scalar epilogue is not needed? I think CM_ScalarEpilogueNotNeededUsePredicate is only set for these cases:

  1. The user has supplied a hint.
  2. The user has set the prefer-predicate-over-epilogue flag.
  3. The TTI hook preferPredicateOverEpilogue has returned true.

but I'd expect that for at least 3) we've set FoldTailByMasking to true?

5088–5095

In this case it looks like we're going to create two sets of plans - one with tail-folding and one without - and in both cases we're actually going to tail-fold anyway. Is that right? Not saying we shouldn't do that, but just trying to understand how this works.

5101

Don't we have to also set ScalarEpilogueStatus = CM_ScalarEpilogueAllowed here?

5493

This should always be false for VF=1, right?

7577

Perhaps worth adding a /* ... */ comment for the new argument too? Same for any other similar places in the file.

7637

Is it worth changing hasVF to take the FoldTailByMasking as a second argument so you can just write:

assert(count_if(VPlans,
                [VF, FoldTailByMasking](const VPlanPtr &Plan) {
                  return Plan->hasVF(VF, FoldTailByMasking);
                }) == 1 &&
       "Best VF has not a single VPlan.");
8710

Wouldn't it be simpler to just write:

for (ElementCount VF = MinVF; ElementCount::isKnownLE(VF, MaxVF);
llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
17

This seems odd. Perhaps I'm mistakend, but I thought with your patch we wouldn't decide to tail-fold with a VF of 1?

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
805

I was confused at first, but actually this looks like an improvement - nice! I think before we were still using an interleave count of 1 as if we were tail-folding, but now we've fallen back on the non-tail-folding plan that uses interleaving.

llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
2915 ↗(On Diff #511438)

For what it's worth this new version looks better, but do you know why? Is it because we no longer allow tail-folding for a VF of 1? I assume it was trying to tail-fold before due to the low trip count.

dmgreen updated this revision to Diff 518200.Apr 29 2023, 12:31 PM
dmgreen marked 14 inline comments as done.

Rebase and address comments. Thanks for taking a look.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1550–1551

Yeah this is a bit of a weird one nowadays. I had tried to remove it, but I think it makes more sense to keep around for specifying when a scalar epilogue cannot be used.

1557

For 3 it would be both with and without FoldTailByMasking. D145925 adds a way for the target to control the SEL more directly.

5088–5095

This falls though in both cases. In the code below we may find that the the MaxVF is a multiple of the known tripcount, and if so return plans with TailFold=false. Else it will not create TailFold=false plans, just using the TailFold=true.
There are a lot of edge cases.

5101

I don't believe so. The ScalarEpilogueStatus tell us at a high level whether we should be predicating. FoldTailByMasking then controls whether the individual vplans are predicated. So it felt better to keep the original ScalarEpilogueStatus inplace to refer back to, in case we would like to differentiate between CM_ScalarEpilogueAllowed and CM_ScalarEpilogueNotNeededUsePredicate cases.

5493

Not in all cases. Cases that are always predicated (like CM_ScalarEpilogueNotAllowedUsePredicate or CM_ScalarEpilogueNotAllowedLowTripLoop) will have VF=1 with tail folding still. They won't have any unpredicated vplans.

7637

It feels like a separate parameter to me, that would want to be checked separately. I can change it if you think its better but there are places that call hasVF on it's own.

llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
2915 ↗(On Diff #511438)

This was because:

// Don't use a predicated vector body if the user has forced a vectorization
// factor of 1.
  if (UserVF.isScalar())
    SEL = CM_ScalarEpilogueAllowed;

It overrides the "tiny trip count" SEL that was applying previously. I think that for VF=1 it makes a lot sense to not predicated the body, but I've removed that from this patch to keep the diff down. We can re-add it in the future if needed.