This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationLegality.cpp
2/3
LoopVectorizationPlanner.h
22/25
LoopVectorize.cpp
-
VPlan.h
-
VPlan.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1
maximize-bandwidth-invalidate.ll
-
sve-tail-folding-forced.ll
1/1
sve-tail-folding.ll
-
tail-folding-styles.ll
-
ARM/
-
mve-known-trip-count.ll
-
tail-folding-counting-down.ll
-
tail-folding-reduces-vf.ll
-
PowerPC/
-
reg-usage.ll
-
RISCV/
-
riscv-vector-reverse.ll
-
first-order-recurrence-sink-replicate-region.ll
-
icmp-uniforms.ll
1/2
vplan-sink-scalars-and-merge.ll

Differential D142015

[LV] Plan with and without FoldTailByMasking
Needs ReviewPublic

Authored by dmgreen on Jan 18 2023, 6:33 AM.

Download Raw Diff

Details

Reviewers

fhahn
Ayal
SjoerdMeijer
sdesmalen
david-arm
bmahjour

Summary

Currently the loop vectorizer has a single parameter in the CostModel that controls FoldTailByMasking. It is set fairly early, and can't be changed later meaning we need to pick between tail folding and non tail folding before we have done much cost modelling. This patch aims to alter that so that there is not a single parameter, moving it eventually into the vplan, so that we can have plans both with and without FoldTailByMasking, that can be costed against one another and the best one picked for vectorization.

A lot of the changes are fairly mechanical and attempted to be non-disruptive to keep the patch simpler, but there are still a fair number of changes. The important parts are:

FoldTailByMasking is removed from CostModel. It now has a MayFoldTailByMasking variable to hold whether FoldTailByMasking VPlans can be created.
A number of maps in the cost model like InstsToScalarize/Uniforms/Scalars/ForcedScalars were made conditional on the pair of (FoldTailByMasking, VF), so that they still contain the information required in both FoldTailByMasking=true and FoldTailByMasking=false cases. The semi-random name VectorizationScheme was used to describe the pair of (FoldTailByMasking, VF). Alternative suggestions welcome.
We then create vplans with both FoldTailByMasking=false and FoldTailByMasking=true if we MayFoldTailByMasking. The VPlan from then stores whether it is FoldTailByMasking. For VF=1 only non-predicated plans are created.
Tail folded and non tail folded vplans then need to be able to be costed against one another. This patch makes FoldTailByMasking win on a tie if the costs are equal, which will be more consistent with the vectorization before this patch.
Epilogue loops for the moment are always use FoldTailByMasking=false. This can hopefully be changed in the near future to allow predicated epilogues for unpredicated loop bodies.

Overall this has the effect of allowing us to model and cost tail folded and non tail folded vplans against one another, and should in the future allow us to generate predicated epilogues for unpredicated vector loops.

Diff Detail

Event Timeline

dmgreen created this revision.Jan 18 2023, 6:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2023, 6:33 AM

Herald added subscribers: luke, StephenFan, frasercrmck and 23 others. · View Herald Transcript

dmgreen requested review of this revision.Jan 18 2023, 6:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2023, 6:33 AM

Herald added subscribers: • pcwang-thead, vkmr, MaskRay. · View Herald Transcript

Harbormaster completed remote builds in B208473: Diff 490112.Jan 18 2023, 6:34 AM

Thanks for working on this. General direction looks good to me, this is exactly what we want.

VectorizationScheme was used to describe the pair of (FoldTailByMasking, VF). Alternative suggestions welcome

I was going to suggest PredicationScheme. But the VF is also part of it, so PredicationScheme does not covert it. Perhaps VectorizationScheme is just fine. :)

I have done only a first scan of the patch, but it is big, so need to do it again, which I will do soon.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5783	Nit: perhaps rename this and in some other places to VS, so that the VF.Width references below become VS.Width.

This makes a lot of sense to me, so with the nits addressed, this LGTM.
But wait a day in case there are other ideas about this.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
182	This comment needs to be moved to line no. 224.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
7618	nit: TODO

This revision is now accepted and ready to land.Jan 26 2023, 12:46 AM

Hi @dmgreen, I'm sorry I've not looked into this in more detail, but from the description you gave (which is very detailed - thanks!) I do have one concern about choosing tail-folding by default in a tie. One major problem we have at the moment is that the active lane mask call is effectively free because, unless I'm mistaken, we don't add the cost of this intrinsic to the loop. A simple IV increment (an add instruction) is likely to be cheaper than the codegen for the loop predicate for many targets? However, I do appreciate that targets that can't efficiently generate loop predicates probably also can't do masked loads and stores, so the tail-folded loop cost is likely to be high anyway. I wonder if it's better to be conservative and choose the non-tail-folded version in a tie until we've got a fairer comparison between different vectorisation styles?

fhahn mentioned this in D142669: [VPlan] Allow planning with different cost models..Jan 26 2023, 2:12 PM

fhahn mentioned this in D142670: [LV] Allow forcing tail folding when constructing the cost model (WIP)..Jan 26 2023, 2:15 PM

I think the overall direction is very desirable! However I am not sure if adding a FoldTail argument to the cost functions is an ideal way to get there as it requires a large number of changes and is not really scalable (e.g. adding support to generate a 3rd variant of plans may mean adding another argument to all functions)

I am not sure if I am missing something, but would it be possible to instead run the planner multiple times with different cost-model configurations? I tried to sketch an option for more general support for that in D142669. Then we could generate the plans for tail-folding by just instantiating another instance of the cost model, sketched in D142670. Maybe that could help to simplify the patch?

Another thing I noticed that the patch changes skeleton creation to take a VPlan as argument. I think it would be preferable to do this the other way around if possible, so instead model the difference between plans with and without tail-folding explicitly in the pre-header blocks in the VPlan. I've not looked at the details here to see how difficult that would be yet unfortunately though. But more than happy to iterate on that aspect together!

This revision now requires changes to proceed.Jan 26 2023, 3:09 PM

Could you model this approach instead as a VPlan transformation, instead of hardcoding unscalable flags. Transform: add tail folding. Transform: add masked tail folding. Transform: add scalar tail. Then you can take the cost model to decide which is more desirable.

dmgreen added a child revision: D142875: [LV] Predicated epilog vectorization.Jan 30 2023, 1:18 AM

In D142015#4066159, @SjoerdMeijer wrote:

Thanks for working on this. General direction looks good to me, this is exactly what we want.

VectorizationScheme was used to describe the pair of (FoldTailByMasking, VF). Alternative suggestions welcome

I was going to suggest PredicationScheme. But the VF is also part of it, so PredicationScheme does not covert it. Perhaps VectorizationScheme is just fine. :)

Thanks. There are still some MVE performance issues I need to work through, where it now picks a higher unpredicated trip count over tail predication, and a combination of it being inside a nested loop and low cheap counts make it unprofitable. There are some good improvements too, but that one is a little too large. I think with epilog vectorization and a few other tricks it can get the the point where it too is an improvement.

In D142015#4082011, @david-arm wrote:

Hi @dmgreen, I'm sorry I've not looked into this in more detail, but from the description you gave (which is very detailed - thanks!) I do have one concern about choosing tail-folding by default in a tie. One major problem we have at the moment is that the active lane mask call is effectively free because, unless I'm mistaken, we don't add the cost of this intrinsic to the loop. A simple IV increment (an add instruction) is likely to be cheaper than the codegen for the loop predicate for many targets? However, I do appreciate that targets that can't efficiently generate loop predicates probably also can't do masked loads and stores, so the tail-folded loop cost is likely to be high anyway. I wonder if it's better to be conservative and choose the non-tail-folded version in a tie until we've got a fairer comparison between different vectorisation styles?

Yep - that's hopefully step 3. Once we have this and epilog vectorization (which should be a fairly simple addition) we can then have the option on SVE of choosing between unpredicated body + predicated remainder vs a full predicated body, depending on the target. Targets where the while instructions are not a bottleneck can get the benefits of full predication. I'm not sure yet whether it will just be a hard limit or an added cost. That's the idea at least, there are plenty of details to sort out first and it we will have to see how it performs in practice.

In D142015#4084191, @fhahn wrote:

I think the overall direction is very desirable! However I am not sure if adding a FoldTail argument to the cost functions is an ideal way to get there as it requires a large number of changes and is not really scalable (e.g. adding support to generate a 3rd variant of plans may mean adding another argument to all functions)

I am not sure if I am missing something, but would it be possible to instead run the planner multiple times with different cost-model configurations? I tried to sketch an option for more general support for that in D142669. Then we could generate the plans for tail-folding by just instantiating another instance of the cost model, sketched in D142670. Maybe that could help to simplify the patch?

Multiple cost models was something I considered, but dismissed fairly quickly. I didn't think it would be something that anyone would agree to in a review - that we invent a new hierarchy of multiple cost-models each containing multiple vplans. The vplans should be flatter than that, and there is more in the cost-model that isn't dependent on FoldTail than is. I think we should be trying to reduce the number of cost-models, not increase it! The same argument about new variants could equally apply, just to an explosion in the number of cost-models needed.

This patch only pushed the FoldTailByMasking to the places that need it, just treating FoldTailByMasking like the VF already is. How married are you to the idea of multiple cost models? (Or how anti this are you?) I know this patch is fairly large but it feels like a cleaner end result once it is done. Let me know and I can try and get everything working the other way if necessary.

Another thing I noticed that the patch changes skeleton creation to take a VPlan as argument. I think it would be preferable to do this the other way around if possible, so instead model the difference between plans with and without tail-folding explicitly in the pre-header blocks in the VPlan. I've not looked at the details here to see how difficult that would be yet unfortunately though. But more than happy to iterate on that aspect together!

I was hoping for one step at a time. Firstly we can break the dependency on specifying FoldTailByMasking early in the costmodel, whilst keeping then number of other changes needed minimal. Bigger changes can come later, but might require more changes. (It is useful, for example, to know that a plan is FoldTailByMasking when picking epilog plans).

Matt added a subscriber: Matt.Jan 30 2023, 3:01 PM

syzaara added a subscriber: syzaara.Jan 31 2023, 9:41 AM

In D142015#4090401, @dmgreen wrote:

In D142015#4084191, @fhahn wrote:

I think the overall direction is very desirable! However I am not sure if adding a FoldTail argument to the cost functions is an ideal way to get there as it requires a large number of changes and is not really scalable (e.g. adding support to generate a 3rd variant of plans may mean adding another argument to all functions)

I am not sure if I am missing something, but would it be possible to instead run the planner multiple times with different cost-model configurations? I tried to sketch an option for more general support for that in D142669. Then we could generate the plans for tail-folding by just instantiating another instance of the cost model, sketched in D142670. Maybe that could help to simplify the patch?

Multiple cost models was something I considered, but dismissed fairly quickly. I didn't think it would be something that anyone would agree to in a review - that we invent a new hierarchy of multiple cost-models each containing multiple vplans. The vplans should be flatter than that, and there is more in the cost-model that isn't dependent on FoldTail than is. I think we should be trying to reduce the number of cost-models, not increase it! The same argument about new variants could equally apply, just to an explosion in the number of cost-models needed.

IIUC we in practice already have multiple cost models (e.g. tail folded vs regular), but at the moment we effectively pick which one to run very early. So I don't think this adds a new hierarchy, but I might be missing some aspects you are thinking about here?

I agree that we shouldn't have a new hierarchies containing multiple plans. IIUC this is referring to D142669 where it picks selects the list of plans for the most profitable VF returned by plan. This was mostly to keep the sketch of supporting multiple cost models simple (in terms of time to implement). I *think* this could be improved by computing the cost for each plan directly, instead of doing so after constructing all plans in selectVectorizationFactor. I'll try to see if there are any stumbling blocks down this path.

I was also hoping to untangle some of the cost model functionality that computes the max VF, but unfortunately things are very tied together and it is not really feasible to split those parts of from computing costs per VF.

This patch only pushed the FoldTailByMasking to the places that need it, just treating FoldTailByMasking like the VF already is. How married are you to the idea of multiple cost models? (Or how anti this are you?) I know this patch is fairly large but it feels like a cleaner end result once it is done. Let me know and I can try and get everything working the other way if necessary.

The changes are mostly mechanical, but it makes the mappings in the cost model more complicated and adding the additional argument potentially increases maintenance cost and makes the signatures slightly more complex. I just want to make sure we the considered potential tradeoffs/alternatives if we go down that route.

Another thing I noticed that the patch changes skeleton creation to take a VPlan as argument. I think it would be preferable to do this the other way around if possible, so instead model the difference between plans with and without tail-folding explicitly in the pre-header blocks in the VPlan. I've not looked at the details here to see how difficult that would be yet unfortunately though. But more than happy to iterate on that aspect together!

I was hoping for one step at a time. Firstly we can break the dependency on specifying FoldTailByMasking early in the costmodel, whilst keeping then number of other changes needed minimal. Bigger changes can come later, but might require more changes. (It is useful, for example, to know that a plan is FoldTailByMasking when picking epilog plans).

Agreed, I think there are existing places where it would be convenient to know if the plan is FoldTailByMasking. One example is adjustRecipesForReductions, where this is the main thing for removing the reliance on the cost-model there.

fhahn mentioned this in D143938: [VPlan] Compute costs for plans directly after construction..Feb 13 2023, 12:10 PM

In D142015#4114045, @fhahn wrote:

I agree that we shouldn't have a new hierarchies containing multiple plans. IIUC this is referring to D142669 where it picks selects the list of plans for the most profitable VF returned by plan. This was mostly to keep the sketch of supporting multiple cost models simple (in terms of time to implement). I *think* this could be improved by computing the cost for each plan directly, instead of doing so after constructing all plans in selectVectorizationFactor. I'll try to see if there are any stumbling blocks down this path.

Sketched in D143938 and updated D142669.

This is just a rebase of the existing patch - I think not much has changed other than adjustments needed for the rebase.

Harbormaster completed remote builds in B221918: Diff 508526.Mar 27 2023, 1:49 AM

OK This now attempt to use the scheme @fhahn suggested where there are multiple cost models that get created with TailFold=true and TailFold=false. It should hopefully work with allowing planning with and without tail folding (this patch), plus the predicated epilog vectorization in the followup.

Some of what was in CostModel has moved into the Planner. This patch adds a reference from the vplan to the costmodel to make sure the correct model is used with the correct plan. Hopefully that can be reduced and removed eventually. It does manage to remove the VFCandidates array which is a nice little simplification.

Harbormaster completed remote builds in B221934: Diff 508547.Mar 27 2023, 2:56 AM

Hi @dmgreen, I'm starting to review this patch again, but it's quite large so it might take a while. Just a quick skim through so far I thought of a few ideas about some NFC patches that could be useful in their own right and might help to reduce the diff, if you agree?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5317–5318	Some changes like this don't really need to be in this patch, right?
5321	Can't we just pass in `TheFunction` instead?
5382	Again, I'm a bit concerned about the effect this will have when the active lane mask call is not costed into the loop. I'd prefer for now to be conservative and opt for the non-predicated scheme until we've accounted for the IV costs.
5784	Again, it looks like the change in prototype doesn't need to be part of this patch and might be a useful tidy-up NFC patch.
llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll
973	These debug output changes look useful by themselves outside of this patch - not sure if it's possible to pass in the `FoldTailByMasking` flag in a separate patch?

In D142015#4223465, @dmgreen wrote:

OK This now attempt to use the scheme @fhahn suggested where there are multiple cost models that get created with TailFold=true and TailFold=false. It should hopefully work with allowing planning with and without tail folding (this patch), plus the predicated epilog vectorization in the followup.

Some of what was in CostModel has moved into the Planner. This patch adds a reference from the vplan to the costmodel to make sure the correct model is used with the correct plan. Hopefully that can be reduced and removed eventually. It does manage to remove the VFCandidates array which is a nice little simplification.

I updated D143938 and D142669 recently and I think they should be in good shape now and should be ready for review. Rebasing the patch on top of them should hopefully simplify the diff. There are a few additional dependencies to remove a few more places that rely on the cost model after construction.

I put all patches on a branch, if that helps: https://github.com/fhahn/llvm-project/tree/vplan-cost-upfront

dmgreen added inline comments.Apr 6 2023, 9:30 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5317–5318	Yeah sure, I can give that a try. It seems a bit odd on it's own to be honest, like a patch that makes things worse on its own and feels like we change it just so that we change it again later. Creating needless busywork. But I have for the moment moved it out of this patch.
5321	I think I chose Loop because the LoopVectorizationPlanner doesn't store the Function, so would need pass L->getHeader()->getParent() as argument and it simplified the interface a little. Happy to change it though.
5351	I also have pulled this part out into D147720, as a functional change that can be done separately.
5382	For MVE in the past, when we returned true for preferPredicateOverEpilogue, we would get a tail-predicated loop (or not vectorize). Now that we get both tail-folded and non-tail-folded loops to cost against one another, we need to pick the tail-folded version on a tie to not be worse than before. The scores between the two are often the same, so to be closer to the old codegen the conservative option is to chose FoldTail on a tie. I believe that the only target that returns true from preferPredicateOverEpilogue at the moment is MVE. SVE can be adjusted later if needed, but my current thinking was to use UsePredicatedEpilogue from D145925 as a first step to get the benefits of epilog vectorization whilst hopefully not messing anything else up. That was the plan for how to treat SVE conservatively, and we can expand things if needed in the future. It doesn't really make a lot of sense to add up disparate throughput costs and expect them to mean anything, but we can perhaps come up with something if we need and if not we can always just force unpredicated body + predicated epilog. Let me know what you think.
5784	Thanks Yeah - That is a good idea. I can actually just remove this from the current version of the patch, I believe.
llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll
973	Hmm. It would involve pulling out VPlan->FoldTailByMasking into a separate patch. That feels like it is the core of this patch, to be honest. Pulling it out just for some debug messages that are usually present elsewhere feels like a bit of an odd patch on it's own. When you look at the whole debug output there is already parts explaining whether the CostModel is FoldTailByMasking.

dmgreen updated this revision to Diff 511438.Apr 6 2023, 9:31 AM

Harbormaster completed remote builds in B224041: Diff 511438.Apr 6 2023, 9:32 AM

Hi @dmgreen, this patch looks a lot smaller and tidier than last time - thanks a lot! I've just got a few more comments.

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
192–195	Perhaps this comment should now be something like: /// Cost of the loop with that width and vectorization style. What do you think?
301–302	I think the comment needs updating now because it no longer returns a VF.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1550–1551	This is not your fault, but whilst you are here could you update the comment to reflect the function behaviour? It looks like the comment probably got out of sync with the function at some point. :) Perhaps something like: /// Returns true if a scalar epilogue is allowed. It may return false if: /// 1. We are optimising for code size /// 2. There is a loop hint annotation /// etc.
1557	Do you know the scenarios in which we aren't folding the tail by masking, but a scalar epilogue is not needed? I think CM_ScalarEpilogueNotNeededUsePredicate is only set for these cases: The user has supplied a hint. The user has set the prefer-predicate-over-epilogue flag. The TTI hook preferPredicateOverEpilogue has returned true. but I'd expect that for at least 3) we've set FoldTailByMasking to true?
5088–5095	In this case it looks like we're going to create two sets of plans - one with tail-folding and one without - and in both cases we're actually going to tail-fold anyway. Is that right? Not saying we shouldn't do that, but just trying to understand how this works.
5101	Don't we have to also set `ScalarEpilogueStatus = CM_ScalarEpilogueAllowed` here?
5493	This should always be false for VF=1, right?
7577	Perhaps worth adding a `/* ... */` comment for the new argument too? Same for any other similar places in the file.
7637	Is it worth changing `hasVF` to take the `FoldTailByMasking` as a second argument so you can just write: assert(count_if(VPlans, [VF, FoldTailByMasking](const VPlanPtr &Plan) { return Plan->hasVF(VF, FoldTailByMasking); }) == 1 && "Best VF has not a single VPlan.");
8710	Wouldn't it be simpler to just write: for (ElementCount VF = MinVF; ElementCount::isKnownLE(VF, MaxVF);
llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll
17	This seems odd. Perhaps I'm mistakend, but I thought with your patch we wouldn't decide to tail-fold with a VF of 1?
llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
805	I was confused at first, but actually this looks like an improvement - nice! I think before we were still using an interleave count of 1 as if we were tail-folding, but now we've fallen back on the non-tail-folding plan that uses interleaving.
llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
2915 ↗	(On Diff #511438)	For what it's worth this new version looks better, but do you know why? Is it because we no longer allow tail-folding for a VF of 1? I assume it was trying to tail-fold before due to the low trip count.

dmgreen added a parent revision: D147720: [LV] Use the known trip count when costing non-tail folded VFs.Apr 24 2023, 2:31 AM

Rebase and address comments. Thanks for taking a look.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1550–1551	Yeah this is a bit of a weird one nowadays. I had tried to remove it, but I think it makes more sense to keep around for specifying when a scalar epilogue cannot be used.
1557	For 3 it would be both with and without FoldTailByMasking. D145925 adds a way for the target to control the SEL more directly.
5088–5095	This falls though in both cases. In the code below we may find that the the MaxVF is a multiple of the known tripcount, and if so return plans with TailFold=false. Else it will not create TailFold=false plans, just using the TailFold=true. There are a lot of edge cases.
5101	I don't believe so. The ScalarEpilogueStatus tell us at a high level whether we should be predicating. FoldTailByMasking then controls whether the individual vplans are predicated. So it felt better to keep the original ScalarEpilogueStatus inplace to refer back to, in case we would like to differentiate between CM_ScalarEpilogueAllowed and CM_ScalarEpilogueNotNeededUsePredicate cases.
5493	Not in all cases. Cases that are always predicated (like CM_ScalarEpilogueNotAllowedUsePredicate or CM_ScalarEpilogueNotAllowedLowTripLoop) will have VF=1 with tail folding still. They won't have any unpredicated vplans.
7637	It feels like a separate parameter to me, that would want to be checked separately. I can change it if you think its better but there are places that call hasVF on it's own.
llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll
2915 ↗	(On Diff #511438)	This was because: // Don't use a predicated vector body if the user has forced a vectorization // factor of 1. if (UserVF.isScalar()) SEL = CM_ScalarEpilogueAllowed; It overrides the "tiny trip count" SEL that was applying previously. I think that for VF=1 it makes a lot sense to not predicated the body, but I've removed that from this patch to keep the diff down. We can re-add it in the future if needed.

Harbormaster completed remote builds in B229059: Diff 518200.Apr 29 2023, 12:32 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

LoopVectorizationLegality.h

15 lines

lib/

Transforms/

Vectorize/

LoopVectorizationLegality.cpp

11 lines

LoopVectorizationPlanner.h

82 lines

LoopVectorize.cpp

614 lines

VPlan.h

19 lines

VPlan.cpp

2 lines

test/

Transforms/

LoopVectorize/

AArch64/

maximize-bandwidth-invalidate.ll

2 lines

sve-tail-folding-forced.ll

2 lines

sve-tail-folding.ll

34 lines

tail-folding-styles.ll

43 lines

ARM/

mve-known-trip-count.ll

6 lines

tail-folding-counting-down.ll

2 lines

tail-folding-reduces-vf.ll

2 lines

PowerPC/

reg-usage.ll

10 lines

RISCV/

riscv-vector-reverse.ll

4 lines

first-order-recurrence-sink-replicate-region.ll

12 lines

icmp-uniforms.ll

2 lines

vplan-sink-scalars-and-merge.ll

22 lines

Diff 518200

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

Show First 20 Lines • Show All 369 Lines • ▼ Show 20 Lines	public:
uint64_t getMaxSafeVectorWidthInBits() const {		uint64_t getMaxSafeVectorWidthInBits() const {
return LAI->getDepChecker().getMaxSafeVectorWidthInBits();		return LAI->getDepChecker().getMaxSafeVectorWidthInBits();
}		}

bool hasStride(Value *V) { return LAI->hasStride(V); }		bool hasStride(Value *V) { return LAI->hasStride(V); }

/// Returns true if vector representation of the instruction \p I		/// Returns true if vector representation of the instruction \p I
/// requires mask.		/// requires mask.
bool isMaskRequired(const Instruction *I) const {		bool isMaskRequired(bool FoldTailByMasking, const Instruction *I) const {
return MaskedOp.contains(I);		return MaskedOp.contains(I) \|\|
		(FoldTailByMasking && FoldTailMaskedOp.contains(I));
}		}

unsigned getNumStores() const { return LAI->getNumStores(); }		unsigned getNumStores() const { return LAI->getNumStores(); }
unsigned getNumLoads() const { return LAI->getNumLoads(); }		unsigned getNumLoads() const { return LAI->getNumLoads(); }

/// Returns all assume calls in predicated blocks. They need to be dropped		/// Returns all assume calls in predicated blocks. They need to be dropped
/// when flattening the CFG.		/// when flattening the CFG.
const SmallPtrSetImpl<Instruction *> &getConditionalAssumes() const {		const SmallPtrSetImpl<Instruction *> &
return ConditionalAssumes;		getConditionalAssumes(bool FoldTailByMasking) const {
		return FoldTailByMasking ? FoldTailConditionalAssumes : ConditionalAssumes;
}		}

PredicatedScalarEvolution *getPredicatedScalarEvolution() const {		PredicatedScalarEvolution *getPredicatedScalarEvolution() const {
return &PSE;		return &PSE;
}		}

Loop *getLoop() const { return TheLoop; }		Loop *getLoop() const { return TheLoop; }

▲ Show 20 Lines • Show All 143 Lines • ▼ Show 20 Lines	private:
/// While vectorizing these instructions we have to generate a		/// While vectorizing these instructions we have to generate a
/// call to the appropriate masked intrinsic		/// call to the appropriate masked intrinsic
SmallPtrSet<const Instruction *, 8> MaskedOp;		SmallPtrSet<const Instruction *, 8> MaskedOp;

/// Assume instructions in predicated blocks must be dropped if the CFG gets		/// Assume instructions in predicated blocks must be dropped if the CFG gets
/// flattened.		/// flattened.
SmallPtrSet<Instruction *, 8> ConditionalAssumes;		SmallPtrSet<Instruction *, 8> ConditionalAssumes;

		/// Same as MaskedOp above when folding tail by masking.
		SmallPtrSet<const Instruction *, 8> FoldTailMaskedOp;
		/// Same as ConditionalAssumes above when folding tail by masking.
		SmallPtrSet<Instruction *, 8> FoldTailConditionalAssumes;

/// BFI and PSI are used to check for profile guided size optimizations.		/// BFI and PSI are used to check for profile guided size optimizations.
BlockFrequencyInfo *BFI;		BlockFrequencyInfo *BFI;
ProfileSummaryInfo *PSI;		ProfileSummaryInfo *PSI;
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

Show First 20 Lines • Show All 1,424 Lines • ▼ Show 20 Lines	for (User *U : AE->users()) {
<< *UI << "\n");		<< *UI << "\n");
return false;		return false;
}		}
}		}

// The list of pointers that we can safely read and write to remains empty.		// The list of pointers that we can safely read and write to remains empty.
SmallPtrSet<Value *, 8> SafePointers;		SmallPtrSet<Value *, 8> SafePointers;

SmallPtrSet<const Instruction *, 8> TmpMaskedOp;
SmallPtrSet<Instruction *, 8> TmpConditionalAssumes;

// Check and mark all blocks for predication, including those that ordinarily		// Check and mark all blocks for predication, including those that ordinarily
// do not need predication such as the header block.		// do not need predication such as the header block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
if (!blockCanBePredicated(BB, SafePointers, TmpMaskedOp,		if (!blockCanBePredicated(BB, SafePointers, FoldTailMaskedOp,
TmpConditionalAssumes)) {		FoldTailConditionalAssumes)) {
LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking as requested.\n");		LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking as requested.\n");
return false;		return false;
}		}
}		}

LLVM_DEBUG(dbgs() << "LV: can fold tail by masking.\n");		LLVM_DEBUG(dbgs() << "LV: can fold tail by masking.\n");

MaskedOp.insert(TmpMaskedOp.begin(), TmpMaskedOp.end());
ConditionalAssumes.insert(TmpConditionalAssumes.begin(),
TmpConditionalAssumes.end());

return true;		return true;
}		}

} // namespace llvm		} // namespace llvm

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show First 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	public:

InsertPointGuard(const InsertPointGuard &) = delete;		InsertPointGuard(const InsertPointGuard &) = delete;
InsertPointGuard &operator=(const InsertPointGuard &) = delete;		InsertPointGuard &operator=(const InsertPointGuard &) = delete;

~InsertPointGuard() { Builder.restoreIP(VPInsertPoint(Block, Point)); }		~InsertPointGuard() { Builder.restoreIP(VPInsertPoint(Block, Point)); }
};		};
};		};

/// TODO: The following VectorizationFactor was pulled out of		/// TODO: The following VectorizationFactor was pulled out of
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions This comment needs to be moved to line no. 224. SjoerdMeijer: This comment needs to be moved to line no. 224.
/// LoopVectorizationCostModel class. LV also deals with		/// LoopVectorizationCostModel class. LV also deals with
/// VectorizerParams::VectorizationFactor and VectorizationCostTy.		/// VectorizerParams::VectorizationFactor and VectorizationCostTy.
/// We need to streamline them.		/// We need to streamline them.

/// Information about vectorization costs.		/// Information about vectorization costs.
struct VectorizationFactor {		struct VectorizationFactor {
/// Vector width with best cost.		/// Vector width with best cost.
ElementCount Width;		ElementCount Width;

/// Cost of the loop with that width.		/// Whether the entire loop is predicated.
		bool FoldTailByMasking;

		/// Cost of the loop with that width and vectorization style.
		david-armUnsubmitted Done Reply Inline Actions Perhaps this comment should now be something like: /// Cost of the loop with that width and vectorization style. What do you think? david-arm: Perhaps this comment should now be something like: /// Cost of the loop with that width and…
InstructionCost Cost;		InstructionCost Cost;

/// Cost of the scalar loop.		/// Cost of the scalar loop.
InstructionCost ScalarCost;		InstructionCost ScalarCost;

/// The minimum trip count required to make vectorization profitable, e.g. due		/// The minimum trip count required to make vectorization profitable, e.g. due
/// to runtime checks.		/// to runtime checks.
ElementCount MinProfitableTripCount;		ElementCount MinProfitableTripCount;

VectorizationFactor(ElementCount Width, InstructionCost Cost,		VectorizationFactor(ElementCount Width, bool FoldTailByMasking,
InstructionCost ScalarCost)		InstructionCost Cost, InstructionCost ScalarCost)
: Width(Width), Cost(Cost), ScalarCost(ScalarCost) {}		: Width(Width), FoldTailByMasking(FoldTailByMasking), Cost(Cost),
		ScalarCost(ScalarCost) {}

/// Width 1 means no vectorization, cost 0 means uncomputed cost.		/// Width 1 means no vectorization, cost 0 means uncomputed cost.
static VectorizationFactor Disabled() {		static VectorizationFactor Disabled() {
return {ElementCount::getFixed(1), 0, 0};		return {ElementCount::getFixed(1), false, 0, 0};
}		}

bool operator==(const VectorizationFactor &rhs) const {		bool operator==(const VectorizationFactor &rhs) const {
return Width == rhs.Width && Cost == rhs.Cost;		return Width == rhs.Width && FoldTailByMasking == rhs.FoldTailByMasking &&
		Cost == rhs.Cost;
}		}

bool operator!=(const VectorizationFactor &rhs) const {		bool operator!=(const VectorizationFactor &rhs) const {
return !(*this == rhs);		return !(*this == rhs);
}		}
};		};

/// A class that represents two vectorization factors (initialized with 0 by		/// A class that represents two vectorization factors (initialized with 0 by
Show All 40 Lines	class LoopVectorizationPlanner {
const TargetLibraryInfo *TLI;		const TargetLibraryInfo *TLI;

/// Target Transform Info.		/// Target Transform Info.
const TargetTransformInfo *TTI;		const TargetTransformInfo *TTI;

/// The legality analysis.		/// The legality analysis.
LoopVectorizationLegality *Legal;		LoopVectorizationLegality *Legal;

/// The profitability analysis.
LoopVectorizationCostModel &CM;

/// The interleaved access analysis.		/// The interleaved access analysis.
InterleavedAccessInfo &IAI;		InterleavedAccessInfo &IAI;

PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

const LoopVectorizeHints &Hints;		const LoopVectorizeHints &Hints;

OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

SmallVector<VPlanPtr, 4> VPlans;		SmallVector<VPlanPtr, 4> VPlans;

/// A builder used to construct the current plan.		/// A builder used to construct the current plan.
VPBuilder Builder;		VPBuilder Builder;

		/// Profitable vector factors.
		SmallVector<VectorizationFactor, 8> ProfitableVFs;

public:		public:
LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,		LoopVectorizationPlanner(Loop L, LoopInfo LI, const TargetLibraryInfo *TLI,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
LoopVectorizationLegality *Legal,		LoopVectorizationLegality *Legal,
LoopVectorizationCostModel &CM,
InterleavedAccessInfo &IAI,		InterleavedAccessInfo &IAI,
PredicatedScalarEvolution &PSE,		PredicatedScalarEvolution &PSE,
const LoopVectorizeHints &Hints,		const LoopVectorizeHints &Hints,
OptimizationRemarkEmitter *ORE)		OptimizationRemarkEmitter *ORE)
: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), CM(CM), IAI(IAI),		: OrigLoop(L), LI(LI), TLI(TLI), TTI(TTI), Legal(Legal), IAI(IAI),
PSE(PSE), Hints(Hints), ORE(ORE) {}		PSE(PSE), Hints(Hints), ORE(ORE) {}

/// Plan how to best vectorize, return the best VF and its cost, or		/// Plan how to best vectorize with a given cost model.
		david-armUnsubmitted Done Reply Inline Actions I think the comment needs updating now because it no longer returns a VF. david-arm: I think the comment needs updating now because it no longer returns a VF.
/// std::nullopt if vectorization and interleaving should be avoided up front.		void plan(LoopVectorizationCostModel &CM, ElementCount UserVF,
std::optional<VectorizationFactor> plan(ElementCount UserVF, unsigned UserIC);		unsigned UserIC);

		/// \return The most profitable vectorization factor and the cost of that VF.
		/// This method checks every VF in the plans in \p VPlans. If UserVF is not
		/// ZERO then this vectorization factor will be selected if vectorization is
		/// possible.
		std::optional<VectorizationFactor> selectVectorizationFactor();

		VectorizationFactor
		selectEpilogueVectorizationFactor(const VectorizationFactor &MainVF);

/// Use the VPlan-native path to plan how to best vectorize, return the best		/// Use the VPlan-native path to plan how to best vectorize, return the best
/// VF and its cost.		/// VF and its cost.
VectorizationFactor planInVPlanNativePath(ElementCount UserVF);		VectorizationFactor planInVPlanNativePath(LoopVectorizationCostModel &CM,
		ElementCount UserVF);

/// Return the best VPlan for \p VF.		/// Return the best VPlan for \p VF.
VPlan &getBestPlanFor(ElementCount VF) const;		VPlan &getBestPlanFor(ElementCount VF, bool FoldTailByMasking) const;

/// Generate the IR code for the body of the vectorized loop according to the		/// Generate the IR code for the body of the vectorized loop according to the
/// best selected \p VF, \p UF and VPlan \p BestPlan.		/// best selected \p VF, \p UF and VPlan \p BestPlan.
/// TODO: \p IsEpilogueVectorization is needed to avoid issues due to epilogue		/// TODO: \p IsEpilogueVectorization is needed to avoid issues due to epilogue
/// vectorization re-using plans for both the main and epilogue vector loops.		/// vectorization re-using plans for both the main and epilogue vector loops.
/// It should be removed once the re-use issue has been fixed.		/// It should be removed once the re-use issue has been fixed.
void executePlan(ElementCount VF, unsigned UF, VPlan &BestPlan,		void executePlan(ElementCount VF, unsigned UF, VPlan &BestPlan,
InnerLoopVectorizer &LB, DominatorTree *DT,		InnerLoopVectorizer &LB, DominatorTree *DT,
bool IsEpilogueVectorization);		bool IsEpilogueVectorization);

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
void printPlans(raw_ostream &O);		void printPlans(raw_ostream &O);
#endif		#endif

/// Look through the existing plans and return true if we have one with all		/// Look through the existing plans and return true if we have one with all
/// the vectorization factors in question.		/// the vectorization factors in question.
bool hasPlanWithVF(ElementCount VF) const {		bool hasPlanWithVF(ElementCount VF, bool FoldTailByMasking) const {
return any_of(VPlans,		return any_of(VPlans, [&](const VPlanPtr &Plan) {
[&](const VPlanPtr &Plan) { return Plan->hasVF(VF); });		return Plan->hasVF(VF) && Plan->foldTailByMasking() == FoldTailByMasking;
		});
}		}

/// Test a \p Predicate on a \p Range of VF's. Return the value of applying		/// Test a \p Predicate on a \p Range of VF's. Return the value of applying
/// \p Predicate on Range.Start, possibly decreasing Range.End such that the		/// \p Predicate on Range.Start, possibly decreasing Range.End such that the
/// returned value holds for the entire \p Range.		/// returned value holds for the entire \p Range.
static bool		static bool
getDecisionAndClampRange(const std::function<bool(ElementCount)> &Predicate,		getDecisionAndClampRange(const std::function<bool(ElementCount)> &Predicate,
VFRange &Range);		VFRange &Range);

/// Check if the number of runtime checks exceeds the threshold.		/// Check if the number of runtime checks exceeds the threshold.
bool requiresTooManyRuntimeChecks() const;		bool requiresTooManyRuntimeChecks() const;

protected:		protected:
/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,		/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,
/// according to the information gathered by Legal when it checked if it is		/// according to the information gathered by Legal when it checked if it is
/// legal to vectorize the loop.		/// legal to vectorize the loop.
void buildVPlans(ElementCount MinVF, ElementCount MaxVF);		void buildVPlans(LoopVectorizationCostModel &CM, ElementCount MinVF,
		ElementCount MaxVF);

private:		private:
/// Build a VPlan according to the information gathered by Legal. \return a		/// Build a VPlan according to the information gathered by Legal. \return a
/// VPlan for vectorization factors \p Range.Start and up to \p Range.End		/// VPlan for vectorization factors \p Range.Start and up to \p Range.End
/// exclusive, possibly decreasing \p Range.End.		/// exclusive, possibly decreasing \p Range.End.
VPlanPtr buildVPlan(VFRange &Range);		VPlanPtr buildVPlan(LoopVectorizationCostModel &CM, VFRange &Range);

/// Build a VPlan using VPRecipes according to the information gather by		/// Build a VPlan using VPRecipes according to the information gather by
/// Legal. This method is only used for the legacy inner loop vectorizer.		/// Legal. This method is only used for the legacy inner loop vectorizer.
/// \p Range's largest included VF is restricted to the maximum VF the		/// \p Range's largest included VF is restricted to the maximum VF the
/// returned VPlan is valid for. If no VPlan can be built for the input range,		/// returned VPlan is valid for. If no VPlan can be built for the input range,
/// set the largest included VF to the maximum VF for which no plan could be		/// set the largest included VF to the maximum VF for which no plan could be
/// built.		/// built.
std::optional<VPlanPtr> tryToBuildVPlanWithVPRecipes(		std::optional<VPlanPtr> tryToBuildVPlanWithVPRecipes(
VFRange &Range, SmallPtrSetImpl<Instruction *> &DeadInstructions);		LoopVectorizationCostModel &CM, VFRange &Range,
		SmallPtrSetImpl<Instruction *> &DeadInstructions);

/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,		/// Build VPlans for power-of-2 VF's between \p MinVF and \p MaxVF inclusive,
/// according to the information gathered by Legal when it checked if it is		/// according to the information gathered by Legal when it checked if it is
/// legal to vectorize the loop. This method creates VPlans using VPRecipes.		/// legal to vectorize the loop. This method creates VPlans using VPRecipes.
void buildVPlansWithVPRecipes(ElementCount MinVF, ElementCount MaxVF);		void buildVPlansWithVPRecipes(LoopVectorizationCostModel &CM,
		ElementCount MinVF, ElementCount MaxVF);

// Adjust the recipes for reductions. For in-loop reductions the chain of		// Adjust the recipes for reductions. For in-loop reductions the chain of
// instructions leading from the loop exit instr to the phi need to be		// instructions leading from the loop exit instr to the phi need to be
// converted to reductions, with one operand being vector and the other being		// converted to reductions, with one operand being vector and the other being
// the scalar reduction chain. For other reductions, a select is introduced		// the scalar reduction chain. For other reductions, a select is introduced
// between the phi and live-out recipes when folding the tail.		// between the phi and live-out recipes when folding the tail.
void adjustRecipesForReductions(VPBasicBlock *LatchVPBB, VPlanPtr &Plan,		void adjustRecipesForReductions(LoopVectorizationCostModel &CM,
		VPBasicBlock *LatchVPBB, VPlanPtr &Plan,
VPRecipeBuilder &RecipeBuilder,		VPRecipeBuilder &RecipeBuilder,
ElementCount MinVF);		ElementCount MinVF);

		/// Determines if we have the infrastructure to vectorize loop \p L and its
		/// epilogue, assuming the main loop is vectorized by \p VF.
		bool isCandidateForEpilogueVectorization(const Loop &L,
		const ElementCount VF) const;

		/// Returns true if the per-lane cost of VectorizationFactor A is lower than
		/// that of B.
		bool isMoreProfitable(const VectorizationFactor &A,
		const VectorizationFactor &B) const;

		/// Returns true if epilogue vectorization is considered profitable, and
		/// false otherwise.
		/// \p VF is the vectorization factor chosen for the original loop.
		bool isEpilogueVectorizationProfitable(const ElementCount VF) const;
};		};

} // namespace llvm		} // namespace llvm

#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H		#endif // LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONPLANNER_H

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	cl::values(clEnumValN(PreferPredicateTy::ScalarEpilogue,
"predicate-dont-vectorize",		"predicate-dont-vectorize",
"prefers tail-folding, don't attempt vectorization if "		"prefers tail-folding, don't attempt vectorization if "
"tail-folding fails.")));		"tail-folding fails.")));

static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(		static cl::opt<TailFoldingStyle> ForceTailFoldingStyle(
"force-tail-folding-style", cl::desc("Force the tail folding style"),		"force-tail-folding-style", cl::desc("Force the tail folding style"),
cl::init(TailFoldingStyle::None),		cl::init(TailFoldingStyle::None),
cl::values(		cl::values(
clEnumValN(TailFoldingStyle::None, "none", "Disable tail folding"),
clEnumValN(		clEnumValN(
TailFoldingStyle::Data, "data",		TailFoldingStyle::Data, "data",
"Create lane mask for data only, using active.lane.mask intrinsic"),		"Create lane mask for data only, using active.lane.mask intrinsic"),
clEnumValN(TailFoldingStyle::DataWithoutLaneMask,		clEnumValN(TailFoldingStyle::DataWithoutLaneMask,
"data-without-lane-mask",		"data-without-lane-mask",
"Create lane mask with compare/stepvector"),		"Create lane mask with compare/stepvector"),
clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",		clEnumValN(TailFoldingStyle::DataAndControlFlow, "data-and-control",
"Create lane mask using active.lane.mask intrinsic, and use "		"Create lane mask using active.lane.mask intrinsic, and use "
▲ Show 20 Lines • Show All 919 Lines • ▼ Show 20 Lines
/// vectorization.		/// vectorization.
/// In many cases vectorization is not profitable. This can happen because of		/// In many cases vectorization is not profitable. This can happen because of
/// a number of reasons. In this class we mainly attempt to predict the		/// a number of reasons. In this class we mainly attempt to predict the
/// expected speedup/slowdowns due to the supported instruction set. We use the		/// expected speedup/slowdowns due to the supported instruction set. We use the
/// TargetTransformInfo to query the different backends for the cost of		/// TargetTransformInfo to query the different backends for the cost of
/// different operations.		/// different operations.
class LoopVectorizationCostModel {		class LoopVectorizationCostModel {
public:		public:
LoopVectorizationCostModel(ScalarEpilogueLowering SEL, Loop *L,		LoopVectorizationCostModel(bool FoldTailByMasking, ScalarEpilogueLowering SEL,
PredicatedScalarEvolution &PSE, LoopInfo *LI,		Loop *L, PredicatedScalarEvolution &PSE,
LoopVectorizationLegality *Legal,		LoopInfo LI, LoopVectorizationLegality Legal,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
const TargetLibraryInfo TLI, DemandedBits DB,		const TargetLibraryInfo TLI, DemandedBits DB,
AssumptionCache *AC,		AssumptionCache *AC,
OptimizationRemarkEmitter ORE, const Function F,		OptimizationRemarkEmitter ORE, const Function F,
const LoopVectorizeHints *Hints,		const LoopVectorizeHints *Hints,
InterleavedAccessInfo &IAI)		InterleavedAccessInfo &IAI)
: ScalarEpilogueStatus(SEL), TheLoop(L), PSE(PSE), LI(LI), Legal(Legal),		: ScalarEpilogueStatus(SEL), FoldTailByMasking(FoldTailByMasking),
TTI(TTI), TLI(TLI), DB(DB), AC(AC), ORE(ORE), TheFunction(F),		TheLoop(L), PSE(PSE), LI(LI), Legal(Legal), TTI(TTI), TLI(TLI), DB(DB),
Hints(Hints), InterleaveInfo(IAI) {}		AC(AC), ORE(ORE), TheFunction(F), Hints(Hints), InterleaveInfo(IAI) {}

/// \return An upper bound for the vectorization factors (both fixed and		/// \return An upper bound for the vectorization factors (both fixed and
/// scalable). If the factors are 0, vectorization and interleaving should be		/// scalable). If the factors are 0, vectorization and interleaving should be
/// avoided up front.		/// avoided up front.
FixedScalableVFPair computeMaxVF(ElementCount UserVF, unsigned UserIC);		FixedScalableVFPair computeMaxVF(ElementCount UserVF, unsigned UserIC);

/// \return True if runtime checks are required for vectorization, and false		/// \return True if runtime checks are required for vectorization, and false
/// otherwise.		/// otherwise.
bool runtimeChecksRequired();		bool runtimeChecksRequired();

/// \return The most profitable vectorization factor and the cost of that VF.
/// This method checks every VF in \p CandidateVFs. If UserVF is not ZERO
/// then this vectorization factor will be selected if vectorization is
/// possible.
VectorizationFactor
selectVectorizationFactor(const ElementCountSet &CandidateVFs);

VectorizationFactor
selectEpilogueVectorizationFactor(const ElementCount MaxVF,
const LoopVectorizationPlanner &LVP);

/// Setup cost-based decisions for user vectorization factor.		/// Setup cost-based decisions for user vectorization factor.
/// \return true if the UserVF is a feasible VF to be chosen.		/// \return true if the UserVF is a feasible VF to be chosen.
bool selectUserVectorizationFactor(ElementCount UserVF) {		bool selectUserVectorizationFactor(ElementCount UserVF) {
collectUniformsAndScalars(UserVF);		collectUniformsAndScalars(UserVF);
collectInstsToScalarize(UserVF);		collectInstsToScalarize(UserVF);
return expectedCost(UserVF).first.isValid();		return expectedCost(UserVF).first.isValid();
}		}

▲ Show 20 Lines • Show All 340 Lines • ▼ Show 20 Lines	auto RequiresScalarEpilogue = [this](ElementCount VF) {
return requiresScalarEpilogue(VF);		return requiresScalarEpilogue(VF);
};		};
bool IsRequired = all_of(Range, RequiresScalarEpilogue);		bool IsRequired = all_of(Range, RequiresScalarEpilogue);
assert(		assert(
(IsRequired \|\| none_of(Range, RequiresScalarEpilogue)) &&		(IsRequired \|\| none_of(Range, RequiresScalarEpilogue)) &&
"all VFs in range must agree on whether a scalar epilogue is required");		"all VFs in range must agree on whether a scalar epilogue is required");
return IsRequired;		return IsRequired;
}		}

/// Returns true if a scalar epilogue is not allowed due to optsize or a		/// Returns false if a scalar epilogue is not allowed due to, for example,
		david-armUnsubmitted Not Done Reply Inline Actions This is not your fault, but whilst you are here could you update the comment to reflect the function behaviour? It looks like the comment probably got out of sync with the function at some point. :) Perhaps something like: /// Returns true if a scalar epilogue is allowed. It may return false if: /// 1. We are optimising for code size /// 2. There is a loop hint annotation /// etc. david-arm: This is not your fault, but whilst you are here could you update the comment to reflect the…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah this is a bit of a weird one nowadays. I had tried to remove it, but I think it makes more sense to keep around for specifying when a scalar epilogue cannot be used. dmgreen: Yeah this is a bit of a weird one nowadays. I had tried to remove it, but I think it makes more…
/// loop hint annotation.		/// optsize or a tail folding. It is use either as a check for when
		/// interleaving/epilog vectorization can occur, or for checking cases where a
		/// epilog would be required for correctness.
bool isScalarEpilogueAllowed() const {		bool isScalarEpilogueAllowed() const {
return ScalarEpilogueStatus == CM_ScalarEpilogueAllowed;		return ScalarEpilogueStatus == CM_ScalarEpilogueAllowed \|\|
		(!FoldTailByMasking &&
		david-armUnsubmitted Not Done Reply Inline Actions Do you know the scenarios in which we aren't folding the tail by masking, but a scalar epilogue is not needed? I think CM_ScalarEpilogueNotNeededUsePredicate is only set for these cases: The user has supplied a hint. The user has set the prefer-predicate-over-epilogue flag. The TTI hook preferPredicateOverEpilogue has returned true. but I'd expect that for at least 3) we've set FoldTailByMasking to true? david-arm: Do you know the scenarios in which we aren't folding the tail by masking, but a scalar epilogue…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions For 3 it would be both with and without FoldTailByMasking. D145925 adds a way for the target to control the SEL more directly. dmgreen: For 3 it would be both with and without FoldTailByMasking. D145925 adds a way for the target to…
		ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate);
}		}

/// Returns the TailFoldingStyle that is best for the current loop.		/// Returns the TailFoldingStyle that is best for the current loop.
TailFoldingStyle		TailFoldingStyle
getTailFoldingStyle(bool IVUpdateMayOverflow = true) const {		getTailFoldingStyle(bool IVUpdateMayOverflow = true) const {
if (!CanFoldTailByMasking)		if (!FoldTailByMasking)
return TailFoldingStyle::None;		return TailFoldingStyle::None;

if (ForceTailFoldingStyle.getNumOccurrences())		if (ForceTailFoldingStyle.getNumOccurrences())
return ForceTailFoldingStyle;		return ForceTailFoldingStyle;

return TTI.getPreferredTailFoldingStyle(IVUpdateMayOverflow);		return TTI.getPreferredTailFoldingStyle(IVUpdateMayOverflow);
}		}

/// Returns true if all loop blocks should be masked to fold tail loop.		/// Returns true if all loop blocks should be masked to fold tail loop.
bool foldTailByMasking() const {		bool foldTailByMasking() const { return FoldTailByMasking; }
return getTailFoldingStyle() != TailFoldingStyle::None;
}

/// Returns true if the instructions in this block requires predication		/// Returns true if the instructions in this block requires predication
/// for any reason, e.g. because tail folding now requires a predicate		/// for any reason, e.g. because tail folding now requires a predicate
/// or because the block in the original loop was predicated.		/// or because the block in the original loop was predicated.
bool blockNeedsPredicationForAnyReason(BasicBlock *BB) const {		bool blockNeedsPredicationForAnyReason(BasicBlock *BB) const {
return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);		return foldTailByMasking() \|\| Legal->blockNeedsPredication(BB);
}		}

Show All 22 Lines	public:
/// VF. Return the cost of the instruction, including scalarization overhead		/// VF. Return the cost of the instruction, including scalarization overhead
/// if it's needed. The flag NeedToScalarize shows if the call needs to be		/// if it's needed. The flag NeedToScalarize shows if the call needs to be
/// scalarized -		/// scalarized -
/// i.e. either vector version isn't available, or is too expensive.		/// i.e. either vector version isn't available, or is too expensive.
InstructionCost getVectorCallCost(CallInst *CI, ElementCount VF,		InstructionCost getVectorCallCost(CallInst *CI, ElementCount VF,
Function **Variant,		Function **Variant,
bool *NeedsMask = nullptr) const;		bool *NeedsMask = nullptr) const;

/// Returns true if the per-lane cost of VectorizationFactor A is lower than
/// that of B.
bool isMoreProfitable(const VectorizationFactor &A,
const VectorizationFactor &B) const;

/// Invalidates decisions already taken by the cost model.		/// Invalidates decisions already taken by the cost model.
void invalidateCostModelingDecisions() {		void invalidateCostModelingDecisions() {
WideningDecisions.clear();		WideningDecisions.clear();
Uniforms.clear();		Uniforms.clear();
Scalars.clear();		Scalars.clear();
}		}

		/// The vectorization cost is a combination of the cost itself and a boolean
		/// indicating whether any of the contributing operations will actually
		/// operate on vector values after type legalization in the backend. If this
		/// latter value is false, then all operations will be scalarized (i.e. no
		/// vectorization has actually taken place).
		using VectorizationCostTy = std::pair<InstructionCost, bool>;

		/// Returns the expected execution cost. The unit of the cost does
		/// not matter because we use the 'cost' units to compare different
		/// vector widths. The cost that is returned is not normalized by
		/// the factor width. If \p Invalid is not nullptr, this function
		/// will add a pair(Instruction*, ElementCount) to \p Invalid for
		/// each instruction that has an Invalid cost for the given VF.
		VectorizationCostTy
		expectedCost(ElementCount VF,
		SmallVectorImpl<InstructionVFPair> *Invalid = nullptr);

		/// Return the NumPredStores, to be checked by the Planner.
		unsigned getNumPredStores() { return NumPredStores; }

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factors for both		/// \return An upper bound for the vectorization factors for both
/// fixed and scalable vectorization, where the minimum-known number of		/// fixed and scalable vectorization, where the minimum-known number of
/// elements is a power-of-2 larger than zero. If scalable vectorization is		/// elements is a power-of-2 larger than zero. If scalable vectorization is
/// disabled or unsupported, then the scalable part will be equal to		/// disabled or unsupported, then the scalable part will be equal to
/// ElementCount::getScalable(0).		/// ElementCount::getScalable(0).
Show All 9 Lines	ElementCount getMaximizedVFForTarget(unsigned ConstTripCount,
unsigned WidestType,		unsigned WidestType,
ElementCount MaxSafeVF,		ElementCount MaxSafeVF,
bool FoldTailByMasking);		bool FoldTailByMasking);

/// \return the maximum legal scalable VF, based on the safe max number		/// \return the maximum legal scalable VF, based on the safe max number
/// of elements.		/// of elements.
ElementCount getMaxLegalScalableVF(unsigned MaxSafeElements);		ElementCount getMaxLegalScalableVF(unsigned MaxSafeElements);

/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually
/// operate on vector values after type legalization in the backend. If this
/// latter value is false, then all operations will be scalarized (i.e. no
/// vectorization has actually taken place).
using VectorizationCostTy = std::pair<InstructionCost, bool>;

/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by
/// the factor width. If \p Invalid is not nullptr, this function
/// will add a pair(Instruction*, ElementCount) to \p Invalid for
/// each instruction that has an Invalid cost for the given VF.
VectorizationCostTy
expectedCost(ElementCount VF,
SmallVectorImpl<InstructionVFPair> *Invalid = nullptr);

/// Returns the execution time cost of an instruction for a given vector		/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);		VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
InstructionCost getInstructionCost(Instruction *I, ElementCount VF,		InstructionCost getInstructionCost(Instruction *I, ElementCount VF,
Type *&VectorTy);		Type *&VectorTy);
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	private:
/// aliasing/dependence checks fail, or to handle the tail/remainder		/// aliasing/dependence checks fail, or to handle the tail/remainder
/// iterations when the trip count is unknown or doesn't divide by the VF,		/// iterations when the trip count is unknown or doesn't divide by the VF,
/// or as a peel-loop to handle gaps in interleave-groups.		/// or as a peel-loop to handle gaps in interleave-groups.
/// Under optsize and when the trip count is very small we don't allow any		/// Under optsize and when the trip count is very small we don't allow any
/// iterations to execute in the scalar loop.		/// iterations to execute in the scalar loop.
ScalarEpilogueLowering ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;		ScalarEpilogueLowering ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;

/// All blocks of loop are to be masked to fold tail of scalar iterations.		/// All blocks of loop are to be masked to fold tail of scalar iterations.
bool CanFoldTailByMasking = false;		bool FoldTailByMasking = false;

/// A map holding scalar costs for different vectorization factors. The		/// A map holding scalar costs for different vectorization factors. The
/// presence of a cost for an instruction in the mapping indicates that the		/// presence of a cost for an instruction in the mapping indicates that the
/// instruction will be scalarized when vectorizing with the associated		/// instruction will be scalarized when vectorizing with the associated
/// vectorization factor. The entries are VF-ScalarCostTy pairs.		/// vectorization factor. The entries are VF-ScalarCostTy pairs.
DenseMap<ElementCount, ScalarCostsTy> InstsToScalarize;		DenseMap<ElementCount, ScalarCostsTy> InstsToScalarize;

/// Holds the instructions known to be uniform after vectorization.		/// Holds the instructions known to be uniform after vectorization.
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	private:

/// Returns a range containing only operands needing to be extracted.		/// Returns a range containing only operands needing to be extracted.
SmallVector<Value *, 4> filterExtractingOperands(Instruction::op_range Ops,		SmallVector<Value *, 4> filterExtractingOperands(Instruction::op_range Ops,
ElementCount VF) const {		ElementCount VF) const {
return SmallVector<Value *, 4>(make_filter_range(		return SmallVector<Value *, 4>(make_filter_range(
Ops, [this, VF](Value *V) { return this->needsExtract(V, VF); }));		Ops, [this, VF](Value *V) { return this->needsExtract(V, VF); }));
}		}

/// Determines if we have the infrastructure to vectorize loop \p L and its
/// epilogue, assuming the main loop is vectorized by \p VF.
bool isCandidateForEpilogueVectorization(const Loop &L,
const ElementCount VF) const;

/// Returns true if epilogue vectorization is considered profitable, and
/// false otherwise.
/// \p VF is the vectorization factor chosen for the original loop.
bool isEpilogueVectorizationProfitable(const ElementCount VF) const;

public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;

/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;

/// Loop Info analysis.		/// Loop Info analysis.
Show All 29 Lines	public:
/// Values to ignore in the cost model.		/// Values to ignore in the cost model.
SmallPtrSet<const Value *, 16> ValuesToIgnore;		SmallPtrSet<const Value *, 16> ValuesToIgnore;

/// Values to ignore in the cost model when VF > 1.		/// Values to ignore in the cost model when VF > 1.
SmallPtrSet<const Value *, 16> VecValuesToIgnore;		SmallPtrSet<const Value *, 16> VecValuesToIgnore;

/// All element types found in the loop.		/// All element types found in the loop.
SmallPtrSet<Type *, 16> ElementTypesInLoop;		SmallPtrSet<Type *, 16> ElementTypesInLoop;

/// Profitable vector factors.
SmallVector<VectorizationFactor, 8> ProfitableVFs;
};		};
} // end namespace llvm		} // end namespace llvm

namespace {		namespace {
/// Helper struct to manage generating runtime checks for vectorization.		/// Helper struct to manage generating runtime checks for vectorization.
///		///
/// The runtime checks are created up-front in temporary blocks to allow better		/// The runtime checks are created up-front in temporary blocks to allow better
/// estimating the cost and un-linked from the existing IR. After deciding to		/// estimating the cost and un-linked from the existing IR. After deciding to
▲ Show 20 Lines • Show All 1,545 Lines • ▼ Show 20 Lines	static void cse(BasicBlock *BB) {
}		}
}		}

InstructionCost LoopVectorizationCostModel::getVectorCallCost(		InstructionCost LoopVectorizationCostModel::getVectorCallCost(
CallInst CI, ElementCount VF, Function Variant, bool NeedsMask) const {		CallInst CI, ElementCount VF, Function Variant, bool NeedsMask) const {
Function *F = CI->getCalledFunction();		Function *F = CI->getCalledFunction();
Type *ScalarRetTy = CI->getType();		Type *ScalarRetTy = CI->getType();
SmallVector<Type *, 4> Tys, ScalarTys;		SmallVector<Type *, 4> Tys, ScalarTys;
bool MaskRequired = Legal->isMaskRequired(CI);		bool MaskRequired = Legal->isMaskRequired(foldTailByMasking(), CI);
for (auto &ArgOp : CI->args())		for (auto &ArgOp : CI->args())
ScalarTys.push_back(ArgOp->getType());		ScalarTys.push_back(ArgOp->getType());

// Estimate cost of scalarized vector call. The source operands are assumed		// Estimate cost of scalarized vector call. The source operands are assumed
// to be vectors, so we need to extract individual elements from there,		// to be vectors, so we need to extract individual elements from there,
// execute VF scalar calls, and then gather the result into the vector return		// execute VF scalar calls, and then gather the result into the vector return
// value.		// value.
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
▲ Show 20 Lines • Show All 469 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::fixReduction(VPReductionPHIRecipe *PhiR,

VPBasicBlock *LatchVPBB =		VPBasicBlock *LatchVPBB =
PhiR->getParent()->getEnclosingLoopRegion()->getExitingBasicBlock();		PhiR->getParent()->getEnclosingLoopRegion()->getExitingBasicBlock();
BasicBlock *VectorLoopLatch = State.CFG.VPBB2IRBB[LatchVPBB];		BasicBlock *VectorLoopLatch = State.CFG.VPBB2IRBB[LatchVPBB];
// If tail is folded by masking, the vector value to leave the loop should be		// If tail is folded by masking, the vector value to leave the loop should be
// a Select choosing between the vectorized LoopExitInst and vectorized Phi,		// a Select choosing between the vectorized LoopExitInst and vectorized Phi,
// instead of the former. For an inloop reduction the reduction will already		// instead of the former. For an inloop reduction the reduction will already
// be predicated, and does not need to be handled here.		// be predicated, and does not need to be handled here.
if (Cost->foldTailByMasking() && !PhiR->isInLoop()) {		if (State.Plan->foldTailByMasking() && !PhiR->isInLoop()) {
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
Value *VecLoopExitInst = State.get(LoopExitInstDef, Part);		Value *VecLoopExitInst = State.get(LoopExitInstDef, Part);
SelectInst *Sel = nullptr;		SelectInst *Sel = nullptr;
for (User *U : VecLoopExitInst->users()) {		for (User *U : VecLoopExitInst->users()) {
if (isa<SelectInst>(U)) {		if (isa<SelectInst>(U)) {
assert(!Sel && "Reduction exit feeding two selects");		assert(!Sel && "Reduction exit feeding two selects");
Sel = cast<SelectInst>(U);		Sel = cast<SelectInst>(U);
} else		} else
▲ Show 20 Lines • Show All 513 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {

// Can we prove this instruction is safe to unconditionally execute?		// Can we prove this instruction is safe to unconditionally execute?
// If not, we must use some form of predication.		// If not, we must use some form of predication.
switch(I->getOpcode()) {		switch(I->getOpcode()) {
default:		default:
return false;		return false;
case Instruction::Load:		case Instruction::Load:
case Instruction::Store: {		case Instruction::Store: {
if (!Legal->isMaskRequired(I))		if (!Legal->isMaskRequired(foldTailByMasking(), I))
return false;		return false;
// When we know the load's address is loop invariant and the instruction		// When we know the load's address is loop invariant and the instruction
// in the original scalar loop was unconditionally executed then we		// in the original scalar loop was unconditionally executed then we
// don't need to mark it as a predicated instruction. Tail folding may		// don't need to mark it as a predicated instruction. Tail folding may
// introduce additional predication, but we're guaranteed to always have		// introduce additional predication, but we're guaranteed to always have
// at least one active lane. We call Legal->blockNeedsPredication here		// at least one active lane. We call Legal->blockNeedsPredication here
// because it doesn't query tail-folding. For stores, we need to prove		// because it doesn't query tail-folding. For stores, we need to prove
// both speculation safety (which follows from the same argument as loads),		// both speculation safety (which follows from the same argument as loads),
Show All 10 Lines	bool LoopVectorizationCostModel::isPredicatedInst(Instruction *I) const {
case Instruction::UDiv:		case Instruction::UDiv:
case Instruction::SDiv:		case Instruction::SDiv:
case Instruction::SRem:		case Instruction::SRem:
case Instruction::URem:		case Instruction::URem:
// TODO: We can use the loop-preheader as context point here and get		// TODO: We can use the loop-preheader as context point here and get
// context sensitive reasoning		// context sensitive reasoning
return !isSafeToSpeculativelyExecute(I);		return !isSafeToSpeculativelyExecute(I);
case Instruction::Call:		case Instruction::Call:
return Legal->isMaskRequired(I);		return Legal->isMaskRequired(foldTailByMasking(), I);
}		}
}		}

std::pair<InstructionCost, InstructionCost>		std::pair<InstructionCost, InstructionCost>
LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,		LoopVectorizationCostModel::getDivRemSpeculationCost(Instruction *I,
ElementCount VF) const {		ElementCount VF) const {
assert(I->getOpcode() == Instruction::UDiv \|\|		assert(I->getOpcode() == Instruction::UDiv \|\|
I->getOpcode() == Instruction::SDiv \|\|		I->getOpcode() == Instruction::SDiv \|\|
I->getOpcode() == Instruction::SRem \|\|		I->getOpcode() == Instruction::SRem \|\|
I->getOpcode() == Instruction::URem);		I->getOpcode() == Instruction::URem);
assert(!isSafeToSpeculativelyExecute(I));		assert(!isSafeToSpeculativelyExecute(I));

const TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		const TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;

▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	bool LoopVectorizationCostModel::interleavedAccessCanBeWidened(

// Check if masking is required.		// Check if masking is required.
// A Group may need masking for one of two reasons: it resides in a block that		// A Group may need masking for one of two reasons: it resides in a block that
// needs predication, or it was decided to use masking to deal with gaps		// needs predication, or it was decided to use masking to deal with gaps
// (either a gap at the end of a load-access that may result in a speculative		// (either a gap at the end of a load-access that may result in a speculative
// load, or any gaps in a store-access).		// load, or any gaps in a store-access).
bool PredicatedAccessRequiresMasking =		bool PredicatedAccessRequiresMasking =
blockNeedsPredicationForAnyReason(I->getParent()) &&		blockNeedsPredicationForAnyReason(I->getParent()) &&
Legal->isMaskRequired(I);		Legal->isMaskRequired(foldTailByMasking(), I);
bool LoadAccessWithGapsRequiresEpilogMasking =		bool LoadAccessWithGapsRequiresEpilogMasking =
isa<LoadInst>(I) && Group->requiresScalarEpilogue() &&		isa<LoadInst>(I) && Group->requiresScalarEpilogue() &&
!isScalarEpilogueAllowed();		!isScalarEpilogueAllowed();
bool StoreAccessWithGapsRequiresMasking =		bool StoreAccessWithGapsRequiresMasking =
isa<StoreInst>(I) && (Group->getNumMembers() < Group->getFactor());		isa<StoreInst>(I) && (Group->getNumMembers() < Group->getFactor());
if (!PredicatedAccessRequiresMasking &&		if (!PredicatedAccessRequiresMasking &&
!LoadAccessWithGapsRequiresEpilogMasking &&		!LoadAccessWithGapsRequiresEpilogMasking &&
!StoreAccessWithGapsRequiresMasking)		!StoreAccessWithGapsRequiresMasking)
▲ Show 20 Lines • Show All 483 Lines • ▼ Show 20 Lines	reportVectorizationFailure("Single iteration (non) loop",
"SingleIterationLoop", ORE, TheLoop);		"SingleIterationLoop", ORE, TheLoop);
return FixedScalableVFPair::getNone();		return FixedScalableVFPair::getNone();
}		}

switch (ScalarEpilogueStatus) {		switch (ScalarEpilogueStatus) {
case CM_ScalarEpilogueAllowed:		case CM_ScalarEpilogueAllowed:
return computeFeasibleMaxVF(TC, UserVF, false);		return computeFeasibleMaxVF(TC, UserVF, false);
case CM_ScalarEpilogueNotAllowedUsePredicate:		case CM_ScalarEpilogueNotAllowedUsePredicate:
[[fallthrough]];		LLVM_DEBUG(dbgs() << "LV: vector predicate hint/switch found.\n"
		<< "LV: Not allowing scalar epilogue, creating "
		"predicated vector loop.\n");
		// We cannot add a scalar tail, but fall through to the code below both with
		// and without FoldTailByMasking. FoldTailByMasking=false will only be
		// allowed if the trip count is known to be a multiple of the VF. Otherwise
		// FoldTailByMasking=true plans will be used.
		break;
		david-armUnsubmitted Done Reply Inline Actions In this case it looks like we're going to create two sets of plans - one with tail-folding and one without - and in both cases we're actually going to tail-fold anyway. Is that right? Not saying we shouldn't do that, but just trying to understand how this works. david-arm: In this case it looks like we're going to create two sets of plans - one with tail-folding and…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions This falls though in both cases. In the code below we may find that the the MaxVF is a multiple of the known tripcount, and if so return plans with TailFold=false. Else it will not create TailFold=false plans, just using the TailFold=true. There are a lot of edge cases. dmgreen: This falls though in both cases. In the code below we may find that the the MaxVF is a multiple…
case CM_ScalarEpilogueNotNeededUsePredicate:		case CM_ScalarEpilogueNotNeededUsePredicate:
LLVM_DEBUG(		// If this cost model is for predicated plans then fall through to the
dbgs() << "LV: vector predicate hint/switch found.\n"		// prepareToFoldTailByMasking checks below, else return the unpredicated max
<< "LV: Not allowing scalar epilogue, creating predicated "		// size.
<< "vector loop.\n");		if (!FoldTailByMasking)
		return computeFeasibleMaxVF(TC, UserVF, false);
		david-armUnsubmitted Not Done Reply Inline Actions Don't we have to also set `ScalarEpilogueStatus = CM_ScalarEpilogueAllowed` here? david-arm: Don't we have to also set `ScalarEpilogueStatus = CM_ScalarEpilogueAllowed` here?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I don't believe so. The ScalarEpilogueStatus tell us at a high level whether we should be predicating. FoldTailByMasking then controls whether the individual vplans are predicated. So it felt better to keep the original ScalarEpilogueStatus inplace to refer back to, in case we would like to differentiate between CM_ScalarEpilogueAllowed and CM_ScalarEpilogueNotNeededUsePredicate cases. dmgreen: I don't believe so. The ScalarEpilogueStatus tell us at a high level whether we should be…
		LLVM_DEBUG(dbgs() << "LV: vector predicate hint/switch found.\n"
		<< "LV: Trying predicated vector loop.\n");
break;		break;
case CM_ScalarEpilogueNotAllowedLowTripLoop:		case CM_ScalarEpilogueNotAllowedLowTripLoop:
// fallthrough as a special case of OptForSize		// fallthrough as a special case of OptForSize
case CM_ScalarEpilogueNotAllowedOptSize:		case CM_ScalarEpilogueNotAllowedOptSize:
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)		if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedOptSize)
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");		dbgs() << "LV: Not allowing scalar epilogue due to -Os/-Oz.\n");
else		else
LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "		LLVM_DEBUG(dbgs() << "LV: Not allowing scalar epilogue due to low trip "
<< "count.\n");		<< "count.\n");

// Bail if runtime checks are required, which are not good when optimising		// Bail if runtime checks are required, which are not good when optimising
// for size.		// for size.
if (runtimeChecksRequired())		if (runtimeChecksRequired())
return FixedScalableVFPair::getNone();		return FixedScalableVFPair::getNone();

break;		break;
}		}

// The only loops we can vectorize without a scalar epilogue, are loops with		// The only loops we can vectorize without a scalar epilogue, are loops with
// a bottom-test and a single exiting block. We'd have to handle the fact		// a bottom-test and a single exiting block. We'd have to handle the fact
// that not every instruction executes on the last iteration. This will		// that not every instruction executes on the last iteration. This will
// require a lane mask which varies through the vector loop body. (TODO)		// require a lane mask which varies through the vector loop body. (TODO)
if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {		if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch())
// If there was a tail-folding hint/switch, but we can't fold the tail by
// masking, fallback to a vectorization with a scalar epilogue.
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate) {
LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking: vectorize with a "
"scalar epilogue instead.\n");
ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
return computeFeasibleMaxVF(TC, UserVF, false);
}
return FixedScalableVFPair::getNone();		return FixedScalableVFPair::getNone();
}

// Now try the tail folding		// Now try the tail folding

// Invalidate interleave groups that require an epilogue if we can't mask		// Invalidate interleave groups that require an epilogue if we can't mask
// the interleave-group.		// the interleave-group.
if (!useMaskedInterleavedAccesses(TTI)) {		if (!useMaskedInterleavedAccesses(TTI)) {
assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&		assert(WideningDecisions.empty() && Uniforms.empty() && Scalars.empty() &&
"No decisions should have been taken at this point");		"No decisions should have been taken at this point");
Show All 28 Lines	if (MaxPowerOf2RuntimeVF && *MaxPowerOf2RuntimeVF > 0) {
const SCEV *ExitCount = SE->getAddExpr(		const SCEV *ExitCount = SE->getAddExpr(
BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));		BackedgeTakenCount, SE->getOne(BackedgeTakenCount->getType()));
const SCEV *Rem = SE->getURemExpr(		const SCEV *Rem = SE->getURemExpr(
SE->applyLoopGuards(ExitCount, TheLoop),		SE->applyLoopGuards(ExitCount, TheLoop),
SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));		SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));
if (Rem->isZero()) {		if (Rem->isZero()) {
// Accept MaxFixedVF if we do not have a tail.		// Accept MaxFixedVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxFactors;		return FoldTailByMasking ? FixedScalableVFPair::getNone() : MaxFactors;
}		}
}		}

		// If this cost model is not for tail folding then return at this point and
		// leave it for the other model.
		if (!FoldTailByMasking &&
		ScalarEpilogueStatus != CM_ScalarEpilogueNotNeededUsePredicate)
		return FixedScalableVFPair::getNone();

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {		if (Legal->prepareToFoldTailByMasking()) {
CanFoldTailByMasking = true;		assert(FoldTailByMasking);
return MaxFactors;
}

// If there was a tail-folding hint/switch, but we can't fold the tail by
// masking, fallback to a vectorization with a scalar epilogue.
if (ScalarEpilogueStatus == CM_ScalarEpilogueNotNeededUsePredicate) {
LLVM_DEBUG(dbgs() << "LV: Cannot fold tail by masking: vectorize with a "
"scalar epilogue instead.\n");
ScalarEpilogueStatus = CM_ScalarEpilogueAllowed;
return MaxFactors;		return MaxFactors;
}		}

if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedUsePredicate) {		if (ScalarEpilogueStatus == CM_ScalarEpilogueNotAllowedUsePredicate) {
LLVM_DEBUG(dbgs() << "LV: Can't fold tail by masking: don't vectorize\n");		LLVM_DEBUG(dbgs() << "LV: Can't fold tail by masking: don't vectorize\n");
return FixedScalableVFPair::getNone();		return FixedScalableVFPair::getNone();
}		}

▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	if (MaximizeBandwidth \|\| (MaximizeBandwidth.getNumOccurrences() == 0 &&

// Invalidate any widening decisions we might have made, in case the loop		// Invalidate any widening decisions we might have made, in case the loop
// requires prediction (decided later), but we have already made some		// requires prediction (decided later), but we have already made some
// load/store widening decisions.		// load/store widening decisions.
invalidateCostModelingDecisions();		invalidateCostModelingDecisions();
}		}
return MaxVF;		return MaxVF;
}		}

/// Convenience function that returns the value of vscale_range iff		/// Convenience function that returns the value of vscale_range iff
		david-armUnsubmitted Done Reply Inline Actions Some changes like this don't really need to be in this patch, right? david-arm: Some changes like this don't really need to be in this patch, right?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah sure, I can give that a try. It seems a bit odd on it's own to be honest, like a patch that makes things worse on its own and feels like we change it just so that we change it again later. Creating needless busywork. But I have for the moment moved it out of this patch. dmgreen: Yeah sure, I can give that a try. It seems a bit odd on it's own to be honest, like a patch…
/// vscale_range.min == vscale_range.max or otherwise returns the value		/// vscale_range.min == vscale_range.max or otherwise returns the value
/// returned by the corresponding TLI method.		/// returned by the corresponding TLI method.
static std::optional<unsigned>		static std::optional<unsigned>
		david-armUnsubmitted Done Reply Inline Actions Can't we just pass in `TheFunction` instead? david-arm: Can't we just pass in `TheFunction` instead?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I think I chose Loop because the LoopVectorizationPlanner doesn't store the Function, so would need pass L->getHeader()->getParent() as argument and it simplified the interface a little. Happy to change it though. dmgreen: I think I chose Loop because the LoopVectorizationPlanner doesn't store the Function, so would…
getVScaleForTuning(const Function *F, const TargetTransformInfo &TTI) {		getVScaleForTuning(const Function *F, const TargetTransformInfo &TTI) {
if (F->hasFnAttribute(Attribute::VScaleRange)) {		if (F->hasFnAttribute(Attribute::VScaleRange)) {
auto Attr = F->getFnAttribute(Attribute::VScaleRange);		auto Attr = F->getFnAttribute(Attribute::VScaleRange);
auto Min = Attr.getVScaleRangeMin();		auto Min = Attr.getVScaleRangeMin();
auto Max = Attr.getVScaleRangeMax();		auto Max = Attr.getVScaleRangeMax();
if (Max && Min == Max)		if (Max && Min == Max)
return Max;		return Max;
}		}

return TTI.getVScaleForTuning();		return TTI.getVScaleForTuning();
}		}

bool LoopVectorizationCostModel::isMoreProfitable(		bool LoopVectorizationPlanner::isMoreProfitable(
const VectorizationFactor &A, const VectorizationFactor &B) const {		const VectorizationFactor &A, const VectorizationFactor &B) const {
InstructionCost CostA = A.Cost;		InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;		InstructionCost CostB = B.Cost;

unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);		unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(OrigLoop);

if (!A.Width.isScalable() && !B.Width.isScalable() && MaxTripCount) {		if (!A.Width.isScalable() && !B.Width.isScalable() && MaxTripCount) {
// If the trip count is a known (possibly small) constant, the trip count		// If the trip count is a known (possibly small) constant, the trip count
// will be rounded up to an integer number of iterations under		// will be rounded up to an integer number of iterations under
// FoldTailByMasking. The total cost in that case will be		// FoldTailByMasking. The total cost in that case will be
// VecCost*ceil(TripCount/VF). When not folding the tail, the total		// VecCost*ceil(TripCount/VF). When not folding the tail, the total
// cost will be VecCostfloor(TC/VF) + ScalarCost(TC%VF). There will be		// cost will be VecCostfloor(TC/VF) + ScalarCost(TC%VF). There will be
// some extra overheads, but for the purpose of comparing the costs of		// some extra overheads, but for the purpose of comparing the costs of
// different VFs we can use this to compare the total loop-body cost		// different VFs we can use this to compare the total loop-body cost
// expected after vectorization.		// expected after vectorization.
auto GetCostForTC = [MaxTripCount, this](unsigned VF,		auto GetCostForTC = [MaxTripCount](bool FoldTailByMasking, unsigned VF,
InstructionCost VectorCost,		InstructionCost VectorCost,
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I also have pulled this part out into D147720, as a functional change that can be done separately. dmgreen: I also have pulled this part out into D147720, as a functional change that can be done…
InstructionCost ScalarCost) {		InstructionCost ScalarCost) {
return foldTailByMasking() ? VectorCost * divideCeil(MaxTripCount, VF)		return FoldTailByMasking ? VectorCost * divideCeil(MaxTripCount, VF)
: VectorCost * (MaxTripCount / VF) +		: VectorCost * (MaxTripCount / VF) +
ScalarCost * (MaxTripCount % VF);		ScalarCost * (MaxTripCount % VF);
};		};
auto RTCostA = GetCostForTC(A.Width.getFixedValue(), CostA, A.ScalarCost);		auto RTCostA = GetCostForTC(A.FoldTailByMasking, A.Width.getFixedValue(),
auto RTCostB = GetCostForTC(B.Width.getFixedValue(), CostB, B.ScalarCost);		CostA, A.ScalarCost);
		auto RTCostB = GetCostForTC(B.FoldTailByMasking, B.Width.getFixedValue(),
		CostB, B.ScalarCost);

		if (A.FoldTailByMasking && !B.FoldTailByMasking)
		return RTCostA <= RTCostB;

return RTCostA < RTCostB;		return RTCostA < RTCostB;
}		}

// Improve estimate for the vector width if it is scalable.		// Improve estimate for the vector width if it is scalable.
unsigned EstimatedWidthA = A.Width.getKnownMinValue();		unsigned EstimatedWidthA = A.Width.getKnownMinValue();
unsigned EstimatedWidthB = B.Width.getKnownMinValue();		unsigned EstimatedWidthB = B.Width.getKnownMinValue();
if (std::optional<unsigned> VScale = getVScaleForTuning(TheFunction, TTI)) {		if (std::optional<unsigned> VScale =
		getVScaleForTuning(OrigLoop->getHeader()->getParent(), *TTI)) {
if (A.Width.isScalable())		if (A.Width.isScalable())
EstimatedWidthA = VScale;		EstimatedWidthA = VScale;
if (B.Width.isScalable())		if (B.Width.isScalable())
EstimatedWidthB = VScale;		EstimatedWidthB = VScale;
}		}

		// If one plan is predicated and the other is not, opt for the predicated
		// scheme on a tie.
		if (A.FoldTailByMasking && !B.FoldTailByMasking)
		return (CostA * EstimatedWidthB) <= (CostB * EstimatedWidthA);
		david-armUnsubmitted Done Reply Inline Actions Again, I'm a bit concerned about the effect this will have when the active lane mask call is not costed into the loop. I'd prefer for now to be conservative and opt for the non-predicated scheme until we've accounted for the IV costs. david-arm: Again, I'm a bit concerned about the effect this will have when the active lane mask call is…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions For MVE in the past, when we returned true for preferPredicateOverEpilogue, we would get a tail-predicated loop (or not vectorize). Now that we get both tail-folded and non-tail-folded loops to cost against one another, we need to pick the tail-folded version on a tie to not be worse than before. The scores between the two are often the same, so to be closer to the old codegen the conservative option is to chose FoldTail on a tie. I believe that the only target that returns true from preferPredicateOverEpilogue at the moment is MVE. SVE can be adjusted later if needed, but my current thinking was to use UsePredicatedEpilogue from D145925 as a first step to get the benefits of epilog vectorization whilst hopefully not messing anything else up. That was the plan for how to treat SVE conservatively, and we can expand things if needed in the future. It doesn't really make a lot of sense to add up disparate throughput costs and expect them to mean anything, but we can perhaps come up with something if we need and if not we can always just force unpredicated body + predicated epilog. Let me know what you think. dmgreen: For MVE in the past, when we returned true for preferPredicateOverEpilogue, we would get a tail…

// Assume vscale may be larger than 1 (or the value being tuned for),		// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width		// so that scalable vectorization is slightly favorable over fixed-width
// vectorization.		// vectorization.
if (A.Width.isScalable() && !B.Width.isScalable())		if (A.Width.isScalable() && !B.Width.isScalable())
return (CostA * B.Width.getFixedValue()) <= (CostB * EstimatedWidthA);		return (CostA * B.Width.getFixedValue()) <= (CostB * EstimatedWidthA);

// To avoid the need for FP division:		// To avoid the need for FP division:
// (CostA / A.Width) < (CostB / B.Width)		// (CostA / A.Width) < (CostB / B.Width)
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	if (Subset == Tail \|\| Tail[Subset.size()].first != I) {
Tail = Tail.drop_front(Subset.size());		Tail = Tail.drop_front(Subset.size());
Subset = {};		Subset = {};
} else		} else
// Grow the subset by one element		// Grow the subset by one element
Subset = Tail.take_front(Subset.size() + 1);		Subset = Tail.take_front(Subset.size() + 1);
} while (!Tail.empty());		} while (!Tail.empty());
}		}

VectorizationFactor LoopVectorizationCostModel::selectVectorizationFactor(		std::optional<VectorizationFactor>
const ElementCountSet &VFCandidates) {		LoopVectorizationPlanner::selectVectorizationFactor() {
InstructionCost ExpectedCost = expectedCost(ElementCount::getFixed(1)).first;		LLVM_DEBUG(printPlans(dbgs()));

		// If we had no plans as they were all invalid, return the invalid cost
		if (VPlans.size() == 0)
		return std::nullopt;

		// If we only have one plan due to the UserVF, return it. We try with both
		// predicated and unpredicated loops.
		ElementCount UserVF = Hints.getWidth();
		bool UserPredicated = Hints.getPredicate();
		if (UserVF && hasPlanWithVF(UserVF, UserPredicated)) {
		VPlan &Plan = getBestPlanFor(UserVF, UserPredicated);
		auto Cost = Plan.getCostModel()->expectedCost(UserVF);
		if (Cost.first.isValid())
		return VectorizationFactor(UserVF, UserPredicated, Cost.first, 0);
		} else if (UserVF && hasPlanWithVF(UserVF, !UserPredicated)) {
		VPlan &Plan = getBestPlanFor(UserVF, !UserPredicated);
		auto Cost = Plan.getCostModel()->expectedCost(UserVF);
		if (Cost.first.isValid())
		return VectorizationFactor(UserVF, !UserPredicated, Cost.first, 0);
		}

		assert(VPlans[0]->hasScalarVFOnly() &&
		"Expected Scalar VPlan to be a the first candidate");

		InstructionCost ExpectedCost =
		VPlans[0]->getCostModel()->expectedCost(ElementCount::getFixed(1)).first;
LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");		LLVM_DEBUG(dbgs() << "LV: Scalar loop costs: " << ExpectedCost << ".\n");
assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");		assert(ExpectedCost.isValid() && "Unexpected invalid cost for scalar loop");
assert(VFCandidates.count(ElementCount::getFixed(1)) &&
"Expected Scalar VF to be a candidate");

const VectorizationFactor ScalarCost(ElementCount::getFixed(1), ExpectedCost,		const VectorizationFactor ScalarCost(ElementCount::getFixed(1),
ExpectedCost);		VPlans[0]->foldTailByMasking(),
		david-armUnsubmitted Done Reply Inline Actions This should always be false for VF=1, right? david-arm: This should always be false for VF=1, right?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Not in all cases. Cases that are always predicated (like CM_ScalarEpilogueNotAllowedUsePredicate or CM_ScalarEpilogueNotAllowedLowTripLoop) will have VF=1 with tail folding still. They won't have any unpredicated vplans. dmgreen: Not in all cases. Cases that are always predicated (like…
		ExpectedCost, ExpectedCost);
VectorizationFactor ChosenFactor = ScalarCost;		VectorizationFactor ChosenFactor = ScalarCost;

bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;		bool ForceVectorization = Hints.getForce() == LoopVectorizeHints::FK_Enabled;
if (ForceVectorization && VFCandidates.size() > 1) {		if (ForceVectorization && VPlans.size() > 1) {
// Ignore scalar width, because the user explicitly wants vectorization.		// Ignore scalar width, because the user explicitly wants vectorization.
// Initialize cost to max so that VF = 2 is, at least, chosen during cost		// Initialize cost to max so that VF = 2 is, at least, chosen during cost
// evaluation.		// evaluation.
ChosenFactor.Cost = InstructionCost::getMax();		ChosenFactor.Cost = InstructionCost::getMax();
}		}

SmallVector<InstructionVFPair> InvalidCosts;		SmallVector<InstructionVFPair> InvalidCosts;
for (const auto &i : VFCandidates) {		for (const VPlanPtr &VPlan : drop_begin(VPlans)) {
		for (const ElementCount &i : VPlan->getVFs()) {
// The cost for scalar VF=1 is already calculated, so ignore it.		// The cost for scalar VF=1 is already calculated, so ignore it.
if (i.isScalar())		if (i.isScalar())
continue;		continue;

VectorizationCostTy C = expectedCost(i, &InvalidCosts);		LoopVectorizationCostModel::VectorizationCostTy C =
VectorizationFactor Candidate(i, C.first, ScalarCost.ScalarCost);		VPlan->getCostModel()->expectedCost(i, &InvalidCosts);
		VectorizationFactor Candidate(i, VPlan->foldTailByMasking(), C.first,
		ScalarCost.ScalarCost);

#ifndef NDEBUG		#ifndef NDEBUG
unsigned AssumedMinimumVscale = 1;		unsigned AssumedMinimumVscale = 1;
if (std::optional<unsigned> VScale = getVScaleForTuning(TheFunction, TTI))		if (std::optional<unsigned> VScale =
		getVScaleForTuning(OrigLoop->getHeader()->getParent(), *TTI))
AssumedMinimumVscale = *VScale;		AssumedMinimumVscale = *VScale;
unsigned Width =		unsigned Width =
Candidate.Width.isScalable()		Candidate.Width.isScalable()
? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale		? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale
: Candidate.Width.getFixedValue();		: Candidate.Width.getFixedValue();
LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i << " costs: "		LLVM_DEBUG(
<< Candidate.Cost << " => " << (Candidate.Cost / Width));		dbgs() << "LV: " << (VPlan->foldTailByMasking() ? "Tail folded " : "")
		<< "Vector loop of width " << i << " costs: " << Candidate.Cost
		<< " => " << (Candidate.Cost / Width));
if (i.isScalable())		if (i.isScalable())
LLVM_DEBUG(dbgs() << " (assuming a minimum vscale of "		LLVM_DEBUG(dbgs() << " (assuming a minimum vscale of "
<< AssumedMinimumVscale << ")");		<< AssumedMinimumVscale << ")");
LLVM_DEBUG(dbgs() << ".\n");		LLVM_DEBUG(dbgs() << ".\n");
#endif		#endif

if (!C.second && !ForceVectorization) {		if (!C.second && !ForceVectorization) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not considering vector loop of width " << i		dbgs()
		<< "LV: Not considering vector loop of width " << i
<< " because it will not generate any vector instructions.\n");		<< " because it will not generate any vector instructions.\n");
continue;		continue;
}		}

		// FIXME: Possibly remove EnableCondStoresVectorization now.
		if (!EnableCondStoresVectorization &&
		VPlan->getCostModel()->getNumPredStores()) {
		reportVectorizationFailure(
		"There are conditional stores.",
		"store that is conditionally executed prevents vectorization",
		"ConditionalStore", ORE, OrigLoop);
		continue;
		}

// If profitable add it to ProfitableVF list.		// If profitable add it to ProfitableVF list.
if (isMoreProfitable(Candidate, ScalarCost))		if (isMoreProfitable(Candidate, ScalarCost))
ProfitableVFs.push_back(Candidate);		ProfitableVFs.push_back(Candidate);

if (isMoreProfitable(Candidate, ChosenFactor))		if (isMoreProfitable(Candidate, ChosenFactor))
ChosenFactor = Candidate;		ChosenFactor = Candidate;
}		}

emitInvalidCostRemarks(InvalidCosts, ORE, TheLoop);

if (!EnableCondStoresVectorization && NumPredStores) {
reportVectorizationFailure("There are conditional stores.",
"store that is conditionally executed prevents vectorization",
"ConditionalStore", ORE, TheLoop);
ChosenFactor = ScalarCost;
}		}

LLVM_DEBUG(if (ForceVectorization && !ChosenFactor.Width.isScalar() &&		emitInvalidCostRemarks(InvalidCosts, ORE, OrigLoop);
!isMoreProfitable(ChosenFactor, ScalarCost)) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		LLVM_DEBUG({
<< "but was forced by a user.\n");		if (ForceVectorization && !ChosenFactor.Width.isScalar() &&
LLVM_DEBUG(dbgs() << "LV: Selecting VF: " << ChosenFactor.Width << ".\n");		!isMoreProfitable(ChosenFactor, ScalarCost))
		dbgs() << "LV: Vectorization seems to be not beneficial, "
		<< "but was forced by a user.\n";
		});
		LLVM_DEBUG(dbgs() << "LV: Selecting "
		<< (ChosenFactor.FoldTailByMasking ? "Tail folded " : "")
		<< "VF: " << ChosenFactor.Width << ".\n");
		assert((ChosenFactor.Width.isScalar() \|\| ChosenFactor.ScalarCost > 0) &&
		"when vectorizing, the scalar cost must be non-zero.");
return ChosenFactor;		return ChosenFactor;
}		}

bool LoopVectorizationCostModel::isCandidateForEpilogueVectorization(		bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
const Loop &L, ElementCount VF) const {		const Loop &L, ElementCount VF) const {
// Cross iteration phis such as reductions need special handling and are		// Cross iteration phis such as reductions need special handling and are
// currently unsupported.		// currently unsupported.
if (any_of(L.getHeader()->phis(),		if (any_of(L.getHeader()->phis(),
[&](PHINode &Phi) { return Legal->isFixedOrderRecurrence(&Phi); }))		[&](PHINode &Phi) { return Legal->isFixedOrderRecurrence(&Phi); }))
return false;		return false;

// Phis with uses outside of the loop require special handling and are		// Phis with uses outside of the loop require special handling and are
Show All 14 Lines	bool LoopVectorizationPlanner::isCandidateForEpilogueVectorization(
// non-latch exits properly. It may be fine, but it needs auditted and		// non-latch exits properly. It may be fine, but it needs auditted and
// tested.		// tested.
if (L.getExitingBlock() != L.getLoopLatch())		if (L.getExitingBlock() != L.getLoopLatch())
return false;		return false;

return true;		return true;
}		}

bool LoopVectorizationCostModel::isEpilogueVectorizationProfitable(		bool LoopVectorizationPlanner::isEpilogueVectorizationProfitable(
const ElementCount VF) const {		const ElementCount VF) const {
// FIXME: We need a much better cost-model to take different parameters such		// FIXME: We need a much better cost-model to take different parameters such
// as register pressure, code size increase and cost of extra branches into		// as register pressure, code size increase and cost of extra branches into
// account. For now we apply a very crude heuristic and only consider loops		// account. For now we apply a very crude heuristic and only consider loops
// with vectorization factors larger than a certain value.		// with vectorization factors larger than a certain value.

// Allow the target to opt out entirely.		// Allow the target to opt out entirely.
if (!TTI.preferEpilogueVectorization())		if (!TTI->preferEpilogueVectorization())
return false;		return false;

// We also consider epilogue vectorization unprofitable for targets that don't		// We also consider epilogue vectorization unprofitable for targets that don't
// consider interleaving beneficial (eg. MVE).		// consider interleaving beneficial (eg. MVE).
if (TTI.getMaxInterleaveFactor(VF) <= 1)		if (TTI->getMaxInterleaveFactor(VF) <= 1)
return false;		return false;

unsigned Multiplier = 1;		unsigned Multiplier = 1;
if (VF.isScalable())		if (VF.isScalable())
Multiplier = getVScaleForTuning(TheFunction, TTI).value_or(1);		Multiplier = getVScaleForTuning(OrigLoop->getHeader()->getParent(), *TTI)
		.value_or(1);
if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)		if ((Multiplier * VF.getKnownMinValue()) >= EpilogueVectorizationMinVF)
return true;		return true;
return false;		return false;
}		}

VectorizationFactor		VectorizationFactor LoopVectorizationPlanner::selectEpilogueVectorizationFactor(
LoopVectorizationCostModel::selectEpilogueVectorizationFactor(		const VectorizationFactor &MainLoopVF) {
const ElementCount MainLoopVF, const LoopVectorizationPlanner &LVP) {
VectorizationFactor Result = VectorizationFactor::Disabled();		VectorizationFactor Result = VectorizationFactor::Disabled();
if (!EnableEpilogueVectorization) {		if (!EnableEpilogueVectorization) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is disabled.\n");		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is disabled.\n");
return Result;		return Result;
}		}

if (!isScalarEpilogueAllowed()) {		if (MainLoopVF.FoldTailByMasking) {
LLVM_DEBUG(dbgs() << "LEV: Unable to vectorize epilogue because no "		LLVM_DEBUG(dbgs() << "LEV: Epilogue not required as the vector loop is "
"epilogue is allowed.\n");		"predicated.\n";);
return Result;		return Result;
}		}

// Not really a cost consideration, but check for unsupported cases here to		// Not really a cost consideration, but check for unsupported cases here to
// simplify the logic.		// simplify the logic.
if (!isCandidateForEpilogueVectorization(*TheLoop, MainLoopVF)) {		if (!isCandidateForEpilogueVectorization(*OrigLoop, MainLoopVF.Width)) {
LLVM_DEBUG(dbgs() << "LEV: Unable to vectorize epilogue because the loop "		LLVM_DEBUG(dbgs() << "LEV: Unable to vectorize epilogue because the loop "
"is not a supported candidate.\n");		"is not a supported candidate.\n");
return Result;		return Result;
}		}

if (EpilogueVectorizationForceVF > 1) {		if (EpilogueVectorizationForceVF > 1) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n");		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n");
ElementCount ForcedEC = ElementCount::getFixed(EpilogueVectorizationForceVF);		ElementCount ForcedEC = ElementCount::getFixed(EpilogueVectorizationForceVF);
if (LVP.hasPlanWithVF(ForcedEC))		if (hasPlanWithVF(ForcedEC, false))
return {ForcedEC, 0, 0};		return {ForcedEC, false, 0, 0};
else {		else {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization forced factor is not "		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization forced factor is not "
"viable.\n");		"viable.\n");
return Result;		return Result;
}		}
}		}

if (TheLoop->getHeader()->getParent()->hasOptSize() \|\|		Function *TheFunction = OrigLoop->getHeader()->getParent();
TheLoop->getHeader()->getParent()->hasMinSize()) {		if (TheFunction->hasOptSize() \|\| TheFunction->hasMinSize()) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LEV: Epilogue vectorization skipped due to opt for size.\n");		dbgs() << "LEV: Epilogue vectorization skipped due to opt for size.\n");
return Result;		return Result;
}		}

if (!isEpilogueVectorizationProfitable(MainLoopVF)) {		if (!isEpilogueVectorizationProfitable(MainLoopVF.Width)) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
"this loop\n");		"this loop\n");
return Result;		return Result;
}		}

// If MainLoopVF = vscale x 2, and vscale is expected to be 4, then we know		// If MainLoopVF = vscale x 2, and vscale is expected to be 4, then we know
// the main loop handles 8 lanes per iteration. We could still benefit from		// the main loop handles 8 lanes per iteration. We could still benefit from
// vectorizing the epilogue loop with VF=4.		// vectorizing the epilogue loop with VF=4.
ElementCount EstimatedRuntimeVF = MainLoopVF;		ElementCount EstimatedRuntimeVF = MainLoopVF.Width;
if (MainLoopVF.isScalable()) {		if (MainLoopVF.Width.isScalable()) {
EstimatedRuntimeVF = ElementCount::getFixed(MainLoopVF.getKnownMinValue());		EstimatedRuntimeVF =
if (std::optional<unsigned> VScale = getVScaleForTuning(TheFunction, TTI))		ElementCount::getFixed(MainLoopVF.Width.getKnownMinValue());
		if (std::optional<unsigned> VScale = getVScaleForTuning(TheFunction, *TTI))
EstimatedRuntimeVF = VScale;		EstimatedRuntimeVF = VScale;
}		}

for (auto &NextVF : ProfitableVFs)		for (auto &NextVF : ProfitableVFs)
if (((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&		if (((!NextVF.Width.isScalable() && MainLoopVF.Width.isScalable() &&
ElementCount::isKnownLT(NextVF.Width, EstimatedRuntimeVF)) \|\|		ElementCount::isKnownLT(NextVF.Width, EstimatedRuntimeVF)) \|\|
ElementCount::isKnownLT(NextVF.Width, MainLoopVF)) &&		ElementCount::isKnownLT(NextVF.Width, MainLoopVF.Width)) &&
(Result.Width.isScalar() \|\| isMoreProfitable(NextVF, Result)) &&		(Result.Width.isScalar() \|\| isMoreProfitable(NextVF, Result)) &&
LVP.hasPlanWithVF(NextVF.Width))		hasPlanWithVF(NextVF.Width, NextVF.FoldTailByMasking))
Result = NextVF;		Result = NextVF;

if (Result != VectorizationFactor::Disabled())		if (Result != VectorizationFactor::Disabled())
LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "		LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "
<< Result.Width << "\n");		<< Result.Width << "\n");
return Result;		return Result;
}		}

▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {
"Expected the load/store/recurrence type to be sized");		"Expected the load/store/recurrence type to be sized");

ElementTypesInLoop.insert(T);		ElementTypesInLoop.insert(T);
}		}
}		}
}		}

unsigned		unsigned
LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,		LoopVectorizationCostModel::selectInterleaveCount(ElementCount VF,
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: perhaps rename this and in some other places to VS, so that the VF.Width references below become VS.Width. SjoerdMeijer: Nit: perhaps rename this and in some other places to VS, so that the VF.Width references below…
InstructionCost LoopCost) {		InstructionCost LoopCost) {
		david-armUnsubmitted Done Reply Inline Actions Again, it looks like the change in prototype doesn't need to be part of this patch and might be a useful tidy-up NFC patch. david-arm: Again, it looks like the change in prototype doesn't need to be part of this patch and might be…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Thanks Yeah - That is a good idea. I can actually just remove this from the current version of the patch, I believe. dmgreen: Thanks Yeah - That is a good idea. I can actually just remove this from the current version of…
// -- The interleave heuristics --		// -- The interleave heuristics --
// We interleave the loop in order to expose ILP and reduce the loop overhead.		// We interleave the loop in order to expose ILP and reduce the loop overhead.
// There are many micro-architectural considerations that we can't predict		// There are many micro-architectural considerations that we can't predict
// at this level. For example, frontend pressure (on decode or fetch) due to		// at this level. For example, frontend pressure (on decode or fetch) due to
// code size, or the number and capabilities of the execution ports.		// code size, or the number and capabilities of the execution ports.
//		//
// We use the following heuristics to select the interleave count:		// We use the following heuristics to select the interleave count:
// 1. If the code has reductions, then we interleave to break the cross		// 1. If the code has reductions, then we interleave to break the cross
▲ Show 20 Lines • Show All 758 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getConsecutiveMemOpCost(Instruction *I,
unsigned AS = getLoadStoreAddressSpace(I);		unsigned AS = getLoadStoreAddressSpace(I);
int ConsecutiveStride = Legal->isConsecutivePtr(ValTy, Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(ValTy, Ptr);
enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;

assert((ConsecutiveStride == 1 \|\| ConsecutiveStride == -1) &&		assert((ConsecutiveStride == 1 \|\| ConsecutiveStride == -1) &&
"Stride should be 1 or -1 for consecutive memory access");		"Stride should be 1 or -1 for consecutive memory access");
const Align Alignment = getLoadStoreAlignment(I);		const Align Alignment = getLoadStoreAlignment(I);
InstructionCost Cost = 0;		InstructionCost Cost = 0;
if (Legal->isMaskRequired(I)) {		if (Legal->isMaskRequired(foldTailByMasking(), I)) {
Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,		Cost += TTI.getMaskedMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,
CostKind);		CostKind);
} else {		} else {
TTI::OperandValueInfo OpInfo = TTI::getOperandInfo(I->getOperand(0));		TTI::OperandValueInfo OpInfo = TTI::getOperandInfo(I->getOperand(0));
Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,		Cost += TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS,
CostKind, OpInfo, I);		CostKind, OpInfo, I);
}		}

Show All 37 Lines	LoopVectorizationCostModel::getGatherScatterCost(Instruction *I,
ElementCount VF) {		ElementCount VF) {
Type *ValTy = getLoadStoreType(I);		Type *ValTy = getLoadStoreType(I);
auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));		auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));
const Align Alignment = getLoadStoreAlignment(I);		const Align Alignment = getLoadStoreAlignment(I);
const Value *Ptr = getLoadStorePointerOperand(I);		const Value *Ptr = getLoadStorePointerOperand(I);

return TTI.getAddressComputationCost(VectorTy) +		return TTI.getAddressComputationCost(VectorTy) +
TTI.getGatherScatterOpCost(		TTI.getGatherScatterOpCost(
I->getOpcode(), VectorTy, Ptr, Legal->isMaskRequired(I), Alignment,		I->getOpcode(), VectorTy, Ptr,
		Legal->isMaskRequired(foldTailByMasking(), I), Alignment,
TargetTransformInfo::TCK_RecipThroughput, I);		TargetTransformInfo::TCK_RecipThroughput, I);
}		}

InstructionCost		InstructionCost
LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,		LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,
ElementCount VF) {		ElementCount VF) {
// TODO: Once we have support for interleaving with scalable vectors		// TODO: Once we have support for interleaving with scalable vectors
// we can calculate the cost properly here.		// we can calculate the cost properly here.
Show All 18 Lines	if (Group->getMember(IF))
Indices.push_back(IF);		Indices.push_back(IF);

// Calculate the cost of the whole interleaved group.		// Calculate the cost of the whole interleaved group.
bool UseMaskForGaps =		bool UseMaskForGaps =
(Group->requiresScalarEpilogue() && !isScalarEpilogueAllowed()) \|\|		(Group->requiresScalarEpilogue() && !isScalarEpilogueAllowed()) \|\|
(isa<StoreInst>(I) && (Group->getNumMembers() < Group->getFactor()));		(isa<StoreInst>(I) && (Group->getNumMembers() < Group->getFactor()));
InstructionCost Cost = TTI.getInterleavedMemoryOpCost(		InstructionCost Cost = TTI.getInterleavedMemoryOpCost(
I->getOpcode(), WideVecTy, Group->getFactor(), Indices, Group->getAlign(),		I->getOpcode(), WideVecTy, Group->getFactor(), Indices, Group->getAlign(),
AS, CostKind, Legal->isMaskRequired(I), UseMaskForGaps);		AS, CostKind, Legal->isMaskRequired(foldTailByMasking(), I),
		UseMaskForGaps);

if (Group->isReverse()) {		if (Group->isReverse()) {
// TODO: Add support for reversed masked interleaved access.		// TODO: Add support for reversed masked interleaved access.
assert(!Legal->isMaskRequired(I) &&		assert(!Legal->isMaskRequired(foldTailByMasking(), I) &&
"Reverse masked interleaved access not supported.");		"Reverse masked interleaved access not supported.");
Cost += Group->getNumMembers() *		Cost += Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy,		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy,
std::nullopt, CostKind, 0);		std::nullopt, CostKind, 0);
}		}
return Cost;		return Cost;
}		}

▲ Show 20 Lines • Show All 699 Lines • ▼ Show 20 Lines	auto ComputeCCH = [&](Instruction *I) -> TTI::CastContextHint {

switch (getWideningDecision(I, VF)) {		switch (getWideningDecision(I, VF)) {
case LoopVectorizationCostModel::CM_GatherScatter:		case LoopVectorizationCostModel::CM_GatherScatter:
return TTI::CastContextHint::GatherScatter;		return TTI::CastContextHint::GatherScatter;
case LoopVectorizationCostModel::CM_Interleave:		case LoopVectorizationCostModel::CM_Interleave:
return TTI::CastContextHint::Interleave;		return TTI::CastContextHint::Interleave;
case LoopVectorizationCostModel::CM_Scalarize:		case LoopVectorizationCostModel::CM_Scalarize:
case LoopVectorizationCostModel::CM_Widen:		case LoopVectorizationCostModel::CM_Widen:
return Legal->isMaskRequired(I) ? TTI::CastContextHint::Masked		return Legal->isMaskRequired(foldTailByMasking(), I)
		? TTI::CastContextHint::Masked
: TTI::CastContextHint::Normal;		: TTI::CastContextHint::Normal;
case LoopVectorizationCostModel::CM_Widen_Reverse:		case LoopVectorizationCostModel::CM_Widen_Reverse:
return TTI::CastContextHint::Reversed;		return TTI::CastContextHint::Reversed;
case LoopVectorizationCostModel::CM_Unknown:		case LoopVectorizationCostModel::CM_Unknown:
llvm_unreachable("Instr did not go through cost modelling?");		llvm_unreachable("Instr did not go through cost modelling?");
}		}

llvm_unreachable("Unhandled case!");		llvm_unreachable("Unhandled case!");
};		};
▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines
static unsigned determineVPlanVF(const unsigned WidestVectorRegBits,		static unsigned determineVPlanVF(const unsigned WidestVectorRegBits,
LoopVectorizationCostModel &CM) {		LoopVectorizationCostModel &CM) {
unsigned WidestType;		unsigned WidestType;
std::tie(std::ignore, WidestType) = CM.getSmallestAndWidestTypes();		std::tie(std::ignore, WidestType) = CM.getSmallestAndWidestTypes();
return WidestVectorRegBits / WidestType;		return WidestVectorRegBits / WidestType;
}		}

VectorizationFactor		VectorizationFactor
LoopVectorizationPlanner::planInVPlanNativePath(ElementCount UserVF) {		LoopVectorizationPlanner::planInVPlanNativePath(LoopVectorizationCostModel &CM,
		ElementCount UserVF) {
assert(!UserVF.isScalable() && "scalable vectors not yet supported");		assert(!UserVF.isScalable() && "scalable vectors not yet supported");
ElementCount VF = UserVF;		ElementCount VF = UserVF;
// Outer loop handling: They may require CFG and instruction level		// Outer loop handling: They may require CFG and instruction level
// transformations before even evaluating whether vectorization is profitable.		// transformations before even evaluating whether vectorization is profitable.
// Since we cannot modify the incoming IR, we need to build VPlan upfront in		// Since we cannot modify the incoming IR, we need to build VPlan upfront in
// the vectorization pipeline.		// the vectorization pipeline.
if (!OrigLoop->isInnermost()) {		if (!OrigLoop->isInnermost()) {
// If the user doesn't provide a vectorization factor, determine a		// If the user doesn't provide a vectorization factor, determine a
Show All 12 Lines	if (UserVF.isZero()) {
VF = ElementCount::getFixed(4);		VF = ElementCount::getFixed(4);
}		}
}		}
assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");
assert(isPowerOf2_32(VF.getKnownMinValue()) &&		assert(isPowerOf2_32(VF.getKnownMinValue()) &&
"VF needs to be a power of two");		"VF needs to be a power of two");
LLVM_DEBUG(dbgs() << "LV: Using " << (!UserVF.isZero() ? "user " : "")		LLVM_DEBUG(dbgs() << "LV: Using " << (!UserVF.isZero() ? "user " : "")
<< "VF " << VF << " to build VPlans.\n");		<< "VF " << VF << " to build VPlans.\n");
buildVPlans(VF, VF);		buildVPlans(CM, VF, VF);

// For VPlan build stress testing, we bail out after VPlan construction.		// For VPlan build stress testing, we bail out after VPlan construction.
if (VPlanBuildStressTest)		if (VPlanBuildStressTest)
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();

return {VF, 0 /Cost/, 0 /* ScalarCost */};		return {VF, false /TailFold/, 0 /Cost/, 0 /* ScalarCost */};
		david-armUnsubmitted Done Reply Inline Actions Perhaps worth adding a `/* ... /` comment for the new argument too? Same for any other similar places in the file. david-arm:* Perhaps worth adding a `/* ... */` comment for the new argument too? Same for any other similar…
}		}

LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "		dbgs() << "LV: Not vectorizing. Inner loops aren't supported in the "
"VPlan-native path.\n");		"VPlan-native path.\n");
return VectorizationFactor::Disabled();		return VectorizationFactor::Disabled();
}		}

std::optional<VectorizationFactor>		void LoopVectorizationPlanner::plan(LoopVectorizationCostModel &CM,
LoopVectorizationPlanner::plan(ElementCount UserVF, unsigned UserIC) {		ElementCount UserVF, unsigned UserIC) {
		CM.collectValuesToIgnore();
		CM.collectElementTypesForWidening();

assert(OrigLoop->isInnermost() && "Inner loop expected.");		assert(OrigLoop->isInnermost() && "Inner loop expected.");
FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC);		FixedScalableVFPair MaxFactors = CM.computeMaxVF(UserVF, UserIC);
if (!MaxFactors) // Cases that should not to be vectorized nor interleaved.		if (!MaxFactors) // Cases that should not to be vectorized nor interleaved.
return std::nullopt;		return;

// Invalidate interleave groups if all blocks of loop will be predicated.		// Invalidate interleave groups if all blocks of loop will be predicated.
if (CM.blockNeedsPredicationForAnyReason(OrigLoop->getHeader()) &&		if (CM.blockNeedsPredicationForAnyReason(OrigLoop->getHeader()) &&
!useMaskedInterleavedAccesses(*TTI)) {		!useMaskedInterleavedAccesses(*TTI)) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Invalidate all interleaved groups due to fold-tail by masking "		<< "LV: Invalidate all interleaved groups due to fold-tail by masking "
"which requires masked-interleaved support.\n");		"which requires masked-interleaved support.\n");
if (CM.InterleaveInfo.invalidateGroups())		if (CM.InterleaveInfo.invalidateGroups())
// Invalidating interleave groups also requires invalidating all decisions		// Invalidating interleave groups also requires invalidating all decisions
// based on them, which includes widening decisions and uniform and scalar		// based on them, which includes widening decisions and uniform and scalar
// values.		// values.
CM.invalidateCostModelingDecisions();		CM.invalidateCostModelingDecisions();
}		}

ElementCount MaxUserVF =		ElementCount MaxUserVF =
UserVF.isScalable() ? MaxFactors.ScalableVF : MaxFactors.FixedVF;		UserVF.isScalable() ? MaxFactors.ScalableVF : MaxFactors.FixedVF;
bool UserVFIsLegal = ElementCount::isKnownLE(UserVF, MaxUserVF);		bool UserVFIsLegal = ElementCount::isKnownLE(UserVF, MaxUserVF);
if (!UserVF.isZero() && UserVFIsLegal) {		if (!UserVF.isZero() && UserVFIsLegal) {
assert(isPowerOf2_32(UserVF.getKnownMinValue()) &&		assert(isPowerOf2_32(UserVF.getKnownMinValue()) &&
"VF needs to be a power of two");		"VF needs to be a power of two");
// Collect the instructions (and their associated costs) that will be more		// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.		// profitable to scalarize.
if (CM.selectUserVectorizationFactor(UserVF)) {		if (CM.selectUserVectorizationFactor(UserVF)) {
		SjoerdMeijerUnsubmitted Done Reply Inline Actions nit: TODO SjoerdMeijer: nit: TODO
LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		LLVM_DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");
CM.collectInLoopReductions();		CM.collectInLoopReductions();
buildVPlansWithVPRecipes(UserVF, UserVF);		buildVPlansWithVPRecipes(CM, UserVF, UserVF);
if (!hasPlanWithVF(UserVF)) {		return;
LLVM_DEBUG(dbgs() << "LV: No VPlan could be built for " << UserVF
<< ".\n");
return std::nullopt;
}

LLVM_DEBUG(printPlans(dbgs()));
return {{UserVF, 0, 0}};
} else		} else
reportVectorizationInfo("UserVF ignored because of invalid costs.",		reportVectorizationInfo("UserVF ignored because of invalid costs.",
"InvalidCost", ORE, OrigLoop);		"InvalidCost", ORE, OrigLoop);
}		}

// Populate the set of Vectorization Factor Candidates.
ElementCountSet VFCandidates;
for (auto VF = ElementCount::getFixed(1);
ElementCount::isKnownLE(VF, MaxFactors.FixedVF); VF *= 2)
VFCandidates.insert(VF);
for (auto VF = ElementCount::getScalable(1);
ElementCount::isKnownLE(VF, MaxFactors.ScalableVF); VF *= 2)
VFCandidates.insert(VF);

for (const auto &VF : VFCandidates) {
// Collect Uniform and Scalar instructions after vectorization with VF.
CM.collectUniformsAndScalars(VF);

// Collect the instructions (and their associated costs) that will be more
// profitable to scalarize.
if (VF.isVector())
CM.collectInstsToScalarize(VF);
}

CM.collectInLoopReductions();		CM.collectInLoopReductions();
buildVPlansWithVPRecipes(ElementCount::getFixed(1), MaxFactors.FixedVF);		buildVPlansWithVPRecipes(CM, ElementCount::getFixed(1), MaxFactors.FixedVF);
buildVPlansWithVPRecipes(ElementCount::getScalable(1), MaxFactors.ScalableVF);		buildVPlansWithVPRecipes(CM, ElementCount::getScalable(1), MaxFactors.ScalableVF);

LLVM_DEBUG(printPlans(dbgs()));
if (!MaxFactors.hasVector())
return VectorizationFactor::Disabled();

// Select the optimal vectorization factor.
VectorizationFactor VF = CM.selectVectorizationFactor(VFCandidates);
assert((VF.Width.isScalar() \|\| VF.ScalarCost > 0) && "when vectorizing, the scalar cost must be non-zero.");
if (!hasPlanWithVF(VF.Width)) {
LLVM_DEBUG(dbgs() << "LV: No VPlan could be built for " << VF.Width
<< ".\n");
return std::nullopt;
}
return VF;
}		}

VPlan &LoopVectorizationPlanner::getBestPlanFor(ElementCount VF) const {		VPlan &LoopVectorizationPlanner::getBestPlanFor(ElementCount VF,
		bool FoldTailByMasking) const {
assert(count_if(VPlans,		assert(count_if(VPlans,
[VF](const VPlanPtr &Plan) { return Plan->hasVF(VF); }) ==		[VF, FoldTailByMasking](const VPlanPtr &Plan) {
1 &&		return Plan->hasVF(VF) &&
		david-armUnsubmitted Done Reply Inline Actions Is it worth changing `hasVF` to take the `FoldTailByMasking` as a second argument so you can just write: assert(count_if(VPlans, [VF, FoldTailByMasking](const VPlanPtr &Plan) { return Plan->hasVF(VF, FoldTailByMasking); }) == 1 && "Best VF has not a single VPlan."); david-arm: Is it worth changing `hasVF` to take the `FoldTailByMasking` as a second argument so you can…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions It feels like a separate parameter to me, that would want to be checked separately. I can change it if you think its better but there are places that call hasVF on it's own. dmgreen: It feels like a separate parameter to me, that would want to be checked separately. I can…
		Plan->foldTailByMasking() == FoldTailByMasking;
		}) == 1 &&
"Best VF has not a single VPlan.");		"Best VF has not a single VPlan.");

for (const VPlanPtr &Plan : VPlans) {		for (const VPlanPtr &Plan : VPlans) {
if (Plan->hasVF(VF))		if (Plan->hasVF(VF) && Plan->foldTailByMasking() == FoldTailByMasking)
return *Plan.get();		return *Plan.get();
}		}
llvm_unreachable("No plan found!");		llvm_unreachable("No plan found!");
}		}

static void AddRuntimeUnrollDisableMetaData(Loop *L) {		static void AddRuntimeUnrollDisableMetaData(Loop *L) {
SmallVector<Metadata *, 4> MDs;		SmallVector<Metadata *, 4> MDs;
// Reserve first location for self reference to the LoopID metadata node.		// Reserve first location for self reference to the LoopID metadata node.
Show All 33 Lines	void LoopVectorizationPlanner::executePlan(ElementCount BestVF, unsigned BestUF,
InnerLoopVectorizer &ILV,		InnerLoopVectorizer &ILV,
DominatorTree *DT,		DominatorTree *DT,
bool IsEpilogueVectorization) {		bool IsEpilogueVectorization) {
assert(BestVPlan.hasVF(BestVF) &&		assert(BestVPlan.hasVF(BestVF) &&
"Trying to execute plan with unsupported VF");		"Trying to execute plan with unsupported VF");
assert(BestVPlan.hasUF(BestUF) &&		assert(BestVPlan.hasUF(BestUF) &&
"Trying to execute plan with unsupported UF");		"Trying to execute plan with unsupported UF");

LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF		LLVM_DEBUG(dbgs() << "Executing best plan with TailFold="
<< '\n');		<< (BestVPlan.foldTailByMasking() ? "true" : "false")
		<< ", VF=" << BestVF << ", UF=" << BestUF << '\n');

// Workaround! Compute the trip count of the original loop and cache it		// Workaround! Compute the trip count of the original loop and cache it
// before we start modifying the CFG. This code has a systemic problem		// before we start modifying the CFG. This code has a systemic problem
// wherein it tries to run analysis over partially constructed IR; this is		// wherein it tries to run analysis over partially constructed IR; this is
// wrong, and not simply for SCEV. The trip count of the original loop		// wrong, and not simply for SCEV. The trip count of the original loop
// simply happens to be prone to hitting this in practice. In theory, we		// simply happens to be prone to hitting this in practice. In theory, we
// can hit the same issue for any SCEV, or ValueTracking query done during		// can hit the same issue for any SCEV, or ValueTracking query done during
// mutation. See PR49900.		// mutation. See PR49900.
▲ Show 20 Lines • Show All 382 Lines • ▼ Show 20 Lines	bool LoopVectorizationPlanner::getDecisionAndClampRange(
return PredicateAtRangeStart;		return PredicateAtRangeStart;
}		}

/// Build VPlans for the full range of feasible VF's = {\p MinVF, 2 * \p MinVF,		/// Build VPlans for the full range of feasible VF's = {\p MinVF, 2 * \p MinVF,
/// 4 * \p MinVF, ..., \p MaxVF} by repeatedly building a VPlan for a sub-range		/// 4 * \p MinVF, ..., \p MaxVF} by repeatedly building a VPlan for a sub-range
/// of VF's starting at a given VF and extending it as much as possible. Each		/// of VF's starting at a given VF and extending it as much as possible. Each
/// vectorization decision can potentially shorten this sub-range during		/// vectorization decision can potentially shorten this sub-range during
/// buildVPlan().		/// buildVPlan().
void LoopVectorizationPlanner::buildVPlans(ElementCount MinVF,		void LoopVectorizationPlanner::buildVPlans(LoopVectorizationCostModel &CM,
		ElementCount MinVF,
ElementCount MaxVF) {		ElementCount MaxVF) {
auto MaxVFTimes2 = MaxVF * 2;		auto MaxVFTimes2 = MaxVF * 2;
for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFTimes2);) {		for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFTimes2);) {
VFRange SubRange = {VF, MaxVFTimes2};		VFRange SubRange = {VF, MaxVFTimes2};
VPlans.push_back(buildVPlan(SubRange));		VPlans.push_back(buildVPlan(CM, SubRange));
VF = SubRange.End;		VF = SubRange.End;
}		}
}		}

VPValue VPRecipeBuilder::createEdgeMask(BasicBlock Src, BasicBlock *Dst,		VPValue VPRecipeBuilder::createEdgeMask(BasicBlock Src, BasicBlock *Dst,
VPlan &Plan) {		VPlan &Plan) {
assert(is_contained(predecessors(Dst), Src) && "Invalid edge");		assert(is_contained(predecessors(Dst), Src) && "Invalid edge");

▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlan &Plan) {
// All-one mask is modelled as no-mask following the convention for masked		// All-one mask is modelled as no-mask following the convention for masked
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
if (!CM.blockNeedsPredicationForAnyReason(BB))		if (!CM.blockNeedsPredicationForAnyReason(BB))
return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

assert(CM.foldTailByMasking() && "must fold the tail");		assert(Plan.foldTailByMasking() && "must fold the tail");

// If we're using the active lane mask for control flow, then we get the		// If we're using the active lane mask for control flow, then we get the
// mask from the active lane mask PHI that is cached in the VPlan.		// mask from the active lane mask PHI that is cached in the VPlan.
TailFoldingStyle TFStyle = CM.getTailFoldingStyle();		TailFoldingStyle TFStyle = CM.getTailFoldingStyle();
if (useActiveLaneMaskForControlFlow(TFStyle))		if (useActiveLaneMaskForControlFlow(TFStyle))
return BlockMaskCache[BB] = Plan.getActiveLaneMaskPhi();		return BlockMaskCache[BB] = Plan.getActiveLaneMaskPhi();

// Introduce the early-exit compare IV <= BTC to form header block mask.		// Introduce the early-exit compare IV <= BTC to form header block mask.
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	if (CM.isScalarAfterVectorization(I, VF) \|\|
return false;		return false;
return Decision != LoopVectorizationCostModel::CM_Scalarize;		return Decision != LoopVectorizationCostModel::CM_Scalarize;
};		};

if (!LoopVectorizationPlanner::getDecisionAndClampRange(willWiden, Range))		if (!LoopVectorizationPlanner::getDecisionAndClampRange(willWiden, Range))
return nullptr;		return nullptr;

VPValue *Mask = nullptr;		VPValue *Mask = nullptr;
if (Legal->isMaskRequired(I))		if (Legal->isMaskRequired(Plan->foldTailByMasking(), I))
Mask = createBlockInMask(I->getParent(), *Plan);		Mask = createBlockInMask(I->getParent(), *Plan);

// Determine if the pointer operand of the access is either consecutive or		// Determine if the pointer operand of the access is either consecutive or
// reverse consecutive.		// reverse consecutive.
LoopVectorizationCostModel::InstWidening Decision =		LoopVectorizationCostModel::InstWidening Decision =
CM.getWideningDecision(I, Range.Start);		CM.getWideningDecision(I, Range.Start);
bool Reverse = Decision == LoopVectorizationCostModel::CM_Widen_Reverse;		bool Reverse = Decision == LoopVectorizationCostModel::CM_Widen_Reverse;
bool Consecutive =		bool Consecutive =
▲ Show 20 Lines • Show All 199 Lines • ▼ Show 20 Lines	if (NeedsMask) {
// We have 2 cases that would require a mask:		// We have 2 cases that would require a mask:
// 1) The block needs to be predicated, either due to a conditional		// 1) The block needs to be predicated, either due to a conditional
// in the scalar loop or use of an active lane mask with		// in the scalar loop or use of an active lane mask with
// tail-folding, and we use the appropriate mask for the block.		// tail-folding, and we use the appropriate mask for the block.
// 2) No mask is required for the block, but the only available		// 2) No mask is required for the block, but the only available
// vector variant at this VF requires a mask, so we synthesize an		// vector variant at this VF requires a mask, so we synthesize an
// all-true mask.		// all-true mask.
VPValue *Mask = nullptr;		VPValue *Mask = nullptr;
if (Legal->isMaskRequired(CI))		if (Legal->isMaskRequired(Plan->foldTailByMasking(), CI))
Mask = createBlockInMask(CI->getParent(), *Plan);		Mask = createBlockInMask(CI->getParent(), *Plan);
else		else
Mask = Plan->getVPValueOrAddLiveIn(ConstantInt::getTrue(		Mask = Plan->getVPValueOrAddLiveIn(ConstantInt::getTrue(
IntegerType::getInt1Ty(Variant->getFunctionType()->getContext())));		IntegerType::getInt1Ty(Variant->getFunctionType()->getContext())));

VFShape Shape = VFShape::get(CI, VariantVF, /HasGlobalPred=*/true);		VFShape Shape = VFShape::get(CI, VariantVF, /HasGlobalPred=*/true);
unsigned MaskPos = 0;		unsigned MaskPos = 0;

▲ Show 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	VPRecipeBuilder::tryToCreateWidenRecipe(Instruction *Instr,
if (auto *SI = dyn_cast<SelectInst>(Instr)) {		if (auto *SI = dyn_cast<SelectInst>(Instr)) {
return toVPRecipeResult(new VPWidenSelectRecipe(		return toVPRecipeResult(new VPWidenSelectRecipe(
*SI, make_range(Operands.begin(), Operands.end())));		*SI, make_range(Operands.begin(), Operands.end())));
}		}

return toVPRecipeResult(tryToWiden(Instr, Operands, VPBB, Plan));		return toVPRecipeResult(tryToWiden(Instr, Operands, VPBB, Plan));
}		}

void LoopVectorizationPlanner::buildVPlansWithVPRecipes(ElementCount MinVF,		void LoopVectorizationPlanner::buildVPlansWithVPRecipes(
ElementCount MaxVF) {		LoopVectorizationCostModel &CM, ElementCount MinVF, ElementCount MaxVF) {
assert(OrigLoop->isInnermost() && "Inner loop expected.");		assert(OrigLoop->isInnermost() && "Inner loop expected.");

		for (ElementCount VF = MinVF; ElementCount::isKnownLE(VF, MaxVF); VF *= 2) {
		// Collect Uniform and Scalar instructions after vectorization with VF.
		david-armUnsubmitted Done Reply Inline Actions Wouldn't it be simpler to just write: for (ElementCount VF = MinVF; ElementCount::isKnownLE(VF, MaxVF); david-arm: Wouldn't it be simpler to just write: for (ElementCount VF = MinVF; ElementCount::isKnownLE…
		CM.collectUniformsAndScalars(VF);

		// Collect the instructions (and their associated costs) that will be more
		// profitable to scalarize.
		if (VF.isVector())
		CM.collectInstsToScalarize(VF);
		}

// Add assume instructions we need to drop to DeadInstructions, to prevent		// Add assume instructions we need to drop to DeadInstructions, to prevent
// them from being added to the VPlan.		// them from being added to the VPlan.
// TODO: We only need to drop assumes in blocks that get flattend. If the		// TODO: We only need to drop assumes in blocks that get flattend. If the
// control flow is preserved, we should keep them.		// control flow is preserved, we should keep them.
SmallPtrSet<Instruction *, 4> DeadInstructions;		SmallPtrSet<Instruction *, 4> DeadInstructions;
auto &ConditionalAssumes = Legal->getConditionalAssumes();		auto &ConditionalAssumes =
		Legal->getConditionalAssumes(CM.foldTailByMasking());
DeadInstructions.insert(ConditionalAssumes.begin(), ConditionalAssumes.end());		DeadInstructions.insert(ConditionalAssumes.begin(), ConditionalAssumes.end());

auto MaxVFTimes2 = MaxVF * 2;		auto MaxVFTimes2 = MaxVF * 2;
for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFTimes2);) {		for (ElementCount VF = MinVF; ElementCount::isKnownLT(VF, MaxVFTimes2);) {
VFRange SubRange = {VF, MaxVFTimes2};		VFRange SubRange = {VF, MaxVFTimes2};
if (auto Plan = tryToBuildVPlanWithVPRecipes(SubRange, DeadInstructions))		if (auto Plan = tryToBuildVPlanWithVPRecipes(CM, SubRange, DeadInstructions))
VPlans.push_back(std::move(*Plan));		VPlans.push_back(std::move(*Plan));
VF = SubRange.End;		VF = SubRange.End;
}		}
}		}

// Add the necessary canonical IV and branch recipes required to control the		// Add the necessary canonical IV and branch recipes required to control the
// loop.		// loop.
static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, DebugLoc DL,		static void addCanonicalIVRecipes(VPlan &Plan, Type *IdxTy, DebugLoc DL,
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	for (PHINode &ExitPhi : ExitBB->phis()) {
Value *IncomingValue =		Value *IncomingValue =
ExitPhi.getIncomingValueForBlock(ExitingBB);		ExitPhi.getIncomingValueForBlock(ExitingBB);
VPValue *V = Plan.getVPValueOrAddLiveIn(IncomingValue);		VPValue *V = Plan.getVPValueOrAddLiveIn(IncomingValue);
Plan.addLiveOut(&ExitPhi, V);		Plan.addLiveOut(&ExitPhi, V);
}		}
}		}

std::optional<VPlanPtr> LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(		std::optional<VPlanPtr> LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
VFRange &Range, SmallPtrSetImpl<Instruction *> &DeadInstructions) {		LoopVectorizationCostModel &CM, VFRange &Range,
		SmallPtrSetImpl<Instruction *> &DeadInstructions) {

SmallPtrSet<const InterleaveGroup<Instruction> *, 1> InterleaveGroups;		SmallPtrSet<const InterleaveGroup<Instruction> *, 1> InterleaveGroups;

VPRecipeBuilder RecipeBuilder(OrigLoop, TLI, Legal, CM, PSE, Builder);		VPRecipeBuilder RecipeBuilder(OrigLoop, TLI, Legal, CM, PSE, Builder);

// ---------------------------------------------------------------------------		// ---------------------------------------------------------------------------
// Pre-construction: record ingredients whose recipes we'll need to further		// Pre-construction: record ingredients whose recipes we'll need to further
// process after constructing the initial VPlan.		// process after constructing the initial VPlan.
Show All 17 Lines	for (const auto &Reduction : CM.getInLoopReductionChains()) {
}		}
}		}

// For each interleave group which is relevant for this (possibly trimmed)		// For each interleave group which is relevant for this (possibly trimmed)
// Range, add it to the set of groups to be later applied to the VPlan and add		// Range, add it to the set of groups to be later applied to the VPlan and add
// placeholders for its members' Recipes which we'll be replacing with a		// placeholders for its members' Recipes which we'll be replacing with a
// single VPInterleaveRecipe.		// single VPInterleaveRecipe.
for (InterleaveGroup<Instruction> *IG : IAI.getInterleaveGroups()) {		for (InterleaveGroup<Instruction> *IG : IAI.getInterleaveGroups()) {
auto applyIG = [IG, this](ElementCount VF) -> bool {		auto applyIG = [IG, &CM](ElementCount VF) -> bool {
return (VF.isVector() && // Query is illegal for VF == 1		return (VF.isVector() && // Query is illegal for VF == 1
CM.getWideningDecision(IG->getInsertPos(), VF) ==		CM.getWideningDecision(IG->getInsertPos(), VF) ==
LoopVectorizationCostModel::CM_Interleave);		LoopVectorizationCostModel::CM_Interleave);
};		};
if (!getDecisionAndClampRange(applyIG, Range))		if (!getDecisionAndClampRange(applyIG, Range))
continue;		continue;
InterleaveGroups.insert(IG);		InterleaveGroups.insert(IG);
for (unsigned i = 0; i < IG->getFactor(); i++)		for (unsigned i = 0; i < IG->getFactor(); i++)
if (Instruction *Member = IG->getMember(i))		if (Instruction *Member = IG->getMember(i))
RecipeBuilder.recordRecipeOf(Member);		RecipeBuilder.recordRecipeOf(Member);
};		};

// ---------------------------------------------------------------------------		// ---------------------------------------------------------------------------
// Build initial VPlan: Scan the body of the loop in a topological order to		// Build initial VPlan: Scan the body of the loop in a topological order to
// visit each basic block after having visited its predecessor basic blocks.		// visit each basic block after having visited its predecessor basic blocks.
// ---------------------------------------------------------------------------		// ---------------------------------------------------------------------------

// Create initial VPlan skeleton, starting with a block for the pre-header,		// Create initial VPlan skeleton, starting with a block for the pre-header,
// followed by a region for the vector loop, followed by the middle block. The		// followed by a region for the vector loop, followed by the middle block. The
// skeleton vector loop region contains a header and latch block.		// skeleton vector loop region contains a header and latch block.
VPBasicBlock *Preheader = new VPBasicBlock("vector.ph");		VPBasicBlock *Preheader = new VPBasicBlock("vector.ph");
auto Plan = std::make_unique<VPlan>(Preheader);		auto Plan = std::make_unique<VPlan>(CM.foldTailByMasking(), &CM, Preheader);

VPBasicBlock *HeaderVPBB = new VPBasicBlock("vector.body");		VPBasicBlock *HeaderVPBB = new VPBasicBlock("vector.body");
VPBasicBlock *LatchVPBB = new VPBasicBlock("vector.latch");		VPBasicBlock *LatchVPBB = new VPBasicBlock("vector.latch");
VPBlockUtils::insertBlockAfter(LatchVPBB, HeaderVPBB);		VPBlockUtils::insertBlockAfter(LatchVPBB, HeaderVPBB);
auto *TopRegion = new VPRegionBlock(HeaderVPBB, LatchVPBB, "vector loop");		auto *TopRegion = new VPRegionBlock(HeaderVPBB, LatchVPBB, "vector loop");
VPBlockUtils::insertBlockAfter(TopRegion, Preheader);		VPBlockUtils::insertBlockAfter(TopRegion, Preheader);
VPBasicBlock *MiddleVPBB = new VPBasicBlock("middle.block");		VPBasicBlock *MiddleVPBB = new VPBasicBlock("middle.block");
VPBlockUtils::insertBlockAfter(MiddleVPBB, TopRegion);		VPBlockUtils::insertBlockAfter(MiddleVPBB, TopRegion);
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	std::optional<VPlanPtr> LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(
// Transform initial VPlan: Apply previously taken decisions, in order, to		// Transform initial VPlan: Apply previously taken decisions, in order, to
// bring the VPlan to its final state.		// bring the VPlan to its final state.
// ---------------------------------------------------------------------------		// ---------------------------------------------------------------------------

VPlanTransforms::removeRedundantCanonicalIVs(*Plan);		VPlanTransforms::removeRedundantCanonicalIVs(*Plan);
VPlanTransforms::removeRedundantInductionCasts(*Plan);		VPlanTransforms::removeRedundantInductionCasts(*Plan);

// Adjust the recipes for any inloop reductions.		// Adjust the recipes for any inloop reductions.
adjustRecipesForReductions(cast<VPBasicBlock>(TopRegion->getExiting()), Plan,		adjustRecipesForReductions(CM, cast<VPBasicBlock>(TopRegion->getExiting()),
RecipeBuilder, Range.Start);		Plan, RecipeBuilder, Range.Start);

// Sink users of fixed-order recurrence past the recipe defining the previous		// Sink users of fixed-order recurrence past the recipe defining the previous
// value and introduce FirstOrderRecurrenceSplice VPInstructions.		// value and introduce FirstOrderRecurrenceSplice VPInstructions.
if (!VPlanTransforms::adjustFixedOrderRecurrences(*Plan, Builder))		if (!VPlanTransforms::adjustFixedOrderRecurrences(*Plan, Builder))
return std::nullopt;		return std::nullopt;

// Interleave memory: for each Interleave Group we marked earlier as relevant		// Interleave memory: for each Interleave Group we marked earlier as relevant
// for this VPlan, replace the Recipes widening its memory instructions with a		// for this VPlan, replace the Recipes widening its memory instructions with a
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	std::optional<VPlanPtr> LoopVectorizationPlanner::tryToBuildVPlanWithVPRecipes(

VPlanTransforms::removeRedundantExpandSCEVRecipes(*Plan);		VPlanTransforms::removeRedundantExpandSCEVRecipes(*Plan);
VPlanTransforms::mergeBlocksIntoPredecessors(*Plan);		VPlanTransforms::mergeBlocksIntoPredecessors(*Plan);

assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");		assert(VPlanVerifier::verifyPlanIsValid(*Plan) && "VPlan is invalid");
return std::make_optional(std::move(Plan));		return std::make_optional(std::move(Plan));
}		}

VPlanPtr LoopVectorizationPlanner::buildVPlan(VFRange &Range) {		VPlanPtr LoopVectorizationPlanner::buildVPlan(LoopVectorizationCostModel &CM,
		VFRange &Range) {
// Outer loop handling: They may require CFG and instruction level		// Outer loop handling: They may require CFG and instruction level
// transformations before even evaluating whether vectorization is profitable.		// transformations before even evaluating whether vectorization is profitable.
// Since we cannot modify the incoming IR, we need to build VPlan upfront in		// Since we cannot modify the incoming IR, we need to build VPlan upfront in
// the vectorization pipeline.		// the vectorization pipeline.
assert(!OrigLoop->isInnermost());		assert(!OrigLoop->isInnermost());
assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");		assert(EnableVPlanNativePath && "VPlan-native path is not enabled.");

// Create new empty VPlan		// Create new empty VPlan
auto Plan = std::make_unique<VPlan>();		auto Plan = std::make_unique<VPlan>(CM.foldTailByMasking(), &CM);

// Build hierarchical CFG		// Build hierarchical CFG
VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan);		VPlanHCFGBuilder HCFGBuilder(OrigLoop, LI, *Plan);
HCFGBuilder.buildHierarchicalCFG();		HCFGBuilder.buildHierarchicalCFG();

for (ElementCount VF : Range)		for (ElementCount VF : Range)
Plan->addVF(VF);		Plan->addVF(VF);

SmallPtrSet<Instruction *, 1> DeadInstructions;		SmallPtrSet<Instruction *, 1> DeadInstructions;
VPlanTransforms::VPInstructionsToVPRecipes(		VPlanTransforms::VPInstructionsToVPRecipes(
Plan,		Plan,
[this](PHINode *P) { return Legal->getIntOrFpInductionDescriptor(P); },		[this](PHINode *P) { return Legal->getIntOrFpInductionDescriptor(P); },
DeadInstructions, PSE.getSE(), TLI);		DeadInstructions, PSE.getSE(), TLI);

// Remove the existing terminator of the exiting block of the top-most region.		// Remove the existing terminator of the exiting block of the top-most region.
// A BranchOnCount will be added instead when adding the canonical IV recipes.		// A BranchOnCount will be added instead when adding the canonical IV recipes.
auto *Term =		auto *Term =
Plan->getVectorLoopRegion()->getExitingBasicBlock()->getTerminator();		Plan->getVectorLoopRegion()->getExitingBasicBlock()->getTerminator();
Term->eraseFromParent();		Term->eraseFromParent();

addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), DebugLoc(),		addCanonicalIVRecipes(*Plan, Legal->getWidestInductionType(), DebugLoc(),
CM.getTailFoldingStyle());		CM.getTailFoldingStyle(false));
return Plan;		return Plan;
}		}

// Adjust the recipes for reductions. For in-loop reductions the chain of		// Adjust the recipes for reductions. For in-loop reductions the chain of
// instructions leading from the loop exit instr to the phi need to be converted		// instructions leading from the loop exit instr to the phi need to be converted
// to reductions, with one operand being vector and the other being the scalar		// to reductions, with one operand being vector and the other being the scalar
// reduction chain. For other reductions, a select is introduced between the phi		// reduction chain. For other reductions, a select is introduced between the phi
// and live-out recipes when folding the tail.		// and live-out recipes when folding the tail.
void LoopVectorizationPlanner::adjustRecipesForReductions(		void LoopVectorizationPlanner::adjustRecipesForReductions(
VPBasicBlock *LatchVPBB, VPlanPtr &Plan, VPRecipeBuilder &RecipeBuilder,		LoopVectorizationCostModel &CM, VPBasicBlock *LatchVPBB, VPlanPtr &Plan,
ElementCount MinVF) {		VPRecipeBuilder &RecipeBuilder, ElementCount MinVF) {
for (const auto &Reduction : CM.getInLoopReductionChains()) {		for (const auto &Reduction : CM.getInLoopReductionChains()) {
PHINode *Phi = Reduction.first;		PHINode *Phi = Reduction.first;
const RecurrenceDescriptor &RdxDesc =		const RecurrenceDescriptor &RdxDesc =
Legal->getReductionVars().find(Phi)->second;		Legal->getReductionVars().find(Phi)->second;
const SmallVector<Instruction *, 4> &ReductionOperations = Reduction.second;		const SmallVector<Instruction *, 4> &ReductionOperations = Reduction.second;

if (MinVF.isScalar() && !CM.useOrderedReductions(RdxDesc))		if (MinVF.isScalar() && !CM.useOrderedReductions(RdxDesc))
continue;		continue;
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	for (Instruction *R : ReductionOperations) {
}		}
Chain = R;		Chain = R;
}		}
}		}

// If tail is folded by masking, introduce selects between the phi		// If tail is folded by masking, introduce selects between the phi
// and the live-out instruction of each reduction, at the beginning of the		// and the live-out instruction of each reduction, at the beginning of the
// dedicated latch block.		// dedicated latch block.
if (CM.foldTailByMasking()) {		if (Plan->foldTailByMasking()) {
Builder.setInsertPoint(LatchVPBB, LatchVPBB->begin());		Builder.setInsertPoint(LatchVPBB, LatchVPBB->begin());
for (VPRecipeBase &R :		for (VPRecipeBase &R :
Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {		Plan->getVectorLoopRegion()->getEntryBasicBlock()->phis()) {
VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);		VPReductionPHIRecipe *PhiR = dyn_cast<VPReductionPHIRecipe>(&R);
if (!PhiR \|\| PhiR->isInLoop())		if (!PhiR \|\| PhiR->isInLoop())
continue;		continue;
VPValue *Cond =		VPValue *Cond =
RecipeBuilder.createBlockInMask(OrigLoop->getHeader(), *Plan);		RecipeBuilder.createBlockInMask(OrigLoop->getHeader(), *Plan);
▲ Show 20 Lines • Show All 650 Lines • ▼ Show 20 Lines	static bool processLoopInVPlanNativePath(
}		}
assert(EnableVPlanNativePath && "VPlan-native path is disabled.");		assert(EnableVPlanNativePath && "VPlan-native path is disabled.");
Function *F = L->getHeader()->getParent();		Function *F = L->getHeader()->getParent();
InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());		InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL->getLAI());

ScalarEpilogueLowering SEL =		ScalarEpilogueLowering SEL =
getScalarEpilogueLowering(F, L, Hints, PSI, BFI, TTI, TLI, *LVL, &IAI);		getScalarEpilogueLowering(F, L, Hints, PSI, BFI, TTI, TLI, *LVL, &IAI);

LoopVectorizationCostModel CM(SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC, ORE, F,		LoopVectorizationCostModel CM(false, SEL, L, PSE, LI, LVL, *TTI, TLI, DB, AC,
&Hints, IAI);		ORE, F, &Hints, IAI);
// Use the planner for outer loop vectorization.		// Use the planner for outer loop vectorization.
// TODO: CM is not used at this point inside the planner. Turn CM into an		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, IAI, PSE, Hints, ORE);
// optional argument if we don't need it in the future.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, LVL, CM, IAI, PSE, Hints, ORE);

// Get user vectorization factor.		// Get user vectorization factor.
ElementCount UserVF = Hints.getWidth();		ElementCount UserVF = Hints.getWidth();

CM.collectElementTypesForWidening();		CM.collectElementTypesForWidening();

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
const VectorizationFactor VF = LVP.planInVPlanNativePath(UserVF);		const VectorizationFactor VF = LVP.planInVPlanNativePath(CM, UserVF);

// If we are stress testing VPlan builds, do not attempt to generate vector		// If we are stress testing VPlan builds, do not attempt to generate vector
// code. Masked vector code generation support will follow soon.		// code. Masked vector code generation support will follow soon.
// Also, do not attempt to vectorize if no vector code will be produced.		// Also, do not attempt to vectorize if no vector code will be produced.
if (VPlanBuildStressTest \|\| VectorizationFactor::Disabled() == VF)		if (VPlanBuildStressTest \|\| VectorizationFactor::Disabled() == VF)
return false;		return false;

VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);		VPlan &BestPlan = LVP.getBestPlanFor(VF.Width, VF.FoldTailByMasking);

{		{
GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI,		GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI,
F->getParent()->getDataLayout());		F->getParent()->getDataLayout());
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width,
VF.Width, 1, LVL, &CM, BFI, PSI, Checks);		VF.Width, 1, LVL, &CM, BFI, PSI, Checks);
LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""		LLVM_DEBUG(dbgs() << "Vectorizing outer loop in \""
<< L->getHeader()->getParent()->getName() << "\"\n");		<< L->getHeader()->getParent()->getName() << "\"\n");
▲ Show 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
// the incoming IR, we need to build VPlan upfront in the vectorization		// the incoming IR, we need to build VPlan upfront in the vectorization
// pipeline.		// pipeline.
if (!L->isInnermost())		if (!L->isInnermost())
return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,		return processLoopInVPlanNativePath(L, PSE, LI, DT, &LVL, TTI, TLI, DB, AC,
ORE, BFI, PSI, Hints, Requirements);		ORE, BFI, PSI, Hints, Requirements);

assert(L->isInnermost() && "Inner loop expected.");		assert(L->isInnermost() && "Inner loop expected.");

InterleavedAccessInfo IAI(PSE, L, DT, LI, LVL.getLAI());		InterleavedAccessInfo BaseIAI(PSE, L, DT, LI, LVL.getLAI());
bool UseInterleaved = TTI->enableInterleavedAccessVectorization();		bool UseInterleaved = TTI->enableInterleavedAccessVectorization();

// If an override option has been passed in for interleaved accesses, use it.		// If an override option has been passed in for interleaved accesses, use it.
if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)		if (EnableInterleavedMemAccesses.getNumOccurrences() > 0)
UseInterleaved = EnableInterleavedMemAccesses;		UseInterleaved = EnableInterleavedMemAccesses;

// Analyze interleaved memory accesses.		// Analyze interleaved memory accesses.
if (UseInterleaved)		if (UseInterleaved)
IAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));		BaseIAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));

// Check the function attributes and profiles to find out if this function		// Check the function attributes and profiles to find out if this function
// should be optimized for size.		// should be optimized for size.
ScalarEpilogueLowering SEL =		ScalarEpilogueLowering SEL =
getScalarEpilogueLowering(F, L, Hints, PSI, BFI, TTI, TLI, LVL, &IAI);		getScalarEpilogueLowering(F, L, Hints, PSI, BFI, TTI, TLI, LVL, &BaseIAI);

// Check the loop for a trip count threshold: vectorize loops with a tiny trip		// Check the loop for a trip count threshold: vectorize loops with a tiny trip
// count by optimizing for size, to minimize overheads.		// count by optimizing for size, to minimize overheads.
auto ExpectedTC = getSmallBestKnownTC(*SE, L);		auto ExpectedTC = getSmallBestKnownTC(*SE, L);
if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {		if (ExpectedTC && *ExpectedTC < TinyTripCountVectorThreshold) {
LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "		LLVM_DEBUG(dbgs() << "LV: Found a loop with a very small trip count. "
<< "This loop is worth vectorizing only if no scalar "		<< "This loop is worth vectorizing only if no scalar "
<< "iteration overheads are incurred.");		<< "iteration overheads are incurred.");
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	ORE->emit([&]() {
"floating-point operations";		"floating-point operations";
});		});
LLVM_DEBUG(dbgs() << "LV: loop not vectorized: cannot prove it is safe to "		LLVM_DEBUG(dbgs() << "LV: loop not vectorized: cannot prove it is safe to "
"reorder floating-point operations\n");		"reorder floating-point operations\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}

// Use the cost model.
LoopVectorizationCostModel CM(SEL, L, PSE, LI, &LVL, *TTI, TLI, DB, AC, ORE,
F, &Hints, IAI);
CM.collectValuesToIgnore();
CM.collectElementTypesForWidening();

// Use the planner for vectorization.		// Use the planner for vectorization.
LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, CM, IAI, PSE, Hints, ORE);		LoopVectorizationPlanner LVP(L, LI, TLI, TTI, &LVL, BaseIAI, PSE, Hints, ORE);

// Get user vectorization factor and interleave count.		// Get user vectorization factor and interleave count.
ElementCount UserVF = Hints.getWidth();		ElementCount UserVF = Hints.getWidth();
unsigned UserIC = Hints.getInterleave();		unsigned UserIC = Hints.getInterleave();

		// We plan with two different cost models with FoldTailByMasking = false and
		// true, adding the useful vplans from each and picking the best below in
		// selectVectorizationFactor.
		LoopVectorizationCostModel BaseCM(false, SEL, L, PSE, LI, &LVL, *TTI, TLI, DB,
		AC, ORE, F, &Hints, BaseIAI);
		LVP.plan(BaseCM, UserVF, UserIC);

		InterleavedAccessInfo PredIAI(PSE, L, DT, LI, LVL.getLAI());
		if (UseInterleaved && SEL != CM_ScalarEpilogueAllowed)
		PredIAI.analyzeInterleaving(useMaskedInterleavedAccesses(*TTI));
		LoopVectorizationCostModel PredCM(true, SEL, L, PSE, LI, &LVL, *TTI, TLI, DB,
		AC, ORE, F, &Hints, PredIAI);
		if (SEL != CM_ScalarEpilogueAllowed)
		LVP.plan(PredCM, UserVF, UserIC);

// Plan how to best vectorize, return the best VF and its cost.		// Plan how to best vectorize, return the best VF and its cost.
std::optional<VectorizationFactor> MaybeVF = LVP.plan(UserVF, UserIC);		// Use the cost model.
		std::optional<VectorizationFactor> MaybeVF = LVP.selectVectorizationFactor();

VectorizationFactor VF = VectorizationFactor::Disabled();		VectorizationFactor VF = VectorizationFactor::Disabled();
unsigned IC = 1;		unsigned IC = 1;

GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI,		GeneratedRTChecks Checks(*PSE.getSE(), DT, LI, TTI,
F->getParent()->getDataLayout());		F->getParent()->getDataLayout());
if (MaybeVF) {		if (MaybeVF) {
VF = *MaybeVF;		VF = *MaybeVF;
// Select the interleave count.		// Select the interleave count.
IC = CM.selectInterleaveCount(VF.Width, VF.Cost);		IC = VF.FoldTailByMasking ? PredCM.selectInterleaveCount(VF.Width, VF.Cost)
		: BaseCM.selectInterleaveCount(VF.Width, VF.Cost);

unsigned SelectedIC = std::max(IC, UserIC);		unsigned SelectedIC = std::max(IC, UserIC);
// Optimistically generate runtime checks if they are needed. Drop them if		// Optimistically generate runtime checks if they are needed. Drop them if
// they turn out to not be profitable.		// they turn out to not be profitable.
if (VF.Width.isVector() \|\| SelectedIC > 1)		if (VF.Width.isVector() \|\| SelectedIC > 1)
Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC);		Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC);

// Check if it is profitable to vectorize with runtime checks.		// Check if it is profitable to vectorize with runtime checks.
▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
bool DisableRuntimeUnroll = false;		bool DisableRuntimeUnroll = false;
MDNode *OrigLoopID = L->getLoopID();		MDNode *OrigLoopID = L->getLoopID();
{		{
using namespace ore;		using namespace ore;
if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
		VPlan &BestPlan = LVP.getBestPlanFor(VF.Width, VF.FoldTailByMasking);
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
&CM, BFI, PSI, Checks);		BestPlan.getCostModel(), BFI, PSI, Checks);

VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);
LVP.executePlan(VF.Width, IC, BestPlan, Unroller, DT, false);		LVP.executePlan(VF.Width, IC, BestPlan, Unroller, DT, false);

ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")";		<< NV("InterleaveCount", IC) << ")";
});		});
} else {		} else {
// If we decided that it is legal to vectorize the loop, then do it.		// If we decided that it is legal to vectorize the loop, then do it.

// Consider vectorizing the epilogue too if it's profitable.		// Consider vectorizing the epilogue too if it's profitable.
VectorizationFactor EpilogueVF =		VectorizationFactor EpilogueVF =
CM.selectEpilogueVectorizationFactor(VF.Width, LVP);		LVP.selectEpilogueVectorizationFactor(VF);
if (EpilogueVF.Width.isVector()) {		if (EpilogueVF.Width.isVector()) {

// The first pass vectorizes the main loop and creates a scalar epilogue		// The first pass vectorizes the main loop and creates a scalar epilogue
// to be vectorized by executing the plan (potentially with a different		// to be vectorized by executing the plan (potentially with a different
// factor) again shortly afterwards.		// factor) again shortly afterwards.
		// TODOD: Predicated remainders
EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1);		EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1);
EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TLI, TTI, AC, ORE,
EPI, &LVL, &CM, BFI, PSI, Checks);

VPlan &BestMainPlan = LVP.getBestPlanFor(EPI.MainLoopVF);		VPlan &BestMainPlan = LVP.getBestPlanFor(EPI.MainLoopVF, false);
		EpilogueVectorizerMainLoop MainILV(
		L, PSE, LI, DT, TLI, TTI, AC, ORE, EPI, &LVL,
		BestMainPlan.getCostModel(), BFI, PSI, Checks);
LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF, BestMainPlan, MainILV,		LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF, BestMainPlan, MainILV,
DT, true);		DT, true);
++LoopsVectorized;		++LoopsVectorized;

// Second pass vectorizes the epilogue and adjusts the control flow		// Second pass vectorizes the epilogue and adjusts the control flow
// edges from the first pass.		// edges from the first pass.
EPI.MainLoopVF = EPI.EpilogueVF;		EPI.MainLoopVF = EPI.EpilogueVF;
EPI.MainLoopUF = EPI.EpilogueUF;		EPI.MainLoopUF = EPI.EpilogueUF;
EpilogueVectorizerEpilogueLoop EpilogILV(L, PSE, LI, DT, TLI, TTI, AC,		VPlan &BestEpiPlan = LVP.getBestPlanFor(EPI.EpilogueVF, false);
ORE, EPI, &LVL, &CM, BFI, PSI,		EpilogueVectorizerEpilogueLoop EpilogILV(
Checks);		L, PSE, LI, DT, TLI, TTI, AC, ORE, EPI, &LVL,
		BestEpiPlan.getCostModel(), BFI, PSI, Checks);

VPlan &BestEpiPlan = LVP.getBestPlanFor(EPI.EpilogueVF);
VPRegionBlock *VectorLoop = BestEpiPlan.getVectorLoopRegion();		VPRegionBlock *VectorLoop = BestEpiPlan.getVectorLoopRegion();
VPBasicBlock *Header = VectorLoop->getEntryBasicBlock();		VPBasicBlock *Header = VectorLoop->getEntryBasicBlock();
Header->setName("vec.epilog.vector.body");		Header->setName("vec.epilog.vector.body");

// Ensure that the start values for any VPWidenIntOrFpInductionRecipe,		// Ensure that the start values for any VPWidenIntOrFpInductionRecipe,
// VPWidenPointerInductionRecipe and VPReductionPHIRecipes are updated		// VPWidenPointerInductionRecipe and VPReductionPHIRecipes are updated
// before vectorizing the epilogue loop.		// before vectorizing the epilogue loop.
for (VPRecipeBase &R : Header->phis()) {		for (VPRecipeBase &R : Header->phis()) {
Show All 30 Lines	if (!VectorizeLoop) {

LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV,		LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV,
DT, true);		DT, true);
++LoopsEpilogueVectorized;		++LoopsEpilogueVectorized;

if (!MainILV.areSafetyChecksAdded())		if (!MainILV.areSafetyChecksAdded())
DisableRuntimeUnroll = true;		DisableRuntimeUnroll = true;
} else {		} else {
		VPlan &BestPlan = LVP.getBestPlanFor(VF.Width, VF.FoldTailByMasking);
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width,
VF.MinProfitableTripCount, IC, &LVL, &CM, BFI,		VF.MinProfitableTripCount, IC, &LVL,
PSI, Checks);		BestPlan.getCostModel(), BFI, PSI, Checks);

VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);
LVP.executePlan(VF.Width, IC, BestPlan, LB, DT, false);		LVP.executePlan(VF.Width, IC, BestPlan, LB, DT, false);
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there		// Add metadata to disable runtime unrolling a scalar loop when there
// are no runtime checks about strides and memory. A scalar loop that is		// are no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.		// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())		if (!LB.areSafetyChecksAdded())
DisableRuntimeUnroll = true;		DisableRuntimeUnroll = true;
▲ Show 20 Lines • Show All 169 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

namespace llvm {		namespace llvm {

class BasicBlock;		class BasicBlock;
class DominatorTree;		class DominatorTree;
class InnerLoopVectorizer;		class InnerLoopVectorizer;
class IRBuilderBase;		class IRBuilderBase;
class LoopInfo;		class LoopInfo;
		class LoopVectorizationCostModel;
class PredicateScalarEvolution;		class PredicateScalarEvolution;
class raw_ostream;		class raw_ostream;
class RecurrenceDescriptor;		class RecurrenceDescriptor;
class SCEV;		class SCEV;
class Type;		class Type;
class VPBasicBlock;		class VPBasicBlock;
class VPRegionBlock;		class VPRegionBlock;
class VPlan;		class VPlan;
▲ Show 20 Lines • Show All 2,188 Lines • ▼ Show 20 Lines	class VPlan {

/// Indicates whether it is safe use the Value2VPValue mapping or if the		/// Indicates whether it is safe use the Value2VPValue mapping or if the
/// mapping cannot be used any longer, because it is stale.		/// mapping cannot be used any longer, because it is stale.
bool Value2VPValueEnabled = true;		bool Value2VPValueEnabled = true;

/// Values used outside the plan.		/// Values used outside the plan.
MapVector<PHINode , VPLiveOut > LiveOuts;		MapVector<PHINode , VPLiveOut > LiveOuts;

		/// Whether this plan should foldTailByMasking
		bool FoldTailByMasking = false;

		/// The cost model used to construct and cost this VPlan.
		/// TODO: Remove this and the dependencies on the costmodel.
		LoopVectorizationCostModel *Cost;

public:		public:
VPlan(VPBlockBase *Entry = nullptr) : Entry(Entry) {		VPlan(bool FoldTailByMasking = false,
		LoopVectorizationCostModel *Cost = nullptr,
		VPBlockBase *Entry = nullptr)
		: Entry(Entry), FoldTailByMasking(FoldTailByMasking), Cost(Cost) {
if (Entry)		if (Entry)
Entry->setPlan(this);		Entry->setPlan(this);
}		}

~VPlan();		~VPlan();

/// Prepare the plan for execution, setting up the required live-in values.		/// Prepare the plan for execution, setting up the required live-in values.
void prepareToExecute(Value TripCount, Value VectorTripCount,		void prepareToExecute(Value TripCount, Value VectorTripCount,
Show All 28 Lines	public:

/// The vector trip count.		/// The vector trip count.
VPValue &getVectorTripCount() { return VectorTripCount; }		VPValue &getVectorTripCount() { return VectorTripCount; }

/// Mark the plan to indicate that using Value2VPValue is not safe any		/// Mark the plan to indicate that using Value2VPValue is not safe any
/// longer, because it may be stale.		/// longer, because it may be stale.
void disableValue2VPValue() { Value2VPValueEnabled = false; }		void disableValue2VPValue() { Value2VPValueEnabled = false; }

		const SmallSetVector<ElementCount, 2> &getVFs() const { return VFs; }

void addVF(ElementCount VF) { VFs.insert(VF); }		void addVF(ElementCount VF) { VFs.insert(VF); }

void setVF(ElementCount VF) {		void setVF(ElementCount VF) {
assert(hasVF(VF) && "Cannot set VF not already in plan");		assert(hasVF(VF) && "Cannot set VF not already in plan");
VFs.clear();		VFs.clear();
VFs.insert(VF);		VFs.insert(VF);
}		}

▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	void removeLiveOut(PHINode *PN) {
delete LiveOuts[PN];		delete LiveOuts[PN];
LiveOuts.erase(PN);		LiveOuts.erase(PN);
}		}

const MapVector<PHINode , VPLiveOut > &getLiveOuts() const {		const MapVector<PHINode , VPLiveOut > &getLiveOuts() const {
return LiveOuts;		return LiveOuts;
}		}

		bool foldTailByMasking() const { return FoldTailByMasking; }

		LoopVectorizationCostModel *getCostModel() const { return Cost; }

private:		private:
/// Add to the given dominator tree the header block and every new basic block		/// Add to the given dominator tree the header block and every new basic block
/// that was created between it and the latch block, inclusive.		/// that was created between it and the latch block, inclusive.
static void updateDominatorTree(DominatorTree DT, BasicBlock LoopLatchBB,		static void updateDominatorTree(DominatorTree DT, BasicBlock LoopLatchBB,
BasicBlock *LoopPreHeaderBB,		BasicBlock *LoopPreHeaderBB,
BasicBlock *LoopExitBB);		BasicBlock *LoopExitBB);
};		};

▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 788 Lines • ▼ Show 20 Lines	void VPlan::print(raw_ostream &O) const {

O << "}\n";		O << "}\n";
}		}

std::string VPlan::getName() const {		std::string VPlan::getName() const {
std::string Out;		std::string Out;
raw_string_ostream RSO(Out);		raw_string_ostream RSO(Out);
RSO << Name << " for ";		RSO << Name << " for ";
		if (FoldTailByMasking)
		RSO << "Tail Folded ";
if (!VFs.empty()) {		if (!VFs.empty()) {
RSO << "VF={" << VFs[0];		RSO << "VF={" << VFs[0];
for (ElementCount VF : drop_begin(VFs))		for (ElementCount VF : drop_begin(VFs))
RSO << "," << VF;		RSO << "," << VF;
RSO << "},";		RSO << "},";
}		}

if (UFs.empty()) {		if (UFs.empty()) {
▲ Show 20 Lines • Show All 336 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -passes=loop-vectorize -vectorizer-maximize-bandwidth -S 2>&1 \| FileCheck %s			; RUN: opt < %s -passes=loop-vectorize -vectorizer-maximize-bandwidth -S 2>&1 \| FileCheck %s
	; RUN: opt < %s -passes=loop-vectorize -vectorizer-maximize-bandwidth -S -debug-only=loop-vectorize 2>&1 -disable-output \| FileCheck %s --check-prefix=COST			; RUN: opt < %s -passes=loop-vectorize -vectorizer-maximize-bandwidth -S -debug-only=loop-vectorize 2>&1 -disable-output \| FileCheck %s --check-prefix=COST

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64-none-unknown-eabi"			target triple = "aarch64-none-unknown-eabi"

	; Check that the maximize vector bandwidth option does not give incorrect costs			; Check that the maximize vector bandwidth option does not give incorrect costs
	; due to invalid cost decisions. The loop below has a low maximum trip count,			; due to invalid cost decisions. The loop below has a low maximum trip count,
	; so will be masked.			; so will be masked.

	; COST: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %0 = load			; COST: LV: Found an estimated cost of 3000000 for VF 2 For instruction: %0 = load
	; COST: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %0 = load			; COST: LV: Found an estimated cost of 3000000 for VF 4 For instruction: %0 = load
	; COST: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %0 = load			; COST: LV: Found an estimated cost of 3000000 for VF 8 For instruction: %0 = load
	; COST: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %0 = load			; COST: LV: Found an estimated cost of 3000000 for VF 16 For instruction: %0 = load
	; COST: LV: Selecting VF: 1.			; COST: LV: Selecting Tail folded VF: 1.
				david-armUnsubmitted Not Done Reply Inline Actions This seems odd. Perhaps I'm mistakend, but I thought with your patch we wouldn't decide to tail-fold with a VF of 1? david-arm: This seems odd. Perhaps I'm mistakend, but I thought with your patch we wouldn't decide to tail…

	define i32 @test(ptr nocapture noundef readonly %pInVec, ptr nocapture noundef readonly %pInA1, ptr nocapture noundef readonly %pInA2, ptr nocapture noundef readonly %pInA3, ptr nocapture noundef readonly %pInA4, i32 noundef %numCols) {			define i32 @test(ptr nocapture noundef readonly %pInVec, ptr nocapture noundef readonly %pInA1, ptr nocapture noundef readonly %pInA2, ptr nocapture noundef readonly %pInA3, ptr nocapture noundef readonly %pInA4, i32 noundef %numCols) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[AND:%.]] = and i32 [[NUMCOLS:%.]], 3			; CHECK-NEXT: [[AND:%.]] = and i32 [[NUMCOLS:%.]], 3
	; CHECK-NEXT: [[CMP_NOT32:%.*]] = icmp eq i32 [[AND]], 0			; CHECK-NEXT: [[CMP_NOT32:%.*]] = icmp eq i32 [[AND]], 0
	; CHECK-NEXT: br i1 [[CMP_NOT32]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP_NOT32]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]
	; CHECK: while.body.preheader:			; CHECK: while.body.preheader:
	▲ Show 20 Lines • Show All 104 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt -S -passes=loop-vectorize -debug-only=loop-vectorize < %s 2>%t \| FileCheck %s			; RUN: opt -S -passes=loop-vectorize -debug-only=loop-vectorize < %s 2>%t \| FileCheck %s
	; RUN: cat %t \| FileCheck %s --check-prefix=VPLANS			; RUN: cat %t \| FileCheck %s --check-prefix=VPLANS

	; These tests ensure that tail-folding is enabled when the predicate.enable			; These tests ensure that tail-folding is enabled when the predicate.enable
	; loop attribute is set to true.			; loop attribute is set to true.

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	; VPLANS-LABEL: Checking a loop in 'simple_memset'			; VPLANS-LABEL: Checking a loop in 'simple_memset'
	; VPLANS: VPlan 'Initial VPlan for VF={vscale x 1,vscale x 2,vscale x 4},UF>=1' {			; VPLANS: VPlan 'Initial VPlan for Tail Folded VF={vscale x 1,vscale x 2,vscale x 4},UF>=1' {
	; VPLANS-NEXT: Live-in vp<[[TC:%[0-9]+]]> = original trip-count			; VPLANS-NEXT: Live-in vp<[[TC:%[0-9]+]]> = original trip-count
	; VPLANS-EMPTY:			; VPLANS-EMPTY:
	; VPLANS-NEXT: vector.ph:			; VPLANS-NEXT: vector.ph:
	; VPLANS-NEXT: EMIT vp<[[VF:%[0-9]+]]> = VF * Part + ir<0>			; VPLANS-NEXT: EMIT vp<[[VF:%[0-9]+]]> = VF * Part + ir<0>
	; VPLANS-NEXT: EMIT vp<[[NEWTC:%[0-9]+]]> = TC > VF ? TC - VF : 0 vp<[[TC]]>			; VPLANS-NEXT: EMIT vp<[[NEWTC:%[0-9]+]]> = TC > VF ? TC - VF : 0 vp<[[TC]]>
	; VPLANS-NEXT: EMIT vp<[[LANEMASK_ENTRY:%[0-9]+]]> = active lane mask vp<[[VF]]> vp<[[TC]]>			; VPLANS-NEXT: EMIT vp<[[LANEMASK_ENTRY:%[0-9]+]]> = active lane mask vp<[[VF]]> vp<[[TC]]>
	; VPLANS-NEXT: Successor(s): vector loop			; VPLANS-NEXT: Successor(s): vector loop
	; VPLANS-EMPTY:			; VPLANS-EMPTY:
	▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

	Show First 20 Lines • Show All 772 Lines • ▼ Show 20 Lines
	while.end.loopexit: ; preds = %while.body			while.end.loopexit: ; preds = %while.body
	ret void			ret void
	}			}

	define void @simple_memset_trip1024(i32 %val, ptr %ptr, i64 %n) #0 {			define void @simple_memset_trip1024(i32 %val, ptr %ptr, i64 %n) #0 {
	; CHECK-LABEL: @simple_memset_trip1024(			; CHECK-LABEL: @simple_memset_trip1024(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4			; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4			; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 8
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i64 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i64 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT2:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT2]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT4:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX1]], 0			; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX1]], 0
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr i32, ptr [[PTR:%.]], i64 [[TMP4]]			; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
	; CHECK-NEXT: [[TMP6:%.*]] = getelementptr i32, ptr [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 4
	; CHECK-NEXT: store <vscale x 4 x i32> [[BROADCAST_SPLAT]], ptr [[TMP6]], align 4			; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[TMP6]], 0
	; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()			; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 1
	; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4			; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX1]], [[TMP8]]
	; CHECK-NEXT: [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], [[TMP8]]			; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, ptr [[PTR:%.]], i64 [[TMP4]]
	; CHECK-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]			; CHECK-NEXT: [[TMP11:%.*]] = getelementptr i32, ptr [[PTR]], i64 [[TMP9]]
	; CHECK-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]			; CHECK-NEXT: [[TMP12:%.*]] = getelementptr i32, ptr [[TMP10]], i32 0
				; CHECK-NEXT: store <vscale x 4 x i32> [[BROADCAST_SPLAT]], ptr [[TMP12]], align 4
				david-armUnsubmitted Done Reply Inline Actions I was confused at first, but actually this looks like an improvement - nice! I think before we were still using an interleave count of 1 as if we were tail-folding, but now we've fallen back on the non-tail-folding plan that uses interleaving. david-arm: I was confused at first, but actually this looks like an improvement - nice! I think before we…
				; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4
				; CHECK-NEXT: [[TMP15:%.*]] = getelementptr i32, ptr [[TMP10]], i64 [[TMP14]]
				; CHECK-NEXT: store <vscale x 4 x i32> [[BROADCAST_SPLAT3]], ptr [[TMP15]], align 4
				; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 8
				; CHECK-NEXT: [[INDEX_NEXT4]] = add nuw i64 [[INDEX1]], [[TMP17]]
				; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT4]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; CHECK-NEXT: br label [[WHILE_BODY:%.*]]			; CHECK-NEXT: br label [[WHILE_BODY:%.*]]
	; CHECK: while.body:			; CHECK: while.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ [[INDEX_NEXT:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ [[INDEX_NEXT:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	Show All 30 Lines

llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=none < %s \| FileCheck %s --check-prefix=NONE
	; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data < %s \| FileCheck %s --check-prefix=DATA			; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data < %s \| FileCheck %s --check-prefix=DATA
	; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-without-lane-mask < %s \| FileCheck %s --check-prefix=DATA_NO_LANEMASK			; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-without-lane-mask < %s \| FileCheck %s --check-prefix=DATA_NO_LANEMASK
	; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-and-control < %s \| FileCheck %s --check-prefix=DATA_AND_CONTROL			; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-and-control < %s \| FileCheck %s --check-prefix=DATA_AND_CONTROL
	; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-and-control-without-rt-check < %s \| FileCheck %s --check-prefix=DATA_AND_CONTROL_NO_RT_CHECK			; RUN: opt -S -passes=loop-vectorize -force-tail-folding-style=data-and-control-without-rt-check < %s \| FileCheck %s --check-prefix=DATA_AND_CONTROL_NO_RT_CHECK

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	; Test the different tail folding styles.			; Test the different tail folding styles.

	define void @simple_memset_tailfold(i32 %val, ptr %ptr, i64 %n) "target-features" = "+sve" {			define void @simple_memset_tailfold(i32 %val, ptr %ptr, i64 %n) "target-features" = "+sve" {
	; NONE-LABEL: @simple_memset_tailfold(
	; NONE-NEXT: entry:
	; NONE-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
	; NONE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
	; NONE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
	; NONE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[UMAX]], [[TMP1]]
	; NONE-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; NONE: vector.ph:
	; NONE-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
	; NONE-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
	; NONE-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[UMAX]], [[TMP3]]
	; NONE-NEXT: [[N_VEC:%.*]] = sub i64 [[UMAX]], [[N_MOD_VF]]
	; NONE-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i64 0
	; NONE-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
	; NONE-NEXT: br label [[VECTOR_BODY:%.*]]
	; NONE: vector.body:
	; NONE-NEXT: [[INDEX1:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT2:%.]], [[VECTOR_BODY]] ]
	; NONE-NEXT: [[TMP4:%.*]] = add i64 [[INDEX1]], 0
	; NONE-NEXT: [[TMP5:%.]] = getelementptr i32, ptr [[PTR:%.]], i64 [[TMP4]]
	; NONE-NEXT: [[TMP6:%.*]] = getelementptr i32, ptr [[TMP5]], i32 0
	; NONE-NEXT: store <vscale x 4 x i32> [[BROADCAST_SPLAT]], ptr [[TMP6]], align 4
	; NONE-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
	; NONE-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 4
	; NONE-NEXT: [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], [[TMP8]]
	; NONE-NEXT: [[TMP9:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
	; NONE-NEXT: br i1 [[TMP9]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; NONE: middle.block:
	; NONE-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[UMAX]], [[N_VEC]]
	; NONE-NEXT: br i1 [[CMP_N]], label [[WHILE_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; NONE: scalar.ph:
	; NONE-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
	; NONE-NEXT: br label [[WHILE_BODY:%.*]]
	; NONE: while.body:
	; NONE-NEXT: [[INDEX:%.]] = phi i64 [ [[INDEX_NEXT:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; NONE-NEXT: [[GEP:%.*]] = getelementptr i32, ptr [[PTR]], i64 [[INDEX]]
	; NONE-NEXT: store i32 [[VAL]], ptr [[GEP]], align 4
	; NONE-NEXT: [[INDEX_NEXT]] = add nsw i64 [[INDEX]], 1
	; NONE-NEXT: [[CMP10:%.*]] = icmp ult i64 [[INDEX_NEXT]], [[N]]
	; NONE-NEXT: br i1 [[CMP10]], label [[WHILE_BODY]], label [[WHILE_END_LOOPEXIT]], !llvm.loop [[LOOP3:![0-9]+]]
	; NONE: while.end.loopexit:
	; NONE-NEXT: ret void
	;
	; DATA-LABEL: @simple_memset_tailfold(			; DATA-LABEL: @simple_memset_tailfold(
	; DATA-NEXT: entry:			; DATA-NEXT: entry:
	; DATA-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)			; DATA-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
	; DATA-NEXT: [[TMP0:%.*]] = sub i64 -1, [[UMAX]]			; DATA-NEXT: [[TMP0:%.*]] = sub i64 -1, [[UMAX]]
	; DATA-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()			; DATA-NEXT: [[TMP1:%.*]] = call i64 @llvm.vscale.i64()
	; DATA-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 4			; DATA-NEXT: [[TMP2:%.*]] = mul i64 [[TMP1]], 4
	; DATA-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]			; DATA-NEXT: [[TMP3:%.*]] = icmp ult i64 [[TMP0]], [[TMP2]]
	; DATA-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; DATA-NEXT: br i1 [[TMP3]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	▲ Show 20 Lines • Show All 215 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s		; RUN: opt -passes=loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s
; REQUIRES: asserts		; REQUIRES: asserts

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"		target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv8.1m.main-arm-none-eabi"		target triple = "thumbv8.1m.main-arm-none-eabi"

; Trip count of 5 - shouldn't be vectorized.		; Trip count of 5 - shouldn't be vectorized.
; CHECK-LABEL: tripcount5		; CHECK-LABEL: tripcount5
; CHECK: LV: Selecting VF: 1		; CHECK: LV: Selecting Tail folded VF: 1
define void @tripcount5(ptr nocapture readonly %in, ptr nocapture %out, ptr nocapture readonly %consts, i32 %n) #0 {		define void @tripcount5(ptr nocapture readonly %in, ptr nocapture %out, ptr nocapture readonly %consts, i32 %n) #0 {
entry:		entry:
%arrayidx20 = getelementptr inbounds i32, ptr %out, i32 1		%arrayidx20 = getelementptr inbounds i32, ptr %out, i32 1
%arrayidx38 = getelementptr inbounds i32, ptr %out, i32 2		%arrayidx38 = getelementptr inbounds i32, ptr %out, i32 2
%arrayidx56 = getelementptr inbounds i32, ptr %out, i32 3		%arrayidx56 = getelementptr inbounds i32, ptr %out, i32 3
%arrayidx74 = getelementptr inbounds i32, ptr %out, i32 4		%arrayidx74 = getelementptr inbounds i32, ptr %out, i32 4
%arrayidx92 = getelementptr inbounds i32, ptr %out, i32 5		%arrayidx92 = getelementptr inbounds i32, ptr %out, i32 5
%arrayidx110 = getelementptr inbounds i32, ptr %out, i32 6		%arrayidx110 = getelementptr inbounds i32, ptr %out, i32 6
▲ Show 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	for.body: ; preds = %entry, %for.body
%add138 = add nsw i32 %mul136, %add129		%add138 = add nsw i32 %mul136, %add129
%add139 = add nuw nsw i32 %hop.0236, 16		%add139 = add nuw nsw i32 %hop.0236, 16
%cmp = icmp ult i32 %hop.0236, 112		%cmp = icmp ult i32 %hop.0236, 112
br i1 %cmp, label %for.body, label %for.cond.cleanup		br i1 %cmp, label %for.body, label %for.cond.cleanup
}		}

; Larger example with predication that should also not be vectorized		; Larger example with predication that should also not be vectorized
; CHECK-LABEL: predicated		; CHECK-LABEL: predicated
; CHECK: LV: Selecting VF: 1		; CHECK: LV: Selecting Tail folded VF: 1
; CHECK: LV: Selecting VF: 1		; CHECK: LV: Selecting Tail folded VF: 1
define dso_local i32 @predicated(i32 noundef %0, ptr %glob) #0 {		define dso_local i32 @predicated(i32 noundef %0, ptr %glob) #0 {
%2 = alloca [101 x i32], align 4		%2 = alloca [101 x i32], align 4
%3 = alloca [21 x i32], align 4		%3 = alloca [21 x i32], align 4
call void @llvm.lifetime.start.p0(i64 404, ptr nonnull %2)		call void @llvm.lifetime.start.p0(i64 404, ptr nonnull %2)
call void @llvm.lifetime.start.p0(i64 84, ptr nonnull %3)		call void @llvm.lifetime.start.p0(i64 84, ptr nonnull %3)
%4 = icmp sgt i32 %0, 0		%4 = icmp sgt i32 %0, 0
br i1 %4, label %5, label %159		br i1 %4, label %5, label %159

▲ Show 20 Lines • Show All 198 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll

	Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	entry:			entry:
	%cmp5 = icmp sgt i32 %N, 0			%cmp5 = icmp sgt i32 %N, 0
	br i1 %cmp5, label %while.body.preheader, label %while.end			br i1 %cmp5, label %while.body.preheader, label %while.end

	while.body.preheader:			while.body.preheader:
	br label %while.body			br label %while.body

	while.body:			while.body:
	%N.addr.09 = phi i32 [ %dec, %while.body ], [ 2049, %while.body.preheader ]			%N.addr.09 = phi i32 [ %dec, %while.body ], [ 2051, %while.body.preheader ]
	%c.addr.08 = phi i8* [ %incdec.ptr4, %while.body ], [ %c, %while.body.preheader ]			%c.addr.08 = phi i8* [ %incdec.ptr4, %while.body ], [ %c, %while.body.preheader ]
	%b.addr.07 = phi i8* [ %incdec.ptr1, %while.body ], [ %b, %while.body.preheader ]			%b.addr.07 = phi i8* [ %incdec.ptr1, %while.body ], [ %b, %while.body.preheader ]
	%a.addr.06 = phi i8* [ %incdec.ptr, %while.body ], [ %a, %while.body.preheader ]			%a.addr.06 = phi i8* [ %incdec.ptr, %while.body ], [ %a, %while.body.preheader ]
	%dec = add nsw i32 %N.addr.09, -1			%dec = add nsw i32 %N.addr.09, -1
	%incdec.ptr = getelementptr inbounds i8, i8* %a.addr.06, i32 1			%incdec.ptr = getelementptr inbounds i8, i8* %a.addr.06, i32 1
	%0 = load i8, i8* %a.addr.06, align 1			%0 = load i8, i8* %a.addr.06, align 1
	%incdec.ptr1 = getelementptr inbounds i8, i8* %b.addr.07, i32 1			%incdec.ptr1 = getelementptr inbounds i8, i8* %b.addr.07, i32 1
	%1 = load i8, i8* %b.addr.07, align 1			%1 = load i8, i8* %b.addr.07, align 1
	▲ Show 20 Lines • Show All 294 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll

	; RUN: opt -opaque-pointers=0 < %s -mattr=+mve,+mve.fp -passes=loop-vectorize -tail-predication=disabled -S \| FileCheck %s --check-prefixes=DEFAULT			; RUN: opt -opaque-pointers=0 < %s -mattr=+mve,+mve.fp -passes=loop-vectorize -tail-predication=disabled -S \| FileCheck %s --check-prefixes=DEFAULT
				; RUN: opt -opaque-pointers=0 < %s -mattr=+mve,+mve.fp -passes=loop-vectorize -prefer-predicate-over-epilogue=predicate-dont-vectorize -S \| FileCheck %s --check-prefixes=TAILPRED
	; RUN: opt -opaque-pointers=0 < %s -mattr=+mve,+mve.fp -passes=loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s --check-prefixes=TAILPRED			; RUN: opt -opaque-pointers=0 < %s -mattr=+mve,+mve.fp -passes=loop-vectorize -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue -S \| FileCheck %s --check-prefixes=TAILPRED

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv8.1m.main-arm-none-eabi"			target triple = "thumbv8.1m.main-arm-none-eabi"

	; When TP is disabled, this test can vectorize with a VF of 16.			; When TP is disabled, this test can vectorize with a VF of 16.
	; When TP is enabled, this test should vectorize with a VF of 8.			; When TP is enabled, this test should vectorize with a VF of 8.
				; When both are allowed, the VF=8 with tail folding should win out.
	;			;
	; DEFAULT: load <16 x i8>, <16 x i8>*			; DEFAULT: load <16 x i8>, <16 x i8>*
	; DEFAULT: sext <16 x i8> %{{.*}} to <16 x i16>			; DEFAULT: sext <16 x i8> %{{.*}} to <16 x i16>
	; DEFAULT: add <16 x i16>			; DEFAULT: add <16 x i16>
	; DEFAULT-NOT: llvm.masked.load			; DEFAULT-NOT: llvm.masked.load
	; DEFAULT-NOT: llvm.masked.store			; DEFAULT-NOT: llvm.masked.store
	;			;
	; TAILPRED: llvm.masked.load.v8i8.p0v8i8			; TAILPRED: llvm.masked.load.v8i8.p0v8i8
	▲ Show 20 Lines • Show All 95 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll

; RUN: opt < %s -debug-only=loop-vectorize -passes='function(loop-vectorize),default<O2>' -vectorizer-maximize-bandwidth -mtriple=powerpc64-unknown-linux -S -mcpu=pwr8 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR8		; RUN: opt < %s -debug-only=loop-vectorize -passes='function(loop-vectorize),default<O2>' -vectorizer-maximize-bandwidth -mtriple=powerpc64-unknown-linux -S -mcpu=pwr8 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR8
; RUN: opt < %s -debug-only=loop-vectorize -passes='function(loop-vectorize),default<O2>' -vectorizer-maximize-bandwidth -mtriple=powerpc64le-unknown-linux -S -mcpu=pwr9 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR9		; RUN: opt < %s -debug-only=loop-vectorize -passes='function(loop-vectorize),default<O2>' -vectorizer-maximize-bandwidth -mtriple=powerpc64le-unknown-linux -S -mcpu=pwr9 2>&1 \| FileCheck %s --check-prefixes=CHECK,CHECK-PWR9
; REQUIRES: asserts		; REQUIRES: asserts

@a = global [1024 x i8] zeroinitializer, align 16		@a = global [1024 x i8] zeroinitializer, align 16
@b = global [1024 x i8] zeroinitializer, align 16		@b = global [1024 x i8] zeroinitializer, align 16

define i32 @foo() {		define i32 @foo() {
; CHECK-LABEL: foo		; CHECK-LABEL: foo

; CHECK-PWR8: Executing best plan with VF=16, UF=4		; CHECK-PWR8: Executing best plan with TailFold=false, VF=16, UF=4

; CHECK-PWR9: Executing best plan with VF=8, UF=8		; CHECK-PWR9: Executing best plan with TailFold=false, VF=8, UF=8


entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
%add.lcssa = phi i32 [ %add, %for.body ]		%add.lcssa = phi i32 [ %add, %for.body ]
ret i32 %add.lcssa		ret i32 %add.lcssa
Show All 19 Lines

define i32 @goo() {		define i32 @goo() {
; For indvars.iv used in a computating chain only feeding into getelementptr or cmp,		; For indvars.iv used in a computating chain only feeding into getelementptr or cmp,
; it will not have vector version and the vector register usage will not exceed the		; it will not have vector version and the vector register usage will not exceed the
; available vector register number.		; available vector register number.

; CHECK-LABEL: goo		; CHECK-LABEL: goo

; CHECK: Executing best plan with VF=16, UF=4		; CHECK: Executing best plan with TailFold=false, VF=16, UF=4

entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup: ; preds = %for.body		for.cond.cleanup: ; preds = %for.body
%add.lcssa = phi i32 [ %add, %for.body ]		%add.lcssa = phi i32 [ %add, %for.body ]
ret i32 %add.lcssa		ret i32 %add.lcssa

Show All 16 Lines	for.body: ; preds = %for.body, %entry
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 1024		%exitcond = icmp eq i64 %indvars.iv.next, 1024
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

define i64 @bar(ptr nocapture %a) {		define i64 @bar(ptr nocapture %a) {
; CHECK-LABEL: bar		; CHECK-LABEL: bar

; CHECK: Executing best plan with VF=2, UF=12		; CHECK: Executing best plan with TailFold=false, VF=2, UF=12

entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
%add2.lcssa = phi i64 [ %add2, %for.body ]		%add2.lcssa = phi i64 [ %add2, %for.body ]
ret i64 %add2.lcssa		ret i64 %add2.lcssa

Show All 11 Lines
}		}

@d = external global [0 x i64], align 8		@d = external global [0 x i64], align 8
@e = external global [0 x i32], align 4		@e = external global [0 x i32], align 4
@c = external global [0 x i32], align 4		@c = external global [0 x i32], align 4

define void @hoo(i32 %n) {		define void @hoo(i32 %n) {
; CHECK-LABEL: hoo		; CHECK-LABEL: hoo
; CHECK: Executing best plan with VF=1, UF=12		; CHECK: Executing best plan with TailFold=false, VF=1, UF=12

entry:		entry:
br label %for.body		br label %for.body

for.body: ; preds = %for.body, %entry		for.body: ; preds = %for.body, %entry
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%arrayidx = getelementptr inbounds [0 x i64], ptr @d, i64 0, i64 %indvars.iv		%arrayidx = getelementptr inbounds [0 x i64], ptr @d, i64 0, i64 %indvars.iv
%tmp = load i64, ptr %arrayidx, align 8		%tmp = load i64, ptr %arrayidx, align 8
▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll

	Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: LV: The target has 32 registers of RISCV::VRRC register class			; CHECK-NEXT: LV: The target has 32 registers of RISCV::VRRC register class
	; CHECK-NEXT: LV: Loop cost is 25			; CHECK-NEXT: LV: Loop cost is 25
	; CHECK-NEXT: LV: IC is 1			; CHECK-NEXT: LV: IC is 1
	; CHECK-NEXT: LV: VF is vscale x 4			; CHECK-NEXT: LV: VF is vscale x 4
	; CHECK-NEXT: LV: Not Interleaving.			; CHECK-NEXT: LV: Not Interleaving.
	; CHECK-NEXT: LV: Interleaving is not beneficial.			; CHECK-NEXT: LV: Interleaving is not beneficial.
	; CHECK-NEXT: LV: Found a vectorizable loop (vscale x 4) in <stdin>			; CHECK-NEXT: LV: Found a vectorizable loop (vscale x 4) in <stdin>
	; CHECK-NEXT: LEV: Epilogue vectorization is not profitable for this loop			; CHECK-NEXT: LEV: Epilogue vectorization is not profitable for this loop
	; CHECK-NEXT: Executing best plan with VF=vscale x 4, UF=1			; CHECK-NEXT: Executing best plan with TailFold=false, VF=vscale x 4, UF=1
	; CHECK-NEXT: LV: Interleaving disabled by the pass manager			; CHECK-NEXT: LV: Interleaving disabled by the pass manager
	;			;
	entry:			entry:
	%cmp7 = icmp sgt i32 %n, 0			%cmp7 = icmp sgt i32 %n, 0
	br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

	for.body.preheader: ; preds = %entry			for.body.preheader: ; preds = %entry
	%0 = zext i32 %n to i64			%0 = zext i32 %n to i64
	▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: LV: The target has 32 registers of RISCV::VRRC register class			; CHECK-NEXT: LV: The target has 32 registers of RISCV::VRRC register class
	; CHECK-NEXT: LV: Loop cost is 25			; CHECK-NEXT: LV: Loop cost is 25
	; CHECK-NEXT: LV: IC is 1			; CHECK-NEXT: LV: IC is 1
	; CHECK-NEXT: LV: VF is vscale x 4			; CHECK-NEXT: LV: VF is vscale x 4
	; CHECK-NEXT: LV: Not Interleaving.			; CHECK-NEXT: LV: Not Interleaving.
	; CHECK-NEXT: LV: Interleaving is not beneficial.			; CHECK-NEXT: LV: Interleaving is not beneficial.
	; CHECK-NEXT: LV: Found a vectorizable loop (vscale x 4) in <stdin>			; CHECK-NEXT: LV: Found a vectorizable loop (vscale x 4) in <stdin>
	; CHECK-NEXT: LEV: Epilogue vectorization is not profitable for this loop			; CHECK-NEXT: LEV: Epilogue vectorization is not profitable for this loop
	; CHECK-NEXT: Executing best plan with VF=vscale x 4, UF=1			; CHECK-NEXT: Executing best plan with TailFold=false, VF=vscale x 4, UF=1
	; CHECK-NEXT: LV: Interleaving disabled by the pass manager			; CHECK-NEXT: LV: Interleaving disabled by the pass manager
	;			;
	entry:			entry:
	%cmp7 = icmp sgt i32 %n, 0			%cmp7 = icmp sgt i32 %n, 0
	br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup			br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

	for.body.preheader: ; preds = %entry			for.body.preheader: ; preds = %entry
	%0 = zext i32 %n to i64			%0 = zext i32 %n to i64
	Show All 25 Lines

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt < %s -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=1 -force-widen-divrem-via-safe-divisor=0 -disable-output -debug-only=loop-vectorize 2>&1 \| FileCheck %s		; RUN: opt < %s -passes=loop-vectorize -force-vector-width=2 -force-vector-interleave=1 -force-widen-divrem-via-safe-divisor=0 -disable-output -debug-only=loop-vectorize 2>&1 \| FileCheck %s

target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

; Test cases for PR50009, which require sinking a replicate-region due to a		; Test cases for PR50009, which require sinking a replicate-region due to a
; first-order recurrence.		; first-order recurrence.

define void @sink_replicate_region_1(i32 %x, ptr %ptr, ptr noalias %dst) optsize {		define void @sink_replicate_region_1(i32 %x, ptr %ptr, ptr noalias %dst) optsize {
; CHECK-LABEL: sink_replicate_region_1		; CHECK-LABEL: sink_replicate_region_1
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	loop:
br i1 %ec, label %exit, label %loop		br i1 %ec, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

define void @sink_replicate_region_2(i32 %x, i8 %y, ptr %ptr) optsize {		define void @sink_replicate_region_2(i32 %x, i8 %y, ptr %ptr) optsize {
; CHECK-LABEL: sink_replicate_region_2		; CHECK-LABEL: sink_replicate_region_2
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	loop:
br i1 %ec, label %exit, label %loop		br i1 %ec, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

define i32 @sink_replicate_region_3_reduction(i32 %x, i8 %y, ptr %ptr) optsize {		define i32 @sink_replicate_region_3_reduction(i32 %x, i8 %y, ptr %ptr) optsize {
; CHECK-LABEL: sink_replicate_region_3_reduction		; CHECK-LABEL: sink_replicate_region_3_reduction
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	exit:
%res = phi i32 [ %and.red.next, %loop ]		%res = phi i32 [ %and.red.next, %loop ]
ret i32 %res		ret i32 %res
}		}

; To sink the replicate region containing %rem, we need to split the block		; To sink the replicate region containing %rem, we need to split the block
; containing %conv at the end, because %conv is the last recipe in the block.		; containing %conv at the end, because %conv is the last recipe in the block.
define void @sink_replicate_region_4_requires_split_at_end_of_block(i32 %x, ptr %ptr, ptr noalias %dst) optsize {		define void @sink_replicate_region_4_requires_split_at_end_of_block(i32 %x, ptr %ptr, ptr noalias %dst) optsize {
; CHECK-LABEL: sink_replicate_region_4_requires_split_at_end_of_block		; CHECK-LABEL: sink_replicate_region_4_requires_split_at_end_of_block
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines

exit:		exit:
ret void		ret void
}		}

; Test case that requires sinking a recipe in a replicate region after another replicate region.		; Test case that requires sinking a recipe in a replicate region after another replicate region.
define void @sink_replicate_region_after_replicate_region(ptr %ptr, ptr noalias %dst.2, i32 %x, i8 %y) optsize {		define void @sink_replicate_region_after_replicate_region(ptr %ptr, ptr noalias %dst.2, i32 %x, i8 %y) optsize {
; CHECK-LABEL: sink_replicate_region_after_replicate_region		; CHECK-LABEL: sink_replicate_region_after_replicate_region
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	loop: ; preds = %loop, %entry
br i1 %C, label %exit, label %loop		br i1 %C, label %exit, label %loop

exit: ; preds = %loop		exit: ; preds = %loop
ret void		ret void
}		}

define void @need_new_block_after_sinking_pr56146(i32 %x, ptr %src, ptr noalias %dst) {		define void @need_new_block_after_sinking_pr56146(i32 %x, ptr %src, ptr noalias %dst) {
; CHECK-LABEL: need_new_block_after_sinking_pr56146		; CHECK-LABEL: need_new_block_after_sinking_pr56146
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll

	Show All 30 Lines

	for.end:			for.end:
	%tmp4 = phi i32 [ %tmp3, %for.body ]			%tmp4 = phi i32 [ %tmp3, %for.body ]
	ret i32 %tmp4			ret i32 %tmp4
	}			}

	; Check for crash exposed by D76992.			; Check for crash exposed by D76992.
	; CHECK-LABEL: 'test'			; CHECK-LABEL: 'test'
	; CHECK: VPlan 'Initial VPlan for VF={4},UF>=1' {			; CHECK: VPlan 'Initial VPlan for Tail Folded VF={4},UF>=1' {
	; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count			; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
	; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count			; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: vector.ph:			; CHECK-NEXT: vector.ph:
	; CHECK-NEXT: Successor(s): vector loop			; CHECK-NEXT: Successor(s): vector loop
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: <x1> vector loop: {			; CHECK-NEXT: <x1> vector loop: {
	; CHECK-NEXT: vector.body:			; CHECK-NEXT: vector.body:
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

; REQUIRES: asserts		; REQUIRES: asserts

; RUN: opt -passes=loop-vectorize -force-vector-interleave=1 -force-vector-width=2 -debug -disable-output %s 2>&1 \| FileCheck %s		; RUN: opt -passes=loop-vectorize -force-vector-interleave=1 -force-vector-width=2 -debug -disable-output %s 2>&1 \| FileCheck %s

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"		target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

@a = common global [2048 x i32] zeroinitializer, align 16		@a = common global [2048 x i32] zeroinitializer, align 16
@b = common global [2048 x i32] zeroinitializer, align 16		@b = common global [2048 x i32] zeroinitializer, align 16
@c = common global [2048 x i32] zeroinitializer, align 16		@c = common global [2048 x i32] zeroinitializer, align 16


; CHECK-LABEL: LV: Checking a loop in 'sink1'		; CHECK-LABEL: LV: Checking a loop in 'sink1'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	loop:
%realexit = or i1 %large, %exitcond		%realexit = or i1 %large, %exitcond
br i1 %realexit, label %exit, label %loop		br i1 %realexit, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

; CHECK-LABEL: LV: Checking a loop in 'sink2'		; CHECK-LABEL: LV: Checking a loop in 'sink2'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	loop:
%realexit = or i1 %large, %exitcond		%realexit = or i1 %large, %exitcond
br i1 %realexit, label %exit, label %loop		br i1 %realexit, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

; CHECK-LABEL: LV: Checking a loop in 'sink3'		; CHECK-LABEL: LV: Checking a loop in 'sink3'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines

exit:		exit:
ret void		ret void
}		}

; Make sure we do not sink uniform instructions.		; Make sure we do not sink uniform instructions.
define void @uniform_gep(i64 %k, ptr noalias %A, ptr noalias %B) {		define void @uniform_gep(i64 %k, ptr noalias %A, ptr noalias %B) {
; CHECK-LABEL: LV: Checking a loop in 'uniform_gep'		; CHECK-LABEL: LV: Checking a loop in 'uniform_gep'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	loop.latch:
br i1 %cmp179, label %loop, label %exit		br i1 %cmp179, label %loop, label %exit
exit:		exit:
ret void		ret void
}		}

; Loop with predicated load.		; Loop with predicated load.
define void @pred_cfg1(i32 %k, i32 %j) {		define void @pred_cfg1(i32 %k, i32 %j) {
; CHECK-LABEL: LV: Checking a loop in 'pred_cfg1'		; CHECK-LABEL: LV: Checking a loop in 'pred_cfg1'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines
exit:		exit:
ret void		ret void
}		}

; Loop with predicated load and store in separate blocks, store depends on		; Loop with predicated load and store in separate blocks, store depends on
; loaded value.		; loaded value.
define void @pred_cfg2(i32 %k, i32 %j) {		define void @pred_cfg2(i32 %k, i32 %j) {
; CHECK-LABEL: LV: Checking a loop in 'pred_cfg2'		; CHECK-LABEL: LV: Checking a loop in 'pred_cfg2'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
exit:		exit:
ret void		ret void
}		}

; Loop with predicated load and store in separate blocks, store does not depend		; Loop with predicated load and store in separate blocks, store does not depend
; on loaded value.		; on loaded value.
define void @pred_cfg3(i32 %k, i32 %j) {		define void @pred_cfg3(i32 %k, i32 %j) {
; CHECK-LABEL: LV: Checking a loop in 'pred_cfg3'		; CHECK-LABEL: LV: Checking a loop in 'pred_cfg3'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines	next.1:
br i1 %realexit, label %exit, label %loop		br i1 %realexit, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

define void @merge_3_replicate_region(i32 %k, i32 %j) {		define void @merge_3_replicate_region(i32 %k, i32 %j) {
; CHECK-LABEL: LV: Checking a loop in 'merge_3_replicate_region'		; CHECK-LABEL: LV: Checking a loop in 'merge_3_replicate_region'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines

exit:		exit:
ret void		ret void
}		}


define void @update_2_uses_in_same_recipe_in_merged_block(i32 %k) {		define void @update_2_uses_in_same_recipe_in_merged_block(i32 %k) {
; CHECK-LABEL: LV: Checking a loop in 'update_2_uses_in_same_recipe_in_merged_block'		; CHECK-LABEL: LV: Checking a loop in 'update_2_uses_in_same_recipe_in_merged_block'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	loop:
br i1 %realexit, label %exit, label %loop		br i1 %realexit, label %exit, label %loop

exit:		exit:
ret void		ret void
}		}

define void @recipe_in_merge_candidate_used_by_first_order_recurrence(i32 %k) {		define void @recipe_in_merge_candidate_used_by_first_order_recurrence(i32 %k) {
; CHECK-LABEL: LV: Checking a loop in 'recipe_in_merge_candidate_used_by_first_order_recurrence'		; CHECK-LABEL: LV: Checking a loop in 'recipe_in_merge_candidate_used_by_first_order_recurrence'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 193 Lines • ▼ Show 20 Lines
exit:		exit:
ret void		ret void
}		}

; Test case with a dead GEP between the load and store regions. Dead recipes		; Test case with a dead GEP between the load and store regions. Dead recipes
; need to be removed before merging.		; need to be removed before merging.
define void @merge_with_dead_gep_between_regions(i32 %n, ptr noalias %src, ptr noalias %dst) optsize {		define void @merge_with_dead_gep_between_regions(i32 %n, ptr noalias %src, ptr noalias %dst) optsize {
; CHECK-LABEL: LV: Checking a loop in 'merge_with_dead_gep_between_regions'		; CHECK-LABEL: LV: Checking a loop in 'merge_with_dead_gep_between_regions'
; CHECK: VPlan 'Initial VPlan for VF={2},UF>=1' {		; CHECK: VPlan 'Initial VPlan for Tail Folded VF={2},UF>=1' {
		david-armUnsubmitted Not Done Reply Inline Actions These debug output changes look useful by themselves outside of this patch - not sure if it's possible to pass in the `FoldTailByMasking` flag in a separate patch? david-arm: These debug output changes look useful by themselves outside of this patch - not sure if it's…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Hmm. It would involve pulling out VPlan->FoldTailByMasking into a separate patch. That feels like it is the core of this patch, to be honest. Pulling it out just for some debug messages that are usually present elsewhere feels like a bit of an odd patch on it's own. When you look at the whole debug output there is already parts explaining whether the CostModel is FoldTailByMasking. dmgreen: Hmm. It would involve pulling out VPlan->FoldTailByMasking into a separate patch. That feels…
; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count		; CHECK-NEXT: Live-in vp<[[VEC_TC:%.+]]> = vector-trip-count
; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count		; CHECK-NEXT: Live-in vp<[[BTC:%.+]]> = backedge-taken count
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: vector.ph:		; CHECK-NEXT: vector.ph:
; CHECK-NEXT: Successor(s): vector loop		; CHECK-NEXT: Successor(s): vector loop
; CHECK-EMPTY:		; CHECK-EMPTY:
; CHECK-NEXT: <x1> vector loop: {		; CHECK-NEXT: <x1> vector loop: {
; CHECK-NEXT: vector.body:		; CHECK-NEXT: vector.body:
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Plan with and without FoldTailByMaskingNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 518200

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlan.cpp

llvm/test/Transforms/LoopVectorize/AArch64/maximize-bandwidth-invalidate.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

llvm/test/Transforms/LoopVectorize/AArch64/tail-folding-styles.ll

llvm/test/Transforms/LoopVectorize/ARM/mve-known-trip-count.ll

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-reduces-vf.ll

llvm/test/Transforms/LoopVectorize/PowerPC/reg-usage.ll

llvm/test/Transforms/LoopVectorize/RISCV/riscv-vector-reverse.ll

llvm/test/Transforms/LoopVectorize/first-order-recurrence-sink-replicate-region.ll

llvm/test/Transforms/LoopVectorize/icmp-uniforms.ll

llvm/test/Transforms/LoopVectorize/vplan-sink-scalars-and-merge.ll

[LV] Plan with and without FoldTailByMasking
Needs ReviewPublic