This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Target/ARM/
-
ARM/
1/3
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
1
LoopVectorize.cpp
-
VPlan.h
4
VPlan.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
prefer-tail-loop-folding.ll
-
tail-folding-counting-down.ll
1
tail-loop-folding.ll

Differential D79100

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops
ClosedPublic

Authored by SjoerdMeijer on Apr 29 2020, 9:24 AM.

Download Raw Diff

Details

Reviewers

Ayal
fhahn
samparker
dmgreen
gilr
rengolin
simoll
rkruppe
sdesmalen
rogfer01
efriedma
lebedev.ri

Commits

rG47650451738c: [LV] Emit @llvm.get.active.mask for tail-folded loops

Summary

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

This implements the approach as discussed on the llvm de list, see Eli's comment in http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html. The approach is based on emitting an intrinsic for deriving the mask. The vectoriser emits this new intrinsic in the vector preheader block when the new TII hook indicates that the target can lower this intrinsic and that it is desired to do so for this loop. For MVE, we do this when the loop is tail-folded, which is the very first step in tail-predicating a loop. For all the other targets, this intrinsic won't be emitted as the default of the hook is of course not to do this.

This change will be followed up by MVE specific changes to lower this intrinsics.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Apr 29 2020, 9:24 AM

Herald added a reviewer: rengolin. · View Herald TranscriptApr 29 2020, 9:24 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: jdoerfert, rkruppe, hiraditya. · View Herald Transcript

SjoerdMeijer mentioned this in D79175: [ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC.Apr 30 2020, 6:59 AM

samparker added reviewers: simoll, rkruppe, sdesmalen.Apr 30 2020, 7:19 AM

samparker added a reviewer: rogfer01.Apr 30 2020, 7:23 AM

This is used in the ARM backend, in our tail-predication pass, see D79175.

added a test case, minor fixes in comments.

This was discussed on the list and implements the approach as suggested by @efriedma :

http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html

I.e., we now emit a llvm.get.active.mask intrinsic into the vector body, which is used by the masked load/stores, and its operands allows to reconstruct to the required information in backend passes. For example, a relevant snippet of a vector body with the tail-folded looks like this:

%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
%[[PRED:.*]] = icmp ule <4 x i64> %induction, <i64 429, i64 429, i64 429, i64 429>
%[[MASK:.*]] = call <4 x i1> @llvm.get.active.mask.v4i1.v4i1(<4 x i1> %[[PRED]])
%[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]
%[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]

The induction and cmp feed the active.mask intrinsic, which is used by the masked loads/stores. This new intrinsic can now be easily picked up in backend passes, and the 429 in this example represents the backedge taken iteration count of the scalar loop, which we need to set up the tail-predication.

Herald added a subscriber: vkmr. · View Herald TranscriptMay 18 2020, 3:57 AM

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

In D79100#2041747, @SjoerdMeijer wrote:

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

Just to make sure I understand: is this text still current? Your intrinsic does not seem to propagate the scalar trip count anymore (at least not explicitly). If I understand it correctly, now you communicate the active mask of the current vector iteration which you compute using i <= backedge taken-count. Did I get it right?

llvm/lib/Transforms/Vectorize/VPlan.cpp
386	Minor nit: Here you use `TC` (as in trip count?) but above when you created this `VPInstruction` you used `BTC` (backedge taken count). This might be confusing for a future reader so I'd suggest to stick to your own convention and use `BTC` here as well. This code seems similar to that of `VPRecipeBuilder::createBlockInMask` (except that one creates the mask for a given block and this one creates a mask similar to the case `OrigLoop->getHeader() == BB`). That function uses `BTC` as well.

What are the semantics of llvm.get.active.mask? I don't see an actual description anywhere beyond "it enables tail folding, somehow".

llvm/include/llvm/IR/Intrinsics.td
1420 ↗	(On Diff #264579)	I assume you meant to write LLVMMatchType<0>

vkmr marked an inline comment as done.May 18 2020, 3:42 PM

vkmr added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.cpp
389–392	Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V->getType(), V->getType() }, {V}, nullptr, "active.mask");`.

vkmr added inline comments.May 18 2020, 4:18 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
425–444	Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan. You can pass the flag `-debug-only=loop-vectorize` to `opt` to see the generated VPlan.

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator

If the epilogue folding always uses ULE, then the operator wouldn't be needed. I think it would be really nice not to have the IV and upper limit splatted, plus we could remove the vector add on the IV too.

Thanks all for your comments! I think this addresses all comments:

I did a rename of ActiveMask to ActiveLaneMask because that describes it better I think,
The intrinsic now takes 2 arguments: the induction step, and the backedge taken count. It indeed represents the IV <= BTC comparison, thus giving it semantics rather than let it only pass on a value, which I have also clarified with comments at different places,
The VP instruction is now lowered using Builder.CreateIntrinsic, and have added a case to print it.

SjoerdMeijer marked an inline comment as done.May 19 2020, 7:19 AM

SjoerdMeijer added inline comments.

llvm/include/llvm/IR/Intrinsics.td
1425 ↗	(On Diff #264890)	Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems. This is very minor, am looking into it, and in the mean time already wanted to show the important changes.

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This revision now requires changes to proceed.May 19 2020, 7:42 AM

In D79100#2044165, @lebedev.ri wrote:

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

In D79100#2044289, @SjoerdMeijer wrote:

In D79100#2044165, @lebedev.ri wrote:

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

@SjoerdMeijer, i don't believe that is how langref works.

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

In D79100#2044530, @SjoerdMeijer wrote:

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

I think that would still be backwards.
langref is not documentation, it is specification.
To reword, langref is source of truth, not the code.

Understood, and we don't disagree, this will be fixed.

+ LangRef description

efriedma added inline comments.May 19 2020, 2:54 PM

llvm/docs/LangRef.rst
16198 ↗	(On Diff #265018)	Is this semantically equivalent to icmp ule? If it is, you should probably make that more clear, and explain that it's a hint to the backend. If not, this needs a much more thorough explanation.
llvm/include/llvm/IR/Intrinsics.td
1240 ↗	(On Diff #265018)	Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't really indicate why it needs to be noduplicate. Please use LLVMMatchType/LLVMScalarOrSameVectorWidth to ensure the argument/result types match.

Thanks @efriedma : added a statement about the equivalence to the langref description, and fixed the intrinsic description.

In D79100#2041782, @SjoerdMeijer wrote:

In D79100#2041747, @SjoerdMeijer wrote:

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

The comparison really should be encapsulated in the intrinsic itself because for scalable types it is not clear how many bits the stepvector type needs to enumerate its lanes without overflow:

%lane_ids = <vscale x 1 x i8> llvm.stepvector() ; This will overflow if 'vscale > 256' at runtime (note that a 'stepvector' intrinsic does not even exist at this point)
%lane_mask = %ule icmp ule %lane_ids, (splat %n)

That is not a problem if you have

%lane_mask = <vscale x 1 x i1> llvm.active.mask(i32 0, i32 %n)

We need such an intrinsic anyway for the VP expansion pass to legalize the EVL parameter of VP intrinsics for targets that have scalable types but no active vector length (tail loop predication).

This patch, this new intrinsic, is a straightforward translation from an icmp. It provides the required information, and exactly the same as in your example, albeit in a slightly different form. Scalable vectors, stepvector intrinsics, etc., are not used yet by the vectoriser and are not yet defined, respectively, so using that at this point is problematic.

As this is such a straightforward translation, with straightforward semantics, this won't block in any way future developments. In fact, this is the first step in that direction, and it is good we get some experience with it. We offer our support to adapt this approach/intrinsic and port it, should this be necessary, which should be straightforward.

SjoerdMeijer mentioned this in D80316: [HardwareLoops] Intrinsic LangRef descriptions.May 20 2020, 11:20 AM

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6821	It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap).

In D79100#2047545, @efriedma wrote:

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

Ah okay, sorry, I missed that. That should hopefully be straightforward, and if we have everyone on board with this, I will fix that right away tomorrow.

Uploading a new revision to see if we can reach a conclusion on this.

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

Yes, this is what I expected.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

In general, given two expressions that are equivalent, there's always some way to pattern-match. For power-of-two fixed-width vectors, the pattern you'd need to match is not that complicated (basically just an icmp where one operand is a vector induction variable, and the other operand is a splat.) But like you say, the intrinsic is more convenient.

If you're dealing with scalable vectors, expanding the intrinsic to arithmetic would result in a very complex expression due to the potential for overflow, so we'll almost certainly want this intrinsic.

The changes to VPlan.cpp look a little suspicious, but I'm not really familiar with that code, so I'll let someone else review it.

llvm/docs/LangRef.rst
16202 ↗	(On Diff #265969)	Maybe qualify this with the caveat that it's only equivalent if the vector induction variable doesn't overflow. We probably want to return false in the lanes where it overflows.

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

simoll mentioned this in D78203: [VP,Integer,#2] ExpandVectorPredication pass.May 26 2020, 5:23 AM

In D79100#2054437, @SjoerdMeijer wrote:

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there.

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,
and separate we have LangRef.

That looks like a minimal change, but certainly don't mind doing this if that helps.

In D79100#2055802, @SjoerdMeijer wrote:

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,

I'd just put the LV parts in a separate patch and the TTI stuff in a different patch (the former depending on the latter). The way I see it, those are changes to 2 separate areas of the code, with potentially different people to approve. Also, if there's problem with the LV patch, It can be reverted in isolation.

SjoerdMeijer mentioned this in D80596: New intrinsic @llvm.get.active.lane.mask().May 26 2020, 3:52 PM

Ok, thanks, sounds like a plan.

I will keep this as a reference, I think that's convenient, and will split off the different bits and pieces.
I have started with the easiest patch to create, the intrinsic description and documentation: D80596

SjoerdMeijer mentioned this in D80597: [TTI] New target hook emitGetActiveLaneMask.May 26 2020, 4:10 PM

SjoerdMeijer updated this revision to Diff 266369.May 26 2020, 4:21 PM

SjoerdMeijer retitled this revision from [LV][TTI] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops to [LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops.

SjoerdMeijer mentioned this in rG7fb8a40e5220: New intrinsic @llvm.get.active.lane.mask().May 29 2020, 1:03 AM

SjoerdMeijer mentioned this in rG7480ccbfc9d2: [TTI] New target hook emitGetActiveLaneMask.May 29 2020, 1:36 AM

I have just committed the required scaffolding for this:

the new intrinsic in rG7fb8a40e5220
the new TTI hook in rG7480ccbfc9d2

So was wondering if we are happy with this LV part too?

Just FYI, I am addressing comments on our backend support for this intrinsic in D79175. Once that is ready, I will commit this and D79175 at the same time.

fhahn added inline comments.May 29 2020, 4:25 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
425	IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like: // Get first lane of vector induction variable. Value VIVE0 = State.get(getOperand(0), {Part, 0}); // Get first lane of backedge-taken-count. Value ScalarBTC = State.get(getOperand(1), {Part, 0}); auto Int1Ty = Type::getInt1Ty(Builder.getContext()); auto PredTy = VectorType::get(Int1Ty, State.VF); Instruction *Call = Builder.CreateIntrinsic( Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()}, {VIVE0, ScalarBTC}, nullptr, "active.lane.mask"); State.set(this, Call, Part); This should have the advantage of re-using some scalar values, if they have been requested already by other recipes. (Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)
llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll
82–86	We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.lane.mask` (unless that's not possible at the moment)

fhahn added inline comments.May 29 2020, 4:48 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1417	Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere?

SjoerdMeijer marked an inline comment as done.May 29 2020, 5:24 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1417	Just a quick first comment on your side note: what are all the extra parameters needed/used somewhere? That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at `TailFolded`, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at `L` and `LI`, and possibly `SE`. So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary. I will address your other remarks soon. Thanks for that.

IIUC we want the first lanes of both the BTC and the IV, right?

Yep, exactly that.

If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like:
<SNIP>
This should have the advantage of re-using some scalar values, if they have been requested already by other recipes.

(Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)

Thanks for that. I took D80787, and that worked like a charm, greatly simplifying things here: I have simplified lowering of VPInstruction::ActiveLaneMask as per your suggestion.

SjoerdMeijer mentioned this in D80787: [VPlan] Support extracting lanes for defs managed in VPTransformState..Jun 2 2020, 8:59 AM

With D80787 committed, and this using that, does this look okay now?

LGTM with the comment, thanks! Please wait a day or so with committing in case there are additional comments. Also, I might have missed it, but it would be good to have a test case with unroll-factor/interleave-count > 1 (as mentioned in an inline comment).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1417	I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required.

Many thanks for looking at this!

This needs to be committed together with the backend patch that adds support for lowering this intrinsic. That is nearly done (hopefully), and am expecting a commit somewhere next week, so will definitely wait a few days.

I will add that test case (sorry, I overlooked and missed that), and will follow up to reduce the arguments in the TTI hook.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 17 2020, 2:08 AM

Closed by commit rG47650451738c: [LV] Emit @llvm.get.active.mask for tail-folded loops (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer mentioned this in rG20835cff272e: [TTI] Refactor emitGetActiveLaneMask.

Herald added a subscriber: bmahjour. · View Herald TranscriptJun 17 2020, 2:08 AM

Hi, seems that this patch is causing some failures:
Failed Tests (2):

LLVM :: Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll
LLVM :: Transforms/LoopVectorize/ARM/tail-loop-folding.ll

I replied by mail, but just for completeness, I had noticed and reverted it, and now recommitted it.

SjoerdMeijer mentioned this in rGd1522513d4c4: [ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask.Jun 17 2020, 7:32 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMTargetTransformInfo.cpp

6 lines

Transforms/

Vectorize/

LoopVectorize.cpp

6 lines

VPlan.h

1 line

VPlan.cpp

18 lines

test/

Transforms/

LoopVectorize/

ARM/

prefer-tail-loop-folding.ll

19 lines

tail-folding-counting-down.ll

10 lines

tail-loop-folding.ll

82 lines

Diff 271304

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,407 Lines • ▼ Show 20 Lines	LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
"a candidate.\n");		"a candidate.\n");
return false;		return false;
}		}

return canTailPredicateLoop(L, LI, SE, DL, LAI);		return canTailPredicateLoop(L, LI, SE, DL, LAI);
}		}

bool ARMTTIImpl::emitGetActiveLaneMask() const {		bool ARMTTIImpl::emitGetActiveLaneMask() const {
if (!ST->hasMVEIntegerOps())		if (!ST->hasMVEIntegerOps() \|\| DisableTailPredication)
return false;		return false;
		fhahnUnsubmitted Not Done Reply Inline Actions Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere? fhahn: Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Just a quick first comment on your side note: what are all the extra parameters needed/used somewhere? That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at `TailFolded`, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at `L` and `LI`, and possibly `SE`. So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary. I will address your other remarks soon. Thanks for that. SjoerdMeijer: Just a quick first comment on your side note: > what are all the extra parameters needed/used…
		fhahnUnsubmitted Not Done Reply Inline Actions I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required. fhahn: I am not sure, but I think adding all those unused parameters, because they might get used in…

// TODO: Intrinsic @llvm.get.active.lane.mask is supported.		// Intrinsic @llvm.get.active.lane.mask is supported.
// It is used in the MVETailPredication pass, which requires the number of		// It is used in the MVETailPredication pass, which requires the number of
// elements processed by this vector loop to setup the tail-predicated		// elements processed by this vector loop to setup the tail-predicated
// loop.		// loop.
return false;		return true;
}		}
void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// Only currently enable these preferences for M-Class cores.		// Only currently enable these preferences for M-Class cores.
if (!ST->isMClass())		if (!ST->isMClass())
return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);

// Disable loop unrolling for Oz and Os.		// Disable loop unrolling for Oz and Os.
▲ Show 20 Lines • Show All 69 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,812 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
if (!CM.blockNeedsPredication(BB))		if (!CM.blockNeedsPredication(BB))
return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

// Introduce the early-exit compare IV <= BTC to form header block mask.		// Introduce the early-exit compare IV <= BTC to form header block mask.
// This is used instead of IV < TC because TC may wrap, unlike BTC.		// This is used instead of IV < TC because TC may wrap, unlike BTC.
		efriedmaUnsubmitted Not Done Reply Inline Actions It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap). efriedma: It would be nice to explicitly note here that we're assuming the vector factor is a power of…
// Start by constructing the desired canonical IV.		// Start by constructing the desired canonical IV.
VPValue *IV = nullptr;		VPValue *IV = nullptr;
if (Legal->getPrimaryInduction())		if (Legal->getPrimaryInduction())
IV = Plan->getVPValue(Legal->getPrimaryInduction());		IV = Plan->getVPValue(Legal->getPrimaryInduction());
else {		else {
auto IVRecipe = new VPWidenCanonicalIVRecipe();		auto IVRecipe = new VPWidenCanonicalIVRecipe();
Builder.getInsertBlock()->appendRecipe(IVRecipe);		Builder.getInsertBlock()->appendRecipe(IVRecipe);
IV = IVRecipe->getVPValue();		IV = IVRecipe->getVPValue();
}		}
VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();		VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();
		bool TailFolded = !CM.isScalarEpilogueAllowed();
		if (TailFolded && CM.TTI.emitGetActiveLaneMask())
		BlockMask = Builder.createNaryOp(VPInstruction::ActiveLaneMask, {IV, BTC});
		else
BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
}		}

// This is the block mask. We OR all incoming edges.		// This is the block mask. We OR all incoming edges.
for (auto *Predecessor : predecessors(BB)) {		for (auto *Predecessor : predecessors(BB)) {
VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);		VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);
if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.		if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.
return BlockMaskCache[BB] = EdgeMask;		return BlockMaskCache[BB] = EdgeMask;
▲ Show 20 Lines • Show All 1,262 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

	Show First 20 Lines • Show All 679 Lines • ▼ Show 20 Lines

	public:			public:
	/// VPlan opcodes, extending LLVM IR with idiomatics instructions.			/// VPlan opcodes, extending LLVM IR with idiomatics instructions.
	enum {			enum {
	Not = Instruction::OtherOpsEnd + 1,			Not = Instruction::OtherOpsEnd + 1,
	ICmpULE,			ICmpULE,
	SLPLoad,			SLPLoad,
	SLPStore,			SLPStore,
				ActiveLaneMask,
	};			};

	private:			private:
	typedef unsigned char OpcodeTy;			typedef unsigned char OpcodeTy;
	OpcodeTy Opcode;			OpcodeTy Opcode;

	/// Utility method serving execute(): generates a single instance of the			/// Utility method serving execute(): generates a single instance of the
	/// modeled instruction.			/// modeled instruction.
	▲ Show 20 Lines • Show All 1,283 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 374 Lines • ▼ Show 20 Lines	void VPInstruction::generateInstruction(VPTransformState &State,
case Instruction::Select: {		case Instruction::Select: {
Value *Cond = State.get(getOperand(0), Part);		Value *Cond = State.get(getOperand(0), Part);
Value *Op1 = State.get(getOperand(1), Part);		Value *Op1 = State.get(getOperand(1), Part);
Value *Op2 = State.get(getOperand(2), Part);		Value *Op2 = State.get(getOperand(2), Part);
Value *V = Builder.CreateSelect(Cond, Op1, Op2);		Value *V = Builder.CreateSelect(Cond, Op1, Op2);
State.set(this, V, Part);		State.set(this, V, Part);
break;		break;
}		}
		case VPInstruction::ActiveLaneMask: {
		// Get first lane of vector induction variable.
		Value *VIVElem0 = State.get(getOperand(0), {Part, 0});
		// Get first lane of backedge-taken-count.
		rogfer01Unsubmitted Not Done Reply Inline Actions Minor nit: Here you use `TC` (as in trip count?) but above when you created this `VPInstruction` you used `BTC` (backedge taken count). This might be confusing for a future reader so I'd suggest to stick to your own convention and use `BTC` here as well. This code seems similar to that of `VPRecipeBuilder::createBlockInMask` (except that one creates the mask for a given block and this one creates a mask similar to the case `OrigLoop->getHeader() == BB`). That function uses `BTC` as well. rogfer01: Minor nit: Here you use `TC` (as in trip count?) but above when you created this…
		Value *ScalarBTC = State.get(getOperand(1), {Part, 0});

		auto *Int1Ty = Type::getInt1Ty(Builder.getContext());
		auto *PredTy = VectorType::get(Int1Ty, State.VF);
		Instruction *Call = Builder.CreateIntrinsic(
		Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()},
		vkmrUnsubmitted Not Done Reply Inline Actions Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V->getType(), V->getType() }, {V}, nullptr, "active.mask");`. vkmr: Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V…
		{VIVElem0, ScalarBTC}, nullptr, "active.lane.mask");
		State.set(this, Call, Part);
		break;
		}
default:		default:
llvm_unreachable("Unsupported opcode for instruction");		llvm_unreachable("Unsupported opcode for instruction");
}		}
}		}

void VPInstruction::execute(VPTransformState &State) {		void VPInstruction::execute(VPTransformState &State) {
assert(!State.Instance && "VPInstruction executing an Instance");		assert(!State.Instance && "VPInstruction executing an Instance");
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
Show All 12 Lines
}		}

void VPInstruction::print(raw_ostream &O, VPSlotTracker &SlotTracker) const {		void VPInstruction::print(raw_ostream &O, VPSlotTracker &SlotTracker) const {
if (hasResult()) {		if (hasResult()) {
printAsOperand(O, SlotTracker);		printAsOperand(O, SlotTracker);
O << " = ";		O << " = ";
}		}

switch (getOpcode()) {		switch (getOpcode()) {
		fhahnUnsubmitted Not Done Reply Inline Actions IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like: // Get first lane of vector induction variable. Value VIVE0 = State.get(getOperand(0), {Part, 0}); // Get first lane of backedge-taken-count. Value ScalarBTC = State.get(getOperand(1), {Part, 0}); auto Int1Ty = Type::getInt1Ty(Builder.getContext()); auto PredTy = VectorType::get(Int1Ty, State.VF); Instruction Call = Builder.CreateIntrinsic( Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()}, {VIVE0, ScalarBTC}, nullptr, "active.lane.mask"); State.set(this, Call, Part); This should have the advantage of re-using some scalar values, if they have been requested already by other recipes. (Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that) fhahn:* IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it…
case VPInstruction::Not:		case VPInstruction::Not:
O << "not";		O << "not";
break;		break;
case VPInstruction::ICmpULE:		case VPInstruction::ICmpULE:
O << "icmp ule";		O << "icmp ule";
break;		break;
case VPInstruction::SLPLoad:		case VPInstruction::SLPLoad:
O << "combined load";		O << "combined load";
break;		break;
case VPInstruction::SLPStore:		case VPInstruction::SLPStore:
O << "combined store";		O << "combined store";
break;		break;
		case VPInstruction::ActiveLaneMask:
		O << "active lane mask";
		break;

default:		default:
O << Instruction::getOpcodeName(getOpcode());		O << Instruction::getOpcodeName(getOpcode());
}		}
		vkmrUnsubmitted Not Done Reply Inline Actions Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan. You can pass the flag `-debug-only=loop-vectorize` to `opt` to see the generated VPlan. vkmr: Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan.

for (const VPValue *Operand : operands()) {		for (const VPValue *Operand : operands()) {
O << " ";		O << " ";
Operand->printAsOperand(O, SlotTracker);		Operand->printAsOperand(O, SlotTracker);
}		}
}		}

/// Generate the code inside the body of the vectorized loop. Assumes a single		/// Generate the code inside the body of the vectorized loop. Assumes a single
▲ Show 20 Lines • Show All 526 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/prefer-tail-loop-folding.ll

Show All 39 Lines
; RUN: -prefer-predicate-over-epilog=true \		; RUN: -prefer-predicate-over-epilog=true \
; RUN: -disable-mve-tail-predication=false -loop-vectorize \		; RUN: -disable-mve-tail-predication=false -loop-vectorize \
; RUN: -enable-arm-maskedldst=true -S < %s \| \		; RUN: -enable-arm-maskedldst=true -S < %s \| \
; RUN: FileCheck %s -check-prefixes=CHECK,FOLDING-OPT		; RUN: FileCheck %s -check-prefixes=CHECK,FOLDING-OPT

define void @prefer_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {		define void @prefer_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {
; CHECK-LABEL: prefer_folding(		; CHECK-LABEL: prefer_folding(
; PREFER-FOLDING: vector.body:		; PREFER-FOLDING: vector.body:
; PREFER-FOLDING: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32		; PREFER-FOLDING: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
; PREFER-FOLDING: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32		; PREFER-FOLDING: %[[VIVELEM0:.*]] = add i32 %index, 0
; PREFER-FOLDING: call void @llvm.masked.store.v4i32.p0v4i32		; PREFER-FOLDING: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %[[VIVELEM0]], i32 430)
		; PREFER-FOLDING: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask,
		; PREFER-FOLDING: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask,
		; PREFER-FOLDING: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
; PREFER-FOLDING: br i1 %{{.}}, label %{{.}}, label %vector.body		; PREFER-FOLDING: br i1 %{{.}}, label %{{.}}, label %vector.body
;		;
; NO-FOLDING-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; NO-FOLDING-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
; NO-FOLDING-NOT: call void @llvm.masked.store.v4i32.p0v4i32(		; NO-FOLDING-NOT: call void @llvm.masked.store.v4i32.p0v4i32(
; NO-FOLDING: br i1 %{{.}}, label %{{.}}, label %for.body		; NO-FOLDING: br i1 %{{.}}, label %{{.}}, label %for.body
entry:		entry:
br label %for.body		br label %for.body

▲ Show 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	for.body:
%add3 = add nuw nsw i32 %i.09, 1		%add3 = add nuw nsw i32 %i.09, 1
%exitcond = icmp eq i32 %add3, 431		%exitcond = icmp eq i32 %add3, 431
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

define void @float(float* noalias nocapture %A, float* noalias nocapture readonly %B, float* noalias nocapture readonly %C) #0 {		define void @float(float* noalias nocapture %A, float* noalias nocapture readonly %B, float* noalias nocapture readonly %C) #0 {
; CHECK-LABEL: float(		; CHECK-LABEL: float(
; PREFER-FOLDING: vector.body:		; PREFER-FOLDING: vector.body:
; PREFER-FOLDING: call <4 x float> @llvm.masked.load.v4f32.p0v4f32		; PREFER-FOLDING: %index = phi i32 [ 0, %vector.ph ], [ %index.next, %vector.body ]
; PREFER-FOLDING: call <4 x float> @llvm.masked.load.v4f32.p0v4f32		; PREFER-FOLDING: %[[VIVELEM0:.*]] = add i32 %index, 0
; PREFER-FOLDING: call void @llvm.masked.store.v4f32.p0v4f32		; PREFER-FOLDING: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %[[VIVELEM0]], i32 430)
		; PREFER-FOLDING: call <4 x float> @llvm.masked.load.v4f32.p0v4f32({{.*}}%active.lane.mask
		; PREFER-FOLDING: call <4 x float> @llvm.masked.load.v4f32.p0v4f32({{.*}}%active.lane.mask
		; PREFER-FOLDING: call void @llvm.masked.store.v4f32.p0v4f32({{.*}}%active.lane.mask
		; PREFER-FOLDING: %index.next = add i32 %index, 4
; PREFER-FOLDING: br i1 %{{.}}, label %{{.}}, label %vector.body		; PREFER-FOLDING: br i1 %{{.}}, label %{{.}}, label %vector.body
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void

for.body:		for.body:
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll

	Show All 9 Lines
	; void f(char a, char b, char *c, int N) {			; void f(char a, char b, char *c, int N) {
	; while (N-- > 0)			; while (N-- > 0)
	; c++ = a++ + *b++;			; c++ = a++ + *b++;
	; }			; }
	;			;
	define dso_local void @sgt_loopguard(i8* noalias nocapture readonly %a, i8* noalias nocapture readonly %b, i8* noalias nocapture %c, i32 %N) local_unnamed_addr #0 {			define dso_local void @sgt_loopguard(i8* noalias nocapture readonly %a, i8* noalias nocapture readonly %b, i8* noalias nocapture %c, i32 %N) local_unnamed_addr #0 {
	; COMMON-LABEL: @sgt_loopguard(			; COMMON-LABEL: @sgt_loopguard(
	; COMMON: vector.body:			; COMMON: vector.body:
	; CHECK-TF: masked.load
	; CHECK-TF: masked.load			; CHECK-TF: %[[VIVELEM0:.*]] = extractelement <16 x i32> %vec.iv, i32 0
	; CHECK-TF: masked.store			; CHECK-TF: %[[SCALARBTC:.*]] = extractelement <16 x i32> %broadcast.splat, i32 0
				; CHECK-TF: %active.lane.mask = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 %[[VIVELEM0]], i32 %[[SCALARBTC]])
				; CHECK-TF: llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %{{.*}}, i32 1, <16 x i1> %active.lane.mask
				; CHECK-TF: llvm.masked.load.v16i8.p0v16i8(<16 x i8>* %{{.*}}, i32 1, <16 x i1> %active.lane.mask
				; CHECK-TF: llvm.masked.store.v16i8.p0v16i8(<16 x i8> %{{.}}, <16 x i8> %{{.*}}, i32 1, <16 x i1> %active.lane.mask)
	entry:			entry:
	%cmp5 = icmp sgt i32 %N, 0			%cmp5 = icmp sgt i32 %N, 0
	br i1 %cmp5, label %while.body.preheader, label %while.end			br i1 %cmp5, label %while.body.preheader, label %while.end

	while.body.preheader:			while.body.preheader:
	br label %while.body			br label %while.body

	while.body:			while.body:
	▲ Show 20 Lines • Show All 406 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

Show All 35 Lines	for.body:
%exitcond = icmp eq i64 %indvars.iv.next, 430		%exitcond = icmp eq i64 %indvars.iv.next, 430
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}


define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {		define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
; COMMON-LABEL: tail_folding_enabled(		; COMMON-LABEL: tail_folding_enabled(
; COMMON: vector.body:		; COMMON: vector.body:
; COMMON: %[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; COMMON: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
; COMMON: %[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; COMMON: %[[ELEM0:.*]] = add i64 %index, 0
		; COMMON: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %[[ELEM0]], i64 429)
		; COMMON: %[[WML1:.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}<4 x i1> %active.lane.mask
		; COMMON: %[[WML2:.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}<4 x i1> %active.lane.mask
; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]		; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]
; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]]		; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]], {{.*}}<4 x i1> %active.lane.mask
; COMMON: br i1 %12, label %{{.*}}, label %vector.body		; COMMON: %index.next = add i64 %index, 4
		; COMMON: br i1 %{{.}}, label %{{.}}, label %vector.body
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void

for.body:		for.body:
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 13 Lines
; CHECK-LABEL: tail_folding_disabled(		; CHECK-LABEL: tail_folding_disabled(
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(		; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(
; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(		; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(
; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body

; PREDFLAG-LABEL: tail_folding_disabled(		; PREDFLAG-LABEL: tail_folding_disabled(
; PREDFLAG: vector.body:		; PREDFLAG: vector.body:
; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; PREDFLAG: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; PREDFLAG: %[[ELEM0:.*]] = add i64 %index, 0
		; PREDFLAG: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i64(i64 %[[ELEM0]], i64 429)
		; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
		; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
		fhahnUnsubmitted Not Done Reply Inline Actions We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.lane.mask` (unless that's not possible at the moment) fhahn: We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.
; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load		; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load
; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32(		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
; PREDFLAG: %index.next = add i64 %index, 4		; PREDFLAG: %index.next = add i64 %index, 4
; PREDFLAG: %12 = icmp eq i64 %index.next, 432		; PREDFLAG: %[[CMP:.*]] = icmp eq i64 %index.next, 432
; PREDFLAG: br i1 %{{.*}}, label %middle.block, label %vector.body, !llvm.loop !6		; PREDFLAG: br i1 %[[CMP]], label %middle.block, label %vector.body, !llvm.loop !6
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void

for.body:		for.body:
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv		%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv		%arrayidx2 = getelementptr inbounds i32, i32* %C, i64 %indvars.iv
%1 = load i32, i32* %arrayidx2, align 4		%1 = load i32, i32* %arrayidx2, align 4
%add = add nsw i32 %1, %0		%add = add nsw i32 %1, %0
%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv		%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
store i32 %add, i32* %arrayidx4, align 4		store i32 %add, i32* %arrayidx4, align 4
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 430		%exitcond = icmp eq i64 %indvars.iv.next, 430
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !10		br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !10
}		}

		define dso_local void @interleave4(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C, i32 %N) local_unnamed_addr #0 {
		; PREDFLAG-LABEL: interleave4(
		; PREDFLAG: %[[ADD1:.*]] = add i32 %index, 0
		; PREDFLAG: %[[ADD2:.*]] = add i32 %index, 4
		; PREDFLAG: %[[ADD3:.*]] = add i32 %index, 8
		; PREDFLAG: %[[ADD4:.*]] = add i32 %index, 12
		; PREDFLAG: %[[BTC:.*]] = extractelement <4 x i32> %broadcast.splat, i32 0
		; PREDFLAG: %[[ALM1:active.lane.mask.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %[[ADD1]], i32 %[[BTC]])
		; PREDFLAG: %[[ALM2:active.lane.mask.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %[[ADD2]], i32 %[[BTC]])
		; PREDFLAG: %[[ALM3:active.lane.mask.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %[[ADD3]], i32 %[[BTC]])
		; PREDFLAG: %[[ALM4:active.lane.mask.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %[[ADD4]], i32 %[[BTC]])
		;
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM1]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM2]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM3]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM4]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM1]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM2]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM3]],{{.}}
		; PREDFLAG: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}, <4 x i1> %[[ALM4]],{{.}}
		;
		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ALM1]])
		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ALM2]])
		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ALM3]])
		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %[[ALM4]])
		;
		entry:
		%cmp8 = icmp sgt i32 %N, 0
		br i1 %cmp8, label %for.body.preheader, label %for.cond.cleanup

		for.body.preheader: ; preds = %entry
		br label %for.body

		for.cond.cleanup.loopexit: ; preds = %for.body
		br label %for.cond.cleanup

		for.cond.cleanup: ; preds = %for.cond.cleanup.loopexit, %entry
		ret void

		for.body: ; preds = %for.body.preheader, %for.body
		%i.09 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%arrayidx = getelementptr inbounds i32, i32* %B, i32 %i.09
		%0 = load i32, i32* %arrayidx, align 4
		%arrayidx1 = getelementptr inbounds i32, i32* %C, i32 %i.09
		%1 = load i32, i32* %arrayidx1, align 4
		%add = add nsw i32 %1, %0
		%arrayidx2 = getelementptr inbounds i32, i32* %A, i32 %i.09
		store i32 %add, i32* %arrayidx2, align 4
		%inc = add nuw nsw i32 %i.09, 1
		%exitcond = icmp eq i32 %inc, %N
		br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body, !llvm.loop !14
		}

; CHECK: !0 = distinct !{!0, !1}		; CHECK: !0 = distinct !{!0, !1}
; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}		; CHECK-NEXT: !1 = !{!"llvm.loop.isvectorized", i32 1}
; CHECK-NEXT: !2 = distinct !{!2, !3, !1}		; CHECK-NEXT: !2 = distinct !{!2, !3, !1}
; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}		; CHECK-NEXT: !3 = !{!"llvm.loop.unroll.runtime.disable"}
; CHECK-NEXT: !4 = distinct !{!4, !1}		; CHECK-NEXT: !4 = distinct !{!4, !1}
; CHECK-NEXT: !5 = distinct !{!5, !3, !1}		; CHECK-NEXT: !5 = distinct !{!5, !3, !1}
; CHECK-NEXT: !6 = distinct !{!6, !1}		; CHECK-NEXT: !6 = distinct !{!6, !1}

attributes #0 = { nofree norecurse nounwind "target-features"="+armv8.1-m.main,+mve.fp" }		attributes #0 = { nofree norecurse nounwind "target-features"="+armv8.1-m.main,+mve.fp" }

!6 = distinct !{!6, !7, !8}		!6 = distinct !{!6, !7, !8}
!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}		!7 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}
!8 = !{!"llvm.loop.vectorize.enable", i1 true}		!8 = !{!"llvm.loop.vectorize.enable", i1 true}

!10 = distinct !{!10, !11, !12}		!10 = distinct !{!10, !11, !12}
!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 false}		!11 = !{!"llvm.loop.vectorize.predicate.enable", i1 false}
!12 = !{!"llvm.loop.vectorize.enable", i1 true}		!12 = !{!"llvm.loop.vectorize.enable", i1 true}

		!14 = distinct !{!14, !15}
		!15 = !{!"llvm.loop.interleave.count", i32 4}