This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
IR/
1/3
Intrinsics.td
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
-
ARMTargetTransformInfo.h
1/3
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
1
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
tail-folding-counting-down.ll
1
tail-loop-folding.ll

Differential D79100

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops
ClosedPublic

Authored by SjoerdMeijer on Apr 29 2020, 9:24 AM.

Download Raw Diff

Details

Reviewers

Ayal
fhahn
samparker
dmgreen
gilr
rengolin
simoll
rkruppe
sdesmalen
rogfer01
efriedma
lebedev.ri

Commits

rG47650451738c: [LV] Emit @llvm.get.active.mask for tail-folded loops

Summary

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

This implements the approach as discussed on the llvm de list, see Eli's comment in http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html. The approach is based on emitting an intrinsic for deriving the mask. The vectoriser emits this new intrinsic in the vector preheader block when the new TII hook indicates that the target can lower this intrinsic and that it is desired to do so for this loop. For MVE, we do this when the loop is tail-folded, which is the very first step in tail-predicating a loop. For all the other targets, this intrinsic won't be emitted as the default of the hook is of course not to do this.

This change will be followed up by MVE specific changes to lower this intrinsics.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Apr 29 2020, 9:24 AM

Herald added a reviewer: rengolin. · View Herald TranscriptApr 29 2020, 9:24 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: jdoerfert, rkruppe, hiraditya. · View Herald Transcript

SjoerdMeijer mentioned this in D79175: [ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC.Apr 30 2020, 6:59 AM

samparker added reviewers: simoll, rkruppe, sdesmalen.Apr 30 2020, 7:19 AM

samparker added a reviewer: rogfer01.Apr 30 2020, 7:23 AM

This is used in the ARM backend, in our tail-predication pass, see D79175.

added a test case, minor fixes in comments.

This was discussed on the list and implements the approach as suggested by @efriedma :

http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html

I.e., we now emit a llvm.get.active.mask intrinsic into the vector body, which is used by the masked load/stores, and its operands allows to reconstruct to the required information in backend passes. For example, a relevant snippet of a vector body with the tail-folded looks like this:

%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
%[[PRED:.*]] = icmp ule <4 x i64> %induction, <i64 429, i64 429, i64 429, i64 429>
%[[MASK:.*]] = call <4 x i1> @llvm.get.active.mask.v4i1.v4i1(<4 x i1> %[[PRED]])
%[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]
%[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]

The induction and cmp feed the active.mask intrinsic, which is used by the masked loads/stores. This new intrinsic can now be easily picked up in backend passes, and the 429 in this example represents the backedge taken iteration count of the scalar loop, which we need to set up the tail-predication.

Herald added a subscriber: vkmr. · View Herald TranscriptMay 18 2020, 3:57 AM

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

In D79100#2041747, @SjoerdMeijer wrote:

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

Just to make sure I understand: is this text still current? Your intrinsic does not seem to propagate the scalar trip count anymore (at least not explicitly). If I understand it correctly, now you communicate the active mask of the current vector iteration which you compute using i <= backedge taken-count. Did I get it right?

llvm/lib/Transforms/Vectorize/VPlan.cpp
386 ↗	(On Diff #264579)	Minor nit: Here you use `TC` (as in trip count?) but above when you created this `VPInstruction` you used `BTC` (backedge taken count). This might be confusing for a future reader so I'd suggest to stick to your own convention and use `BTC` here as well. This code seems similar to that of `VPRecipeBuilder::createBlockInMask` (except that one creates the mask for a given block and this one creates a mask similar to the case `OrigLoop->getHeader() == BB`). That function uses `BTC` as well.

What are the semantics of llvm.get.active.mask? I don't see an actual description anywhere beyond "it enables tail folding, somehow".

llvm/include/llvm/IR/Intrinsics.td
1419	I assume you meant to write LLVMMatchType<0>

vkmr marked an inline comment as done.May 18 2020, 3:42 PM

vkmr added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.cpp
389–392 ↗	(On Diff #264579)	Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V->getType(), V->getType() }, {V}, nullptr, "active.mask");`.

vkmr added inline comments.May 18 2020, 4:18 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
424–439 ↗	(On Diff #264579)	Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan. You can pass the flag `-debug-only=loop-vectorize` to `opt` to see the generated VPlan.

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator

If the epilogue folding always uses ULE, then the operator wouldn't be needed. I think it would be really nice not to have the IV and upper limit splatted, plus we could remove the vector add on the IV too.

Thanks all for your comments! I think this addresses all comments:

I did a rename of ActiveMask to ActiveLaneMask because that describes it better I think,
The intrinsic now takes 2 arguments: the induction step, and the backedge taken count. It indeed represents the IV <= BTC comparison, thus giving it semantics rather than let it only pass on a value, which I have also clarified with comments at different places,
The VP instruction is now lowered using Builder.CreateIntrinsic, and have added a case to print it.

SjoerdMeijer marked an inline comment as done.May 19 2020, 7:19 AM

SjoerdMeijer added inline comments.

llvm/include/llvm/IR/Intrinsics.td
1424	Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems. This is very minor, am looking into it, and in the mean time already wanted to show the important changes.

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This revision now requires changes to proceed.May 19 2020, 7:42 AM

In D79100#2044165, @lebedev.ri wrote:

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

In D79100#2044289, @SjoerdMeijer wrote:

In D79100#2044165, @lebedev.ri wrote:

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

@SjoerdMeijer, i don't believe that is how langref works.

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

In D79100#2044530, @SjoerdMeijer wrote:

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

I think that would still be backwards.
langref is not documentation, it is specification.
To reword, langref is source of truth, not the code.

Understood, and we don't disagree, this will be fixed.

+ LangRef description

efriedma added inline comments.May 19 2020, 2:54 PM

llvm/docs/LangRef.rst
16198 ↗	(On Diff #265018)	Is this semantically equivalent to icmp ule? If it is, you should probably make that more clear, and explain that it's a hint to the backend. If not, this needs a much more thorough explanation.
llvm/include/llvm/IR/Intrinsics.td
1239	Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't really indicate why it needs to be noduplicate. Please use LLVMMatchType/LLVMScalarOrSameVectorWidth to ensure the argument/result types match.

Thanks @efriedma : added a statement about the equivalence to the langref description, and fixed the intrinsic description.

In D79100#2041782, @SjoerdMeijer wrote:

In D79100#2041747, @SjoerdMeijer wrote:

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

The comparison really should be encapsulated in the intrinsic itself because for scalable types it is not clear how many bits the stepvector type needs to enumerate its lanes without overflow:

%lane_ids = <vscale x 1 x i8> llvm.stepvector() ; This will overflow if 'vscale > 256' at runtime (note that a 'stepvector' intrinsic does not even exist at this point)
%lane_mask = %ule icmp ule %lane_ids, (splat %n)

That is not a problem if you have

%lane_mask = <vscale x 1 x i1> llvm.active.mask(i32 0, i32 %n)

We need such an intrinsic anyway for the VP expansion pass to legalize the EVL parameter of VP intrinsics for targets that have scalable types but no active vector length (tail loop predication).

This patch, this new intrinsic, is a straightforward translation from an icmp. It provides the required information, and exactly the same as in your example, albeit in a slightly different form. Scalable vectors, stepvector intrinsics, etc., are not used yet by the vectoriser and are not yet defined, respectively, so using that at this point is problematic.

As this is such a straightforward translation, with straightforward semantics, this won't block in any way future developments. In fact, this is the first step in that direction, and it is good we get some experience with it. We offer our support to adapt this approach/intrinsic and port it, should this be necessary, which should be straightforward.

SjoerdMeijer mentioned this in D80316: [HardwareLoops] Intrinsic LangRef descriptions.May 20 2020, 11:20 AM

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6815	It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap).

In D79100#2047545, @efriedma wrote:

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

Ah okay, sorry, I missed that. That should hopefully be straightforward, and if we have everyone on board with this, I will fix that right away tomorrow.

Uploading a new revision to see if we can reach a conclusion on this.

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

Yes, this is what I expected.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

In general, given two expressions that are equivalent, there's always some way to pattern-match. For power-of-two fixed-width vectors, the pattern you'd need to match is not that complicated (basically just an icmp where one operand is a vector induction variable, and the other operand is a splat.) But like you say, the intrinsic is more convenient.

If you're dealing with scalable vectors, expanding the intrinsic to arithmetic would result in a very complex expression due to the potential for overflow, so we'll almost certainly want this intrinsic.

The changes to VPlan.cpp look a little suspicious, but I'm not really familiar with that code, so I'll let someone else review it.

llvm/docs/LangRef.rst
16202 ↗	(On Diff #265969)	Maybe qualify this with the caveat that it's only equivalent if the vector induction variable doesn't overflow. We probably want to return false in the lanes where it overflows.

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

simoll mentioned this in D78203: [VP,Integer,#2] ExpandVectorPredication pass.May 26 2020, 5:23 AM

In D79100#2054437, @SjoerdMeijer wrote:

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there.

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,
and separate we have LangRef.

That looks like a minimal change, but certainly don't mind doing this if that helps.

In D79100#2055802, @SjoerdMeijer wrote:

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,

I'd just put the LV parts in a separate patch and the TTI stuff in a different patch (the former depending on the latter). The way I see it, those are changes to 2 separate areas of the code, with potentially different people to approve. Also, if there's problem with the LV patch, It can be reverted in isolation.

SjoerdMeijer mentioned this in D80596: New intrinsic @llvm.get.active.lane.mask().May 26 2020, 3:52 PM

Ok, thanks, sounds like a plan.

I will keep this as a reference, I think that's convenient, and will split off the different bits and pieces.
I have started with the easiest patch to create, the intrinsic description and documentation: D80596

SjoerdMeijer mentioned this in D80597: [TTI] New target hook emitGetActiveLaneMask.May 26 2020, 4:10 PM

SjoerdMeijer updated this revision to Diff 266369.May 26 2020, 4:21 PM

SjoerdMeijer retitled this revision from [LV][TTI] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops to [LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops.

SjoerdMeijer mentioned this in rG7fb8a40e5220: New intrinsic @llvm.get.active.lane.mask().May 29 2020, 1:03 AM

SjoerdMeijer mentioned this in rG7480ccbfc9d2: [TTI] New target hook emitGetActiveLaneMask.May 29 2020, 1:36 AM

I have just committed the required scaffolding for this:

the new intrinsic in rG7fb8a40e5220
the new TTI hook in rG7480ccbfc9d2

So was wondering if we are happy with this LV part too?

Just FYI, I am addressing comments on our backend support for this intrinsic in D79175. Once that is ready, I will commit this and D79175 at the same time.

fhahn added inline comments.May 29 2020, 4:25 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
425 ↗	(On Diff #266369)	IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like: // Get first lane of vector induction variable. Value VIVE0 = State.get(getOperand(0), {Part, 0}); // Get first lane of backedge-taken-count. Value ScalarBTC = State.get(getOperand(1), {Part, 0}); auto Int1Ty = Type::getInt1Ty(Builder.getContext()); auto PredTy = VectorType::get(Int1Ty, State.VF); Instruction *Call = Builder.CreateIntrinsic( Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()}, {VIVE0, ScalarBTC}, nullptr, "active.lane.mask"); State.set(this, Call, Part); This should have the advantage of re-using some scalar values, if they have been requested already by other recipes. (Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)
llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll
79	We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.lane.mask` (unless that's not possible at the moment)

fhahn added inline comments.May 29 2020, 4:48 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1313	Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere?

SjoerdMeijer marked an inline comment as done.May 29 2020, 5:24 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1313	Just a quick first comment on your side note: what are all the extra parameters needed/used somewhere? That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at `TailFolded`, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at `L` and `LI`, and possibly `SE`. So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary. I will address your other remarks soon. Thanks for that.

IIUC we want the first lanes of both the BTC and the IV, right?

Yep, exactly that.

If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like:
<SNIP>
This should have the advantage of re-using some scalar values, if they have been requested already by other recipes.

(Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)

Thanks for that. I took D80787, and that worked like a charm, greatly simplifying things here: I have simplified lowering of VPInstruction::ActiveLaneMask as per your suggestion.

SjoerdMeijer mentioned this in D80787: [VPlan] Support extracting lanes for defs managed in VPTransformState..Jun 2 2020, 8:59 AM

With D80787 committed, and this using that, does this look okay now?

LGTM with the comment, thanks! Please wait a day or so with committing in case there are additional comments. Also, I might have missed it, but it would be good to have a test case with unroll-factor/interleave-count > 1 (as mentioned in an inline comment).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1313	I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required.

Many thanks for looking at this!

This needs to be committed together with the backend patch that adds support for lowering this intrinsic. That is nearly done (hopefully), and am expecting a commit somewhere next week, so will definitely wait a few days.

I will add that test case (sorry, I overlooked and missed that), and will follow up to reduce the arguments in the TTI hook.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 17 2020, 2:08 AM

Closed by commit rG47650451738c: [LV] Emit @llvm.get.active.mask for tail-folded loops (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer mentioned this in rG20835cff272e: [TTI] Refactor emitGetActiveLaneMask.

Herald added a subscriber: bmahjour. · View Herald TranscriptJun 17 2020, 2:08 AM

Hi, seems that this patch is causing some failures:
Failed Tests (2):

LLVM :: Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll
LLVM :: Transforms/LoopVectorize/ARM/tail-loop-folding.ll

I replied by mail, but just for completeness, I had noticed and reverted it, and now recommitted it.

SjoerdMeijer mentioned this in rGd1522513d4c4: [ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask.Jun 17 2020, 7:32 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

11 lines

TargetTransformInfoImpl.h

5 lines

CodeGen/

BasicTTIImpl.h

5 lines

IR/

Intrinsics.td

5 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

ARM/

ARMTargetTransformInfo.h

4 lines

ARMTargetTransformInfo.cpp

11 lines

Transforms/

Vectorize/

LoopVectorize.cpp

26 lines

test/

Transforms/

LoopVectorize/

ARM/

tail-folding-counting-down.ll

4 lines

tail-loop-folding.ll

8 lines

Diff 260940

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 466 Lines • ▼ Show 20 Lines	public:

/// Query the target whether it would be prefered to create a predicated		/// Query the target whether it would be prefered to create a predicated
/// vector loop, which can avoid the need to emit a scalar epilogue loop.		/// vector loop, which can avoid the need to emit a scalar epilogue loop.
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) const;		const LoopAccessInfo *LAI) const;

		/// Query the target whether lowering of the llvm.set.loop.elements intrinsic
		/// is supported and desired for this loop.
		bool emitNumElementsVecLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) const;

/// @}		/// @}

/// \name Scalar Target Information		/// \name Scalar Target Information
/// @{		/// @{

/// Flags indicating the kind of support for population count.		/// Flags indicating the kind of support for population count.
///		///
/// Compared to the SW implementation, HW support is supposed to		/// Compared to the SW implementation, HW support is supposed to
▲ Show 20 Lines • Show All 707 Lines • ▼ Show 20 Lines	public:
virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) = 0;		HardwareLoopInfo &HWLoopInfo) = 0;
virtual bool		virtual bool
preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree DT, const LoopAccessInfo LAI) = 0;		DominatorTree DT, const LoopAccessInfo LAI) = 0;
		virtual bool emitNumElementsVecLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) = 0;
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace,		int64_t Scale, unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
▲ Show 20 Lines • Show All 261 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) override {		const LoopAccessInfo *LAI) override {
return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);		return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
}		}
		bool emitNumElementsVecLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) override {
		return Impl.emitNumElementsVecLoop(L, LI, SE, TailFolded);
		}
bool isLegalAddImmediate(int64_t Imm) override {		bool isLegalAddImmediate(int64_t Imm) override {
return Impl.isLegalAddImmediate(Imm);		return Impl.isLegalAddImmediate(Imm);
}		}
bool isLegalICmpImmediate(int64_t Imm) override {		bool isLegalICmpImmediate(int64_t Imm) override {
return Impl.isLegalICmpImmediate(Imm);		return Impl.isLegalICmpImmediate(Imm);
}		}
bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale, unsigned AddrSpace,		bool HasBaseReg, int64_t Scale, unsigned AddrSpace,
▲ Show 20 Lines • Show All 490 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines	public:

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) const {		const LoopAccessInfo *LAI) const {
return false;		return false;
}		}

		bool emitNumElementsVecLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFold) const {
		return false;
		}

void getUnrollingPreferences(Loop *, ScalarEvolution &,		void getUnrollingPreferences(Loop *, ScalarEvolution &,
TTI::UnrollingPreferences &) {}		TTI::UnrollingPreferences &) {}

bool isLegalAddImmediate(int64_t Imm) { return false; }		bool isLegalAddImmediate(int64_t Imm) { return false; }

bool isLegalICmpImmediate(int64_t Imm) { return false; }		bool isLegalICmpImmediate(int64_t Imm) { return false; }

bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
▲ Show 20 Lines • Show All 751 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 486 Lines • ▼ Show 20 Lines	public:

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) {		const LoopAccessInfo *LAI) {
return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);		return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
}		}

		bool emitNumElementsVecLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFold) {
		return BaseT::emitNumElementsVecLoop(L, LI, SE, TailFold);
		}

int getInstructionLatency(const Instruction *I) {		int getInstructionLatency(const Instruction *I) {
if (isa<LoadInst>(I))		if (isa<LoadInst>(I))
return getST()->getSchedModel().DefaultLoadLatency;		return getST()->getSchedModel().DefaultLoadLatency;

return BaseT::getInstructionLatency(I);		return BaseT::getInstructionLatency(I);
}		}

virtual Optional<unsigned>		virtual Optional<unsigned>
▲ Show 20 Lines • Show All 1,256 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 1,230 Lines • ▼ Show 20 Lines	def int_vp_xor : Intrinsic<[ llvm_anyvector_ty ],
LLVMMatchType<0>,		LLVMMatchType<0>,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
llvm_i32_ty]>;		llvm_i32_ty]>;

}		}


//===-------------------------- Masked Intrinsics -------------------------===//		//===-------------------------- Masked Intrinsics -------------------------===//
//		//
		efriedmaUnsubmitted Not Done Reply Inline Actions Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't really indicate why it needs to be noduplicate. Please use LLVMMatchType/LLVMScalarOrSameVectorWidth to ensure the argument/result types match. efriedma: Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't…
def int_masked_store : Intrinsic<[], [llvm_anyvector_ty,		def int_masked_store : Intrinsic<[], [llvm_anyvector_ty,
LLVMAnyPointerType<LLVMMatchType<0>>,		LLVMAnyPointerType<LLVMMatchType<0>>,
llvm_i32_ty,		llvm_i32_ty,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
[IntrArgMemOnly, IntrWillReturn, ImmArg<2>]>;		[IntrArgMemOnly, IntrWillReturn, ImmArg<2>]>;

def int_masked_load : Intrinsic<[llvm_anyvector_ty],		def int_masked_load : Intrinsic<[llvm_anyvector_ty],
[LLVMAnyPointerType<LLVMMatchType<0>>, llvm_i32_ty,		[LLVMAnyPointerType<LLVMMatchType<0>>, llvm_i32_ty,
▲ Show 20 Lines • Show All 160 Lines • ▼ Show 20 Lines

//===---------- Intrinsics to control hardware supported loops ----------===//		//===---------- Intrinsics to control hardware supported loops ----------===//

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
// will execute.		// will execute.
def int_set_loop_iterations :		def int_set_loop_iterations :
Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;

		// Specify the number of elements processed by this (vector) loop, which
		// typically corresponds to the iteration count of the scalar loop.
		def int_set_loop_elements:
		Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;
		efriedmaUnsubmitted Not Done Reply Inline Actions I assume you meant to write LLVMMatchType<0> efriedma: I assume you meant to write LLVMMatchType<0>

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
// will execute. Also test that the given count is not zero, allowing it to		// will execute. Also test that the given count is not zero, allowing it to
// control entry to a 'while' loop.		// control entry to a 'while' loop.
def int_test_set_loop_iterations :		def int_test_set_loop_iterations :
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems. This is very minor, am looking into it, and in the mean time already wanted to show the important changes. SjoerdMeijer: Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems.
Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Decrement loop counter by the given argument. Return false if the loop		// Decrement loop counter by the given argument. Return false if the loop
// should exit.		// should exit.
def int_loop_decrement :		def int_loop_decrement :
Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Decrement the first operand (the loop counter) by the second operand (the		// Decrement the first operand (the loop counter) by the second operand (the
▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 227 Lines • ▼ Show 20 Lines

	bool TargetTransformInfo::preferPredicateOverEpilogue(			bool TargetTransformInfo::preferPredicateOverEpilogue(
	Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,			Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
	TargetLibraryInfo TLI, DominatorTree DT,			TargetLibraryInfo TLI, DominatorTree DT,
	const LoopAccessInfo *LAI) const {			const LoopAccessInfo *LAI) const {
	return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);			return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
	}			}

				bool TargetTransformInfo::emitNumElementsVecLoop(Loop L, LoopInfo LI,
				ScalarEvolution &SE, bool TailFolded) const {
				return TTIImpl->emitNumElementsVecLoop(L, LI, SE, TailFolded);
				}

	void TargetTransformInfo::getUnrollingPreferences(			void TargetTransformInfo::getUnrollingPreferences(
	Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {			Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {
	return TTIImpl->getUnrollingPreferences(L, SE, UP);			return TTIImpl->getUnrollingPreferences(L, SE, UP);
	}			}

	bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {			bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {
	return TTIImpl->isLegalAddImmediate(Imm);			return TTIImpl->isLegalAddImmediate(Imm);
	}			}
	▲ Show 20 Lines • Show All 1,151 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 231 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo);		HardwareLoopInfo &HWLoopInfo);
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,
ScalarEvolution &SE,		ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *TLI,		TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI);		const LoopAccessInfo *LAI);

void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

		bool emitNumElementsVecLoop(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) const;

bool shouldBuildLookupTablesForConstant(Constant *C) const {		bool shouldBuildLookupTablesForConstant(Constant *C) const {
// In the ROPI and RWPI relocation models we can't have pointers to global		// In the ROPI and RWPI relocation models we can't have pointers to global
// variables or functions in constant data, so don't convert switches to		// variables or functions in constant data, so don't convert switches to
// lookup tables if any of the values would need relocation.		// lookup tables if any of the values would need relocation.
if (ST->isROPI() \|\| ST->isRWPI())		if (ST->isROPI() \|\| ST->isRWPI())
return !C->needsRelocation();		return !C->needsRelocation();

return true;		return true;
}		}
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_ARM_ARMTARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_ARM_ARMTARGETTRANSFORMINFO_H

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,303 Lines • ▼ Show 20 Lines	if (!HWLoopInfo.isHardwareLoopCandidate(SE, LI, DT)) {
LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "		LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
"a candidate.\n");		"a candidate.\n");
return false;		return false;
}		}

return canTailPredicateLoop(L, LI, SE, DL, LAI);		return canTailPredicateLoop(L, LI, SE, DL, LAI);
}		}

		bool ARMTTIImpl::emitNumElementsVecLoop(Loop L, LoopInfo LI,
		ScalarEvolution &SE, bool TailFolded) const {
		fhahnUnsubmitted Not Done Reply Inline Actions Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere? fhahn: Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Just a quick first comment on your side note: what are all the extra parameters needed/used somewhere? That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at `TailFolded`, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at `L` and `LI`, and possibly `SE`. So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary. I will address your other remarks soon. Thanks for that. SjoerdMeijer: Just a quick first comment on your side note: > what are all the extra parameters needed/used…
		fhahnUnsubmitted Not Done Reply Inline Actions I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required. fhahn: I am not sure, but I think adding all those unused parameters, because they might get used in…
		// If this loop is tail-folded, we always want to to emit the
		// llvm.set.loop.elements intrinsic, so that this can be picked up in the
		// MVETailPredication pass that needs to know the number of elements
		// processed by this vector loop.
		if (TailFolded)
		return true;
		return false;
		}
void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// Only currently enable these preferences for M-Class cores.		// Only currently enable these preferences for M-Class cores.
if (!ST->isMClass())		if (!ST->isMClass())
return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);

// Disable loop unrolling for Oz and Os.		// Disable loop unrolling for Oz and Os.
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 633 Lines • ▼ Show 20 Lines	protected:

/// Emit a bypass check to see if all of the SCEV assumptions we've		/// Emit a bypass check to see if all of the SCEV assumptions we've
/// had to make are correct.		/// had to make are correct.
void emitSCEVChecks(Loop L, BasicBlock Bypass);		void emitSCEVChecks(Loop L, BasicBlock Bypass);

/// Emit bypass checks to check any memory assumptions we may have made.		/// Emit bypass checks to check any memory assumptions we may have made.
void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);		void emitMemRuntimeChecks(Loop L, BasicBlock Bypass);

		/// Emit the llvm.set.loop.elements IR intrinsic that models the number of
		/// data elements processed by the vector loop.
		void emitNumElementsVecLoop(BasicBlock Bypass, Value Count);

/// Compute the transformed value of Index at offset StartValue using step		/// Compute the transformed value of Index at offset StartValue using step
/// StepValue.		/// StepValue.
/// For integer induction, returns StartValue + Index * StepValue.		/// For integer induction, returns StartValue + Index * StepValue.
/// For pointer induction, returns StartValue[Index * StepValue].		/// For pointer induction, returns StartValue[Index * StepValue].
/// FIXME: The newly created binary instructions should contain nsw/nuw		/// FIXME: The newly created binary instructions should contain nsw/nuw
/// flags, which can be found from the original scalar operations.		/// flags, which can be found from the original scalar operations.
Value emitTransformedIndex(IRBuilder<> &B, Value Index, ScalarEvolution *SE,		Value emitTransformedIndex(IRBuilder<> &B, Value Index, ScalarEvolution *SE,
const DataLayout &DL,		const DataLayout &DL,
▲ Show 20 Lines • Show All 2,175 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::emitMemRuntimeChecks(Loop L, BasicBlock Bypass) {

// We currently don't use LoopVersioning for the actual loop cloning but we		// We currently don't use LoopVersioning for the actual loop cloning but we
// still use it to add the noalias metadata.		// still use it to add the noalias metadata.
LVer = std::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,		LVer = std::make_unique<LoopVersioning>(*Legal->getLAI(), OrigLoop, LI, DT,
PSE.getSE());		PSE.getSE());
LVer->prepareNoAliasMetadata();		LVer->prepareNoAliasMetadata();
}		}

		void InnerLoopVectorizer::emitNumElementsVecLoop(BasicBlock *Bypass,
		Value *Count) {
		if (EnableVPlanNativePath)
		return;

		if (!TTI->emitNumElementsVecLoop(OrigLoop, LI, *PSE.getSE(),
		Cost->foldTailByMasking()))
		return;

		IRBuilder<> Builder(Bypass->getTerminator());
		Function *NumElems = Intrinsic::getDeclaration(
		Bypass->getParent()->getParent(), Intrinsic::set_loop_elements,
		Count->getType());
		Builder.CreateCall(NumElems, Count);
		}

Value *InnerLoopVectorizer::emitTransformedIndex(		Value *InnerLoopVectorizer::emitTransformedIndex(
IRBuilder<> &B, Value Index, ScalarEvolution SE, const DataLayout &DL,		IRBuilder<> &B, Value Index, ScalarEvolution SE, const DataLayout &DL,
const InductionDescriptor &ID) const {		const InductionDescriptor &ID) const {

SCEVExpander Exp(*SE, DL, "induction");		SCEVExpander Exp(*SE, DL, "induction");
auto Step = ID.getStep();		auto Step = ID.getStep();
auto StartValue = ID.getStartValue();		auto StartValue = ID.getStartValue();
assert(Index->getType() == Step->getType() &&		assert(Index->getType() == Step->getType() &&
▲ Show 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	BasicBlock *InnerLoopVectorizer::createVectorizedLoopSkeleton() {
// expressions.		// expressions.
emitSCEVChecks(Lp, LoopScalarPreHeader);		emitSCEVChecks(Lp, LoopScalarPreHeader);

// Generate the code that checks in runtime if arrays overlap. We put the		// Generate the code that checks in runtime if arrays overlap. We put the
// checks into a separate block to make the more common case of few elements		// checks into a separate block to make the more common case of few elements
// faster.		// faster.
emitMemRuntimeChecks(Lp, LoopScalarPreHeader);		emitMemRuntimeChecks(Lp, LoopScalarPreHeader);

		// Emit an intrinsic in the vector preheader that represents the number
		// of data elements processed by the vector loop, which corresponds to
		// the tripcount of the scalar loop. This queries TTI to check that this
		// intrinsic is supported by the target.
		emitNumElementsVecLoop(LoopVectorPreHeader, Count);

// Generate the induction variable.		// Generate the induction variable.
// The loop step is equal to the vectorization factor (num of SIMD elements)		// The loop step is equal to the vectorization factor (num of SIMD elements)
// times the unroll factor (num of SIMD instructions).		// times the unroll factor (num of SIMD instructions).
Value *CountRoundDown = getOrCreateVectorTripCount(Lp);		Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
Constant Step = ConstantInt::get(IdxTy, VF UF);		Constant Step = ConstantInt::get(IdxTy, VF UF);
Induction =		Induction =
createInductionVariable(Lp, StartIdx, CountRoundDown, Step,		createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
getDebugLocFromInstOrOperands(OldInduction));		getDebugLocFromInstOrOperands(OldInduction));
▲ Show 20 Lines • Show All 3,746 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
if (!CM.blockNeedsPredication(BB))		if (!CM.blockNeedsPredication(BB))
return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

// Introduce the early-exit compare IV <= BTC to form header block mask.		// Introduce the early-exit compare IV <= BTC to form header block mask.
// This is used instead of IV < TC because TC may wrap, unlike BTC.		// This is used instead of IV < TC because TC may wrap, unlike BTC.
		efriedmaUnsubmitted Not Done Reply Inline Actions It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap). efriedma: It would be nice to explicitly note here that we're assuming the vector factor is a power of…
// Start by constructing the desired canonical IV.		// Start by constructing the desired canonical IV.
VPValue *IV = nullptr;		VPValue *IV = nullptr;
if (Legal->getPrimaryInduction())		if (Legal->getPrimaryInduction())
IV = Plan->getVPValue(Legal->getPrimaryInduction());		IV = Plan->getVPValue(Legal->getPrimaryInduction());
else {		else {
auto IVRecipe = new VPWidenCanonicalIVRecipe();		auto IVRecipe = new VPWidenCanonicalIVRecipe();
Builder.getInsertBlock()->appendRecipe(IVRecipe);		Builder.getInsertBlock()->appendRecipe(IVRecipe);
IV = IVRecipe->getVPValue();		IV = IVRecipe->getVPValue();
▲ Show 20 Lines • Show All 1,277 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll

	; RUN: opt < %s -loop-vectorize -S \| FileCheck %s --check-prefixes=COMMON,DEFAULT			; RUN: opt < %s -loop-vectorize -S \| FileCheck %s --check-prefixes=COMMON,DEFAULT
	; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| FileCheck %s --check-prefixes=COMMON,CHECK-TF,CHECK-PREFER			; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| FileCheck %s --check-prefixes=COMMON,CHECK-TF,CHECK-PREFER
	; RUN: opt < %s -loop-vectorize -disable-mve-tail-predication=false -S \| FileCheck %s --check-prefixes=COMMON,CHECK-TF,CHECK-ENABLE-TP			; RUN: opt < %s -loop-vectorize -disable-mve-tail-predication=false -S \| FileCheck %s --check-prefixes=COMMON,CHECK-TF,CHECK-ENABLE-TP

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv8.1m.main-arm-unknown-eabihf"			target triple = "thumbv8.1m.main-arm-unknown-eabihf"

	; This IR corresponds to this type of C-code:			; This IR corresponds to this type of C-code:
	;			;
	; void f(char a, char b, char *c, int N) {			; void f(char a, char b, char *c, int N) {
	; while (N-- > 0)			; while (N-- > 0)
	; c++ = a++ + *b++;			; c++ = a++ + *b++;
	; }			; }
	;			;
	define dso_local void @sgt_loopguard(i8* noalias nocapture readonly %a, i8* noalias nocapture readonly %b, i8* noalias nocapture %c, i32 %N) local_unnamed_addr #0 {			define dso_local void @sgt_loopguard(i8* noalias nocapture readonly %a, i8* noalias nocapture readonly %b, i8* noalias nocapture %c, i32 %N) local_unnamed_addr #0 {
	; COMMON-LABEL: @sgt_loopguard(			; COMMON-LABEL: @sgt_loopguard(
				; CHECK-TF: %[[TC:.]] = sub i32 %0, %[[SMIN:.]]
				; CHECK-TF: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK-TF: vector.ph:
				; CHECK-TF: call void @llvm.set.loop.elements.i32(i32 %[[TC]])
	; COMMON: vector.body:			; COMMON: vector.body:
	; CHECK-TF: masked.load			; CHECK-TF: masked.load
	; CHECK-TF: masked.load			; CHECK-TF: masked.load
	; CHECK-TF: masked.store			; CHECK-TF: masked.store
	entry:			entry:
	%cmp5 = icmp sgt i32 %N, 0			%cmp5 = icmp sgt i32 %N, 0
	br i1 %cmp5, label %while.body.preheader, label %while.end			br i1 %cmp5, label %while.body.preheader, label %while.end

	▲ Show 20 Lines • Show All 410 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

; RUN: opt < %s -loop-vectorize -S \| \		; RUN: opt < %s -loop-vectorize -S \| \
; RUN: FileCheck %s -check-prefixes=COMMON,CHECK		; RUN: FileCheck %s -check-prefixes=COMMON,CHECK

; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| \		; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| \
; RUN: FileCheck -check-prefixes=COMMON,PREDFLAG %s		; RUN: FileCheck -check-prefixes=COMMON,PREDFLAG %s

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"		target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv8.1m.main-arm-unknown-eabihf"		target triple = "thumbv8.1m.main-arm-unknown-eabihf"

define dso_local void @tail_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {		define dso_local void @tail_folding(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) #0 {
; CHECK-LABEL: tail_folding(		; CHECK-LABEL: tail_folding(
; CHECK: vector.body:		; CHECK: vector.body:
;		; CHECK-NOT: call void @llvm.set.loop.elements
; This needs implementation of TTI::preferPredicateOverEpilogue,
; then this will be tail-folded too:
;
; CHECK-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; CHECK-NOT: call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
; CHECK-NOT: call void @llvm.masked.store.v4i32.p0v4i32(		; CHECK-NOT: call void @llvm.masked.store.v4i32.p0v4i32(
; CHECK: br i1 %{{.}}, label %{{.}}, label %vector.body		; CHECK: br i1 %{{.}}, label %{{.}}, label %vector.body
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void
Show All 10 Lines	for.body:
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 430		%exitcond = icmp eq i64 %indvars.iv.next, 430
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}


define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {		define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
; COMMON-LABEL: tail_folding_enabled(		; COMMON-LABEL: tail_folding_enabled(
		; COMMON: vector.ph:
		; COMMON: call void @llvm.set.loop.elements.i64(i64 430)
		; COMMON: br label %vector.body
; COMMON: vector.body:		; COMMON: vector.body:
; COMMON: %[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; COMMON: %[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
; COMMON: %[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; COMMON: %[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]		; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]
; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]]		; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]]
; COMMON: br i1 %12, label %{{.*}}, label %vector.body		; COMMON: br i1 %12, label %{{.*}}, label %vector.body
entry:		entry:
br label %for.body		br label %for.body
Show All 20 Lines
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(		; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(
; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(		; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(
; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body

; PREDFLAG-LABEL: tail_folding_disabled(		; PREDFLAG-LABEL: tail_folding_disabled(
; PREDFLAG: vector.body:		; PREDFLAG: vector.body:
; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(
		fhahnUnsubmitted Not Done Reply Inline Actions We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.lane.mask` (unless that's not possible at the moment) fhahn: We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.
; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load		; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load
; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32(		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32(
; PREDFLAG: %index.next = add i64 %index, 4		; PREDFLAG: %index.next = add i64 %index, 4
; PREDFLAG: %12 = icmp eq i64 %index.next, 432		; PREDFLAG: %12 = icmp eq i64 %index.next, 432
; PREDFLAG: br i1 %{{.*}}, label %middle.block, label %vector.body, !llvm.loop !6		; PREDFLAG: br i1 %{{.*}}, label %middle.block, label %vector.body, !llvm.loop !6
entry:		entry:
br label %for.body		br label %for.body

Show All 33 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 260940

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/include/llvm/IR/Intrinsics.td

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops
ClosedPublic