This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
2
LangRef.rst
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
CodeGen/
-
BasicTTIImpl.h
-
IR/
1/3
Intrinsics.td
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
Target/ARM/
-
ARM/
-
ARMTargetTransformInfo.h
1/3
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
1
LoopVectorize.cpp
-
VPlan.h
4
VPlan.cpp
-
test/Transforms/LoopVectorize/ARM/
-
Transforms/
-
LoopVectorize/
-
ARM/
1
tail-loop-folding.ll

Differential D79100

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops
ClosedPublic

Authored by SjoerdMeijer on Apr 29 2020, 9:24 AM.

Download Raw Diff

Details

Reviewers

Ayal
fhahn
samparker
dmgreen
gilr
rengolin
simoll
rkruppe
sdesmalen
rogfer01
efriedma
lebedev.ri

Commits

rG47650451738c: [LV] Emit @llvm.get.active.mask for tail-folded loops

Summary

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

This implements the approach as discussed on the llvm de list, see Eli's comment in http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html. The approach is based on emitting an intrinsic for deriving the mask. The vectoriser emits this new intrinsic in the vector preheader block when the new TII hook indicates that the target can lower this intrinsic and that it is desired to do so for this loop. For MVE, we do this when the loop is tail-folded, which is the very first step in tail-predicating a loop. For all the other targets, this intrinsic won't be emitted as the default of the hook is of course not to do this.

This change will be followed up by MVE specific changes to lower this intrinsics.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Apr 29 2020, 9:24 AM

Herald added a reviewer: rengolin. · View Herald TranscriptApr 29 2020, 9:24 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: jdoerfert, rkruppe, hiraditya. · View Herald Transcript

SjoerdMeijer mentioned this in D79175: [ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC.Apr 30 2020, 6:59 AM

samparker added reviewers: simoll, rkruppe, sdesmalen.Apr 30 2020, 7:19 AM

samparker added a reviewer: rogfer01.Apr 30 2020, 7:23 AM

This is used in the ARM backend, in our tail-predication pass, see D79175.

added a test case, minor fixes in comments.

This was discussed on the list and implements the approach as suggested by @efriedma :

http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html

I.e., we now emit a llvm.get.active.mask intrinsic into the vector body, which is used by the masked load/stores, and its operands allows to reconstruct to the required information in backend passes. For example, a relevant snippet of a vector body with the tail-folded looks like this:

%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
%[[PRED:.*]] = icmp ule <4 x i64> %induction, <i64 429, i64 429, i64 429, i64 429>
%[[MASK:.*]] = call <4 x i1> @llvm.get.active.mask.v4i1.v4i1(<4 x i1> %[[PRED]])
%[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]
%[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]

The induction and cmp feed the active.mask intrinsic, which is used by the masked loads/stores. This new intrinsic can now be easily picked up in backend passes, and the 429 in this example represents the backedge taken iteration count of the scalar loop, which we need to set up the tail-predication.

Herald added a subscriber: vkmr. · View Herald TranscriptMay 18 2020, 3:57 AM

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

In D79100#2041747, @SjoerdMeijer wrote:

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

Just to make sure I understand: is this text still current? Your intrinsic does not seem to propagate the scalar trip count anymore (at least not explicitly). If I understand it correctly, now you communicate the active mask of the current vector iteration which you compute using i <= backedge taken-count. Did I get it right?

llvm/lib/Transforms/Vectorize/VPlan.cpp
386	Minor nit: Here you use `TC` (as in trip count?) but above when you created this `VPInstruction` you used `BTC` (backedge taken count). This might be confusing for a future reader so I'd suggest to stick to your own convention and use `BTC` here as well. This code seems similar to that of `VPRecipeBuilder::createBlockInMask` (except that one creates the mask for a given block and this one creates a mask similar to the case `OrigLoop->getHeader() == BB`). That function uses `BTC` as well.

What are the semantics of llvm.get.active.mask? I don't see an actual description anywhere beyond "it enables tail folding, somehow".

llvm/include/llvm/IR/Intrinsics.td
1424	I assume you meant to write LLVMMatchType<0>

vkmr marked an inline comment as done.May 18 2020, 3:42 PM

vkmr added inline comments.

llvm/lib/Transforms/Vectorize/VPlan.cpp
389–392	Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V->getType(), V->getType() }, {V}, nullptr, "active.mask");`.

vkmr added inline comments.May 18 2020, 4:18 PM

llvm/lib/Transforms/Vectorize/VPlan.cpp
426–445	Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan. You can pass the flag `-debug-only=loop-vectorize` to `opt` to see the generated VPlan.

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator

If the epilogue folding always uses ULE, then the operator wouldn't be needed. I think it would be really nice not to have the IV and upper limit splatted, plus we could remove the vector add on the IV too.

Thanks all for your comments! I think this addresses all comments:

I did a rename of ActiveMask to ActiveLaneMask because that describes it better I think,
The intrinsic now takes 2 arguments: the induction step, and the backedge taken count. It indeed represents the IV <= BTC comparison, thus giving it semantics rather than let it only pass on a value, which I have also clarified with comments at different places,
The VP instruction is now lowered using Builder.CreateIntrinsic, and have added a case to print it.

SjoerdMeijer marked an inline comment as done.May 19 2020, 7:19 AM

SjoerdMeijer added inline comments.

llvm/include/llvm/IR/Intrinsics.td
1429	Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems. This is very minor, am looking into it, and in the mean time already wanted to show the important changes.

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This revision now requires changes to proceed.May 19 2020, 7:42 AM

In D79100#2044165, @lebedev.ri wrote:

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

In D79100#2044289, @SjoerdMeijer wrote:

In D79100#2044165, @lebedev.ri wrote:

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

@SjoerdMeijer, i don't believe that is how langref works.

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

In D79100#2044530, @SjoerdMeijer wrote:

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

I think that would still be backwards.
langref is not documentation, it is specification.
To reword, langref is source of truth, not the code.

Understood, and we don't disagree, this will be fixed.

+ LangRef description

efriedma added inline comments.May 19 2020, 2:54 PM

llvm/docs/LangRef.rst
16198	Is this semantically equivalent to icmp ule? If it is, you should probably make that more clear, and explain that it's a hint to the backend. If not, this needs a much more thorough explanation.
llvm/include/llvm/IR/Intrinsics.td
1240	Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't really indicate why it needs to be noduplicate. Please use LLVMMatchType/LLVMScalarOrSameVectorWidth to ensure the argument/result types match.

Thanks @efriedma : added a statement about the equivalence to the langref description, and fixed the intrinsic description.

In D79100#2041782, @SjoerdMeijer wrote:

In D79100#2041747, @SjoerdMeijer wrote:

In D79100#2041646, @samparker wrote:

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

The comparison really should be encapsulated in the intrinsic itself because for scalable types it is not clear how many bits the stepvector type needs to enumerate its lanes without overflow:

%lane_ids = <vscale x 1 x i8> llvm.stepvector() ; This will overflow if 'vscale > 256' at runtime (note that a 'stepvector' intrinsic does not even exist at this point)
%lane_mask = %ule icmp ule %lane_ids, (splat %n)

That is not a problem if you have

%lane_mask = <vscale x 1 x i1> llvm.active.mask(i32 0, i32 %n)

We need such an intrinsic anyway for the VP expansion pass to legalize the EVL parameter of VP intrinsics for targets that have scalable types but no active vector length (tail loop predication).

This patch, this new intrinsic, is a straightforward translation from an icmp. It provides the required information, and exactly the same as in your example, albeit in a slightly different form. Scalable vectors, stepvector intrinsics, etc., are not used yet by the vectoriser and are not yet defined, respectively, so using that at this point is problematic.

As this is such a straightforward translation, with straightforward semantics, this won't block in any way future developments. In fact, this is the first step in that direction, and it is good we get some experience with it. We offer our support to adapt this approach/intrinsic and port it, should this be necessary, which should be straightforward.

SjoerdMeijer mentioned this in D80316: [HardwareLoops] Intrinsic LangRef descriptions.May 20 2020, 11:20 AM

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6812	It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap).

In D79100#2047545, @efriedma wrote:

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

Ah okay, sorry, I missed that. That should hopefully be straightforward, and if we have everyone on board with this, I will fix that right away tomorrow.

Uploading a new revision to see if we can reach a conclusion on this.

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

Yes, this is what I expected.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

In general, given two expressions that are equivalent, there's always some way to pattern-match. For power-of-two fixed-width vectors, the pattern you'd need to match is not that complicated (basically just an icmp where one operand is a vector induction variable, and the other operand is a splat.) But like you say, the intrinsic is more convenient.

If you're dealing with scalable vectors, expanding the intrinsic to arithmetic would result in a very complex expression due to the potential for overflow, so we'll almost certainly want this intrinsic.

The changes to VPlan.cpp look a little suspicious, but I'm not really familiar with that code, so I'll let someone else review it.

llvm/docs/LangRef.rst
16202	Maybe qualify this with the caveat that it's only equivalent if the vector induction variable doesn't overflow. We probably want to return false in the lanes where it overflows.

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

simoll mentioned this in D78203: [VP,Integer,#2] ExpandVectorPredication pass.May 26 2020, 5:23 AM

In D79100#2054437, @SjoerdMeijer wrote:

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there.

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,
and separate we have LangRef.

That looks like a minimal change, but certainly don't mind doing this if that helps.

In D79100#2055802, @SjoerdMeijer wrote:

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,

I'd just put the LV parts in a separate patch and the TTI stuff in a different patch (the former depending on the latter). The way I see it, those are changes to 2 separate areas of the code, with potentially different people to approve. Also, if there's problem with the LV patch, It can be reverted in isolation.

SjoerdMeijer mentioned this in D80596: New intrinsic @llvm.get.active.lane.mask().May 26 2020, 3:52 PM

Ok, thanks, sounds like a plan.

I will keep this as a reference, I think that's convenient, and will split off the different bits and pieces.
I have started with the easiest patch to create, the intrinsic description and documentation: D80596

SjoerdMeijer mentioned this in D80597: [TTI] New target hook emitGetActiveLaneMask.May 26 2020, 4:10 PM

SjoerdMeijer updated this revision to Diff 266369.May 26 2020, 4:21 PM

SjoerdMeijer retitled this revision from [LV][TTI] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops to [LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops.

SjoerdMeijer mentioned this in rG7fb8a40e5220: New intrinsic @llvm.get.active.lane.mask().May 29 2020, 1:03 AM

SjoerdMeijer mentioned this in rG7480ccbfc9d2: [TTI] New target hook emitGetActiveLaneMask.May 29 2020, 1:36 AM

I have just committed the required scaffolding for this:

the new intrinsic in rG7fb8a40e5220
the new TTI hook in rG7480ccbfc9d2

So was wondering if we are happy with this LV part too?

Just FYI, I am addressing comments on our backend support for this intrinsic in D79175. Once that is ready, I will commit this and D79175 at the same time.

fhahn added inline comments.May 29 2020, 4:25 AM

llvm/lib/Transforms/Vectorize/VPlan.cpp
425	IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like: // Get first lane of vector induction variable. Value VIVE0 = State.get(getOperand(0), {Part, 0}); // Get first lane of backedge-taken-count. Value ScalarBTC = State.get(getOperand(1), {Part, 0}); auto Int1Ty = Type::getInt1Ty(Builder.getContext()); auto PredTy = VectorType::get(Int1Ty, State.VF); Instruction *Call = Builder.CreateIntrinsic( Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()}, {VIVE0, ScalarBTC}, nullptr, "active.lane.mask"); State.set(this, Call, Part); This should have the advantage of re-using some scalar values, if they have been requested already by other recipes. (Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)
llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll
80–83	We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.lane.mask` (unless that's not possible at the moment)

fhahn added inline comments.May 29 2020, 4:48 AM

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1389	Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere?

SjoerdMeijer marked an inline comment as done.May 29 2020, 5:24 AM

SjoerdMeijer added inline comments.

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1389	Just a quick first comment on your side note: what are all the extra parameters needed/used somewhere? That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at `TailFolded`, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at `L` and `LI`, and possibly `SE`. So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary. I will address your other remarks soon. Thanks for that.

IIUC we want the first lanes of both the BTC and the IV, right?

Yep, exactly that.

If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like:
<SNIP>
This should have the advantage of re-using some scalar values, if they have been requested already by other recipes.

(Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)

Thanks for that. I took D80787, and that worked like a charm, greatly simplifying things here: I have simplified lowering of VPInstruction::ActiveLaneMask as per your suggestion.

SjoerdMeijer mentioned this in D80787: [VPlan] Support extracting lanes for defs managed in VPTransformState..Jun 2 2020, 8:59 AM

With D80787 committed, and this using that, does this look okay now?

LGTM with the comment, thanks! Please wait a day or so with committing in case there are additional comments. Also, I might have missed it, but it would be good to have a test case with unroll-factor/interleave-count > 1 (as mentioned in an inline comment).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1389	I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required.

Many thanks for looking at this!

This needs to be committed together with the backend patch that adds support for lowering this intrinsic. That is nearly done (hopefully), and am expecting a commit somewhere next week, so will definitely wait a few days.

I will add that test case (sorry, I overlooked and missed that), and will follow up to reduce the arguments in the TTI hook.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 17 2020, 2:08 AM

Closed by commit rG47650451738c: [LV] Emit @llvm.get.active.mask for tail-folded loops (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer mentioned this in rG20835cff272e: [TTI] Refactor emitGetActiveLaneMask.

Herald added a subscriber: bmahjour. · View Herald TranscriptJun 17 2020, 2:08 AM

Hi, seems that this patch is causing some failures:
Failed Tests (2):

LLVM :: Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll
LLVM :: Transforms/LoopVectorize/ARM/tail-loop-folding.ll

I replied by mail, but just for completeness, I had noticed and reverted it, and now recommitted it.

SjoerdMeijer mentioned this in rGd1522513d4c4: [ARM] Reimplement MVE Tail-Predication pass using @llvm.get.active.lane.mask.Jun 17 2020, 7:32 AM

Revision Contents

Path

Size

llvm/

docs/

LangRef.rst

53 lines

include/

llvm/

Analysis/

TargetTransformInfo.h

11 lines

TargetTransformInfoImpl.h

5 lines

CodeGen/

BasicTTIImpl.h

5 lines

IR/

Intrinsics.td

4 lines

lib/

Analysis/

TargetTransformInfo.cpp

5 lines

Target/

ARM/

ARMTargetTransformInfo.h

3 lines

ARMTargetTransformInfo.cpp

11 lines

Transforms/

Vectorize/

LoopVectorize.cpp

6 lines

VPlan.h

1 line

VPlan.cpp

19 lines

test/

Transforms/

LoopVectorize/

ARM/

tail-loop-folding.ll

22 lines

Diff 265173

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,152 Lines • ▼ Show 20 Lines	.. code-block:: llvm

%r = call <4 x i32> @llvm.vp.xor.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i1> %mask, i32 %evl)		%r = call <4 x i32> @llvm.vp.xor.v4i32(<4 x i32> %a, <4 x i32> %b, <4 x i1> %mask, i32 %evl)
;; For all lanes below %evl, %r is lane-wise equivalent to %also.r		;; For all lanes below %evl, %r is lane-wise equivalent to %also.r

%t = xor <4 x i32> %a, %b		%t = xor <4 x i32> %a, %b
%also.r = select <4 x i1> %mask, <4 x i32> %t, <4 x i32> undef		%also.r = select <4 x i1> %mask, <4 x i32> %t, <4 x i32> undef


		.. _int_get_active_lane_mask:

		'``llvm.get.active.lane.mask.*``' Intrinsics
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

		Syntax:
		"""""""
		This is an overloaded intrinsic.

		::

		declare <4 x i1> @llvm.get.active.lane.mask.v4i1.v4i32.v4i64(<4 x i32> <IV>, <4 x i32> <BTC>)
		declare <8 x i1> @llvm.get.active.lane.mask.v8i1.v8i64.v8i64(<8 x i64> <IV>, <8 x i64> <BTC>)
		declare <16 x i1> @llvm.get.active.lane.mask.v16i1.v16i64.v16i64(<16 x i64> <IV>, <16 x i64> <BTC>)


		Overview:
		"""""""""

		Create a mask representing active and inactive vector lanes.


		Arguments:
		""""""""""

		Both operands have the same vector of integer type. The first operand is the
		vector induction step. The second operand is the loop back-edge taken iteration
		count splat into a vector. The result is a vector with the same number of
		elements as the operands, but with the i1 element value type.


		Semantics:
		""""""""""

		The '``llvm.get.active.lane.mask``' intrinsic performs an element-wise less
		than or equal comparison of the induction step value with the back-edge taken
		iteration count, producing a mask of true/false values representing
		active/inactive vector lanes. This mask can e.g. be used in the masked
		efriedmaUnsubmitted Not Done Reply Inline Actions Is this semantically equivalent to icmp ule? If it is, you should probably make that more clear, and explain that it's a hint to the backend. If not, this needs a much more thorough explanation. efriedma: Is this semantically equivalent to icmp ule? If it is, you should probably make that more…
		load/store instructions. It is semantically equivalent to `icmp ule` and
		provides a hint to the backend. I.e., for a vector loop, the back-edge taken
		iteration count of the original scalar loop is explicit as the second argument.

		efriedmaUnsubmitted Not Done Reply Inline Actions Maybe qualify this with the caveat that it's only equivalent if the vector induction variable doesn't overflow. We probably want to return false in the lanes where it overflows. efriedma: Maybe qualify this with the caveat that it's only equivalent if the vector induction variable…

		Examples:
		"""""""""

		.. code-block:: llvm

		%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
		%get.active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.v4i64.v4i64(<4 x i64> %induction, <4 x i64> <i64 429, i64 429, i64 429, i64 429>)
		%wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32>* %3, i32 4, <4 x i1> %get.active.lane.mask, <4 x i32> undef)


.. _int_mload_mstore:		.. _int_mload_mstore:

Masked Vector Load and Store Intrinsics		Masked Vector Load and Store Intrinsics
---------------------------------------		---------------------------------------

LLVM provides intrinsics for predicated vector load and store operations. The predicate is specified by a mask operand, which holds one bit per vector element, switching the associated vector lane on or off. The memory addresses corresponding to the "off" lanes are not accessed. When all bits of the mask are on, the intrinsic is identical to a regular vector load or store. When all bits are off, no memory is accessed.		LLVM provides intrinsics for predicated vector load and store operations. The predicate is specified by a mask operand, which holds one bit per vector element, switching the associated vector lane on or off. The memory addresses corresponding to the "off" lanes are not accessed. When all bits of the mask are on, the intrinsic is identical to a regular vector load or store. When all bits are off, no memory is accessed.

.. _int_mload:		.. _int_mload:
▲ Show 20 Lines • Show All 3,675 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 472 Lines • ▼ Show 20 Lines	public:

/// Query the target whether it would be prefered to create a predicated		/// Query the target whether it would be prefered to create a predicated
/// vector loop, which can avoid the need to emit a scalar epilogue loop.		/// vector loop, which can avoid the need to emit a scalar epilogue loop.
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) const;		const LoopAccessInfo *LAI) const;

		/// Query the target whether lowering of the llvm.get.active.lane.mask
		/// intrinsic is supported and if emitting it is desired for this loop.
		bool emitGetActiveLaneMask(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) const;

/// @}		/// @}

/// \name Scalar Target Information		/// \name Scalar Target Information
/// @{		/// @{

/// Flags indicating the kind of support for population count.		/// Flags indicating the kind of support for population count.
///		///
/// Compared to the SW implementation, HW support is supposed to		/// Compared to the SW implementation, HW support is supposed to
▲ Show 20 Lines • Show All 734 Lines • ▼ Show 20 Lines	public:
virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		virtual bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *LibInfo,		TargetLibraryInfo *LibInfo,
HardwareLoopInfo &HWLoopInfo) = 0;		HardwareLoopInfo &HWLoopInfo) = 0;
virtual bool		virtual bool
preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree DT, const LoopAccessInfo LAI) = 0;		DominatorTree DT, const LoopAccessInfo LAI) = 0;
		virtual bool emitGetActiveLaneMask(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) = 0;
virtual bool isLegalAddImmediate(int64_t Imm) = 0;		virtual bool isLegalAddImmediate(int64_t Imm) = 0;
virtual bool isLegalICmpImmediate(int64_t Imm) = 0;		virtual bool isLegalICmpImmediate(int64_t Imm) = 0;
virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,		virtual bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV,
int64_t BaseOffset, bool HasBaseReg,		int64_t BaseOffset, bool HasBaseReg,
int64_t Scale, unsigned AddrSpace,		int64_t Scale, unsigned AddrSpace,
Instruction *I) = 0;		Instruction *I) = 0;
virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,		virtual bool isLSRCostLess(TargetTransformInfo::LSRCost &C1,
TargetTransformInfo::LSRCost &C2) = 0;		TargetTransformInfo::LSRCost &C2) = 0;
▲ Show 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);		return Impl.isHardwareLoopProfitable(L, SE, AC, LibInfo, HWLoopInfo);
}		}
bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) override {		const LoopAccessInfo *LAI) override {
return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);		return Impl.preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
}		}
		bool emitGetActiveLaneMask(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) override {
		return Impl.emitGetActiveLaneMask(L, LI, SE, TailFolded);
		}
bool isLegalAddImmediate(int64_t Imm) override {		bool isLegalAddImmediate(int64_t Imm) override {
return Impl.isLegalAddImmediate(Imm);		return Impl.isLegalAddImmediate(Imm);
}		}
bool isLegalICmpImmediate(int64_t Imm) override {		bool isLegalICmpImmediate(int64_t Imm) override {
return Impl.isLegalICmpImmediate(Imm);		return Impl.isLegalICmpImmediate(Imm);
}		}
bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
bool HasBaseReg, int64_t Scale, unsigned AddrSpace,		bool HasBaseReg, int64_t Scale, unsigned AddrSpace,
▲ Show 20 Lines • Show All 512 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	public:

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) const {		const LoopAccessInfo *LAI) const {
return false;		return false;
}		}

		bool emitGetActiveLaneMask(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFold) const {
		return false;
		}

void getUnrollingPreferences(Loop *, ScalarEvolution &,		void getUnrollingPreferences(Loop *, ScalarEvolution &,
TTI::UnrollingPreferences &) {}		TTI::UnrollingPreferences &) {}

bool isLegalAddImmediate(int64_t Imm) { return false; }		bool isLegalAddImmediate(int64_t Imm) { return false; }

bool isLegalICmpImmediate(int64_t Imm) { return false; }		bool isLegalICmpImmediate(int64_t Imm) { return false; }

bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,		bool isLegalAddressingMode(Type Ty, GlobalValue BaseGV, int64_t BaseOffset,
▲ Show 20 Lines • Show All 772 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 492 Lines • ▼ Show 20 Lines	public:

bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,		bool preferPredicateOverEpilogue(Loop L, LoopInfo LI, ScalarEvolution &SE,
AssumptionCache &AC, TargetLibraryInfo *TLI,		AssumptionCache &AC, TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI) {		const LoopAccessInfo *LAI) {
return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);		return BaseT::preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
}		}

		bool emitGetActiveLaneMask(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFold) {
		return BaseT::emitGetActiveLaneMask(L, LI, SE, TailFold);
		}

int getInstructionLatency(const Instruction *I) {		int getInstructionLatency(const Instruction *I) {
if (isa<LoadInst>(I))		if (isa<LoadInst>(I))
return getST()->getSchedModel().DefaultLoadLatency;		return getST()->getSchedModel().DefaultLoadLatency;

return BaseT::getInstructionLatency(I);		return BaseT::getInstructionLatency(I);
}		}

virtual Optional<unsigned>		virtual Optional<unsigned>
▲ Show 20 Lines • Show All 1,317 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

Show First 20 Lines • Show All 1,229 Lines • ▼ Show 20 Lines	let IntrProperties = [IntrNoMem, IntrNoSync, IntrWillReturn] in {
def int_vp_xor : Intrinsic<[ llvm_anyvector_ty ],		def int_vp_xor : Intrinsic<[ llvm_anyvector_ty ],
[ LLVMMatchType<0>,		[ LLVMMatchType<0>,
LLVMMatchType<0>,		LLVMMatchType<0>,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
llvm_i32_ty]>;		llvm_i32_ty]>;

}		}

		def int_get_active_lane_mask:
		Intrinsic<[LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
		[llvm_anyvector_ty, LLVMMatchType<0>],
		efriedmaUnsubmitted Not Done Reply Inline Actions Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't really indicate why it needs to be noduplicate. Please use LLVMMatchType/LLVMScalarOrSameVectorWidth to ensure the argument/result types match. efriedma: Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't…
		[IntrNoMem, IntrNoSync, IntrWillReturn]>;

//===-------------------------- Masked Intrinsics -------------------------===//		//===-------------------------- Masked Intrinsics -------------------------===//
//		//
def int_masked_store : Intrinsic<[], [llvm_anyvector_ty,		def int_masked_store : Intrinsic<[], [llvm_anyvector_ty,
LLVMAnyPointerType<LLVMMatchType<0>>,		LLVMAnyPointerType<LLVMMatchType<0>>,
llvm_i32_ty,		llvm_i32_ty,
LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],		LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>],
[IntrArgMemOnly, IntrWillReturn, ImmArg<2>]>;		[IntrArgMemOnly, IntrWillReturn, ImmArg<2>]>;
▲ Show 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
// will execute.		// will execute.
def int_set_loop_iterations :		def int_set_loop_iterations :
Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Specify that the value given is the number of iterations that the next loop		// Specify that the value given is the number of iterations that the next loop
// will execute. Also test that the given count is not zero, allowing it to		// will execute. Also test that the given count is not zero, allowing it to
// control entry to a 'while' loop.		// control entry to a 'while' loop.
def int_test_set_loop_iterations :		def int_test_set_loop_iterations :
		efriedmaUnsubmitted Not Done Reply Inline Actions I assume you meant to write LLVMMatchType<0> efriedma: I assume you meant to write LLVMMatchType<0>
Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Decrement loop counter by the given argument. Return false if the loop		// Decrement loop counter by the given argument. Return false if the loop
// should exit.		// should exit.
def int_loop_decrement :		def int_loop_decrement :
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems. This is very minor, am looking into it, and in the mean time already wanted to show the important changes. SjoerdMeijer: Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems.
Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;		Intrinsic<[llvm_i1_ty], [llvm_anyint_ty], [IntrNoDuplicate]>;

// Decrement the first operand (the loop counter) by the second operand (the		// Decrement the first operand (the loop counter) by the second operand (the
// maximum number of elements processed in an iteration). Return the remaining		// maximum number of elements processed in an iteration). Return the remaining
// number of iterations still to be executed. This is effectively a sub which		// number of iterations still to be executed. This is effectively a sub which
// can be used with a phi, icmp and br to control the number of iterations		// can be used with a phi, icmp and br to control the number of iterations
// executed, as usual. Any optimisations are allowed to treat it is a sub, and		// executed, as usual. Any optimisations are allowed to treat it is a sub, and
// it's scevable, so it's the backends responsibility to handle cases where it		// it's scevable, so it's the backends responsibility to handle cases where it
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines

	bool TargetTransformInfo::preferPredicateOverEpilogue(			bool TargetTransformInfo::preferPredicateOverEpilogue(
	Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,			Loop L, LoopInfo LI, ScalarEvolution &SE, AssumptionCache &AC,
	TargetLibraryInfo TLI, DominatorTree DT,			TargetLibraryInfo TLI, DominatorTree DT,
	const LoopAccessInfo *LAI) const {			const LoopAccessInfo *LAI) const {
	return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);			return TTIImpl->preferPredicateOverEpilogue(L, LI, SE, AC, TLI, DT, LAI);
	}			}

				bool TargetTransformInfo::emitGetActiveLaneMask(Loop L, LoopInfo LI,
				ScalarEvolution &SE, bool TailFolded) const {
				return TTIImpl->emitGetActiveLaneMask(L, LI, SE, TailFolded);
				}

	void TargetTransformInfo::getUnrollingPreferences(			void TargetTransformInfo::getUnrollingPreferences(
	Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {			Loop *L, ScalarEvolution &SE, UnrollingPreferences &UP) const {
	return TTIImpl->getUnrollingPreferences(L, SE, UP);			return TTIImpl->getUnrollingPreferences(L, SE, UP);
	}			}

	bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {			bool TargetTransformInfo::isLegalAddImmediate(int64_t Imm) const {
	return TTIImpl->isLegalAddImmediate(Imm);			return TTIImpl->isLegalAddImmediate(Imm);
	}			}
	▲ Show 20 Lines • Show All 1,186 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	bool preferPredicateOverEpilogue(Loop L, LoopInfo LI,
ScalarEvolution &SE,		ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
TargetLibraryInfo *TLI,		TargetLibraryInfo *TLI,
DominatorTree *DT,		DominatorTree *DT,
const LoopAccessInfo *LAI);		const LoopAccessInfo *LAI);
void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP);		TTI::UnrollingPreferences &UP);

		bool emitGetActiveLaneMask(Loop L, LoopInfo LI, ScalarEvolution &SE,
		bool TailFolded) const;

bool shouldBuildLookupTablesForConstant(Constant *C) const {		bool shouldBuildLookupTablesForConstant(Constant *C) const {
// In the ROPI and RWPI relocation models we can't have pointers to global		// In the ROPI and RWPI relocation models we can't have pointers to global
// variables or functions in constant data, so don't convert switches to		// variables or functions in constant data, so don't convert switches to
// lookup tables if any of the values would need relocation.		// lookup tables if any of the values would need relocation.
if (ST->isROPI() \|\| ST->isRWPI())		if (ST->isROPI() \|\| ST->isRWPI())
return !C->needsRelocation();		return !C->needsRelocation();

return true;		return true;
}		}
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_ARM_ARMTARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_ARM_ARMTARGETTRANSFORMINFO_H

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,379 Lines • ▼ Show 20 Lines	if (!HWLoopInfo.isHardwareLoopCandidate(SE, LI, DT)) {
LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "		LLVM_DEBUG(dbgs() << "preferPredicateOverEpilogue: hardware-loop is not "
"a candidate.\n");		"a candidate.\n");
return false;		return false;
}		}

return canTailPredicateLoop(L, LI, SE, DL, LAI);		return canTailPredicateLoop(L, LI, SE, DL, LAI);
}		}

		bool ARMTTIImpl::emitGetActiveLaneMask(Loop L, LoopInfo LI,
		ScalarEvolution &SE, bool TailFolded) const {
		fhahnUnsubmitted Not Done Reply Inline Actions Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere? fhahn: Just return `TailFolded`? side note: what are all the extra parameters needed/used somewhere?
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Just a quick first comment on your side note: what are all the extra parameters needed/used somewhere? That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at `TailFolded`, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at `L` and `LI`, and possibly `SE`. So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary. I will address your other remarks soon. Thanks for that. SjoerdMeijer: Just a quick first comment on your side note: > what are all the extra parameters needed/used…
		fhahnUnsubmitted Not Done Reply Inline Actions I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required. fhahn: I am not sure, but I think adding all those unused parameters, because they might get used in…
		// If this loop is tail-folded, we always want to emit the
		// llvm.get.active.lane.mask intrinsic, so that this can be picked up in the
		// MVETailPredication pass that needs to know the number of elements
		// processed by this vector loop.
		if (TailFolded)
		return true;
		return false;
		}
void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,		void ARMTTIImpl::getUnrollingPreferences(Loop *L, ScalarEvolution &SE,
TTI::UnrollingPreferences &UP) {		TTI::UnrollingPreferences &UP) {
// Only currently enable these preferences for M-Class cores.		// Only currently enable these preferences for M-Class cores.
if (!ST->isMClass())		if (!ST->isMClass())
return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);		return BasicTTIImplBase::getUnrollingPreferences(L, SE, UP);

// Disable loop unrolling for Oz and Os.		// Disable loop unrolling for Oz and Os.
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,803 Lines • ▼ Show 20 Lines	VPValue VPRecipeBuilder::createBlockInMask(BasicBlock BB, VPlanPtr &Plan) {
// load/store/gather/scatter. Initialize BlockMask to no-mask.		// load/store/gather/scatter. Initialize BlockMask to no-mask.
VPValue *BlockMask = nullptr;		VPValue *BlockMask = nullptr;

if (OrigLoop->getHeader() == BB) {		if (OrigLoop->getHeader() == BB) {
if (!CM.blockNeedsPredication(BB))		if (!CM.blockNeedsPredication(BB))
return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.		return BlockMaskCache[BB] = BlockMask; // Loop incoming mask is all-one.

// Introduce the early-exit compare IV <= BTC to form header block mask.		// Introduce the early-exit compare IV <= BTC to form header block mask.
// This is used instead of IV < TC because TC may wrap, unlike BTC.		// This is used instead of IV < TC because TC may wrap, unlike BTC.
		efriedmaUnsubmitted Not Done Reply Inline Actions It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap). efriedma: It would be nice to explicitly note here that we're assuming the vector factor is a power of…
// Start by constructing the desired canonical IV.		// Start by constructing the desired canonical IV.
VPValue *IV = nullptr;		VPValue *IV = nullptr;
if (Legal->getPrimaryInduction())		if (Legal->getPrimaryInduction())
IV = Plan->getVPValue(Legal->getPrimaryInduction());		IV = Plan->getVPValue(Legal->getPrimaryInduction());
else {		else {
auto IVRecipe = new VPWidenCanonicalIVRecipe();		auto IVRecipe = new VPWidenCanonicalIVRecipe();
Builder.getInsertBlock()->appendRecipe(IVRecipe);		Builder.getInsertBlock()->appendRecipe(IVRecipe);
IV = IVRecipe->getVPValue();		IV = IVRecipe->getVPValue();
}		}
VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();		VPValue *BTC = Plan->getOrCreateBackedgeTakenCount();
		if (CM.TTI.emitGetActiveLaneMask(CM.TheLoop, CM.LI, *CM.PSE.getSE(),
		!CM.isScalarEpilogueAllowed()))
		BlockMask = Builder.createNaryOp(VPInstruction::ActiveLaneMask, {IV, BTC});
		else
BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});		BlockMask = Builder.createNaryOp(VPInstruction::ICmpULE, {IV, BTC});
return BlockMaskCache[BB] = BlockMask;		return BlockMaskCache[BB] = BlockMask;
}		}

// This is the block mask. We OR all incoming edges.		// This is the block mask. We OR all incoming edges.
for (auto *Predecessor : predecessors(BB)) {		for (auto *Predecessor : predecessors(BB)) {
VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);		VPValue *EdgeMask = createEdgeMask(Predecessor, BB, Plan);
if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.		if (!EdgeMask) // Mask of predecessor is all-one so mask of block is too.
return BlockMaskCache[BB] = EdgeMask;		return BlockMaskCache[BB] = EdgeMask;
▲ Show 20 Lines • Show All 1,257 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.h

	Show First 20 Lines • Show All 669 Lines • ▼ Show 20 Lines

	public:			public:
	/// VPlan opcodes, extending LLVM IR with idiomatics instructions.			/// VPlan opcodes, extending LLVM IR with idiomatics instructions.
	enum {			enum {
	Not = Instruction::OtherOpsEnd + 1,			Not = Instruction::OtherOpsEnd + 1,
	ICmpULE,			ICmpULE,
	SLPLoad,			SLPLoad,
	SLPStore,			SLPStore,
				ActiveLaneMask,
	};			};

	private:			private:
	typedef unsigned char OpcodeTy;			typedef unsigned char OpcodeTy;
	OpcodeTy Opcode;			OpcodeTy Opcode;

	/// Utility method serving execute(): generates a single instance of the			/// Utility method serving execute(): generates a single instance of the
	/// modeled instruction.			/// modeled instruction.
	▲ Show 20 Lines • Show All 1,268 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/VPlan.cpp

Show First 20 Lines • Show All 374 Lines • ▼ Show 20 Lines	void VPInstruction::generateInstruction(VPTransformState &State,
case Instruction::Select: {		case Instruction::Select: {
Value *Cond = State.get(getOperand(0), Part);		Value *Cond = State.get(getOperand(0), Part);
Value *Op1 = State.get(getOperand(1), Part);		Value *Op1 = State.get(getOperand(1), Part);
Value *Op2 = State.get(getOperand(2), Part);		Value *Op2 = State.get(getOperand(2), Part);
Value *V = Builder.CreateSelect(Cond, Op1, Op2);		Value *V = Builder.CreateSelect(Cond, Op1, Op2);
State.set(this, V, Part);		State.set(this, V, Part);
break;		break;
}		}
		case VPInstruction::ActiveLaneMask: {
		// The induction step.
		Value *IV = State.get(getOperand(0), Part);
		// The Back-edge Taken Count.
		rogfer01Unsubmitted Not Done Reply Inline Actions Minor nit: Here you use `TC` (as in trip count?) but above when you created this `VPInstruction` you used `BTC` (backedge taken count). This might be confusing for a future reader so I'd suggest to stick to your own convention and use `BTC` here as well. This code seems similar to that of `VPRecipeBuilder::createBlockInMask` (except that one creates the mask for a given block and this one creates a mask similar to the case `OrigLoop->getHeader() == BB`). That function uses `BTC` as well. rogfer01: Minor nit: Here you use `TC` (as in trip count?) but above when you created this…
		Value *BTC = State.get(getOperand(1), Part);

		// Create the intrinsic call, which has equivalent semantics to a
		// (icmp ule %IV, %BTC) comparison, generating the mask of
		// active/inactive lanes.
		Instruction *Call = Builder.CreateIntrinsic(
		vkmrUnsubmitted Not Done Reply Inline Actions Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V->getType(), V->getType() }, {V}, nullptr, "active.mask");`. vkmr: Minor nit: Probably cleaner to use `Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V…
		Intrinsic::get_active_lane_mask, {BTC->getType()}, {IV, BTC}, nullptr,
		"active.lane.mask");
		State.set(this, Call, Part);
		break;
		}
default:		default:
llvm_unreachable("Unsupported opcode for instruction");		llvm_unreachable("Unsupported opcode for instruction");
}		}
}		}

void VPInstruction::execute(VPTransformState &State) {		void VPInstruction::execute(VPTransformState &State) {
assert(!State.Instance && "VPInstruction executing an Instance");		assert(!State.Instance && "VPInstruction executing an Instance");
for (unsigned Part = 0; Part < State.UF; ++Part)		for (unsigned Part = 0; Part < State.UF; ++Part)
Show All 11 Lines	void VPInstruction::print(raw_ostream &O) const {
print(O, SlotTracker);		print(O, SlotTracker);
}		}

void VPInstruction::print(raw_ostream &O, VPSlotTracker &SlotTracker) const {		void VPInstruction::print(raw_ostream &O, VPSlotTracker &SlotTracker) const {
if (hasResult()) {		if (hasResult()) {
printAsOperand(O, SlotTracker);		printAsOperand(O, SlotTracker);
O << " = ";		O << " = ";
}		}

		fhahnUnsubmitted Not Done Reply Inline Actions IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like: // Get first lane of vector induction variable. Value VIVE0 = State.get(getOperand(0), {Part, 0}); // Get first lane of backedge-taken-count. Value ScalarBTC = State.get(getOperand(1), {Part, 0}); auto Int1Ty = Type::getInt1Ty(Builder.getContext()); auto PredTy = VectorType::get(Int1Ty, State.VF); Instruction Call = Builder.CreateIntrinsic( Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()}, {VIVE0, ScalarBTC}, nullptr, "active.lane.mask"); State.set(this, Call, Part); This should have the advantage of re-using some scalar values, if they have been requested already by other recipes. (Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that) fhahn:* IIUC we want the first lanes of both the BTC and the IV, right? If that's the case, I think it…
switch (getOpcode()) {		switch (getOpcode()) {
case VPInstruction::Not:		case VPInstruction::Not:
O << "not";		O << "not";
break;		break;
case VPInstruction::ICmpULE:		case VPInstruction::ICmpULE:
O << "icmp ule";		O << "icmp ule";
break;		break;
case VPInstruction::SLPLoad:		case VPInstruction::SLPLoad:
O << "combined load";		O << "combined load";
break;		break;
case VPInstruction::SLPStore:		case VPInstruction::SLPStore:
O << "combined store";		O << "combined store";
break;		break;
		case VPInstruction::ActiveLaneMask:
		O << "active lane mask";
		break;

default:		default:
O << Instruction::getOpcodeName(getOpcode());		O << Instruction::getOpcodeName(getOpcode());
}		}
		vkmrUnsubmitted Not Done Reply Inline Actions Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan. You can pass the flag `-debug-only=loop-vectorize` to `opt` to see the generated VPlan. vkmr: Add `case VPInstruction::ActiveMask` to print the correct VPInstruction when printing VPLan.

for (const VPValue *Operand : operands()) {		for (const VPValue *Operand : operands()) {
O << " ";		O << " ";
Operand->printAsOperand(O, SlotTracker);		Operand->printAsOperand(O, SlotTracker);
}		}
}		}

/// Generate the code inside the body of the vectorized loop. Assumes a single		/// Generate the code inside the body of the vectorized loop. Assumes a single
▲ Show 20 Lines • Show All 519 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

Show All 35 Lines	for.body:
%exitcond = icmp eq i64 %indvars.iv.next, 430		%exitcond = icmp eq i64 %indvars.iv.next, 430
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}


define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {		define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
; COMMON-LABEL: tail_folding_enabled(		; COMMON-LABEL: tail_folding_enabled(
; COMMON: vector.body:		; COMMON: vector.body:
; COMMON: %[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; COMMON: %[[INDUCTION:.*]] = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
; COMMON: %[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; COMMON: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i64(<4 x i64> %[[INDUCTION]], <4 x i64> <i64 429, i64 429, i64 429, i64 429>)
		; COMMON: %[[WML1:.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}<4 x i1> %active.lane.mask
		; COMMON: %[[WML2:.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.}}<4 x i1> %active.lane.mask
; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]		; COMMON: %[[ADD:.*]] = add nsw <4 x i32> %[[WML2]], %[[WML1]]
; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]]		; COMMON: call void @llvm.masked.store.v4i32.p0v4i32(<4 x i32> %[[ADD]], {{.*}}<4 x i1> %active.lane.mask
; COMMON: br i1 %12, label %{{.*}}, label %vector.body		; COMMON: br i1 %{{.}}, label %{{.}}, label %vector.body
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void

for.body:		for.body:
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 13 Lines
; CHECK-LABEL: tail_folding_disabled(		; CHECK-LABEL: tail_folding_disabled(
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(		; CHECK-NOT: @llvm.masked.load.v8i32.p0v8i32(
; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(		; CHECK-NOT: @llvm.masked.store.v8i32.p0v8i32(
; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body		; CHECK: br i1 %{{.}}, label {{.}}, label %vector.body

; PREDFLAG-LABEL: tail_folding_disabled(		; PREDFLAG-LABEL: tail_folding_disabled(
; PREDFLAG: vector.body:		; PREDFLAG: vector.body:
; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; PREDFLAG: %[[INDUCTION:induction.*]] = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(		; PREDFLAG: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i64(<4 x i64> %[[INDUCTION]], <4 x i64> <i64 429, i64 429, i64 429, i64 429>)
		; PREDFLAG: %wide.masked.load = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
		; PREDFLAG: %wide.masked.load1 = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
		fhahnUnsubmitted Not Done Reply Inline Actions We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.lane.mask` (unless that's not possible at the moment) fhahn: We should also have a test where the unroll-factor/interleave-count > 1 with `llvm.get.active.
; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load		; PREDFLAG: %{{.*}} = add nsw <4 x i32> %wide.masked.load1, %wide.masked.load
; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32(		; PREDFLAG: call void @llvm.masked.store.v4i32.p0v4i32({{.*}}, <4 x i1> %active.lane.mask
; PREDFLAG: %index.next = add i64 %index, 4		; PREDFLAG: %index.next = add i64 %index, 4
; PREDFLAG: %12 = icmp eq i64 %index.next, 432		; PREDFLAG: %[[CMP:.*]] = icmp eq i64 %index.next, 432
; PREDFLAG: br i1 %{{.*}}, label %middle.block, label %vector.body, !llvm.loop !6		; PREDFLAG: br i1 %[[CMP]], label %middle.block, label %vector.body, !llvm.loop !6
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void

for.body:		for.body:
%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]		%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
Show All 28 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loopsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 265173

llvm/docs/LangRef.rst

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/include/llvm/IR/Intrinsics.td

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/lib/Transforms/Vectorize/VPlan.cpp

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops
ClosedPublic