This is an archive of the discontinued LLVM Phabricator instance.

[LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops
ClosedPublic

Authored by SjoerdMeijer on Apr 29 2020, 9:24 AM.

Details

Summary

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

This implements the approach as discussed on the llvm de list, see Eli's comment in http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html. The approach is based on emitting an intrinsic for deriving the mask. The vectoriser emits this new intrinsic in the vector preheader block when the new TII hook indicates that the target can lower this intrinsic and that it is desired to do so for this loop. For MVE, we do this when the loop is tail-folded, which is the very first step in tail-predicating a loop. For all the other targets, this intrinsic won't be emitted as the default of the hook is of course not to do this.

This change will be followed up by MVE specific changes to lower this intrinsics.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Apr 29 2020, 9:24 AM
Herald added a project: Restricted Project. · View Herald Transcript

This is used in the ARM backend, in our tail-predication pass, see D79175.

added a test case, minor fixes in comments.

SjoerdMeijer retitled this revision from [LV][TTI] Emit new IR intrinsic llvm.set.loop.elements to [LV][TTI] Emit new IR intrinsic llvm.get.active.mask.
SjoerdMeijer edited the summary of this revision. (Show Details)
SjoerdMeijer added a reviewer: efriedma.
SjoerdMeijer added a subscriber: efriedma.

This was discussed on the list and implements the approach as suggested by @efriedma :

http://lists.llvm.org/pipermail/llvm-dev/2020-May/141360.html

I.e., we now emit a llvm.get.active.mask intrinsic into the vector body, which is used by the masked load/stores, and its operands allows to reconstruct to the required information in backend passes. For example, a relevant snippet of a vector body with the tail-folded looks like this:

%induction = add <4 x i64> %broadcast.splat, <i64 0, i64 1, i64 2, i64 3>
%[[PRED:.*]] = icmp ule <4 x i64> %induction, <i64 429, i64 429, i64 429, i64 429>
%[[MASK:.*]] = call <4 x i1> @llvm.get.active.mask.v4i1.v4i1(<4 x i1> %[[PRED]])
%[[WML1:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]
%[[WML2:.*]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32({{.*}}<4 x i1> %[[MASK]]

The induction and cmp feed the active.mask intrinsic, which is used by the masked loads/stores. This new intrinsic can now be easily picked up in backend passes, and the 429 in this example represents the backedge taken iteration count of the scalar loop, which we need to set up the tail-predication.

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

Tail-predication is a new form of predication in MVE for vector loops that implicitely predicates the last vector loop iteration by implicitely setting active/inactive lanes, i.e. the tail loop is predicated. In order to set up a tail-predicated vector loop, we need to know the number of data elements processed by the vector loop, which corresponds the the tripcount of the scalar loop. We would like to propagate the scalar trip count to the backend, so that this can be picked up by the MVE tail-predication pass.

Just to make sure I understand: is this text still current? Your intrinsic does not seem to propagate the scalar trip count anymore (at least not explicitly). If I understand it correctly, now you communicate the active mask of the current vector iteration which you compute using i <= backedge taken-count. Did I get it right?

llvm/lib/Transforms/Vectorize/VPlan.cpp
386 ↗(On Diff #264579)

Minor nit:

Here you use TC (as in trip count?) but above when you created this VPInstruction you used BTC (backedge taken count).

This might be confusing for a future reader so I'd suggest to stick to your own convention and use BTC here as well.

This code seems similar to that of VPRecipeBuilder::createBlockInMask (except that one creates the mask for a given block and this one creates a mask similar to the case OrigLoop->getHeader() == BB). That function uses BTC as well.

What are the semantics of llvm.get.active.mask? I don't see an actual description anywhere beyond "it enables tail folding, somehow".

llvm/include/llvm/IR/Intrinsics.td
1419

I assume you meant to write LLVMMatchType<0>

vkmr marked an inline comment as done.May 18 2020, 3:42 PM
vkmr added inline comments.
llvm/lib/Transforms/Vectorize/VPlan.cpp
389–392 ↗(On Diff #264579)

Minor nit:

Probably cleaner to use Builder.CreateIntrinsic(Intrinsic::get_active_mask, { V->getType(), V->getType() }, {V}, nullptr, "active.mask");.

vkmr added inline comments.May 18 2020, 4:18 PM
llvm/lib/Transforms/Vectorize/VPlan.cpp
424–439 ↗(On Diff #264579)

Add case VPInstruction::ActiveMask to print the correct VPInstruction when printing VPLan. You can pass the flag -debug-only=loop-vectorize to opt to see the generated VPlan.

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator

If the epilogue folding always uses ULE, then the operator wouldn't be needed. I think it would be really nice not to have the IV and upper limit splatted, plus we could remove the vector add on the IV too.

SjoerdMeijer retitled this revision from [LV][TTI] Emit new IR intrinsic llvm.get.active.mask to [LV][TTI] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops.

Thanks all for your comments! I think this addresses all comments:

  • I did a rename of ActiveMask to ActiveLaneMask because that describes it better I think,
  • The intrinsic now takes 2 arguments: the induction step, and the backedge taken count. It indeed represents the IV <= BTC comparison, thus giving it semantics rather than let it only pass on a value, which I have also clarified with comments at different places,
  • The VP instruction is now lowered using Builder.CreateIntrinsic, and have added a case to print it.
SjoerdMeijer marked an inline comment as done.May 19 2020, 7:19 AM
SjoerdMeijer added inline comments.
llvm/include/llvm/IR/Intrinsics.td
1424

Ah, forgot to say that I am looking into LLVMMatchType<0>, this was giving me some problems. This is very minor, am looking into it, and in the mean time already wanted to show the important changes.

lebedev.ri requested changes to this revision.May 19 2020, 7:42 AM
lebedev.ri added a subscriber: lebedev.ri.

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This revision now requires changes to proceed.May 19 2020, 7:42 AM

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

Semantics are still unspecified. Before adding even more intrinsics,
i'd strongly suggest to specify at least the already-committed ones.
Because as far as i can tell, i don't see anything in langref for any of them.

This was intentional. With the already-committed ones you mean the hardware loops ones, and they are not meant to be user-facing intrinsics. That is, we don't expect user to play around with e.g. the hwloop.decrement intrinsic; at least these are really meant to be generated by the optimisers.
This new intrinsic here is slightly different, in that it probably is useful as a user facing intrinsic, so don't mind documenting it.

@SjoerdMeijer, i don't believe that is how langref works.

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

Ok, perhaps I got that wrong then. @samparker can correct me here perhaps, but as I said, for the hardware loop intrinsics I believe this was intentional. But anyway, as I said, will document this new one. As the hardware loop intrinsics are completely separate from this, I will do that separately.

I think that would still be backwards.
langref is not documentation, it is specification.
To reword, langref is source of truth, not the code.

Understood, and we don't disagree, this will be fixed.

+ LangRef description

efriedma added inline comments.May 19 2020, 2:54 PM
llvm/docs/LangRef.rst
16198 ↗(On Diff #265018)

Is this semantically equivalent to icmp ule? If it is, you should probably make that more clear, and explain that it's a hint to the backend. If not, this needs a much more thorough explanation.

llvm/include/llvm/IR/Intrinsics.td
1239

Is IntrNoDuplicate here actually semantically significant? The LangRef explanation doesn't really indicate why it needs to be noduplicate.

Please use LLVMMatchType/LLVMScalarOrSameVectorWidth to ensure the argument/result types match.

Thanks @efriedma : added a statement about the equivalence to the langref description, and fixed the intrinsic description.

I was expecting the intrinsic to be performing the icmp because it feels as though if a target wants an intrinsic like this, that it would want it to do something?

That was my initial though too, so started drafting an intrinsic that would take the induction step, backedge taken count, the comparison operator, thus replacing the icmp, and feeding its result into the masked load/stores.
This however, turned out to be a massively invasive change because of the different places where the induction variable is widened which creates the induction step and icmp, and where the masking happens. This change is very minimal, makes explicit exactly the same information, and thus had my preference.

Hmm, but thinking about it, after my initial attempt to do this I got some more VPlan plumbing experience, and I now see that I could try again and that that it is perhaps not very different. If it helps for acceptance, I can try this, but I think my previous points still stand that this change is minimal and leaves the IR intact.

The comparison really should be encapsulated in the intrinsic itself because for scalable types it is not clear how many bits the stepvector type needs to enumerate its lanes without overflow:

%lane_ids = <vscale x 1 x i8> llvm.stepvector() ; This will overflow if 'vscale > 256' at runtime (note that a 'stepvector' intrinsic does not even exist at this point)
%lane_mask = %ule icmp ule %lane_ids, (splat %n)

That is not a problem if you have

%lane_mask = <vscale x 1 x i1> llvm.active.mask(i32 0, i32 %n)

We need such an intrinsic anyway for the VP expansion pass to legalize the EVL parameter of VP intrinsics for targets that have scalable types but no active vector length (tail loop predication).

This patch, this new intrinsic, is a straightforward translation from an icmp. It provides the required information, and exactly the same as in your example, albeit in a slightly different form. Scalable vectors, stepvector intrinsics, etc., are not used yet by the vectoriser and are not yet defined, respectively, so using that at this point is problematic.

As this is such a straightforward translation, with straightforward semantics, this won't block in any way future developments. In fact, this is the first step in that direction, and it is good we get some experience with it. We offer our support to adapt this approach/intrinsic and port it, should this be necessary, which should be straightforward.

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6815

It would be nice to explicitly note here that we're assuming the vector factor is a power of two (so the induction variable can't wrap).

I think all Simon is asking for is that the first argument of the intrinsic should be a scalar, equal to the first lane of the vector induction variable. How much work would that be?

Ah okay, sorry, I missed that. That should hopefully be straightforward, and if we have everyone on board with this, I will fix that right away tomorrow.

Uploading a new revision to see if we can reach a conclusion on this.

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

The intrinsic now takes two scalar arguments, the first element of the VIV, and the scalar BTC, hopefully in the way you head in mind.

Yes, this is what I expected.

On the dev list, Ayal pointed out an alternative to rediscover the BTC in the backend (still need spend time to confirm we then don't miss anything). This intrinsic still would have my preference because of its convenience, and as it a general concept it seems to support different uses too (vp intrinsicss, and provides a target independent way of describing the active mask).

In general, given two expressions that are equivalent, there's always some way to pattern-match. For power-of-two fixed-width vectors, the pattern you'd need to match is not that complicated (basically just an icmp where one operand is a vector induction variable, and the other operand is a splat.) But like you say, the intrinsic is more convenient.

If you're dealing with scalable vectors, expanding the intrinsic to arithmetic would result in a very complex expression due to the potential for overflow, so we'll almost certainly want this intrinsic.


The changes to VPlan.cpp look a little suspicious, but I'm not really familiar with that code, so I'll let someone else review it.

llvm/docs/LangRef.rst
16202 ↗(On Diff #265969)

Maybe qualify this with the caveat that it's only equivalent if the vector induction variable doesn't overflow. We probably want to return false in the lanes where it overflows.

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

fhahn added a comment.May 26 2020, 1:06 PM

Thanks Eli for commenting and explaining.

I guess it was the TODO in VPlan.cpp that looked a bit suspicious. But there wasn't much going on here, that was very straightforward. so fixed that. I.e., Ayal added support for decrementing loops in VPlan recently. The Backedge Taken Count value looks slightly different for these cases, but extracting it is easy which I have added here. As a result, this now also triggers in test/Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll which I updated

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there.

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

  • In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,
  • and separate we have LangRef.

That looks like a minimal change, but certainly don't mind doing this if that helps.

fhahn added a comment.May 26 2020, 2:29 PM

Many thanks for taking a look Florian!

I think introducing a VPInstruction opcode for the new intrinsic makes sense and fits in the current scheme. But I think there's no need to bundle the langref, TTI and LV changes into a single patch. IMO would be good to split at least the LV part off, to focus on discussing the implementation details in LV there

Ok, but Just double checking that I get this right:

  • In the LV part, we do use and check TTI.emitGetActiveLaneMask, so I guess that means we have LV + TTI in one patch,

I'd just put the LV parts in a separate patch and the TTI stuff in a different patch (the former depending on the latter). The way I see it, those are changes to 2 separate areas of the code, with potentially different people to approve. Also, if there's problem with the LV patch, It can be reverted in isolation.

Ok, thanks, sounds like a plan.

I will keep this as a reference, I think that's convenient, and will split off the different bits and pieces.
I have started with the easiest patch to create, the intrinsic description and documentation: D80596

SjoerdMeijer retitled this revision from [LV][TTI] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops to [LV] Emit new IR intrinsic llvm.get.active.mask for tail-folded loops.

I have just committed the required scaffolding for this:

So was wondering if we are happy with this LV part too?

Just FYI, I am addressing comments on our backend support for this intrinsic in D79175. Once that is ready, I will commit this and D79175 at the same time.

fhahn added inline comments.May 29 2020, 4:25 AM
llvm/lib/Transforms/Vectorize/VPlan.cpp
425 ↗(On Diff #266369)

IIUC we want the first lanes of both the BTC and the IV, right?

If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like:

// Get first lane of vector induction variable.
 Value *VIVE0 = State.get(getOperand(0), {Part, 0});
 // Get first lane of backedge-taken-count.
 Value *ScalarBTC = State.get(getOperand(1), {Part, 0});

 auto *Int1Ty = Type::getInt1Ty(Builder.getContext());
 auto *PredTy = VectorType::get(Int1Ty, State.VF);
 Instruction *Call = Builder.CreateIntrinsic(
     Intrinsic::get_active_lane_mask, {PredTy, ScalarBTC->getType()},
     {VIVE0, ScalarBTC}, nullptr, "active.lane.mask");

 State.set(this, Call, Part);

This should have the advantage of re-using some scalar values, if they have been requested already by other recipes.

(Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)

llvm/test/Transforms/LoopVectorize/ARM/tail-loop-folding.ll
79

We should also have a test where the unroll-factor/interleave-count > 1 with llvm.get.active.lane.mask (unless that's not possible at the moment)

fhahn added inline comments.May 29 2020, 4:48 AM
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1313

Just return TailFolded?

side note: what are all the extra parameters needed/used somewhere?

SjoerdMeijer marked an inline comment as done.May 29 2020, 5:24 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1313

Just a quick first comment on your side note:

what are all the extra parameters needed/used somewhere?

That's just to provide an interface consistent with most other TTI functions here. Currently, we only look at TailFolded, the rest is indeed unused. But if (other) targets want to do a bit more analysis here, they most definitely want to look look at L and LI, and possibly SE.

So, as I said, mostly unused at the moment, and added for consistency, which I accept may or may not be a good reason. I have no strong opionions, so will refactor this TTI hook right away if you think that is better/necessary.

I will address your other remarks soon. Thanks for that.

IIUC we want the first lanes of both the BTC and the IV, right?

Yep, exactly that.

If that's the case, I think it would be more straight-forward to just request the specific lane when looking up the values, .e.g something like:
<SNIP>
This should have the advantage of re-using some scalar values, if they have been requested already by other recipes.

(Note: this currently crashes unfortunately, as getting lanes for defs managed by VPTransformState does not work, but I put up D80787 to fix that)

Thanks for that. I took D80787, and that worked like a charm, greatly simplifying things here: I have simplified lowering of VPInstruction::ActiveLaneMask as per your suggestion.

With D80787 committed, and this using that, does this look okay now?

fhahn accepted this revision.Jun 4 2020, 3:53 PM

LGTM with the comment, thanks! Please wait a day or so with committing in case there are additional comments. Also, I might have missed it, but it would be good to have a test case with unroll-factor/interleave-count > 1 (as mentioned in an inline comment).

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1313

I am not sure, but I think adding all those unused parameters, because they might get used in the future seems a bit odd. I would be easy enough to add them if required.

Many thanks for looking at this!

This needs to be committed together with the backend patch that adds support for lowering this intrinsic. That is nearly done (hopefully), and am expecting a commit somewhere next week, so will definitely wait a few days.

I will add that test case (sorry, I overlooked and missed that), and will follow up to reduce the arguments in the TTI hook.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 17 2020, 2:08 AM
This revision was automatically updated to reflect the committed changes.

Hi, seems that this patch is causing some failures:
Failed Tests (2):

LLVM :: Transforms/LoopVectorize/ARM/tail-folding-counting-down.ll
LLVM :: Transforms/LoopVectorize/ARM/tail-loop-folding.ll

I replied by mail, but just for completeness, I had noticed and reverted it, and now recommitted it.