This is an archive of the discontinued LLVM Phabricator instance.

[ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC
ClosedPublic

Authored by SjoerdMeijer on Apr 30 2020, 6:59 AM.

Details

Summary

With D79100 we can rely on @llvm.get.active.lane.mask() that is generated by the vectoriser to get the number of elements processed by the loop, which is required to set up tail-predication. This intrinsic generates the predicate for the masked loads/stores, and consumes the Backedge Taken Count (BTC) as its second argument; we can now simply extract and use that to set up tail-predication.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Apr 30 2020, 6:59 AM

I was really hoping to see a lot of stuff being deleted from this pass, not added... isn't half of the original code now redundant?

I was really hoping to see a lot of stuff being deleted from this pass, not added... isn't half of the original code now redundant?

Copied from the description of this change:

Now we pick up this intrinsic the number of elements, which simplifies the pattern matching we were doing to find this value. I have not yet removed the pattern matching because that would require changing of a lot of tests. Thus, for now, the intrinsic and pattern matching coexist together, but as a follow up we probably want to remove this.

I propose we focus on handling first this new intrinsic. At this moment the original code is still used, because it will be triggered by all existing tests which need updating.
Handling intrinsics, removing half the code, and updating all tests is a massive change that doesn't make reviewing it easier, so thought that this is best done in steps, and this is the first one.

But fair enough, I will start working on that, and will do that here or in a separate diff while I wait for feedback on its parent D79100.

SjoerdMeijer updated this revision to Diff 266505.EditedMay 27 2020, 5:57 AM
SjoerdMeijer retitled this revision from [ARM][MVE] Support intrinsic @llvm.set.loop.elements in the tail-predication pass to [ARM][MVE] Tail-Predication: use @llvm.get.active.lane.mask to get the BTC.
SjoerdMeijer edited the summary of this revision. (Show Details)
SjoerdMeijer added a reviewer: efriedma.

Spring clean up: this deletes half the pass, i.e. all the pattern matching.
This is possible because of intrinsic @llvm.get.active.lane.mask()that will be generated by D79100 and friends.

I am posting this for review, while I am fixing up the remaining test cases. I.e., I have modified one to the new situation: llvm/test/CodeGen/Thumb2/LowOverheadLoops/basic-tail-pred.ll, but now I need to fix up the others too, but that will be more of the same, so thought it was good to post this already.

efriedma added inline comments.May 27 2020, 4:29 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
657

This addition can overflow.

740

Do you need to check that the first argument to get_active_lane_mask is an induction variable for the loop "L"?

Do you need to check the types of the arguments to get_active_lane_mask?

efriedma added inline comments.May 27 2020, 5:14 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
680

This subtraction can also overflow.

A glorious amount of red in this diff.

llvm/lib/Target/ARM/MVETailPredication.cpp
657

And that's not okay, right? Trip count will always be BTC + 1 and we don't handle uncountable loops.

680

But that's okay, right? This predication is only really useful when wrapping happens and the intrinsic reflects that overflow can/will happen.

efriedma added inline comments.May 28 2020, 12:52 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
657

The masks are wrong if it overflows.

680

The problem here is that the original get_active_lane_mask can do something like this:

Iteration 1: get_active_lane_mask(0, 5) -> all-true
Iteration 2: get_active_lane_mask(4, 5) -> <true, false, false, false>
Iteration 3: get_active_lane_mask(8, 5) -> all-false

In the rewritten code, you end up with this:

Iteration 1: vctp(5) -> all-true
Iteration 2: vctp(1)- > <true, false, false, false>
Iteration 3: vctp(-3)- > all-true

There are a couple ways you could deal with this:

  1. Use saturating subtraction
  2. Prove the trip count is small enough that this never happens, i.e. ceil(ElementCount / VectorWidth) >= TripCount.
efriedma added inline comments.May 28 2020, 12:57 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
680

Using saturating subtraction would require proving that the induction variable used as the first argument to llvm.get.active.mask doesn't overflow. But I guess you have to check that anyway.

samparker added inline comments.May 29 2020, 1:10 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
680

I would be hesitate to introduce saturating math here. I think the reasonable assumption is that the sub will wrap, but only on the final iteration. So just asserting that the element count is within the bounds of the trip count should be fine.

SjoerdMeijer added inline comments.May 29 2020, 7:40 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
680

The problem here is that the original get_active_lane_mask can do something like this:

Iteration 1: get_active_lane_mask(0, 5) -> all-true
Iteration 2: get_active_lane_mask(4, 5) -> <true, false, false, false>
Iteration 3: get_active_lane_mask(8, 5) -> all-false

In general, I see the problem, and I see that this can happen. But here in this context, I was wondering if it not actually boils down to Sam's earlier remark about countable loops. That is, before we were pattern matching a particular pattern produced by the vectoriser, we are handling (vector) loops produced by the vectoriser. Thus, we will never execute Iteration #3 from the example above, because ceil(ElementCount / VectorWidth) >= TripCount will always hold for these loops.

My question is, with the intrinsic approach, can we still rely on that, would that be a valid assumption to make? Along these same lines, this pass also relies on a check IsPredicatedVectorLoop and presence of masked loads/stores currently produced by the vectoriser, which gets its predicate from @llvm.get.active.lane.mask also generated by the vectoriser.

efriedma added inline comments.May 29 2020, 11:06 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
680

It's not safe to assume get_active_lane_mask was produced by the LLVM vectorizer. I mean, the ACLE has a vctp intrinsic; it's not that much of a stretch that we could expose get_active_lane_mask to users at some point.

And even if it was produced by the LLVM vectorizer, other passes that can modify the loop structure run between the vectorizer and the MVETailPredication pass.

And even if you're looking at the unmodified vectorizer output, you still need to verify the "L" is actually the loop that was vectorized.

In summary, I don't think it's a good idea to make assumptions beyond what LangRef actually promises. It's still a lot easier to pattern-match than it would be without the intrinsic.

SjoerdMeijer marked an inline comment as done.May 29 2020, 11:41 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
680

Ok, cheers, got it.

It's still a lot easier to pattern-match than it would be without the intrinsic.

Yep, will have a go at this next week.

I am not entire done yet, but this adds IsSafeActiveMask, which performs checks on the induction variable and backedge-taken count that are arguments to @llvm.get.active.lane.mask, and tests have been added for this to test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll. I have also now updated all other tests to the new situation, i.e. manually added the @llvm.get.active.lane.mask instead of the icmp.

The approach for the overflow check is to use SCEV and query if the loop entry is protected by a conditional BTC + 1 < 0. In other words, if the scalar trip count overflows and becomes negative, we shouldn't enter the loop and create a tripcount expression BTC + 1 as that won't be valid. As I said, not entirely done yet, but wanted to check this after our overflow discussion while I fix up these things:

  • the vctp can be cloned in the exit block, and looks like I am missing that at the moment.
  • I am always creating a new num.elements = BTC + 1 expression, but it looks like that value might exist and I can reuse, hopefully reducing some of the codegen changes that we see in some of the tests.
efriedma added inline comments.Jun 2 2020, 5:22 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
665

I think you want llvm::cannotBeMaxInLoop?

680

(It looks like the current version doesn't try to address the potential overflow in the subtraction.)

689

You also need to check that the AddRec is associated with the right loop.

This should include everything now. Main additions are:

  • check for potential overflow in the subtraction.
  • check that the induction/addRec is associated with the right loop.

Updated test

This should include everything now.

:-)
If @llvm.get.active.lane.mask can't be lowered to a VCTP (e.g. overflow) I guess that means we will have to revert it to an icmp, in order to avoid a backend isel match error. This shouldn't happen yet, but will add this.

I'd like to see a few negative testcases, where we can't transform the llvm.get.active.lane.mask.

llvm/lib/Target/ARM/MVETailPredication.cpp
660

getNumElements() is fine on a FixedVectorType.

692

!isKnownNonNegative?

711

I was expecting something more like IVExpr->getLoop() == L. L might not be the innermost loop.

I was just about to upload a new diff when I noticed your review. Many thanks again.

This includes 2 new function:

  • getNumElements(): this looks in the preheader to see whether the number of elements value is already present, to avoid recreating it.
  • RevertActiveLaneMask(): if it is not safe to lower @llvm.get.active.lane.mask to a VCTP, we recreate the icmp.

About testing:

I'd like to see a few negative testcases, where we can't transform the llvm.get.active.lane.mask.

If have put most negative tests in test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll:

  • @overflow: this for over flow if BTC is a constant and UINT_MAX
  • @IV_not_an_induction
  • @IV_wrong_step
  • @IV_step_not_constant
  • @outerloop_phi
  • @overflow_in_sub

And there is also one in test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll:

  • @reduction_not_guarded: this checks for overflow if BTC is a runtime variable not guarded by a loop check.

These negative tests include checks to see if we recreate the icmp; it shouldn't emit @llvm.get.active.lane.mask or the VCTP.

I have already addressed the 2 minor comments, but TODO: I still need to look into the question about !isKnownNonNegative, as that doesn't seem to work for me (most tail predication start failing and are rejected because of overflow); need to look if I don't properly construct that SCEV expression, or something else.

SjoerdMeijer marked an inline comment as done.Jun 5 2020, 12:11 PM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
692

I played with SCEV today, and tried to use isKnownNonNegative (and similar ones). These SCEV helpers don't seem to provide the required information, i.e. they are not able to find precise enough value ranges to tell us values are non-negative, and so the isKnownNonNegative SCEV helpers and friends don't seem to have the context of the loop. Our loops usually look like this, they have this or a similar loop guard:

%cmp = icmp sgt %N, 0
br i1 %cmp, label %vector.preheader, label %exit

For this example, %N is our ElementCount. When we construct our overflow check:

ceil(ElementCount / VectorWidth) >= TripCount

and query SCEV, it doesn't have the context that %N > 0, resulting in a negative lowerbound, and thus isKnownNonNegative returns False. Looking into how I could add more context to SCEV, I checked for example getSCEVAtScope hoping this would be more context sensitive, and some others too. PredicatedScalarEvolution looked promising, I think it is designed for exactly this (I haven't used it yet), but the LoopUtils helpers isKnownNegativeInLoop and cannotBeMaxInLoop provide this with a convenience interface. These LoopUtils helpers were actually contributed by @samparker after a similar experience (which he might be able to confirm here).

Long story short, it looks like helpers isKnownNegativeInLoop and cannotBeMaxInLoop are actually the right choice here (also confirmed after further debugging and tracing the tests that I mentioned previously).

efriedma added inline comments.Jun 5 2020, 1:12 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
692

The description of isKnownNegativeInLoop says "Returns true if we can prove that \p S is defined and always negative in loop L." If it returns false, we have proven nothing, so you can't use it like this.

I wasn't trying to imply you shouldn't use isKnownNonNegativeInLoop, if that's appropriate.

SjoerdMeijer updated this revision to Diff 269250.EditedJun 8 2020, 9:03 AM

Short story:

isKnownNonNegativeInLoop is unfortunately not able to give an answer for this expression, and as a result most/all loops would be rejected. I have added a FIXMEs, and am using isKnownNegativeInLoop as that is at least able to catch some cases (the test cases with constant values) and is probably better than nothing. I have tried several SCEV helpers, but just none of them seem to support this expression. I think teaching SCEV about this expression is a separate issue. @efriedma, @samparker : please let me know what you think, and what you think the order of events should be.

Longer story:

I wasn't trying to imply you shouldn't use isKnownNonNegativeInLoop, if that's appropriate.

Thanks for confirming. I indeed got confused, briefly went onto the wrong track, but rediscovered isKnownNonNegativeInLoop and experimented further with that.

While evaluating this expression and if it is non-negative:

(((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount

and dumping KnownNonNegative information for the intermediate expressions, I see that SCEV is able to determine KnownNonNegative for all intermediate expressions, except the last one:

BTC: (-1 + %N)
BTC KnownNonNegative: 1
elemcount: %N
elemcount + vlen-1: (3 + %N)
KnownNonNegative: 1
Ceil: ((3 + %N) /u 4)
Ceil KnownNonNegative: 1
TripCount: (1 + ((-4 + (4 * ((3 + %N)/u 4))<nuw>) /u 4))<nuw><nsw>
TripCount KnownNonNegative: 1
ECMinusTC: (-1 + (-1 * ((-4 + (4 * ((3 + %N) /u 4))<nuw>) /u 4))<nsw> + ((3 + %N) /u 4))
KnownNonNegative: 0

When I request signed integer ranges for rounded element count (Ceil) and the trip count (TC) I see this:

Range Ceil: [0,1073741824)
Range TC: [1,1073741825)

And that looks very sensible and promising. I wanted to add support for this here, but then discovered a case that worked slightly differently, and it needs some more thinking and investigation, and probably best be added as a helper somewhere to looputils/SCEV. I have traced SCEV and its decision making, and roughly see where it is rejecting this, but need to investigate that further.

Maybe instead of querying SCEV about (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount, it would be easier for SCEV to reason about ((ElementCount + (VectorWidth - 1)) - TripCount * VectorWidth?

Maybe instead of querying SCEV about (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount, it would be easier for SCEV to reason about ((ElementCount + (VectorWidth - 1)) - TripCount * VectorWidth?

Thanks! I might have tried this (have tried many different expression, with/without overflow flags, etc), but will double check tomorrow.

I was actually just uploading an approach using integer ranges. If we know that:

Range(ElementCount + (VectorWidth-1) / VectorWidth) - Range(TC) == 0

If this set difference results in the empty set, we know overflow doesn't happen.
This seems to work for all cases, i.e. when values are runtime values or constants, and is a simple check.

Maybe instead of querying SCEV about (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount, it would be easier for SCEV to reason about ((ElementCount + (VectorWidth - 1)) - TripCount * VectorWidth?

No luck with that one too: isKnownNonNegativeInLoop is not able to prove that for this expression.

How about the current implementation, and just looking at the signed ranges?

efriedma added inline comments.Jun 9 2020, 1:29 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
685

Can ElementCount + (VW-1) overflow? Do we need to check for that?

712

The general idea here makes sense. The precise way you're implementing it seems a little strange; it's fine if TripCount is smaller than BTC, I think.

745

Is it really legal for the induction variable to be stepping in either direction?

766

Not sure we can safely assume "I" is an add instruction.

SjoerdMeijer marked 2 inline comments as done.Jun 10 2020, 8:57 AM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
685

We are not generating code for ElementCount + (VW-1) , so that one is fine. We do want to know about overflow for Ceil, so will add a check for that.

745

It's definitely a case we want to support. In D77635, the vectoriser was taught to create a vector induction variable when a primary induction is initially absent, which is the case with decrementing loops, i.e. a step value of -1. We have a test case for counting down loops here in test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll with function @foo4.

As this is produced by the vectoriser, I didn't see a problem with this, but will give it some more thoughts if we need to check more for this, but if it helps to get a first version in, I can remove this and address this in a follow up.

Addressed the other comments, simplified the value range check.

efriedma added inline comments.Jun 10 2020, 11:50 AM
llvm/lib/Target/ARM/MVETailPredication.cpp
685

Not sure I understand; even if we aren't generating code, we're using it as input to the safety check. Does the math there work correctly even if it overflows?

691

Ceil is the result of a UDiv; it trivially can't be negative.

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll
253 ↗(On Diff #269869)

I don't understand how this loop is supposed to work. %index is zero in the first iteration, and UINT_MAX-3 in the second iteration.

SjoerdMeijer marked 3 inline comments as done.Jun 10 2020, 12:33 PM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
685

The Ceil expression doesn't have the non-wrapping flags. Therefore, my understanding is, that this
ceiling calculation is done in a higher bit range and so no information is lost.

691

Ah yeah, that of course doesn't make any sense, I am removing it.

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-pred-const.ll
253 ↗(On Diff #269869)

yep, thanks for catching, doesn't make sense, some sort of copy-paste mistake.

  • Removed the overflow check on the Ceil expression, the udiv.
  • Removed recognising the -Step case and corresponding test. That is not produced by the vectoriser, so don't need to recognise it. I am guessing this is no longer produced since the vectoriser now understands decrementing loops.
efriedma added inline comments.Jun 10 2020, 1:18 PM
llvm/lib/Target/ARM/MVETailPredication.cpp
685

Therefore, my understanding is, that this ceiling calculation is done in a higher bit range

SCEV math is modular math; it happens in the width of SCEV::getType(). (So Add, Mul, and AddRec can overflow.) If you want wider math, you need to explicitly zero-extend.

SjoerdMeijer marked an inline comment as done.Jun 10 2020, 1:55 PM
SjoerdMeijer added inline comments.
llvm/lib/Target/ARM/MVETailPredication.cpp
685

Ahhhh, thanks for explaining.

This is a real puzzle..... I think I am going to solve this differently then, because I am afraid we wouldn't be able to put any meaningful bound on ElementCount + (VW-1) (have seen this already but will double-check). I think I am going to use the TripCount (TC) for this, which usually looks like this:

(1 + ((-4 + (4 * ((3 + %N) /u 4))<nuw>) /u 4))<nuw><nsw>

For which we are able to find useful value ranges like this:

TC: [1,1073741825)

Because TC uses %N, and is also used in `ElementCount + (VW-1), I think that means that if:

upperbound(TC) <= UINT_MAX - VectorWidth

that we are okay.

Added overflow check for the rounding expression + test case.

some minor tweaks:

  • added an option to force tail-prediction,
  • removed the unreachable, and generate the splat BTC in the preheader if it doesn't exist.

Sorry for being a bit impatient, but was wondering if this is okay as an initial commit?
As there are several moving parts involved here, an initial commit and a first in-tree version would be convenient to iterate on, for example:

  • I want to have a look at codegen to see if we can improve it,
  • I have added a test case @Correlation that we want to support. It's currently rejected because of possible overflow, that's why I have added an option to force it, but I would like to have a closer look at this again.
efriedma accepted this revision.Jun 16 2020, 7:07 PM

I don't think the overflow check (step 2.2) in IsSafeActiveMask is right. The way it's comparing ranges doesn't seem sound: the "range" is conservative, so it's possible the actual value of the trip count at runtime is only one of the values in the range. Comparing the overlap on the ranges produces a result that doesn't really correspond to what you're looking for. Not sure if I'm explaining that clearly.

If it's really too hard to come up with the relevant proofs, we might want to revise the definition of llvm.get.active.lane.mask. I'm not sure what the revision looks like. We could say that it never produces an all-false vector; instead, it produces poison in that case. Or we could change the arguments somehow to make them easier to reason about.

That said, I'm okay with committing this as-is with the understanding that the pass will stay disabled until we resolve the issues with the overflow check. It looks substantially like what I expect the final form to be. LGTM

This revision is now accepted and ready to land.Jun 16 2020, 7:07 PM

Many thanks @efriedma and @samparker for all your help, really appreciated, will also mention that in the commit message.

And many thanks for your thoughts on the overflow behaviour. Looks like I will need to explore a few different strategies. Easiest would be if SCEV just understands our expressions. When I debugged SCEV, and while I didn't yet get fully to the bottom of it, my impression is that SCEV (its different helpers) is doing a lot of pattern matching, tries different strategies, and finally compares the expression against the loop guard expression. So, I don't know yet how easy or feasible it is to fit this analysis in. That's why I would tend to have a preference for revising the definition of llvm.get.active.lane.mask, because that looks easier. I appreciate that is somewhat working around SCEV limitations, which may or may not be a strong argument.

This revision was automatically updated to reflect the committed changes.