This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen][SVE] Use whilelo instruction when lowering @llvm.get.active.lane.mask
ClosedPublic

Authored by david-arm on Nov 24 2021, 8:42 AM.

Details

Summary

In most common cases the @llvm.get.active.lane.mask intrinsic maps directly
to the SVE whilelo instruction, which already takes overflow into account.
However, currently in SelectionDAGBuilder::visitIntrinsicCall we always lower
this immediately to a generic sequence of instructions that explicitly
take overflow into account. This makes it very difficult to then later
transform back into a single whilelo instruction. Therefore, this patch
introduces a new TLI function called shouldExpandGetActiveLaneMask that asks if
we should lower/expand this to a sequence of generic ISD nodes, or instead
just leave it as an intrinsic for the target to lower.

You can see the significant improvement in code quality for some of the
tests in this file:

CodeGen/AArch64/active_lane_mask.ll

Diff Detail

Event Timeline

david-arm created this revision.Nov 24 2021, 8:42 AM
david-arm requested review of this revision.Nov 24 2021, 8:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 24 2021, 8:42 AM

Just wanted to check how this works for SVE.
For MVE, we have to work quite hard to lower this to MVE specific intrinsic. In MVETailPredication.cpp we have a function IsSafeActiveMask that performs the following checks:

// The active lane intrinsic has this form:
//
//    @llvm.get.active.lane.mask(IV, TC)
//
// Here we perform checks that this intrinsic behaves as expected,
// which means:
//
// 1) Check that the TripCount (TC) belongs to this loop (originally).
// 2) The element count (TC) needs to be sufficiently large that the decrement
//    of element counter doesn't overflow, which means that we need to prove:
//        ceil(ElementCount / VectorWidth) >= TripCount
//    by rounding up ElementCount up:
//        ((ElementCount + (VectorWidth - 1)) / VectorWidth
//    and evaluate if expression isKnownNonNegative:
//        (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
// 3) The IV must be an induction phi with an increment equal to the
//    vector width.

All these checks are performed because we transform the loop into a different form (a tail-predicated one ) and new expressions are introduced that should not overflow. The rationale was that it wouldn't be difficult to imagine that a user-facing intrinsic will be introduced one day, in which we need these checks to see if things behave as expected (@efriedma can correct me if I'm wrong here). Please note that currently this intrinsic is emitted only by the loop vectoriser, and these issues should not occur.

Now the question is, we don't perform any of these checks, but do we need them?
The sanity check that the IV belongs to a loop would be nice, but that seems problematic here because in SelectionDAG we don't have the Loop view.
The overflow check that we do for MVE and the element count expression and decrement that we introduce does not seem to be applicable for SVE, is that right? That would be good news because that was the hardest part for MVE.

llvm/include/llvm/CodeGen/TargetLowering.h
430

... or should remain as an intrinsic.

Perhaps clarify that intrinsic is a target specific intrinsic in that case.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1509

I was expecting return false here, now I am a bit confused...

Hi @SjoerdMeijer, so after looking into the intrinsic a bit more and talking with @paulwalker-arm, we realised that LoopVectorize.cpp isn't doing any specific overflow checks, which means that when lowering the intrinsic we have to start worrying about overflow. Fortunately for SVE, the whilelo instruction actually takes care of overflow already, i.e. there is a 1:1 mapping between the intrinsic and the instruction for many cases. This means that we don't really need to do any of the checks that MVE does for overflow because there is an already efficient way of doing this for SVE. Also, according to the documentation the whilelo instruction takes a start and end value and increments a counter from start to end based on the number of elements requested, i.e. <vscale x 16> elements for a <vscale x 16 x i1> type. So we know exactly which types are safe to map to the instruction and which are not. For some types like <vscale x 32 x i1> (which are illegal and have no mapping) we may also benefit from some of the same checks we do for MVE if we want to improve code quality.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1509

Perhaps the name of the TLI function could be better. What I mean by lowerGetActiveLaneMask is lower to generic ISD nodes (UADDO, etc.), and do the overflow checks explicitly. So returning true here means we let SelectionDAGBuilder lower the intrinsic at that point. If we return false here we're basically asking to leave it as an intrinsic.

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Hi @SjoerdMeijer, so after looking into the intrinsic a bit more and talking with @paulwalker-arm, we realised that LoopVectorize.cpp isn't doing any specific overflow checks, which means that when lowering the intrinsic we have to start worrying about overflow. Fortunately for SVE, the whilelo instruction actually takes care of overflow already, i.e. there is a 1:1 mapping between the intrinsic and the instruction for many cases. This means that we don't really need to do any of the checks that MVE does for overflow because there is an already efficient way of doing this for SVE. Also, according to the documentation the whilelo instruction takes a start and end value and increments a counter from start to end based on the number of elements requested, i.e. <vscale x 16> elements for a <vscale x 16 x i1> type. So we know exactly which types are safe to map to the instruction and which are not. For some types like <vscale x 32 x i1> (which are illegal and have no mapping) we may also benefit from some of the same checks we do for MVE if we want to improve code quality.

Okay, perhaps I should have been clearer with "overflow checks" and the LoopVectorizer. Basically what I wanted to say is that a vectorized loop/body provides us with a few guarantees, i.e. we know the loop executes at least once, and the loop body processes at least VectorWidth elements. If we could assume this in the MVE lowering pass, then we wouldn't have to do all these overflow and sanity checks. We can't make these assume because the intrinsic could be emitted in the LV, but also somewhere else where we don't have these guarantees.

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

Could you add Metadata to the intrinsic to state the assumptions?

Matt added a subscriber: Matt.Nov 25 2021, 8:52 AM
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
7112

Probably bike shedding but I feel if there was an ISD node for this intrinsic then the code this function prevents would likely sit in a function called expandGetActiveLaneMask so perhaps a better name is shouldExpandGetActiveLaneMask?

I also think that given such a name the documentation can be very lightweight given expansion is a well established concept within the code generator.

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

Could you add Metadata to the intrinsic to state the assumptions?

Something along those lines would probably be possible, but defeats the purpose of the intrinsic, which is about communicating information from the middle-end to the back-end. If we need an intrinsic and also meta-data to achieve this, then we should probably revise the definition of the intrinsic. But it looks like we don't need to do that at the moment, and we are also good for SVE.

I think the main difference between SVE and MVE is that for MVE we have to set up some state before entering the loop, and perhaps the actual instruction that generates the predicate uses that state? Whereas for SVE we don't have that problem.

Yep, and I would need to refresh my memory and read the reference manual, but I am almost certain that for MVE we would get wrong results if overflow occurs in the VCTP instructions.

Could you add Metadata to the intrinsic to state the assumptions?

Something along those lines would probably be possible, but defeats the purpose of the intrinsic, which is about communicating information from the middle-end to the back-end. If we need an intrinsic and also meta-data to achieve this, then we should probably revise the definition of the intrinsic. But it looks like we don't need to do that at the moment, and we are also good for SVE.

Totally up to you. Either state in langref that is does not overflow. Or have a different intrinsic. `llvm.get.active.lane.mask_without_overflow'.

david-arm updated this revision to Diff 389987.Nov 26 2021, 3:54 AM
david-arm edited the summary of this revision. (Show Details)
  • Changed TLI interface to shouldExpandGetActiveLaneMask and updated the function comment.
paulwalker-arm accepted this revision.Nov 26 2021, 4:03 AM
This revision is now accepted and ready to land.Nov 26 2021, 4:03 AM