This is an archive of the discontinued LLVM Phabricator instance.

[LangRef] Revise semantics of get.active.lane.mask
ClosedPublic

Authored by SjoerdMeijer on Aug 18 2020, 9:26 AM.

Details

Summary

A first version of get.active.lane.mask was committed in rG7fb8a40e5220. One of the main purposes and uses of this intrinsic is to communicate information to the back-end, but its current definition and semantics make this actually very difficult. The intrinsic is defined as:

@llvm.get.active.lane.mask(%IV, %BTC)

where %BTC is the Backedge-Taken Count (variable names are different in the LangRef spec). This allows to implicitly communicate the loop tripcount, which can be reconstructed by calculating BTC + 1. But it has been very difficult to prove that calculating BTC + 1 is safe and doesn't overflow. We need complicated range and SCEV analysis, and thus the problem is that this intrinsic isn't really doing what it was supposed to solve. Examples of the overflow checks that are required in the (ARM) back-end are D79175 and D86074, which aren't even complete/correct yet.

To solve this problem, I am looking at alternative definitions/semantics for get.active.lane.mask to avoid all the complicated overflow analysis.

One obvious alternative is not to communicate the BTC but the loop tripcount instead. Now using LangRef's variable names, this means changing the current semantics from:

icmp ule (%base + i), %n

to:

icmp ule (%base + i), %n - 1

where %n > 0, and corresponds to the loop tripcount. The intrinsic signature remains the same.

I have marked this as Work-In-Progress (WIP) as I am looking for early feedback on this while I prototype and plumb this new semantics through the middle-end and back-end, and make sure I haven't missed anything.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Aug 18 2020, 9:26 AM
Herald added a project: Restricted Project. · View Herald Transcript
SjoerdMeijer requested review of this revision.Aug 18 2020, 9:26 AM

I think this makes sense, assuming you're comfortable with the current code in the ARM backend for proving that %n - %base doesn't overflow.

Just for complete clarity in my mind of the problem: 'i' defined as the maximum vector width for the loop? So we would expect a loop body like this:

loop:
  %base = phi i32 [ 0, %entry ], [ %base.next, %loop ]
  %count = phi i32 [ 0, %entry ], [ %count.next, %loop ]
  %mask = get.active.lane.mask(i32 %base, i32 %n)
  %base.next = add i32 %base, %vector.width
  %count.next = add nuw i32 %count, 1
  %cmp = icmp ne i32 %count.next, %vector.trip.count
  br i1 %cmp, label %loop, label %exit

And so now we need to prove that (%base + (%vector.trip.count * %vector.width)) doesn't overflow for (%vector.trip.count - 1) iterations? If so, that sounds like a task for AddRecExpr::evaluateAtIteration?

icmp ule (%base + i), %n - 1

How about icmp ult (%base + i), %n?

Thanks @efriedma and @samparker !

I think this makes sense, assuming you're comfortable with the current code in the ARM backend for proving that %n - %base doesn't overflow.

The overflow check for BTC + 1 was the main show stopper at the moment, so getting rid of that is a big step forward which I think we should progress anyway. Slightly orthogonal or different to that is the next checks we need to do. You mentioned "proving that %n - %base doesn't overflow", which I think refers to this part/checks in the ARM backend:

  1. The element count needs to be sufficiently large that the decrement of element counter doesn't overflow,

where the element count is %n in our examples used here. I am not entirely happy with the checks currently for this because as you remarked, they are conservative but safe. Since I haven't seen it rejecting cases that we want to support, I am overall not too unhappy with this. Sam's comments related to this are also very useful and interesting:

Just for complete clarity in my mind of the problem: 'i' defined as the maximum vector width for the loop?

Yep.

So we would expect a loop body like this:

loop:
%base = phi i32 [ 0, %entry ], [ %base.next, %loop ]
%count = phi i32 [ 0, %entry ], [ %count.next, %loop ]
%mask = get.active.lane.mask(i32 %base, i32 %n)
%base.next = add i32 %base, %vector.width
%count.next = add nuw i32 %count, 1
%cmp = icmp ne i32 %count.next, %vector.trip.count
br i1 %cmp, label %loop, label %exit

Yep. That's exactly right I think. Depending on where we are in the pipeline, we don't need %count, but that doesn't change anything about this observation:

And so now we need to prove that (%base + (%vector.trip.count * %vector.width)) doesn't overflow for (%vector.trip.count - 1) iterations? If so, that sounds like a task for AddRecExpr::evaluateAtIteration?

which looks like the overflow problem formulated differently (than what we currently have), and might be easier to analyse.

I am now first going to prototype replacing the BTC with the tripcount to avoid the overflow checks for BTC+1 and will put the different patches up for review if I haven't found any new surprises.
After that, I will progress the %n - %base checks, and see if we can improve that using Sam's suggestions.

SjoerdMeijer retitled this revision from [LangRef] WIP: Revise semantics of get.active.lane.mask to [LangRef] Revise semantics of get.active.lane.mask.Aug 20 2020, 10:59 AM
SjoerdMeijer added a comment.EditedAug 20 2020, 11:08 AM

I have removed the WIP tag because I think this is behaving as expected in the patches that I adapted to this new behaviour (D86304, D86302, D86301, and D86303).

icmp ule (%base + i), %n - 1

How about icmp ult (%base + i), %n?

I have kept the %n - 1 for now, because it makes a bit more explicit here that we pass in the tripcount with %n, but that that the comparison is done with the backedge-taken count %n-1.
But in the legalizer patch, D86302, I do expand it to: icmp ult (%base + i), %n

But if you're going to expand it like that, then why not just state those semantics in the language ref..? I think the description is rather verbose and a bit confusing, and I'm already quite familiar with what it's supposed to do! By just using %n, you should be able to skip the references to BTC and make it 2x easier to understand.

Okidoki, now with that change.

samparker accepted this revision.Aug 24 2020, 11:54 PM

Cheers!

This revision is now accepted and ready to land.Aug 24 2020, 11:54 PM
samparker added inline comments.Aug 25 2020, 1:04 AM
llvm/docs/LangRef.rst
16936–16937

bah, I missed the 'ule' and there's an original typo: imcp

Thanks for catching that.
(weird, thought I had fixed that, but cheers)

lkcl added a subscriber: lkcl.Aug 28 2020, 7:16 AM

the description in LLVMRef.rst reminds me very much of "data-dependent fail on first". in particular, that on "overflow" the m[i] values are set to False.

given especially that the example given is for LOAD it may be worthwhile reviewing to ensure that ISAs with ffirst mask capability can be used, here.

in particular, one characteristic of ffirst on LOAD operations is that any page fault exception caused by anything other than the very first element (n=0) is IGNORED and Vector Length explicitly truncated by the hardware to the element *prior* to the exception.

if however the page fault occurs at element 0 then the exception must be raised as for any scalar operation.

Vector LOADs therefore implicitly provide information about which elements were actually successfully retrieved from memory, and parallel vector processing may take place without needing to perform any extraneous checking.

supporting this in llvm.get.active.lane.mask would result in some extremely compact assembler.

Hi Luke, thanks for sharing your thoughts. I agree with your analysis. The in-tree vector extension that I am aware of that supports first faulting loads is Arm's SVE. While I work on Arm's MVE, I hope and think this is useful for SVE (and other targets) too, i.e. I think ffirst mask capability can be used. But since the devil is in the details here, an implementation would need to prove this. Hopefully that happens soon.

lkcl added a comment.Sep 9 2020, 10:53 AM

Hi Luke, thanks for sharing your thoughts. I agree with your analysis. The in-tree vector extension that I am aware of that supports first faulting loads is Arm's SVE. While I work on Arm's MVE, I hope and think this is useful for SVE (and other targets) too, i.e. I think ffirst mask capability can be used. But since the devil is in the details here, an implementation would need to prove this. Hopefully that happens soon.

:)

i am aware that RVV has fail-on-first, and we are adding it to SimpleV as well.

SV will also have _data_ dependent fail-on-first and given that we are extending PowerISA this will likely be done through vectorisation of Condition Registers.

i.e. if a particular CR bit (which is set based on the result of the instruction is zero, +ve or -ve) is set this is taken to be the "fail" of "fail on first".

it means that ffirst and masking applies to data rather than just LD/STs.

one thing to watch out for and definitely clarify: is the mask set on *all* elements?

ffirst very specifically modifies the Vector Length and this is radically different from plain predicate masking.

predicate masking masks any elements anywhere in the vector but does not modify the vector length.