This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/docs/
-
docs/
1
LangRef.rst

Differential D86147

[LangRef] Revise semantics of get.active.lane.mask
ClosedPublic

Authored by SjoerdMeijer on Aug 18 2020, 9:26 AM.

Download Raw Diff

Details

Reviewers

efriedma
simoll
vkmr
rogfer01
samparker
fhahn
rkruppe
jdoerfert

Commits

rG2002bb487898: [LangRef] Revise semantics of intrinsic get.active.lane.mask

Summary

A first version of get.active.lane.mask was committed in rG7fb8a40e5220. One of the main purposes and uses of this intrinsic is to communicate information to the back-end, but its current definition and semantics make this actually very difficult. The intrinsic is defined as:

@llvm.get.active.lane.mask(%IV, %BTC)

where %BTC is the Backedge-Taken Count (variable names are different in the LangRef spec). This allows to implicitly communicate the loop tripcount, which can be reconstructed by calculating BTC + 1. But it has been very difficult to prove that calculating BTC + 1 is safe and doesn't overflow. We need complicated range and SCEV analysis, and thus the problem is that this intrinsic isn't really doing what it was supposed to solve. Examples of the overflow checks that are required in the (ARM) back-end are D79175 and D86074, which aren't even complete/correct yet.

To solve this problem, I am looking at alternative definitions/semantics for get.active.lane.mask to avoid all the complicated overflow analysis.

One obvious alternative is not to communicate the BTC but the loop tripcount instead. Now using LangRef's variable names, this means changing the current semantics from:

icmp ule (%base + i), %n

to:

icmp ule (%base + i), %n - 1

where %n > 0, and corresponds to the loop tripcount. The intrinsic signature remains the same.

I have marked this as Work-In-Progress (WIP) as I am looking for early feedback on this while I prototype and plumb this new semantics through the middle-end and back-end, and make sure I haven't missed anything.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Aug 18 2020, 9:26 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptAug 18 2020, 9:26 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: javed.absar, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Aug 18 2020, 9:26 AM

I think this makes sense, assuming you're comfortable with the current code in the ARM backend for proving that %n - %base doesn't overflow.

Just for complete clarity in my mind of the problem: 'i' defined as the maximum vector width for the loop? So we would expect a loop body like this:

loop:
  %base = phi i32 [ 0, %entry ], [ %base.next, %loop ]
  %count = phi i32 [ 0, %entry ], [ %count.next, %loop ]
  %mask = get.active.lane.mask(i32 %base, i32 %n)
  %base.next = add i32 %base, %vector.width
  %count.next = add nuw i32 %count, 1
  %cmp = icmp ne i32 %count.next, %vector.trip.count
  br i1 %cmp, label %loop, label %exit

And so now we need to prove that (%base + (%vector.trip.count * %vector.width)) doesn't overflow for (%vector.trip.count - 1) iterations? If so, that sounds like a task for AddRecExpr::evaluateAtIteration?

icmp ule (%base + i), %n - 1

How about icmp ult (%base + i), %n?

Thanks @efriedma and @samparker !

In D86147#2224773, @efriedma wrote:

I think this makes sense, assuming you're comfortable with the current code in the ARM backend for proving that %n - %base doesn't overflow.

The overflow check for BTC + 1 was the main show stopper at the moment, so getting rid of that is a big step forward which I think we should progress anyway. Slightly orthogonal or different to that is the next checks we need to do. You mentioned "proving that %n - %base doesn't overflow", which I think refers to this part/checks in the ARM backend:

The element count needs to be sufficiently large that the decrement of element counter doesn't overflow,

where the element count is %n in our examples used here. I am not entirely happy with the checks currently for this because as you remarked, they are conservative but safe. Since I haven't seen it rejecting cases that we want to support, I am overall not too unhappy with this. Sam's comments related to this are also very useful and interesting:

Just for complete clarity in my mind of the problem: 'i' defined as the maximum vector width for the loop?

Yep.

So we would expect a loop body like this:

loop:
%base = phi i32 [ 0, %entry ], [ %base.next, %loop ]
%count = phi i32 [ 0, %entry ], [ %count.next, %loop ]
%mask = get.active.lane.mask(i32 %base, i32 %n)
%base.next = add i32 %base, %vector.width
%count.next = add nuw i32 %count, 1
%cmp = icmp ne i32 %count.next, %vector.trip.count
br i1 %cmp, label %loop, label %exit

Yep. That's exactly right I think. Depending on where we are in the pipeline, we don't need %count, but that doesn't change anything about this observation:

And so now we need to prove that (%base + (%vector.trip.count * %vector.width)) doesn't overflow for (%vector.trip.count - 1) iterations? If so, that sounds like a task for AddRecExpr::evaluateAtIteration?

which looks like the overflow problem formulated differently (than what we currently have), and might be easier to analyse.

I am now first going to prototype replacing the BTC with the tripcount to avoid the overflow checks for BTC+1 and will put the different patches up for review if I haven't found any new surprises.
After that, I will progress the %n - %base checks, and see if we can improve that using Sam's suggestions.

SjoerdMeijer mentioned this in D86074: [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value.Aug 19 2020, 2:53 AM

SjoerdMeijer mentioned this in D86301: [Verifier] Additional check for get.active.lane.mask.Aug 20 2020, 9:15 AM

SjoerdMeijer mentioned this in D86302: [SelectionDAG] Modify legalizing intrinsic get.active.lane.mask.Aug 20 2020, 9:22 AM

SjoerdMeijer mentioned this in D86303: [ARM][MVE] Tail-predication: remove the BTC + 1 overflow checks.Aug 20 2020, 9:33 AM

SjoerdMeijer mentioned this in D86304: [LV] get.active.lane.mask consuming/emitting TripCount instead of backedge-taken count.Aug 20 2020, 9:42 AM

SjoerdMeijer retitled this revision from [LangRef] WIP: Revise semantics of get.active.lane.mask to [LangRef] Revise semantics of get.active.lane.mask.Aug 20 2020, 10:59 AM

I have removed the WIP tag because I think this is behaving as expected in the patches that I adapted to this new behaviour (D86304, D86302, D86301, and D86303).

icmp ule (%base + i), %n - 1
How about icmp ult (%base + i), %n?

I have kept the %n - 1 for now, because it makes a bit more explicit here that we pass in the tripcount with %n, but that that the comparison is done with the backedge-taken count %n-1.
But in the legalizer patch, D86302, I do expand it to: icmp ult (%base + i), %n

But if you're going to expand it like that, then why not just state those semantics in the language ref..? I think the description is rather verbose and a bit confusing, and I'm already quite familiar with what it's supposed to do! By just using %n, you should be able to skip the references to BTC and make it 2x easier to understand.

Okidoki, now with that change.

Cheers!

This revision is now accepted and ready to land.Aug 24 2020, 11:54 PM

samparker added inline comments.Aug 25 2020, 1:04 AM

llvm/docs/LangRef.rst
16936–16937	bah, I missed the 'ule' and there's an original typo: imcp

Thanks for catching that.
(weird, thought I had fixed that, but cheers)

SjoerdMeijer mentioned this in rGae366479e8c6: [LV] get.active.lane.mask consuming tripcount instead of backedge-taken count.Aug 25 2020, 5:52 AM

SjoerdMeijer mentioned this in rGc352e7fbda2f: [ARM][MVE] Tail-predication: remove the BTC + 1 overflow checks.Aug 25 2020, 6:38 AM

SjoerdMeijer mentioned this in rG39522b1e1042: [SelectionDAG] Legalize intrinsic get.active.lane.mask.Aug 25 2020, 7:01 AM

SjoerdMeijer mentioned this in rG8d5f64c4edbc: [Verifier] Additional check for intrinsic get.active.lane.mask.Aug 25 2020, 7:45 AM

Closed by commit rG2002bb487898: [LangRef] Revise semantics of intrinsic get.active.lane.mask (authored by SjoerdMeijer). · Explain WhyAug 25 2020, 8:24 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG2002bb487898: [LangRef] Revise semantics of intrinsic get.active.lane.mask.

the description in LLVMRef.rst reminds me very much of "data-dependent fail on first". in particular, that on "overflow" the m[i] values are set to False.

given especially that the example given is for LOAD it may be worthwhile reviewing to ensure that ISAs with ffirst mask capability can be used, here.

in particular, one characteristic of ffirst on LOAD operations is that any page fault exception caused by anything other than the very first element (n=0) is IGNORED and Vector Length explicitly truncated by the hardware to the element *prior* to the exception.

if however the page fault occurs at element 0 then the exception must be raised as for any scalar operation.

Vector LOADs therefore implicitly provide information about which elements were actually successfully retrieved from memory, and parallel vector processing may take place without needing to perform any extraneous checking.

supporting this in llvm.get.active.lane.mask would result in some extremely compact assembler.

Hi Luke, thanks for sharing your thoughts. I agree with your analysis. The in-tree vector extension that I am aware of that supports first faulting loads is Arm's SVE. While I work on Arm's MVE, I hope and think this is useful for SVE (and other targets) too, i.e. I think ffirst mask capability can be used. But since the devil is in the details here, an implementation would need to prove this. Hopefully that happens soon.

In D86147#2258555, @SjoerdMeijer wrote:

Hi Luke, thanks for sharing your thoughts. I agree with your analysis. The in-tree vector extension that I am aware of that supports first faulting loads is Arm's SVE. While I work on Arm's MVE, I hope and think this is useful for SVE (and other targets) too, i.e. I think ffirst mask capability can be used. But since the devil is in the details here, an implementation would need to prove this. Hopefully that happens soon.

i am aware that RVV has fail-on-first, and we are adding it to SimpleV as well.

SV will also have _data_ dependent fail-on-first and given that we are extending PowerISA this will likely be done through vectorisation of Condition Registers.

i.e. if a particular CR bit (which is set based on the result of the instruction is zero, +ve or -ve) is set this is taken to be the "fail" of "fail on first".

it means that ffirst and masking applies to data rather than just LD/STs.

one thing to watch out for and definitely clarify: is the mask set on *all* elements?

ffirst very specifically modifies the Vector Length and this is radically different from plain predicate masking.

predicate masking masks any elements anywhere in the vector but does not modify the vector length.

Revision Contents

Path

Size

llvm/

docs/

LangRef.rst

27 lines

Diff 287677

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 16,924 Lines • ▼ Show 20 Lines
	Semantics:			Semantics:
	""""""""""			""""""""""

	The '``llvm.get.active.lane.mask.*``' intrinsics are semantically equivalent			The '``llvm.get.active.lane.mask.*``' intrinsics are semantically equivalent
	to:			to:

	::			::

	%m[i] = icmp ule (%base + i), %n			%m[i] = icmp ult (%base + i), %n

	where ``%m`` is a vector (mask) of active/inactive lanes with its elements			where ``%m`` is a vector (mask) of active/inactive lanes with its elements
	indexed by ``i``, and ``%base``, ``%n`` are the two arguments to			indexed by ``i``, and ``%base``, ``%n`` are the two arguments to
	``llvm.get.active.lane.mask.*``, ``%imcp`` is an integer compare and ``ule``			``llvm.get.active.lane.mask.*``, ``%icmp`` is an integer compare and ``ult``
				samparkerUnsubmitted Not Done Reply Inline Actions bah, I missed the 'ule' and there's an original typo: imcp samparker: bah, I missed the 'ule' and there's an original typo: imcp
	the unsigned less-than-equal comparison operator. Overflow cannot occur in			the unsigned less-than comparison operator. Overflow cannot occur in
	``(%base + i)`` and its comparison against ``%n`` as it is performed in integer			``(%base + i)`` and its comparison against ``%n`` with ``%n > 0``, as it is
	numbers and not in machine numbers. The above is equivalent to:			performed in integer numbers and not in machine numbers. The above is
				equivalent to:

	::			::

	%m = @llvm.get.active.lane.mask(%base, %n)			%m = @llvm.get.active.lane.mask(%base, %n)

	This can, for example, be emitted by the loop vectorizer. Then, ``%base`` is			This can, for example, be emitted by the loop vectorizer in which case
	the first element of the vector induction variable (VIV), and ``%n`` is the			``%base`` is the first element of the vector induction variable (VIV) and
	Back-edge Taken Count (BTC). Thus, these intrinsics perform an element-wise			``%n`` is the loop tripcount. Thus, these intrinsics perform an element-wise
	less than or equal comparison of VIV with BTC, producing a mask of true/false			less than comparison of VIV with the loop tripcount, producing a mask of
	values representing active/inactive vector lanes, except if the VIV overflows			true/false values representing active/inactive vector lanes, except if the VIV
	in which case they return false in the lanes where the VIV overflows. The			overflows in which case they return false in the lanes where the VIV overflows.
	arguments are scalar types to accommodate scalable vector types, for which it is			The arguments are scalar types to accommodate scalable vector types, for which
	unknown what the type of the step vector needs to be that enumerate its			it is unknown what the type of the step vector needs to be that enumerate its
	lanes without overflow.			lanes without overflow.

	This mask ``%m`` can e.g. be used in masked load/store instructions. These			This mask ``%m`` can e.g. be used in masked load/store instructions. These
	intrinsics provide a hint to the backend. I.e., for a vector loop, the			intrinsics provide a hint to the backend. I.e., for a vector loop, the
	back-edge taken count of the original scalar loop is explicit as the second			back-edge taken count of the original scalar loop is explicit as the second
	argument.			argument.


	▲ Show 20 Lines • Show All 3,802 Lines • Show Last 20 Lines