Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Avoid migrating existing patches. Phabricator shutdown timeline

[VP,Integer,#1] Vector-predicated integer intrinsics
ClosedPublic

Authored by simoll on Nov 6 2019, 6:19 AM.

Details

Summary

This patch adds IR intrinsics for vector-predicated integer arithmetic.

It is subpatch #1 of the integer slice of LLVM-VP.
LLVM-VP is a larger effort to bring native vector predication to LLVM.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sdesmalen added inline comments.Nov 18 2019, 10:26 AM
llvm/docs/LangRef.rst
15335

My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.

The most common passthru values in practice will be one of the source operands or zero/undef though. These are supported by SVE/SVE2 as a first-class feature (and I could imagine other/future vector architectures having special support for that as well). Other values can indeed be lowered to an explicit select instruction.

But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.

At that point the intrinsic will already be lowered to individual instructions, so I don't think the definition of the intrinsic has much of an impact on that.

llvm/docs/LangRef.rst
15335

Yes, you're right. Having an implicit undef and explicit merge is good.

simoll marked an inline comment as done.Nov 18 2019, 10:54 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15335

Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1

Other targets do not need to consider %evl because there will be an ExpandVectorPredication pass that folds the EVL into the vector mask. Backends can request through TTI that the %evl be folded away and ignore that parameter from there on.

An early version of that pass is included in the VP reference patch. The TTI changes and the expansion pass are planned for the next patch after this one.

SjoerdMeijer added inline comments.Nov 22 2019, 7:54 AM
llvm/docs/LangRef.rst
15335

Don't think this discussion reached consensus, and I missed the nuances here earlier / changed my mind on it.

The prior art on this also uses an explicit passthru, and it looks more generic than patching that up with an select later; you can always pass an undef in?

I also read the description of %evl on line 14664 above again, and thought the definition could be tightened a bit, i.e. specify an upperbound, also just for clarity:

The explicit vector length parameter always has the type i32 and is a signed integer value. The explicit vector length is only effective if it is non-negative. Results are only computed for enabled lanes. A lane is enabled if the mask at that position is true and, if effective, where the lane position is below the explicit vector length.

And then started drafting something to see if I could describe the proposed behaviour:

The explicit vector length parameter always has the type i32 and is a signed integer value. The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, the effective vector length is used to calculate a mask, and its value is:

0 <= evl <= W,   where W is the vector length.

This creates a mask, %EVLmask, with all elements 0 <= i <= evl set to True, and all other lanes evl < i <= W to False.

A new mask %M is calculated with an element-wise AND from %mask and %EVLmask:

M = Vmask AND EVLmask

A vector operation <opcode> on vectors A and B calculates:

A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                {  undef otherwise

and if I'm not mistaken we are now discussing if undef here should be undef or a passthru

15348

nit: perhaps more consistent %avl -> %evl ?

rkruppe added inline comments.Nov 24 2019, 7:34 AM
llvm/docs/LangRef.rst
15335

Sjoerd's summary matches my understanding of the question here.

I believe so far nobody spelled out that there's actually a material difference between a passthru argument and a separate select. The mask that is readily available for the select (the %mask passed to the VP instruction) isn't the same as the "enabled lanes" computed from mask and EVL (%M in Sjoerd's exposition above), which is extremely undesirable to materialize on machines with a native equipvalent of EVL (such as SX-Aurora and RISC-V).

In particular, if %mask is computed by another VP instruction, then lanes of %mask past the relevant EVL are undef, not false. That means a select like select %mask, %v, zeroinitializer does not actually make the disabled lanes zero, past EVL they are again undef (assuming %x and %mask were computed by VP instructions with the same EVL). In contrast, a passthru value would (presumably, I didn't see anyone object) apply to all disabled lanes.

I don't know how much this matters, since usually consumers of the vector will also only work up to the same EVL. But there are exceptions, such as the code pattern for vectorizing (associative or -ffast-math) reductions for RISC-V V. Briefly put, that pattern is to accumulate a vector of partial (e.g.) sums in the loop, preserving lanes past the current EVL (which may be smaller than the EVL in some previous iterations) so they'll still participate in the final reduction after the loop. This can be expressed perfectly with passthru.

Without passthru, that kind of code seems to require juggling two EVLs (the actual current EVL for this iteration, and the maximum EVL across all iterations) as well as materializing what Sjoerd called %EVLmask and %M above. This seems much harder for a backend to pattern-match and turn into acceptable code than the select %m, vp.something(%a, %b, %m, -1), zeroinitializer pattern.

I previously felt that passthru would be nice to have for backend maintainers (including myself) but perhaps not worth the duplication of IR functionality (having two ways to do selects). However, given the differences I just described, I don't think "just use select" is workable.

That doesn't necessarily mean we need passthru arguments on every VP instruction, I could also imagine adding a vp.passthru(%v, %passthru, %mask, %evl) intrinsic that takes from the first operand for all enabled lanes and the second for the other lanes. But I feel like once the floodgates are opened for doing merging "differently", it's kind of a toss-up whether it's simpler to create a special intrinsic for it or make it part of the general recipe for VP instructions.

simoll marked 2 inline comments as done.Nov 25 2019, 2:41 AM

Moved the discussion to the RFC.

llvm/docs/LangRef.rst
15335
A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                {  undef otherwise

That's an accurate description of %evl semantics.

15348

Yes. I'll change that to %evl.

simoll planned changes to this revision.Jan 9 2020, 5:02 AM
  • Clarify documentation on preserved lanes
  • explicit vlen arg is either negative or (new requirement) less-equal-than the number of lanes of the operation.

Just wanted to check your plans because the status is "planned changes to this revision".
I think I am ready to LGTM this, if someone else LGTM this too, e.g. @rkruppe, but will wait if I missed something or if you're planning more changes.

simoll updated this revision to Diff 241737.EditedJan 31 2020, 8:15 AM
  • Comments
  • Changed phrasing of %mask && %evl function along @SjoerdMeijer sketch. This covers the planned changes (%evl <= W).
  • NoSync change should probably go into a separate commit.
simoll marked 18 inline comments as done.Jan 31 2020, 8:19 AM

Unit tests: fail. 62371 tests passed, 3 failed and 839 were skipped.

failed: LLVM.CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
failed: LLVM.CodeGen/AMDGPU/smrd.ll
failed: libc++.std/thread/thread_mutex/thread_mutex_requirements/thread_mutex_requirements_mutex/thread_mutex_class/try_lock.pass.cpp

clang-tidy: fail. clang-tidy found 0 errors and 18 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

I know it's very late for me to be bringing this up, but I have a couple of broad questions -- one minor, one less minor.

First, when we introduced the constrained FP intrinsics Chandler suggested inserting ".experimental" in the names because of the likelihood that something would need to change before we were done and the "experimental" modifier would make the auto-upgrader handling slightly less ugly. That same reasoning seems like it would apply here.

Second, I'm not 100% comfortable with how the explicit vector length argument is represented. Yes, I know, I should have brought this up about a year ago. I don't have any objections to the existence of this argument. I understand that we need some way to handle it. I'm just wondering if we can handle it in some way that makes it more like an optional argument. I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right? My concern is that for targets that don't do SVE kinds of things we still have this entirely redundant, if not meaningless, operand to carry around. What would you think of this being provided as an operand bundle when it is needed?

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

You're mixing up different meanings of "vector length" here:

  • The number of elements in a scalable vector is a constant multiple of vscale which is unknown at compile time but constant during any given program execution, for both SVE and RISC-V V (though apparently not for the ISA Jacob et al. are working on, which raises all sorts of questions outside the scope of VP intrinsics so let's not get into that now)
  • The %evl parameter of VP intrinsics is a runtime value from 0 to (number of elements in the vector, be it scalable or not) which serves a similar purpose as the <W x i1> mask though it is tailored for a separate and complementary hardware feature. Both EVL and mask are for predication, i.e., not doing computations on some lanes of the vector depending on runtime conditions -- but those lanes still exist.

Although the EVL argument is not architecture specific in that it has the same semantics on all targets and can be legalized on all targets, Andrew is right that in practice, one would not generate vector code using the EVL argument to perform stripmining without knowing that the target architecture has efficient hardware support for it. Instead, that argument would be fixed to a constant -1 if the code does not want/need this concept. This is not too different from many other vector operations (e.g., shufflevector can express arbitrary shuffles, but many architectures support only a subset of those efficiently, and vectorization cost models take that into account). The real question is, do targets without native support for EVL have anything to gain from making the argument optional? I don't think so:

  • It does not make textual LLVM IR significantly more noisy: for example, a boring old 8xi32 masked store is call void @llvm.vp.store.v8i32.p0v8i32(<8 x i32> %val, <8 x i32>* %ptr, <8 x i1> %mask, i32 -1) and all you'd be saving by making the EVL argument optional is the i32 -1 part.
  • It is one additional operand on instruction, but given that almost all VP operations have 3+ arguments even without the EVL argument, getting rid of it seems like a pretty minor optimization (we can revisit this if VP operations become widely used and turn out to significantly affect memory consumption, but in that scenario we can make even greater gains by making them first-class instructions).
  • If this extra argument complicates code that uses IRBuilder or pattern matching for purposes where it should always be -1, we can have overloads that (construct/expect) a constant -1 as EVL argument. For example, builder.CreateVectorFAdd(lhs, rhs, mask) forwarding to builder.CreateVectorFAdd(lhs, rhs, mask, builder.getInt32(-1)).
  • It has no impact on code generation, since the canonical way to indicate that the "EVL feature" is not used by an instruction (the EVL argument being constant -1) is trivial to identify.

So I think that making this argument optional accomplishes nothing, it only makes support for targets that do care (RISC-V V, SX-Aurora VE) more complicated.

simoll added a comment.Feb 3 2020, 3:10 AM

I know it's very late for me to be bringing this up, but I have a couple of broad questions -- one minor, one less minor.

First, when we introduced the constrained FP intrinsics Chandler suggested inserting ".experimental" in the names because of the likelihood that something would need to change before we were done and the "experimental" modifier would make the auto-upgrader handling slightly less ugly. That same reasoning seems like it would apply here.

I don't mind changing the name to 'llvm.experimental.vp.add', etc . The mid to long-term goal is first-class VP instructions in any case.

Second, I'm not 100% comfortable with how the explicit vector length argument is represented. Yes, I know, I should have brought this up about a year ago. I don't have any objections to the existence of this argument. I understand that we need some way to handle it. I'm just wondering if we can handle it in some way that makes it more like an optional argument. I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right? My concern is that for targets that don't do SVE kinds of things we still have this entirely redundant, if not meaningless, operand to carry around. What would you think of this being provided as an operand bundle when it is needed?

Operand bundles are not the same as optional parameters. Operand bundles pessimistically imply side-effects where as a plain i32 argument is innocuous to optimizations:

https://llvm.org/docs/LangRef.html#operand-bundles

  • Calls and invokes with operand bundles have unknown read / write effect on the heap on entry and exit (even if the call target is readnone or readonly), unless they’re overridden with callsite specific attributes.

That is reasonable for constrained fp intrinsics, which always have such a dependence. Many VP intrinsics however do not have any side effects or only inspect/modify memory through their pointer arguments, etc.

<snip>

  • It has no impact on code generation, since the canonical way to indicate that the "EVL feature" is not used by an instruction (the EVL argument being constant -1) is trivial to identify.

So I think that making this argument optional accomplishes nothing, it only makes support for targets that do care (RISC-V V, SX-Aurora VE) more complicated.

Yep. Thanks for clarifying this. Once LLVM supports optional parameters without adverse effects, we can re-visit this discussion until then i'd strongly prefer the EVL to be an explicit parameter.

Ah, okay, I misunderstood. Thanks for clarifying.

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

Alright, I was reaching for a more general concept that I obviously don't know the term for. What I meant to say was just that whatever generated the argument must know something specific about the target architecture.

Operand bundles are not the same as optional parameters. Operand bundles pessimistically imply side-effects where as a plain i32 argument is innocuous to optimizations:

https://llvm.org/docs/LangRef.html#operand-bundles

  • Calls and invokes with operand bundles have unknown read / write effect on the heap on entry and exit (even if the call target is readnone or readonly), unless they’re overridden with callsite specific attributes.

That is reasonable for constrained fp intrinsics, which always have such a dependence. Many VP intrinsics however do not have any side effects or only inspect/modify memory through their pointer arguments, etc.

We could add a callsite attribute when the intrinsic is created. We have that information about the intrinsic, right?

I'll admit that there may be no practical difference between what you've proposed and what I have in mind (in the abstract sense that assumes it's even possible to implement this in the way that I'm imagining). Mostly, I'm trying to explore what I perceive to be a bad smell in the IR. Having a parameter that is irrelevant for some targets just doesn't seem right. Introducing these intrinsics with "experimental" in the name would make me feel a bit better about that, even though again it has no practical effect.

The real question is, do targets without native support for EVL have anything to gain from making the argument optional?

I'm not sure I agree that is the real question. What is gained by not having the floating point constraint arguments always present even when they aren't needed? We have values that describe the default mode, so we could use those when we want the default mode. I think we all agree that they shouldn't be there when we don't need them, yet I would argue that those arguments are more relevant when "not needed" than the evl argument is.

I'm imagining walking someone through the IR, explaining what each instruction means. I'd be a bit embarrassed when I get to a vp intrinsic and say, "Ignore the last argument. We don't use it for this target."

But like I said above, this is much more a vague indication to me that we haven't got this right yet than a concrete problem.

andrew.w.kaylor added inline comments.Feb 3 2020, 5:26 PM
llvm/docs/LangRef.rst
15302

Nothing in this description makes it clear that the value should be negative for targets that don't support the explicit vector length argument.

What should happen if this value isn't negative for a target that doesn't support it?

15306

What happens if the value isn't in this range? Based on the comments below it seems that the effective mask created should be as if %evl were equal to W. If it is a non-constant value will the generated code somehow enforce that behavior?

pengfei added a subscriber: pengfei.Feb 3 2020, 7:07 PM
simoll marked 2 inline comments as done.Feb 4 2020, 2:16 AM

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

Alright, I was reaching for a more general concept that I obviously don't know the term for. What I meant to say was just that whatever generated the argument must know something specific about the target architecture.

Operand bundles are not the same as optional parameters. Operand bundles pessimistically imply side-effects where as a plain i32 argument is innocuous to optimizations:

https://llvm.org/docs/LangRef.html#operand-bundles

  • Calls and invokes with operand bundles have unknown read / write effect on the heap on entry and exit (even if the call target is readnone or readonly), unless they’re overridden with callsite specific attributes.

That is reasonable for constrained fp intrinsics, which always have such a dependence. Many VP intrinsics however do not have any side effects or only inspect/modify memory through their pointer arguments, etc.

We could add a callsite attribute when the intrinsic is created. We have that information about the intrinsic, right?

I'll admit that there may be no practical difference between what you've proposed and what I have in mind (in the abstract sense that assumes it's even possible to implement this in the way that I'm imagining). Mostly, I'm trying to explore what I perceive to be a bad smell in the IR. Having a parameter that is irrelevant for some targets just doesn't seem right. Introducing these intrinsics with "experimental" in the name would make me feel a bit better about that, even though again it has no practical effect.

The real question is, do targets without native support for EVL have anything to gain from making the argument optional?

I'm not sure I agree that is the real question. What is gained by not having the floating point constraint arguments always present even when they aren't needed? We have values that describe the default mode, so we could use those when we want the default mode. I think we all agree that they shouldn't be there when we don't need them, yet I would argue that those arguments are more relevant when "not needed" than the evl argument is.

I'm imagining walking someone through the IR, explaining what each instruction means. I'd be a bit embarrassed when I get to a vp intrinsic and say, "Ignore the last argument. We don't use it for this target."

But like I said above, this is much more a vague indication to me that we haven't got this right yet than a concrete problem.

To me all of this points in the direction that we'd want a standard mechanism for optional parameters. We have three options for this right now:
a) Come up with a novel, clean slate scheme for optional parmeters with default values (eg constrained fp default fp env values, i32 -1 for VP intrinsics). Some examples:

; all optional parameters set (eg RISC-V V or SX-Aurora)
%x= llvm.fp.add(%a, %b) mask(%mask), evl(%evl), fpenv(fpround.tonearest, fpexcept.strict)

; constrained fp SIMD intrinsic (eg AVX512)
%y= llvm.fp.add(%a, %b) mask(%mask), fpenv(fpround.tonearest, fpexcept.strict)

; unpredicated, default fp env fp add (isomorphic to the fadd instruction)
%z= llvm.fp.add(%a, %b)

This is the solution that i'd want to see in LLVM to solve the optional parameter problem once and for all for constrained fp *and* VP (we should look out for other potential applications and start an RFC).

b) OpBundles (imply side-effects unless overriden with callsite attributes) - clunky, could form the starting point for a) though.
c) A more exotic solution: add an extra opaque parameter and provide different intrinsics to produce the opaque value (i'm not aware of this being used anywhere right now). This is similar to the idea that @sdesmalen brought up (https://reviews.llvm.org/D69891#inline-632851), however, with the produced mask having an opaque type such that it behaves basically like an optional parameter tuple, ie:

declare opaquety  @llvm.vp.mask(<8 x i1>)
declare opaquety @llvm.vp.evlmask(<8 x i1>, i32 %evl)
declare <8 x float> @llvm.vp.fadd(%a, %b, opaquety mask)

For constrained fp, this could look as follows

declare opaquety @llvm.constrainedfp.fpenv(metadata, metadata)
declare opaquety @llvm.constrained.fadd(%a, %b, opaquety fpenv)

Note that we could allow chaining of these intrinsics (eg add a passthru opquety parameter to the defining intrinsics):

opaquety %fpenv = llvm.constrainedfp.fpenv(opquety undef, fpround.toneares, fpexcept.strict)
opaquety %opt_params = llvm.vp.evlmask(%fpenv, <8 x ii1>%mask, i32 %evl)
%c = llvm.fadd(%a, %b,  opt_params)

BUT before any of these solutions are adopted, i'd rather we model the optional parameters as standard, explicit parameters with default values. We can pursue tha general solution for the optional parameter problem in parallel.
What do you think?

llvm/docs/LangRef.rst
15302

Nothing in this description makes it clear that the value should be negative for targets that don't support the explicit vector length argument.

Good point. I'll add that to the description.

What should happen if this value isn't negative for a target that doesn't support it?

There is a lowering pass that folds %evl into the mask, just as for reduction intrinsics.

15306

We are discussing the case %evl >= W in the main RFC, right now https://reviews.llvm.org/D57504#1854330 .

I like your first proposal for optional parameters. It "feels right" to me, and I agree that it is an improvement on operand bundles. Obviously it would take some time to build consensus for something like that as a new IR construct. I can live with the explicit parameter for now if we can agree on rules for how it is defined to behave on targets where it does not apply.

simoll planned changes to this revision.Feb 12 2020, 12:43 AM
simoll added a subscriber: chandlerc.

Ok. This is a summary of the requested changes and i'd like to get your go ahead before "committing" them to the patch:

  1. Define what 'W' is (that is the vector length that %evl is compared against) - for static vector types this would be the number of elements of the vector, for scalable types W == MVL and the target is responsible for defining MVL.
  2. Define what happens when %evl > W - since we do not know now what is best, i propose we define this as UB. This leaves us some room to revisit that decision should it turn out that some form of defined behavior would be more reasonable.
  3. Mention that %evl is discouraged for non-VL targets - TTI tells people whether VL is supported or not for every target. If used nevertheless, the ExpandVectorPredicationPass will fold %evl into the mask. "not-using-%evl" means setting %evl = W or %evl = -1.
  4. %evl and %mask are parameters - i will start an RFC on llvm-dev towards optional function parameters (as sketched here under variant a): https://reviews.llvm.org/D69891#1856580).
  5. Rename to llvm.experimental.vp.* - Inserting experimental is the preferred way to introduce new intrinsics until they are stable (for technical reasons brought up by @chandlerc - https://reviews.llvm.org/D69891#1852795 ).

Ok. This is a summary of the requested changes and i'd like to get your go ahead before "committing" them to the patch:

  1. Define what 'W' is (that is the vector length that %evl is compared against) - for static vector types this would be the number of elements of the vector, for scalable types W == MVL and the target is responsible for defining MVL.

I object. The only sensible definition for the upper bound on %evl is the number of elements in the vector type in question (because %evl counts vector elements). We already have a concept of "number of elements in a vector type" that generalizes across all vector types (fixed-width and scalable), and it is quite target-independent: <vscale x $N x $elemtype > (where $N is a constant) has vscale * $N elements. The only target-dependent part is in the positive integer vscale, and its behavior is quite restricted. All of this is already stated in the LangRef (in the section on vector types), and there is no reason for VP intrinsics to repeat it, let alone deviate from it by e.g. introducing new terminology (MVL) or making provisions for more target-dependent behavior than actually exists. The VP instructions should just define their behavior in terms of the number of elements in the vector.

Ok. This is a summary of the requested changes and i'd like to get your go ahead before "committing" them to the patch:

  1. Define what 'W' is (that is the vector length that %evl is compared against) - for static vector types this would be the number of elements of the vector, for scalable types W == MVL and the target is responsible for defining MVL.

I object. The only sensible definition for the upper bound on %evl is the number of elements in the vector type in question (because %evl counts vector elements). We already have a concept of "number of elements in a vector type" that generalizes across all vector types (fixed-width and scalable), and it is quite target-independent: <vscale x $N x $elemtype > (where $N is a constant) has vscale * $N elements. The only target-dependent part is in the positive integer vscale, and its behavior is quite restricted. All of this is already stated in the LangRef (in the section on vector types), and there is no reason for VP intrinsics to repeat it, let alone deviate from it by e.g. introducing new terminology (MVL) or making provisions for more target-dependent behavior than actually exists. The VP instructions should just define their behavior in terms of the number of elements in the vector.

+1
I was complicating things. MVL == number of vector elements and the whole discussion about what MVL is and how it is read/set really is part of the scalable type/vscale discussion.

  1. Rename to llvm.experimental.vp.* - Inserting experimental is the preferred way to introduce new intrinsics until they are stable (for technical reasons brought up by @chandlerc - https://reviews.llvm.org/D69891#1852795 ).

I feel like I should mention that I don't know how Chandler feels about the use of "experimental" for these intrinsics. I wasn't trying to claim his approval of my suggestion, merely explaining that I thought the reasoning that led to the current naming of the constrained FP intrinsics probably applied here as well. After I made that suggestion @craig.topper pointed out to me that the "experimental" qualifier has a tendency to never go away. See, for example, vector add/reduce intriniscs. All this is to say that my suggestion is just a suggestion and I could be convinced to drop it if that is the consensus.

  1. Rename to llvm.experimental.vp.* - Inserting experimental is the preferred way to introduce new intrinsics until they are stable (for technical reasons brought up by @chandlerc - https://reviews.llvm.org/D69891#1852795 ).

I feel like I should mention that I don't know how Chandler feels about the use of "experimental" for these intrinsics. I wasn't trying to claim his approval of my suggestion, merely explaining that I thought the reasoning that led to the current naming of the constrained FP intrinsics probably applied here as well.

Well, to set this straight, i didn't mean to imply Chandler's approval. I just want to document what motivates the experimental prefix.

After I made that suggestion @craig.topper pointed out to me that the "experimental" qualifier has a tendency to never go away. See, for example, vector add/reduce intriniscs. All this is to say that my suggestion is just a suggestion and I could be convinced to drop it if that is the consensus.

How about we stick with llvm.vp.fadd and go for llvm.vp.fadd.v2, etc when/if the intrinsics are updated?

vkmr added a subscriber: vkmr.Feb 17 2020, 8:02 AM

After I made that suggestion @craig.topper pointed out to me that the "experimental" qualifier has a tendency to never go away. See, for example, vector add/reduce intriniscs. All this is to say that my suggestion is just a suggestion and I could be convinced to drop it if that is the consensus.

How about we stick with llvm.vp.fadd and go for llvm.vp.fadd.v2, etc when/if the intrinsics are updated?

I'm OK with that.

simoll updated this revision to Diff 246212.Feb 24 2020, 7:49 AM
simoll edited the summary of this revision. (Show Details)
  • added section that %evl is disencouraged for non-evl targets
  • The VP intrinsic has UB if the effective EVL is out of bounds.
  • Use i32 -1 as special value for %evl (disabling %evl for this VP call).
efriedma added inline comments.
llvm/docs/LangRef.rst
15297

I assume for scalable vectors, the range of legal values for %evl is the runtime vector length? Maybe worth noting explicitly.

15310

I'm not happy with distinguishing between %evl equal to -1, vs. the IR constant -1. I mean, we can, but there are complications for other passes like, for example, constant hoisting. We have immarg for arguments that are *always* immediates, but we don't have a way to mark an argument that's *sometimes* an immediate.

I don't have any great suggestions here... but one idea is to add a second type overload to the intrinsics: i32 if the evl is present, {} if the evl is absent. So @llvm.vp.add.v4i32.i32 is an add with an evl parameter, and @llvm.vp.add.v4i32.sl_s is an add without an evl parameter. I think that expresses the semantics correctly, but maybe it's not so readable.

15330

"discouraged"

simoll marked 3 inline comments as done.Feb 24 2020, 5:15 PM

Thanks for the review!
I'll update shortly with an explicit mention of scalable vector types in the langref.

llvm/docs/LangRef.rst
15310

I'm not happy with distinguishing between %evl equal to -1, vs. the IR constant -1. I mean, we can, but there are complications for other passes like, for example, constant hoisting. We have immarg for arguments that are *always* immediates, but we don't have a way to mark an argument that's *sometimes* an immediate.

We could make TTI:getIntIMmCostIntrin return 0 whenever i32 -1 is passed in VP intrinsics for evl.

/// Return the expected cost of materialization for the given integer
/// immediate of the specified type for a given instruction. The cost can be
/// zero if the immediate can be folded into the specified instruction.
 int getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
                           Type *Ty) const;

In any case, the special handling of i32 -1 is really a makeshift solution. Mid to long term, we want to make evl (and the mask) argument optional parameters (https://lists.llvm.org/pipermail/llvm-dev/2020-February/139110.html).

one idea is to add a second type overload to the intrinsics: i32 if the evl is present, {} if the evl is absent. So @llvm.vp.add.v4i32.i32 is an add with an evl parameter, and @llvm.vp.add.v4i32.sl_s is an add without an evl parameter.

This might solve the representation issue in the IR. However, doing so would break the 1:1 correspondence between instruction opcodes and VP intrinsics, which is a nice property to have when optimizing VP code (eg for the generalized pattern matching logic in the VP reference patch).

efriedma added inline comments.Feb 24 2020, 5:35 PM
llvm/docs/LangRef.rst
15310

If you have a plan for this, that's fine.

simoll updated this revision to Diff 246352.Feb 24 2020, 5:42 PM
  • new LangRef wording
  • updated 'VPIntrinsic::canIgnoreVectorLengthParam` to be in line with the i32 -1 constant being the off switch for evl.
simoll updated this revision to Diff 246444.Feb 25 2020, 7:03 AM

nfc: spelling, rebased

llvm/lib/IR/IntrinsicInst.cpp
278

Why are you doing this and not just "less than zero"?

simoll marked an inline comment as done.Feb 25 2020, 10:06 PM
simoll added inline comments.
llvm/lib/IR/IntrinsicInst.cpp
278

Because with the new LangRef wording %evl is unsigned and the i32 -1 constant is handled as a special case.

andrew.w.kaylor added inline comments.Mar 4 2020, 3:53 PM
llvm/lib/IR/IntrinsicInst.cpp
278

I see that you also discussed this with Eli. I don't like the fact that constant -1 and a value that is 0xFFFFFFFF at runtime could have different behaviors. There's no telling when the constant folder might find a path to convert a value to a constant. I understand that the intrsincs aren't supposed to be called with values outside the range of [0, W] and has undefined behavior if it is, but having an exception for a constant does seem right.

As a completely contrived example, suppose I had IR like this:

bb1:
  br i1 %flag, label %bb2, label %bb3

bb2:
  %t1 = call <8 x i32> @llvm.vp.add.v8i32(<8 x i32> %i0, <8 x i32> %i1, <8 x i1> %m, i32 %n)
  br label %bb4

bb3:
  %t2 = call <8 x i32> @llvm.vp.add.v8i32(<8 x i32> %i0, <8 x i32> %i1, <8 x i1> %m, i32 -1)
  br label %bb4

bb4:
  %t3 = phi <8 x i32> [%t1, %bb1], [%t2, %bb2]
  ...

Now some pass is going to transform that into:

bb1:
  %evl = select i1 %flag, i32 %n, i32 -1
  %t3 = call <8 x i32> @llvm.vp.add.v8i32(<8 x i32> %i0, <8 x i32> %i1, <8 x i1> %m, i32 %evl)
  ...

And now the -1 case has undefined behavior.

simoll marked 4 inline comments as done.Mar 5 2020, 12:46 AM
simoll added inline comments.
llvm/lib/IR/IntrinsicInst.cpp
278

That's a problem, yes.
We could do away with the -1 special case entirely. Then, to disable evl it has to be set to the number of vector elements. For a <W x ty> vector that's just the constant W, for scalable vector types that would be the appropriate vscale const exprssion.

simoll updated this revision to Diff 249153.Mar 9 2020, 10:26 AM
  • Rebased
  • removed i32 -1 special case for EVL in VP intrinsics - 0 <= %evl < W is the only valid (== defined) setting now.

Ping. The i32 -1 special case is gone and canIgnoreVectorLength works for both regular and scalable vector types. Are there any more comments or can this go in?

I'm satisfied with the functionality, but I'm not sure about the intrinsics having undefined behavior outside the [0, W] range. The way you've implemented it, it seems like the behavior would be predictable. If the evl argument is outside that range, it is ignored. Applying an unsigned value greater than W using the "%mask AND %EVLmask" also has this effect. Why not just make that the defined behavior?

llvm/lib/IR/IntrinsicInst.cpp
294

You can get here in one step with 'this->getModule()'

simoll marked an inline comment as done.Mar 12 2020, 1:48 AM

I'm satisfied with the functionality, but I'm not sure about the intrinsics having undefined behavior outside the [0, W] range. The way you've implemented it, it seems like the behavior would be predictable. If the evl argument is outside that range, it is ignored.

To directly lower VP to NEC SX-Aurora %evl strictly needs to be within the range 0 to W or you get a hardware exception. Defining any behavior outside of that range thus implies additional instructions to restrict %evl to its bounds or guard the VP op. Clearly we do not want that. At the same time un-defining the behavior outside of that range does not hamper AVX512 code generation in any way.

Applying an unsigned value greater than W using the "%mask AND %EVLmask" also has this effect.

Semantically, yes. The difference is in the code generation.

[applying.. greater than W] Why not just make that the defined behavior?

When %evl is lowered to a mask there is still a risk of overflow in the comparison when the underlying vector type is widened (consider an operation on <256 x i8> elements and %evl ==258.. when that operation is widened to <512 x i8> you need to do something about that %EVLmask or you'll get spurious active bits in the upper half). If that is UB to begin you do not need to consider it in the %EVLmask computation. So, even non-AVL targets benefit from the strict defined range for %evl.

simoll updated this revision to Diff 249876.Mar 12 2020, 3:09 AM

NFC

  • Rebased
  • getModule()
andrew.w.kaylor accepted this revision.Mar 12 2020, 10:51 AM

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

This revision is now accepted and ready to land.Mar 12 2020, 10:51 AM

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

Great :) Thanks to everybody involved for reviewing!
We've iterated on this patch for a while and the last major update was weeks ago, giving people enough time to react to it. I'll commit this today.

The next patch will introduce the ExpandVectorPredicationPass, which folds %evl into the mask if requested through TLI.

fhahn added a comment.Mar 13 2020, 2:42 AM

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

Great :) Thanks to everybody involved for reviewing!
We've iterated on this patch for a while and the last major update was weeks ago, giving people enough time to react to it. I'll commit this today.

Given the scale of the change, it might be good to let people on llvm-dev know that a this patch is ready to land and it is about to be committed (and wait till early next week in case there's additional feedback).

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

Great :) Thanks to everybody involved for reviewing!
We've iterated on this patch for a while and the last major update was weeks ago, giving people enough time to react to it. I'll commit this today.

Given the scale of the change, it might be good to let people on llvm-dev know that a this patch is ready to land and it is about to be committed (and wait till early next week in case there's additional feedback).

Ok, let's wait until next week to give people more time to react. I will also update the reference patch in the main RFC noting that the integer VP patch was accepted.
However, i don't think it is necessary to bring this particular patch up on llvm-dev - there were several llvm-dev mails regarding VP in the past and we should have all interested parties as subscribers to the main RFC. Future patches are a completely different story - eg the VP-expansion pass and follow up patches should be announced on llvm-dev to make sure everybody is aware of them.

simoll updated this revision to Diff 250173.Mar 13 2020, 3:57 AM
  • Rebased
  • Fixed off-by-one error in the Langref (the lane at position i with i == %evl is False).
This revision was automatically updated to reflect the committed changes.
reames added a subscriber: reames.Mar 19 2020, 10:22 AM

Reading through the LangRef change post commit I was struck by this wording: "The VP intrinsic has undefined behavior if `%evl > W`."

This is a very strong interpretation and doesn't match what we do for arithmetic. One consequence of that difference is that hoisting a vp intrinsic out of a loop will not be legal unless we can prove the EVL is in bounds. You probably want some variant of a poison semantic propagation instead. Unless of course, real hardware happens to fault on out of bounds EVL in which case the current semantic is what you want. :)

Otherwise, nice work.

Reading through the LangRef change post commit I was struck by this wording: "The VP intrinsic has undefined behavior if `%evl > W`."

This is a very strong interpretation and doesn't match what we do for arithmetic. One consequence of that difference is that hoisting a vp intrinsic out of a loop will not be legal unless we can prove the EVL is in bounds. You probably want some variant of a poison semantic propagation instead. Unless of course, real hardware happens to fault on out of bounds EVL in which case the current semantic is what you want. :)

Otherwise, nice work.

I brought this up in D57504 earlier, and @simoll said that some real hardware indeed faults in this case:

The VE target strictly requires VL <= MVL or you'll get a hardware exception.

lkcl added a subscriber: lkcl.Mar 20 2020, 3:26 AM