This is an archive of the discontinued LLVM Phabricator instance.

[VP,Integer,#1] Vector-predicated integer intrinsics
ClosedPublic

Authored by simoll on Nov 6 2019, 6:19 AM.

Details

Summary

This patch adds IR intrinsics for vector-predicated integer arithmetic.

It is subpatch #1 of the integer slice of LLVM-VP.
LLVM-VP is a larger effort to bring native vector predication to LLVM.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
simoll planned changes to this revision.Nov 11 2019, 10:52 AM
simoll marked 2 inline comments as done.
simoll updated this revision to Diff 228884.Nov 12 2019, 6:45 AM

Changes

  • VPIntrinsics.def macro file
  • NFC cleanup
jdoerfert added inline comments.
llvm/include/llvm/IR/Intrinsics.td
1230

I asked @sstefan1 to add nosync and nofree to the td file and other places needed to use them. That should make your changes more concise.


I would have argued we could even do the #inline trick here, or maybe some .td magic, to avoid this replication, but I will not force anything.

llvm/include/llvm/IR/VPIntrinsics.def
2

Nit: copy&past

28

To be honest, I would have made the Opcode for vp intrinsics of existing functions VP_EXISTING (or similar) and used a single macro definition here, assuming all vp intrinsics so far have a unique corresponding scalar version.
These are two unrelated points you can consider (I won't force either).

(If you device to do both changes, you can even get away by only defining the existing opcode in the intrinsic "macro definition").

90

Nit: I'd run clang format on these files as well. Should minimize diffs in the long run.

llvm/lib/IR/IntrinsicInst.cpp
201

Does it make sense to have this corner case in here or should we just make it an Optional<int64_t> and call it a day?

You can also remove the lambda if you move the code, except there is a reason to keep it.

220

Default clauses make it harder to find all switches that might need to be adjusted. If there is no strong reason to have them we should not.

250

Leftover?

263

I doubt this is a sensible default. I doubt one should return anything for a non-vp intrinsic.

294

Nit: braces.

310

Nit: clang format your changes (or patches) (clang-format.py for vim has the formatdiff option)

llvm/unittests/IR/VPIntrinsicTest.cpp
21

Not: using namsepace llvm; is probably less confusing here

jdoerfert added inline comments.Nov 12 2019, 10:48 PM
llvm/lib/IR/IntrinsicInst.cpp
220

Forget this one.

simoll planned changes to this revision.Nov 13 2019, 12:32 AM
simoll marked 12 inline comments as done.

Thanks for the comments so far :)

llvm/include/llvm/IR/Intrinsics.td
1230

I would have argued we could even do the #inline trick here, or maybe some .td magic, to avoid this replication, but I will not force anything.

I agree but refactoring Intrinsics.td is out of scope for this patch.

llvm/include/llvm/IR/VPIntrinsics.def
28

assuming all vp intrinsics so far have a unique corresponding scalar version.

They don't. That's why it's separate. There will be more cases for constrained fp, reduction and memory ops.

llvm/lib/IR/IntrinsicInst.cpp
201

Actually, the excessive vector length case should return None to imply that the static vector length could not be inferred.

Btw, later patches will reuse the lambda.

263

If there is no functional IR opcode this is still a call to an intrinsic function. It's natural to return Call in this case.

310

This particular function is clang-formatted already.. The existing code in IntrinsicInst.cpp is not. I'll do a patch clang-formatting the entire file after this one.

simoll updated this revision to Diff 229029.EditedNov 13 2019, 1:53 AM
simoll marked an inline comment as done.

Changes

  • Cleanup, addressed nitpicks
  • Assume vlen can not be ignored for the corner case of an excessive static vector length (safe default. Pass -1 to enable all lanes in those cases, should they ever occur..).
  • made integer intrinsics nosync. Will consider an explicit nofree for VP intrinsics with memory args if the intrinsic isn't readnone already.
samparker added inline comments.Nov 13 2019, 2:08 AM
llvm/lib/IR/IntrinsicInst.cpp
218

Ah, sorry, I had missed this was for constant lengths!

sstefan1 added inline comments.Nov 13 2019, 2:24 AM
llvm/include/llvm/IR/Intrinsics.td
1230

I kind of started some of this in D65377 and the approach will try and propose is an opt-out list.

If I understood correctly I should try to do that part first (and propose). @jdoerfert correct me if I'm wrong.

Hopefully I'll start this today.

jdoerfert added inline comments.Nov 13 2019, 8:28 AM
llvm/include/llvm/IR/Intrinsics.td
1230

@simoll Agreed. This is out of scope.
@sstefan1 I think your plan sounds good.

llvm/include/llvm/IR/VPIntrinsics.def
2

now too long and wrapped :(
No need for llvm/, which is confusing as it lives in llvm/IR.
Also "Instructions" might be a bit to broad? Idk, maybe: "vector intrinsic descriptions"

28

I see. Though, even if later intriniscs do not correspond to scalar instructions, the changes would still be possible. I admit, at this point it is a question if you like:

REGISTER_VP_INTRINSIC(vp_add, 2, 3)
HANDLE_VP_TO_OC(vp_add, Add)
REGISTER_VP_INTRINSIC(vp_red, 2, 3)

better or worse than

REGISTER_VP_FROM_SCALAR_INTRINSIC(Add, 2, 3)
REGISTER_VP_INTRINSIC(Red, 2, 3)

with the appropriate definition of the above:

#define REGISTER_VP_FROM_SCALAR_INTRINSIC(OC, MP, VP) \
  REGISTER_VP_INTRINSIC(OC, MP, VP) \
  HANDLE_VP_TO_OC(vp_##OC, OC)
llvm/lib/IR/IntrinsicInst.cpp
201

Fine with me.

263

not really convinced but OK.

simoll planned changes to this revision.Nov 13 2019, 9:01 AM
simoll marked 14 inline comments as done.
simoll added inline comments.
llvm/include/llvm/IR/VPIntrinsics.def
2
//===-- IR/VPInstruction.def - Describes llvm.vp.* Intrinsics -*- C++ -*-===//
28

I prefer the explicit form

REGISTER_VP_INTRINSIC(vp_add, 2, 3)
HANDLE_VP_TO_OC(vp_add, Add)
  • The explicit form gives you much nicer diffs.
  • You don't have to read a meta macro to understand what is going on.
  • This file will undergo changes with the next patches. I wouldn't spend too much time compacting it before hasn't "settled down".
simoll updated this revision to Diff 229116.Nov 13 2019, 9:05 AM
simoll marked 2 inline comments as done.

Finally fixed the IR/VPIntrinsics.def header line.

simoll updated this revision to Diff 229118.Nov 13 2019, 9:07 AM

VPInstruction.def -> VPIntrinsics.def -.-

I think I'm fine with this. Anyone else?

llvm/include/llvm/IR/VPIntrinsics.def
28

OK.

+1 from me! Thanks for doing this!

llvm/docs/LangRef.rst
15301

Nit: instead of talking about the MSB being zero, is it simpler to say that it is effective if is a positive number (if that's what you mean)?

15371

Nit: please ignore if you disagree, but perhaps but perhaps shorter/simpler is:

The first two operands and the result have the same vector of integer type.

->

The operands and the result are integer vector types.

(this applies to all/most descriptions)

llvm/lib/IR/IntrinsicInst.cpp
184

Nit: don't think this lamba adds anything, it's just called once.

simoll planned changes to this revision.Nov 14 2019, 1:29 AM
simoll marked 4 inline comments as done.

+1 from me! Thanks for doing this!

Thanks for reviewing! I'll upload a final update for the LangRef.

llvm/docs/LangRef.rst
15301

Good point. I'll rewrite that part referring to the vlen as a signed integer.

15371

I'd like to shorten that. However, there are always mask and vlen operands so we cannot say that all operands have an integer vector type.

llvm/lib/IR/IntrinsicInst.cpp
184

It'll be called twice in the final version (see reference patch).

simoll updated this revision to Diff 229249.Nov 14 2019, 2:11 AM
simoll marked an inline comment as done.

LangRef rewording: explicit vlen is a signed integer.

Thanks for your efforts to move this forward @simoll!

llvm/docs/LangRef.rst
15297

I expect this to also work for scalable vectors, so maybe add a case here for <vscale x W x T> as well?

15299

nit: bit -> i1

15301

Thanks, I wondered something similar.

I had actually expected the %evl to only make sense in the context of scalable vectors?

Is it worth expressing what it means if %evl is larger than the size of the vector?

15303

nit: 80 char limit (also for other places in this file)

15335

Have you considered adding an extra argument to specify the value of the false lanes, similar to how this is currently done for llvm.masked.load?
By passing undef, it would have similar behaviour as the current definition, yet it would remove the need to add explicit select statements for handling merging- or zeroing predication if an instruction supports it. For SVE for example, most instructions have either merging or zeroing predication and the intrinsics expose the merging/zeroing/undef predication directly in the C/C++ intrinsic API. I think it would be really nice to have that capability represented in the vp intrinsics as well.

simoll planned changes to this revision.Nov 14 2019, 7:00 AM
simoll marked 5 inline comments as done.
  • Adding scalable type example declarations
  • reformatting langref for character limit.
llvm/docs/LangRef.rst
15301

I had actually expected the %evl to only make sense in the context of scalable vectors?

EVL does not require scalable vectors. We'll use it with the <256 x double> type for example.

15335

Yes i considered that. However, as you suggested you can express this without loss with a select and i'd like to keep the intrinsics simple.

simoll updated this revision to Diff 229306.Nov 14 2019, 7:06 AM

Thanks for your feedback!

Formatted LangRef for char limit

jdoerfert added inline comments.Nov 14 2019, 8:48 AM
llvm/docs/LangRef.rst
15297

If that support is added later, I would add the wording later too.

15335

FWIW, I like the idea of simple intrinsics and explicit selection.

sdesmalen added inline comments.Nov 14 2019, 9:38 AM
llvm/docs/LangRef.rst
15297

@simoll is anything special needed to support scalable vectors in this context?

15335

If we want to keep the vp intrinsics simple, shouldn't we reconsider the need for %evl as an explicit parameter?

That value can be folded into the predicate using a special intrinsic that enables the lanes upto %evl, e.g.

<256 x i1> llvm.vp.enable.lanes(<256 x i1> %p, i32 %evl)

If any operation needs %evl as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.

llvm/include/llvm/IR/IntrinsicInst.h
233

Can this be renamed to getVectorLength() and have it return ElementCount instead? (at which point we can drop the Optional)

simoll updated this revision to Diff 229490.Nov 15 2019, 2:57 AM
simoll marked 4 inline comments as done.
  • ElementCount getStaticVectorLength()
  • unit test for canIgnoreVectorLength on scalable examples.
simoll added inline comments.Nov 15 2019, 3:00 AM
llvm/docs/LangRef.rst
15297

@simoll is anything special needed to support scalable vectors in this context?

The intrinsics support scalable vectors out of the box. I've added VP intrinsics with scalable types to the individual operations below.

15335

If any operation needs %evl as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.

The SX-Aurora and RiSC-V V extension natively support an explicit vector length parameter just as they support predicated vector. Mask and EVL should be treated the same: if there is a vector mask parameter there should also be an EVL parameter.

llvm/include/llvm/IR/IntrinsicInst.h
233

Sure. That actually makes it simpler.

sdesmalen added inline comments.Nov 15 2019, 5:49 AM
llvm/docs/LangRef.rst
15335

I understand that certain instructions take an explicit vector length parameter, but that doesn't really explain why an arch-independent intrinsic needs to expose this. Is there any reason that modelling this using something like llvm.vp.enable.lanes won't work?

llvm/include/llvm/IR/IntrinsicInst.h
233

Awesome, thanks!

simoll marked an inline comment as done.Nov 15 2019, 7:07 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15335

That's already two architectures that do support EVL on all vector instructions and it is vital to model it. EVL has performance implications (smaller EVL -> lower latency; no need to compute and store a mask where EVL can be used) . For SVE/MVE, you can simply turn the feature off by passing i32 -1.

From the perspective of RVV and SX-Aurora, saying we do not need EVL is equivalent to saying we do not need vector predication.

cameron.mcinally added inline comments.
llvm/docs/LangRef.rst
15335

Yes i considered that. However, as you suggested you can express this without loss with a select and i'd like to keep the intrinsics simple.

Having an explicit passthru is a good idea. Keeping the select glued to the intrinsics may be difficult. For example:

select %mask, %r, undef

It would be easy for a select peep to replace this with %r, since the masked bits are undefined. But for a situation like a trapping instruction, we wouldn't want the the masked elements to execute.

simoll marked an inline comment as done.Nov 18 2019, 9:27 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15335

select %mask, %r, undef

This is the current behavior - the values on masked-off lanes (by EVL or the mask) are undef.

My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.

Besides, if the select is explicit and can be folded into another operation to simplify it, then that is actually a good thing.
But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.

sdesmalen added inline comments.Nov 18 2019, 10:26 AM
llvm/docs/LangRef.rst
15335

If we'd use a select instruction to implement the zeroing/merging predication, the issue with the current set of intrinsics is that we don't know how many lanes to select because the predicate mask is no longer the determining factor to limit which lanes are the valid lanes. Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1. It would be valid IR and there is nothing that guarantees LLVM won't generate the generic intrinsics with a value that is not -1. This is not really special to select, but common to other operations that take a mask as well. For our case it means we can't implement zeroing/merging predication with this set of intrinsics using select.

If we want to solve the select issue and also keep the intrinsics simple, my suggestion was to combine the explicit vector length with the mask using an explicit intrinsic like @llvm.vp.enable.lanes. Because this is an explicit intrinsic, the code-generator can simply extract the %evl parameter and pass that directly to the instructions for RVV/SXA. This is what happens for many other intrinsics in LLVM already, like masked.load/masked.gather that support only a single addressing mode, where it is up to the code-generator to pick apart the value into operands that are suited for a more optimal load instruction.

Without having heard your thoughts on this suggestion, I would have to guess that your reservation is the possibility of LLVM hoisting/separating the logic that merges predicate mask and %evl value in some way. That would mean having to do some tricks (think CodeGenPrep) to keep the values together and recognizable for CodeGen. And that's the exact same thing we would like to avoid for supporting merging/zeroing predication, hence the suggestion for the explicit passthru parameter.

sdesmalen added inline comments.Nov 18 2019, 10:26 AM
llvm/docs/LangRef.rst
15335

My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.

The most common passthru values in practice will be one of the source operands or zero/undef though. These are supported by SVE/SVE2 as a first-class feature (and I could imagine other/future vector architectures having special support for that as well). Other values can indeed be lowered to an explicit select instruction.

But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.

At that point the intrinsic will already be lowered to individual instructions, so I don't think the definition of the intrinsic has much of an impact on that.

llvm/docs/LangRef.rst
15335

Yes, you're right. Having an implicit undef and explicit merge is good.

simoll marked an inline comment as done.Nov 18 2019, 10:54 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15335

Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1

Other targets do not need to consider %evl because there will be an ExpandVectorPredication pass that folds the EVL into the vector mask. Backends can request through TTI that the %evl be folded away and ignore that parameter from there on.

An early version of that pass is included in the VP reference patch. The TTI changes and the expansion pass are planned for the next patch after this one.

SjoerdMeijer added inline comments.Nov 22 2019, 7:54 AM
llvm/docs/LangRef.rst
15335

Don't think this discussion reached consensus, and I missed the nuances here earlier / changed my mind on it.

The prior art on this also uses an explicit passthru, and it looks more generic than patching that up with an select later; you can always pass an undef in?

I also read the description of %evl on line 14664 above again, and thought the definition could be tightened a bit, i.e. specify an upperbound, also just for clarity:

The explicit vector length parameter always has the type i32 and is a signed integer value. The explicit vector length is only effective if it is non-negative. Results are only computed for enabled lanes. A lane is enabled if the mask at that position is true and, if effective, where the lane position is below the explicit vector length.

And then started drafting something to see if I could describe the proposed behaviour:

The explicit vector length parameter always has the type i32 and is a signed integer value. The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, the effective vector length is used to calculate a mask, and its value is:

0 <= evl <= W,   where W is the vector length.

This creates a mask, %EVLmask, with all elements 0 <= i <= evl set to True, and all other lanes evl < i <= W to False.

A new mask %M is calculated with an element-wise AND from %mask and %EVLmask:

M = Vmask AND EVLmask

A vector operation <opcode> on vectors A and B calculates:

A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                {  undef otherwise

and if I'm not mistaken we are now discussing if undef here should be undef or a passthru

15348

nit: perhaps more consistent %avl -> %evl ?

rkruppe added inline comments.Nov 24 2019, 7:34 AM
llvm/docs/LangRef.rst
15335

Sjoerd's summary matches my understanding of the question here.

I believe so far nobody spelled out that there's actually a material difference between a passthru argument and a separate select. The mask that is readily available for the select (the %mask passed to the VP instruction) isn't the same as the "enabled lanes" computed from mask and EVL (%M in Sjoerd's exposition above), which is extremely undesirable to materialize on machines with a native equipvalent of EVL (such as SX-Aurora and RISC-V).

In particular, if %mask is computed by another VP instruction, then lanes of %mask past the relevant EVL are undef, not false. That means a select like select %mask, %v, zeroinitializer does not actually make the disabled lanes zero, past EVL they are again undef (assuming %x and %mask were computed by VP instructions with the same EVL). In contrast, a passthru value would (presumably, I didn't see anyone object) apply to all disabled lanes.

I don't know how much this matters, since usually consumers of the vector will also only work up to the same EVL. But there are exceptions, such as the code pattern for vectorizing (associative or -ffast-math) reductions for RISC-V V. Briefly put, that pattern is to accumulate a vector of partial (e.g.) sums in the loop, preserving lanes past the current EVL (which may be smaller than the EVL in some previous iterations) so they'll still participate in the final reduction after the loop. This can be expressed perfectly with passthru.

Without passthru, that kind of code seems to require juggling two EVLs (the actual current EVL for this iteration, and the maximum EVL across all iterations) as well as materializing what Sjoerd called %EVLmask and %M above. This seems much harder for a backend to pattern-match and turn into acceptable code than the select %m, vp.something(%a, %b, %m, -1), zeroinitializer pattern.

I previously felt that passthru would be nice to have for backend maintainers (including myself) but perhaps not worth the duplication of IR functionality (having two ways to do selects). However, given the differences I just described, I don't think "just use select" is workable.

That doesn't necessarily mean we need passthru arguments on every VP instruction, I could also imagine adding a vp.passthru(%v, %passthru, %mask, %evl) intrinsic that takes from the first operand for all enabled lanes and the second for the other lanes. But I feel like once the floodgates are opened for doing merging "differently", it's kind of a toss-up whether it's simpler to create a special intrinsic for it or make it part of the general recipe for VP instructions.

simoll marked 2 inline comments as done.Nov 25 2019, 2:41 AM

Moved the discussion to the RFC.

llvm/docs/LangRef.rst
15335
A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                {  undef otherwise

That's an accurate description of %evl semantics.

15348

Yes. I'll change that to %evl.

simoll planned changes to this revision.Jan 9 2020, 5:02 AM
  • Clarify documentation on preserved lanes
  • explicit vlen arg is either negative or (new requirement) less-equal-than the number of lanes of the operation.

Just wanted to check your plans because the status is "planned changes to this revision".
I think I am ready to LGTM this, if someone else LGTM this too, e.g. @rkruppe, but will wait if I missed something or if you're planning more changes.

simoll updated this revision to Diff 241737.EditedJan 31 2020, 8:15 AM
  • Comments
  • Changed phrasing of %mask && %evl function along @SjoerdMeijer sketch. This covers the planned changes (%evl <= W).
  • NoSync change should probably go into a separate commit.
simoll marked 18 inline comments as done.Jan 31 2020, 8:19 AM

Unit tests: fail. 62371 tests passed, 3 failed and 839 were skipped.

failed: LLVM.CodeGen/AMDGPU/llvm.amdgcn.s.buffer.load.ll
failed: LLVM.CodeGen/AMDGPU/smrd.ll
failed: libc++.std/thread/thread_mutex/thread_mutex_requirements/thread_mutex_requirements_mutex/thread_mutex_class/try_lock.pass.cpp

clang-tidy: fail. clang-tidy found 0 errors and 18 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

I know it's very late for me to be bringing this up, but I have a couple of broad questions -- one minor, one less minor.

First, when we introduced the constrained FP intrinsics Chandler suggested inserting ".experimental" in the names because of the likelihood that something would need to change before we were done and the "experimental" modifier would make the auto-upgrader handling slightly less ugly. That same reasoning seems like it would apply here.

Second, I'm not 100% comfortable with how the explicit vector length argument is represented. Yes, I know, I should have brought this up about a year ago. I don't have any objections to the existence of this argument. I understand that we need some way to handle it. I'm just wondering if we can handle it in some way that makes it more like an optional argument. I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right? My concern is that for targets that don't do SVE kinds of things we still have this entirely redundant, if not meaningless, operand to carry around. What would you think of this being provided as an operand bundle when it is needed?

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

You're mixing up different meanings of "vector length" here:

  • The number of elements in a scalable vector is a constant multiple of vscale which is unknown at compile time but constant during any given program execution, for both SVE and RISC-V V (though apparently not for the ISA Jacob et al. are working on, which raises all sorts of questions outside the scope of VP intrinsics so let's not get into that now)
  • The %evl parameter of VP intrinsics is a runtime value from 0 to (number of elements in the vector, be it scalable or not) which serves a similar purpose as the <W x i1> mask though it is tailored for a separate and complementary hardware feature. Both EVL and mask are for predication, i.e., not doing computations on some lanes of the vector depending on runtime conditions -- but those lanes still exist.

Although the EVL argument is not architecture specific in that it has the same semantics on all targets and can be legalized on all targets, Andrew is right that in practice, one would not generate vector code using the EVL argument to perform stripmining without knowing that the target architecture has efficient hardware support for it. Instead, that argument would be fixed to a constant -1 if the code does not want/need this concept. This is not too different from many other vector operations (e.g., shufflevector can express arbitrary shuffles, but many architectures support only a subset of those efficiently, and vectorization cost models take that into account). The real question is, do targets without native support for EVL have anything to gain from making the argument optional? I don't think so:

  • It does not make textual LLVM IR significantly more noisy: for example, a boring old 8xi32 masked store is call void @llvm.vp.store.v8i32.p0v8i32(<8 x i32> %val, <8 x i32>* %ptr, <8 x i1> %mask, i32 -1) and all you'd be saving by making the EVL argument optional is the i32 -1 part.
  • It is one additional operand on instruction, but given that almost all VP operations have 3+ arguments even without the EVL argument, getting rid of it seems like a pretty minor optimization (we can revisit this if VP operations become widely used and turn out to significantly affect memory consumption, but in that scenario we can make even greater gains by making them first-class instructions).
  • If this extra argument complicates code that uses IRBuilder or pattern matching for purposes where it should always be -1, we can have overloads that (construct/expect) a constant -1 as EVL argument. For example, builder.CreateVectorFAdd(lhs, rhs, mask) forwarding to builder.CreateVectorFAdd(lhs, rhs, mask, builder.getInt32(-1)).
  • It has no impact on code generation, since the canonical way to indicate that the "EVL feature" is not used by an instruction (the EVL argument being constant -1) is trivial to identify.

So I think that making this argument optional accomplishes nothing, it only makes support for targets that do care (RISC-V V, SX-Aurora VE) more complicated.

simoll added a comment.Feb 3 2020, 3:10 AM

I know it's very late for me to be bringing this up, but I have a couple of broad questions -- one minor, one less minor.

First, when we introduced the constrained FP intrinsics Chandler suggested inserting ".experimental" in the names because of the likelihood that something would need to change before we were done and the "experimental" modifier would make the auto-upgrader handling slightly less ugly. That same reasoning seems like it would apply here.

I don't mind changing the name to 'llvm.experimental.vp.add', etc . The mid to long-term goal is first-class VP instructions in any case.

Second, I'm not 100% comfortable with how the explicit vector length argument is represented. Yes, I know, I should have brought this up about a year ago. I don't have any objections to the existence of this argument. I understand that we need some way to handle it. I'm just wondering if we can handle it in some way that makes it more like an optional argument. I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right? My concern is that for targets that don't do SVE kinds of things we still have this entirely redundant, if not meaningless, operand to carry around. What would you think of this being provided as an operand bundle when it is needed?

Operand bundles are not the same as optional parameters. Operand bundles pessimistically imply side-effects where as a plain i32 argument is innocuous to optimizations:

https://llvm.org/docs/LangRef.html#operand-bundles

  • Calls and invokes with operand bundles have unknown read / write effect on the heap on entry and exit (even if the call target is readnone or readonly), unless they’re overridden with callsite specific attributes.

That is reasonable for constrained fp intrinsics, which always have such a dependence. Many VP intrinsics however do not have any side effects or only inspect/modify memory through their pointer arguments, etc.

<snip>

  • It has no impact on code generation, since the canonical way to indicate that the "EVL feature" is not used by an instruction (the EVL argument being constant -1) is trivial to identify.

So I think that making this argument optional accomplishes nothing, it only makes support for targets that do care (RISC-V V, SX-Aurora VE) more complicated.

Yep. Thanks for clarifying this. Once LLVM supports optional parameters without adverse effects, we can re-visit this discussion until then i'd strongly prefer the EVL to be an explicit parameter.

Ah, okay, I misunderstood. Thanks for clarifying.

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

Alright, I was reaching for a more general concept that I obviously don't know the term for. What I meant to say was just that whatever generated the argument must know something specific about the target architecture.

Operand bundles are not the same as optional parameters. Operand bundles pessimistically imply side-effects where as a plain i32 argument is innocuous to optimizations:

https://llvm.org/docs/LangRef.html#operand-bundles

  • Calls and invokes with operand bundles have unknown read / write effect on the heap on entry and exit (even if the call target is readnone or readonly), unless they’re overridden with callsite specific attributes.

That is reasonable for constrained fp intrinsics, which always have such a dependence. Many VP intrinsics however do not have any side effects or only inspect/modify memory through their pointer arguments, etc.

We could add a callsite attribute when the intrinsic is created. We have that information about the intrinsic, right?

I'll admit that there may be no practical difference between what you've proposed and what I have in mind (in the abstract sense that assumes it's even possible to implement this in the way that I'm imagining). Mostly, I'm trying to explore what I perceive to be a bad smell in the IR. Having a parameter that is irrelevant for some targets just doesn't seem right. Introducing these intrinsics with "experimental" in the name would make me feel a bit better about that, even though again it has no practical effect.

The real question is, do targets without native support for EVL have anything to gain from making the argument optional?

I'm not sure I agree that is the real question. What is gained by not having the floating point constraint arguments always present even when they aren't needed? We have values that describe the default mode, so we could use those when we want the default mode. I think we all agree that they shouldn't be there when we don't need them, yet I would argue that those arguments are more relevant when "not needed" than the evl argument is.

I'm imagining walking someone through the IR, explaining what each instruction means. I'd be a bit embarrassed when I get to a vp intrinsic and say, "Ignore the last argument. We don't use it for this target."

But like I said above, this is much more a vague indication to me that we haven't got this right yet than a concrete problem.

andrew.w.kaylor added inline comments.Feb 3 2020, 5:26 PM
llvm/docs/LangRef.rst
15302

Nothing in this description makes it clear that the value should be negative for targets that don't support the explicit vector length argument.

What should happen if this value isn't negative for a target that doesn't support it?

15306

What happens if the value isn't in this range? Based on the comments below it seems that the effective mask created should be as if %evl were equal to W. If it is a non-constant value will the generated code somehow enforce that behavior?

pengfei added a subscriber: pengfei.Feb 3 2020, 7:07 PM
simoll marked 2 inline comments as done.Feb 4 2020, 2:16 AM

I'm not entirely clear where this value comes from, but it seems like whatever generated it must know that we're generating SVE code, right?

This is not architecture specific, and thus this is not assuming SVE. In the case of SVE, the vector length is unknown at compile time (it is a multiple of something), and constant at run-time. In the RISC-V vector extension, I believe it can be changed at any point. Thus, the way to obtain this value is by reading/writing a status register, or something similar, but relying on querying the architecture features. And whether it is constant at runtime, or can be changed at any point, this API seems to cover the different cases.

Alright, I was reaching for a more general concept that I obviously don't know the term for. What I meant to say was just that whatever generated the argument must know something specific about the target architecture.

Operand bundles are not the same as optional parameters. Operand bundles pessimistically imply side-effects where as a plain i32 argument is innocuous to optimizations:

https://llvm.org/docs/LangRef.html#operand-bundles

  • Calls and invokes with operand bundles have unknown read / write effect on the heap on entry and exit (even if the call target is readnone or readonly), unless they’re overridden with callsite specific attributes.

That is reasonable for constrained fp intrinsics, which always have such a dependence. Many VP intrinsics however do not have any side effects or only inspect/modify memory through their pointer arguments, etc.

We could add a callsite attribute when the intrinsic is created. We have that information about the intrinsic, right?

I'll admit that there may be no practical difference between what you've proposed and what I have in mind (in the abstract sense that assumes it's even possible to implement this in the way that I'm imagining). Mostly, I'm trying to explore what I perceive to be a bad smell in the IR. Having a parameter that is irrelevant for some targets just doesn't seem right. Introducing these intrinsics with "experimental" in the name would make me feel a bit better about that, even though again it has no practical effect.

The real question is, do targets without native support for EVL have anything to gain from making the argument optional?

I'm not sure I agree that is the real question. What is gained by not having the floating point constraint arguments always present even when they aren't needed? We have values that describe the default mode, so we could use those when we want the default mode. I think we all agree that they shouldn't be there when we don't need them, yet I would argue that those arguments are more relevant when "not needed" than the evl argument is.

I'm imagining walking someone through the IR, explaining what each instruction means. I'd be a bit embarrassed when I get to a vp intrinsic and say, "Ignore the last argument. We don't use it for this target."

But like I said above, this is much more a vague indication to me that we haven't got this right yet than a concrete problem.

To me all of this points in the direction that we'd want a standard mechanism for optional parameters. We have three options for this right now:
a) Come up with a novel, clean slate scheme for optional parmeters with default values (eg constrained fp default fp env values, i32 -1 for VP intrinsics). Some examples:

; all optional parameters set (eg RISC-V V or SX-Aurora)
%x= llvm.fp.add(%a, %b) mask(%mask), evl(%evl), fpenv(fpround.tonearest, fpexcept.strict)

; constrained fp SIMD intrinsic (eg AVX512)
%y= llvm.fp.add(%a, %b) mask(%mask), fpenv(fpround.tonearest, fpexcept.strict)

; unpredicated, default fp env fp add (isomorphic to the fadd instruction)
%z= llvm.fp.add(%a, %b)

This is the solution that i'd want to see in LLVM to solve the optional parameter problem once and for all for constrained fp *and* VP (we should look out for other potential applications and start an RFC).

b) OpBundles (imply side-effects unless overriden with callsite attributes) - clunky, could form the starting point for a) though.
c) A more exotic solution: add an extra opaque parameter and provide different intrinsics to produce the opaque value (i'm not aware of this being used anywhere right now). This is similar to the idea that @sdesmalen brought up (https://reviews.llvm.org/D69891#inline-632851), however, with the produced mask having an opaque type such that it behaves basically like an optional parameter tuple, ie:

declare opaquety  @llvm.vp.mask(<8 x i1>)
declare opaquety @llvm.vp.evlmask(<8 x i1>, i32 %evl)
declare <8 x float> @llvm.vp.fadd(%a, %b, opaquety mask)

For constrained fp, this could look as follows

declare opaquety @llvm.constrainedfp.fpenv(metadata, metadata)
declare opaquety @llvm.constrained.fadd(%a, %b, opaquety fpenv)

Note that we could allow chaining of these intrinsics (eg add a passthru opquety parameter to the defining intrinsics):

opaquety %fpenv = llvm.constrainedfp.fpenv(opquety undef, fpround.toneares, fpexcept.strict)
opaquety %opt_params = llvm.vp.evlmask(%fpenv, <8 x ii1>%mask, i32 %evl)
%c = llvm.fadd(%a, %b,  opt_params)

BUT before any of these solutions are adopted, i'd rather we model the optional parameters as standard, explicit parameters with default values. We can pursue tha general solution for the optional parameter problem in parallel.
What do you think?

llvm/docs/LangRef.rst
15302

Nothing in this description makes it clear that the value should be negative for targets that don't support the explicit vector length argument.

Good point. I'll add that to the description.

What should happen if this value isn't negative for a target that doesn't support it?

There is a lowering pass that folds %evl into the mask, just as for reduction intrinsics.

15306

We are discussing the case %evl >= W in the main RFC, right now https://reviews.llvm.org/D57504#1854330 .

I like your first proposal for optional parameters. It "feels right" to me, and I agree that it is an improvement on operand bundles. Obviously it would take some time to build consensus for something like that as a new IR construct. I can live with the explicit parameter for now if we can agree on rules for how it is defined to behave on targets where it does not apply.

simoll planned changes to this revision.Feb 12 2020, 12:43 AM
simoll added a subscriber: chandlerc.

Ok. This is a summary of the requested changes and i'd like to get your go ahead before "committing" them to the patch:

  1. Define what 'W' is (that is the vector length that %evl is compared against) - for static vector types this would be the number of elements of the vector, for scalable types W == MVL and the target is responsible for defining MVL.
  2. Define what happens when %evl > W - since we do not know now what is best, i propose we define this as UB. This leaves us some room to revisit that decision should it turn out that some form of defined behavior would be more reasonable.
  3. Mention that %evl is discouraged for non-VL targets - TTI tells people whether VL is supported or not for every target. If used nevertheless, the ExpandVectorPredicationPass will fold %evl into the mask. "not-using-%evl" means setting %evl = W or %evl = -1.
  4. %evl and %mask are parameters - i will start an RFC on llvm-dev towards optional function parameters (as sketched here under variant a): https://reviews.llvm.org/D69891#1856580).
  5. Rename to llvm.experimental.vp.* - Inserting experimental is the preferred way to introduce new intrinsics until they are stable (for technical reasons brought up by @chandlerc - https://reviews.llvm.org/D69891#1852795 ).

Ok. This is a summary of the requested changes and i'd like to get your go ahead before "committing" them to the patch:

  1. Define what 'W' is (that is the vector length that %evl is compared against) - for static vector types this would be the number of elements of the vector, for scalable types W == MVL and the target is responsible for defining MVL.

I object. The only sensible definition for the upper bound on %evl is the number of elements in the vector type in question (because %evl counts vector elements). We already have a concept of "number of elements in a vector type" that generalizes across all vector types (fixed-width and scalable), and it is quite target-independent: <vscale x $N x $elemtype > (where $N is a constant) has vscale * $N elements. The only target-dependent part is in the positive integer vscale, and its behavior is quite restricted. All of this is already stated in the LangRef (in the section on vector types), and there is no reason for VP intrinsics to repeat it, let alone deviate from it by e.g. introducing new terminology (MVL) or making provisions for more target-dependent behavior than actually exists. The VP instructions should just define their behavior in terms of the number of elements in the vector.

Ok. This is a summary of the requested changes and i'd like to get your go ahead before "committing" them to the patch:

  1. Define what 'W' is (that is the vector length that %evl is compared against) - for static vector types this would be the number of elements of the vector, for scalable types W == MVL and the target is responsible for defining MVL.

I object. The only sensible definition for the upper bound on %evl is the number of elements in the vector type in question (because %evl counts vector elements). We already have a concept of "number of elements in a vector type" that generalizes across all vector types (fixed-width and scalable), and it is quite target-independent: <vscale x $N x $elemtype > (where $N is a constant) has vscale * $N elements. The only target-dependent part is in the positive integer vscale, and its behavior is quite restricted. All of this is already stated in the LangRef (in the section on vector types), and there is no reason for VP intrinsics to repeat it, let alone deviate from it by e.g. introducing new terminology (MVL) or making provisions for more target-dependent behavior than actually exists. The VP instructions should just define their behavior in terms of the number of elements in the vector.

+1
I was complicating things. MVL == number of vector elements and the whole discussion about what MVL is and how it is read/set really is part of the scalable type/vscale discussion.

  1. Rename to llvm.experimental.vp.* - Inserting experimental is the preferred way to introduce new intrinsics until they are stable (for technical reasons brought up by @chandlerc - https://reviews.llvm.org/D69891#1852795 ).

I feel like I should mention that I don't know how Chandler feels about the use of "experimental" for these intrinsics. I wasn't trying to claim his approval of my suggestion, merely explaining that I thought the reasoning that led to the current naming of the constrained FP intrinsics probably applied here as well. After I made that suggestion @craig.topper pointed out to me that the "experimental" qualifier has a tendency to never go away. See, for example, vector add/reduce intriniscs. All this is to say that my suggestion is just a suggestion and I could be convinced to drop it if that is the consensus.

  1. Rename to llvm.experimental.vp.* - Inserting experimental is the preferred way to introduce new intrinsics until they are stable (for technical reasons brought up by @chandlerc - https://reviews.llvm.org/D69891#1852795 ).

I feel like I should mention that I don't know how Chandler feels about the use of "experimental" for these intrinsics. I wasn't trying to claim his approval of my suggestion, merely explaining that I thought the reasoning that led to the current naming of the constrained FP intrinsics probably applied here as well.

Well, to set this straight, i didn't mean to imply Chandler's approval. I just want to document what motivates the experimental prefix.

After I made that suggestion @craig.topper pointed out to me that the "experimental" qualifier has a tendency to never go away. See, for example, vector add/reduce intriniscs. All this is to say that my suggestion is just a suggestion and I could be convinced to drop it if that is the consensus.

How about we stick with llvm.vp.fadd and go for llvm.vp.fadd.v2, etc when/if the intrinsics are updated?

vkmr added a subscriber: vkmr.Feb 17 2020, 8:02 AM

After I made that suggestion @craig.topper pointed out to me that the "experimental" qualifier has a tendency to never go away. See, for example, vector add/reduce intriniscs. All this is to say that my suggestion is just a suggestion and I could be convinced to drop it if that is the consensus.

How about we stick with llvm.vp.fadd and go for llvm.vp.fadd.v2, etc when/if the intrinsics are updated?

I'm OK with that.

simoll updated this revision to Diff 246212.Feb 24 2020, 7:49 AM
simoll edited the summary of this revision. (Show Details)
  • added section that %evl is disencouraged for non-evl targets
  • The VP intrinsic has UB if the effective EVL is out of bounds.
  • Use i32 -1 as special value for %evl (disabling %evl for this VP call).
efriedma added inline comments.
llvm/docs/LangRef.rst
15297

I assume for scalable vectors, the range of legal values for %evl is the runtime vector length? Maybe worth noting explicitly.

15310

I'm not happy with distinguishing between %evl equal to -1, vs. the IR constant -1. I mean, we can, but there are complications for other passes like, for example, constant hoisting. We have immarg for arguments that are *always* immediates, but we don't have a way to mark an argument that's *sometimes* an immediate.

I don't have any great suggestions here... but one idea is to add a second type overload to the intrinsics: i32 if the evl is present, {} if the evl is absent. So @llvm.vp.add.v4i32.i32 is an add with an evl parameter, and @llvm.vp.add.v4i32.sl_s is an add without an evl parameter. I think that expresses the semantics correctly, but maybe it's not so readable.

15330

"discouraged"

simoll marked 3 inline comments as done.Feb 24 2020, 5:15 PM

Thanks for the review!
I'll update shortly with an explicit mention of scalable vector types in the langref.

llvm/docs/LangRef.rst
15310

I'm not happy with distinguishing between %evl equal to -1, vs. the IR constant -1. I mean, we can, but there are complications for other passes like, for example, constant hoisting. We have immarg for arguments that are *always* immediates, but we don't have a way to mark an argument that's *sometimes* an immediate.

We could make TTI:getIntIMmCostIntrin return 0 whenever i32 -1 is passed in VP intrinsics for evl.

/// Return the expected cost of materialization for the given integer
/// immediate of the specified type for a given instruction. The cost can be
/// zero if the immediate can be folded into the specified instruction.
 int getIntImmCostIntrin(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
                           Type *Ty) const;

In any case, the special handling of i32 -1 is really a makeshift solution. Mid to long term, we want to make evl (and the mask) argument optional parameters (https://lists.llvm.org/pipermail/llvm-dev/2020-February/139110.html).

one idea is to add a second type overload to the intrinsics: i32 if the evl is present, {} if the evl is absent. So @llvm.vp.add.v4i32.i32 is an add with an evl parameter, and @llvm.vp.add.v4i32.sl_s is an add without an evl parameter.

This might solve the representation issue in the IR. However, doing so would break the 1:1 correspondence between instruction opcodes and VP intrinsics, which is a nice property to have when optimizing VP code (eg for the generalized pattern matching logic in the VP reference patch).

efriedma added inline comments.Feb 24 2020, 5:35 PM
llvm/docs/LangRef.rst
15310

If you have a plan for this, that's fine.

simoll updated this revision to Diff 246352.Feb 24 2020, 5:42 PM
  • new LangRef wording
  • updated 'VPIntrinsic::canIgnoreVectorLengthParam` to be in line with the i32 -1 constant being the off switch for evl.
simoll updated this revision to Diff 246444.Feb 25 2020, 7:03 AM

nfc: spelling, rebased

llvm/lib/IR/IntrinsicInst.cpp
278

Why are you doing this and not just "less than zero"?

simoll marked an inline comment as done.Feb 25 2020, 10:06 PM
simoll added inline comments.
llvm/lib/IR/IntrinsicInst.cpp
278

Because with the new LangRef wording %evl is unsigned and the i32 -1 constant is handled as a special case.

andrew.w.kaylor added inline comments.Mar 4 2020, 3:53 PM
llvm/lib/IR/IntrinsicInst.cpp
278

I see that you also discussed this with Eli. I don't like the fact that constant -1 and a value that is 0xFFFFFFFF at runtime could have different behaviors. There's no telling when the constant folder might find a path to convert a value to a constant. I understand that the intrsincs aren't supposed to be called with values outside the range of [0, W] and has undefined behavior if it is, but having an exception for a constant does seem right.

As a completely contrived example, suppose I had IR like this:

bb1:
  br i1 %flag, label %bb2, label %bb3

bb2:
  %t1 = call <8 x i32> @llvm.vp.add.v8i32(<8 x i32> %i0, <8 x i32> %i1, <8 x i1> %m, i32 %n)
  br label %bb4

bb3:
  %t2 = call <8 x i32> @llvm.vp.add.v8i32(<8 x i32> %i0, <8 x i32> %i1, <8 x i1> %m, i32 -1)
  br label %bb4

bb4:
  %t3 = phi <8 x i32> [%t1, %bb1], [%t2, %bb2]
  ...

Now some pass is going to transform that into:

bb1:
  %evl = select i1 %flag, i32 %n, i32 -1
  %t3 = call <8 x i32> @llvm.vp.add.v8i32(<8 x i32> %i0, <8 x i32> %i1, <8 x i1> %m, i32 %evl)
  ...

And now the -1 case has undefined behavior.

simoll marked 4 inline comments as done.Mar 5 2020, 12:46 AM
simoll added inline comments.
llvm/lib/IR/IntrinsicInst.cpp
278

That's a problem, yes.
We could do away with the -1 special case entirely. Then, to disable evl it has to be set to the number of vector elements. For a <W x ty> vector that's just the constant W, for scalable vector types that would be the appropriate vscale const exprssion.

simoll updated this revision to Diff 249153.Mar 9 2020, 10:26 AM
  • Rebased
  • removed i32 -1 special case for EVL in VP intrinsics - 0 <= %evl < W is the only valid (== defined) setting now.

Ping. The i32 -1 special case is gone and canIgnoreVectorLength works for both regular and scalable vector types. Are there any more comments or can this go in?

I'm satisfied with the functionality, but I'm not sure about the intrinsics having undefined behavior outside the [0, W] range. The way you've implemented it, it seems like the behavior would be predictable. If the evl argument is outside that range, it is ignored. Applying an unsigned value greater than W using the "%mask AND %EVLmask" also has this effect. Why not just make that the defined behavior?

llvm/lib/IR/IntrinsicInst.cpp
294

You can get here in one step with 'this->getModule()'

simoll marked an inline comment as done.Mar 12 2020, 1:48 AM

I'm satisfied with the functionality, but I'm not sure about the intrinsics having undefined behavior outside the [0, W] range. The way you've implemented it, it seems like the behavior would be predictable. If the evl argument is outside that range, it is ignored.

To directly lower VP to NEC SX-Aurora %evl strictly needs to be within the range 0 to W or you get a hardware exception. Defining any behavior outside of that range thus implies additional instructions to restrict %evl to its bounds or guard the VP op. Clearly we do not want that. At the same time un-defining the behavior outside of that range does not hamper AVX512 code generation in any way.

Applying an unsigned value greater than W using the "%mask AND %EVLmask" also has this effect.

Semantically, yes. The difference is in the code generation.

[applying.. greater than W] Why not just make that the defined behavior?

When %evl is lowered to a mask there is still a risk of overflow in the comparison when the underlying vector type is widened (consider an operation on <256 x i8> elements and %evl ==258.. when that operation is widened to <512 x i8> you need to do something about that %EVLmask or you'll get spurious active bits in the upper half). If that is UB to begin you do not need to consider it in the %EVLmask computation. So, even non-AVL targets benefit from the strict defined range for %evl.

simoll updated this revision to Diff 249876.Mar 12 2020, 3:09 AM

NFC

  • Rebased
  • getModule()
andrew.w.kaylor accepted this revision.Mar 12 2020, 10:51 AM

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

This revision is now accepted and ready to land.Mar 12 2020, 10:51 AM

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

Great :) Thanks to everybody involved for reviewing!
We've iterated on this patch for a while and the last major update was weeks ago, giving people enough time to react to it. I'll commit this today.

The next patch will introduce the ExpandVectorPredicationPass, which folds %evl into the mask if requested through TLI.

fhahn added a comment.Mar 13 2020, 2:42 AM

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

Great :) Thanks to everybody involved for reviewing!
We've iterated on this patch for a while and the last major update was weeks ago, giving people enough time to react to it. I'll commit this today.

Given the scale of the change, it might be good to let people on llvm-dev know that a this patch is ready to land and it is about to be committed (and wait till early next week in case there's additional feedback).

OK. Since the behavior for out-of-range evl is target-dependent, undefined makes sense.

I don't know if you were waiting for input from anyone else, but this looks good to me.

Great :) Thanks to everybody involved for reviewing!
We've iterated on this patch for a while and the last major update was weeks ago, giving people enough time to react to it. I'll commit this today.

Given the scale of the change, it might be good to let people on llvm-dev know that a this patch is ready to land and it is about to be committed (and wait till early next week in case there's additional feedback).

Ok, let's wait until next week to give people more time to react. I will also update the reference patch in the main RFC noting that the integer VP patch was accepted.
However, i don't think it is necessary to bring this particular patch up on llvm-dev - there were several llvm-dev mails regarding VP in the past and we should have all interested parties as subscribers to the main RFC. Future patches are a completely different story - eg the VP-expansion pass and follow up patches should be announced on llvm-dev to make sure everybody is aware of them.

simoll updated this revision to Diff 250173.Mar 13 2020, 3:57 AM
  • Rebased
  • Fixed off-by-one error in the Langref (the lane at position i with i == %evl is False).
This revision was automatically updated to reflect the committed changes.
reames added a subscriber: reames.Mar 19 2020, 10:22 AM

Reading through the LangRef change post commit I was struck by this wording: "The VP intrinsic has undefined behavior if `%evl > W`."

This is a very strong interpretation and doesn't match what we do for arithmetic. One consequence of that difference is that hoisting a vp intrinsic out of a loop will not be legal unless we can prove the EVL is in bounds. You probably want some variant of a poison semantic propagation instead. Unless of course, real hardware happens to fault on out of bounds EVL in which case the current semantic is what you want. :)

Otherwise, nice work.

Reading through the LangRef change post commit I was struck by this wording: "The VP intrinsic has undefined behavior if `%evl > W`."

This is a very strong interpretation and doesn't match what we do for arithmetic. One consequence of that difference is that hoisting a vp intrinsic out of a loop will not be legal unless we can prove the EVL is in bounds. You probably want some variant of a poison semantic propagation instead. Unless of course, real hardware happens to fault on out of bounds EVL in which case the current semantic is what you want. :)

Otherwise, nice work.

I brought this up in D57504 earlier, and @simoll said that some real hardware indeed faults in this case:

The VE target strictly requires VL <= MVL or you'll get a hardware exception.

lkcl added a subscriber: lkcl.Mar 20 2020, 3:26 AM