Page MenuHomePhabricator

[VP,Integer,#1] Vector-predicated integer intrinsics
Needs ReviewPublic

Authored by simoll on Nov 6 2019, 6:19 AM.

Details

Summary

This patch adds IR intrinsics for vector-predicated integer arithmetic.

It is subpatch #1 of the integer slice of LLVM-VP.
LLVM-VP is a larger effort to bring native vector-predication to LLVM.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
jdoerfert added inline comments.Tue, Nov 12, 7:01 PM
llvm/lib/IR/IntrinsicInst.cpp
306

I doubt this is a sensible default. I doubt one should return anything for a non-vp intrinsic.

337

Nit: braces.

llvm/unittests/IR/VPIntrinsicTest.cpp
21

Not: using namsepace llvm; is probably less confusing here

jdoerfert added inline comments.Tue, Nov 12, 10:48 PM
llvm/lib/IR/IntrinsicInst.cpp
263

Forget this one.

simoll planned changes to this revision.Wed, Nov 13, 12:32 AM
simoll marked 12 inline comments as done.

Thanks for the comments so far :)

llvm/include/llvm/IR/Intrinsics.td
1176

I would have argued we could even do the #inline trick here, or maybe some .td magic, to avoid this replication, but I will not force anything.

I agree but refactoring Intrinsics.td is out of scope for this patch.

llvm/include/llvm/IR/VPIntrinsics.def
28

assuming all vp intrinsics so far have a unique corresponding scalar version.

They don't. That's why it's separate. There will be more cases for constrained fp, reduction and memory ops.

llvm/lib/IR/IntrinsicInst.cpp
244

Actually, the excessive vector length case should return None to imply that the static vector length could not be inferred.

Btw, later patches will reuse the lambda.

306

If there is no functional IR opcode this is still a call to an intrinsic function. It's natural to return Call in this case.

353

This particular function is clang-formatted already.. The existing code in IntrinsicInst.cpp is not. I'll do a patch clang-formatting the entire file after this one.

simoll updated this revision to Diff 229029.EditedWed, Nov 13, 1:53 AM
simoll marked an inline comment as done.

Changes

  • Cleanup, addressed nitpicks
  • Assume vlen can not be ignored for the corner case of an excessive static vector length (safe default. Pass -1 to enable all lanes in those cases, should they ever occur..).
  • made integer intrinsics nosync. Will consider an explicit nofree for VP intrinsics with memory args if the intrinsic isn't readnone already.
samparker added inline comments.Wed, Nov 13, 2:08 AM
llvm/lib/IR/IntrinsicInst.cpp
216

Ah, sorry, I had missed this was for constant lengths!

sstefan1 added inline comments.Wed, Nov 13, 2:24 AM
llvm/include/llvm/IR/Intrinsics.td
1176

I kind of started some of this in D65377 and the approach will try and propose is an opt-out list.

If I understood correctly I should try to do that part first (and propose). @jdoerfert correct me if I'm wrong.

Hopefully I'll start this today.

jdoerfert added inline comments.Wed, Nov 13, 8:28 AM
llvm/include/llvm/IR/Intrinsics.td
1176

@simoll Agreed. This is out of scope.
@sstefan1 I think your plan sounds good.

llvm/include/llvm/IR/VPIntrinsics.def
2

now too long and wrapped :(
No need for llvm/, which is confusing as it lives in llvm/IR.
Also "Instructions" might be a bit to broad? Idk, maybe: "vector intrinsic descriptions"

28

I see. Though, even if later intriniscs do not correspond to scalar instructions, the changes would still be possible. I admit, at this point it is a question if you like:

REGISTER_VP_INTRINSIC(vp_add, 2, 3)
HANDLE_VP_TO_OC(vp_add, Add)
REGISTER_VP_INTRINSIC(vp_red, 2, 3)

better or worse than

REGISTER_VP_FROM_SCALAR_INTRINSIC(Add, 2, 3)
REGISTER_VP_INTRINSIC(Red, 2, 3)

with the appropriate definition of the above:

#define REGISTER_VP_FROM_SCALAR_INTRINSIC(OC, MP, VP) \
  REGISTER_VP_INTRINSIC(OC, MP, VP) \
  HANDLE_VP_TO_OC(vp_##OC, OC)
llvm/lib/IR/IntrinsicInst.cpp
244

Fine with me.

306

not really convinced but OK.

simoll planned changes to this revision.Wed, Nov 13, 9:01 AM
simoll marked 14 inline comments as done.
simoll added inline comments.
llvm/include/llvm/IR/VPIntrinsics.def
2
//===-- IR/VPInstruction.def - Describes llvm.vp.* Intrinsics -*- C++ -*-===//
28

I prefer the explicit form

REGISTER_VP_INTRINSIC(vp_add, 2, 3)
HANDLE_VP_TO_OC(vp_add, Add)
  • The explicit form gives you much nicer diffs.
  • You don't have to read a meta macro to understand what is going on.
  • This file will undergo changes with the next patches. I wouldn't spend too much time compacting it before hasn't "settled down".
simoll updated this revision to Diff 229116.Wed, Nov 13, 9:05 AM
simoll marked 2 inline comments as done.

Finally fixed the IR/VPIntrinsics.def header line.

simoll updated this revision to Diff 229118.Wed, Nov 13, 9:07 AM

VPInstruction.def -> VPIntrinsics.def -.-

I think I'm fine with this. Anyone else?

llvm/include/llvm/IR/VPIntrinsics.def
28

OK.

+1 from me! Thanks for doing this!

llvm/docs/LangRef.rst
14668

Nit: instead of talking about the MSB being zero, is it simpler to say that it is effective if is a positive number (if that's what you mean)?

14738

Nit: please ignore if you disagree, but perhaps but perhaps shorter/simpler is:

The first two operands and the result have the same vector of integer type.

->

The operands and the result are integer vector types.

(this applies to all/most descriptions)

llvm/lib/IR/IntrinsicInst.cpp
227

Nit: don't think this lamba adds anything, it's just called once.

simoll planned changes to this revision.Thu, Nov 14, 1:29 AM
simoll marked 4 inline comments as done.

+1 from me! Thanks for doing this!

Thanks for reviewing! I'll upload a final update for the LangRef.

llvm/docs/LangRef.rst
14668

Good point. I'll rewrite that part referring to the vlen as a signed integer.

14738

I'd like to shorten that. However, there are always mask and vlen operands so we cannot say that all operands have an integer vector type.

llvm/lib/IR/IntrinsicInst.cpp
227

It'll be called twice in the final version (see reference patch).

simoll updated this revision to Diff 229249.Thu, Nov 14, 2:11 AM
simoll marked an inline comment as done.

LangRef rewording: explicit vlen is a signed integer.

Thanks for your efforts to move this forward @simoll!

llvm/docs/LangRef.rst
14664

I expect this to also work for scalable vectors, so maybe add a case here for <vscale x W x T> as well?

14666

nit: bit -> i1

14668

Thanks, I wondered something similar.

I had actually expected the %evl to only make sense in the context of scalable vectors?

Is it worth expressing what it means if %evl is larger than the size of the vector?

14670

nit: 80 char limit (also for other places in this file)

14702

Have you considered adding an extra argument to specify the value of the false lanes, similar to how this is currently done for llvm.masked.load?
By passing undef, it would have similar behaviour as the current definition, yet it would remove the need to add explicit select statements for handling merging- or zeroing predication if an instruction supports it. For SVE for example, most instructions have either merging or zeroing predication and the intrinsics expose the merging/zeroing/undef predication directly in the C/C++ intrinsic API. I think it would be really nice to have that capability represented in the vp intrinsics as well.

simoll planned changes to this revision.Thu, Nov 14, 7:00 AM
simoll marked 5 inline comments as done.
  • Adding scalable type example declarations
  • reformatting langref for character limit.
llvm/docs/LangRef.rst
14668

I had actually expected the %evl to only make sense in the context of scalable vectors?

EVL does not require scalable vectors. We'll use it with the <256 x double> type for example.

14702

Yes i considered that. However, as you suggested you can express this without loss with a select and i'd like to keep the intrinsics simple.

simoll updated this revision to Diff 229306.Thu, Nov 14, 7:06 AM

Thanks for your feedback!

Formatted LangRef for char limit

jdoerfert added inline comments.Thu, Nov 14, 8:48 AM
llvm/docs/LangRef.rst
14664

If that support is added later, I would add the wording later too.

14702

FWIW, I like the idea of simple intrinsics and explicit selection.

sdesmalen added inline comments.Thu, Nov 14, 9:38 AM
llvm/docs/LangRef.rst
14664

@simoll is anything special needed to support scalable vectors in this context?

14702

If we want to keep the vp intrinsics simple, shouldn't we reconsider the need for %evl as an explicit parameter?

That value can be folded into the predicate using a special intrinsic that enables the lanes upto %evl, e.g.

<256 x i1> llvm.vp.enable.lanes(<256 x i1> %p, i32 %evl)

If any operation needs %evl as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.

llvm/include/llvm/IR/IntrinsicInst.h
232

Can this be renamed to getVectorLength() and have it return ElementCount instead? (at which point we can drop the Optional)

simoll updated this revision to Diff 229490.Fri, Nov 15, 2:57 AM
simoll marked 4 inline comments as done.
  • ElementCount getStaticVectorLength()
  • unit test for canIgnoreVectorLength on scalable examples.
simoll added inline comments.Fri, Nov 15, 3:00 AM
llvm/docs/LangRef.rst
14664

@simoll is anything special needed to support scalable vectors in this context?

The intrinsics support scalable vectors out of the box. I've added VP intrinsics with scalable types to the individual operations below.

14702

If any operation needs %evl as an explicit parameter, it would be a matter of having a pattern that matches the enable.lanes intrinsic, similar to how merging/zeroing predication can be implemented by matching the selects.

The SX-Aurora and RiSC-V V extension natively support an explicit vector length parameter just as they support predicated vector. Mask and EVL should be treated the same: if there is a vector mask parameter there should also be an EVL parameter.

llvm/include/llvm/IR/IntrinsicInst.h
232

Sure. That actually makes it simpler.

sdesmalen added inline comments.Fri, Nov 15, 5:49 AM
llvm/docs/LangRef.rst
14702

I understand that certain instructions take an explicit vector length parameter, but that doesn't really explain why an arch-independent intrinsic needs to expose this. Is there any reason that modelling this using something like llvm.vp.enable.lanes won't work?

llvm/include/llvm/IR/IntrinsicInst.h
232

Awesome, thanks!

simoll marked an inline comment as done.Fri, Nov 15, 7:07 AM
simoll added inline comments.
llvm/docs/LangRef.rst
14702

That's already two architectures that do support EVL on all vector instructions and it is vital to model it. EVL has performance implications (smaller EVL -> lower latency; no need to compute and store a mask where EVL can be used) . For SVE/MVE, you can simply turn the feature off by passing i32 -1.

From the perspective of RVV and SX-Aurora, saying we do not need EVL is equivalent to saying we do not need vector predication.

cameron.mcinally added inline comments.
llvm/docs/LangRef.rst
14702

Yes i considered that. However, as you suggested you can express this without loss with a select and i'd like to keep the intrinsics simple.

Having an explicit passthru is a good idea. Keeping the select glued to the intrinsics may be difficult. For example:

select %mask, %r, undef

It would be easy for a select peep to replace this with %r, since the masked bits are undefined. But for a situation like a trapping instruction, we wouldn't want the the masked elements to execute.

simoll marked an inline comment as done.Mon, Nov 18, 9:27 AM
simoll added inline comments.
llvm/docs/LangRef.rst
14702

> select %mask, %r, undef

This is the current behavior - the values on masked-off lanes (by EVL or the mask) are undef.

My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.

Besides, if the select is explicit and can be folded into another operation to simplify it, then that is actually a good thing.
But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.

sdesmalen added inline comments.Mon, Nov 18, 10:26 AM
llvm/docs/LangRef.rst
14702

If we'd use a select instruction to implement the zeroing/merging predication, the issue with the current set of intrinsics is that we don't know how many lanes to select because the predicate mask is no longer the determining factor to limit which lanes are the valid lanes. Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1. It would be valid IR and there is nothing that guarantees LLVM won't generate the generic intrinsics with a value that is not -1. This is not really special to select, but common to other operations that take a mask as well. For our case it means we can't implement zeroing/merging predication with this set of intrinsics using select.

If we want to solve the select issue and also keep the intrinsics simple, my suggestion was to combine the explicit vector length with the mask using an explicit intrinsic like @llvm.vp.enable.lanes. Because this is an explicit intrinsic, the code-generator can simply extract the %evl parameter and pass that directly to the instructions for RVV/SXA. This is what happens for many other intrinsics in LLVM already, like masked.load/masked.gather that support only a single addressing mode, where it is up to the code-generator to pick apart the value into operands that are suited for a more optimal load instruction.

Without having heard your thoughts on this suggestion, I would have to guess that your reservation is the possibility of LLVM hoisting/separating the logic that merges predicate mask and %evl value in some way. That would mean having to do some tricks (think CodeGenPrep) to keep the values together and recognizable for CodeGen. And that's the exact same thing we would like to avoid for supporting merging/zeroing predication, hence the suggestion for the explicit passthru parameter.

14702

My problem with passthru is that it encourages people to specify arbitrary values on masked-off lanes and to do so often. Yes, for some combination of target architecture, instruction, passthru value and type it might be a perfect match for one machine instruction . But, for most combinations you will have to emit an extra select anyway.

The most common passthru values in practice will be one of the source operands or zero/undef though. These are supported by SVE/SVE2 as a first-class feature (and I could imagine other/future vector architectures having special support for that as well). Other values can indeed be lowered to an explicit select instruction.

But there is more, for things like coalescing predicated registers and vector register spilling, it is helpful to have undef-on-masked-off and a separate select instruction that can be re-scheduled.

At that point the intrinsic will already be lowered to individual instructions, so I don't think the definition of the intrinsic has much of an impact on that.

llvm/docs/LangRef.rst
14702

Yes, you're right. Having an implicit undef and explicit merge is good.

simoll marked an inline comment as done.Mon, Nov 18, 10:54 AM
simoll added inline comments.
llvm/docs/LangRef.rst
14702

Suggesting that other architectures can just turn off the feature by setting the %evl parameter to -1 doesn't change the fact that the code generator still needs to handle the IR if %evl is not -1

Other targets do not need to consider %evl because there will be an ExpandVectorPredication pass that folds the EVL into the vector mask. Backends can request through TTI that the %evl be folded away and ignore that parameter from there on.

An early version of that pass is included in the VP reference patch. The TTI changes and the expansion pass are planned for the next patch after this one.

SjoerdMeijer added inline comments.Fri, Nov 22, 7:54 AM
llvm/docs/LangRef.rst
14702

Don't think this discussion reached consensus, and I missed the nuances here earlier / changed my mind on it.

The prior art on this also uses an explicit passthru, and it looks more generic than patching that up with an select later; you can always pass an undef in?

I also read the description of %evl on line 14664 above again, and thought the definition could be tightened a bit, i.e. specify an upperbound, also just for clarity:

The explicit vector length parameter always has the type i32 and is a signed integer value. The explicit vector length is only effective if it is non-negative. Results are only computed for enabled lanes. A lane is enabled if the mask at that position is true and, if effective, where the lane position is below the explicit vector length.

And then started drafting something to see if I could describe the proposed behaviour:

The explicit vector length parameter always has the type i32 and is a signed integer value. The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, the effective vector length is used to calculate a mask, and its value is:

0 <= evl <= W,   where W is the vector length.

This creates a mask, %EVLmask, with all elements 0 <= i <= evl set to True, and all other lanes evl < i <= W to False.

A new mask %M is calculated with an element-wise AND from %mask and %EVLmask:

M = Vmask AND EVLmask

A vector operation <opcode> on vectors A and B calculates:

A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                {  undef otherwise

and if I'm not mistaken we are now discussing if undef here should be undef or a passthru

14715

nit: perhaps more consistent %avl -> %evl ?

rkruppe added inline comments.Sun, Nov 24, 7:34 AM
llvm/docs/LangRef.rst
14702

Sjoerd's summary matches my understanding of the question here.

I believe so far nobody spelled out that there's actually a material difference between a passthru argument and a separate select. The mask that is readily available for the select (the %mask passed to the VP instruction) isn't the same as the "enabled lanes" computed from mask and EVL (%M in Sjoerd's exposition above), which is extremely undesirable to materialize on machines with a native equipvalent of EVL (such as SX-Aurora and RISC-V).

In particular, if %mask is computed by another VP instruction, then lanes of %mask past the relevant EVL are undef, not false. That means a select like select %mask, %v, zeroinitializer does not actually make the disabled lanes zero, past EVL they are again undef (assuming %x and %mask were computed by VP instructions with the same EVL). In contrast, a passthru value would (presumably, I didn't see anyone object) apply to all disabled lanes.

I don't know how much this matters, since usually consumers of the vector will also only work up to the same EVL. But there are exceptions, such as the code pattern for vectorizing (associative or -ffast-math) reductions for RISC-V V. Briefly put, that pattern is to accumulate a vector of partial (e.g.) sums in the loop, preserving lanes past the current EVL (which may be smaller than the EVL in some previous iterations) so they'll still participate in the final reduction after the loop. This can be expressed perfectly with passthru.

Without passthru, that kind of code seems to require juggling two EVLs (the actual current EVL for this iteration, and the maximum EVL across all iterations) as well as materializing what Sjoerd called %EVLmask and %M above. This seems much harder for a backend to pattern-match and turn into acceptable code than the select %m, vp.something(%a, %b, %m, -1), zeroinitializer pattern.

I previously felt that passthru would be nice to have for backend maintainers (including myself) but perhaps not worth the duplication of IR functionality (having two ways to do selects). However, given the differences I just described, I don't think "just use select" is workable.

That doesn't necessarily mean we need passthru arguments on every VP instruction, I could also imagine adding a vp.passthru(%v, %passthru, %mask, %evl) intrinsic that takes from the first operand for all enabled lanes and the second for the other lanes. But I feel like once the floodgates are opened for doing merging "differently", it's kind of a toss-up whether it's simpler to create a special intrinsic for it or make it part of the general recipe for VP instructions.

simoll marked 2 inline comments as done.Mon, Nov 25, 2:41 AM

Moved the discussion to the RFC.

llvm/docs/LangRef.rst
14702
A <opcode> B =  {  A[i] <opcode> B[i]   if M[i] = T, and
                {  undef otherwise

That's an accurate description of %evl semantics.

14715

Yes. I'll change that to %evl.