Page MenuHomePhabricator

RFC: Prototype & Roadmap for vector predication in LLVM
Needs ReviewPublic

Authored by simoll on Jan 31 2019, 3:12 AM.

Details

Summary

Vector Predication Roadmap

This proposal defines a roadmap towards native vector predication in LLVM, specifically for vector instructions with a mask and/or an explicit vector length.
LLVM currently has no target-independent means to model predicated vector instructions for modern SIMD ISAs such as AVX512, ARM SVE, the RISC-V V extension and NEC SX-Aurora.
Only some predicated vector operations, such as masked loads and stores are available through intrinsics [MaskedIR]_.

Please use docs/Proposals/VectorPredication.rst to comment on the summary.

Vector Predication intrinsics

The prototype in this patch demonstrates the following concepts:

  • Predicated vector intrinsics with an explicit mask and vector length parameter on IR level.
  • First-class predicated SDNodes on ISel level. Mask and vector length are value operands.
  • An incremental strategy to generalize PatternMatch/InstCombine/InstSimplify and DAGCombiner to work on both regular instructions and VP intrinsics.
  • DAGCombiner example: FMA fusion.
  • InstCombine/InstSimplify example: FSub pattern re-writes.
  • Early experiments on the LNT test suite (Clang static release, O3 -ffast-math) indicate that compile time on non-VP IR is not affected by the API abstractions in PatternMatch, etc.

Roadmap

Drawing from the prototype, we propose the following roadmap towards native vector predication in LLVM:

1. IR-level VP intrinsics

  • There is a consensus on the semantics/instruction set of VP intrinsics.
  • VP intrinsics and attributes are available on IR level.
  • TTI has capability flags for VP (`supportsVP()?, haveActiveVectorLength()`?).

Result: VP usable for IR-level vectorizers (LV, VPlan, RegionVectorizer), potential integration in Clang with builtins.

2. CodeGen support

  • VP intrinsics translate to first-class SDNodes (`llvm.vp.fdiv.* -> vp_fdiv`).
  • VP legalization (legalize explicit vector length to mask (AVX512), legalize VP SDNodes to pre-existing ones (SSE, NEON)).

Result: Backend development based on VP SDNodes.

3. Lift InstSimplify/InstCombine/DAGCombiner to VP

  • Introduce PredicatedInstruction, PredicatedBinaryOperator, .. helper classes that match standard vector IR and VP intrinsics.
  • Add a matcher context to PatternMatch and context-aware IR Builder APIs.
  • Incrementally lift DAGCombiner to work on VP SDNodes as well as on regular vector instructions.
  • Incrementally lift InstCombine/InstSimplify to operate on VP as well as regular IR instructions.

Result: Optimization of VP intrinsics on par with standard vector instructions.

4. Deprecate llvm.masked.* / llvm.experimental.reduce.*

  • Modernize llvm.masked.* / llvm.experimental.reduce* by translating to VP.
  • DCE transitional APIs.

Result: VP has superseded earlier vector intrinsics.

5. Predicated IR Instructions

  • Vector instructions have an optional mask and vector length parameter. These lower to VP SDNodes (from Stage 2).
  • Phase out VP intrinsics, only keeping those that are not equivalent to vectorized scalar instructions (reduce, shuffles, ..).
  • InstCombine/InstSimplify expect predication in regular Instructions (Stage (3) has laid the groundwork).

Result: Native vector predication in IR.

References

.. [MaskedIR] llvm.masked.* intrinsics, https://llvm.org/docs/LangRef.html#masked-vector-load-and-store-intrinsics
.. [EvlRFC] Explicit Vector Length RFC, https://reviews.llvm.org/D53613

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
simoll retitled this revision from RFC: EVL Prototype & Roadmap for vector predication in LLVM to RFC: Prototype & Roadmap for vector predication in LLVM.Feb 13 2019, 7:39 AM
simoll edited the summary of this revision. (Show Details)

Renamed EVL to VP

programmerjake resigned from this revision.Feb 13 2019, 11:54 AM
rengolin edited reviewers, added: mkuper, rkruppe, fhahn; removed: programmerjake.Feb 14 2019, 2:57 AM
rengolin added subscribers: Ayal, hsaito.
chill added a subscriber: chill.Mar 16 2019, 2:01 AM
simoll updated this revision to Diff 191252.Mar 19 2019, 12:10 AM
  • re-based onto master
simoll updated this revision to Diff 195366.Apr 16 2019, 6:39 AM

Updates

  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Cross references

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.
Do we have enough upside in having both?

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Do we have enough upside in having both?

I see no harm in having both since we already add the infrastructure in LLVM-VP to abstract away from specific instructions and/or intrinsics. Once (if ever) exception, rounding mode become available for native instructions (or can be an optional tag-on like fast-math flags), we can deprecate all constrained intrinsics and use llvm.vp.fdiv, etc or native instructions instead.

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Do we have enough upside in having both?

I see no harm in having both since we already add the infrastructure in LLVM-VP to abstract away from specific instructions and/or intrinsics. Once (if ever) exception, rounding mode become available for native instructions (or can be an optional tag-on like fast-math flags), we can deprecate all constrained intrinsics and use llvm.vp.fdiv, etc or native instructions instead.

There is an indirect harm in adding more intrinsics with partially-redundant semantics: writing transformations and analyses requires logic that handles both forms. I recommend having fewer intrinsics where we can have fewer intrinsics.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Then, please propose one more rounding mode, like round.permissive or round.any.

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Do we have enough upside in having both?

I see no harm in having both since we already add the infrastructure in LLVM-VP to abstract away from specific instructions and/or intrinsics. Once (if ever) exception, rounding mode become available for native instructions (or can be an optional tag-on like fast-math flags), we can deprecate all constrained intrinsics and use llvm.vp.fdiv, etc or native instructions instead.

There is an indirect harm in adding more intrinsics with partially-redundant semantics: writing transformations and analyses requires logic that handles both forms. I recommend having fewer intrinsics where we can have fewer intrinsics.

Yep. If one additional generally-useful rounding mode gets rid of several partially redundant intrinsics, that would be a good trade-off.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

If you use "round.tonearest" that will get you the same semantics as the non-constrained version. The optimizer assumes round-to-nearest by default.

kpn added a subscriber: kpn.Apr 17 2019, 12:27 PM

Would it make sense to also update docs/AddingConstrainedIntrinsics.rst please?

simoll planned changes to this revision.Apr 18 2019, 1:33 AM

Thanks for your feedback!

Planned

  • Make the llvm.vp.constrained.* versions the only fp ops in vp. Encode default fp semantics by passing fpexcept.ignore and round.tonearest.
  • Update docs/AddingConstrainedIntrinsics.rst to account for the fact that llvm.experimental.constrained.* is no longer the only namespace for constrained intrinsics.
In D57504#1470705, @kpn wrote:

Would it make sense to also update docs/AddingConstrainedIntrinsics.rst please?

Sure. I don't think we should match (in the API) an llvm.vp.constrained.* intrinsic as ConstrainedFPIntrinsic though.
Conceptually, an llvm.vp.constrained.* intrinsics sure is both - VPIntrinsic and ConstrainedFPIntrinsic. If the latter is used to transform them, ignoring the mask an vector len argument along the way, we'll see breakage (..in the future, once there are transforms for constrained fp).

vkmr added a subscriber: vkmr.May 7 2019, 7:55 AM
simoll added a comment.Aug 7 2019, 2:09 AM

This is a "Keepalive" message - I will get back working on LLVM-VP in October.

Nice. Btw. another motivation could be std::simd. Here the overflow intrinsics exposed as builtin would allow us to provide a fast implementation of the masked variants of <simd>

simoll added a comment.Oct 7 2019, 4:45 AM

Picking this up again. I begin with changing the VP intrinsics as outlined before with one deviation from the earlier plan:

  • There will be no llvm.vp.constrained.* just llvm.vp.* and all FP intrinsics will have an exception mode and rounding mode parameter.

This work was mentioned on the SVE discussion about predication, adding arm folks, just in case.

<<~same mail send to llvm-dev>>

Who is interested in a round table on vector predication at the '19 US DevMtg and/or would like to help organizing one? There were some proposals for related round tables on the mailing list but not all of them have a time slot yet (VPlan, SVE, complex math, constrained fp, ..). I am eyeing the Wednesday, 11:55 slot so please let me know if there is a schedule conflict i am not aware of.

Potential Topics:

  • Intersection with constrained-fp intrinsics and backend support (also complex arith).
  • Design of predicated reduction intrinsics (intersection with llvm.experimental.reduce[.v2].*).
  • Compatibility with SVE LLVM extension.
  • <Your topic here>
shawnl added a subscriber: shawnl.Oct 24 2019, 6:37 PM

Are predicated vector instructions not just a special case of DemandedBits? Why can't we leave out the .vp. intrinsics, and just generate the predicate with DemandedBits? That way you do a predicated vector operation like so (in zig): As the example makes clear, this optimization would have to be guaranteed in order for the generated code to be correct (as the predicate avoids a divide-by-zero error).

var notzero = v != 0;
if (std.vector.any(notzero)) {

v = std.vector.select(5 / v, v, notzero);

}

Are predicated vector instructions not just a special case of DemandedBits? Why can't we leave out the .vp. intrinsics, and just generate the predicate with DemandedBits? That way you do a predicated vector operation like so (in zig): As the example makes clear, this optimization would have to be guaranteed in order for the generated code to be correct (as the predicate avoids a divide-by-zero error).

var notzero = v != 0;
if (std.vector.any(notzero)) {

v = std.vector.select(5 / v, v, notzero);

}

What you describe is a workaround but not a solution for predicated SIMD in LLVM.
This approach may seem natural considering SIMD ISAs, such as x86 SSE, ARM NEON, that do not have predication.
It is however a bad fit for SIMD instruction sets that do support predicated SIMD (AVX512, ARM SVE, RISC-V V, NEC SX-Aurora).

As it turns out, it is more robust to have predicated instructions right in LLVM IR and convert them to the instruction+select pattern for SSE and friends than going the other way round.
This is what LLVM-VP proposes.

DevMtg Summary

  • There will be a separate RFC for the generalized pattern rewriting logic in LLVM-VP (see PatternMatch.h). We do this because it is useful for other efforts as well, eg to make the existing pattern rewrites in InstSimplify/Combine, DAGCombiner work also for constrained fp (@uweigand ) and complex arithmetic (@greened) . This may actually speedup things since we can pursue VP and generalized pattern match in parallel.
  • @nhaehnle brought up that the LLVM-VP intrinsics should be convenient and natural to work with. The convenience wrappers (PredicatedInstruction, PredicatedBinaryOperator) and pattern rewrite generalizations already achieve this to a large extent. Specifically, there should be no "holes" when it comes to handling the intrinsics (eg it should not be necessary to resort to lower-level APIs (VPIntrinsic) when dealing with predicated SIMD). (To take something actionable from this, i think there should be an IRBuilder<>::CreateVectorFAdd(A, B, Mask, AVL, InsertPt), returning a PredicatedBinaryOperator, which may either be an FAdd instruction (or constrained fp..) or a llvm.vp.fadd intrinsic, depending on the fp environment, mask parameter, and vector length parameter.)

Are predicated vector instructions not just a special case of DemandedBits? Why can't we leave out the .vp. intrinsics, and just generate the predicate with DemandedBits? That way you do a predicated vector operation like so (in zig): As the example makes clear, this optimization would have to be guaranteed in order for the generated code to be correct (as the predicate avoids a divide-by-zero error).

var notzero = v != 0;
if (std.vector.any(notzero)) {

v = std.vector.select(5 / v, v, notzero);

}

What you describe is a workaround but not a solution for predicated SIMD in LLVM.
This approach may seem natural considering SIMD ISAs, such as x86 SSE, ARM NEON, that do not have predication.
It is however a bad fit for SIMD instruction sets that do support predicated SIMD (AVX512, ARM SVE, RISC-V V, NEC SX-Aurora).

As it turns out, it is more robust to have predicated instructions right in LLVM IR and convert them to the instruction+select pattern for SSE and friends than going the other way round.
This is what LLVM-VP proposes.

+1 on what Simon said.

There are lots of peeps like:

select ?, X, undef -> X

If we optimize away the select, we could end up incorrectly trapping on the no longer masked bits of X. This would be bad for the constrained intrinsics.

But also in the general case, it's very hard to keep a select glued to an operation through opt and llc.

+1 on what Simon said.

+1.

DevMtg Summary

  • There will be a separate RFC for the generalized pattern rewriting logic in LLVM-VP (see PatternMatch.h). We do this because it is useful for other efforts as well, eg to make the existing pattern rewrites in InstSimplify/Combine, DAGCombiner work also for constrained fp (@uweigand ) and complex arithmetic (@greened) . This may actually speedup things since we can pursue VP and generalized pattern match in parallel.

I'd like to rant a little bit to see if anyone agrees with my probably unpopular opinion...

Code explosion is the symptom, not the sickness. It's caused by using experimental intrinsics. Experimental intrinsics are a detriment to progress. They end up creating a ton more work and are designed to be inevitably replaced.

I do understand the desire for these intrinsics by some -- i.e. devs that don't care about these new features aren't impacted. I think that's short sighted though. Hiding complexity behind utility functions will be painful when debugging tough problems. And updating existing code to use pattern matchers is a lot of churn -- probably more churn than making the constructs first-class citizens.

IMHO, we'd be better off baking these new features into LLVM right from the start. These 3 topics are fairly significant features. It would be hard to argue that any one will go out of style in the foreseeable future...

DevMtg Summary

  • There will be a separate RFC for the generalized pattern rewriting logic in LLVM-VP (see PatternMatch.h). We do this because it is useful for other efforts as well, eg to make the existing pattern rewrites in InstSimplify/Combine, DAGCombiner work also for constrained fp (@uweigand ) and complex arithmetic (@greened) . This may actually speedup things since we can pursue VP and generalized pattern match in parallel.

I'd like to rant a little bit to see if anyone agrees with my probably unpopular opinion...

Code explosion is the symptom, not the sickness. It's caused by using experimental intrinsics. Experimental intrinsics are a detriment to progress. They end up creating a ton more work and are designed to be inevitably replaced.

Actually, the idea behind the generalized pattern code is to offer a way to gradually transition from intrinsics to native instruction support without disturbing transformations. The pattern matcher is templatized to match the intrinsics first (through utility classes). When the transition to native IR support is complete, one template-instantiation of the pattern rewriter gets dropped and the code bloat is undone. Eg in the case of VP, eventually PatternMatch will only ever be instantiated for the PredicatedContext and no longer for the special case of the EmptyContext. However, initially (in this patch) the pattern matcher is still instantiated for both kinds of context.
We can use the same mechanism to lift existing optimizations to complex arithmetic intrinsics. In that case, the matcher context would require that all constituent operations are complex number operators. The builder consuming the context will emit complex operations.

I do understand the desire for these intrinsics by some -- i.e. devs that don't care about these new features aren't impacted. I think that's short sighted though. Hiding complexity behind utility functions will be painful when debugging tough problems. And updating existing code to use pattern matchers is a lot of churn -- probably more churn than making the constructs first-class citizens.

Sure. You know its tempting to just duplicate all OpCodes (Opcodes v2) and redesign them (all of them..) to support all of this from the start: a) masking (also for scalar ops), b) an active vector length, c) constrained fp.
If you want the existing transformations to work with OpCodes v2 , you'd need exactly the same pattern generalizations, btw. In the end, whether its native instructions or intrinsics does not matter that much.

IMHO, we'd be better off baking these new features into LLVM right from the start. These 3 topics are fairly significant features. It would be hard to argue that any one will go out of style in the foreseeable future...

I'd say LLVM is long past the starting line. If we just turn on predication on regular IR instructions, many existing instruction transformations will break. You'd need one monster commit that does the switch and fixes all these transformations at the same time.

Code explosion is the symptom, not the sickness. It's caused by using experimental intrinsics. Experimental intrinsics are a detriment to progress. They end up creating a ton more work and are designed to be inevitably replaced.

I think this is a big hammer argument for a nuanced topic.

We have used experimental intrinsics for a large number of disparate concepts, from exception handling to fuzzy vector extensions, and then after the semantics was defined and accepted, we baked the concepts into IR.

This is a proven track, and predication is a very similar example to past experiences, I see no contradiction here.

IMHO, we'd be better off baking these new features into LLVM right from the start. These 3 topics are fairly significant features. It would be hard to argue that any one will go out of style in the foreseeable future...

The risk of getting it wrong and having to re-bake into IR is high. We've done that with exception handling before and it wasn't pretty.

Predication is already in native IR form, albeit complex and error prone. The nuances across targets are too many to have a simple implementation working for everyone, and having a concrete implementation of the idea in intrinsic form may help clear up the issues before we stick to anything.

It's quite possible, and I really hope, that only a few targets will actually implement them, and that will be enough, so intrinsics will be short lived. Meanwhile, previous IR patterns will still match, so nothing is lost.

Of course, as with any intrinsic, it's quite possible that it will "just work" and people will give up half-way through. But history has shown that more often than not, these group efforts finish with a reasonable implementation, better than what we had before.

cheers,
--renato

+1 to what Renato said, I like this direction!
FWIW: we are working on Arm's M-profile Vector Extension (MVE), another vector extension for which this is very useful.

Code explosion is the symptom, not the sickness. It's caused by using experimental intrinsics. Experimental intrinsics are a detriment to progress. They end up creating a ton more work and are designed to be inevitably replaced.

I think this is a big hammer argument for a nuanced topic.

That's fair. But we can talk specifics too. We already have a lot of this functional and optimized in Clang. E.g.:

#include <stdio.h>

#pragma STDC FENV_ACCESS ON

void foo(double a[], double b[]) {
  double res[8];
  for(int i = 0; i < 8; i++)
    if (b[i] != 0.0)
      res[i] = a[i] / b[i];

  printf("%f\n", res[0]);
}
vmovupd (%rsi), %zmm0           #  test.c:8:9
vxorpd  %xmm1, %xmm1, %xmm1     #  test.c:8:14
vcmpneqpd       %zmm1, %zmm0, %k1 #  test.c:8:14
vmovupd (%rdi), %zmm1 {%k1} {z} #  test.c:9:16
vdivpd  %zmm0, %zmm1, %zmm0 {%k1} #  test.c:9:21
vmovupd %zmm0, (%rsp) {%k1}     #  test.c:9:14
vmovsd  (%rsp), %xmm0           #  test.c:11:18

That said, there's a large amount of technical debt from carrying these changes locally. I'd like to get out of that debt. That's why I'd like to avoid the experimental intrinsics detour.

I will also note that we care about a limited number of targets. So to be fair, take that into consideration.

We have used experimental intrinsics for a large number of disparate concepts, from exception handling to fuzzy vector extensions, and then after the semantics was defined and accepted, we baked the concepts into IR.

This is a proven track, and predication is a very similar example to past experiences, I see no contradiction here.

That's a fair argument too. I wasn't monitoring those projects, so I don't know the specifics.

Predication, Complex, and FPEnv require a massive amount of intrinsics to work though. Pretty much duplicating every operator (and target specific intrinsic for FPEnv). And probably some others I've forgotten. That seems like an unreasonable amount of intrinsics to me. But if others with experience in experimental intrinsics think it's manageable, I can't really argue.

IMHO, we'd be better off baking these new features into LLVM right from the start. These 3 topics are fairly significant features. It would be hard to argue that any one will go out of style in the foreseeable future...

The risk of getting it wrong and having to re-bake into IR is high. We've done that with exception handling before and it wasn't pretty.

Predication is already in native IR form, albeit complex and error prone. The nuances across targets are too many to have a simple implementation working for everyone, and having a concrete implementation of the idea in intrinsic form may help clear up the issues before we stick to anything.

It's quite possible, and I really hope, that only a few targets will actually implement them, and that will be enough, so intrinsics will be short lived. Meanwhile, previous IR patterns will still match, so nothing is lost.

Of course, as with any intrinsic, it's quite possible that it will "just work" and people will give up half-way through. But history has shown that more often than not, these group efforts finish with a reasonable implementation, better than what we had before.

Another good argument. And this isn't really the hill I want to die on. But it just seems silly to me to implement something twice: Occam's razor. We'll have to work the kinks out somewhere -- so why not push directly to the goal...

But it just seems silly to me to implement something twice: Occam's razor. We'll have to work the kinks out somewhere -- so why not push directly to the goal...

I see where you're coming from, but hindsight is 20/20. Implementing something twice, when the first one is a prototype means you can make a lot of mistakes on the first iteration.

If the cost of changing the IR outweighs the prototyping costs (it usually does), than the overall cost is lower, even if for a longer period.

The current proposals are interlinked, so I don't think there will be combinatorial explosion, or even multiplication of intrinsics. I hope that we'll figure out the best way to represent that into IR sooner because of that.

This is not the first time that we try to get those into IR proper, either. All previous times we started with "change the IR" approach and could never get into agreement.

Intrinsics give us the prototype route: low implementation cost, low impact, easy to clean up later. It does add clutter in between, but that impact can also be limited to one or two targets of the willing sub-communities.

LLVM is a very fast moving target, stopping the world to get the IR "right" doesn't work.

A good example to look for is the scalable vector IR changes that have gone through multiple attempts and are going on for many years and still not complete...

These things take time, rushing it usually backfires. :)

LLVM is a very fast moving target, stopping the world to get the IR "right" doesn't work.

A good example to look for is the scalable vector IR changes that have gone through multiple attempts and are going on for many years and still not complete...

These things take time, rushing it usually backfires. :)

Ha, yeah. All good points. I'll let this drop...

simoll updated this revision to Diff 226913.Oct 29 2019, 9:34 AM
Updates
  • Fixed several intrinsic attributes.
  • All fp intrinsics are constrained (identically to the llvm.contrained.* ones). They behave like regular fp ops if fpexcept.ignore is passed.
  • Bitcode verifier test.
Observations
  • When using fpexcept.ignore, the fp callsites should have the readnone attribute set on them to override the inaccessiblememonly of the intrinsic declarations. That way DCE still works.
  • The rules for constrained fp (strictfp on the function definition, only constrained fp in that function) apply only if there is a single fp op with exceptions in the function. That is strictfp is not necessary when all fp ops have fpexcept.ignore.
  • When the exception behavior is not fpxcept.ignore, the fp op of the intrinsic is not revealed (getFunctionalOpcode(..) returns Call in that case).
  • (FIXME) NoCapture does not work on vectors of pointers.
Next steps

As mentioned earlier, generalized pattern matching will be part of a separate RFC (although its still included in this reference implementation).
I'd like to discuss the actual intrinsic signatures next. For that i will upload a new minimal patch for integer intrinsic support.

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

llvm/include/llvm/CodeGen/ISDOpcodes.h
489

I was unfamiliar with this one... I think I know what it does, and how it is different from VP_SELECT, but for clarity, can you define what integer pivot is?

492

typo: hether

1134

just spell out 'otherwise' here, and also below.

llvm/include/llvm/CodeGen/SelectionDAGNodes.h
727

Perhaps outdated comment? Should it be something along the lines of 'vector predicated node' i.s.o. explicit vector lenght node?

1492

indentation of || off by 1?

2357

VP_LOAD and VP_STORE?

2386

same?

2426

.. does a truncation before store sounds a bit odd. Since 'truncating store' is a well known term, and that you explain what it is for ints/floats below, I think it suffices to say "Return true if this is truncating store. For intergers ..."

    1. Observations
  • When using fpexcept.ignore, the fp callsites should have the readnone attribute set on them to override the inaccessiblememonly of the intrinsic declarations. That way DCE still works.

Wouldn't that allow the call to be moved relative to other calls? Specifically, we need to make sure intrinsics aren't moved relative to calls that change the rounding mode. The "inaccessiblememonly" attribute is meant to model both the reading of control modes and the possible setting of status flags or raising of exceptions"

  • The rules for constrained fp (strictfp on the function definition, only constrained fp in that function) apply only if there is a single fp op with exceptions in the function. That is strictfp is not necessary when all fp ops have fpexcept.ignore.

I don't think this is right. Even if there are no constrained FP operations in the function we might have math library calls for which the strictfp attribute is needed to prevent libcall simplification and constant folding that might violate the rounding mode.

    1. Observations
  • When using fpexcept.ignore, the fp callsites should have the readnone attribute set on them to override the inaccessiblememonly of the intrinsic declarations. That way DCE still works.

Wouldn't that allow the call to be moved relative to other calls? Specifically, we need to make sure intrinsics aren't moved relative to calls that change the rounding mode. The "inaccessiblememonly" attribute is meant to model both the reading of control modes and the possible setting of status flags or raising of exceptions"

  • The rules for constrained fp (strictfp on the function definition, only constrained fp in that function) apply only if there is a single fp op with exceptions in the function. That is strictfp is not necessary when all fp ops have fpexcept.ignore.

I don't think this is right. Even if there are no constrained FP operations in the function we might have math library calls for which the strictfp attribute is needed to prevent libcall simplification and constant folding that might violate the rounding mode.

I see. Since we need to model the default fp environment with these intrinsics (and this is our priority), let me make the following suggestion: VP intrinsics will have a rounding mode and exception behavior argument from the start but the only allowed values are "round.tonearest" and "fpexcept.ignore". Once we have a solution for the general case implemented for constraint fp, we will unlock that feature also for LLVM VP.

simoll marked 8 inline comments as done.Oct 30 2019, 2:06 AM

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

Hi Sjoerd, thanks for you comments! i've fixed the inline nitpicks right away. I'll do a style pass for the actual commits.

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

Hi Sjoerd, thanks for you comments! i've fixed the inline nitpicks right away. I'll do a style pass for the actual commits.

Cheers. Just curious, what are your next steps? People can correct me if I'm wrong, but my impression is that with the RFC, this prototype, the discussion at the US LLVM dev conference, there is consensus and people are on-board with the general idea and direction. There are still some discussions on e.g. the (constrained) FP part, but would it now be the time split this for example up in an separate commits, like an INT and FP part (if that makes sense), so that they can be (separately) progressed?

simoll added a comment.EditedNov 4 2019, 4:30 AM

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

Hi Sjoerd, thanks for you comments! i've fixed the inline nitpicks right away. I'll do a style pass for the actual commits.

Cheers. Just curious, what are your next steps? People can correct me if I'm wrong, but my impression is that with the RFC, this prototype, the discussion at the US LLVM dev conference, there is consensus and people are on-board with the general idea and direction. There are still some discussions on e.g. the (constrained) FP part, but would it now be the time split this for example up in an separate commits, like an INT and FP part (if that makes sense), so that they can be (separately) progressed?

Yes, i hope that's where things are right now ;-) I am planning to go by functional slices. Each slice comes with IR-level intrinsics, TTI support, basic lowering to standard IR, Selection DAG support and tests.

I am preparing the first patchset for integer support atm.

Slices:

  • Integer slice.
  • Memory slice.
  • Reduction slice.
  • FP (with unconstrained metadata args) slice.

Standalone patch:

  • Mask, VectorLength and Passthru attributes (in preparation of vector function calls).

Pending discussion/separate RFC:

  • Constrained FP (being able to fully optimize constrained fp intrinsics in the default fp env).
  • Generalized pattern match (aka optimizing VP).
simoll added a comment.Nov 6 2019, 1:13 AM

D69552: Move floating point related entities to namespace level contains the fp enum changes required for LLVM-VP. Referencing the patch here.

simoll updated this revision to Diff 228052.Nov 6 2019, 6:27 AM

Fixed attribute placements, signatures, more tests, ..
This is in sync with the subpatch #1 of the integer slice (https://reviews.llvm.org/D69891).

simoll added a comment.Nov 6 2019, 6:29 AM

Integer slice patches

#1 IR-level support: https://reviews.llvm.org/D69891
#2 TTI & Legalization: <stay tuned>
#3 ISel patch: <stay tuned>

I'll update this comment as we go to keep track of the integer slice.

simoll updated this revision to Diff 228885.Nov 12 2019, 6:48 AM

Changes

  • VPIntrinsics.def file.
  • Pass vlen i32 -1 to enable all lanes with scalable vector types.
  • Various NFC fixes.

Moving the discussion from the integer patch alley to the main RFC as this is about the general design of VP intrinsics.. it's about having a passthru operand (as in llvm.masked.load) and whether %evl should be a parameter of the intrinsics or modelled differently.

@SjoerdMeijer https://reviews.llvm.org/D69891#inline-636845
and if I'm not mistaken we are now discussing if undef here should be undef or a passthru

@rkruppe https://reviews.llvm.org/D69891#inline-637215
I previously felt that passthru would be nice to have for backend maintainers (including myself) but perhaps not worth the duplication of IR functionality (having two ways to do selects). However, given the differences I just described, I don't think "just use select" is workable.

Ok. I do agree that having passthru simplifies isel for certain architectures (and legal combinations of passthru value, type, and operations..) but:
VP intrinsics aren't target intrinsics: they are not supposed to be a way to directly program any specific ISA in the way of a macroassembler, like you would do with llvm.x86.* or llvm.arm.mve.* or any other. Rather, think of them as regular IR instructions. Pretend that anything we propose in VP intrinsics will end up as a feature of a first-class LLVM instructions. Based on that i figured that one VP intrinsics should match one IR instructions plus predication, nothing more.

  • If we had predicated IR instructions, would we want them to have a passthru operand?
  • The prototype shows that defining VP intrinsics with undef-on-masked-out makes it straightforward to generalize InstSimplify/InstCombine/DAGCombiner such that they can optimize VP intrinsics. If you add a passthru operand then logically VP intrinsics start to behave like two instructions: that could be made work but it would be messier as you'd have to peek through selects, etc.

@sdesmalen https://reviews.llvm.org/D69891#1750287
If we want to solve the select issue and also keep the intrinsics simple, my suggestion was to combine the explicit vector length with the mask using an explicit intrinsic like @llvm.vp.enable.lanes. Because this is an explicit intrinsic, the code-generator can simply extract the %evl parameter and pass that directly to the instructions for RVV/SXA. This is what happens for many other intrinsics in LLVM already, like masked.load/masked.gather that support only a single addressing mode, where it is up to the code-generator to pick apart the value into operands that are suited for a more optimal load instruction.

Without having heard your thoughts on this suggestion, I would have to guess that your reservation is the possibility of LLVM hoisting/separating the logic that merges predicate mask and %evl value in some way. That would mean having to do some tricks (think CodeGenPrep) to keep the values together and recognizable for CodeGen. And that's the exact same thing we would like to avoid for supporting merging/zeroing predication, hence the suggestion for the explicit passthru parameter.

That's not quite the same:
%evl is mapped to a hardware register on SX-Aurora. We cannot simply reconstitute the %evl from any given mask, if %evl is obscured it makes all operations that depend on it less efficient because we need to default to the full vector length. Now, if the select is separated from the VP intrinsic, you simply emit one select instruction (and it should be possible to hoist it back and merge it with the VP intrinsic in most cases (.. and you probably want an optimization that does that in anyway because there will be code with explicit selects even with passthru)). Besides, if the select is folded with an instruction that is subsequently simpler then that's actually an argument in favor of explicit selects: passthru makes this implicit.

Ok. I do agree that having passthru simplifies isel for certain architectures (and legal combinations of passthru value, type, and operations..) but:
VP intrinsics aren't target intrinsics: they are not supposed to be a way to directly program any specific ISA in the way of a macroassembler, like you would do with llvm.x86.* or llvm.arm.mve.* or any other. Rather, think of them as regular IR instructions. Pretend that anything we propose in VP intrinsics will end up as a feature of a first-class LLVM instructions. Based on that i figured that one VP intrinsics should match one IR instructions plus predication, nothing more.

  • If we had predicated IR instructions, would we want them to have a passthru operand?

I think that would probably be a reasonable clean-slate IR design, though I am not at all sure if it would be better. I wasn't specifically advocating for passthru operands, though. I agree that not having passthru and performing the same function in two operations can be readily pattern-matched by backends at modest effort and failing to match it has low cost. My main point was just that the existing select instruction is not sufficient as the second operation, for essentially the same reason why the VP intrinsics have an EVL argument instead of just the mask. Creating a VP equivalent of select (as already sketched in the other thread) resolves that concern just as well.

simoll marked an inline comment as done.Tue, Dec 3, 4:18 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15283–15300

@rkruppe
[..] My main point was just that the existing select instruction is not sufficient as the second operation, for essentially the same reason why the VP intrinsics have an EVL argument instead of just the mask. Creating a VP equivalent of select (as already sketched in the other thread) resolves that concern just as well.

I agree. The prototype has defined such an llvm.vp.select from the get-go.

rkruppe added inline comments.Tue, Dec 3, 10:06 AM
llvm/docs/LangRef.rst
15283–15300

Oops, missed that / forgot about it. Sorry for the noise.

Is there a reason why it's not in the "integer slice" patch? It's not integer-specific, but it seems to fit even less into the other slices.

simoll marked an inline comment as done.Mon, Dec 9, 12:52 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15283–15300

I wanted to keep the integer patch concise for one. Also, having played around with this for a while now, i think that the signature of vp.select should be:

llvm.vp.select(<W x i1> %m, %onTrue, %onFalse, i32 %threshold, i32 vlen %evl)

meaning that values from %onTrue are selected where %m is true and the lane index is below %threshold. %onFalse is selected otherwise. Lane indices greater-equal %evl are undef as ever. In short: there is just one "merge" operation and no more separate vp.compose.