Page MenuHomePhabricator

RFC: Prototype & Roadmap for vector predication in LLVM
Changes PlannedPublic

Authored by simoll on Jan 31 2019, 3:12 AM.

Details

Summary

Vector Predication Roadmap

This proposal defines a roadmap towards native vector predication in LLVM, specifically for vector instructions with a mask and/or an explicit vector length.
LLVM currently has no target-independent means to model predicated vector instructions for modern SIMD ISAs such as AVX512, ARM SVE, the RISC-V V extension and NEC SX-Aurora.
Only some predicated vector operations, such as masked loads and stores are available through intrinsics [MaskedIR]_.

Please use docs/Proposals/VectorPredication.rst to comment on the summary.

Vector Predication intrinsics

The prototype in this patch demonstrates the following concepts:

  • Predicated vector intrinsics with an explicit mask and vector length parameter on IR level.
  • First-class predicated SDNodes on ISel level. Mask and vector length are value operands.
  • An incremental strategy to generalize PatternMatch/InstCombine/InstSimplify and DAGCombiner to work on both regular instructions and VP intrinsics.
  • DAGCombiner example: FMA fusion.
  • InstCombine/InstSimplify example: FSub pattern re-writes.
  • Early experiments on the LNT test suite (Clang static release, O3 -ffast-math) indicate that compile time on non-VP IR is not affected by the API abstractions in PatternMatch, etc.

Roadmap

Drawing from the prototype, we propose the following roadmap towards native vector predication in LLVM:

1. IR-level VP intrinsics

  • There is a consensus on the semantics/instruction set of VP intrinsics.
  • VP intrinsics and attributes are available on IR level.
  • TTI has capability flags for VP (`supportsVP()?, haveActiveVectorLength()`?).

Result: VP usable for IR-level vectorizers (LV, VPlan, RegionVectorizer), potential integration in Clang with builtins.

2. CodeGen support

  • VP intrinsics translate to first-class SDNodes (`llvm.vp.fdiv.* -> vp_fdiv`).
  • VP legalization (legalize explicit vector length to mask (AVX512), legalize VP SDNodes to pre-existing ones (SSE, NEON)).

Result: Backend development based on VP SDNodes.

3. Lift InstSimplify/InstCombine/DAGCombiner to VP

  • Introduce PredicatedInstruction, PredicatedBinaryOperator, .. helper classes that match standard vector IR and VP intrinsics.
  • Add a matcher context to PatternMatch and context-aware IR Builder APIs.
  • Incrementally lift DAGCombiner to work on VP SDNodes as well as on regular vector instructions.
  • Incrementally lift InstCombine/InstSimplify to operate on VP as well as regular IR instructions.

Result: Optimization of VP intrinsics on par with standard vector instructions.

4. Deprecate llvm.masked.* / llvm.experimental.reduce.*

  • Modernize llvm.masked.* / llvm.experimental.reduce* by translating to VP.
  • DCE transitional APIs.

Result: VP has superseded earlier vector intrinsics.

5. Predicated IR Instructions

  • Vector instructions have an optional mask and vector length parameter. These lower to VP SDNodes (from Stage 2).
  • Phase out VP intrinsics, only keeping those that are not equivalent to vectorized scalar instructions (reduce, shuffles, ..).
  • InstCombine/InstSimplify expect predication in regular Instructions (Stage (3) has laid the groundwork).

Result: Native vector predication in IR.

References

.. [MaskedIR] llvm.masked.* intrinsics, https://llvm.org/docs/LangRef.html#masked-vector-load-and-store-intrinsics
.. [EvlRFC] Explicit Vector Length RFC, https://reviews.llvm.org/D53613

Diff Detail

Event Timeline

simoll created this revision.Jan 31 2019, 3:12 AM
greened added inline comments.
docs/Proposals/VectorPredication.rst
20

This document seems to overload "EVL" to mean both predication and an explicit vector length. Of course predication can be used to simulate the effects of a vector length value on targets that don't have a way to specify an explicit vector length.

Can we keep the two concepts distinct in this document? It's confusing to see "EVL" when discussing predication. In particular, everything in SelectionDAG uses the "EVL" name to represent both concepts. Can we come up with something more accurate?

grosser added a subscriber: grosser.Feb 1 2019, 1:56 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2019, 1:56 AM
grosser removed a subscriber: grosser.Feb 1 2019, 1:56 AM
grosser added a subscriber: grosser.
simoll marked an inline comment as done.Feb 1 2019, 2:02 AM
simoll added inline comments.
docs/Proposals/VectorPredication.rst
20

"EVL" is really just the working title of the whole extension. It is my understanding that the explicit vector length parameter is just part of the predicate and not dinstinct from it (eg the conceptual predicate is a composite of the bit mask and the vector length).

How about naming the extension "VP" for "vector predication" (vp_fadd, llvm.vp.*)..?

programmerjake requested changes to this revision.Feb 1 2019, 2:28 AM
programmerjake added inline comments.
include/llvm/IR/Intrinsics.td
1147

We will need to change the mask parameter length to allow for mask lengths that are a divisor of the main vector length.
See http://lists.llvm.org/pipermail/llvm-dev/2019-February/129845.html

This revision now requires changes to proceed.Feb 1 2019, 2:28 AM
programmerjake added a comment.EditedFeb 1 2019, 5:06 AM

We will also need to adjust gather/scatter and possibly other load/store kinds to allow the address vector length to be a divisor of the main vector length (similar to mask vector length). I didn't check if there are intrinsics for strided load/store, those will need to be changed too, to allow, for example, storing <scalable 3 x float> to var.v in:

struct S
{
    float v[3];
    // random other stuff
};
S var[N];
simoll marked an inline comment as done.Feb 1 2019, 5:59 AM

We will also need to adjust gather/scatter and possibly other load/store kinds to allow the address vector length to be a divisor of the main vector length (similar to mask vector length). I didn't check if there are intrinsics for strided load/store, those will need to be changed too, to allow, for example, storing <scalable 3 x float> to var.v in:

.. and as a side effect evl_load/evl_store are subsumed by evl_gather/evl_scatter:

evl.load(%p, %M, %L) ==  evl.gather(<1 x double*> %p, <256 x i1>..) ==  evl.gather(double* %p, <256 x i1> %M, i32 %L)

Nice :)

Regarding strided memory accesses, i was hoping the stride could be pattern matched in the backend.

include/llvm/IR/Intrinsics.td
1147

Can we make the vector length operate at the granularity of the mask?

In your case [1] that would mean that the AVL refers to multiples of the short element vector (eg <3 x float>).

[1] http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html

programmerjake added inline comments.Feb 1 2019, 10:45 AM
include/llvm/IR/Intrinsics.td
1147

I was initially assuming that the vector length would be in the granularity of the mask.
That would work for my ISA extension. I think that would work for the RISC-V V extension, would have to double check, or get someone who's working on it to check. I don't think that would work without needing to multiply the vector length on AVX512, assuming a shift is used to generate the mask. I have no clue for ARM SVE or other architectures.

simoll marked an inline comment as done.Feb 1 2019, 11:27 AM
simoll added inline comments.
include/llvm/IR/Intrinsics.td
1147

So we are on the same page here.

What i actually had in mind is how this would interact with scalable vectors,e.g.:

<scalable 2 x float> evl.fsub(<scalable 2 x float> %x, <scalable 2 x float> %y, <scalable 2 x i1> %M, i32 %L)

In that case, the vector length should refer to packets of two elements. That would be a perfect match for NEC SX-Aurora, where AVL always refers to 64 bit elements (eg there is a packed float mode).

greened added inline comments.Feb 4 2019, 10:52 AM
docs/Proposals/VectorPredication.rst
20

Sure, that sounds fine.

rkruppe added inline comments.Feb 4 2019, 12:58 PM
include/llvm/IR/Intrinsics.td
1147

That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), <scalable 4 x i8> is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.

Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of vscale rather than in terms of vector elements. <scalable n x T> is simply an (n * vscale)-element vector and that the vscale factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to (n * vscale) - 1). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every <scalable n x T> with <(CONST_VSCALE * n) x T> and every call to llvm.vscale() with that constant.

I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.

simoll marked an inline comment as done.Feb 4 2019, 1:38 PM
simoll added inline comments.
include/llvm/IR/Intrinsics.td
1147

That definitely wouldn't work for RISC-V V, as its vector length register counts in elements, not bigger packets. For example, (in the latest public version of the spec at the moment, v0.6-draft), <scalable 4 x i8> is a natural type for a vector of 8-bit integers. You might use it in a loop that doesn't need 16- or 32-bit elements, and operations on it have to interpret the active vector length as being in terms of 8 bit elements to match the hardware, not in terms of 32 bit elements.

Why couldn't you use <scalable 1 x i8> then?

Moreover, it seems incongruent with the scalable vector type proposal to treat vlen as being in terms of vscale rather than in terms of vector elements. <scalable n x T> is simply an (n * vscale)-element vector and that the vscale factor is not known at compile time is inconsequential for numbering or interpreting the lanes (e.g., lane indices for shuffles or element inserts/extracts go from 0 to (n * vscale) - 1). In fact, I believe it is currently the case that scalable vectors can be legalized by picking some constant for vscale (e.g., 1) and simply replacing every <scalable n x T> with <(CONST_VSCALE * n) x T> and every call to llvm.vscale() with that constant.

Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.

I don't think it would be a good match for SVE or other "predication only" architectures either: as Jacob pointed out for the case of AVX-512, it seems to require an extra multiplication/shift to generate the mask corresponding to the vector length. This is probably secondary, but it feels like another hint that this line of thought is not exactly a smooth, natural extension.

You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.

Yes, lowering it otherwise may involve a shift (or adding a constant vector) in the worst case. However, all of this will happen on the legalization code path that is not expected to yield fast code but something that is correct and somehow reasonable.. we already do legalize things like llvm.masked.gather on SSE (and it ain't pretty).

We will also need to adjust gather/scatter and possibly other load/store kinds to allow the address vector length to be a divisor of the main vector length (similar to mask vector length). I didn't check if there are intrinsics for strided load/store, those will need to be changed too, to allow, for example, storing <scalable 3 x float> to var.v in:

.. and as a side effect evl_load/evl_store are subsumed by evl_gather/evl_scatter:

evl.load(%p, %M, %L) ==  evl.gather(<1 x double*> %p, <256 x i1>..) ==  evl.gather(double* %p, <256 x i1> %M, i32 %L)

This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be evl.gather(<1 x double*> %p, <scalable n x i1>), which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...) where each mask bit controls vscale many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.

And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.

include/llvm/IR/Intrinsics.td
1147

Why couldn't you use <scalable 1 x i8> then?

Each vector register holds a multiple of 32 bit (on that particular target), so <scalable 4 x i8> is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, <scalable 4 x i8>) versus having only half (<scalable 2 x i8>) or a quarter (<scalable 1 x i8>) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both vsin16s(<scalable 1 x f16>) and vsin16d(<scalable 2 x f16>).

Additionally, I want to actually be able to actually use the full vector register without implementing a dynamically changing vscale. Not just because I'm lazy, but also because the architecture has changed enough that the motivation for it has become lessened, so maybe that will not be upstreamed (or only later).

Instead llvm.scale() would be replaced by a constant CONST_VSCALE times another constant: vscale. This does not seem a substantial difference to me.

My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant. A vlen that is in units of vscale is thus unnatural and clashes with every other operation on scalable vectors. If we were talking about a family of intrinsics specifically targeted at the "vector of float4s" use case, that would be inherent and good, but we're not.

It's unfortunate that this clashes with how SX-Aurora's packed operations work, I did not know that.

You would only ever use the full vector length as vlen parameter when you generate EVL for architectures like AVX512, SVE in the first place.

Certainly, that's why I say it's secondary and mostly a hint that something is amiss with "the current thinking". In fact, I am by now inclined to propose that Jacob and collaborators start out by expressing their architecture's operations with target-specific intrinsics that also use the attributes introduced here (especially since none of the typical vectorizers are equipped to generate the sort of code they want from scalar code using e.g. float4 types). Alternatively, use a dynamic vector length of <the length their architecture wants> * <how many elements each of the short vectors has> and fix it up in the backend.

simoll marked an inline comment as done.Feb 5 2019, 2:22 AM
simoll added inline comments.
include/llvm/IR/Intrinsics.td
1147

Each vector register holds a multiple of 32 bit (on that particular target), so <scalable 4 x i8> is just the truth :) It's also important to be able to express the difference between "stuffing the vector register with as many elements as will fit" (here, <scalable 4 x i8>) versus having only half (<scalable 2 x i8>) or a quarter (<scalable 1 x i8>) as many elements because your vectorization factor is limited by larger elements types elsewhere in the code -- in mixed precision code you'll want to do either depending on how you vectorize. The distinction is also important for vector function ABIs, e.g. you might have both vsin16s(<scalable 1 x f16>) and vsin16d(<scalable 2 x f16>).

Makes sense. However, if VL is element-grained on IR-level then there need be functions in TTI to query the native VL grain size for the target (and per scalable type). Eg for SX-Aurora in packed float mode the grain size is <2 x float> and so you might want to generate a remainder loop in that case (unpredicated vector body + predicated vector body for the last iteration).

My point isn't that legalization becomes difficult, it's that scalable vectors are not intended as "a sequence of fixed-size vectors" but rather ordinary vectors whose length happens to be a bit more complex than a compile time constant

Quote from "[llvm-dev] [RFC] Supporting ARM's SVE in LLVM", Graham Hunter:

To represent a vector of unknown length a scaling property is added to the `VectorType` class whose element count becomes an unknown multiple of a known minimum element count

<snip>

A similar rule applies to vector floating point MVTs but those types whose static component is less that 128bits (MVT::nx2f32) are also mapped directly to SVE data registers but in a form whereby elements are effectively interleaved with enough undefined elements to fulfil the 128bit requirement.

I think the sub-vector interpretation is actually the more natural reading of SVE, considering that sub-vectors are padded/interleaved to fit a native sub-register size (128bit on SVE, 64bit on SX-Aurora and 32bit on RVV (in general, or just the RVV implementation you are working on?)). Each sub-vector in the full scalable type is offset by a multiple of that size, so a scalable type is an array of padded sub-vectors.

simoll added a comment.Feb 5 2019, 2:36 AM

We will also need to adjust gather/scatter and possibly other load/store kinds to allow the address vector length to be a divisor of the main vector length (similar to mask vector length). I didn't check if there are intrinsics for strided load/store, those will need to be changed too, to allow, for example, storing <scalable 3 x float> to var.v in:

.. and as a side effect evl_load/evl_store are subsumed by evl_gather/evl_scatter:

evl.load(%p, %M, %L) ==  evl.gather(<1 x double*> %p, <256 x i1>..) ==  evl.gather(double* %p, <256 x i1> %M, i32 %L)

This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be evl.gather(<1 x double*> %p, <scalable n x i1>), which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...) where each mask bit controls vscale many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.

Mixing vector types and scalable vector types is illegal and is not what i was suggesting. Rather, a scalar pointer would be passed to convey a consecutive load/store from a single address.

And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.

Less readable IR: the address computation would become simpler, eg there is no need to synthesize a consecutive constant only to have it pattern-matched and subsumed in the backend (eg <0, 1, 2, ..., 15, 0, 1, 2, 3.., 15, 0, ....>).
Finicky pattern matching: it is trivial to legalize this by expanding it into a more standard gather/scatter, or splitting it into consecutive memory accesses.. we can even keep the EVL_LOAD, EVL_STORE SDNodes in the backend so you woudn't even realize that an llvm.evl.gather was used for a consecutive load on IR level.
More complexity on IR passes for standard loads: We are already using intrinsics here and i'd rather advocate to push for a solution that makes general passes work on predicated gather/scatter (which is not the case atm, afaik).

But again, we can leave this out of the first version and keep discussing as an extension.

This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be evl.gather(<1 x double*> %p, <scalable n x i1>), which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...) where each mask bit controls vscale many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.

Mixing vector types and scalable vector types is illegal and is not what i was suggesting. Rather, a scalar pointer would be passed to convey a consecutive load/store from a single address.

Ok, sorry, I misunderstood the proposal then. That seems reasonable, although I remain unsure about the benefits.

And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.

Less readable IR: the address computation would become simpler, eg there is no need to synthesize a consecutive constant only to have it pattern-matched and subsumed in the backend (eg <0, 1, 2, ..., 15, 0, 1, 2, 3.., 15, 0, ....>).

I don't really follow, here we were only talking about subsuming unit-stride loads under gathers (and likewise for stores/scatters), right? But anyway, I was mostly worried about a dummy 1-element vector and the extra type parameter(s) on the intrinsic, which isn't an issue with what you actually proposed.

Finicky pattern matching: it is trivial to legalize this by expanding it into a more standard gather/scatter, or splitting it into consecutive memory accesses.. we can even keep the EVL_LOAD, EVL_STORE SDNodes in the backend so you woudn't even realize that an llvm.evl.gather was used for a consecutive load on IR level.

I'm not talking about legalizing a unit-stride access into a gather/scatter (when would that be necessary? everyone who has scatter/gather also has unit-stride), but about recognizing that the "gather" or "scatter" is actually unit strided. Having separate SDNodes would solve that by handling it once for all targets in SelectionDAG construction/combines.

More complexity on IR passes for standard loads: We are already using intrinsics here and i'd rather advocate to push for a solution that makes general passes work on predicated gather/scatter (which is not the case atm, afaik).

I expect that quite a few optimizations and analyses that work for plain old loads don't work for gathers, or have to work substantially harder to work on gathers, maybe even with reduced effectiveness. I do want scatters and gathers to be optimized well, too, but I fear we'll instead end up with a pile of "do one thing if this gather is a unit-stride access, else do something different [or nothing]".

How does this plane interact with the later stages of the roadmap? At stage 5, a {Load,Store}Inst with vlen and mask is a unit-stride access, and gathers are left out in the rain, unless you first generalize load and store instructions to general gathers and scatters (which seems a bit radical).

But again, we can leave this out of the first version and keep discussing as an extension.

Yeah, this is a question of what's the best canonical form for unit-stride memory accesses, that can be debated when the those actually exist in tree.

include/llvm/IR/Intrinsics.td
1147

Makes sense. However, if VL is element-grained on IR-level then there need be functions in TTI to query the native VL grain size for the target (and per scalable type). Eg for SX-Aurora in packed float mode the grain size is <2 x float> and so you might want to generate a remainder loop in that case (unpredicated vector body + predicated vector body for the last iteration).

Yes, this difference affects how you should vectorize, so it needs to be in TTI in some form.

I think the sub-vector interpretation is actually the more natural reading of SVE, considering that sub-vectors are padded/interleaved to fit a native sub-register size (128bit on SVE, 64bit on SX-Aurora and 32bit on RVV (in general, or just the RVV implementation you are working on?)). Each sub-vector in the full scalable type is offset by a multiple of that size, so a scalable type is an array of padded sub-vectors.

This behavior does not privilege or require a concept of "subvectors" (just the total number of elements/bits having a certain factor). It can be framed that way if you insist, but can just as naturally be framed as interleaving the whole vector or by "padding each element". The latter is an especially useful perspective for code handling mixed element types because there it's important that each 32 bit float lines up with the corresponding elements of vectors with larger and smaller elements.

PS: for integers the "pad each element" perspective is particularly strong, because e.g. <scalable 1 x i8> is literally just promoted to <scalable 1 x i32> (on RVV, different numbers on other architectures). No undef "interleaving" with anything, just straight up sign extension.

32bit on RVV (in general, or just the RVV implementation you are working on?)

It depends on the machine and target triple. At minimum it's the largest support element type, and some environments ("platform specs" in RISC-V jargon) could demand more from supported hardware/guarantee more to software. The relevant parameters are called ELEN and VLEN in the spec.

lkcl added a subscriber: lkcl.Feb 5 2019, 6:51 PM

Certainly, that's why I say it's secondary and mostly a hint that something is amiss with "the current thinking". In fact, I am by now inclined to propose that Jacob and collaborators start out by expressing their architecture's operations with target-specific intrinsics that also use the attributes introduced here (especially since none of the typical vectorizers are equipped to generate the sort of code they want from scalar code using e.g. float4 types).

just so you know: we're not using a "typical vectoriser", we are writing our own, and it is very very specifically designed to accommodate the float3 and float4 datatypes. it is a SPIR-V to LLVM-IR compiler (used by both OpenCL and 3D Shaders compatible with the Vulkan API). to reiterate: we are *not* expecting to vector-augment or even use clang-llvm, rustlang-llvm or other vectoriser, we are *directly* translating SPIR-V IR into LLVM IR.

so LLVM IR having these vectorisation features is important (for the Kazan Project), yet, bizarrely, front-end support for vectorisation is *not* important (for the Kazan Project).

Alternatively, use a dynamic vector length of <the length their architecture wants> * <how many elements each of the short vectors has> and fix it up in the backend.

remember that both RVV and SV support the exact same vscale (variable-length multiples of elements) concept.

so if EVL does not support vscale, it would be mandatory for *both* RVV *and* SV to carry out sub-optimal backend fixups. that would mean in turn that both RVV and SV would be forced to create a copy of the predicate mask, multiplied out by the vscale length, and other associated significantly sub-optimal decisions.

whereas, for a SIMD-style architecture that doesn't *have* predication at all, they're *already* going to have to use gather/scatter, they're *already* sub-optimal, and taking vscale into account in the loop is a trivial backend modification.

effectively, the finest-grained features need to be exposed to the top-level API, whatever they are, otherwise architectures that have such features need to "dumb down" by way of back-end fixups, and the opportunity for LLVM to be a leading all-inclusive complier technology is lost.

now, at some point, we might ask ourselves "what the heck, why is everyone having to suffer just because of one particularly obtuse flexible feature that other architectures do not have??" which is a not very nice but quite reasonable way to put it... and the answer to that is: do we *really* want to encourage ISA architectural designers to keep things as they've always been [primarily SIMD]?

simoll added a comment.Feb 6 2019, 4:56 AM

This seems shaky. When generalized to scalable vector types, it means a load of a scalable vector would be evl.gather(<1 x double*> %p, <scalable n x i1>), which mixes fixed and scaled vector sizes. While it's no big deal to test the divisibility, allowing "mixed scalability" increases the surface area of the feature and not in a direction that seems desirable. For example, it strongly suggests permitting evl.add(<scalable n x i32>, <scalable n x i32>, <n x i1>, ...) where each mask bit controls vscale many lanes -- quite unnatural, and not something that seems likely to ever be put into hardware.

Mixing vector types and scalable vector types is illegal and is not what i was suggesting. Rather, a scalar pointer would be passed to convey a consecutive load/store from a single address.

Ok, sorry, I misunderstood the proposal then. That seems reasonable, although I remain unsure about the benefits.

And what for? I see technical disadvantages (less readable IR, needing more finicky pattern matching in the backend, more complexity in IR passes that work better on loads than on general gathers) and few if any technical advantages. It's a little conceptual simplification, but only at the level of abstraction where you don't care about uniformity.

Less readable IR: the address computation would become simpler, eg there is no need to synthesize a consecutive constant only to have it pattern-matched and subsumed in the backend (eg <0, 1, 2, ..., 15, 0, 1, 2, 3.., 15, 0, ....>).

I don't really follow, here we were only talking about subsuming unit-stride loads under gathers (and likewise for stores/scatters), right? But anyway, I was mostly worried about a dummy 1-element vector and the extra type parameter(s) on the intrinsic, which isn't an issue with what you actually proposed.

Finicky pattern matching: it is trivial to legalize this by expanding it into a more standard gather/scatter, or splitting it into consecutive memory accesses.. we can even keep the EVL_LOAD, EVL_STORE SDNodes in the backend so you woudn't even realize that an llvm.evl.gather was used for a consecutive load on IR level.

I'm not talking about legalizing a unit-stride access into a gather/scatter (when would that be necessary? everyone who has scatter/gather also has unit-stride), but about recognizing that the "gather" or "scatter" is actually unit strided. Having separate SDNodes would solve that by handling it once for all targets in SelectionDAG construction/combines.

More complexity on IR passes for standard loads: We are already using intrinsics here and i'd rather advocate to push for a solution that makes general passes work on predicated gather/scatter (which is not the case atm, afaik).

I expect that quite a few optimizations and analyses that work for plain old loads don't work for gathers, or have to work substantially harder to work on gathers, maybe even with reduced effectiveness. I do want scatters and gathers to be optimized well, too, but I fear we'll instead end up with a pile of "do one thing if this gather is a unit-stride access, else do something different [or nothing]".

How does this plane interact with the later stages of the roadmap? At stage 5, a {Load,Store}Inst with vlen and mask is a unit-stride access, and gathers are left out in the rain, unless you first generalize load and store instructions to general gathers and scatters (which seems a bit radical).

But again, we can leave this out of the first version and keep discussing as an extension.

Yeah, this is a question of what's the best canonical form for unit-stride memory accesses, that can be debated when the those actually exist in tree.

I was referring to the generalized gather, for example:

evl.gather.nxv4f32(<scalable 1 x float*> %Ptr, <scalable 4 x i1> %Mask., ..)

This would load sub-vectors of <4 x float> size from each element pointer in the %Ptr vector. Consecutive (unit strided) loads just happen to be a corner case of the generalized gather. In particular the remark about emitting verbose address computation codes just to pattern match them later should be seen in that light, eg see https://lists.llvm.org/pipermail/llvm-dev/2019-February/129942.html.

Certainly, that's why I say it's secondary and mostly a hint that something is amiss with "the current thinking". In fact, I am by now inclined to propose that Jacob and collaborators start out by expressing their architecture's operations with target-specific intrinsics that also use the attributes introduced here (especially since none of the typical vectorizers are equipped to generate the sort of code they want from scalar code using e.g. float4 types).

just so you know: we're not using a "typical vectoriser", we are writing our own, and it is very very specifically designed to accommodate the float3 and float4 datatypes. it is a SPIR-V to LLVM-IR compiler (used by both OpenCL and 3D Shaders compatible with the Vulkan API). to reiterate: we are *not* expecting to vector-augment or even use clang-llvm, rustlang-llvm or other vectoriser, we are *directly* translating SPIR-V IR into LLVM IR.

Err.. re-vectorizing float3/float4 codes will mostly concern the vectorizer backend ("widening phase"), all other stages should accept vectors as "scalar" data types. I am talking about RV here, of course (https://github.com/cdl-saarland/rv) ;-)

so LLVM IR having these vectorisation features is important (for the Kazan Project), yet, bizarrely, front-end support for vectorisation is *not* important (for the Kazan Project).

Alternatively, use a dynamic vector length of <the length their architecture wants> * <how many elements each of the short vectors has> and fix it up in the backend.

remember that both RVV and SV support the exact same vscale (variable-length multiples of elements) concept.

We've already agreed that the unit of the vlen parameter is just the element type (that is the sub-vector element type in the vector-of-subvector interpretation of scalable types).

simoll planned changes to this revision.Feb 6 2019, 1:37 PM

Planned (Accepted)

Open (Required)

  • TTI Queries:
    • What is the native scalable sub-vector size (128bit on SVE, 64bit on SX-Aurora, ..)?
    • What is the native vlen granularity for this target and element size (eg 64bit grain for SX-Aurora, element-wise for RVV/SV)?

Open (Future Work)

Food for thought

  • Use of higher-order functions to encode reductions, eg:
%reduced = call float @llvm.vp.reduce.nxv2f32(@llvm.vp.fadd.f32, %Mask, %Len, float 0.0, <scalable 2x f32> %DataVector)

This means less intrinsics, the passed-in function could also be user-defined.

lkcl added a comment.Feb 6 2019, 7:23 PM

remember that both RVV and SV support the exact same vscale (variable-length multiples of elements) concept.

We've already agreed that the unit of the vlen parameter is just the element type (that is the sub-vector element type in the vector-of-subvector interpretation of scalable types).

appreciated, simoll, apologies for giving the false impression that there was disagreement with that: i remained silent previously, as i agreed with the conclusion / logic, that vlen should match the full total number elements even on sub-vectors.

i *believe* (robin, can you clarify / confirm?) that robin may have been disagreeing on predicate masks, i.e. i *believe* that robin may be requesting that predicate masks *also* match the elements one-for-one. in SV, as jacob expressed: we definitely feel that an option for predicates to be on the *sub-vectors* i.e. len(predicate_mask) == VL/VSCALE (*not* just VL) would result in much more optimal SV-assembler being generated.

We've already agreed that the unit of the vlen parameter is just the element type (that is the sub-vector element type in the vector-of-subvector interpretation of scalable types).

appreciated, simoll, apologies for giving the false impression that there was disagreement with that: i remained silent previously, as i agreed with the conclusion / logic, that vlen should match the full total number elements even on sub-vectors.

i *believe* (robin, can you clarify / confirm?) that robin may have been disagreeing on predicate masks, i.e. i *believe* that robin may be requesting that predicate masks *also* match the elements one-for-one. in SV, as jacob expressed: we definitely feel that an option for predicates to be on the *sub-vectors* i.e. len(predicate_mask) == VL/VSCALE (*not* just VL) would result in much more optimal SV-assembler being generated.

For completeness, I just want to add the suggestion I made on the mailing list that subvector types could be expressed as <scalable 1 x <3 x float>>, where the vector element type is itself a vector. Masks operating in terms of subvectors would be <scalable 1 x <3 x i1>>. These are really matrix types. :) Intrinsics could be provided for both the traditional flat vectors and the vector-as-vector-element types.

ARM's SVE proposal interprets <scalable n x type> as a flat vector and I think it would be confusing to change the interpretation of that, even if it is target-dependent. Separate concepts should have separate types.

jbhateja removed a subscriber: jbhateja.Feb 8 2019, 11:18 PM
jbhateja added a subscriber: jbhateja.

Documenting this here: Constrained EVL intrinsics will be necessary for trapping fp ops (https://lists.llvm.org/pipermail/llvm-dev/2019-January/129806.html).

simoll updated this revision to Diff 186665.Feb 13 2019, 7:39 AM
simoll retitled this revision from RFC: EVL Prototype & Roadmap for vector predication in LLVM to RFC: Prototype & Roadmap for vector predication in LLVM.
simoll edited the summary of this revision. (Show Details)

Renamed EVL to VP

programmerjake resigned from this revision.Feb 13 2019, 11:54 AM
rengolin edited reviewers, added: mkuper, rkruppe, fhahn; removed: programmerjake.Feb 14 2019, 2:57 AM
rengolin added subscribers: Ayal, hsaito.
chill added a subscriber: chill.Mar 16 2019, 2:01 AM
simoll updated this revision to Diff 191252.Mar 19 2019, 12:10 AM
  • re-based onto master
simoll updated this revision to Diff 195366.Apr 16 2019, 6:39 AM

Updates

  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Cross references

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.
Do we have enough upside in having both?

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Do we have enough upside in having both?

I see no harm in having both since we already add the infrastructure in LLVM-VP to abstract away from specific instructions and/or intrinsics. Once (if ever) exception, rounding mode become available for native instructions (or can be an optional tag-on like fast-math flags), we can deprecate all constrained intrinsics and use llvm.vp.fdiv, etc or native instructions instead.

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Do we have enough upside in having both?

I see no harm in having both since we already add the infrastructure in LLVM-VP to abstract away from specific instructions and/or intrinsics. Once (if ever) exception, rounding mode become available for native instructions (or can be an optional tag-on like fast-math flags), we can deprecate all constrained intrinsics and use llvm.vp.fdiv, etc or native instructions instead.

There is an indirect harm in adding more intrinsics with partially-redundant semantics: writing transformations and analyses requires logic that handles both forms. I recommend having fewer intrinsics where we can have fewer intrinsics.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Then, please propose one more rounding mode, like round.permissive or round.any.

    1. Updates
  • added constrained fp intrinsics (IR level only).
  • initial support for mapping llvm.experimental.constrained.* intrinsics to llvm.vp.constrained.*.

Do we really need both vp.fadd() and vp.constrained.fadd()? Can't we just use the latter with rmInvalid/ebInvalid? That should prevent vp.constrained.fadd from losing optimizations w/o good reasons.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

Do we have enough upside in having both?

I see no harm in having both since we already add the infrastructure in LLVM-VP to abstract away from specific instructions and/or intrinsics. Once (if ever) exception, rounding mode become available for native instructions (or can be an optional tag-on like fast-math flags), we can deprecate all constrained intrinsics and use llvm.vp.fdiv, etc or native instructions instead.

There is an indirect harm in adding more intrinsics with partially-redundant semantics: writing transformations and analyses requires logic that handles both forms. I recommend having fewer intrinsics where we can have fewer intrinsics.

Yep. If one additional generally-useful rounding mode gets rid of several partially redundant intrinsics, that would be a good trade-off.

According to the LLVM langref, "fpexcept.ignore" seems to be the right option for exceptions whereas there is no "round.permissive" option for the rounding behavior. Abusing rmInvalid/ebInvalid seems hacky.

If you use "round.tonearest" that will get you the same semantics as the non-constrained version. The optimizer assumes round-to-nearest by default.

kpn added a subscriber: kpn.Apr 17 2019, 12:27 PM

Would it make sense to also update docs/AddingConstrainedIntrinsics.rst please?

simoll planned changes to this revision.Apr 18 2019, 1:33 AM

Thanks for your feedback!

Planned

  • Make the llvm.vp.constrained.* versions the only fp ops in vp. Encode default fp semantics by passing fpexcept.ignore and round.tonearest.
  • Update docs/AddingConstrainedIntrinsics.rst to account for the fact that llvm.experimental.constrained.* is no longer the only namespace for constrained intrinsics.
In D57504#1470705, @kpn wrote:

Would it make sense to also update docs/AddingConstrainedIntrinsics.rst please?

Sure. I don't think we should match (in the API) an llvm.vp.constrained.* intrinsic as ConstrainedFPIntrinsic though.
Conceptually, an llvm.vp.constrained.* intrinsics sure is both - VPIntrinsic and ConstrainedFPIntrinsic. If the latter is used to transform them, ignoring the mask an vector len argument along the way, we'll see breakage (..in the future, once there are transforms for constrained fp).

vkmr added a subscriber: vkmr.May 7 2019, 7:55 AM