Page MenuHomePhabricator

RFC: Prototype & Roadmap for vector predication in LLVM
Changes PlannedPublic

Authored by simoll on Jan 31 2019, 3:12 AM.

Details

Summary

Vector Predication Roadmap

This proposal defines a roadmap towards native vector predication in LLVM, specifically for vector instructions with a mask and/or an explicit vector length.
LLVM currently has no target-independent means to model predicated vector instructions for modern SIMD ISAs such as AVX512, ARM SVE, the RISC-V V extension and NEC SX-Aurora.
Only some predicated vector operations, such as masked loads and stores are available through intrinsics [MaskedIR]_.

Please use docs/Proposals/VectorPredication.rst to comment on the summary.

Vector Predication intrinsics

The prototype in this patch demonstrates the following concepts:

  • Predicated vector intrinsics with an explicit mask and vector length parameter on IR level.
  • First-class predicated SDNodes on ISel level. Mask and vector length are value operands.
  • An incremental strategy to generalize PatternMatch/InstCombine/InstSimplify and DAGCombiner to work on both regular instructions and VP intrinsics.
  • DAGCombiner example: FMA fusion.
  • InstCombine/InstSimplify example: FSub pattern re-writes.
  • Early experiments on the LNT test suite (Clang static release, O3 -ffast-math) indicate that compile time on non-VP IR is not affected by the API abstractions in PatternMatch, etc.

Roadmap

Drawing from the prototype, we propose the following roadmap towards native vector predication in LLVM:

1. IR-level VP intrinsics

  • There is a consensus on the semantics/instruction set of VP intrinsics.
  • VP intrinsics and attributes are available on IR level.
  • TTI has capability flags for VP (`supportsVP()?, haveActiveVectorLength()`?).

Result: VP usable for IR-level vectorizers (LV, VPlan, RegionVectorizer), potential integration in Clang with builtins.

2. CodeGen support

  • VP intrinsics translate to first-class SDNodes (`llvm.vp.fdiv.* -> vp_fdiv`).
  • VP legalization (legalize explicit vector length to mask (AVX512), legalize VP SDNodes to pre-existing ones (SSE, NEON)).

Result: Backend development based on VP SDNodes.

3. Lift InstSimplify/InstCombine/DAGCombiner to VP

  • Introduce PredicatedInstruction, PredicatedBinaryOperator, .. helper classes that match standard vector IR and VP intrinsics.
  • Add a matcher context to PatternMatch and context-aware IR Builder APIs.
  • Incrementally lift DAGCombiner to work on VP SDNodes as well as on regular vector instructions.
  • Incrementally lift InstCombine/InstSimplify to operate on VP as well as regular IR instructions.

Result: Optimization of VP intrinsics on par with standard vector instructions.

4. Deprecate llvm.masked.* / llvm.experimental.reduce.*

  • Modernize llvm.masked.* / llvm.experimental.reduce* by translating to VP.
  • DCE transitional APIs.

Result: VP has superseded earlier vector intrinsics.

5. Predicated IR Instructions

  • Vector instructions have an optional mask and vector length parameter. These lower to VP SDNodes (from Stage 2).
  • Phase out VP intrinsics, only keeping those that are not equivalent to vectorized scalar instructions (reduce, shuffles, ..).
  • InstCombine/InstSimplify expect predication in regular Instructions (Stage (3) has laid the groundwork).

Result: Native vector predication in IR.

References

.. [MaskedIR] llvm.masked.* intrinsics, https://llvm.org/docs/LangRef.html#masked-vector-load-and-store-intrinsics
.. [EvlRFC] Explicit Vector Length RFC, https://reviews.llvm.org/D53613

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
SjoerdMeijer added inline comments.Oct 29 2019, 1:54 PM
llvm/include/llvm/CodeGen/SelectionDAGNodes.h
2357

VP_LOAD and VP_STORE?

2386

same?

2426

.. does a truncation before store sounds a bit odd. Since 'truncating store' is a well known term, and that you explain what it is for ints/floats below, I think it suffices to say "Return true if this is truncating store. For intergers ..."

    1. Observations
  • When using fpexcept.ignore, the fp callsites should have the readnone attribute set on them to override the inaccessiblememonly of the intrinsic declarations. That way DCE still works.

Wouldn't that allow the call to be moved relative to other calls? Specifically, we need to make sure intrinsics aren't moved relative to calls that change the rounding mode. The "inaccessiblememonly" attribute is meant to model both the reading of control modes and the possible setting of status flags or raising of exceptions"

  • The rules for constrained fp (strictfp on the function definition, only constrained fp in that function) apply only if there is a single fp op with exceptions in the function. That is strictfp is not necessary when all fp ops have fpexcept.ignore.

I don't think this is right. Even if there are no constrained FP operations in the function we might have math library calls for which the strictfp attribute is needed to prevent libcall simplification and constant folding that might violate the rounding mode.

    1. Observations
  • When using fpexcept.ignore, the fp callsites should have the readnone attribute set on them to override the inaccessiblememonly of the intrinsic declarations. That way DCE still works.

Wouldn't that allow the call to be moved relative to other calls? Specifically, we need to make sure intrinsics aren't moved relative to calls that change the rounding mode. The "inaccessiblememonly" attribute is meant to model both the reading of control modes and the possible setting of status flags or raising of exceptions"

  • The rules for constrained fp (strictfp on the function definition, only constrained fp in that function) apply only if there is a single fp op with exceptions in the function. That is strictfp is not necessary when all fp ops have fpexcept.ignore.

I don't think this is right. Even if there are no constrained FP operations in the function we might have math library calls for which the strictfp attribute is needed to prevent libcall simplification and constant folding that might violate the rounding mode.

I see. Since we need to model the default fp environment with these intrinsics (and this is our priority), let me make the following suggestion: VP intrinsics will have a rounding mode and exception behavior argument from the start but the only allowed values are "round.tonearest" and "fpexcept.ignore". Once we have a solution for the general case implemented for constraint fp, we will unlock that feature also for LLVM VP.

simoll marked 8 inline comments as done.Oct 30 2019, 2:06 AM

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

Hi Sjoerd, thanks for you comments! i've fixed the inline nitpicks right away. I'll do a style pass for the actual commits.

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

Hi Sjoerd, thanks for you comments! i've fixed the inline nitpicks right away. I'll do a style pass for the actual commits.

Cheers. Just curious, what are your next steps? People can correct me if I'm wrong, but my impression is that with the RFC, this prototype, the discussion at the US LLVM dev conference, there is consensus and people are on-board with the general idea and direction. There are still some discussions on e.g. the (constrained) FP part, but would it now be the time split this for example up in an separate commits, like an INT and FP part (if that makes sense), so that they can be (separately) progressed?

simoll added a comment.EditedNov 4 2019, 4:30 AM

Hi Simon, I went through the code for the first time, and this is a first round of proper nitpicks from my side. Please ignore if you want to focus on the bigger picture at this point in the discussion, but these are just some things I noticed. General nitpick is that you should run clang-format as there are quite a few coding style issues: indentation, indentation of arguments, exceeding 80 columns, placement of * and & in arguments and return values, etc. And find some more nitpicks inlined.

Hi Sjoerd, thanks for you comments! i've fixed the inline nitpicks right away. I'll do a style pass for the actual commits.

Cheers. Just curious, what are your next steps? People can correct me if I'm wrong, but my impression is that with the RFC, this prototype, the discussion at the US LLVM dev conference, there is consensus and people are on-board with the general idea and direction. There are still some discussions on e.g. the (constrained) FP part, but would it now be the time split this for example up in an separate commits, like an INT and FP part (if that makes sense), so that they can be (separately) progressed?

Yes, i hope that's where things are right now ;-) I am planning to go by functional slices. Each slice comes with IR-level intrinsics, TTI support, basic lowering to standard IR, Selection DAG support and tests.

I am preparing the first patchset for integer support atm.

Slices:

  • Integer slice.
  • Memory slice.
  • Reduction slice.
  • FP (with unconstrained metadata args) slice.

Standalone patch:

  • Mask, VectorLength and Passthru attributes (in preparation of vector function calls).

Pending discussion/separate RFC:

  • Constrained FP (being able to fully optimize constrained fp intrinsics in the default fp env).
  • Generalized pattern match (aka optimizing VP).
simoll added a comment.Nov 6 2019, 1:13 AM

D69552: Move floating point related entities to namespace level contains the fp enum changes required for LLVM-VP. Referencing the patch here.

simoll updated this revision to Diff 228052.Nov 6 2019, 6:27 AM

Fixed attribute placements, signatures, more tests, ..
This is in sync with the subpatch #1 of the integer slice (https://reviews.llvm.org/D69891).

simoll added a comment.Nov 6 2019, 6:29 AM

Integer slice patches

#1 IR-level support: https://reviews.llvm.org/D69891
#2 TTI & Legalization: <stay tuned>
#3 ISel patch: <stay tuned>

I'll update this comment as we go to keep track of the integer slice.

simoll updated this revision to Diff 228885.Nov 12 2019, 6:48 AM

Changes

  • VPIntrinsics.def file.
  • Pass vlen i32 -1 to enable all lanes with scalable vector types.
  • Various NFC fixes.

Moving the discussion from the integer patch alley to the main RFC as this is about the general design of VP intrinsics.. it's about having a passthru operand (as in llvm.masked.load) and whether %evl should be a parameter of the intrinsics or modelled differently.

@SjoerdMeijer https://reviews.llvm.org/D69891#inline-636845
and if I'm not mistaken we are now discussing if undef here should be undef or a passthru

@rkruppe https://reviews.llvm.org/D69891#inline-637215
I previously felt that passthru would be nice to have for backend maintainers (including myself) but perhaps not worth the duplication of IR functionality (having two ways to do selects). However, given the differences I just described, I don't think "just use select" is workable.

Ok. I do agree that having passthru simplifies isel for certain architectures (and legal combinations of passthru value, type, and operations..) but:
VP intrinsics aren't target intrinsics: they are not supposed to be a way to directly program any specific ISA in the way of a macroassembler, like you would do with llvm.x86.* or llvm.arm.mve.* or any other. Rather, think of them as regular IR instructions. Pretend that anything we propose in VP intrinsics will end up as a feature of a first-class LLVM instructions. Based on that i figured that one VP intrinsics should match one IR instructions plus predication, nothing more.

  • If we had predicated IR instructions, would we want them to have a passthru operand?
  • The prototype shows that defining VP intrinsics with undef-on-masked-out makes it straightforward to generalize InstSimplify/InstCombine/DAGCombiner such that they can optimize VP intrinsics. If you add a passthru operand then logically VP intrinsics start to behave like two instructions: that could be made work but it would be messier as you'd have to peek through selects, etc.

@sdesmalen https://reviews.llvm.org/D69891#1750287
If we want to solve the select issue and also keep the intrinsics simple, my suggestion was to combine the explicit vector length with the mask using an explicit intrinsic like @llvm.vp.enable.lanes. Because this is an explicit intrinsic, the code-generator can simply extract the %evl parameter and pass that directly to the instructions for RVV/SXA. This is what happens for many other intrinsics in LLVM already, like masked.load/masked.gather that support only a single addressing mode, where it is up to the code-generator to pick apart the value into operands that are suited for a more optimal load instruction.

Without having heard your thoughts on this suggestion, I would have to guess that your reservation is the possibility of LLVM hoisting/separating the logic that merges predicate mask and %evl value in some way. That would mean having to do some tricks (think CodeGenPrep) to keep the values together and recognizable for CodeGen. And that's the exact same thing we would like to avoid for supporting merging/zeroing predication, hence the suggestion for the explicit passthru parameter.

That's not quite the same:
%evl is mapped to a hardware register on SX-Aurora. We cannot simply reconstitute the %evl from any given mask, if %evl is obscured it makes all operations that depend on it less efficient because we need to default to the full vector length. Now, if the select is separated from the VP intrinsic, you simply emit one select instruction (and it should be possible to hoist it back and merge it with the VP intrinsic in most cases (.. and you probably want an optimization that does that in anyway because there will be code with explicit selects even with passthru)). Besides, if the select is folded with an instruction that is subsequently simpler then that's actually an argument in favor of explicit selects: passthru makes this implicit.

Ok. I do agree that having passthru simplifies isel for certain architectures (and legal combinations of passthru value, type, and operations..) but:
VP intrinsics aren't target intrinsics: they are not supposed to be a way to directly program any specific ISA in the way of a macroassembler, like you would do with llvm.x86.* or llvm.arm.mve.* or any other. Rather, think of them as regular IR instructions. Pretend that anything we propose in VP intrinsics will end up as a feature of a first-class LLVM instructions. Based on that i figured that one VP intrinsics should match one IR instructions plus predication, nothing more.

  • If we had predicated IR instructions, would we want them to have a passthru operand?

I think that would probably be a reasonable clean-slate IR design, though I am not at all sure if it would be better. I wasn't specifically advocating for passthru operands, though. I agree that not having passthru and performing the same function in two operations can be readily pattern-matched by backends at modest effort and failing to match it has low cost. My main point was just that the existing select instruction is not sufficient as the second operation, for essentially the same reason why the VP intrinsics have an EVL argument instead of just the mask. Creating a VP equivalent of select (as already sketched in the other thread) resolves that concern just as well.

simoll marked an inline comment as done.Dec 3 2019, 4:18 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15194–15211

@rkruppe
[..] My main point was just that the existing select instruction is not sufficient as the second operation, for essentially the same reason why the VP intrinsics have an EVL argument instead of just the mask. Creating a VP equivalent of select (as already sketched in the other thread) resolves that concern just as well.

I agree. The prototype has defined such an llvm.vp.select from the get-go.

rkruppe added inline comments.Dec 3 2019, 10:06 AM
llvm/docs/LangRef.rst
15194–15211

Oops, missed that / forgot about it. Sorry for the noise.

Is there a reason why it's not in the "integer slice" patch? It's not integer-specific, but it seems to fit even less into the other slices.

simoll marked an inline comment as done.Dec 9 2019, 12:52 AM
simoll added inline comments.
llvm/docs/LangRef.rst
15194–15211

I wanted to keep the integer patch concise for one. Also, having played around with this for a while now, i think that the signature of vp.select should be:

llvm.vp.select(<W x i1> %m, %onTrue, %onFalse, i32 %threshold, i32 vlen %evl)

meaning that values from %onTrue are selected where %m is true and the lane index is below %threshold. %onFalse is selected otherwise. Lane indices greater-equal %evl are undef as ever. In short: there is just one "merge" operation and no more separate vp.compose.

Matt added a subscriber: Matt.Dec 18 2019, 12:57 PM
simoll planned changes to this revision.Jan 9 2020, 4:59 AM
  • Add pivot/threshold argument to llvm.vp.select and remove llvm.vp.compose
  • Clarify documentation on preserved lanes
  • explicit vlen arg is either negative or (new requirement) less-equal-than the number of lanes of the operation.

(not sure if I should continue here or in D69891, will try here first)

Sorry for dipping out of this discussion. I.e. after our "passthru discussion", I wanted to do more homework to make sure a "separate select" would work for us, if we wouldn't miss anything, but then other work happened and never got round to this. But I am still very interested, so dipping back in :-/

If we had predicated IR instructions, would we want them to have a passthru operand?

I think that would probably be a reasonable clean-slate IR design, though I am not at all sure if it would be better.

One of the problems I had that I found it difficult to see all consequences, and answers the sort of questions asked above (also because I haven't yet spend enough time on this). For example, being explicit in IR is in general a good thing to do? So yes, why not a passthru? But then I had the same question as Robin, not sure it would be better. The other thing I would mention again is that I think convenience is a pretty strong argument too, if this is most convenient for at least two / three other architectures, then why not? But then you could argue that it is simple to patch up with a select, and we're going in circles... At least the concern Robin brought up about the select seems to be addressed with the vp.select.

(not sure if I should continue here or in D69891, will try here first)

Yep, the RFC is the right place for conceptual discussions.

Sorry for dipping out of this discussion. I.e. after our "passthru discussion", I wanted to do more homework to make sure a "separate select" would work for us, if we wouldn't miss anything, but then other work happened and never got round to this. But I am still very interested, so dipping back in :-/

Welcome back :)

If we had predicated IR instructions, would we want them to have a passthru operand?

I think that would probably be a reasonable clean-slate IR design, though I am not at all sure if it would be better.

One of the problems I had that I found it difficult to see all consequences, and answers the sort of questions asked above (also because I haven't yet spend enough time on this). For example, being explicit in IR is in general a good thing to do? So yes, why not a passthru? But then I had the same question as Robin, not sure it would be better. The other thing I would mention again is that I think convenience is a pretty strong argument too, if this is most convenient for at least two / three other architectures, then why not? But then you could argue that it is simple to patch up with a select, and we're going in circles... At least the concern Robin brought up about the select seems to be addressed with the vp.select.

Couldn't agree more. I guess we just do now know at this point.. how about we move the discussion away from "which would be better?" to "if we decide for A now and later strongly realize that B would have been the right call.. how bad a u-turn would that be?"

Changes required going from passthru to select:

  • IR: modernize VP with passthru to intrinsic+select
  • Nothing more.. since we already had to implement the select+intrinsic matching logic anyway to fuse explicit selects into passthru operands.
  • Dead code: all the logic for dealing with the passthru operand: PatternMatch for passthru (instcombine, instsimplify, known bits..), etc

Changes required going from select to passthru:

  • IR: modernize and pass 'undef' as passthru
  • Implement that pass from the other scenario that folds select into passthru (and all the additional logic for dealing with passthru).
  • Dead code: none

My point here is that no matter how we decide: explicit selects and vp intrinsics will co-exist and have to be folded/optimized. However, in the explicit-select scenario we do not have to teach LLVM about passthru operands (PatternMatch -> InstCombine, ...).
Btw, I guess that https://reviews.llvm.org/D71432 shows that op+select folding can be cleanly implemented in isel and that's also in line with my experiments for the VE target.
Regarding convenience: the IRBuilder could have, eg, a ::CreatePredicatedFAdd with an explicit (optional) passthru operand.. resulting in a VP op + select.

Couldn't agree more. I guess we just do now know at this point.. how about we move the discussion away from "which would be better?" to "if we decide for A now and later strongly realize that B would have been the right call.. how bad a u-turn would that be?"

Changes required going from passthru to select:

  • IR: modernize VP with passthru to intrinsic+select
  • Nothing more.. since we already had to implement the select+intrinsic matching logic anyway to fuse explicit selects into passthru operands.
  • Dead code: all the logic for dealing with the passthru operand: PatternMatch for passthru (instcombine, instsimplify, known bits..), etc

Changes required going from select to passthru:

  • IR: modernize and pass 'undef' as passthru
  • Implement that pass from the other scenario that folds select into passthru (and all the additional logic for dealing with passthru).
  • Dead code: none

My point here is that no matter how we decide: explicit selects and vp intrinsics will co-exist and have to be folded/optimized. However, in the explicit-select scenario we do not have to teach LLVM about passthru operands (PatternMatch -> InstCombine, ...).
Btw, I guess that https://reviews.llvm.org/D71432 shows that op+select folding can be cleanly implemented in isel and that's also in line with my experiments for the VE target.
Regarding convenience: the IRBuilder could have, eg, a ::CreatePredicatedFAdd with an explicit (optional) passthru operand.. resulting in a VP op + select.

Thanks for summarising this. Fair enough, I think this sounds like a (good) plan.
I will continue in D69891, and will leave a comment there.

Btw, I guess that https://reviews.llvm.org/D71432 shows that op+select folding can be cleanly implemented in isel and that's also in line with my experiments for the VE target.

This needs a caveat. Keeping the select glued to the operation takes some careful effort. Especially in the undef passthru case, there are a bunch of peeps that will incorrectly fold away the select. E.g. this transform from InstSimplify:

if (isa<UndefValue>(FalseVal))   // select ?, X, undef -> X
  return TrueVal;

The VP intrinsics will certainly be immune to these, but if the plan is to eventually replace the VP select intrinsics with IR selects, then this problem will need to be solved. Just a heads up...

Btw, I guess that https://reviews.llvm.org/D71432 shows that op+select folding can be cleanly implemented in isel and that's also in line with my experiments for the VE target.

This needs a caveat. Keeping the select glued to the operation takes some careful effort. Especially in the undef passthru case, there are a bunch of peeps that will incorrectly fold away the select. E.g. this transform from InstSimplify:

if (isa<UndefValue>(FalseVal))   // select ?, X, undef -> X
  return TrueVal;

The VP intrinsics will certainly be immune to these, but if the plan is to eventually replace the VP select intrinsics with IR selects, then this problem will need to be solved. Just a heads up...

As Eli argued in that patch, IR like select %m, (constrained.fadd %a, %b), %passthru is not expressing a predicated vector add, and must not be selected as such. The IR semantics are unambiguously: first a full vector add is performed (with all exceptions etc. that entails, or possible UB in related cases like integer division) and then some of the resulting lanes are replaced with values from %passthru. To predicate the fadd itself, a dedicated operation/intrinsic is needed. LLVM IR does not currently (and should not) change the meaning of the regular unpredicated operations based on (some? any?) uses of the value being a select. The only thing a select (or vp.select) can do is alter the lanes of a vector after it has been computed, it cannot travel back in time to change how it was computed.

VP intrinsics are the aforementioned predicated operations: in certain lanes, no computation (which might raising FP exceptions, have UB, etc.) happens and the resulting vector has some "default value" instead. The present discussion about whether to include a %passthru argument is just about how this default value is determined. But this does not change that the operation itself is predicated, it just affects how you express e.g. the patterns that map to SVE's zeroing and merging predication.

As Eli argued in that patch, IR like select %m, (constrained.fadd %a, %b), %passthru is not expressing a predicated vector add, and must not be selected as such. The IR semantics are unambiguously: first a full vector add is performed (with all exceptions etc. that entails, or possible UB in related cases like integer division) and then some of the resulting lanes are replaced with values from %passthru. To predicate the fadd itself, a dedicated operation/intrinsic is needed. LLVM IR does not currently (and should not) change the meaning of the regular unpredicated operations based on (some? any?) uses of the value being a select. The only thing a select (or vp.select) can do is alter the lanes of a vector after it has been computed, it cannot travel back in time to change how it was computed.

VP intrinsics are the aforementioned predicated operations: in certain lanes, no computation (which might raising FP exceptions, have UB, etc.) happens and the resulting vector has some "default value" instead. The present discussion about whether to include a %passthru argument is just about how this default value is determined. But this does not change that the operation itself is predicated, it just affects how you express e.g. the patterns that map to SVE's zeroing and merging predication.

Understood. I now see that we already discussed this here in October.

Your current argument sounds like it argues for explicit passthrus. E.g.:

select %m, (vp.fadd %m, %a, %b), %zeroinitializer

On SVE, this would become something like:

movprfx z0.s, p0/z, z0.s
fadd z0.s, p0/m, z0.s, z1.s

Isn't that traveling back in time to change how the inactive elements are defined? To be true to the IR. we'd want something like:

fadd z0.s, p0/m, z0.s, z1.s
sel z0s, p0/m, z0.s, <zero_vector>

How do we justify that this case is different than the op+select->predicated_op case? Are we assuming the implicit undef on the VP intrinsic allows for it?

I'm not sure what problem you think there might be? Both code sequences do the same thing (same side effects, same final result) as the input IR they matched, right? So that's what justifies them both as valid outputs and the choice is just a matter of codegen quality. You don't even need to appeal to the vp.fadd producing undef in disabled lanes, because in the final result those lanes are zero anyway and that's all that matters. This doesn't seem fundamentally more tricky than any other isel pattern that matches multiple IR instructions to produce a more efficient combined instruction. For example, if the ARM backend selects add i32 %a, (shl i32 %b, 4) as add r0, r0, r1, lsl #4, it never materializes shl %b, 4 (not into a register, at least) but the end result is still correct.

I'm not sure what problem you think there might be? Both code sequences do the same thing (same side effects, same final result) as the input IR they matched, right?

Ah, right. That side effects are the difference. Thanks for reminding me.

So that's what justifies them both as valid outputs and the choice is just a matter of codegen quality. You don't even need to appeal to the vp.fadd producing undef in disabled lanes, because in the final result those lanes are zero anyway and that's all that matters. This doesn't seem fundamentally more tricky than any other isel pattern that matches multiple IR instructions to produce a more efficient combined instruction. For example, if the ARM backend selects add i32 %a, (shl i32 %b, 4) as add r0, r0, r1, lsl #4, it never materializes shl %b, 4 (not into a register, at least) but the end result is still correct.

Yeah, this was what I was hung up on. I didn't see the difference between something like not materializing a dead instruction and masking an inactive element. But, yeah. the side effects would not be the same.

Btw, I guess that https://reviews.llvm.org/D71432 shows that op+select folding can be cleanly implemented in isel and that's also in line with my experiments for the VE target.

This needs a caveat. Keeping the select glued to the operation takes some careful effort. Especially in the undef passthru case, there are a bunch of peeps that will incorrectly fold away the select. E.g. this transform from InstSimplify:

if (isa<UndefValue>(FalseVal))   // select ?, X, undef -> X
  return TrueVal;

The VP intrinsics will certainly be immune to these, but if the plan is to eventually replace the VP select intrinsics with IR selects, then this problem will need to be solved. Just a heads up...

@hsaito and I had a discussion about this earlier today. I had the same concern, that optimizations after the vectorizer might do something to decouple the vp.select from the vp.{operation}, which could lead to the code generator not being able to create a masked operation with passthru on targets that support that and thus potentially invalidate the cost model assumptions that the vectorizer made when it generated the predicated operation. Hideki convinced me that the additional freedom from explicit dependencies gained by not having a passthru argument as part of the predicated operation was likely to be more beneficial than tight coupling. If we ever do find this to be a problem, we can do something to make the intervening optimizations less aggressive with this sort of pattern.

I also talked briefly with @craig.topper about the X86 codegen handling of this, and his off the cuff reaction was to think that we probably won't have any problem generating the desired passthru+masked instructions from separated vp.select operations.

llvm/docs/Proposals/VectorPredication.rst
3

Is there any reason that some form of this document can't be committed now? We have at least enough support to claim this as a community wide proposal, right?

(This was gonna be an inline comment on D69891, but it's more of a general conceptual issue, so I decided to move it here.)

Right now, LangRef changes in D69891 describe the restriction on the EVL value as this:

The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, its value is in the range:

0 <= %evl <= W,   where W is the vector length.

The restriction is good, but this wording doesn't specify what happens when %evl is not in that range. Some sort of undefined behavior, I assume, but this must be explicitly stated, especially since there are many ways in which it could be undefined. I don't recall previous discussion of this detail and I don't know what you have in mind, but some possibilities I see:

  1. The instruction has capital-UB undefined behavior. This gives the greatest flexibility to backends (e.g., allows generation of code that traps if %evl is too large) but I don't know of any architecture that needs this much flexibility and it constrains IR optimizations (code hoisting etc.) the most.
  2. The instruction returns poison (i.e., all result lanes are poison) and all lanes are (potentially, non-deterministically) enabled regardless of the mask parameter. This is less restrictive for IR optimizations (e.g., integer vp.add can unconditionally be speculated) but still allows backends to unconditionally use SETVL-style "stripmining" instructions that are not generally consistent (across architectures) w.r.t. which lanes become active when a vector length greater than the hardware vector length is requested.
  3. %EVLmask is undef, that's all. As consequence, lanes disabled by the %mask argument definitely stay disabled, but for other lanes (where the mask has a 1 or an undef) it's non-deterministic whether they are active. As far as I can see, this has pretty much the same implications for IR optimizations and backends (excluding hypothetical pathological architectures) but is less of a special case to specify and directly captures the diversity of hardware behavior that (presumably) motivates this restriction on EVL.

Off the cuff, I would suggest the last option.

(This was gonna be an inline comment on D69891, but it's more of a general conceptual issue, so I decided to move it here.)

Right now, LangRef changes in D69891 describe the restriction on the EVL value as this:

The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, its value is in the range:

0 <= %evl <= W,   where W is the vector length.

The restriction is good, but this wording doesn't specify what happens when %evl is not in that range. Some sort of undefined behavior, I assume, but this must be explicitly stated, especially since there are many ways in which it could be undefined. I don't recall previous discussion of this detail and I don't know what you have in mind, but some possibilities I see:

  1. The instruction has capital-UB undefined behavior. This gives the greatest flexibility to backends (e.g., allows generation of code that traps if %evl is too large) but I don't know of any architecture that needs this much flexibility and it constrains IR optimizations (code hoisting etc.) the most.
  2. The instruction returns poison (i.e., all result lanes are poison) and all lanes are (potentially, non-deterministically) enabled regardless of the mask parameter. This is less restrictive for IR optimizations (e.g., integer vp.add can unconditionally be speculated) but still allows backends to unconditionally use SETVL-style "stripmining" instructions that are not generally consistent (across architectures) w.r.t. which lanes become active when a vector length greater than the hardware vector length is requested.
  3. %EVLmask is undef, that's all. As consequence, lanes disabled by the %mask argument definitely stay disabled, but for other lanes (where the mask has a 1 or an undef) it's non-deterministic whether they are active. As far as I can see, this has pretty much the same implications for IR optimizations and backends (excluding hypothetical pathological architectures) but is less of a special case to specify and directly captures the diversity of hardware behavior that (presumably) motivates this restriction on EVL.

Off the cuff, I would suggest the last option.

We (Libre-SoC, provisionally renamed from Libre-RISCV) are currently building a processor that supports variable-length vector operations by having each operation specify the starting register in a flat register file, then relying on VL telling it how many elements to operate on, which, when divided by the number of elements per register, directly translates to the number of registers to operate on. So, if VL is out of bounds, the instructions can overwrite registers past the end of the range assigned by the register allocator and/or trap. This would probably force use of option #1 above, at least for our processor. Our ISA design is still incomplete, so we might add (or already have) a mechanism allowing use of option #2 or #3 if there is a sufficient reason (will have to see what the rest of Libre-SoC think).

We (Libre-SoC, provisionally renamed from Libre-RISCV) are currently building a processor that supports variable-length vector operations by having each operation specify the starting register in a flat register file, then relying on VL telling it how many elements to operate on, which, when divided by the number of elements per register, directly translates to the number of registers to operate on. So, if VL is out of bounds, the instructions can overwrite registers past the end of the range assigned by the register allocator and/or trap. This would probably force use of option #1 above, at least for our processor. Our ISA design is still incomplete, so we might add (or already have) a mechanism allowing use of option #2 or #3 if there is a sufficient reason (will have to see what the rest of Libre-SoC think).

Presumably you have an efficient way to somehow force the VL into the intended range to support strip-mining of loops? The exact strategy doesn't matter, anything that avoids VL being "out of bounds" should make the other options work just fine. (Assuming there aren't other, larger problems with mapping VP operations to your ISA.)

We (Libre-SoC, provisionally renamed from Libre-RISCV) are currently building a processor that supports variable-length vector operations by having each operation specify the starting register in a flat register file, then relying on VL telling it how many elements to operate on, which, when divided by the number of elements per register, directly translates to the number of registers to operate on. So, if VL is out of bounds, the instructions can overwrite registers past the end of the range assigned by the register allocator and/or trap. This would probably force use of option #1 above, at least for our processor. Our ISA design is still incomplete, so we might add (or already have) a mechanism allowing use of option #2 or #3 if there is a sufficient reason (will have to see what the rest of Libre-SoC think).

Presumably you have an efficient way to somehow force the VL into the intended range to support strip-mining of loops? The exact strategy doesn't matter, anything that avoids VL being "out of bounds" should make the other options work just fine. (Assuming there aren't other, larger problems with mapping VP operations to your ISA.)

Yes, we do (setvl has a immediate for max VL, which needs to be calculated by the register allocator or similar), though it can be bypassed by writing directly to the VL register.

So, in that case, we should be able to use option #2 or #3, as long as the compiler doesn't write to VL by any means other than setvl.

simoll marked an inline comment as done.Feb 3 2020, 3:35 AM

(This was gonna be an inline comment on D69891, but it's more of a general conceptual issue, so I decided to move it here.)

Right now, LangRef changes in D69891 describe the restriction on the EVL value as this:

The explicit vector length (%evl) is only effective if it is non-negative, and when that is the case, its value is in the range:

0 <= %evl <= W,   where W is the vector length.

The restriction is good, but this wording doesn't specify what happens when %evl is not in that range. Some sort of undefined behavior, I assume, but this must be explicitly stated, especially since there are many ways in which it could be undefined. I don't recall previous discussion of this detail and I don't know what you have in mind, but some possibilities I see:

  1. The instruction has capital-UB undefined behavior. This gives the greatest flexibility to backends (e.g., allows generation of code that traps if %evl is too large) but I don't know of any architecture that needs this much flexibility and it constrains IR optimizations (code hoisting etc.) the most.

Exactly. The VE target strictly requires VL <= MVL or you'll get a hardware exception. Enforcing strict UB here means VP-users have to explicitly drop instructions that keep the VL within bounds. This means that we can optimize the VL computation code and that it can be factored into cost calculations, etc. With Options 2 & 3 this would happen only very late in the backend when most scalar optimizations are already done.
Besides, this still allows you to speculate as long as MVL (as in the UB-causing bound for VL) does not go below VL... could you explain under which circumstance MVL would go below VL by hoisting? This is definitely not the case for static VL targets (x86) and also not for VE.

TODO:
  • Define behavior for %evl > W
  • Amend that W is target specific.
llvm/docs/Proposals/VectorPredication.rst
3

I think so. I'll put the proposal doc up for review.

Exactly. The VE target strictly requires VL <= MVL or you'll get a hardware exception. Enforcing strict UB here means VP-users have to explicitly drop instructions that keep the VL within bounds. This means that we can optimize the VL computation code and that it can be factored into cost calculations, etc. With Options 2 & 3 this would happen only very late in the backend when most scalar optimizations are already done.

I think I'm lost here. Which thing is VL and which is MVL in this scenario?

Also, the talk about how various hardware treats the relative values of VL and MVL concerns me if either of these is supposed to be the width of the vector passed to this intrinsic. My understanding is that we're supposed to be able to generate vectors of any width we want in IR and the type legalization is responsible for mapping that to vector sizes that are legal for the target. So what does the target requirement mean here?

simoll added a comment.Feb 4 2020, 2:27 AM

Exactly. The VE target strictly requires VL <= MVL or you'll get a hardware exception. Enforcing strict UB here means VP-users have to explicitly drop instructions that keep the VL within bounds. This means that we can optimize the VL computation code and that it can be factored into cost calculations, etc. With Options 2 & 3 this would happen only very late in the backend when most scalar optimizations are already done.

I think I'm lost here. Which thing is VL and which is MVL in this scenario?

VL == %evl
MVL == W
Sorry for the vector speak :)

Also, the talk about how various hardware treats the relative values of VL and MVL concerns me if either of these is supposed to be the width of the vector passed to this intrinsic. My understanding is that we're supposed to be able to generate vectors of any width we want in IR and the type legalization is responsible for mapping that to vector sizes that are legal for the target. So what does the target requirement mean here?

I agree that, in the end, the semantics will be based solely on IR-types. However, what that semantics should look like for the %evl > W case depends on the way targets can handle this to make sure that whatever we specify on IR-level is at least reasonable for all targets.

From what I recall, the plan is to implement this by using fixed-size vector types combined with VL-based ops. MVL would be the size of those vector types.

Quoting all of lkcl's email so it ends up in Phabricator:

On Tue, Feb 4, 2020 at 3:48 AM @lkcl wrote:

Exactly. The VE target strictly requires VL <= MVL or you'll get a
hardware exception. Enforcing strict UB here means VP-users have to
explicitly drop instructions that keep the VL within bounds. This means
that we can optimize the VL computation code and that it can be factored
into cost calculations, etc. With Options 2 & 3 this would happen only
very late in the backend when most scalar optimizations are already
done.

I think I'm lost here. Which thing is VL and which is MVL in this
scenario?

VL == %evl
MVL == W
Sorry for the vector speak :)

ah.  right.  that bit of information was important, simon :)   without
clarification, i assumed W was the "required vector length at the
program loop level", whoops..

I agree that, in the end, the semantics will be based solely on IR-types.
However, what that semantics should look like for the %evl > W case
depends on the way targets can handle this to make sure that whatever we
specify on IR-level is at least reasonable for all targets.

okaaay, riight, so the purpose of the discussion is, e.g., to work out
how to represent things like for-loops in the strcpy example here, is
that right?

https://www.sigarch.org/simd-instructions-considered-harmful/

so %evl > W (i.e. %evl > MVL) in RVV, it is the very effort of trying
to *set* %evl to the loop length, this is retried *in every loop*.
and the implementation (in hardware) very very specifically -
unbeknownst to the programmer (and to the IR writer) - hard-limits
%evl *to* MVL.

to be clear: although the programmer *tries* to set %evl > MVL, this
*never happens*: %evl will *always* be actually set to <= MVL.

it's quite clever.

it is really really important - a critical part of the design of RVV
loops - that the programmer (or LLVM compiler developer in this case)
*not* even know or make any assumptions about what MVL will be.  some
hardware will actually have MVL equal to 1.  some really unbelievably
powerful and stupidly expensive hardware might have MVL equal to 65536
(yes really, 65536 wide vector ALUs) and the critical thing is, the
assembly code *does not care*.  it still works perfectly on both,
despite the fact that you have no idea, really, what value MVL is
going to be.

SimpleV is different in that you absolutely must explicitly declare,
as part of any assembly loops (or any other instructions), precisely
and exactly how large MVL is to be.  this is because it is an
"allocation of the number of scalar registers - from the *scalar*
regfile - to be used for the vector operation".

thus, for SimpleV, we do actually need a way in LLVM to represent
(set) MVL, because it is quite literally an "explicit reservation of a
certain size and number of registers".

think of it as a way to say "hey y'know these upcoming SIMD
instructions? yeah, we need to set them to all be of length 8 for this
set.  then, like, next we need to set all the upcoming SIMD
instructions to 16, y'ken".  actually they're not SIMD they're
vector-ops but you get the idea.

this we do with an *extra* parameter to the SV.SETVL instruction
https://libre-riscv.org/simple_v_extension/appendix/#index8h1

SV.SETVL a2, t4, 8 # MVL==8

now, *if* we have a way to set MVL (through LLVM-IR), we can *also*
use that for doing saving/restoring of entire scalar register files
with a single instruction, as well as use it for function call
register stack save/restore.

basically when we have control over MVL through LLVM-IR, we get a
"LD.MULTI" and "ST.MULTI" instruction "for free" as an accidental
side-benefit.

SV.SETMVL #32    ; tells the hardware that vector operations are to
use 32 *scalar* regs
SV.LD a0, f0, #8     ; loads registers f0 thru f31 from the address at (a0+8)

for SIMD systems such as x86 and ARM, the only way to keep loops as
simple as RVV and SV, you'd need an instruction which, when you got to
the last run through the loop, then whilst %evl would be set to some
fixed-width-at-the-SIMD-boundary, some predicate mask was set up
*instead*... and thus despite the SIMD operation still being 4 (or 8,
or 16), the elements at the end were left alone (masked out)

without such an instruction (one which sets up the predicate bitmask
as not being all 1s on the last loop) you'd have to have a sequence of
instructions that effectively do the same job, and those instructions
will, clearly, impact performance due to them being executed on each
and every loop.

this is, unless the above is expressly supported in a single
instruction (one equivalent to SETVL
which sets up the predicate mask on the last loop) i am sorry to have
to use this particular phrase, a dog's dinner approach when compared
to variable-run vectorisation, and it's why i keep warning that
attempting to add support for fixed-power-of-two-%evl in this proposal
is not a good idea.

even if you _do_ have such an instruction (or a really really short
sequence that's equivalent and does not impact the length of the loop
too badly), the fact that the assembly code has to use 16 wide SIMD if
you want to do high-performance but then if you have short loops you
are wasting ALU resources but if you use 4 wide SIMD to stop wasting
ALU resources you can't do high-performance, you are screwed both
coming and going, and, ultimately, have to resort to stripmining to
properly solve it, and at that point we're *definitely* outside of the
scope of this proposal [as i understand it].

l.

From what I recall, the plan is to implement this by using fixed-size vector types combined with VL-based ops. MVL would be the size of those vector types.

To be clear, I'm referring specifically to LLVM IR for SimpleV, not for other targets.

OK. I was picturing MVL as some sort of maximum supported by the hardware in some sense or context. I think(?) I've got it now.

So let me ask about how you're picturing this working on targets that don't support these non-fixed vector lengths. The comments from lkcl have me concerned that we're going to be asked to emulate this behavior, which is possible I suppose but probably not the best choice performance wise. Consider this call:

%sum = call <8 x double> @llvm.vp.fadd.f64(<8 x double> %x,<8 x double> %y, <8 x i1> %mask, i32 4)

Frankly, I'd hope never to see such a thing. We talked about using -1 for the %evl argument for targets that don't support variable vector length (is that the right phrase?), but what are we supposed to do if something else is used?

Disregarding the %evl argument for the moment, the x86 type legalizer might lower this as a masked <8 x double> fadd, or it might lower it as two <4 x double> fadd operations, or it might scalarize it entirely. Even if the target hardware supports 512-bit vectors we might choose to lower it as two <4 x double> fadds. Or we might not. The backend currently considers itself to have the freedom to do anything that meets the semantics of the intrinsic. So that brings up the question of whether we will be expected to honor the %evl argument. In this case, it would be fairly trivial to do so. However, the possibility raises a concern about what the code that generated this IR was trying to do and whether it is a reasonable thing to have done for x86 backends.

Basically, I want to actively discourage front ends and optimizations from using the %evl argument in cases where it won't be optimal.

simoll added a comment.EditedFeb 6 2020, 12:18 AM

OK. I was picturing MVL as some sort of maximum supported by the hardware in some sense or context. I think(?) I've got it now.

So let me ask about how you're picturing this working on targets that don't support these non-fixed vector lengths. The comments from lkcl have me concerned that we're going to be asked to emulate this behavior, which is possible I suppose but probably not the best choice performance wise. Consider this call:

%sum = call <8 x double> @llvm.vp.fadd.f64(<8 x double> %x,<8 x double> %y, <8 x i1> %mask, i32 4)

Frankly, I'd hope never to see such a thing. We talked about using -1 for the %evl argument for targets that don't support variable vector length (is that the right phrase?), but what are we supposed to do if something else is used?

For targets that do not support %evl they can say so through TTI and the ExpandVectorPredicationPass will convert it into:

%mask.vl = icmp ult <8 x i1> <0,1,2,3,4,5,6,7>, ("splat' <8 x i32> 4)
%mask.new = and <8 x i1> %mask, %mask.vl
%sum = call <8 x double> @llvm.vp.fadd.f64(<8 x double> %x,<8 x double> %y, <8 x i1> %mask.new, i32 -1)

Basically, %evl never hits the X86 backend and can be ignored. The expansion pass implements one, unified, legalization strategy for all non-VL targets, achieving predictable behavior across targets.

Disregarding the %evl argument for the moment, the x86 type legalizer might lower this as a masked <8 x double> fadd, or it might lower it as two <4 x double> fadd operations, or it might scalarize it entirely. Even if the target hardware supports 512-bit vectors we might choose to lower it as two <4 x double> fadds. Or we might not. The backend currently considers itself to have the freedom to do anything that meets the semantics of the intrinsic. So that brings up the question of whether we will be expected to honor the %evl argument. In this case, it would be fairly trivial to do so. However, the possibility raises a concern about what the code that generated this IR was trying to do and whether it is a reasonable thing to have done for x86 backends.

I see two sources for VP intrinsics in code:
1.) Hand-written intrinsic code (if we expose VP as C intrinsics in Clang and/or somebody directly implements say a math library in VP, ..)
We do not claim performance portability for VP code. If your actual target is AVX512 and you use VP intrinsics, do not use the %evl parameter (or know how the expansion pass is going to lower it and exploit that).

2.) Optimization passes and (vectorizing) frontends
Vectorizers/frontends should query TTI to decide whether they should be using %evl.
For VL targets, the loop vectorizer could use %evl to implement tail loop predication (as in the DAXPY example https://www.sigarch.org/simd-instructions-considered-harmful/ , linked by @lkcl).
For non-VL targets, you should make the iteration mask the root mask of all other predicates in the loop and set %evl to -1.

Basically, I want to actively discourage front ends and optimizations from using the %evl argument in cases where it won't be optimal.

TTI would tell front ends and optimizations that %evl is a no-go for your target. Is this enough discouragement?

Basically, I want to actively discourage front ends and optimizations from using the %evl argument in cases where it won't be optimal.

TTI would tell front ends and optimizations that %evl is a no-go for your target. Is this enough discouragement?

In theory, yes. In practice, it will depend on how optimizations make use of that information. Your explanation of how the ExpandVectorPredicationPass will make this palatable to the backend worries me a little, because it essentially means that optimizations don't have to care that the target doesn't support this feature. They can generate IR that uses it and EVPP will smooth over it. Obviously, we could handle this on a case-by-case basis as it comes up. As you say, TTI will provide sufficient information for passes to make the decision.

2.) Optimization passes and (vectorizing) frontends
Vectorizers/frontends should query TTI to decide whether they should be using %evl.
For VL targets, the loop vectorizer could use %evl to implement tail loop predication (as in the DAXPY example https://www.sigarch.org/simd-instructions-considered-harmful/ , linked by @lkcl).
For non-VL targets, you should make the iteration mask the root mask of all other predicates in the loop and set %evl to -1.

FWIW this is the approach we plan to use at BSC to vectorize using RISC-V extension. We're currently adding mask information to VPlan recipes that when executed should emit VPred operations with masking. Our plan includes a vplan→vplan transformation that would express the "root" mask as a "set vector length" operation.

lkcl added a comment.Feb 6 2020, 5:07 PM

TTI would tell front ends and optimizations that %evl is a no-go for your target. Is this enough discouragement?

In theory, yes. In practice, it will depend on how optimizations make use of that information. Your explanation of how the ExpandVectorPredicationPass will make this palatable to the backend worries me a little, because it essentially means that optimizations don't have to care that the target doesn't support this feature. They can generate IR that uses it and EVPP will smooth over it. Obviously, we could handle this on a case-by-case basis as it comes up. As you say, TTI will provide sufficient information for passes to make the decision.

ok so it is starting to sink in what is being proposed: a *mainstream* pass in llvm that *always* puts in vector predication, and then various backends, depending on hardware capability, will either have passes that turn that mandatory vector predication into scalar loops, or SIMD / SIMT (getting rid of %evl in the process), or, in the case of Cray-inspired hardware, calling SETVL assembly code.

if that's accurate, then wow that's quite bold and has a lot of advantages.

i have a suggestion. for SimpleV we.definitely need to have an explicit way to specify MVL. this because it is literally specifying precisely how many scalar registers are to be allocated for a vector op.

however for SIMD (ARM, x86, other) i have a suspicion that being able to "hint" the best size of SIMD instruction width to use is probably a good idea.

if a SIMD width hint is available it happens to be synonymous with SimpleV's (hard) requirent to be able to specify MVL.

a scalar system would ignore both %evl and %mvl (or better mpvl - max partition vector length) i.e passes woule eliminate them.

a SIMD system would use %mpvl to choose the best SIMD opcodes for the job, the passes would subdivide work into such chunks then generate the suitablr cornercase last loop as well, *ignoring* %evl in the process.

SimpleV would use both to generate opcodes, coordinating with the regfile allocator, correctly and efficiently.

i have a suggestion. for SimpleV we.definitely need to have an explicit way to specify MVL. this because it is literally specifying precisely how many scalar registers are to be allocated for a vector op.

Would it work for you if we leave the definition of MVL for scalable types to the targets?

This would allow you (and ARM MVE/SVE , RISC-V V) to have their own mechanism for setting/querying MVL.
Besides, i think that defining MVL is out of the scope of this RFC given the diversity of scalable vector ISAs right now.. again a point we could revisit should all scalable vector ISAs someday agree on one way to define MVL.

The up-to-date list of planned changes (also for this patch) is here: https://reviews.llvm.org/D69891#1871485

lkcl added a comment.Feb 12 2020, 5:23 AM

i have a suggestion. for SimpleV we.definitely need to have an explicit way to specify MVL. this because it is literally specifying precisely how many scalar registers are to be allocated for a vector op.

Would it work for you if we leave the definition of MVL for scalable types to the targets?

mmm... honestly? probably not. however we can get away with either inline assembler (for a very limited subset of requirements) or just going "y'know what, let's just set MVL hard-coded to default to 4 or 8 for all loops", for now, as best matched to the (planned) maximum internal register read/write ports for our first chip.

This would allow you (and ARM MVE/SVE , RISC-V V) to have their own mechanism for setting/querying MVL.

and x86-for-hinting-the-SIMD-length. [for anyone who may be under the impression that RVV does not need the concept of MVL: see the sub-extension which fits the vector regfile onto the scalar (FP) regfile. if the FP regfile is to be used and useful at the same time, then there needs to be a way to explicity define how much of the FP regfile is to be allocated (to* RVV, and that in turn means being able to define the number of "lanes" to actually be used... which is, funnily enough, exactly what *setting* MVL. N(Lanes) == MVL. MVL == N(Lanes) ].

Besides, i think that defining MVL is out of the scope of this RFC given the diversity of scalable vector ISAs right now..

this is cool and exciting.

again a point we could revisit should all scalable vector ISAs someday agree on one way to define MVL.

yes, as a separate proposal.

i have a suggestion. for SimpleV we.definitely need to have an explicit way to specify MVL. this because it is literally specifying precisely how many scalar registers are to be allocated for a vector op.

Would it work for you if we leave the definition of MVL for scalable types to the targets?

mmm... honestly? probably not. however we can get away with either inline assembler (for a very limited subset of requirements) or just going "y'know what, let's just set MVL hard-coded to default to 4 or 8 for all loops", for now, as best matched to the (planned) maximum internal register read/write ports for our first chip.

I think i wasn't clear: what i meant to say is that we will not decide how MVL is defined/queried/set in the scope of this RFC... potentially leading to the situation that every target comes with its own set of target intrinsics to do so.

This would allow you (and ARM MVE/SVE , RISC-V V) to have their own mechanism for setting/querying MVL.

and x86-for-hinting-the-SIMD-length.

For x86 with scalable types, yes. For "classic" SIMD types MVL == W of <W x type>

<snip> [for anyone who may be under the impression that RVV does not need the concept of MVL: see the sub-extension which fits the vector regfile onto the scalar (FP) regfile. if the FP regfile is to be used and useful at the same time, then tere needs to be a way to explicity define how much of the FP regfile is to be allocated (to* RVV, and that in turn means being able to define the number of "lanes" to actually be used... which is, funnily enough, exactly what *setting* MVL. N(Lanes) == MVL. MVL == N(Lanes) ].

Besides, i think that defining MVL is out of the scope of this RFC given the diversity of scalable vector ISAs right now..

this is cool and exciting.

Yep, and we wouldn't get near the level of support for this RFC otherwise.

again a point we could revisit should all scalable vector ISAs someday agree on one way to define MVL.

yes, as a separate proposal.

+1

Exactly. The VE target strictly requires VL <= MVL or you'll get a hardware exception. Enforcing strict UB here means VP-users have to explicitly drop instructions that keep the VL within bounds. This means that we can optimize the VL computation code and that it can be factored into cost calculations, etc. With Options 2 & 3 this would happen only very late in the backend when most scalar optimizations are already done.

Ok, I didn't realize VE's SETVL works like that. In that case we don't have much of a choice, unfortunately.

Besides, this still allows you to speculate as long as MVL (as in the UB-causing bound for VL) does not go below VL... could you explain under which circumstance MVL would go below VL by hoisting? This is definitely not the case for static VL targets (x86) and also not for VE.

Of course, for lots of IR that we care about in practice, it will be quite simple to see that hoisting is safe, e.g. because:

  • %evl it is a constant -1
  • %evl is computed in a way that can be recognized to produce a small enough value (typical strip-mined loops)
  • there are earlier unconditional VP operations with the same EVL value (most vectorized functions)

But you need some such analysis, and must not hoist when those tricks all fail, because there's no general guarantee that the condition you're hoisting out of is independent from "%evl > element count?". A trivial (if pathological) example of this is when the condition never true in any execution and the EVL value is larger than W. A more real-world example, if you insist, comes from one proposed way to port hand-crafted fixed-width SIMD algorithms to RVV: check at runtime whether vector registers are at least as large as required by the SIMD algorithm, if so set the VL register to a constant and execute vector code, otherwise fall back to another implementation. This might mean having vp.foo(..., i32 4) instructions guarded by a runtime check that effectively determines whether that 4 is a legal value, and hoisting the computation out of the condition introduces UB in the executions where it isn't.

Whether this would lead to any end-to-end miscompilations is another question, but that's not a good excuse to implement known-incorrect optimizations.

lkcl added a comment.Feb 12 2020, 8:19 AM

I think i wasn't clear: what i meant to say is that we will not decide how MVL is defined/queried/set in the scope of this RFC... potentially leading to the situation that every target comes with its own set of target intrinsics to do so.

ah yes got you.

This would allow you (and ARM MVE/SVE , RISC-V V) to have their own mechanism for setting/querying MVL.

and x86-for-hinting-the-SIMD-length.

For x86 with scalable types, yes. For "classic" SIMD types MVL == W of <W x type>

mmm... i don't believe that's a wise choice / decision / assumption. i am partly-guessing-and-making-architectural-assumptions here: imagine that the (very-well-informed) programmer knows how the pipelines of a particular processor work (and i do mean very well), they know that there are a couple of separate pipelines, one which handles e.g. NxFP32, one which handles MxFP64, but that if you issue SIMD instructions of width N=Mx2, it will result in a "blockage" (stall) and under-utilisation.

*however*... if you issue *half* the workload (i.e. MVL == W/2) for the FP32 instructions interleaved with "full" workload (MVL==W for the FP64 ops), *then*, because of the way that the architecture works the two suites of instructions *will* go to the separate pipelines, *will* get done in parallel, because you're not overloading the exact same 64-bit-wide pipeline entrypoint if you'd done... you get what i'm trying to say?

i think what i'm trying to say works better for MMX (the instructions which shared the FP regfile with SIMD instructions, is that right? or is it SSE?) - there you definitely want control over how much of the regfile is allocated to SIMD and how much remains actual for scalar-FP usage, and if MVL == W as a hard-coded assumption, with no "hint", you could end up taking up far more of the FP regfile for SIMD MMX than is efficient / effective.

however... if the compiler could be *explicitly* told, "hey i want you to use only W/2 or W/4 worth of the FP regfile for SIMD operations please, and to automatically create a 2x or 4x loop that makes up for it *as if* you had done a full MVL==W single SIMD instruction", then it becomes possible to create a balance there which will not hammer the L1/L2 cache with LD/ST operations, consuming far more power than necessary, because the SIMD instructions completely dominate the entirety of the FP regfile.

we quickly learned from 3D workloads that they are very computationally-intensive and fit a "LD, massive-amounts-of-SIMD-processing, ST" pattern with *very* little in the way of overlaps. consequently, if the compiler generates:

  • LD
  • half-the-processing-because-there's-not-enough-registers
  • ST-some-temps
  • do-some-more-processing
  • LD-out-of-temps, do-a-bit-more-processing
  • ST

this is horribly, horribly power-inefficient.

so being able to balance the workload, keep things entirely in the regfile even if it means using half-wide (or quarter-wide) SIMD ops and the loops taking twice or 4 times longer in order to avoid the spill into temporary LD/STs, this is far more important than trying to make "individual" SIMD operations (ones that consume far too much of the regfile and result in LD/ST "spill") as wide as possible.

again, however: i'm raising this not to suggest that it be part of *this* RFC, i'm just document it to make sure it's not forgotten, for later.

Besides, i think that defining MVL is out of the scope of this RFC given the diversity of scalable vector ISAs right now..

this is cool and exciting.

Yep, and we wouldn't get near the level of support for this RFC otherwise.

yehyeh.

I think i wasn't clear: what i meant to say is that we will not decide how MVL is defined/queried/set in the scope of this RFC... potentially leading to the situation that every target comes with its own set of target intrinsics to do so.

ah yes got you.

This would allow you (and ARM MVE/SVE , RISC-V V) to have their own mechanism for setting/querying MVL.

and x86-for-hinting-the-SIMD-length.

For x86 with scalable types, yes. For "classic" SIMD types MVL == W of <W x type>

mmm... i don't believe that's a wise choice / decision / assumption. i am partly-guessing-and-making-architectural-assumptions here: imagine that the (very-well-informed) programmer knows how the pipelines of a particular processor work (and i do mean very well), they know that there are a couple of separate pipelines, one which handles e.g. NxFP32, one which handles MxFP64, but that if you issue SIMD instructions of width N=Mx2, it will result in a "blockage" (stall) and under-utilisation.

*however*... if you issue *half* the workload (i.e. MVL == W/2) for the FP32 instructions interleaved with "full" workload (MVL==W for the FP64 ops), *then*, because of the way that the architecture works the two suites of instructions *will* go to the separate pipelines, *will* get done in parallel, because you're not overloading the exact same 64-bit-wide pipeline entrypoint if you'd done... you get what i'm trying to say?

i think what i'm trying to say works better for MMX (the instructions which shared the FP regfile with SIMD instructions, is that right? or is it SSE?) - there you definitely want control over how much of the regfile is allocated to SIMD and how much remains actual for scalar-FP usage, and if MVL == W as a hard-coded assumption, with no "hint", you could end up taking up far more of the FP regfile for SIMD MMX than is efficient / effective.

MMX does use the X87 FP register file, but they can't coexist at the same. The first use of MMX marks the X87 register stack as occupied. I can't remember if it alters the data or not. An explicit emms instruction has to be done at the end of the MMX code to erase the MMX data and make the registers usable for X87 again.

lkcl added a comment.Feb 12 2020, 8:37 AM

But you need some such analysis, and must not hoist when those tricks all fail, because there's no general guarantee that the condition you're hoisting out of is independent from "%evl > element count?". A trivial (if pathological) example of this is when the condition never true in any execution and the EVL value is larger than W. A more real-world example, if you insist, comes from one proposed way to port hand-crafted fixed-width SIMD algorithms to RVV: check at runtime whether vector registers are at least as large as required by the SIMD algorithm, if so set the VL register to a constant and execute vector code,

ah... ah... you can't. at least, the last version of the RVV spec that i read (7?) still explicity states, "regardless of what *you* want VL to be set to, the *hardware* gets to decide exactly what value *actually* goes into the VL CSR".

the only guarantee that you have is that you will find that if you set VL to a non-zero value, you will find that, when you read it immediately after setting, it will be non-zero.

this specifically *does not matter* on RVV (sigh: when RVV is not done on top of the FP regfile, and there is a separate vector regfile), because the vector regfile is specifically designed to refer to *vectors*... not to invididual elements.

for SimpleV, because we designed it right from the start to sit on top of the int and fp regfiles, what VL is set to *really does matter*, because it defines precisely and exactly how many of the scalar registers are to be used *as* "vector elements".

thus, for RVV, when converting SIMD assembly patterns to RVV, you absolutely *must* use the "loop pattern" described in https://www.sigarch.org/simd-instructions-considered-harmful/

if you try to hard-code-set VL to anything specific, this has the (unintended) side-effect of destroying the entire paradigm on which RVV is based, namely that you are not *supposed* to know the actual hardware vector "lane" size... at all. so, if you had really minimalist hardware which only *had* one actual "Lane", then if you tried to explicitly set VL=4, that hardware is absolutely hosed, as it is literally unable to support, at the hardware level, the three extra lanes requested/demanded.

this is why you have to "ask" for a VL, and the instruction will put the *actual* number of elements that VL got set to into a destination register, because you need to subtract that number of (processed) elements from the loop.

of course, with the idea of dropping RVV on top of the FP regfile that goes somewhat out the window. however i'm not... welcome, shall we say... in the RV WG participation, so you'd need to take this up with them, directly. and try not to mention my name too much because they're quite likely to sabotage things (to everyone's detriment) just because i was the one that came up with the insights. *shakes head*...

lkcl added a comment.Feb 12 2020, 9:02 AM

MMX does use the X87 FP register file, but they can't coexist at the same. The first use of MMX marks the X87 register stack as occupied. I can't remember if it alters the data or not. An explicit emms instruction has to be done at the end of the MMX code to erase the MMX data and make the registers usable for X87 again.

craig, thank you for correcting me. that makes a lot of sense as i can just imagine the x87 designers going "argh, how are we going to avoid a pipeline clash / mess, here" :)

you get the principle i am sure, even though MMX is not a suitable example.

ah... ah... you can't. at least, the last version of the RVV spec that i read (7?) still explicity states, "regardless of what *you* want VL to be set to, the *hardware* gets to decide exactly what value *actually* goes into the VL CSR".

the only guarantee that you have is that you will find that if you set VL to a non-zero value, you will find that, when you read it immediately after setting, it will be non-zero.

I don't know where you have gotten this idea, it has never been true for as long as I can recall. While RVV implementations have some freedom in how they set VL, there are also lots of rules governing their behavior. Most relevantly, since October 2018 (spec version 0.5-draft), programs requesting something less than or equal to the maximum VL will get exactly that number as VL, no something smaller. And even before that change, there were long-standing significant restrictions on how VL is determined beyond what you claim (see the linked commit).

Furthermore, even if what you said was true, it would not make the scheme I described invalid. VL does not change without the program deliberately executing one of a few instructions that change VL (this is already necessary for any strip-mined loop to work at all). Thus, after executing a SETVL it's enough to inspect the resulting VL to know whether it's safe to execute code that assumes a particular value of VL. More freedom in how VL is determined by the processor just means more possibilities for unnecessarily hitting the fallback path, but that only impacts performance rather than correctness.

lkcl added a comment.Feb 14 2020, 10:33 AM

ah... ah... you can't. at least, the last version of the RVV spec that i read (7?) still explicity states, "regardless of what *you* want VL to be set to, the *hardware* gets to decide exactly what value *actually* goes into the VL CSR".

the only guarantee that you have is that you will find that if you set VL to a non-zero value, you will find that, when you read it immediately after setting, it will be non-zero.

I don't know where you have gotten this idea, it has never been true for as long as I can recall. While RVV implementations have some freedom in how they set VL, there are also lots of rules governing their behavior. Most relevantly, since October 2018 (spec version 0.5-draft), programs requesting something less than or equal to the maximum VL will get exactly that number as VL, no something smaller. And even before that change, there were long-standing significant restrictions on how VL is determined beyond what you claim (see the linked commit).

remember, with the exclusion from discussion due to the anti-trust practices of the RISC-V Foundation, everyone on the "outside" of the RVV working group process has to "reverse-engineer" what the hell is going on. so please do be patient if i make mistakes, as i am not really very happy spending our sponsor's and donor's time (and money) extracting information from the RVV WG in this way (and shouldn't have to).

Furthermore, even if what you said was true, it would not make the scheme I described invalid.

if you are describing replacing a SIMD loop with a *single* instruction, prefixed with a "SETVL", then my understanding is that yes, it would be... *on some hardware*. if the intention is never to be fully-compatible with *all* RVV-compatible hardware, then that's fine.

think it through: imagine some hardware that has only one "lane". that hardware will ONLY have an *absolute* maximum value for MVL: one.

therefore, if you try to set VL to anything greater than 1, it will *only* permit VL to be set to 1.

the variable nature of MVL on a per-implementor basis has caused other problems as well, particularly in the element-offset (VSLIDE?) instructions. it's been a contentious issue.

VL does not change without the program deliberately executing one of a few instructions that change VL (this is already necessary for any strip-mined loop to work at all). Thus, after executing a SETVL it's enough to inspect the resulting VL to know whether it's safe to execute code that assumes a particular value of VL.

ahhh, okaay, right. i get it. so, you'd have:

SETVL a5, 4 # a5 is the dest reg where VL gets stored
if (a5 != 4)
{

go to fallback loop

}

More freedom in how VL is determined by the processor just means more possibilities for unnecessarily hitting the fallback path, but that only impacts performance rather than correctness.

i would argue that even the check itself - having the fallback path at all - impacts performance (and increases code size).

this is why, in SimpleV, we make it mandatory that even if the underlying hardware does not have a large number of lanes, the implementation *must* provide "virtual" hardware - in effect a hardware for-loop. one other processor which does exactly this is the Broadcomm VideoCore IV. it gives the *impression* of having a 16-wide FP32 SIMD capability, whereas in fact it only has a 4x FP32 operation and the hardware delays for 4 additional cycles, pushing 4 *sets* of 4x FP32 into the (one) 4-wide FP32 pipeline.

MMX does use the X87 FP register file, but they can't coexist at the same. The first use of MMX marks the X87 register stack as occupied. I can't remember if it alters the data or not. An explicit emms instruction has to be done at the end of the MMX code to erase the MMX data and make the registers usable for X87 again.

craig, thank you for correcting me. that makes a lot of sense as i can just imagine the x87 designers going "argh, how are we going to avoid a pipeline clash / mess, here" :)

you get the principle i am sure, even though MMX is not a suitable example.

I don't know about Craig, but I'm not sure I do get the principle. For any given target we have a known maximum vector width (as in total number of bits, not number of elements) that is discoverable through TargetTransformInfo. We also have a "preferred" vector width that gets a default value based on the target architecture, but can be overridden by a command line option and may change what TargetTransformInfo tells you. However, the IR is not bound by these. The optimizer and any front end can generate whatever vectors they like. If some wacky optimization wants to create a <23 x float> vector, that's legal IR. However, when it gets to the backend, the type legalizer is going to do something to break it down into chunks that can be consumed by the processor. To get nicely optimized code, there needs to be cooperation between the optimizer and the backend.

This is why I mentioned before that the discussion of architecture specific details in the context of defining the semantics of the IR is making me nervous. LLVM IR is designed to be target-independent. The VP semantics need to respect that.

That's not to say we can ignore target-specific details. We have two distinct lanes though -- (1) the semantics of the IR, and (2) the mechanisms by which the target details can be discovered so that pre-codegen components can tune the IR for a specific target. We need to make sure the IR semantics are rich enough to represent the details of all targets we intend to support, but the details of the target shouldn't be visible in the IR semantics. Maybe I'm preaching to the choir here. I just want to make sure we're all on the same page. Perhaps this would be cleared up if I had a better understanding of what you were saying.

lkcl added a comment.Feb 14 2020, 2:29 PM

Perhaps this would be cleared up if I had a better understanding of what you were saying.

appreciated. if it's ok, can we schedule that for when it's part of a (new) proposal?

Perhaps this would be cleared up if I had a better understanding of what you were saying.

appreciated. if it's ok, can we schedule that for when it's part of a (new) proposal?

Sure.

simoll updated this revision to Diff 246280.Feb 24 2020, 12:34 PM
  • rebased
  • various fixes
  • includes llangref rephrasing and atest changes to VP integer patch

FYI, the test failures you are seeing here are due to the generalized pattern matching doing a better job at matching the fsub idiom for fneg. The required test changes are included in https://reviews.llvm.org/D75467 .

simoll updated this revision to Diff 250207.EditedMar 13 2020, 7:59 AM
  • Rebased
  • %evl <= W or UB ensues
  • fixed LangRef wording

The VP integer intrinsic patch has been accepted. I'll wait until next week with the commit to leave time for comments.

FYI, the VP-integer intrinsics & langref patch is in. Next up: expansion to standard SIMD IR. I'll announce the next patch also on llvm-dev when it's on phabricator.

simoll planned changes to this revision.Aug 28 2020, 5:16 AM
mdchen added a subscriber: mdchen.Tue, Oct 13, 4:46 AM
Kazhuu added a subscriber: Kazhuu.Mon, Oct 19, 7:16 AM