Page MenuHomePhabricator

[SVE][IR] Scalable Vector IR Type
ClosedPublic

Authored by huntergr on Apr 26 2017, 4:12 AM.

Details

Summary
  • Adds a 'scalable' flag to VectorType
  • Adds an 'ElementCount' class to VectorType to pass (possibly scalable) vector lengths, with overloaded operators.
  • Modifies existing helper functions to use ElementCount
  • Adds support for serializing/deserializing to/from both textual and bitcode IR formats
  • Extends the verifier to reject global variables of scalable types
  • Adds unit tests
  • Updates documentation

See the latest version of the RFC here: http://lists.llvm.org/pipermail/llvm-dev/2018-July/124396.html

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
  • Clarified that the runtime multiple is constant across all scalable vector types, even if the constant value isn't known at compile time.
  • Removed extra whitespace.

LGTM now, thanks for all the hard work! I won't approve, as this is a larger discussion, but I'm happy with this patch, and the native support as is.

hfinkel added inline comments.Apr 5 2019, 11:18 AM
docs/LangRef.rst
2749

I definitely agree that we should not deal with changing the vscale during program execution.

I think that the model is: There is an underlying vector length. vscale = round(vector length in bits / primitive size in bits). Can we specify it like that? We do also need to define what the rounding is. What does <scalable 4 x i3> do? Or is it not allowed?

hsaito added inline comments.Apr 5 2019, 4:01 PM
docs/LangRef.rst
2749

I definitely agree that we should not deal with changing the vscale during program execution.

I agree that this will make things a lot simpler than allowing it to change per function or in a middle of a function.

However, I don't quite agree that changing vscale per function is an orthogonal issue. What are we going to do when function foo() with vscale=X calls function bar() with vscale=Y using a scalable vector parameter?

Having said that, since I don't expect the discussions to converge anytime soon if we talk about vscale changing within a compilation unit, I agree we should move forward with vscale not changing within a compilation unit (we say program execution, but compiler's visibility is always limited to compilation unit). It should be sufficient to say that if multiple compilation unit with different vscale are linked, unspecified behavior will result.

@hfinkel, I think the model is "#elements * elementtype" fits in one or more "units of vector" and then apply vscale to it. I don't think scalable vector needs to fit one physical register of HW. Vector type legalization should kick-in. @huntergr, please correct me if my mental model is wrong.

rengolin added inline comments.Apr 6 2019, 3:01 AM
docs/LangRef.rst
2749

However, I don't quite agree that changing vscale per function is an orthogonal issue.

I didn't mean the implementation, but the discussion.

I think a per-function vscale implementation will be very different from the current one, no matter which course we take now. It won't matter much if we have native or intrinsic implementation, we'll still need function attributes and teach the optimisation passes, etc.

Having said that, since I don't expect the discussions to converge anytime soon if we talk about vscale changing within a compilation unit

If the scope it the compilation unit, then we'd need it to be fixed on the target string, or we won't be able to link two units together. I think even this discussion is too soon, and we should push the scope to the whole program.

Any change in vscale throughout the program should be undefined, or we'd have to encode the necessary logic in the compiler, which is the biggest worry I see from the feedback. So far, the benefits of doing so are on edge cases and the actual costs are unknown (but very likely large).

In my view, this is *definitely* not a subject we should raise right now and restricting the current implementation to whole-program scope is the only way we can go forward for now in any sensible way.

hfinkel added inline comments.Apr 7 2019, 7:36 AM
docs/LangRef.rst
2749

I didn't mean the implementation, but the discussion.

As I've said in previous thread, I don't believe that we can sensibly model a changing vscale without some SSA dependence, and that will require significant changes to the overall scheme.

restricting the current implementation to whole-program scope is the only way we can go forward for now in any sensible way.

+1

I think the model is "#elements * elementtype" fits in one or more "units of vector" and then apply vscale to it. I don't think scalable vector needs to fit one physical register of HW. Vector type legalization should kick-in.

Indeed, I believe you're correct. We need to account for this in the definition too.

hfinkel added inline comments.Apr 7 2019, 7:39 AM
docs/LangRef.rst
2749

Indeed, I believe you're correct. We need to account for this in the definition too.

Either by having a model that includes legalization, or by restricting the size of the base vector type?

huntergr marked an inline comment as done.Apr 9 2019, 12:46 AM
huntergr added inline comments.
docs/LangRef.rst
2749

We perform legalization for scalable vectors with the same mechanisms fixed-length vectors do (splitting for too large, promoting/extending for too small). Should this be documented in this description (it isn't for fixed vectors), or is there a better place in the docs for that explanation?

(A side note; for 'unpacked' float vector types (e.g. <scalable 2 x float>) we do declare them as legal for SVE then generate predicates to mask off the unused lanes in AArch64 specific code. Since there are more predicated architectures being added to the codebase, perhaps this could be generalized as a new legalization mechanism for fp vector types)

hfinkel added inline comments.Apr 9 2019, 1:42 AM
docs/LangRef.rst
2749

We perform legalization for scalable vectors with the same mechanisms fixed-length vectors do (splitting for too large, promoting/extending for too small).

What defines too big (what size is used for splitting)? If <scalable 8 x double> fits in the vector register depends on the runtime vector size, no?

Should this be documented in this description

No, unless it's part of the IR-level model.

What we need here is a model defined, at the IR level, that explains why:

  1. I can add two <scalable 4 x float> vectors together.
  2. I cannot add a <scalable 4 x float> to a <scalable 2 x float>
  3. I can sext a <scalable 4 x i32> to a <scalable 4 x i64>, and this can be bitcast to a <scalable 8 x i32>.

Also we should address happens to vectors with an odd number of lanes or of a non-power-of-two-sized primitive types (both of which are defined at the IR level).

huntergr added inline comments.Apr 9 2019, 5:56 AM
docs/LangRef.rst
2749

What defines too big (what size is used for splitting)? If <scalable 8 x double> fits in the vector register depends on the runtime vector size, no?

No. For SVE, the legal types are those where the minimum size is equal to 128 bits, since that's the minimum size for hardware registers (and the granularity of increments of register size). So an operation using <scalable 8 x double> values would need to be split into 4 <scalable 2 x double> operations for SVE during legalization. (I'm ignoring the predicated unpacked float forms for a moment)

I wonder if the change in syntax from <n x 8 x double> to <scalable 8 x double> makes that less obvious.

The basis of the model can be grounded in '1' being a valid value for vscale, which effectively makes the types equivalent to fixed length vectors. I'll try coming up with a description based on that.

joelkevinjones added a comment.EditedApr 11 2019, 7:19 PM

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

If we are allowing just element-wise Instructions, we should explicitly say that in LangRef, warn at LLVM-DEV mailing list that new scalable vector types are coming, wait a little to let last minute screams to happen, assure them by saying scalable vector on element-wise Instructions won't cause any mess beyond non-scalable vector, and then commit. This would be the quickest route, and it still enables other interesting follow-up patches to be reviewed/committed. I think we are ready enough to do this if we choose to take this route.

If we are going for more than element-wise Instructions, we need to have well defined and agreed semantics for each of those, and that should be part of the LangRef for each such Instruction.
Also, have we thought about Intrinsics? Can all Intrinsics that take/return non-scalable vector handle scalable vector?

We can certainly let element-wise stuff to go in first and then extend to non-element-wise stuff later. Any thoughts in this regard?

If we are going for more than element-wise Instructions, we need to have well defined and agreed semantics for each of those, and that should be part of the LangRef for each such Instruction.

Agreed.

Also, have we thought about Intrinsics? Can all Intrinsics that take/return non-scalable vector handle scalable vector?

Same for intrinsics. If any of the vector ones apply to scalable vectors (ex. reduce), it needs to be documented.

We can certainly let element-wise stuff to go in first and then extend to non-element-wise stuff later. Any thoughts in this regard?

Precisely, we should go in baby-steps, with each step making sure we don't touch/break *anything* that isn't scalable, updating LangRef as we go.

Doing a single review on everything would be painful and counter-productive.

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

I don't recall this being an issue, but I agree, if there are instructions that currently take vector types that don't take scalable vectors, that certainly needs to be documented.

If we are allowing just element-wise Instructions, we should explicitly say that in LangRef, warn at LLVM-DEV mailing list that new scalable vector types are coming, wait a little to let last minute screams to happen, assure them by saying scalable vector on element-wise Instructions won't cause any mess beyond non-scalable vector, and then commit.

I think that there's been plenty of traffic on this subject on llvm-dev. We do send warnings for potentially-downstream-disruptive changes, as a general best practice. I'm not really sure if this falls into this category, but I'm sure a note to llvm-dev will be sent to let everyone know when this lands, if nothing else, as an FYI to review other related patches.

This would be the quickest route, and it still enables other interesting follow-up patches to be reviewed/committed. I think we are ready enough to do this if we choose to take this route.

If we are going for more than element-wise Instructions, we need to have well defined and agreed semantics for each of those, and that should be part of the LangRef for each such Instruction.
Also, have we thought about Intrinsics? Can all Intrinsics that take/return non-scalable vector handle scalable vector?

FWIW, I took a quick look through the intrinsics and nothing jumped out at me as an intrinsic that currently has vector types on its interface that shouldn't take scalable vectors. So, if there things which don't work, we should certainly note that.

We can certainly let element-wise stuff to go in first and then extend to non-element-wise stuff later. Any thoughts in this regard?

We should indeed review patches in small/medium-sized units (so long as the changes are testable).

PkmX added a subscriber: PkmX.Apr 14 2019, 11:18 PM

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

The main difference is for 'shufflevector'. For a splat it's simple, since you just use a zeroinitializer mask. For anything else, though, you currently need a constant vector with immediate values; this obviously won't work if you don't know how many elements there are.

Downstream, we solve this by allowing shufflevector masks to be ConstantExprs, and then using 'vscale' and 'stepvector' as symbolic Constant nodes. With those and a few arithmetic and logical ops, we can synthesize the usual set of shuffles (reverse, top/bottom half, odd/even, interleaves, zips, etc). Would also work for fixed-length vectors. There's been some pushback on introducing them as symbolic constants though, and the initial demo patch set has them as intrinsics.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.
  2. Introduce additional intrinsics for the other shuffle variants as needed
  3. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

The discussion at the EuroLLVM roundtable was leaning towards option 1, with an action on me to provide a set of canonical shuffle examples using vscale and stepvector for community consideration.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

This warrants a larger discussion, which would hinder the current progress.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

I agree, although this is an important part of the model, so we should start having this discussion in parallel (sooner rather than later). I had been under the impression that a set of intrinsics were being proposed for this, but extending shufflevector is also an option worth considering. If these are first-class types, then having first-class instruction support is probably the right path. This deserves it's own RFC.

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

I agree, although this is an important part of the model, so we should start having this discussion in parallel (sooner rather than later). I had been under the impression that a set of intrinsics were being proposed for this, but extending shufflevector is also an option worth considering. If these are first-class types, then having first-class instruction support is probably the right path. This deserves it's own RFC.

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

I agree on both points.

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

The main difference is for 'shufflevector'. For a splat it's simple, since you just use a zeroinitializer mask. For anything else, though, you currently need a constant vector with immediate values; this obviously won't work if you don't know how many elements there are.

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

Going with length of val being "scalable * n", if there is an optimization that takes advantages of poison value being returned by evaluating "idx >= n" at compile time, we need to disable that for scalable vector. Also, if any verifier is checking whether constant idx value is less than n, we need to disable that for scalable vector.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

I agree, although this is an important part of the model, so we should start having this discussion in parallel (sooner rather than later). I had been under the impression that a set of intrinsics were being proposed for this, but extending shufflevector is also an option worth considering. If these are first-class types, then having first-class instruction support is probably the right path. This deserves it's own RFC.

That's pretty much what LLVM-VP is about: https://reviews.llvm.org/D57504

We are proposing compress/expand and lane shift as intrinsics.
I suggest you add any shuffle intrinsics to the same namespace to avoid fragmentation.

IMHO, the redesigned reduction intrinsics should go there as well (https://reviews.llvm.org/D60261 and/or https://reviews.llvm.org/D60262).

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

Even if shuffle masks don't need to be constants anymore, it will be awkward to encode compress/expand without additional intrinsics (you would need a prefix sum vector over the mask bits or similar).

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

+1

greened added a comment.EditedApr 17 2019, 1:53 PM

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n. I assume for vectors of length > n one would have to use shufflevector or something similar (using vscale and stepvector as Graham mentioned) to get the elements you want to the lower n positions.

Either way, the semantics and any restrictions certainly need to be documented.

Going with length of val being "scalable * n", if there is an optimization that takes advantages of poison value being returned by evaluating "idx >= n" at compile time, we need to disable that for scalable vector. Also, if any verifier is checking whether constant idx value is less than n, we need to disable that for scalable vector.

If the index is still restricted by n then this shouldn't be an issue.

docs/LangRef.rst
2753

Just want to double-check: there is nothing about scalable vectors that assumes all vector types have the same bit width, correct? Can <scalable 1 x float> have a different bit width from <scalable 1 x double>?

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

troyj added a subscriber: troyj.Apr 17 2019, 6:47 PM
PkmX added inline comments.Apr 17 2019, 8:57 PM
docs/LangRef.rst
2753

I believe the intention is that <scalable 1 x double> should have twice many bits as <scalable 1 x float>, or the same many bits as <scalable 2 x float>.

That's pretty much what LLVM-VP is about: https://reviews.llvm.org/D57504

We are proposing compress/expand and lane shift as intrinsics.
I suggest you add any shuffle intrinsics to the same namespace to avoid fragmentation.

I'm hoping we can use an extended shufflevector for this still, but if we do end up going down the extra intrinsics route then I'll certainly use the same namespace.

IMHO, the redesigned reduction intrinsics should go there as well (https://reviews.llvm.org/D60261 and/or https://reviews.llvm.org/D60262).

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

Even if shuffle masks don't need to be constants anymore, it will be awkward to encode compress/expand without additional intrinsics (you would need a prefix sum vector over the mask bits or similar).

Agreed, and it also extends to predicated inserts and extracts -- using SVE's 'lasta/lastb' instructions in a vectorized loop tail, for example -- they don't map well to the current instructions, so would need intrinsics.

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

+1

Agreed. Handling shuffles was part of the original RFC, but that RFC was far too large to discuss all at once. I've started reworking that section based on current feedback, and I'll send it out once ready.

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n. I assume for vectors of length > n one would have to use shufflevector or something similar (using vscale and stepvector as Graham mentioned) to get the elements you want to the lower n positions.

Either way, the semantics and any restrictions certainly need to be documented.

For now, I suspect keeping the limit as 'n' is sufficient, as we only really need to insert into the first element to be able to generate a splat. If we need more later we can discuss it then.

Going with length of val being "scalable * n", if there is an optimization that takes advantages of poison value being returned by evaluating "idx >= n" at compile time, we need to disable that for scalable vector. Also, if any verifier is checking whether constant idx value is less than n, we need to disable that for scalable vector.

If the index is still restricted by n then this shouldn't be an issue.

Agreed.

huntergr added inline comments.Apr 18 2019, 2:45 AM
docs/LangRef.rst
2753

That's correct. I think that a clearer syntax might be <vscale x 1 x float> to indicate that the number of elements is being multiplied by the same vscale term for all scalable vector types.

The intent is that we should be able to reason about the relative sizes of different scalable vector types based on the element size and minimum number of lanes alone.

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Yes; in our downstream compiler we allow inserts and extracts at arbitrary offsets in IR (though we admittedly haven't found too much use for it) using vscale (as a constant, though obtaining the Value via an intrinsic will still work) to give us the required element index.

e.g. (vscale * n) - 1 for the last element.

I don't think we need to support this initially, though.

The SVE ISA doesn't allow arbitrary indices for inserts or extracts, but we can generate an appropriate predicate quite easily in the backend.

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

I want to say yes here, but that leaves a bit of oddness. It's true that nobody can really police on true variable values. However, some variable values can become constant through optimization.
What's the behavior when >=n constant is exposed?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

I want to say yes here, but that leaves a bit of oddness. It's true that nobody can really police on true variable values. However, some variable values can become constant through optimization.
What's the behavior when >=n constant is exposed?

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

+1

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

+1

That means a need for a warning to general developers: If there is a check for constant_index >= n to see if the result is a poison value, that code has to be changed so that it applies to non-scalable vector only. Hopefully not too many instances of that. I'm fine with this as long as the communication to the rest of LLVM community is clear about it.

lewahlig removed a subscriber: lewahlig.
lewahlig added a subscriber: lewahlig.

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

+1

That means a need for a warning to general developers: If there is a check for constant_index >= n to see if the result is a poison value, that code has to be changed so that it applies to non-scalable vector only. Hopefully not too many instances of that. I'm fine with this as long as the communication to the rest of LLVM community is clear about it.

Ok; I'll update the patch for that.

I think this works out fine, since this behaviour matches our downstream implementation, but I can summarize the extra semantic decisions in this review in a post to llvm-dev as well.

huntergr updated this revision to Diff 197301.Apr 30 2019, 5:11 AM

Noted the change in semantics for extractelement and insertelement in the langref.

Noted the change in semantics for extractelement and insertelement in the langref.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

greened added inline comments.Apr 30 2019, 12:30 PM
docs/LangRef.rst
2753

+1 for <vscale x 1 x float>.

Noted the change in semantics for extractelement and insertelement in the langref.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

I think this should be UB, or at least return poison. However, making it UB has the unfortunate side effect of making it illegal to hoist these operations out of conditionals (which currently isn't the case), so maybe poison is better.

For me the main reason for making is UB (aside from generally being conservative) is that element insert/extract can be usefully legalized by putting the vector into a stack slot and accessing the affected element with scalar loads/stores -- in fact, that's the default legalization strategy in trunk. But in that setting an out-of-bounds lane index would become a wild read/write (= clear-cut UB), unless the legalization code jumps through extra hoops to add a bounds check or clamp the lane index into range. This wouldn't affect SVE or RVV since they would implement custom lowering anyway, but for other targets (especially if we eventually generalize insert/extract to accept variable lane positions and do so for all vector types for consistency) it could become an unnecessary headache. Though if we make insertelement with out-of-range lane index return poison instead of being UB, then at minimum clamping in insertelement is necessary to avoid the wild write.

Aside: your proposed semantics for SVE would break the property extractelt(insertelt(vec, i, elt), i) = elt which instcombine (and others) most likely assumes. If we make the insertelt return poison (or even just undef, FWIW), that issue goes away.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

I think this should be UB, or at least return poison. However, making it UB has the unfortunate side effect of making it illegal to hoist these operations out of conditionals (which currently isn't the case), so maybe poison is better.

For me the main reason for making is UB (aside from generally being conservative) is that element insert/extract can be usefully legalized by putting the vector into a stack slot and accessing the affected element with scalar loads/stores -- in fact, that's the default legalization strategy in trunk. But in that setting an out-of-bounds lane index would become a wild read/write (= clear-cut UB), unless the legalization code jumps through extra hoops to add a bounds check or clamp the lane index into range. This wouldn't affect SVE or RVV since they would implement custom lowering anyway, but for other targets (especially if we eventually generalize insert/extract to accept variable lane positions and do so for all vector types for consistency) it could become an unnecessary headache. Though if we make insertelement with out-of-range lane index return poison instead of being UB, then at minimum clamping in insertelement is necessary to avoid the wild write.

Aside: your proposed semantics for SVE would break the property extractelt(insertelt(vec, i, elt), i) = elt which instcombine (and others) most likely assumes. If we make the insertelt return poison (or even just undef, FWIW), that issue goes away.

I think that making the return be a poison value is best.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

I think this should be UB, or at least return poison. However, making it UB has the unfortunate side effect of making it illegal to hoist these operations out of conditionals (which currently isn't the case), so maybe poison is better.

For me the main reason for making is UB (aside from generally being conservative) is that element insert/extract can be usefully legalized by putting the vector into a stack slot and accessing the affected element with scalar loads/stores -- in fact, that's the default legalization strategy in trunk. But in that setting an out-of-bounds lane index would become a wild read/write (= clear-cut UB), unless the legalization code jumps through extra hoops to add a bounds check or clamp the lane index into range. This wouldn't affect SVE or RVV since they would implement custom lowering anyway, but for other targets (especially if we eventually generalize insert/extract to accept variable lane positions and do so for all vector types for consistency) it could become an unnecessary headache. Though if we make insertelement with out-of-range lane index return poison instead of being UB, then at minimum clamping in insertelement is necessary to avoid the wild write.

Aside: your proposed semantics for SVE would break the property extractelt(insertelt(vec, i, elt), i) = elt which instcombine (and others) most likely assumes. If we make the insertelt return poison (or even just undef, FWIW), that issue goes away.

I think that making the return be a poison value is best.

Ok, will update again. Thanks.

Sorry to be late to the party, but I have a quick question:

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

What if we know the width is fixed on our target machine? Let's say it's fixed at 512b. So a full width scalable double vector would be:

<vscale x 2 x double>

Since our width is fixed, we know that vscale=4 here and there are 8 elements in this vector.

I'd like to be able to do a really fast insert at say index 3. I.e. insert_element(<vscale x 2 x double> res, double elt, int 3). Would that insert be as fast with the vscale method as it would be for an index < n-1?

Sorry to be late to the party, but I have a quick question:

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

What if we know the width is fixed on our target machine? Let's say it's fixed at 512b. So a full width scalable double vector would be:

<vscale x 2 x double>

Since our width is fixed, we know that vscale=4 here and there are 8 elements in this vector.

I'd like to be able to do a really fast insert at say index 3. I.e. insert_element(<vscale x 2 x double> res, double elt, int 3). Would that insert be as fast with the vscale method as it would be for an index < n-1?

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:

%r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
%r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
%r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
%r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
%r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
%r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
%r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
%r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7

But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:

%r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
%r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
%r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
%r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
%r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
%r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
%r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
%r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7

So you could do that (judging by the rest of the discussion), but you'd have poison values (effectively) if you exceeded the runtime length. I guess if you're treating scalable vectors as pseudo-fixed-length then you don't care about using the same binary on different hardware.

But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

In the current code, you'd get 4 NEON vectors. In future we'd implement fixed length SVE support as well (for SLP autovec without introducing extra predicate generation/branching), but the recommended method of loop autovec for SVE would be VLA. For your own work you'd be able to use fixed-length types then, but we're still figuring out the design (to minimize the number of ISel patterns).

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

Yes, though if the index is known to be within a certain range (5bit signed immediate), then you can skip generating a splat and just compare against the stepvector directly; so for your 512b example cpu, you'd be able to generate one fewer instruction when indexing 32b or 64b elements. If you're building up an entire vector (as in your IR), the insr instruction will shift all elements along by one and insert into the first lane, so no predicates would be needed -- but we would need to pattern match and optimize for this case.

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:

%r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
%r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
%r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
%r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
%r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
%r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
%r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
%r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7

So you could do that (judging by the rest of the discussion), but you'd have poison values (effectively) if you exceeded the runtime length. I guess if you're treating scalable vectors as pseudo-fixed-length then you don't care about using the same binary on different hardware.

But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

In the current code, you'd get 4 NEON vectors. In future we'd implement fixed length SVE support as well (for SLP autovec without introducing extra predicate generation/branching), but the recommended method of loop autovec for SVE would be VLA. For your own work you'd be able to use fixed-length types then, but we're still figuring out the design (to minimize the number of ISel patterns).

That's great.

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

Yes, though if the index is known to be within a certain range (5bit signed immediate), then you can skip generating a splat and just compare against the stepvector directly; so for your 512b example cpu, you'd be able to generate one fewer instruction when indexing 32b or 64b elements. If you're building up an entire vector (as in your IR), the insr instruction will shift all elements along by one and insert into the first lane, so no predicates would be needed -- but we would need to pattern match and optimize for this case.

Ok. Thanks for the clarification.

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

Are there any other outstanding issues to address before this lands?

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

Are there any other outstanding issues to address before this lands?

I think @huntergr still needs to update the insertelement/extractelement doc. For me, the last remaining thing is the definition of shufflevector behavior that his last revision did not address. Do we allow scalable vector? Do we allow anything other than zeroinitializer mask? I'm fine not allowing for the time being or restricting to zeroinitializer mask for the time being. We still need to write it down since generally supporting shufflevector of scalable vectors is outside the scope of this patch.

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

Are there any other outstanding issues to address before this lands?

I think @huntergr still needs to update the insertelement/extractelement doc. For me, the last remaining thing is the definition of shufflevector behavior that his last revision did not address. Do we allow scalable vector? Do we allow anything other than zeroinitializer mask? I'm fine not allowing for the time being or restricting to zeroinitializer mask for the time being. We still need to write it down since generally supporting shufflevector of scalable vectors is outside the scope of this patch.

+1 - Once we have these documentation updates, I think that we'll be good to go.

huntergr updated this revision to Diff 198986.May 10 2019, 12:05 AM
  • Changed (extract|insert)element semantics to return poison values at runtime if the index exceeds hardware vector length.
  • Changed shufflevector semantics to note that only zeroinitializer and undef can be used as mask values for scalable vectors for now.

(Sorry about the delay in updating, I've been out at a conference this week)

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

While I don't want to hold up this patch, a name change to the type will be difficult to get through once the type is in and the terminology has settled, so it may be worth getting it right from the start. Personally I think <vscale x 16 x i8> is more explicit (and therefore easier to understand for those new to the type) than <scalable 16 x i8>. And now that we all have a better understanding and clear definition of scalable types, I wonder if <n x 16 x i8> instead of <vscale x 16 x i8> is worth considering again, since it is much easier to read in tests.

Other than that, great to see this discussion on scalable types converging and so close to agreement!

I wonder if `<n x 16 x i8>` instead of `<vscale x 16 x i8>` is worth considering again, since it is much easier to read in tests.

I kind of like <n x 16 x i8>. It's concise. I don't think vscale really adds a lot of value, but it does eat up characters on a line.

I know very well how annoying it can be to read and write (and say) the scalable prefix all the time and wish for something shorter sometimes, but I also prefer <vscale x ...> for the reasons Sander gave. I'll add that <vscale x 4 x i32> feels a bit lighter than <scalable 4 x i32> even though it's the same number of characters (maybe because there's more whitespace?).

The <n x ...> syntax is shorter but doesn't have the mnemonic aspect and also clashes with the pre-existing use of "n" as a metavariable standing for some fixed vector length (as in, <N x i1> for example), so I'd rather have <scalable ...> if those are the two options.

greened added a comment.EditedMay 13 2019, 11:47 AM

I know very well how annoying it can be to read and write (and say) the scalable prefix all the time and wish for something shorter sometimes, but I also prefer <vscale x ...> for the reasons Sander gave. I'll add that <vscale x 4 x i32> feels a bit lighter than <scalable 4 x i32> even though it's the same number of characters (maybe because there's more whitespace?).

The <n x ...> syntax is shorter but doesn't have the mnemonic aspect and also clashes with the pre-existing use of "n" as a metavariable standing for some fixed vector length (as in, <N x i1> for example), so I'd rather have <scalable ...> if those are the two options.

+1. I like vscale over n because it directly references a term documented in the IR with this patch.

It seems like there's enough support for changing to <vscale x as the prefix, so I'll revise the patch.

huntergr updated this revision to Diff 200725.May 22 2019, 7:00 AM
  • Changed textual IR format to <vscale x n x <ty>>
hfinkel added inline comments.May 22 2019, 4:25 PM
docs/LangRef.rst
8190

I believe that all you need to say is:

For a scalable vector, if the value of ``idx`` exceeds the runtime length of the vector, the result is a poison value.

This part about the "result in the IR" is confusing, and is just stating a basic consistency fact about LLVM's type system. I recommend removing that.

huntergr updated this revision to Diff 200933.May 23 2019, 5:09 AM
  • Simplified wording for extract/insertelement semantics
huntergr marked an inline comment as done.May 23 2019, 5:10 AM
hfinkel accepted this revision.May 23 2019, 8:40 AM
  • Simplified wording for extract/insertelement semantics

Thanks. LGTM.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2019, 5:20 AM
Herald added a subscriber: kristina. · View Herald Transcript

Hi, this slowed down thinlto links 3-4x, and other things probably too. PR42210 has details. I've reverted this for now in r362913.

Hi, this slowed down thinlto links 3-4x, and other things probably too. PR42210 has details. I've reverted this for now in r362913.

Ok, thanks. I'll investigate -- I guess having a persistent map for the duration of verification might resolve it.

@huntergr do you have an account on bugzilla? I couldn't CC you on that bug.

@huntergr do you have an account on bugzilla? I couldn't CC you on that bug.

No. I'll sign up for one.

Ka-Ka added a subscriber: Ka-Ka.Jun 11 2019, 6:46 AM