This is an archive of the discontinued LLVM Phabricator instance.

[SVE][IR] Scalable Vector IR Type
ClosedPublic

Authored by huntergr on Apr 26 2017, 4:12 AM.

Details

Summary
  • Adds a 'scalable' flag to VectorType
  • Adds an 'ElementCount' class to VectorType to pass (possibly scalable) vector lengths, with overloaded operators.
  • Modifies existing helper functions to use ElementCount
  • Adds support for serializing/deserializing to/from both textual and bitcode IR formats
  • Extends the verifier to reject global variables of scalable types
  • Adds unit tests
  • Updates documentation

See the latest version of the RFC here: http://lists.llvm.org/pipermail/llvm-dev/2018-July/124396.html

Diff Detail

Repository
rL LLVM

Event Timeline

huntergr created this revision.Apr 26 2017, 4:12 AM
rengolin edited edge metadata.

Hi Graham, thanks for this work.

I think you have addressed all points in the previous review and I'm happy with the patch. It's very concise and simple.

We should let it simmer for a while to make sure no one else has any comment, as this is a core change in IR, and we should get as much eyes on it as possible.

I'm adding a few more people as reviewers and I hope the CC'd people can also have a look.

cheers,
--renato

lib/IR/Type.cpp
623 ↗(On Diff #96705)

why not:

: SequentialType(VectorTyID, ElType, EC.Min), Scalable(EC.Scalable) { }
huntergr updated this revision to Diff 97411.May 2 2017, 2:49 AM

Improved constructor based on Renato's suggestion. Thanks.

huntergr marked an inline comment as done.May 2 2017, 2:50 AM
sanjoy added a subscriber: sanjoy.May 7 2017, 3:44 PM
huntergr updated this revision to Diff 121117.Nov 1 2017, 5:20 AM

Changed textual IR format to match Chris's suggestion from the mailing list.

I have also changed vscale and stepvector to be intrinsics, but as that makes those patches just an addition to Intrinsics.td I won't post them yet.

Rough plan is to create a patch series to get minimal legalization and codegen working after enough of Sander de Smalen's MC patches have been committed; hopefully that will provide more context.

Alongside that I'll be looking at all the places the size of a value is queried to see if there's any problems using a pair of scaled and unscaled byte sizes.

Updated RFC including a list of patches for simple codegen using this extension has been posted: http://lists.llvm.org/pipermail/llvm-dev/2018-June/123780.html

I think the discussion on the mailing list has reached agreement on this type being suitable for both SVE and RVV; any review comments on the code or tests?

I just realized the updated RFC doesn't touch on the issue at all, but I think it's safe to say we won't support globals of scalable vector type? Those seems impossible to implement in a sensible way for RISC-V, and if my memory and quick skim-reading is correct, it isn't part of the SVE C language extensions either. If that's correct, I'd expect the verifier to reject global variables whose type is a scalable vector.

huntergr updated this revision to Diff 155640.Jul 16 2018, 3:24 AM
huntergr retitled this revision from [SVE] Scalable Vector IR Type to [SVE][IR] Scalable Vector IR Type.
huntergr edited the summary of this revision. (Show Details)

Indeed, we shouldn't allow scalable vectors to be globals. I've added a check for that in the verifier, plus unit tests and a small update to the langref. Thanks.

fhahn added a subscriber: fhahn.Jul 16 2018, 4:22 AM

Thank you! I took another look and found two nits, sorry for not pointing them out earlier.

Other than those nits I'd say "LGTM", but note that I currently have neither commit access nor even any accepted upstream patches, so a LGTM from me probably isn't enough to commit.

include/llvm/IR/DerivedTypes.h
498 ↗(On Diff #155640)

Nit: This restriction on the range of NumElements is very reasonable, but we should make it a proper invariant of the type and enforce it in VectorType::get rather than post-hoc in some accessors.

lib/IR/LLVMContextImpl.h
1311 ↗(On Diff #155640)

Nit: have you considered std::pair<bool, unsigned> instead of manually bit-packing it into 64 bits? DenseMap should support nested pairs, the size should be the same (except if unsigned is 64 bit, which I don't believe we support and which is extremely niche anyway), and it would simplify VectorType::get a little bit.

huntergr added inline comments.Jul 19 2018, 8:00 AM
include/llvm/IR/DerivedTypes.h
498 ↗(On Diff #155640)

VectorType::get already has this restriction, at least on most platforms -- the argument to it is 'unsigned', which is usually 32 bits. The ElementCount struct also uses 'unsigned'. It may be worth changing it to an explicit uint32_t.

I think VectorType originally stored the number of elements as a 32 bit field so was consistent with the interface, but at some point the different SequentialType variants were changed to unify with a single 64 bit size field in the parent class.

I will create an updated patch to check that we don't overflow when using methods like getDoubleElementsVectorType; good catch.

rkruppe added inline comments.Jul 20 2018, 4:34 AM
include/llvm/IR/DerivedTypes.h
498 ↗(On Diff #155640)

Oh, right, I was under the mistaken impression that VectorType::get took uint64_t. So the potential overflow in getDoubeElementsVectorType is pre-existing, but if you want to take the time to harden against it now, that's great.

I don't think changing unsigned to uint32_t is worth the churn, at least not in this patch.

huntergr added inline comments.Jul 26 2018, 4:55 AM
lib/IR/LLVMContextImpl.h
1311 ↗(On Diff #155640)

So I tried this, and couldn't compile it -- there's no implementation of getHashValue, getEmptyKey, getTombstoneKey, etc. for the nested pair.

In our downstream compiler this is actually implemented as a dense map of a 3-element tuple against the Type*, for which we have implemented the appropriate extensions to DenseMapInfo.

If that approach is preferred to bitpacking, I'll make a separate patch to implement the DenseMap extensions.

rkruppe added inline comments.Aug 31 2018, 10:29 AM
lib/IR/LLVMContextImpl.h
1311 ↗(On Diff #155640)

I finally got a chance to look into the error you're seeing and it turns out the root cause is not nested pairs but a missing implementation of DenseMapInfo for bool.

We could add that implementation, but in the future other code may also want to hash an ElementCount, e.g. VPlan may migrate the vectorization factor VF from unsigned to ElementCount and there are some DenseMaps with VF as key. With this in mind I'm leaning towards implementing DenseMapInfo<VectorType::ElementCount>, using the bit fiddling that's currently open-coded here. What do you think?

huntergr added inline comments.Sep 4 2018, 2:38 AM
lib/IR/LLVMContextImpl.h
1311 ↗(On Diff #155640)

Ah, I spotted the same bug you did (missing implementation for bool), but it seems I hadn't submitted the comment I wrote. I tried it with nesting a pair of unsigned ints and that worked, but making it work directly with ElementCount seems a nicer idea, thanks.

huntergr updated this revision to Diff 172469.Nov 2 2018, 5:21 PM
  • Added checks in verifier to prevent scalable vectors being included in structs or arrays
  • Changed lookup to use ElementCount
  • Moved ElementCount to a new file
  • More unit tests
huntergr marked an inline comment as done.Nov 2 2018, 5:23 PM
greened added a comment.EditedMar 8 2019, 7:31 AM

This all LGTM after addressing the various comments.

simoll added a subscriber: simoll.Mar 8 2019, 9:00 AM
hsaito added a subscriber: hsaito.Mar 8 2019, 2:27 PM
rovka added a subscriber: rovka.Mar 20 2019, 9:28 AM

Seems fine in general, just some nits.

include/llvm/IR/DerivedTypes.h
413 ↗(On Diff #172469)

Nit: Punctuation (comments should end with .)

477 ↗(On Diff #172469)

Nit: Punctuation.

490 ↗(On Diff #172469)

Nit: I'd like to see a similar comment in SequentialType::getNumElements.

include/llvm/Support/ScalableSize.h
26 ↗(On Diff #172469)

Nit: Punctuation and capitalization (If [...])

lib/IR/Verifier.cpp
311 ↗(On Diff #172469)

Nitpick: I would call this "containsScalableVectorValue", to make it clear that it doesn't just look at the top level type.

656 ↗(On Diff #172469)

Could do an early return here instead of aggregating the result.

4797 ↗(On Diff #172469)

Nitpick 1: This comment is going to become stale as soon as someone comes up with a non-scalable type they'd like to check.
Nitpick 2: Any reason why this is called here and not in Verifier::verify?

unittests/IR/VectorTypesTest.cpp
34 ↗(On Diff #172469)

I'd also check the number of elements and element size here, just to cover the ElementCount operators fully.

69 ↗(On Diff #172469)

Ditto.

greened added inline comments.Mar 20 2019, 11:10 AM
docs/LangRef.rst
678 ↗(On Diff #172469)

Add a similar comment about scalable vector types in aggregates?

include/llvm/Support/ScalableSize.h
6 ↗(On Diff #172469)

Needs updated license.

17 ↗(On Diff #172469)

Should be LLVM_SUPPORT_SCALABLESIZE_H.

huntergr updated this revision to Diff 191733.Mar 21 2019, 10:20 AM

Rebased, incorporated fixes from reviews.

Thanks for taking a look, Diana and David.

huntergr marked 12 inline comments as done.Mar 21 2019, 10:24 AM
sebpop accepted this revision.Mar 29 2019, 11:50 AM
sebpop added a subscriber: sebpop.

I am accepting the scalable vector types based on the comments in
http://lists.llvm.org/pipermail/llvm-dev/2019-March/131137.html

Let's move forward with SVE support in LLVM. Thanks!

This revision is now accepted and ready to land.Mar 29 2019, 11:50 AM

I am accepting the scalable vector types based on the comments in
http://lists.llvm.org/pipermail/llvm-dev/2019-March/131137.html

Let's move forward with SVE support in LLVM. Thanks!

Hi Sebastian,

Thanks!

I'll wait a couple more days for last minute feedback before committing.

I am accepting the scalable vector types based on the comments in
http://lists.llvm.org/pipermail/llvm-dev/2019-March/131137.html

Let's move forward with SVE support in LLVM. Thanks!

Hi Sebastian,

Thanks!

I'll wait a couple more days for last minute feedback before committing.

I'd advise caution here, it's really significant/impactful change, and a single sign-off is a bit worrying.
In particular, even regardless of the feature itself, has the implementation itself been reviewed?

rovka added a comment.Apr 3 2019, 1:47 AM

I'd advise caution here, it's really significant/impactful change, and a single sign-off is a bit worrying.

I agree that this is a significant change and I can understand why people are a bit nervous about merging it. Would it help if we had more middle-end patches reviewed before committing, so people could have a better understanding of the impact? Off the top of my head, the rework of D35137 would be interesting, and also constant handling.

Can you also tell us what the plan is regarding all the places in the optimizer that may need updating to handle the new vectors? AFAICT, code that deals only with fixed-width vectors shouldn't be affected, but I would imagine a lot of passes would break IR using scalable vectors (e.g. by building fixed-width vectors instead of scalable ones).

In particular, even regardless of the feature itself, has the implementation itself been reviewed?

I can't speak for Sebastian, but there have been a few pairs of eyes on the implementation itself, as you can see from the comments.

sebpop added a comment.Apr 3 2019, 7:18 AM

I'd advise caution here, it's really significant/impactful change, and a single sign-off is a bit worrying.

There were several people who voted for a scalable vector type in the llvm-dev thread that I referred to
http://lists.llvm.org/pipermail/llvm-dev/2019-March/131137.html
Maybe those people can also sign-off to accept this change.

I am somewhat against extending the IR with the SV type
and I recognize that as we stand today adding the SV type is a good way to get ARM-SVE support in LLVM.

I agree that this is a significant change and I can understand why people are a bit nervous about merging it. Would it help if we had more middle-end patches reviewed before committing, so people could have a better understanding of the impact? Off the top of my head, the rework of D35137 would be interesting, and also constant handling.

Can you also tell us what the plan is regarding all the places in the optimizer that may need updating to handle the new vectors?

A patch series has been posted for review, see section 7. in
http://lists.llvm.org/pipermail/llvm-dev/2018-June/123780.html

In particular, even regardless of the feature itself, has the implementation itself been reviewed?

I can't speak for Sebastian, but there have been a few pairs of eyes on the implementation itself, as you can see from the comments.

Yes, I have reviewed the current patch and I have looked at the patch series.

rovka added a comment.Apr 4 2019, 2:03 AM

I'd advise caution here, it's really significant/impactful change, and a single sign-off is a bit worrying.

There were several people who voted for a scalable vector type in the llvm-dev thread that I referred to
http://lists.llvm.org/pipermail/llvm-dev/2019-March/131137.html
Maybe those people can also sign-off to accept this change.

I am somewhat against extending the IR with the SV type
and I recognize that as we stand today adding the SV type is a good way to get ARM-SVE support in LLVM.

I think this patch is a good start and I don't mind having it merged as is, I was just suggesting a way forward since it seems people are still hesitating. While there seems to be some consensus on moving forward with native types, I'm sure many of the details are still somewhat fuzzy. In my opinion having more patches reviewed would make those details clear, and it would thus make it easier to get more sign-offs on this patch as well.

I agree that this is a significant change and I can understand why people are a bit nervous about merging it. Would it help if we had more middle-end patches reviewed before committing, so people could have a better understanding of the impact? Off the top of my head, the rework of D35137 would be interesting, and also constant handling.

Can you also tell us what the plan is regarding all the places in the optimizer that may need updating to handle the new vectors?

A patch series has been posted for review, see section 7. in
http://lists.llvm.org/pipermail/llvm-dev/2018-June/123780.html

My understanding is that Graham is reworking some of those. Some were only RFC to begin with, some have been abandoned since, and in any case there has been some back-and-forth on the mailing list since then. I think now would be a good time to rebase/update all the patches that are still relevant so people can see the current version of things.

Also, my question regarding updating the optimizer still stands, I don't think I've seen patches related to updating the existing passes to take the 'scalable' property into account. While in some places it's likely to have no impact, I'd be very surprised if you could pass an IR file with scalable types through the optimizer and not get a soup of fixed-width and scalable vectors out the other end (or the corresponding assertions about mismatched types). I'd be happy to help deal with that churn, by the way :) that's rather why I was asking about the plan.

hfinkel added inline comments.Apr 4 2019, 9:12 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

This doesn't seem strong enough. We need the unknown multiple to be the *same* for any given type (at least within a given function). We also need a relationship between vectors of different underlying types (so that zext/sext/etc. make sense). Otherwise, you can't even sensibly add them together (for example). I realize that it says something about an unknown vector length above, but we need to translate that statement into semantics that make sense for the vectors themselves.

lib/IR/Verifier.cpp
4927 ↗(On Diff #191733)

Remove unneeded whitespace change.

rengolin added inline comments.Apr 5 2019, 1:52 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

It's not that simple. Both SVE and RISC-V can have vector multiplier changes in the middle of a function (via system register or similar). Neither of them *want* that to be the norm, but IIRC, RISC-V doesn't want it to change inside a function and SVE wants it to be the same for the whole program.

I totally agree with you that leaving it open is a huge can of worms, and wanting a per-function change would probably need new annotation on functions, which if ever done, should be orthogonal to this change (or would lead us into madness).

I second your proposal that we fix the semantics in LLVM, for now, that the "unknown width" is the same throughout the program and that the existing relationship between fixed vectors extends to scalable vectors.

If you look at the changes in this patch series, it assumes that behaviour already, by getting new vector types of half-size with double-elements and so on.

IFF RISC-V wants to extend the logic to be per-function, then we will need to do a much more extensive analysis on the passes, especially around inlining and function calls. I strongly propose we don't look at it right now and fix the semantics as proposed above.

In my analysis, with that semantics, I don't see a big impact on any existing non-scalable optimisations. With vectorisation passes being run at the end of the pipeline, even for scalable code, most of the existing pipeline will still be relevant, too.

I'd advise caution here, it's really significant/impactful change, and a single sign-off is a bit worrying.
In particular, even regardless of the feature itself, has the implementation itself been reviewed?

Noted; I'll hold off for now.

I have scheduled a roundtable at EuroLLVM next week, so if interested people are attending we can perhaps make more progress there.

I think this patch is a good start and I don't mind having it merged as is, I was just suggesting a way forward since it seems people are still hesitating. While there seems to be some consensus on moving forward with native types, I'm sure many of the details are still somewhat fuzzy. In my opinion having more patches reviewed would make those details clear, and it would thus make it easier to get more sign-offs on this patch as well.

Ok. I'll see if I can get some time to work on updating other core parts for review.

I agree that this is a significant change and I can understand why people are a bit nervous about merging it. Would it help if we had more middle-end patches reviewed before committing, so people could have a better understanding of the impact? Off the top of my head, the rework of D35137 would be interesting, and also constant handling.

Can you also tell us what the plan is regarding all the places in the optimizer that may need updating to handle the new vectors?

A patch series has been posted for review, see section 7. in
http://lists.llvm.org/pipermail/llvm-dev/2018-June/123780.html

My understanding is that Graham is reworking some of those. Some were only RFC to begin with, some have been abandoned since, and in any case there has been some back-and-forth on the mailing list since then. I think now would be a good time to rebase/update all the patches that are still relevant so people can see the current version of things.

Indeed. We are considering a github repo with a more complete implementation split into sensible patches (unlike the current repo with a single megapatch that covers far more than just SVE support). It will take some time to prepare that, though.

Also, my question regarding updating the optimizer still stands, I don't think I've seen patches related to updating the existing passes to take the 'scalable' property into account. While in some places it's likely to have no impact, I'd be very surprised if you could pass an IR file with scalable types through the optimizer and not get a soup of fixed-width and scalable vectors out the other end (or the corresponding assertions about mismatched types). I'd be happy to help deal with that churn, by the way :) that's rather why I was asking about the plan.

There aren't that many changes (imo), but I understand that you'd like to see for yourself what they are. I'll see what the opinion is on the repo vs. lots of phabricator patches at EuroLLVM.

rkruppe added inline comments.Apr 5 2019, 2:41 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

So first of all, I agree that this patch does (and should) only implement a single "constant" (at runtime) vscale value. The current wording here in LangRef is ambiguous about this, it doesn't make clear at which scope the "unknown integer multiple" is fixed. It should be made clear that this factor is the same for all vector types and does not change while the program executes (and, once the vscale intrinsic is added, this section here should also point at it).

Second, since you mentioned it I should say: the RISC-V vector extension has now changed to a point where (at least in its standard incarnation, without further extensions on top of it) there is no need for vscale to change during runtime, let alone between individual functions. All the changes to the "vector register size" now happen in such a way that it's easy to express the longer vectors in IR by just using different vector types with higher ElementCount::Min, e.g. <scalable 8 x double> instead of <scalable 1 x double>. So from the RVV side, there's no need any more for vscale to vary e.g. function-by-function.

I don't know whether the SVE folks want to take a shot at it regardless, but in past discussion it sounded like the vscale-per-function model wasn't a good fit for the programming model they envisioned, so maybe they'd come up with a different solution if and when they tackle that problem.

huntergr added inline comments.Apr 5 2019, 2:43 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

For SVE at least, we can consider changing the vector length during execution to be undefined behaviour. The compiler is not expected to be able to handle it.

For RVV, given the new restrictions on how vlmul is handled, I think they won't need to change the multiple at runtime either -- just increase the the minimum number of lanes. I'm hoping to discuss this with Robin at EuroLLVM, assuming time permits.

I'll come up with some stricter wording.

huntergr updated this revision to Diff 193859.Apr 5 2019, 5:46 AM
  • Clarified that the runtime multiple is constant across all scalable vector types, even if the constant value isn't known at compile time.
  • Removed extra whitespace.
huntergr marked 2 inline comments as done.Apr 5 2019, 5:48 AM
  • Clarified that the runtime multiple is constant across all scalable vector types, even if the constant value isn't known at compile time.
  • Removed extra whitespace.

LGTM now, thanks for all the hard work! I won't approve, as this is a larger discussion, but I'm happy with this patch, and the native support as is.

hfinkel added inline comments.Apr 5 2019, 11:18 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

I definitely agree that we should not deal with changing the vscale during program execution.

I think that the model is: There is an underlying vector length. vscale = round(vector length in bits / primitive size in bits). Can we specify it like that? We do also need to define what the rounding is. What does <scalable 4 x i3> do? Or is it not allowed?

hsaito added inline comments.Apr 5 2019, 4:01 PM
docs/LangRef.rst
2749 ↗(On Diff #191733)

I definitely agree that we should not deal with changing the vscale during program execution.

I agree that this will make things a lot simpler than allowing it to change per function or in a middle of a function.

However, I don't quite agree that changing vscale per function is an orthogonal issue. What are we going to do when function foo() with vscale=X calls function bar() with vscale=Y using a scalable vector parameter?

Having said that, since I don't expect the discussions to converge anytime soon if we talk about vscale changing within a compilation unit, I agree we should move forward with vscale not changing within a compilation unit (we say program execution, but compiler's visibility is always limited to compilation unit). It should be sufficient to say that if multiple compilation unit with different vscale are linked, unspecified behavior will result.

@hfinkel, I think the model is "#elements * elementtype" fits in one or more "units of vector" and then apply vscale to it. I don't think scalable vector needs to fit one physical register of HW. Vector type legalization should kick-in. @huntergr, please correct me if my mental model is wrong.

rengolin added inline comments.Apr 6 2019, 3:01 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

However, I don't quite agree that changing vscale per function is an orthogonal issue.

I didn't mean the implementation, but the discussion.

I think a per-function vscale implementation will be very different from the current one, no matter which course we take now. It won't matter much if we have native or intrinsic implementation, we'll still need function attributes and teach the optimisation passes, etc.

Having said that, since I don't expect the discussions to converge anytime soon if we talk about vscale changing within a compilation unit

If the scope it the compilation unit, then we'd need it to be fixed on the target string, or we won't be able to link two units together. I think even this discussion is too soon, and we should push the scope to the whole program.

Any change in vscale throughout the program should be undefined, or we'd have to encode the necessary logic in the compiler, which is the biggest worry I see from the feedback. So far, the benefits of doing so are on edge cases and the actual costs are unknown (but very likely large).

In my view, this is *definitely* not a subject we should raise right now and restricting the current implementation to whole-program scope is the only way we can go forward for now in any sensible way.

hfinkel added inline comments.Apr 7 2019, 7:36 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

I didn't mean the implementation, but the discussion.

As I've said in previous thread, I don't believe that we can sensibly model a changing vscale without some SSA dependence, and that will require significant changes to the overall scheme.

restricting the current implementation to whole-program scope is the only way we can go forward for now in any sensible way.

+1

I think the model is "#elements * elementtype" fits in one or more "units of vector" and then apply vscale to it. I don't think scalable vector needs to fit one physical register of HW. Vector type legalization should kick-in.

Indeed, I believe you're correct. We need to account for this in the definition too.

hfinkel added inline comments.Apr 7 2019, 7:39 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

Indeed, I believe you're correct. We need to account for this in the definition too.

Either by having a model that includes legalization, or by restricting the size of the base vector type?

huntergr marked an inline comment as done.Apr 9 2019, 12:46 AM
huntergr added inline comments.
docs/LangRef.rst
2749 ↗(On Diff #191733)

We perform legalization for scalable vectors with the same mechanisms fixed-length vectors do (splitting for too large, promoting/extending for too small). Should this be documented in this description (it isn't for fixed vectors), or is there a better place in the docs for that explanation?

(A side note; for 'unpacked' float vector types (e.g. <scalable 2 x float>) we do declare them as legal for SVE then generate predicates to mask off the unused lanes in AArch64 specific code. Since there are more predicated architectures being added to the codebase, perhaps this could be generalized as a new legalization mechanism for fp vector types)

hfinkel added inline comments.Apr 9 2019, 1:42 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

We perform legalization for scalable vectors with the same mechanisms fixed-length vectors do (splitting for too large, promoting/extending for too small).

What defines too big (what size is used for splitting)? If <scalable 8 x double> fits in the vector register depends on the runtime vector size, no?

Should this be documented in this description

No, unless it's part of the IR-level model.

What we need here is a model defined, at the IR level, that explains why:

  1. I can add two <scalable 4 x float> vectors together.
  2. I cannot add a <scalable 4 x float> to a <scalable 2 x float>
  3. I can sext a <scalable 4 x i32> to a <scalable 4 x i64>, and this can be bitcast to a <scalable 8 x i32>.

Also we should address happens to vectors with an odd number of lanes or of a non-power-of-two-sized primitive types (both of which are defined at the IR level).

huntergr added inline comments.Apr 9 2019, 5:56 AM
docs/LangRef.rst
2749 ↗(On Diff #191733)

What defines too big (what size is used for splitting)? If <scalable 8 x double> fits in the vector register depends on the runtime vector size, no?

No. For SVE, the legal types are those where the minimum size is equal to 128 bits, since that's the minimum size for hardware registers (and the granularity of increments of register size). So an operation using <scalable 8 x double> values would need to be split into 4 <scalable 2 x double> operations for SVE during legalization. (I'm ignoring the predicated unpacked float forms for a moment)

I wonder if the change in syntax from <n x 8 x double> to <scalable 8 x double> makes that less obvious.

The basis of the model can be grounded in '1' being a valid value for vscale, which effectively makes the types equivalent to fixed length vectors. I'll try coming up with a description based on that.

joelkevinjones added a comment.EditedApr 11 2019, 7:19 PM

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

If we are allowing just element-wise Instructions, we should explicitly say that in LangRef, warn at LLVM-DEV mailing list that new scalable vector types are coming, wait a little to let last minute screams to happen, assure them by saying scalable vector on element-wise Instructions won't cause any mess beyond non-scalable vector, and then commit. This would be the quickest route, and it still enables other interesting follow-up patches to be reviewed/committed. I think we are ready enough to do this if we choose to take this route.

If we are going for more than element-wise Instructions, we need to have well defined and agreed semantics for each of those, and that should be part of the LangRef for each such Instruction.
Also, have we thought about Intrinsics? Can all Intrinsics that take/return non-scalable vector handle scalable vector?

We can certainly let element-wise stuff to go in first and then extend to non-element-wise stuff later. Any thoughts in this regard?

If we are going for more than element-wise Instructions, we need to have well defined and agreed semantics for each of those, and that should be part of the LangRef for each such Instruction.

Agreed.

Also, have we thought about Intrinsics? Can all Intrinsics that take/return non-scalable vector handle scalable vector?

Same for intrinsics. If any of the vector ones apply to scalable vectors (ex. reduce), it needs to be documented.

We can certainly let element-wise stuff to go in first and then extend to non-element-wise stuff later. Any thoughts in this regard?

Precisely, we should go in baby-steps, with each step making sure we don't touch/break *anything* that isn't scalable, updating LangRef as we go.

Doing a single review on everything would be painful and counter-productive.

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

I don't recall this being an issue, but I agree, if there are instructions that currently take vector types that don't take scalable vectors, that certainly needs to be documented.

If we are allowing just element-wise Instructions, we should explicitly say that in LangRef, warn at LLVM-DEV mailing list that new scalable vector types are coming, wait a little to let last minute screams to happen, assure them by saying scalable vector on element-wise Instructions won't cause any mess beyond non-scalable vector, and then commit.

I think that there's been plenty of traffic on this subject on llvm-dev. We do send warnings for potentially-downstream-disruptive changes, as a general best practice. I'm not really sure if this falls into this category, but I'm sure a note to llvm-dev will be sent to let everyone know when this lands, if nothing else, as an FYI to review other related patches.

This would be the quickest route, and it still enables other interesting follow-up patches to be reviewed/committed. I think we are ready enough to do this if we choose to take this route.

If we are going for more than element-wise Instructions, we need to have well defined and agreed semantics for each of those, and that should be part of the LangRef for each such Instruction.
Also, have we thought about Intrinsics? Can all Intrinsics that take/return non-scalable vector handle scalable vector?

FWIW, I took a quick look through the intrinsics and nothing jumped out at me as an intrinsic that currently has vector types on its interface that shouldn't take scalable vectors. So, if there things which don't work, we should certainly note that.

We can certainly let element-wise stuff to go in first and then extend to non-element-wise stuff later. Any thoughts in this regard?

We should indeed review patches in small/medium-sized units (so long as the changes are testable).

PkmX added a subscriber: PkmX.Apr 14 2019, 11:18 PM

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

The main difference is for 'shufflevector'. For a splat it's simple, since you just use a zeroinitializer mask. For anything else, though, you currently need a constant vector with immediate values; this obviously won't work if you don't know how many elements there are.

Downstream, we solve this by allowing shufflevector masks to be ConstantExprs, and then using 'vscale' and 'stepvector' as symbolic Constant nodes. With those and a few arithmetic and logical ops, we can synthesize the usual set of shuffles (reverse, top/bottom half, odd/even, interleaves, zips, etc). Would also work for fixed-length vectors. There's been some pushback on introducing them as symbolic constants though, and the initial demo patch set has them as intrinsics.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.
  2. Introduce additional intrinsics for the other shuffle variants as needed
  3. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

The discussion at the EuroLLVM roundtable was leaning towards option 1, with an action on me to provide a set of canonical shuffle examples using vscale and stepvector for community consideration.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

This warrants a larger discussion, which would hinder the current progress.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

I agree, although this is an important part of the model, so we should start having this discussion in parallel (sooner rather than later). I had been under the impression that a set of intrinsics were being proposed for this, but extending shufflevector is also an option worth considering. If these are first-class types, then having first-class instruction support is probably the right path. This deserves it's own RFC.

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

I agree, although this is an important part of the model, so we should start having this discussion in parallel (sooner rather than later). I had been under the impression that a set of intrinsics were being proposed for this, but extending shufflevector is also an option worth considering. If these are first-class types, then having first-class instruction support is probably the right path. This deserves it's own RFC.

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

I agree on both points.

I think this is a coherent set of changes. Given the current top of trunk, this expands support from just assembly/disassembly of machine instructions to include LLVM IR, right? Such being the case, I think this patch should go in. I have some ideas on how to structure passes so SV IR supporting optimizations can be added incrementally. If anyone thinks such a proposal would help, let me know.

I think there is one more thing we still have to do. Does scalable vector type apply to all Instructions where non-scalable vector is allowed? If the answer is no, we need to identify which ones are not allowed to take scalable vector type operand/result. Some of the Instructions are not plain element-wise operation. Do we have agreed upon semantics for all those that are allowed?

The main difference is for 'shufflevector'. For a splat it's simple, since you just use a zeroinitializer mask. For anything else, though, you currently need a constant vector with immediate values; this obviously won't work if you don't know how many elements there are.

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

Going with length of val being "scalable * n", if there is an optimization that takes advantages of poison value being returned by evaluating "idx >= n" at compile time, we need to disable that for scalable vector. Also, if any verifier is checking whether constant idx value is less than n, we need to disable that for scalable vector.

So if we wanted to keep them as intrinsics for now, I think we have one of three options:

  1. Leave discussion on more complicated shuffles until later, and only use scalable autovectorization on loops which don't need anything more than splats.

Given the current state, this is the easiest path.

I agree, although this is an important part of the model, so we should start having this discussion in parallel (sooner rather than later). I had been under the impression that a set of intrinsics were being proposed for this, but extending shufflevector is also an option worth considering. If these are first-class types, then having first-class instruction support is probably the right path. This deserves it's own RFC.

That's pretty much what LLVM-VP is about: https://reviews.llvm.org/D57504

We are proposing compress/expand and lane shift as intrinsics.
I suggest you add any shuffle intrinsics to the same namespace to avoid fragmentation.

IMHO, the redesigned reduction intrinsics should go there as well (https://reviews.llvm.org/D60261 and/or https://reviews.llvm.org/D60262).

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

Even if shuffle masks don't need to be constants anymore, it will be awkward to encode compress/expand without additional intrinsics (you would need a prefix sum vector over the mask bits or similar).

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

+1

greened added a comment.EditedApr 17 2019, 1:53 PM

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n. I assume for vectors of length > n one would have to use shufflevector or something similar (using vscale and stepvector as Graham mentioned) to get the elements you want to the lower n positions.

Either way, the semantics and any restrictions certainly need to be documented.

Going with length of val being "scalable * n", if there is an optimization that takes advantages of poison value being returned by evaluating "idx >= n" at compile time, we need to disable that for scalable vector. Also, if any verifier is checking whether constant idx value is less than n, we need to disable that for scalable vector.

If the index is still restricted by n then this shouldn't be an issue.

docs/LangRef.rst
2755 ↗(On Diff #193859)

Just want to double-check: there is nothing about scalable vectors that assumes all vector types have the same bit width, correct? Can <scalable 1 x float> have a different bit width from <scalable 1 x double>?

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

troyj added a subscriber: troyj.Apr 17 2019, 6:47 PM
PkmX added inline comments.Apr 17 2019, 8:57 PM
docs/LangRef.rst
2755 ↗(On Diff #193859)

I believe the intention is that <scalable 1 x double> should have twice many bits as <scalable 1 x float>, or the same many bits as <scalable 2 x float>.

That's pretty much what LLVM-VP is about: https://reviews.llvm.org/D57504

We are proposing compress/expand and lane shift as intrinsics.
I suggest you add any shuffle intrinsics to the same namespace to avoid fragmentation.

I'm hoping we can use an extended shufflevector for this still, but if we do end up going down the extra intrinsics route then I'll certainly use the same namespace.

IMHO, the redesigned reduction intrinsics should go there as well (https://reviews.llvm.org/D60261 and/or https://reviews.llvm.org/D60262).

  1. Introduce additional intrinsics for the other shuffle variants as needed
  2. Allow shufflevector to accept arbitrary masks so that intrinsics can be used (though possibly only if the output vector is scalable).

Even if shuffle masks don't need to be constants anymore, it will be awkward to encode compress/expand without additional intrinsics (you would need a prefix sum vector over the mask bits or similar).

Agreed, and it also extends to predicated inserts and extracts -- using SVE's 'lasta/lastb' instructions in a vectorized loop tail, for example -- they don't map well to the current instructions, so would need intrinsics.

This warrants a larger discussion, which would hinder the current progress.

I agree. We should have a separate RFC on this.

+1

Agreed. Handling shuffles was part of the original RFC, but that RFC was far too large to discuss all at once. I've started reworking that section based on current feedback, and I'll send it out once ready.

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n. I assume for vectors of length > n one would have to use shufflevector or something similar (using vscale and stepvector as Graham mentioned) to get the elements you want to the lower n positions.

Either way, the semantics and any restrictions certainly need to be documented.

For now, I suspect keeping the limit as 'n' is sufficient, as we only really need to insert into the first element to be able to generate a splat. If we need more later we can discuss it then.

Going with length of val being "scalable * n", if there is an optimization that takes advantages of poison value being returned by evaluating "idx >= n" at compile time, we need to disable that for scalable vector. Also, if any verifier is checking whether constant idx value is less than n, we need to disable that for scalable vector.

If the index is still restricted by n then this shouldn't be an issue.

Agreed.

huntergr added inline comments.Apr 18 2019, 2:45 AM
docs/LangRef.rst
2755 ↗(On Diff #193859)

That's correct. I think that a clearer syntax might be <vscale x 1 x float> to indicate that the number of elements is being multiplied by the same vscale term for all scalable vector types.

The intent is that we should be able to reason about the relative sizes of different scalable vector types based on the element size and minimum number of lanes alone.

We need to clarify on insertelement/extractelement. Maybe already done in some other patches, but that clarification should be part of this patch.
Is the "length of val" under the semantics "scalable * n" in <scalable n x ElemTy>, right? Or is it still n?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Yes; in our downstream compiler we allow inserts and extracts at arbitrary offsets in IR (though we admittedly haven't found too much use for it) using vscale (as a constant, though obtaining the Value via an intrinsic will still work) to give us the required element index.

e.g. (vscale * n) - 1 for the last element.

I don't think we need to support this initially, though.

The SVE ISA doesn't allow arbitrary indices for inserts or extracts, but we can generate an appropriate predicate quite easily in the backend.

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

I want to say yes here, but that leaves a bit of oddness. It's true that nobody can really police on true variable values. However, some variable values can become constant through optimization.
What's the behavior when >=n constant is exposed?

I am not sure how it could be anything but n. If you don't know how long the vector is, you can't correctly generate an index beyond n.

But you know at runtime... there has to be a way to determine, at runtime, vscale. And the index doesn't need to be a constant. I'm not sure that we need to restrict non-constant n to only values valid for vscale == 1.

Good point. 100% agree. I was only considering the constant case.

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

I want to say yes here, but that leaves a bit of oddness. It's true that nobody can really police on true variable values. However, some variable values can become constant through optimization.
What's the behavior when >=n constant is exposed?

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

+1

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

+1

That means a need for a warning to general developers: If there is a check for constant_index >= n to see if the result is a poison value, that code has to be changed so that it applies to non-scalable vector only. Hopefully not too many instances of that. I'm fine with this as long as the communication to the rest of LLVM community is clear about it.

lewahlig removed a subscriber: lewahlig.
lewahlig added a subscriber: lewahlig.

Exactly. Non-constant values can become constant. Constant values can be guarded by vscale-dependent runtime guards (both hand-written and compiler generated). My preference is to leave this not restricted to vscale == 1 values, but rather allow all values that can be supported at runtime, and have it be UB if, at runtime, the relevant index is not available.

+1

That means a need for a warning to general developers: If there is a check for constant_index >= n to see if the result is a poison value, that code has to be changed so that it applies to non-scalable vector only. Hopefully not too many instances of that. I'm fine with this as long as the communication to the rest of LLVM community is clear about it.

Ok; I'll update the patch for that.

I think this works out fine, since this behaviour matches our downstream implementation, but I can summarize the extra semantic decisions in this review in a post to llvm-dev as well.

huntergr updated this revision to Diff 197301.Apr 30 2019, 5:11 AM

Noted the change in semantics for extractelement and insertelement in the langref.

Noted the change in semantics for extractelement and insertelement in the langref.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

greened added inline comments.Apr 30 2019, 12:30 PM
docs/LangRef.rst
2755 ↗(On Diff #193859)

+1 for <vscale x 1 x float>.

Noted the change in semantics for extractelement and insertelement in the langref.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

I think this should be UB, or at least return poison. However, making it UB has the unfortunate side effect of making it illegal to hoist these operations out of conditionals (which currently isn't the case), so maybe poison is better.

For me the main reason for making is UB (aside from generally being conservative) is that element insert/extract can be usefully legalized by putting the vector into a stack slot and accessing the affected element with scalar loads/stores -- in fact, that's the default legalization strategy in trunk. But in that setting an out-of-bounds lane index would become a wild read/write (= clear-cut UB), unless the legalization code jumps through extra hoops to add a bounds check or clamp the lane index into range. This wouldn't affect SVE or RVV since they would implement custom lowering anyway, but for other targets (especially if we eventually generalize insert/extract to accept variable lane positions and do so for all vector types for consistency) it could become an unnecessary headache. Though if we make insertelement with out-of-range lane index return poison instead of being UB, then at minimum clamping in insertelement is necessary to avoid the wild write.

Aside: your proposed semantics for SVE would break the property extractelt(insertelt(vec, i, elt), i) = elt which instcombine (and others) most likely assumes. If we make the insertelt return poison (or even just undef, FWIW), that issue goes away.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

I think this should be UB, or at least return poison. However, making it UB has the unfortunate side effect of making it illegal to hoist these operations out of conditionals (which currently isn't the case), so maybe poison is better.

For me the main reason for making is UB (aside from generally being conservative) is that element insert/extract can be usefully legalized by putting the vector into a stack slot and accessing the affected element with scalar loads/stores -- in fact, that's the default legalization strategy in trunk. But in that setting an out-of-bounds lane index would become a wild read/write (= clear-cut UB), unless the legalization code jumps through extra hoops to add a bounds check or clamp the lane index into range. This wouldn't affect SVE or RVV since they would implement custom lowering anyway, but for other targets (especially if we eventually generalize insert/extract to accept variable lane positions and do so for all vector types for consistency) it could become an unnecessary headache. Though if we make insertelement with out-of-range lane index return poison instead of being UB, then at minimum clamping in insertelement is necessary to avoid the wild write.

Aside: your proposed semantics for SVE would break the property extractelt(insertelt(vec, i, elt), i) = elt which instcombine (and others) most likely assumes. If we make the insertelt return poison (or even just undef, FWIW), that issue goes away.

I think that making the return be a poison value is best.

Why implementation defined and not UB for the case where the index exceeds the runtime length? How do you intend to define this for SVE?

SVE uses a predicate for indexed inserts and extracts. We generate that predicate by comparing a splat of the index against a stepvector (0,1,2,3...); if the index is out of range then the predicate will be all false.

For a mov (insert), that results in an unmodified vector.

For a lastb (extract), that extracts the last lane in the vector if no predicate bits are true.

I don't know if RVV or SX-Aurora have similarly defined semantics. If the preference is to make it UB, that's fine.

I think this should be UB, or at least return poison. However, making it UB has the unfortunate side effect of making it illegal to hoist these operations out of conditionals (which currently isn't the case), so maybe poison is better.

For me the main reason for making is UB (aside from generally being conservative) is that element insert/extract can be usefully legalized by putting the vector into a stack slot and accessing the affected element with scalar loads/stores -- in fact, that's the default legalization strategy in trunk. But in that setting an out-of-bounds lane index would become a wild read/write (= clear-cut UB), unless the legalization code jumps through extra hoops to add a bounds check or clamp the lane index into range. This wouldn't affect SVE or RVV since they would implement custom lowering anyway, but for other targets (especially if we eventually generalize insert/extract to accept variable lane positions and do so for all vector types for consistency) it could become an unnecessary headache. Though if we make insertelement with out-of-range lane index return poison instead of being UB, then at minimum clamping in insertelement is necessary to avoid the wild write.

Aside: your proposed semantics for SVE would break the property extractelt(insertelt(vec, i, elt), i) = elt which instcombine (and others) most likely assumes. If we make the insertelt return poison (or even just undef, FWIW), that issue goes away.

I think that making the return be a poison value is best.

Ok, will update again. Thanks.

Sorry to be late to the party, but I have a quick question:

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

What if we know the width is fixed on our target machine? Let's say it's fixed at 512b. So a full width scalable double vector would be:

<vscale x 2 x double>

Since our width is fixed, we know that vscale=4 here and there are 8 elements in this vector.

I'd like to be able to do a really fast insert at say index 3. I.e. insert_element(<vscale x 2 x double> res, double elt, int 3). Would that insert be as fast with the vscale method as it would be for an index < n-1?

Sorry to be late to the party, but I have a quick question:

Ok, so do we have agreement that constant literal indices should be limited to 0..n-1 for now, but non-constant indices can potentially exceed n so that expressions featuring vscale can be used?

What if we know the width is fixed on our target machine? Let's say it's fixed at 512b. So a full width scalable double vector would be:

<vscale x 2 x double>

Since our width is fixed, we know that vscale=4 here and there are 8 elements in this vector.

I'd like to be able to do a really fast insert at say index 3. I.e. insert_element(<vscale x 2 x double> res, double elt, int 3). Would that insert be as fast with the vscale method as it would be for an index < n-1?

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:

%r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
%r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
%r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
%r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
%r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
%r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
%r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
%r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7

But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:

%r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
%r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
%r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
%r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
%r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
%r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
%r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
%r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7

So you could do that (judging by the rest of the discussion), but you'd have poison values (effectively) if you exceeded the runtime length. I guess if you're treating scalable vectors as pseudo-fixed-length then you don't care about using the same binary on different hardware.

But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

In the current code, you'd get 4 NEON vectors. In future we'd implement fixed length SVE support as well (for SLP autovec without introducing extra predicate generation/branching), but the recommended method of loop autovec for SVE would be VLA. For your own work you'd be able to use fixed-length types then, but we're still figuring out the design (to minimize the number of ISel patterns).

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

Yes, though if the index is known to be within a certain range (5bit signed immediate), then you can skip generating a splat and just compare against the stepvector directly; so for your 512b example cpu, you'd be able to generate one fewer instruction when indexing 32b or 64b elements. If you're building up an entire vector (as in your IR), the insr instruction will shift all elements along by one and insert into the first lane, so no predicates would be needed -- but we would need to pattern match and optimize for this case.

If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.

Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:

%r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
%r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
%r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
%r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
%r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
%r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
%r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
%r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7

So you could do that (judging by the rest of the discussion), but you'd have poison values (effectively) if you exceeded the runtime length. I guess if you're treating scalable vectors as pseudo-fixed-length then you don't care about using the same binary on different hardware.

But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

In the current code, you'd get 4 NEON vectors. In future we'd implement fixed length SVE support as well (for SLP autovec without introducing extra predicate generation/branching), but the recommended method of loop autovec for SVE would be VLA. For your own work you'd be able to use fixed-length types then, but we're still figuring out the design (to minimize the number of ISel patterns).

That's great.

The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.

So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

Yes, though if the index is known to be within a certain range (5bit signed immediate), then you can skip generating a splat and just compare against the stepvector directly; so for your 512b example cpu, you'd be able to generate one fewer instruction when indexing 32b or 64b elements. If you're building up an entire vector (as in your IR), the insr instruction will shift all elements along by one and insert into the first lane, so no predicates would be needed -- but we would need to pattern match and optimize for this case.

Ok. Thanks for the clarification.

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

Are there any other outstanding issues to address before this lands?

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

Are there any other outstanding issues to address before this lands?

I think @huntergr still needs to update the insertelement/extractelement doc. For me, the last remaining thing is the definition of shufflevector behavior that his last revision did not address. Do we allow scalable vector? Do we allow anything other than zeroinitializer mask? I'm fine not allowing for the time being or restricting to zeroinitializer mask for the time being. We still need to write it down since generally supporting shufflevector of scalable vectors is outside the scope of this patch.

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

Are there any other outstanding issues to address before this lands?

I think @huntergr still needs to update the insertelement/extractelement doc. For me, the last remaining thing is the definition of shufflevector behavior that his last revision did not address. Do we allow scalable vector? Do we allow anything other than zeroinitializer mask? I'm fine not allowing for the time being or restricting to zeroinitializer mask for the time being. We still need to write it down since generally supporting shufflevector of scalable vectors is outside the scope of this patch.

+1 - Once we have these documentation updates, I think that we'll be good to go.

huntergr updated this revision to Diff 198986.May 10 2019, 12:05 AM
  • Changed (extract|insert)element semantics to return poison values at runtime if the index exceeds hardware vector length.
  • Changed shufflevector semantics to note that only zeroinitializer and undef can be used as mask values for scalable vectors for now.

(Sorry about the delay in updating, I've been out at a conference this week)

What's the status of this? It seems like discussion has died down a bit. I think Graham's idea to change from <scalable 2 x float> to <vscale x 2 x float> will make the IR more readable/understandable but it's not a show-stopper for me.

While I don't want to hold up this patch, a name change to the type will be difficult to get through once the type is in and the terminology has settled, so it may be worth getting it right from the start. Personally I think <vscale x 16 x i8> is more explicit (and therefore easier to understand for those new to the type) than <scalable 16 x i8>. And now that we all have a better understanding and clear definition of scalable types, I wonder if <n x 16 x i8> instead of <vscale x 16 x i8> is worth considering again, since it is much easier to read in tests.

Other than that, great to see this discussion on scalable types converging and so close to agreement!

I wonder if <n x 16 x i8> instead of <vscale x 16 x i8> is worth considering again, since it is much easier to read in tests.

I kind of like <n x 16 x i8>. It's concise. I don't think vscale really adds a lot of value, but it does eat up characters on a line.

I know very well how annoying it can be to read and write (and say) the scalable prefix all the time and wish for something shorter sometimes, but I also prefer <vscale x ...> for the reasons Sander gave. I'll add that <vscale x 4 x i32> feels a bit lighter than <scalable 4 x i32> even though it's the same number of characters (maybe because there's more whitespace?).

The <n x ...> syntax is shorter but doesn't have the mnemonic aspect and also clashes with the pre-existing use of "n" as a metavariable standing for some fixed vector length (as in, <N x i1> for example), so I'd rather have <scalable ...> if those are the two options.

greened added a comment.EditedMay 13 2019, 11:47 AM

I know very well how annoying it can be to read and write (and say) the scalable prefix all the time and wish for something shorter sometimes, but I also prefer <vscale x ...> for the reasons Sander gave. I'll add that <vscale x 4 x i32> feels a bit lighter than <scalable 4 x i32> even though it's the same number of characters (maybe because there's more whitespace?).

The <n x ...> syntax is shorter but doesn't have the mnemonic aspect and also clashes with the pre-existing use of "n" as a metavariable standing for some fixed vector length (as in, <N x i1> for example), so I'd rather have <scalable ...> if those are the two options.

+1. I like vscale over n because it directly references a term documented in the IR with this patch.

It seems like there's enough support for changing to <vscale x as the prefix, so I'll revise the patch.

huntergr updated this revision to Diff 200725.May 22 2019, 7:00 AM
  • Changed textual IR format to <vscale x n x <ty>>
hfinkel added inline comments.May 22 2019, 4:25 PM
docs/LangRef.rst
8222 ↗(On Diff #200725)

I believe that all you need to say is:

For a scalable vector, if the value of ``idx`` exceeds the runtime length of the vector, the result is a poison value.

This part about the "result in the IR" is confusing, and is just stating a basic consistency fact about LLVM's type system. I recommend removing that.

huntergr updated this revision to Diff 200933.May 23 2019, 5:09 AM
  • Simplified wording for extract/insertelement semantics
huntergr marked an inline comment as done.May 23 2019, 5:10 AM
hfinkel accepted this revision.May 23 2019, 8:40 AM
  • Simplified wording for extract/insertelement semantics

Thanks. LGTM.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMay 29 2019, 5:20 AM
Herald added a subscriber: kristina. · View Herald Transcript
thakis added a subscriber: thakis.Jun 9 2019, 12:24 PM

Hi, this slowed down thinlto links 3-4x, and other things probably too. PR42210 has details. I've reverted this for now in r362913.

Hi, this slowed down thinlto links 3-4x, and other things probably too. PR42210 has details. I've reverted this for now in r362913.

Ok, thanks. I'll investigate -- I guess having a persistent map for the duration of verification might resolve it.

@huntergr do you have an account on bugzilla? I couldn't CC you on that bug.

@huntergr do you have an account on bugzilla? I couldn't CC you on that bug.

No. I'll sign up for one.

Ka-Ka added a subscriber: Ka-Ka.Jun 11 2019, 6:46 AM
Herald added a project: Restricted Project. · View Herald Transcript