Page MenuHomePhabricator

[IR] Introduce llvm.experimental.vector.splice intrinsic
Needs ReviewPublic

Authored by c-rhodes on Jan 14 2021, 12:18 PM.

Details

Summary

This patch introduces a new intrinsic @llvm.experimental.vector.splice
that constructs a vector of the same type as the two input vectors,
based on a immediate where the sign of the immediate distinguishes two
variants. A positive immediate specifies an index into the first vector
and a negative immediate specifies the number of trailing elements to
extract from the first vector.

For example:

@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E>  ; index
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, -3) ==> <B, C, D, E> ; trailing element count

These intrinsics support both fixed and scalable vectors, where the
former is lowered to a shufflevector to maintain existing behaviour,
although while marked as experimental the recommended way to express
this operation for fixed-width vectors is to use shufflevector. For
scalable vectors where it is not possible to express a shufflevector
mask for this operation, a new ISD node has been implemented.

This is one of the named shufflevector intrinsics proposed on the
mailing-list in the RFC at [1].

Patch by Paul Walker and Cullen Rhodes.

[1] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146864.html

Diff Detail

Event Timeline

c-rhodes created this revision.Jan 14 2021, 12:18 PM
c-rhodes requested review of this revision.Jan 14 2021, 12:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2021, 12:18 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1109

Is this change required for the splice patch? I wonder if it should be broken out to its own commit.

c-rhodes added inline comments.Jan 15 2021, 4:19 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1109

Is this change required for the splice patch? I wonder if it should be broken out to its own commit.

This fixes a selection error I hit when splicing predicates. The promotion of predicates uses a truncate which gets lowered to a setcc that crashes on operands of type VT, e.g.:

  t17: nxv2i64 = and t29, t30
  t31: nxv2i64 = AArch64ISD::DUP Constant:i64<0>
t19: nxv2i1 = setcc t17, t31, setne:ch

I can split this into a separate patch.

In D94444, @paulwalker-arm proposed a more generic extract vector intrinsic that accepts an index and stride. Now I'm wondering if we should just have a generic scalable shuffle vector intrinsic to handle all these operations under one intrinsic.

That idea doesn't need to hold up this Diff, but it might be something to consider...

craig.topper added inline comments.Jan 15 2021, 4:33 PM
llvm/docs/LangRef.rst
16236

Does this mention that %trailing.elt must be a constant?

llvm/include/llvm/IR/Intrinsics.td
1657

Should this have DefaultAttrs?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1274

Can you use setOperationPromotedToType here?

In D94444, @paulwalker-arm proposed a more generic extract vector intrinsic that accepts an index and stride. Now I'm wondering if we should just have a generic scalable shuffle vector intrinsic to handle all these operations under one intrinsic.

That idea doesn't need to hold up this Diff, but it might be something to consider...

I don't believe that can work for the same reasons having a single LLVM instruction does not work. Each shuffle has different requirements. Some require one operand whilst other require two and some require additional scalar operands when others do not.

paulwalker-arm added inline comments.Jan 16 2021, 4:18 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.h
903 ↗(On Diff #316728)

It looks like you're missing the implementation of this function and some associated isel patterns, which explains the poor code generation for the SVE _1 test variants. Or are you planning to add those under a second patch, in which case this function declaration should be removed from this patch.

c-rhodes added inline comments.Jan 16 2021, 4:27 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.h
903 ↗(On Diff #316728)

It looks like you're missing the implementation of this function and some associated isel patterns, which explains the poor code generation for the SVE _1 test variants. Or are you planning to add those under a second patch, in which case this function declaration should be removed from this patch.

The plan is to upstream those patterns separately, they were initially part of this patch but they depended on some changes we have downstream to use the SIMD variant of INSR when the scalar argument comes from a vector extract. It looks like I missed this declaration, I'll remove it.

Matt added a subscriber: Matt.Jan 19 2021, 9:14 AM
c-rhodes updated this revision to Diff 317860.Jan 20 2021, 6:25 AM

Changes:

  • Remove unused function declaration for LowerVECTOR_SPLICE.
  • Clarify trailing.elts must be an integer constant in LangRef.
  • Use DefaultAttrs for intrinsic.
  • Use setOperationPromotedToType.
c-rhodes marked 3 inline comments as done.Jan 20 2021, 6:39 AM
c-rhodes added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1109

Is this change required for the splice patch? I wonder if it should be broken out to its own commit.

This fixes a selection error I hit when splicing predicates. The promotion of predicates uses a truncate which gets lowered to a setcc that crashes on operands of type VT, e.g.:

  t17: nxv2i64 = and t29, t30
  t31: nxv2i64 = AArch64ISD::DUP Constant:i64<0>
t19: nxv2i1 = setcc t17, t31, setne:ch

I can split this into a separate patch.

I tried splitting this out but I couldn't figure out how to test it so I've left it in this patch. What I found was the setcc being generated by the splice occurs after result type legalization, so without the above a setcc returning an nxv2i1 and taking two nxv2i64 vectors is considered legal at this point in selection. With the above it falls into SelectionDAGLegalize::LegalizeOp which considers the legality based on the operand VTs, so setcc then gets custom lowered to the target-specific predicated variant AArch64ISD::SETCC_MERGE_ZERO that can be selected.

1274

Can you use setOperationPromotedToType here?

That's nice, wasn't aware of that thanks.

fhahn added a comment.Jan 20 2021, 6:41 AM

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

sdesmalen added inline comments.Jan 20 2021, 7:15 AM
llvm/docs/LangRef.rst
16266

I expect that we want to support two flavours of the splice intrinsic:

// in flavour1 an immediate of '3' translates to the number of trailing elements,
// e.g. start index of VL - 3 == 'B'.
experimental.vector.splice.flavour1(<A,B,C,D>, <E,F,G,H>, 3) ==> <B, C, D, E>

// in flavour2 an immediate of '1' translates to the start index 1 == 'B'.
experimental.vector.splice.flavour2(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E>

This patch implements only flavour1, and because trailing.elts should be an immediate it is not possible to express flavour2. Do we want to add flavour2 as well at some point? And if so, what would it be named?

Alternatively, it seems possible to express both with the current splice intrinsic where the sign of the immediate distinguishes the two flavours, e.g. a start index of -3 would 'wrap' by VL to index 1, and thus have the same meaning as 3 in flavour1, whereas a positive immediate of 1 would mean index 1, as in flavour 2).

c-rhodes marked an inline comment as done.Jan 20 2021, 7:18 AM

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

The named shufflevector intrinsics accept both fixed and scalable vectors but for fixed they map to existing shufflevector. That's a good point thanks for raising that, converting to a shufflevector as an IR transform like the insert/extract do is probably the right thing to do here.

llvm/docs/LangRef.rst
16266

Overloading the usage based on the sign of the immediate sounds like a worth upgrade to me.

16268–16269

This restriction does not make sense to me. The logical requirement is for the trailing elements to be less than or equal to the full runtime-time vector length. Then I'd say that counts beyond this range result in poison.

fhahn added a comment.Jan 21 2021, 8:38 AM

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

The named shufflevector intrinsics accept both fixed and scalable vectors but for fixed they map to existing shufflevector.

As specified here yes, but I was wondering if they actually need to support fixed-width vectors? Is there a reason to use the intrinsics with fixed vectors instead of shuffles?

vkmr added a subscriber: vkmr.Jan 27 2021, 2:27 AM

FWIW, we have a similar intrinsic in our downstream compiler for RISC-V. We call it experimental.vector.slideleftfill and is same as flavour 2 of this (except the "offset" need not be an immediate) as suggested by @sdesmalen. We have a separate predicated version (experimental.vector.vp.slideleftfill) too which takes additional arguments for the explicit vector length of the two vectors.

llvm/docs/LangRef.rst
16267

Why is it required for the trainling.elts to be an immediate? I imagine there could be a scenario where this value is unknown at compile-time, for instance a recurrence where the order of recurrence is determined at runtime.

paulwalker-arm added inline comments.Jan 27 2021, 4:04 AM
llvm/docs/LangRef.rst
16267

At this stage we don't want to muddy the waters by introducing new shuffle behaviours to LLVM, but rather just extend the existing shuffle requirements to cover scalable vectors. Today shufflevector only supports a literal mask and thus at this stage we are enforcing this same requirement to the intrinsic variants.

c-rhodes updated this revision to Diff 324619.Thu, Feb 18, 7:13 AM
c-rhodes edited the summary of this revision. (Show Details)

Changes:

  • As proposed by @sdesmalen, the intrinsic now supports two variants based on the sign of the immediate, where a negative immediate is a trailing element count and a positive immediate an index.
  • Following the discussion on the reverse intrinsic (D94883) around whether named shufflevector intrinsics should support fixed vectors, I've opted for the same approach to make it explicit in the LangRef that whilst these instructions are experimental shufflevector should be used for fixed-width vectors. The changes to InstCombineCalls to map this intrinsic to shufflevector as an IR transform have been dropped.
  • ExpandVectorSpliceThroughStack has been moved to TLI under expandVectorSplice which is reused by DAGTypeLegalizer::SplitVecRes_VECTOR_SPLICE.
c-rhodes marked 2 inline comments as done.Thu, Feb 18, 7:18 AM

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

The named shufflevector intrinsics accept both fixed and scalable vectors but for fixed they map to existing shufflevector.

As specified here yes, but I was wondering if they actually need to support fixed-width vectors? Is there a reason to use the intrinsics with fixed vectors instead of shuffles?

I noticed there was a similar discussion on D94883, as mentioned I've took the same approach and made it explicit in the LangRef that shufflevector should be used instead for this intrinsic whilst it's experimental.

llvm/docs/LangRef.rst
16268–16269

This restriction does not make sense to me. The logical requirement is for the trailing elements to be less than or equal to the full runtime-time vector length. Then I'd say that counts beyond this range result in poison.

My mistake, you're right it should be the full runtime time VL rather than minimum number of elements. LangRef has been updated.

Just a few nits, will have a closer look at the patch later.

llvm/docs/LangRef.rst
16278–16284

nit:
The signed immediate, modulo the number of elements in the vector, is the index into the first vector from which to extract the result value.
This means conceptually that for a positive immediate, a vector is extracted from concat(%vec1, %vec2) starting at index imm, whereas for a negative immediate, it extracts imm trailing elements from the first vector, and the remaining elements from %vec2.

16287

nit: insert comma.

16305

It's better to say that if the immediate value is outside this range, the result is a poison value. That leaves it up to the implementation how to handle it.

16305

where VL is the runtime vector length of the source/result vector.

sdesmalen added inline comments.Fri, Feb 19, 6:44 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5524–5531

is there a way to split the load using existing code?

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
10911

just Imm % NumElts; ?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8657

You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid loading beyond the allocated stack object.

llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll
20

nit: not sure if there is a lot of value to do this for each possible element-type, if you do two (i8 and some other type) that may be sufficient.

c-rhodes updated this revision to Diff 324983.Fri, Feb 19, 7:49 AM

Address comments

c-rhodes marked 6 inline comments as done.Fri, Feb 19, 7:51 AM
c-rhodes added inline comments.
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5524–5531

is there a way to split the load using existing code?

Splitting the full load from expandVectorSplice with SplitVecRes_LOAD seems to work.

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
10911

just Imm % NumElts; ?

I tried that at first but found the behaviour isn't correct for negative immediates, for example a trailing element count of -15 and 16 elements -15 % 16 = -15, so end up with this mask: [-15,-14,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,0]. From what I read the sign of the modulo result is implementation defined if one or both of the operands are are negative.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8657

You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid loading beyond the allocated stack object.

For the index getVectorElementPointer does the clamping via clampDynamicVectorIndex

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
940

This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does not suggest it must return a load. If you prefer not to explicitly create the split loads then it seems more natural, and future proof, if the result of expandVectorSplice is split using a pair of EXTRACT_SUBVECTOR operations.

bin.cheng-ali added inline comments.
llvm/docs/LangRef.rst
16285

One nit. Given `imm` is negative here, should this be?
"it extracts `-imm` trailing elements from the first vector, ..."

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
10911

Then (NumElts + Imm) % NumElts ?

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8623

Sorry for may stupid question, this stores two vector register and load in the middle, does Endingness matters here? How does LLVM make sure correct sequence of elements is loaded? Thanks in advance.

paulwalker-arm added inline comments.Mon, Feb 22, 3:27 AM
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8623

There's no such thing as a stupid question :)

These are load/store operations on vector types so endianness only plays a role in how each element is stored and not in the order of the elements. That's to say, element N will be at the same location in memory regardless of endianness, however, the bytes that make up element N will be laid out differently. The index used to splice the vectors is also element based, which means there are no partial element accesses and thus no endianness issues.

c-rhodes updated this revision to Diff 325438.Mon, Feb 22, 7:08 AM
c-rhodes marked an inline comment as done.

Address comments

c-rhodes marked 3 inline comments as done.Mon, Feb 22, 7:11 AM
c-rhodes added inline comments.Mon, Feb 22, 7:17 AM
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
940

This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does not suggest it must return a load. If you prefer not to explicitly create the split loads then it seems more natural, and future proof, if the result of expandVectorSplice is split using a pair of EXTRACT_SUBVECTOR operations.

Fair point, I've used EXTRACT_SUBVECTOR instead. I should point out warnings were being generated for the nxv16f32 splitvec tests by the getVectorNumElements call in DAGTypeLegalizer::SplitVecRes_EXTRACT_SUBVECTOR. To remove the warnings I changed it to getVectorMinNumElements.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8623

Thanks very much for explanation. I assume this is defined in Arm architecture reference manual, however, I noticed there is supplement rule for SVE as:
For unpredicated SVE vector register loads and stores, the vector is treated as containing byte elements that
are transferred in increasing element number order without any endianness conversion.

IIUC, this rule would apply here? Of course no endianness issue either way.

Just a few more nits.

llvm/docs/LangRef.rst
16307

nit: s/, for/. For/

llvm/include/llvm/CodeGen/ISDOpcodes.h
565

nit: s/T is [..] and//

565

nit: s/out-of-bounds/not within the range [-VL, VL)/

566–567

Please remove this restriction for scalable vectors. The implementation (by clamping to avoid a runtime crash) should not dictate the specification. It is sufficient to say that if IMM is out of range that the result value is undefined.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8644

nit: because the Imm < 0 block is a lot bigger, maybe you can rewrite it as:

if (Imm >= 0) {
  // Load back the required element.
  StackPtr = getVectorElementPointer(DAG, StackPtr, VT, Node->getOperand(2));
  // Load the spliced result
  return DAG.getLoad(VT, DL, StoreV2, StackPtr,
                     MachinePointerInfo::getUnknownStack(MF));
}
// Handle Imm < 0 case here.
8657

Thanks for confirming. Can you just add a comment that clarifies this?