This is an archive of the discontinued LLVM Phabricator instance.

[Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.
ClosedPublic

Authored by dtemirbulatov on Jun 5 2023, 3:29 PM.

Details

Summary

Here we enable some of shuffle lowering with TBL instruction with SVE and SVE2 for indexing for one register and TBL version for SVE2 while indexing to both registers.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 3:29 PM
dtemirbulatov requested review of this revision.Jun 5 2023, 3:29 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 3:29 PM

Hi @dtemirbulatov, I only had a glance so haven't reviewed this properly yet, but already left you a few comments and questions.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054

This could do with a code comment describing why we don't do this when we don't know the exact SVE size.

I think there are several cases to distinguish here:

  • Elements are selected from either the first source vector, or from the second, but not both. This can be implemented with SVE and when the SVE size isn't known.
  • Elements are selected from both source vectors.
    • If we only have SVE, then we can mimic this with multiple (single-source) TBL instructions and selects, but I suspect there's little value in handling this, given that SVE2 is the prevalent feature going forward.
    • If we don't know the size, then we need to do more work to update the indices for the TBL. For example:
<2 x i64> %x, <2 x i64> %y, <4 x i32> <0, 1, 2, 3>

if SVE VL = 256, then we'd basically insert the <2 x i64> vector value into a conceptual <4 x i64> register where only the first two lanes are relevant, so the offsets must be: <0, 1, 4, 5>
if SVE VL = 512, then we'd basically insert the <2 x i64> vector value into a conceptual <8 x i64> register where only the first two lanes are relevant, so the offsets must be: <0, 1, 8, 9>

So we'd somehow need to do something like this:

<0, 1, 2, 3> + <0, 0, VL, VL>
26055

nit: tab?

26057

It's not really clear to me what Swap and IsUndefOrZero are supposed to do. I would have expected it to look at the indices rather than the source vector operands. Can you add a comment explaining this logic?

26100–26111

nit: what about this?

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-rev.ll
396–397

Seems like these changes can be committed separately.

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle.ll
945 ↗(On Diff #528595)

This needs some check lines for the constant.

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
392 ↗(On Diff #528595)

Can you make it somehow clear form the name of the function, or by inlining the attribute here (e.g. @shuffle_v4i16_tbl2(ptr %a, ptr %b) "target-features"="+sve2" {) that this relates to SVE2?

Matt added a subscriber: Matt.Jun 6 2023, 10:20 AM
dtemirbulatov retitled this revision from WIP: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowring for fixed vector types. to WIP: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types..Jun 30 2023, 4:59 AM
dtemirbulatov edited the summary of this revision. (Show Details)
dtemirbulatov marked 2 inline comments as done.Jul 18 2023, 4:57 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26057

Ok, I will comment for those variables. Also, for example. If the mask is refers to only one operand then second operand would be already undefined at the time of lowering. I think we don't have to check the mask here.

dtemirbulatov marked 5 inline comments as done.

Rebased addressed comments.

dtemirbulatov retitled this revision from WIP: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types. to [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types..Jul 18 2023, 12:33 PM

Hi Dinar,
I re-wrote your code in the Lowering to something I could understand better.
I will still spend some time trying to understand the mask logic and see if the new scalable vector is correct.
But I believe it is good to share my comments for now

+  // Avoid producing TBL and TBL2 instructions if we don't know
+  // SVE register size or minimal is unequal to maximum size.
+  if (!MaxSVESize || MinSVESize != MaxSVESize)
+    return SDValue();

-  return SDValue();
+  // Element type is ???
+  unsigned BitsPerElt = VT.getVectorElementType().getSizeInBits();
+  if (BitsPerElt != 8 && BitsPerElt != 16 && BitsPerElt != 32 &&
+      BitsPerElt != 64)
+    return SDValue();
+
+  bool Swap = false;
+  if (Op1.isUndef() || isZerosVector(Op1.getNode())) {
+    // Swapped shuffle operands, but we have to recalculate indexes afterwards.
+    std::swap(Op1, Op2);
+    Swap = true;
+  }
+
+
+  bool IsUndefOrZero = Op2.isUndef() || isZerosVector(Op2.getNode());
+  unsigned NumElts = VT.getVectorNumElements();
+  unsigned IndexLen = MinSVESize / BitsPerElt;
+  unsigned FillElements = IndexLen - NumElts;
+
+  // Set the Mask with the correct indexes
+  SmallVector<SDValue, 8> TBLMask;
+  for (int Val : ShuffleMask) {
+    unsigned Offset = Val;
+    if (Swap)
+      Offset = Offset < NumElts ? Offset + NumElts : Offset - NumElts;
+    // why we need to set the mask to 255 when one of the vectors is zero/undef?
+    else if(Offset >= NumElts){
+       if(IsUndefOrZero)
+           Offset = 255;
+       else if (IndexLen > NumElts)
+           Offset += IndexLen - NumElts;
+    }
+    TBLMask.push_back(DAG.getConstant(Offset, DL, MVT::i64));
+  }
+  // Filling the rest of the mask with 255
+  for (unsigned i = 0; i < FillElements; ++i)
+    TBLMask.push_back(DAG.getConstant(255, DL, MVT::i64));
+
+  // Building a scalable vector for the suffle. So we can use the sve instructions
+  EVT MaskEltType = EVT::getIntegerVT(*DAG.getContext(), BitsPerElt);
+  EVT MaskType = EVT::getVectorVT(*DAG.getContext(), MaskEltType, IndexLen);
+  EVT MaskContainerVT = getContainerForFixedLengthVector(DAG, MaskType);
+  SDValue VecMask =
+      DAG.getBuildVector(MaskType, DL, ArrayRef(TBLMask.data(), IndexLen));
+  SDValue SVEMask = convertToScalableVector(DAG, MaskContainerVT, VecMask);
+
+  SDValue Shuffle;
+  if (IsUndefOrZero || Swap) {
+    Shuffle = convertFromScalableVector(
+        DAG, VT,
+        DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
+                    DAG.getConstant(Intrinsic::aarch64_sve_tbl, DL, MVT::i32),
+                    Op1, SVEMask));
+  } else {
+    if (Subtarget->hasSVE2())
+      Shuffle = convertFromScalableVector(
+          DAG, VT,
+          DAG.getNode(
+              ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
+              DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32), Op1,
+              Op2, SVEMask));
+    else
+      return SDValue();
+  }
+  return DAG.getNode(ISD::BITCAST, DL, Op.getValueType(), Shuffle);
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26071–26072

Why do we need to test this? What other value it can have?
I thought the only possible values were this 8/16/32/64. Is it possible to have 128, 1024? Are these values we are avoiding?

26084

For me this tests is a bit odd.
FillElements is unsigned and not a bool. So are you checking if IndexLen> NumElts?
Is it too difficult to put and example for each case with the fixed mask. So we can see how the mask was and how it has changed.
if (Swap)
for Offset < NumElts ->mask = <0,0,7,7> -> <3,3,0,0>
for Offset > NumElts ->mask = <0,0,7,7>

Maybe this will help to understand if there is any hidden problem with these ifs.

I re-wrote that to something like:

// Set the Mask with the correct indexes
SmallVector<SDValue, 8> TBLMask;
for (int Val : ShuffleMask) {
  unsigned Offset = Val;
  if (Swap)
    Offset = Offset < NumElts ? Offset + NumElts : Offset - NumElts;
  // why we need to set the mask to 255 when one of the vectors is zero/undef?
  else if(Offset >= NumElts){
     if(IsUndefOrZero)
         Offset = 255;
     else if (IndexLen > NumElts)
         Offset += IndexLen - NumElts;
  }
  TBLMask.push_back(DAG.getConstant(Offset, DL, MVT::i64));
}
26088

I would suggest add a comment here.
You are populating the rest of the mask here with 255, why?
// Filling up the remaining positions of the mask with 255 because ...

26091

Are you here changing a fixed vector to a scalable vector to use tbl?

dtemirbulatov marked 3 inline comments as done.

Addressed comments, rebase.

dtemirbulatov added inline comments.Aug 1 2023, 2:02 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054

I will address run-time index construction in the following patch.

26071–26072

yes, The element type chack was already done in useSVEForFixedLengthVectorVT() function.

26091

Right, as those instructions accept only scalable type operand and result.

dtemirbulatov marked 2 inline comments as done.Aug 1 2023, 3:54 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26084

Thanks for finding this, I think now that "IndexLen> NumElts" logic/transform was not correct.

sdesmalen added inline comments.Aug 1 2023, 7:49 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26057

There is an implicit assumption that a DAGCombine (or perhaps an InstCombine) has rewritten the second operand to be undef. It seems safer to look at the values in the Mask instead.

26073

I think you need to call this Op2IsUndefOrZero

26085–26089

What should happen if the vector type is <256 x i8> and IsUndefOrZero is false ?

26092–26093

I'm curious, why is that better?

Also, as my comment above suggests, for <256 x i8> vectors, 255 is just the last element of the first data-vector. There is also some test-coverage missing here.

26103

I think this condition can be removed? (if it has swapped, then the second vector must be zero or undef?)

26108

What is the type of VecMask here? How can the type of VecMask here (single-vector TBL instruction) be the same as for the case of the SVE2 two-vector TBL instruction?

26109–26110

nit:

} else if (Subtarget->hasSVE2()) {
   ...
} else
  return SDValue();

?

dtemirbulatov marked 4 inline comments as done.Aug 7 2023, 9:15 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26092–26093

I think that if we leave uninitialized remaining out of scope indexes then they will end up with 0 value and means extra work since 0 element almost always means 0 lane and with 255 value probably hardware could understand this is out of scope value, except <256 x i8> case which a it rarely than valid 0 index. The last available value for index here is 255 then it flips to 0.

26103

No, IsUndefOrZero tells us that only operand 0 is used. I will add a test and rename the variable.

26108

This is not correct, According to specification index size is similar for TBL, TBL2.

dtemirbulatov marked 6 inline comments as done.

Rebased, fixed remarks, added ability to avoid to produce tbl, tbl2 instruction if index is not representable index value more than 255.

Corrected sve-streaming-mode-fixed-length-vector-shuffle.ll test by removed unused parameters from functions.

sdesmalen added inline comments.Aug 14 2023, 7:02 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054

It probably makes sense to move this functionality into its own function.

26067

You seem to have removed checks to ask if Op1 is poison/undef or zeroinitializer.

The way you've written the code now, Op1IsUndefOrZero should be renamed instead to Op1IsUnused. (same for Op2IsUnused)

26074

Is the condition && (Val < NumElts * 2) needed? This seems already covered by the if (NumElts > IndexLen) check above on line 25844.

26077

It's probably easier to update the ShuffleMask here. That simplifies the logic below as it avoids having to constantly distinguish between the Op*IsUndefOrZero cases.

26081

Hardcoding a check for '128' doesn't seem right to me. It all depends on what can be represented in the element type of the instruction, i.e. for i8's the max value is 255. That means when targeting vscale_range(16, 16), then 255 is a valid index into the first source vector.

26082

It doesn't seem right to look at the number of indices in the mask. I think rather it should be about the maximum index value in the mask?

26092–26093

While it may be rare, it must be correct for all cases. If the LLVM IR instruction requests an element from a zeroinitializer vector is selected, then it must make sure to do so. That means that 255 may not always be the correct value. Maybe it should just return SDValue() for that case, or explicitly materialize a zeroinitializer vector to select the value from.

26099

While not wrong, I don't really like '255' because it is a hard-coded value. For i8 elements, it may well be a valid index. For i16 or larger elements, there's no need to hard-code 255. Maybe just use '(1 << elementSize) - 1'?

26109

For a shufflevector that selects IndexLen elements from Op1/Op2, this code seems to be filling the first VL - IndexLen elements with some initializer (255 in this case). It seems odd to me that this loop starts at 0 and I doubt this has the desired effect.

I'm not sure to what extend it matters what value we use for 'undef' lanes. Can we just initialize the TBLMask with all numeric_limits<type>::max index value?

26124–26126

Please write this as:

} else if (Subtarget->hasSVE2()) {
  ...
} else
  return SDValue();
dtemirbulatov marked 5 inline comments as done.

Addressed comments.

dtemirbulatov added inline comments.Aug 17 2023, 6:03 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26074

yes, we need this check to understand that we just electing from operand 2 and operand 1 is unused.

26077

hmm, ShuffleMask is readonly variable.

26081

yes and for second source operand the index starts from 256, so that is what we are checking here is we have to address to both operands then it is going to be 255 + index and we already established that both operands involved.

26082

We already checked all values in the mask.

dtemirbulatov marked 2 inline comments as done.

Resolved remained remarks, added support for half sized mask, N-sized mask. if just minimum SVE known then allow SVE TBL with just one operand transform.

dtemirbulatov marked an inline comment as done.Aug 18 2023, 7:15 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26085–26089

I think

dtemirbulatov marked an inline comment as not done.Aug 18 2023, 7:17 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26085–26089

I think we allowing this lowering to only one operand supported with SVE1 TBL.

dtemirbulatov marked an inline comment as done.Aug 18 2023, 7:17 AM
sdesmalen added inline comments.Aug 21 2023, 8:27 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054

Comment seems unaddressed.

26077

You can create a new ShuffleMask variable, which is writable.

26082

I think you can make this into an assert that NumElts <= IndexLen, because at the point of lowering the operation, the types will have been legalised in such a way that the number of lanes of the mask >= number of lanes of the source operand.

26083

There are two things not right about this:

  1. The contents of Op1 and Op2 shouldn't matter for the lowering of the shuffle, so there is no reason to bail out here. (If you remove this, does any test fail?)
  2. isZerosVector will never return true, because Op1 and Op2 are INSERT_SUBVECTOR nodes because they are assigned with convertToScalableVector(), and INSERT_SUBVECTOR is not considered in isZerosVector.
26095

What is the meaning of 'Op1IsUnused' after swapping Op1 and Op2? Should that be Op2IsUnused now?

To me it seems better to update the ShuffleMask directly and remove the need for Op*IsUnused altogether

26100

It's normally better to bail out early, e.g.

if (condition) {
  // lots of code
}

return SDValue();

->

if (!condition)
 return SDValue
// lots of code
26104

This condition and the comment above make little sense to me. The issue to solve is instead about the maximum *index* value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector.

For example:

define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) {
  %op0 = load <256 x i8>, ptr %ptr0
  %op1 = load <256 x i8>, ptr %ptr1
  ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
  %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263>
  ret <16 x i8> %res
}

However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices.

.LCPI0_0:
        .byte   0                               // 0x0
        .byte   1                               // 0x1
        .byte   2                               // 0x2
        .byte   3                               // 0x3
        .byte   4                               // 0x4
        .byte   5                               // 0x5
        .byte   6                               // 0x6
        .byte   7                               // 0x7
        .byte   16                              // 0x10
        .byte   33                              // 0x21
        .byte   34                              // 0x22
        .byte   35                              // 0x23
        .byte   36                              // 0x24
        .byte   37                              // 0x25
        .byte   38                              // 0x26
        .byte   39                              // 0x27
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .text
        .globl  foo
        .p2align        2
        .type   foo,@function

foo:                                    // @foo
        .cfi_startproc
// %bb.0:
        adrp    x8, .LCPI0_0
        add     x8, x8, :lo12:.LCPI0_0
        ptrue   p0.b
        ldr     q1, [x1]
        ldr     q0, [x0]
        ld1b    { z2.b }, p0/z, [x8]
        tbl     z0.b, { z0.b, z1.b }, z2.b
                                        // kill: def $q0 killed $q0 killed $z0
        ret
26105

Please check this condition earlier up for an early bail out! (at same place where it checks MinSVESize)

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
605 ↗(On Diff #551498)

Why doesn't this function use the TBL instruction? I don't see anything that stops it from being to use TBL?

dtemirbulatov marked an inline comment as done.Aug 22 2023, 3:35 PM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26104

I think this is correct, here is full DAG dump:
SelectionDAG has 16 nodes:

t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
t18: v16i8,ch = load<(load (s128) from %ir.ptr0, align 256)> t0, t2, undef:i64
  t4: i64,ch = CopyFromReg t0, Register:i64 %1
t16: v16i8,ch = load<(load (s128) from %ir.ptr1, align 256)> t0, t4, undef:i64
  t11: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23> t18, t16
t13: ch,glue = CopyToReg t0, Register:v16i8 $q0, t11
t22: nxv16i8 = insert_subvector undef:nxv16i8, t18, Constant:i64<0>
t23: nxv16i8 = insert_subvector undef:nxv16i8, t16, Constant:i64<0>
t14: ch = AArch64ISD::RET_GLUE t13, Register:v16i8 $q0, t13:1
26105

No, Most of those checks related to support TBL2 both opperand used and at the place where MinSVESize check we don't know content of the mask, thus could not tell that both operand would be used or not.

dtemirbulatov marked 5 inline comments as done.Aug 23 2023, 12:39 AM
dtemirbulatov added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26083

Sorry, I left isZerosVector() check by accident.

26095

replaced both variables with a single one 'IsSingleOp'.

26105

And for a single operand case it is enough to know just MinSVESize.

dtemirbulatov marked 2 inline comments as done.Aug 23 2023, 12:40 AM

Resolved comments.

sdesmalen added inline comments.Aug 23 2023, 3:14 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25834–25835

nit: You don't need to pass MinSVESize and MaxSVESize, you can easily get this again from the Subtarget.

25885

This swap code seems to be unnecessary, because SelectionDAGBuilder already does this for you.

You can see that if you compile:

define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
  %op0 = load <16 x i8>, ptr %ptr0
  %op1 = load <16 x i8>, ptr %ptr1
  ; Takes only elements from second vector
  %res = shufflevector <16 x i8> %op0, <16 x i8> %op1, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 31, i32 31>
  ret <16 x i8> %res
}

Which results in:

t0: ch,glue = EntryToken
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t8: v16i8,ch = load<(load (s128) from %ir.ptr1)> t0, t4, undef:i64
  t10: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,15> t8, undef:v16i8
t12: ch,glue = CopyToReg t0, Register:v16i8 $q0, t10
t13: ch = AArch64ISD::RET_GLUE t12, Register:v16i8 $q0, t12:1

That means you don't need to modify the Shufflemask at all, and you can call ShuffleVectorInst:: isSingleSourceMask(ShuffleMask) to calculate IsSingleOp. If isSingleSourceMask returns true, then you can assume it will select elements from the first source operand. That will simplify this patch quite a bit!

26104

It seems SelectionDAGBuilder has extracted a subregister from the two <256 x i8> vectors, hence why the offsets are valid. This clearly wasn't the right example. The following case can't be transformed like that:

define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
  %op0 = load <256 x i8>, ptr %ptr0
  %op1 = load <256 x i8>, ptr %ptr1
  ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
  %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 252, i32 253, i32 254, i32 255, i32 256, i32 257, i32 258, i32 259, i32 508, i32 509, i32 510, i32 511>
  ret <16 x i8> %res
}

It's only because SelectionDAG inserts the mask into a <256 x i8> vector that your code happens to discard this case (because that will fail the check MaxOffset <= NumElts * 2). I think I'd still prefer an explicit check that the maximum index value in the ShuffleMask doesn't exceed the maximum value of the element type.

dtemirbulatov marked 3 inline comments as done.

Resolved comments.

dtemirbulatov marked an inline comment as done.Aug 23 2023, 5:30 AM
sdesmalen added inline comments.Aug 23 2023, 9:04 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25855–25868

If you define MaxOffset as (1 << BitsPerElt) - 1, you can remove the switch statement?

25870–25875

Please move this code up after

if (!MinSVESize)
  return SDValue();
25879

You're using unsigned here, but ShuffleMask is actually a vector of ints. undef elements are represented as index (int) -1. When you're comparing this with MaxOffset for example, things get a little confusing and I think this is hiding some bugs.

Can you instead make all the ShuffleMask values use signed integers?

dtemirbulatov marked 3 inline comments as done.

Resolved comments.

sdesmalen added inline comments.Aug 24 2023, 2:44 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25849

Please rename this to something more descriptive, e.g. ElementsPerVector

25852

Seems more generic to write:

uint64_t MaxOffset = APInt(BitsPerElt, -1, false).getZExtValue();
25858

What is the point of this variable?

In my previous comment I suggested that the calculation should be done on signed values because the ShuffleMask has signed values . Doing:

for (int Val : ShuffleMask) {
  unsigned Offset = Val;
  // rest of code using Offset instead of Val
}

does not solve that problem.

25859

I don't think you need to distinguish IsSingleOp here, because if IsSingleOp == true, then Offset > NumElts, so the code below already covers it.

25869–25870

You'll need to update the real offset before checking whether Offset >= MaxOffset. For example, for vscale_range(16, 16) the index into the next vector would always be at least 256, which cannot be represented in an i8.

Moving this code above the previous condition fixes that.

25878

nit: you can simplify this to -1

25894–25895

Combine this into an else if

sdesmalen added inline comments.Aug 24 2023, 2:53 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25849

Please rename this to something more descriptive, e.g. ElementsPerVector

Perhaps ElementsPerVectorReg is more appropriate (given that there is already a NumElts variable).

dtemirbulatov added inline comments.Aug 24 2023, 5:49 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25858

hmm, not expecting anytging harmful here, according to the specification : "Non-negative elements in the mask represent an index."

dtemirbulatov marked 10 inline comments as done.

Addressed comments. Also, I found error in my last change where I was comparing "Offset > ElementsPerVectorReg", but It whould be ">=", Fixed.

Thanks for the latest round of changes. I have a few more comments, but this seems to be moving in the right direction.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25858

hmm, not expecting anything harmful here, according to the specification : "Non-negative elements in the mask represent an index.

That doesn't mean the value cannot be negative. I prefer the code to be explicit about what happens when the value is negative, rather than this being implicit by using an unsigned variable.

ShuffleMask is an array of signed integers. When you then iterate over this array of signed integers, it makes sense to make that variable signed as well.

25884

nit: you can sink the convertFromScalableVector out of the if/else conditions, and moving it to the return.

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle.ll
963 ↗(On Diff #554219)

Missing CHECK lines for this symbol.

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
378 ↗(On Diff #554219)

You have added these tests to sve-streaming-mode-fixed-length-vector-shuffle.ll, but the sve-streaming-mode-* tests should not contain any tests with specific vscale_range(..) attributes, because the point is more to test that no NEON is being used and streaming(-compatible) SVE is used instead.

668 ↗(On Diff #554219)

What is unacceptable about the form?

dtemirbulatov marked 2 inline comments as done.

Resolved comments.

dtemirbulatov added inline comments.Sep 4 2023, 6:27 AM
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
668 ↗(On Diff #554219)

We could see in the result that three tbl instructions were used instead of one, because the result vector is not reprsentable on the hardware.

dtemirbulatov marked 2 inline comments as done.Sep 4 2023, 6:29 AM

Just a few minor nits, otherwise looks good to me.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25857

nit: This should be called Index instead of Offset.

25865

If you pull out this condition to the top of the loop and return early for negative offsets, then in the rest of the loop you can assume Offset is unsigned and remove some of the (int) casts you have in there.

25870–25871

nit: this statement is a bit unclear to me. Why is it better?

Do we want to say "Choosing an out-of-range index leads to the lane being zeroed".
Note that this is only relevant for i16+ types and for i8 types when the runtime VL < 2048.

25883

nit: unnecessary curly braces.

25891

Just to be safe, could you add an else branch here with llvm_unreachable("Cannot lower shuffle without SVE2 TBL"); ?

26049–26050

nit: odd line-break.

SjoerdMeijer added inline comments.
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
4 ↗(On Diff #555721)

Drive by comment: can we add a RUN for 256 bits? It might be interesting to see the codegen for that too because there is a 256 bit SVE implementation and we have seen some inefficiencies for these type of things.

dtemirbulatov marked 7 inline comments as done.

Resolved comments.

sdesmalen added inline comments.Sep 22 2023, 12:32 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25858

When one of the indices is poison, it will encode that in the ShuffleMask with -1. Bailing out may be too conservative, you could for example choose to select the first element instead.

With your patch at the moment:

define <4 x float> @foo(ptr %a, ptr %b) nounwind vscale_range(1, 1){
  %op1 = load <4 x float>, ptr %a
  %op2 = load <4 x float>, ptr %b
  %1 = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 0, i32 3, i32 5, i32 poison>
  ret <4 x float> %1
}

results in:

foo:                                    // @foo
        ldr     q0, [x1]
        ldr     q1, [x0]
        mov     z0.s, z0.s[1]
        mov     z2.s, z1.s[3]
        str     s1, [sp, #-16]!
        stp     s2, s0, [sp, #4]
        ldr     q0, [sp], #16
        ret

whereas:

define <4 x float> @foo(ptr %a, ptr %b) nounwind vscale_range(1, 1){
  %op1 = load <4 x float>, ptr %a
  %op2 = load <4 x float>, ptr %b
  %1 = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 0, i32 3, i32 5, i32 1>
  ret <4 x float> %1
}

results in:

foo:                                    // @foo
        adrp    x8, .LCPI0_0
        ldr     q0, [x0]
        ldr     q1, [x1]
        ldr     q2, [x8, :lo12:.LCPI0_0]
        tbl     z0.s, { z0.s, z1.s }, z2.s
        ret

It would also be good to have a test for this case (one of the shufflemask lanes being poison)

25873

nit: s/world perfrom/would perform/ ?

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
4 ↗(On Diff #556773)

This line is exactly the same as on line 2?

8 ↗(On Diff #556773)

I would actually prefer to keep these tests kind of the same as they were, and for any new tests to be moved to a new test file, e.g. sve-fixed-length-vector-shuffle-tbl.ll.

Where:

  • The test has RUN lines for both +sve and for +sve2
  • We test with Min == Max, such that it uses TBL/TBL2 as your patch currently implements
  • We test without a Max (which means Max is unbounded, and implies your current patch won't use TBL, although that could change in the future)
  • There are a few tests for boundary cases using vscale_range(16, 16) and i8 vectors.

I'd recommend using the same size for the vectors in tests (now your test is using <4 x i8>, <8 x i8> and <16 x i8>, but I don't think that makes any difference based on the code you've added. Maybe you can just stick with <16 x i8> instead. If you go for 128-bit vectors, you'll need to add the -force-streaming-compatible flag to force the use of SVE instructions over NEON.

dtemirbulatov marked 4 inline comments as done.

Addressed remarks.

sdesmalen added inline comments.Oct 2 2023, 5:27 AM
llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
3

Can you remove this RUN line? This RUN line is identical to the last RUN line.

4

Can you stick with one vector width (e.g. min==max==128)? I don't think the line below (min==max==256) is testing anything that the current RUN line isn't already testing when -force-stremaing-compatible is enabled.

111

Please rename this to something like shuffle_index_indices_from_op1 to make it clear the mask takes all the lanes from %op1.

212

Please rename this to something like shuffle_index_indices_from_op2 to make it clear the mask takes all the lanes from %op2.

346

Please rename this to something like shuffle_index_indices_from_both_ops to make it clear the mask takes all the lanes from both %op1 and %op2.

350

Please remove the acceptable and _unacceptable from the names. If something cannot use the TBL instruction, then prepend the name with negative_.

381

As I said here, one of the indices being poison should not be a reason not to use vector instructions. This case can be handled by the algorithm.

385

Please inline "target-features"="+sve2" vscale_range(16,16) here rather than using #0, as that makes the test a bit easier to read.

385

Can you add a comment that clarifies why this function doesn't use TBL?

447

This test is named _unacceptable_form, yet it uses TBL without issues. So this is not an "unacceptable form"? :)

703

How is this test different from shuffle_index_size_acceptable_op_both, other than it having a slightly different mask?

834

How is this test different from shuffle_index_size_acceptable_op_both, other than it having a slightly different mask and a wider return type?

890

I don't think this is a good test, because this test is relying on a DAGCombine that swaps %op1 and %op2. I think you can just remove it.

dtemirbulatov marked 12 inline comments as done.

Resolved remarks.

sdesmalen added inline comments.Oct 2 2023, 8:31 AM
llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
119

This test is missing check lines for the shuffle mask, could you please add it?

178

I'm a bit puzzled, why did you add this test?

202

nit: I meant prepend do the start of the name, e.g. @negative_test_shuffle_index_size_op_both_maxhw

dtemirbulatov added inline comments.Oct 2 2023, 9:02 AM
llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
3

No, it is not here we targeting sve, not sve2.

119

No, for the poison value it generates different value for different conditions (hardware vector size, etc), but for this test always in range of offset for operand 1, I am not sure it is consistant.

sdesmalen added inline comments.Oct 2 2023, 1:31 PM
llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
119

I don't think there's a benefit to choosing one index over another, so what if you change the code to always choose lane 0 if the index < 0? That way, you can add the CHECK lines.

dtemirbulatov marked 3 inline comments as done.

Addressed comments.

sdesmalen accepted this revision.Oct 3 2023, 5:05 AM

LGTM, thanks!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25859

nit: I don't know if this can ever happen, but maybe it's more safe to write if (Index < 0) if any other values can be used to represent poison.

This revision is now accepted and ready to land.Oct 3 2023, 5:05 AM
This revision was landed with ongoing or failed builds.Oct 3 2023, 8:20 AM
This revision was automatically updated to reflect the committed changes.