This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
74/78
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/1
sve-fixed-length-permute-rev.ll
17/19
sve-fixed-length-vector-shuffle-tbl.ll

Differential D152205

[Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.
ClosedPublic

Authored by dtemirbulatov on Jun 5 2023, 3:29 PM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
kmclaughlin
hassnaa-arm
MattDevereau
efriedma
CarolineConcatto

Commits

rG8232ab76d0ba: [AArch64][SVE][SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector…

Summary

Here we enable some of shuffle lowering with TBL instruction with SVE and SVE2 for indexing for one register and TBL version for SVE2 while indexing to both registers.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 3:29 PM

dtemirbulatov requested review of this revision.Jun 5 2023, 3:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2023, 3:29 PM

Harbormaster completed remote builds in B236749: Diff 528595.Jun 5 2023, 5:48 PM

Hi @dtemirbulatov, I only had a glance so haven't reviewed this properly yet, but already left you a few comments and questions.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054	This could do with a code comment describing why we don't do this when we don't know the exact SVE size. I think there are several cases to distinguish here: Elements are selected from either the first source vector, or from the second, but not both. This can be implemented with SVE and when the SVE size isn't known. Elements are selected from both source vectors. If we only have SVE, then we can mimic this with multiple (single-source) TBL instructions and selects, but I suspect there's little value in handling this, given that SVE2 is the prevalent feature going forward. If we don't know the size, then we need to do more work to update the indices for the TBL. For example: <2 x i64> %x, <2 x i64> %y, <4 x i32> <0, 1, 2, 3> if SVE VL = 256, then we'd basically insert the <2 x i64> vector value into a conceptual <4 x i64> register where only the first two lanes are relevant, so the offsets must be: <0, 1, 4, 5> if SVE VL = 512, then we'd basically insert the <2 x i64> vector value into a conceptual <8 x i64> register where only the first two lanes are relevant, so the offsets must be: <0, 1, 8, 9> So we'd somehow need to do something like this: <0, 1, 2, 3> + <0, 0, VL, VL>
26055	nit: tab?
26057	It's not really clear to me what `Swap` and `IsUndefOrZero` are supposed to do. I would have expected it to look at the indices rather than the source vector operands. Can you add a comment explaining this logic?
26100–26111	nit: what about this?
llvm/test/CodeGen/AArch64/sve-fixed-length-permute-rev.ll
396–397	Seems like these changes can be committed separately.
llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle.ll
945 ↗	(On Diff #528595)	This needs some check lines for the constant.
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
392 ↗	(On Diff #528595)	Can you make it somehow clear form the name of the function, or by inlining the attribute here (e.g. `@shuffle_v4i16_tbl2(ptr %a, ptr %b) "target-features"="+sve2" {`) that this relates to SVE2?

Matt added a subscriber: Matt.Jun 6 2023, 10:20 AM

dtemirbulatov retitled this revision from WIP: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowring for fixed vector types. to WIP: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types..Jun 30 2023, 4:59 AM

dtemirbulatov edited the summary of this revision. (Show Details)

dtemirbulatov marked 2 inline comments as done.Jul 18 2023, 4:57 AM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26057	Ok, I will comment for those variables. Also, for example. If the mask is refers to only one operand then second operand would be already undefined at the time of lowering. I think we don't have to check the mask here.

Rebased addressed comments.

dtemirbulatov retitled this revision from WIP: [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types. to [Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types..Jul 18 2023, 12:33 PM

Harbormaster completed remote builds in B246315: Diff 541676.Jul 18 2023, 6:33 PM

dtemirbulatov added a reviewer: CarolineConcatto.Jul 24 2023, 7:31 AM

Hi Dinar,
I re-wrote your code in the Lowering to something I could understand better.
I will still spend some time trying to understand the mask logic and see if the new scalable vector is correct.
But I believe it is good to share my comments for now

+  // Avoid producing TBL and TBL2 instructions if we don't know
+  // SVE register size or minimal is unequal to maximum size.
+  if (!MaxSVESize || MinSVESize != MaxSVESize)
+    return SDValue();

-  return SDValue();
+  // Element type is ???
+  unsigned BitsPerElt = VT.getVectorElementType().getSizeInBits();
+  if (BitsPerElt != 8 && BitsPerElt != 16 && BitsPerElt != 32 &&
+      BitsPerElt != 64)
+    return SDValue();
+
+  bool Swap = false;
+  if (Op1.isUndef() || isZerosVector(Op1.getNode())) {
+    // Swapped shuffle operands, but we have to recalculate indexes afterwards.
+    std::swap(Op1, Op2);
+    Swap = true;
+  }
+
+
+  bool IsUndefOrZero = Op2.isUndef() || isZerosVector(Op2.getNode());
+  unsigned NumElts = VT.getVectorNumElements();
+  unsigned IndexLen = MinSVESize / BitsPerElt;
+  unsigned FillElements = IndexLen - NumElts;
+
+  // Set the Mask with the correct indexes
+  SmallVector<SDValue, 8> TBLMask;
+  for (int Val : ShuffleMask) {
+    unsigned Offset = Val;
+    if (Swap)
+      Offset = Offset < NumElts ? Offset + NumElts : Offset - NumElts;
+    // why we need to set the mask to 255 when one of the vectors is zero/undef?
+    else if(Offset >= NumElts){
+       if(IsUndefOrZero)
+           Offset = 255;
+       else if (IndexLen > NumElts)
+           Offset += IndexLen - NumElts;
+    }
+    TBLMask.push_back(DAG.getConstant(Offset, DL, MVT::i64));
+  }
+  // Filling the rest of the mask with 255
+  for (unsigned i = 0; i < FillElements; ++i)
+    TBLMask.push_back(DAG.getConstant(255, DL, MVT::i64));
+
+  // Building a scalable vector for the suffle. So we can use the sve instructions
+  EVT MaskEltType = EVT::getIntegerVT(*DAG.getContext(), BitsPerElt);
+  EVT MaskType = EVT::getVectorVT(*DAG.getContext(), MaskEltType, IndexLen);
+  EVT MaskContainerVT = getContainerForFixedLengthVector(DAG, MaskType);
+  SDValue VecMask =
+      DAG.getBuildVector(MaskType, DL, ArrayRef(TBLMask.data(), IndexLen));
+  SDValue SVEMask = convertToScalableVector(DAG, MaskContainerVT, VecMask);
+
+  SDValue Shuffle;
+  if (IsUndefOrZero || Swap) {
+    Shuffle = convertFromScalableVector(
+        DAG, VT,
+        DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
+                    DAG.getConstant(Intrinsic::aarch64_sve_tbl, DL, MVT::i32),
+                    Op1, SVEMask));
+  } else {
+    if (Subtarget->hasSVE2())
+      Shuffle = convertFromScalableVector(
+          DAG, VT,
+          DAG.getNode(
+              ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,
+              DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32), Op1,
+              Op2, SVEMask));
+    else
+      return SDValue();
+  }
+  return DAG.getNode(ISD::BITCAST, DL, Op.getValueType(), Shuffle);

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26071–26072	Why do we need to test this? What other value it can have? I thought the only possible values were this 8/16/32/64. Is it possible to have 128, 1024? Are these values we are avoiding?
26084	For me this tests is a bit odd. FillElements is unsigned and not a bool. So are you checking if IndexLen> NumElts? Is it too difficult to put and example for each case with the fixed mask. So we can see how the mask was and how it has changed. if (Swap) for Offset < NumElts ->mask = <0,0,7,7> -> <3,3,0,0> for Offset > NumElts ->mask = <0,0,7,7> Maybe this will help to understand if there is any hidden problem with these ifs. I re-wrote that to something like: // Set the Mask with the correct indexes SmallVector<SDValue, 8> TBLMask; for (int Val : ShuffleMask) { unsigned Offset = Val; if (Swap) Offset = Offset < NumElts ? Offset + NumElts : Offset - NumElts; // why we need to set the mask to 255 when one of the vectors is zero/undef? else if(Offset >= NumElts){ if(IsUndefOrZero) Offset = 255; else if (IndexLen > NumElts) Offset += IndexLen - NumElts; } TBLMask.push_back(DAG.getConstant(Offset, DL, MVT::i64)); }
26088	I would suggest add a comment here. You are populating the rest of the mask here with 255, why? // Filling up the remaining positions of the mask with 255 because ...
26091	Are you here changing a fixed vector to a scalable vector to use tbl?

Addressed comments, rebase.

Harbormaster completed remote builds in B249327: Diff 545824.Jul 31 2023, 5:27 PM

dtemirbulatov added inline comments.Aug 1 2023, 2:02 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054	I will address run-time index construction in the following patch.
26071–26072	yes, The element type chack was already done in useSVEForFixedLengthVectorVT() function.
26091	Right, as those instructions accept only scalable type operand and result.

dtemirbulatov marked 2 inline comments as done.Aug 1 2023, 3:54 AM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26084	Thanks for finding this, I think now that "IndexLen> NumElts" logic/transform was not correct.

sdesmalen added inline comments.Aug 1 2023, 7:49 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26057	There is an implicit assumption that a DAGCombine (or perhaps an InstCombine) has rewritten the second operand to be `undef`. It seems safer to look at the values in the Mask instead.
26073	I think you need to call this `Op2IsUndefOrZero`
26085–26089	What should happen if the vector type is `<256 x i8>` and IsUndefOrZero is `false` ?
26092–26093	I'm curious, why is that better? Also, as my comment above suggests, for `<256 x i8>` vectors, `255` is just the last element of the first data-vector. There is also some test-coverage missing here.
26103	I think this condition can be removed? (if it has swapped, then the second vector must be zero or undef?)
26108	What is the type of VecMask here? How can the type of VecMask here (single-vector TBL instruction) be the same as for the case of the SVE2 two-vector TBL instruction?
26109–26110	nit: } else if (Subtarget->hasSVE2()) { ... } else return SDValue(); ?

dtemirbulatov marked 4 inline comments as done.Aug 7 2023, 9:15 AM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26092–26093	I think that if we leave uninitialized remaining out of scope indexes then they will end up with 0 value and means extra work since 0 element almost always means 0 lane and with 255 value probably hardware could understand this is out of scope value, except <256 x i8> case which a it rarely than valid 0 index. The last available value for index here is 255 then it flips to 0.
26103	No, IsUndefOrZero tells us that only operand 0 is used. I will add a test and rename the variable.
26108	This is not correct, According to specification index size is similar for TBL, TBL2.

Rebased, fixed remarks, added ability to avoid to produce tbl, tbl2 instruction if index is not representable index value more than 255.

Harbormaster completed remote builds in B251093: Diff 548205.Aug 8 2023, 10:36 AM

Corrected sve-streaming-mode-fixed-length-vector-shuffle.ll test by removed unused parameters from functions.

Harbormaster completed remote builds in B251471: Diff 548731.Aug 9 2023, 5:06 PM

sdesmalen added inline comments.Aug 14 2023, 7:02 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054	It probably makes sense to move this functionality into its own function.
26067	You seem to have removed checks to ask if Op1 is `poison`/`undef` or `zeroinitializer`. The way you've written the code now, `Op1IsUndefOrZero` should be renamed instead to `Op1IsUnused`. (same for Op2IsUnused)
26074	Is the condition `&& (Val < NumElts * 2)` needed? This seems already covered by the `if (NumElts > IndexLen)` check above on line 25844.
26077	It's probably easier to update the ShuffleMask here. That simplifies the logic below as it avoids having to constantly distinguish between the `Op*IsUndefOrZero` cases.
26081	Hardcoding a check for '128' doesn't seem right to me. It all depends on what can be represented in the element type of the instruction, i.e. for i8's the max value is 255. That means when targeting `vscale_range(16, 16)`, then 255 is a valid index into the first source vector.
26082	It doesn't seem right to look at the number of indices in the mask. I think rather it should be about the maximum index value in the mask?
26092–26093	While it may be rare, it must be correct for all cases. If the LLVM IR instruction requests an element from a `zeroinitializer` vector is selected, then it must make sure to do so. That means that 255 may not always be the correct value. Maybe it should just return `SDValue()` for that case, or explicitly materialize a `zeroinitializer` vector to select the value from.
26099	While not wrong, I don't really like '255' because it is a hard-coded value. For i8 elements, it may well be a valid index. For i16 or larger elements, there's no need to hard-code 255. Maybe just use '(1 << elementSize) - 1'?
26109	For a shufflevector that selects `IndexLen` elements from Op1/Op2, this code seems to be filling the first `VL - IndexLen` elements with some initializer (255 in this case). It seems odd to me that this loop starts at 0 and I doubt this has the desired effect. I'm not sure to what extend it matters what value we use for 'undef' lanes. Can we just initialize the TBLMask with all numeric_limits<type>::max index value?
26124–26126	Please write this as: } else if (Subtarget->hasSVE2()) { ... } else return SDValue();

Addressed comments.

Herald added a subscriber: sunshaoce. · View Herald TranscriptAug 17 2023, 6:02 AM

dtemirbulatov added inline comments.Aug 17 2023, 6:03 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26074	yes, we need this check to understand that we just electing from operand 2 and operand 1 is unused.
26077	hmm, ShuffleMask is readonly variable.
26081	yes and for second source operand the index starts from 256, so that is what we are checking here is we have to address to both operands then it is going to be 255 + index and we already established that both operands involved.
26082	We already checked all values in the mask.

Harbormaster completed remote builds in B253201: Diff 551105.Aug 17 2023, 6:54 AM

Resolved remained remarks, added support for half sized mask, N-sized mask. if just minimum SVE known then allow SVE TBL with just one operand transform.

dtemirbulatov marked an inline comment as done.Aug 18 2023, 7:15 AM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26085–26089	I think

dtemirbulatov marked an inline comment as not done.Aug 18 2023, 7:17 AM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26085–26089	I think we allowing this lowering to only one operand supported with SVE1 TBL.

dtemirbulatov marked an inline comment as done.Aug 18 2023, 7:17 AM

Harbormaster completed remote builds in B253483: Diff 551498.Aug 18 2023, 8:26 AM

sdesmalen added inline comments.Aug 21 2023, 8:27 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26054	Comment seems unaddressed.
26077	You can create a new ShuffleMask variable, which is writable.
26082	I think you can make this into an assert that NumElts <= IndexLen, because at the point of lowering the operation, the types will have been legalised in such a way that the number of lanes of the mask >= number of lanes of the source operand.
26083	There are two things not right about this: The contents of `Op1` and `Op2` shouldn't matter for the lowering of the shuffle, so there is no reason to bail out here. (If you remove this, does any test fail?) `isZerosVector` will never return true, because `Op1` and `Op2` are `INSERT_SUBVECTOR` nodes because they are assigned with `convertToScalableVector()`, and `INSERT_SUBVECTOR` is not considered in `isZerosVector`.
26095	What is the meaning of 'Op1IsUnused' after swapping Op1 and Op2? Should that be Op2IsUnused now? To me it seems better to update the ShuffleMask directly and remove the need for `Op*IsUnused` altogether
26100	It's normally better to bail out early, e.g. if (condition) { // lots of code } return SDValue(); -> if (!condition) return SDValue // lots of code
26104	This condition and the comment above make little sense to me. The issue to solve is instead about the maximum index value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector. For example: define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) { %op0 = load <256 x i8>, ptr %ptr0 %op1 = load <256 x i8>, ptr %ptr1 ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8! %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263> ret <16 x i8> %res } However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices. .LCPI0_0: .byte 0 // 0x0 .byte 1 // 0x1 .byte 2 // 0x2 .byte 3 // 0x3 .byte 4 // 0x4 .byte 5 // 0x5 .byte 6 // 0x6 .byte 7 // 0x7 .byte 16 // 0x10 .byte 33 // 0x21 .byte 34 // 0x22 .byte 35 // 0x23 .byte 36 // 0x24 .byte 37 // 0x25 .byte 38 // 0x26 .byte 39 // 0x27 .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .byte 255 // 0xff .text .globl foo .p2align 2 .type foo,@function foo: // @foo .cfi_startproc // %bb.0: adrp x8, .LCPI0_0 add x8, x8, :lo12:.LCPI0_0 ptrue p0.b ldr q1, [x1] ldr q0, [x0] ld1b { z2.b }, p0/z, [x8] tbl z0.b, { z0.b, z1.b }, z2.b // kill: def $q0 killed $q0 killed $z0 ret
26105	Please check this condition earlier up for an early bail out! (at same place where it checks MinSVESize)
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
605 ↗	(On Diff #551498)	Why doesn't this function use the TBL instruction? I don't see anything that stops it from being to use TBL?

dtemirbulatov marked an inline comment as done.Aug 22 2023, 3:35 PM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

26104

I think this is correct, here is full DAG dump:
SelectionDAG has 16 nodes:

t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
t18: v16i8,ch = load<(load (s128) from %ir.ptr0, align 256)> t0, t2, undef:i64
  t4: i64,ch = CopyFromReg t0, Register:i64 %1
t16: v16i8,ch = load<(load (s128) from %ir.ptr1, align 256)> t0, t4, undef:i64
  t11: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23> t18, t16
t13: ch,glue = CopyToReg t0, Register:v16i8 $q0, t11
t22: nxv16i8 = insert_subvector undef:nxv16i8, t18, Constant:i64<0>
t23: nxv16i8 = insert_subvector undef:nxv16i8, t16, Constant:i64<0>
t14: ch = AArch64ISD::RET_GLUE t13, Register:v16i8 $q0, t13:1

26105

No, Most of those checks related to support TBL2 both opperand used and at the place where MinSVESize check we don't know content of the mask, thus could not tell that both operand would be used or not.

dtemirbulatov marked 5 inline comments as done.Aug 23 2023, 12:39 AM

dtemirbulatov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
26083	Sorry, I left isZerosVector() check by accident.
26095	replaced both variables with a single one 'IsSingleOp'.
26105	And for a single operand case it is enough to know just MinSVESize.

dtemirbulatov updated this revision to Diff 552613.Aug 23 2023, 12:40 AM

Resolved comments.

Harbormaster completed remote builds in B254267: Diff 552613.Aug 23 2023, 1:55 AM

sdesmalen added inline comments.Aug 23 2023, 3:14 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25834–25835	nit: You don't need to pass MinSVESize and MaxSVESize, you can easily get this again from the Subtarget.
25885	This swap code seems to be unnecessary, because SelectionDAGBuilder already does this for you. You can see that if you compile: define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) { %op0 = load <16 x i8>, ptr %ptr0 %op1 = load <16 x i8>, ptr %ptr1 ; Takes only elements from second vector %res = shufflevector <16 x i8> %op0, <16 x i8> %op1, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 31, i32 31> ret <16 x i8> %res } Which results in: t0: ch,glue = EntryToken t4: i64,ch = CopyFromReg t0, Register:i64 %1 t8: v16i8,ch = load<(load (s128) from %ir.ptr1)> t0, t4, undef:i64 t10: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,15> t8, undef:v16i8 t12: ch,glue = CopyToReg t0, Register:v16i8 $q0, t10 t13: ch = AArch64ISD::RET_GLUE t12, Register:v16i8 $q0, t12:1 That means you don't need to modify the Shufflemask at all, and you can call `ShuffleVectorInst:: isSingleSourceMask(ShuffleMask)` to calculate `IsSingleOp`. If `isSingleSourceMask` returns `true`, then you can assume it will select elements from the first source operand. That will simplify this patch quite a bit!
26104	It seems SelectionDAGBuilder has extracted a subregister from the two <256 x i8> vectors, hence why the offsets are valid. This clearly wasn't the right example. The following case can't be transformed like that: define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) { %op0 = load <256 x i8>, ptr %ptr0 %op1 = load <256 x i8>, ptr %ptr1 ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8! %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 252, i32 253, i32 254, i32 255, i32 256, i32 257, i32 258, i32 259, i32 508, i32 509, i32 510, i32 511> ret <16 x i8> %res } It's only because SelectionDAG inserts the mask into a `<256 x i8>` vector that your code happens to discard this case (because that will fail the check `MaxOffset <= NumElts * 2`). I think I'd still prefer an explicit check that the maximum index value in the ShuffleMask doesn't exceed the maximum value of the element type.

Resolved comments.

dtemirbulatov marked an inline comment as done.Aug 23 2023, 5:30 AM

Harbormaster completed remote builds in B254314: Diff 552674.Aug 23 2023, 6:52 AM

sdesmalen added inline comments.Aug 23 2023, 9:04 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25855–25868	If you define `MaxOffset` as `(1 << BitsPerElt) - 1`, you can remove the switch statement?
25870–25875	Please move this code up after if (!MinSVESize) return SDValue();
25879	You're using `unsigned` here, but ShuffleMask is actually a vector of `int`s. `undef` elements are represented as index `(int) -1`. When you're comparing this with MaxOffset for example, things get a little confusing and I think this is hiding some bugs. Can you instead make all the ShuffleMask values use signed integers?

Resolved comments.

Harbormaster completed remote builds in B254487: Diff 552915.Aug 23 2023, 6:06 PM

sdesmalen added inline comments.Aug 24 2023, 2:44 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25849	Please rename this to something more descriptive, e.g. `ElementsPerVector`
25852	Seems more generic to write: uint64_t MaxOffset = APInt(BitsPerElt, -1, false).getZExtValue();
25858	What is the point of this variable? In my previous comment I suggested that the calculation should be done on signed values because the ShuffleMask has signed values . Doing: for (int Val : ShuffleMask) { unsigned Offset = Val; // rest of code using Offset instead of Val } does not solve that problem.
25859	I don't think you need to distinguish `IsSingleOp` here, because if `IsSingleOp == true`, then `Offset > NumElts`, so the code below already covers it.
25869–25870	You'll need to update the real offset before checking whether Offset >= MaxOffset. For example, for vscale_range(16, 16) the index into the next vector would always be at least 256, which cannot be represented in an i8. Moving this code above the previous condition fixes that.
25878	nit: you can simplify this to `-1`
25894–25895	Combine this into an `else if`

sdesmalen added inline comments.Aug 24 2023, 2:53 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25849	Please rename this to something more descriptive, e.g. `ElementsPerVector` Perhaps `ElementsPerVectorReg` is more appropriate (given that there is already a `NumElts` variable).

dtemirbulatov added inline comments.Aug 24 2023, 5:49 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25858	hmm, not expecting anytging harmful here, according to the specification : "Non-negative elements in the mask represent an index."

Addressed comments. Also, I found error in my last change where I was comparing "Offset > ElementsPerVectorReg", but It whould be ">=", Fixed.

Harbormaster completed remote builds in B255443: Diff 554219.Aug 29 2023, 3:01 AM

Thanks for the latest round of changes. I have a few more comments, but this seems to be moving in the right direction.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25858	hmm, not expecting anything harmful here, according to the specification : "Non-negative elements in the mask represent an index. That doesn't mean the value cannot be negative. I prefer the code to be explicit about what happens when the value is negative, rather than this being implicit by using an `unsigned` variable. `ShuffleMask` is an array of signed integers. When you then iterate over this array of signed integers, it makes sense to make that variable signed as well.
25884	nit: you can sink the `convertFromScalableVector` out of the if/else conditions, and moving it to the return.
llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle.ll
963 ↗	(On Diff #554219)	Missing CHECK lines for this symbol.
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
378 ↗	(On Diff #554219)	You have added these tests to `sve-streaming-mode-fixed-length-vector-shuffle.ll`, but the sve-streaming-mode-* tests should not contain any tests with specific `vscale_range(..)` attributes, because the point is more to test that no NEON is being used and streaming(-compatible) SVE is used instead.
668 ↗	(On Diff #554219)	What is unacceptable about the form?

Resolved comments.

dtemirbulatov added inline comments.Sep 4 2023, 6:27 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
668 ↗	(On Diff #554219)	We could see in the result that three tbl instructions were used instead of one, because the result vector is not reprsentable on the hardware.

dtemirbulatov marked 2 inline comments as done.Sep 4 2023, 6:29 AM

Ping.

Just a few minor nits, otherwise looks good to me.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25857	nit: This should be called `Index` instead of `Offset`.
25865	If you pull out this condition to the top of the loop and return early for negative offsets, then in the rest of the loop you can assume Offset is `unsigned` and remove some of the `(int)` casts you have in there.
25870–25871	nit: this statement is a bit unclear to me. Why is it better? Do we want to say "Choosing an out-of-range index leads to the lane being zeroed". Note that this is only relevant for i16+ types and for i8 types when the runtime VL < 2048.
25883	nit: unnecessary curly braces.
25891	Just to be safe, could you add an `else` branch here with `llvm_unreachable("Cannot lower shuffle without SVE2 TBL");` ?
26049–26050	nit: odd line-break.

SjoerdMeijer added a subscriber: SjoerdMeijer.Sep 12 2023, 3:37 AM

SjoerdMeijer added inline comments.

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
4 ↗	(On Diff #555721)	Drive by comment: can we add a RUN for 256 bits? It might be interesting to see the codegen for that too because there is a 256 bit SVE implementation and we have seen some inefficiencies for these type of things.

Resolved comments.

Harbormaster completed remote builds in B257217: Diff 556773.Sep 14 2023, 5:11 AM

sdesmalen added inline comments.Sep 22 2023, 12:32 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25858	When one of the indices is `poison`, it will encode that in the ShuffleMask with `-1`. Bailing out may be too conservative, you could for example choose to select the first element instead. With your patch at the moment: define <4 x float> @foo(ptr %a, ptr %b) nounwind vscale_range(1, 1){ %op1 = load <4 x float>, ptr %a %op2 = load <4 x float>, ptr %b %1 = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 0, i32 3, i32 5, i32 poison> ret <4 x float> %1 } results in: foo: // @foo ldr q0, [x1] ldr q1, [x0] mov z0.s, z0.s[1] mov z2.s, z1.s[3] str s1, [sp, #-16]! stp s2, s0, [sp, #4] ldr q0, [sp], #16 ret whereas: define <4 x float> @foo(ptr %a, ptr %b) nounwind vscale_range(1, 1){ %op1 = load <4 x float>, ptr %a %op2 = load <4 x float>, ptr %b %1 = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 0, i32 3, i32 5, i32 1> ret <4 x float> %1 } results in: foo: // @foo adrp x8, .LCPI0_0 ldr q0, [x0] ldr q1, [x1] ldr q2, [x8, :lo12:.LCPI0_0] tbl z0.s, { z0.s, z1.s }, z2.s ret It would also be good to have a test for this case (one of the shufflemask lanes being `poison`)
25873	nit: s/world perfrom/would perform/ ?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
4 ↗	(On Diff #556773)	This line is exactly the same as on line 2?
8 ↗	(On Diff #556773)	I would actually prefer to keep these tests kind of the same as they were, and for any new tests to be moved to a new test file, e.g. `sve-fixed-length-vector-shuffle-tbl.ll`. Where: The test has RUN lines for both `+sve` and for `+sve2` We test with Min == Max, such that it uses TBL/TBL2 as your patch currently implements We test without a Max (which means Max is unbounded, and implies your current patch won't use TBL, although that could change in the future) There are a few tests for boundary cases using vscale_range(16, 16) and i8 vectors. I'd recommend using the same size for the vectors in tests (now your test is using `<4 x i8>`, `<8 x i8>` and `<16 x i8>`, but I don't think that makes any difference based on the code you've added. Maybe you can just stick with `<16 x i8>` instead. If you go for 128-bit vectors, you'll need to add the `-force-streaming-compatible` flag to force the use of SVE instructions over NEON.

Addressed remarks.

sdesmalen added inline comments.Oct 2 2023, 5:27 AM

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
3	Can you remove this RUN line? This RUN line is identical to the last RUN line.
4	Can you stick with one vector width (e.g. min==max==128)? I don't think the line below (min==max==256) is testing anything that the current RUN line isn't already testing when `-force-stremaing-compatible` is enabled.
111	Please rename this to something like `shuffle_index_indices_from_op1` to make it clear the mask takes all the lanes from `%op1`.
212	Please rename this to something like `shuffle_index_indices_from_op2` to make it clear the mask takes all the lanes from `%op2`.
346	Please rename this to something like `shuffle_index_indices_from_both_ops` to make it clear the mask takes all the lanes from both `%op1` and `%op2`.
350	Please remove the `acceptable` and `_unacceptable` from the names. If something cannot use the TBL instruction, then prepend the name with `negative_`.
381	As I said here, one of the indices being `poison` should not be a reason not to use vector instructions. This case can be handled by the algorithm.
385	Please inline `"target-features"="+sve2" vscale_range(16,16)` here rather than using `#0`, as that makes the test a bit easier to read.
385	Can you add a comment that clarifies why this function doesn't use TBL?
447	This test is named `_unacceptable_form`, yet it uses TBL without issues. So this is not an "unacceptable form"? :)
703	How is this test different from `shuffle_index_size_acceptable_op_both`, other than it having a slightly different mask?
834	How is this test different from `shuffle_index_size_acceptable_op_both`, other than it having a slightly different mask and a wider return type?
890	I don't think this is a good test, because this test is relying on a DAGCombine that swaps `%op1` and `%op2`. I think you can just remove it.

Resolved remarks.

Harbormaster completed remote builds in B257720: Diff 557533.Oct 2 2023, 8:16 AM

sdesmalen added inline comments.Oct 2 2023, 8:31 AM

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
119	This test is missing check lines for the shuffle mask, could you please add it?
178	I'm a bit puzzled, why did you add this test?
202	nit: I meant prepend do the start of the name, e.g. `@negative_test_shuffle_index_size_op_both_maxhw`

dtemirbulatov added inline comments.Oct 2 2023, 9:02 AM

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
3	No, it is not here we targeting sve, not sve2.
119	No, for the poison value it generates different value for different conditions (hardware vector size, etc), but for this test always in range of offset for operand 1, I am not sure it is consistant.

sdesmalen added inline comments.Oct 2 2023, 1:31 PM

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll
119	I don't think there's a benefit to choosing one index over another, so what if you change the code to always choose lane `0` if the index < 0? That way, you can add the CHECK lines.

Addressed comments.

Harbormaster completed remote builds in B257737: Diff 557554.Oct 3 2023, 4:44 AM

LGTM, thanks!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
25859	nit: I don't know if this can ever happen, but maybe it's more safe to write `if (Index < 0)` if any other values can be used to represent `poison`.

This revision is now accepted and ready to land.Oct 3 2023, 5:05 AM

This revision was landed with ongoing or failed builds.Oct 3 2023, 8:20 AM

Closed by commit rG8232ab76d0ba: [AArch64][SVE][SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector… (authored by dtemirbulatov). · Explain Why

This revision was automatically updated to reflect the committed changes.

dtemirbulatov added a commit: rG8232ab76d0ba: [AArch64][SVE][SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector….

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

76 lines

test/

CodeGen/

AArch64/

sve-fixed-length-permute-rev.ll

131 lines

sve-fixed-length-vector-shuffle-tbl.ll

254 lines

Diff 557563

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 25,824 Lines • ▼ Show 20 Lines if (VT.bitsGT(SrcVT)) {

Val = convertToScalableVector(DAG, ContainerSrcVT, Val); Val = convertToScalableVector(DAG, ContainerSrcVT, Val);

Val = DAG.getNode(Opcode, DL, CvtVT, Pg, Val, DAG.getUNDEF(CvtVT)); Val = DAG.getNode(Opcode, DL, CvtVT, Pg, Val, DAG.getUNDEF(CvtVT));

Val = convertFromScalableVector(DAG, SrcVT.changeTypeToInteger(), Val); Val = convertFromScalableVector(DAG, SrcVT.changeTypeToInteger(), Val);

return DAG.getNode(ISD::TRUNCATE, DL, VT, Val); return DAG.getNode(ISD::TRUNCATE, DL, VT, Val);

} }

static SDValue GenerateFixedLengthSVETBL(SDValue Op, SDValue Op1, SDValue Op2,

ArrayRef<int> ShuffleMask, EVT VT,

EVT ContainerVT, SelectionDAG &DAG) {

sdesmalenUnsubmitted

Done

nit: You don't need to pass MinSVESize and MaxSVESize, you can easily get this again from the Subtarget.

sdesmalen: nit: You don't need to pass MinSVESize and MaxSVESize, you can easily get this again from the…

auto &Subtarget = DAG.getSubtarget<AArch64Subtarget>();

SDLoc DL(Op);

unsigned MinSVESize = Subtarget.getMinSVEVectorSizeInBits();

unsigned MaxSVESize = Subtarget.getMaxSVEVectorSizeInBits();

bool IsSingleOp = ShuffleVectorInst::isSingleSourceMask(ShuffleMask);

// Ignore two operands if no SVE2 or all index numbers couldn't

// be represented.

if (!IsSingleOp && (!Subtarget.hasSVE2() || MinSVESize != MaxSVESize))

return SDValue();

EVT VTOp1 = Op.getOperand(0).getValueType();

unsigned BitsPerElt = VTOp1.getVectorElementType().getSizeInBits();

unsigned IndexLen = MinSVESize / BitsPerElt;

sdesmalenUnsubmitted

Done

Please rename this to something more descriptive, e.g. ElementsPerVector

sdesmalen: Please rename this to something more descriptive, e.g. `ElementsPerVector`

sdesmalenUnsubmitted

Done

Please rename this to something more descriptive, e.g. ElementsPerVector

Perhaps ElementsPerVectorReg is more appropriate (given that there is already a NumElts variable).

sdesmalen: > Please rename this to something more descriptive, e.g. `ElementsPerVector` Perhaps…

unsigned ElementsPerVectorReg = VTOp1.getVectorNumElements();

unsigned MaskSize = ShuffleMask.size();

uint64_t MaxOffset = APInt(BitsPerElt, -1, false).getZExtValue();

sdesmalenUnsubmitted

Done

Seems more generic to write:

uint64_t MaxOffset = APInt(BitsPerElt, -1, false).getZExtValue();

sdesmalen: Seems more generic to write: uint64_t MaxOffset = APInt(BitsPerElt, -1, false).getZExtValue…

assert(ElementsPerVectorReg <= IndexLen && MaskSize <= IndexLen &&

"Incorrectly legalised shuffle operation");

SmallVector<SDValue, 8> TBLMask;

for (int Index : ShuffleMask) {

sdesmalenUnsubmitted

Done

nit: This should be called Index instead of Offset.

sdesmalen: nit: This should be called `Index` instead of `Offset`.

// Handling poison index value.

sdesmalenUnsubmitted

Done

What is the point of this variable?

In my previous comment I suggested that the calculation should be done on signed values because the ShuffleMask has signed values . Doing:

for (int Val : ShuffleMask) {
  unsigned Offset = Val;
  // rest of code using Offset instead of Val
}

does not solve that problem.

sdesmalen: What is the point of this variable? In my previous comment I suggested that the calculation…

dtemirbulatovAuthorUnsubmitted

Done

hmm, not expecting anytging harmful here, according to the specification : "Non-negative elements in the mask represent an index."

dtemirbulatov: hmm, not expecting anytging harmful here, according to the specification : "Non-negative…

sdesmalenUnsubmitted

Done

hmm, not expecting anything harmful here, according to the specification : "Non-negative elements in the mask represent an index.

That doesn't mean the value cannot be negative. I prefer the code to be explicit about what happens when the value is negative, rather than this being implicit by using an unsigned variable.

ShuffleMask is an array of signed integers. When you then iterate over this array of signed integers, it makes sense to make that variable signed as well.

sdesmalen: > hmm, not expecting anything harmful here, according to the specification : "Non-negative…

sdesmalenUnsubmitted

Done

When one of the indices is poison, it will encode that in the ShuffleMask with -1. Bailing out may be too conservative, you could for example choose to select the first element instead.

With your patch at the moment:

define <4 x float> @foo(ptr %a, ptr %b) nounwind vscale_range(1, 1){
  %op1 = load <4 x float>, ptr %a
  %op2 = load <4 x float>, ptr %b
  %1 = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 0, i32 3, i32 5, i32 poison>
  ret <4 x float> %1
}

results in:

foo:                                    // @foo
        ldr     q0, [x1]
        ldr     q1, [x0]
        mov     z0.s, z0.s[1]
        mov     z2.s, z1.s[3]
        str     s1, [sp, #-16]!
        stp     s2, s0, [sp, #4]
        ldr     q0, [sp], #16
        ret

whereas:

define <4 x float> @foo(ptr %a, ptr %b) nounwind vscale_range(1, 1){
  %op1 = load <4 x float>, ptr %a
  %op2 = load <4 x float>, ptr %b
  %1 = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 0, i32 3, i32 5, i32 1>
  ret <4 x float> %1
}

results in:

foo:                                    // @foo
        adrp    x8, .LCPI0_0
        ldr     q0, [x0]
        ldr     q1, [x1]
        ldr     q2, [x8, :lo12:.LCPI0_0]
        tbl     z0.s, { z0.s, z1.s }, z2.s
        ret

It would also be good to have a test for this case (one of the shufflemask lanes being poison)

sdesmalen: When one of the indices is `poison`, it will encode that in the ShuffleMask with `-1`. Bailing…

if (Index < 0)

sdesmalenUnsubmitted

Done

I don't think you need to distinguish IsSingleOp here, because if IsSingleOp == true, then Offset > NumElts, so the code below already covers it.

sdesmalen: I don't think you need to distinguish `IsSingleOp` here, because if `IsSingleOp == true`, then…

sdesmalenUnsubmitted

Not Done

nit: I don't know if this can ever happen, but maybe it's more safe to write if (Index < 0) if any other values can be used to represent poison.

sdesmalen: nit: I don't know if this can ever happen, but maybe it's more safe to write `if (Index < 0)`…

Index = 0;

// If we refer to the second operand then we have to add elements

// number in hardware register minus number of elements in a type.

if ((unsigned)Index >= ElementsPerVectorReg)

Index += IndexLen - ElementsPerVectorReg;

// For 8-bit elements and 1024-bit SVE registers and MaxOffset equals

sdesmalenUnsubmitted

Done

If you pull out this condition to the top of the loop and return early for negative offsets, then in the rest of the loop you can assume Offset is unsigned and remove some of the (int) casts you have in there.

sdesmalen: If you pull out this condition to the top of the loop and return early for negative offsets…

// to 255, this might point to the last element of in the second operand

// of the shufflevector, thus we are rejecting this transform.

if ((unsigned)Index >= MaxOffset)

sdesmalenUnsubmitted

Done

If you define MaxOffset as (1 << BitsPerElt) - 1, you can remove the switch statement?

sdesmalen: If you define `MaxOffset` as `(1 << BitsPerElt) - 1`, you can remove the switch statement?

return SDValue();

TBLMask.push_back(DAG.getConstant(Index, DL, MVT::i64));

sdesmalenUnsubmitted

Done

You'll need to update the real offset before checking whether Offset >= MaxOffset. For example, for vscale_range(16, 16) the index into the next vector would always be at least 256, which cannot be represented in an i8.

Moving this code above the previous condition fixes that.

sdesmalen: You'll need to update the real offset before checking whether Offset >= MaxOffset. For example…

}

sdesmalenUnsubmitted

Done

nit: this statement is a bit unclear to me. Why is it better?

Do we want to say "Choosing an out-of-range index leads to the lane being zeroed".
Note that this is only relevant for i16+ types and for i8 types when the runtime VL < 2048.

sdesmalen: nit: this statement is a bit unclear to me. Why is it better? Do we want to say "Choosing an…

// Choosing an out-of-range index leads to the lane being zeroed vs zero

sdesmalenUnsubmitted

Done

nit: s/world perfrom/would perform/ ?

sdesmalen: nit: s/world perfrom/would perform/ ?

// value where it would perform first lane duplication for out of

// index elements. For i8 elements an out-of-range index could be a valid

sdesmalenUnsubmitted

Done

Please move this code up after

if (!MinSVESize)
  return SDValue();

sdesmalen: Please move this code up after if (!MinSVESize) return SDValue();

// for 2048-bit vector register size.

for (unsigned i = 0; i < IndexLen - ElementsPerVectorReg; ++i)

TBLMask.push_back(DAG.getConstant((int)MaxOffset, DL, MVT::i64));

sdesmalenUnsubmitted

Done

nit: you can simplify this to -1

sdesmalen: nit: you can simplify this to `-1`

sdesmalenUnsubmitted

Done

You're using unsigned here, but ShuffleMask is actually a vector of ints. undef elements are represented as index (int) -1. When you're comparing this with MaxOffset for example, things get a little confusing and I think this is hiding some bugs.

Can you instead make all the ShuffleMask values use signed integers?

sdesmalen: You're using `unsigned` here, but ShuffleMask is actually a vector of `int`s. `undef` elements…

EVT MaskEltType = EVT::getIntegerVT(*DAG.getContext(), BitsPerElt);

EVT MaskType = EVT::getVectorVT(*DAG.getContext(), MaskEltType, IndexLen);

EVT MaskContainerVT = getContainerForFixedLengthVector(DAG, MaskType);

SDValue VecMask =

sdesmalenUnsubmitted

Done

nit: unnecessary curly braces.

sdesmalen: nit: unnecessary curly braces.

DAG.getBuildVector(MaskType, DL, ArrayRef(TBLMask.data(), IndexLen));

sdesmalenUnsubmitted

Done

nit: you can sink the convertFromScalableVector out of the if/else conditions, and moving it to the return.

sdesmalen: nit: you can sink the `convertFromScalableVector` out of the if/else conditions, and moving it…

SDValue SVEMask = convertToScalableVector(DAG, MaskContainerVT, VecMask);

sdesmalenUnsubmitted

Done

This swap code seems to be unnecessary, because SelectionDAGBuilder already does this for you.

You can see that if you compile:

define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
  %op0 = load <16 x i8>, ptr %ptr0
  %op1 = load <16 x i8>, ptr %ptr1
  ; Takes only elements from second vector
  %res = shufflevector <16 x i8> %op0, <16 x i8> %op1, <16 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 31, i32 31>
  ret <16 x i8> %res
}

Which results in:

t0: ch,glue = EntryToken
      t4: i64,ch = CopyFromReg t0, Register:i64 %1
    t8: v16i8,ch = load<(load (s128) from %ir.ptr1)> t0, t4, undef:i64
  t10: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,8,9,10,11,12,13,15,15> t8, undef:v16i8
t12: ch,glue = CopyToReg t0, Register:v16i8 $q0, t10
t13: ch = AArch64ISD::RET_GLUE t12, Register:v16i8 $q0, t12:1

That means you don't need to modify the Shufflemask at all, and you can call ShuffleVectorInst:: isSingleSourceMask(ShuffleMask) to calculate IsSingleOp. If isSingleSourceMask returns true, then you can assume it will select elements from the first source operand. That will simplify this patch quite a bit!

sdesmalen: This swap code seems to be unnecessary, because SelectionDAGBuilder already does this for you.

SDValue Shuffle;

if (IsSingleOp)

Shuffle =

DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,

DAG.getConstant(Intrinsic::aarch64_sve_tbl, DL, MVT::i32),

sdesmalenUnsubmitted

Done

Just to be safe, could you add an else branch here with llvm_unreachable("Cannot lower shuffle without SVE2 TBL"); ?

sdesmalen: Just to be safe, could you add an `else` branch here with `llvm_unreachable("Cannot lower…

Op1, SVEMask);

else if (Subtarget.hasSVE2())

Shuffle =

DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,

sdesmalenUnsubmitted

Done

Combine this into an else if

sdesmalen: Combine this into an `else if`

DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32),

Op1, Op2, SVEMask);

else

llvm_unreachable("Cannot lower shuffle without SVE2 TBL");

Shuffle = convertFromScalableVector(DAG, VT, Shuffle);

return DAG.getNode(ISD::BITCAST, DL, Op.getValueType(), Shuffle);

}

SDValue AArch64TargetLowering::LowerFixedLengthVECTOR_SHUFFLEToSVE( SDValue AArch64TargetLowering::LowerFixedLengthVECTOR_SHUFFLEToSVE(

SDValue Op, SelectionDAG &DAG) const { SDValue Op, SelectionDAG &DAG) const {

EVT VT = Op.getValueType(); EVT VT = Op.getValueType();

assert(VT.isFixedLengthVector() && "Expected fixed length vector type!"); assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");

auto *SVN = cast<ShuffleVectorSDNode>(Op.getNode()); auto *SVN = cast<ShuffleVectorSDNode>(Op.getNode());

auto ShuffleMask = SVN->getMask(); auto ShuffleMask = SVN->getMask();

▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines if (MinSVESize == MaxSVESize && MaxSVESize == VT.getSizeInBits()) {

if (isUZP_v_undef_Mask(ShuffleMask, VT, WhichResult)) { if (isUZP_v_undef_Mask(ShuffleMask, VT, WhichResult)) {

unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2; unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;

return convertFromScalableVector( return convertFromScalableVector(

DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1)); DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));

} }

// Avoid producing TBL instruction if we don't know SVE register minimal size.

if (MinSVESize)

sdesmalenUnsubmitted

Done

nit: odd line-break.

sdesmalen: nit: odd line-break.

return GenerateFixedLengthSVETBL(Op, Op1, Op2, ShuffleMask, VT, ContainerVT,

DAG);

return SDValue(); return SDValue();

sdesmalenUnsubmitted

Done

This could do with a code comment describing why we don't do this when we don't know the exact SVE size.

I think there are several cases to distinguish here:

Elements are selected from either the first source vector, or from the second, but not both. This can be implemented with SVE and when the SVE size isn't known.
Elements are selected from both source vectors.
- If we only have SVE, then we can mimic this with multiple (single-source) TBL instructions and selects, but I suspect there's little value in handling this, given that SVE2 is the prevalent feature going forward.
- If we don't know the size, then we need to do more work to update the indices for the TBL. For example:

<2 x i64> %x, <2 x i64> %y, <4 x i32> <0, 1, 2, 3>

if SVE VL = 256, then we'd basically insert the <2 x i64> vector value into a conceptual <4 x i64> register where only the first two lanes are relevant, so the offsets must be: <0, 1, 4, 5>
if SVE VL = 512, then we'd basically insert the <2 x i64> vector value into a conceptual <8 x i64> register where only the first two lanes are relevant, so the offsets must be: <0, 1, 8, 9>

So we'd somehow need to do something like this:

<0, 1, 2, 3> + <0, 0, VL, VL>

sdesmalen: This could do with a code comment describing why we don't do this when we don't know the exact…

dtemirbulatovAuthorUnsubmitted

Done

I will address run-time index construction in the following patch.

dtemirbulatov: I will address run-time index construction in the following patch.

sdesmalenUnsubmitted

Done

It probably makes sense to move this functionality into its own function.

sdesmalen: It probably makes sense to move this functionality into its own function.

sdesmalenUnsubmitted

Not Done

Comment seems unaddressed.

sdesmalen: Comment seems unaddressed.

} }

sdesmalenUnsubmitted

Done

nit: tab?

sdesmalen: nit: tab?

SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op, SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op,

sdesmalenUnsubmitted

Done

It's not really clear to me what Swap and IsUndefOrZero are supposed to do. I would have expected it to look at the indices rather than the source vector operands. Can you add a comment explaining this logic?

sdesmalen: It's not really clear to me what `Swap` and `IsUndefOrZero` are supposed to do. I would have…

dtemirbulatovAuthorUnsubmitted

Done

Ok, I will comment for those variables. Also, for example. If the mask is refers to only one operand then second operand would be already undefined at the time of lowering. I think we don't have to check the mask here.

dtemirbulatov: Ok, I will comment for those variables. Also, for example. If the mask is refers to only one…

sdesmalenUnsubmitted

Done

There is an implicit assumption that a DAGCombine (or perhaps an InstCombine) has rewritten the second operand to be undef. It seems safer to look at the values in the Mask instead.

sdesmalen: There is an implicit assumption that a DAGCombine (or perhaps an InstCombine) has rewritten the…

SelectionDAG &DAG) const { SelectionDAG &DAG) const {

SDLoc DL(Op); SDLoc DL(Op);

EVT InVT = Op.getValueType(); EVT InVT = Op.getValueType();

assert(VT.isScalableVector() && isTypeLegal(VT) && assert(VT.isScalableVector() && isTypeLegal(VT) &&

InVT.isScalableVector() && isTypeLegal(InVT) && InVT.isScalableVector() && isTypeLegal(InVT) &&

"Only expect to cast between legal scalable vector types!"); "Only expect to cast between legal scalable vector types!");

assert(VT.getVectorElementType() != MVT::i1 && assert(VT.getVectorElementType() != MVT::i1 &&

InVT.getVectorElementType() != MVT::i1 && InVT.getVectorElementType() != MVT::i1 &&

"For predicate bitcasts, use getSVEPredicateBitCast"); "For predicate bitcasts, use getSVEPredicateBitCast");

sdesmalenUnsubmitted

Done

You seem to have removed checks to ask if Op1 is poison/undef or zeroinitializer.

The way you've written the code now, Op1IsUndefOrZero should be renamed instead to Op1IsUnused. (same for Op2IsUnused)

sdesmalen: You seem to have removed checks to ask if Op1 is `poison`/`undef` or `zeroinitializer`. The…

if (InVT == VT) if (InVT == VT)

return Op; return Op;

EVT PackedVT = getPackedSVEVectorVT(VT.getVectorElementType()); EVT PackedVT = getPackedSVEVectorVT(VT.getVectorElementType());

CarolineConcattoUnsubmitted

Done

Why do we need to test this? What other value it can have?
I thought the only possible values were this 8/16/32/64. Is it possible to have 128, 1024? Are these values we are avoiding?

CarolineConcatto: Why do we need to test this? What other value it can have? I thought the only possible values…

dtemirbulatovAuthorUnsubmitted

Done

yes, The element type chack was already done in useSVEForFixedLengthVectorVT() function.

dtemirbulatov: yes, The element type chack was already done in useSVEForFixedLengthVectorVT() function.

EVT PackedInVT = getPackedSVEVectorVT(InVT.getVectorElementType()); EVT PackedInVT = getPackedSVEVectorVT(InVT.getVectorElementType());

sdesmalenUnsubmitted

Done

I think you need to call this Op2IsUndefOrZero

sdesmalen: I think you need to call this `Op2IsUndefOrZero`

sdesmalenUnsubmitted

Done

Is the condition && (Val < NumElts * 2) needed? This seems already covered by the if (NumElts > IndexLen) check above on line 25844.

sdesmalen: Is the condition `&& (Val < NumElts * 2)` needed? This seems already covered by the `if…

dtemirbulatovAuthorUnsubmitted

Done

yes, we need this check to understand that we just electing from operand 2 and operand 1 is unused.

dtemirbulatov: yes, we need this check to understand that we just electing from operand 2 and operand 1 is…

// Safe bitcasting between unpacked vector types of different element counts // Safe bitcasting between unpacked vector types of different element counts

// is currently unsupported because the following is missing the necessary // is currently unsupported because the following is missing the necessary

// work to ensure the result's elements live where they're supposed to within // work to ensure the result's elements live where they're supposed to within

sdesmalenUnsubmitted

Done

It's probably easier to update the ShuffleMask here. That simplifies the logic below as it avoids having to constantly distinguish between the Op*IsUndefOrZero cases.

sdesmalen: It's probably easier to update the ShuffleMask here. That simplifies the logic below as it…

dtemirbulatovAuthorUnsubmitted

Done

hmm, ShuffleMask is readonly variable.

dtemirbulatov: hmm, ShuffleMask is readonly variable.

sdesmalenUnsubmitted

Done

You can create a new ShuffleMask variable, which is writable.

sdesmalen: You can create a new ShuffleMask variable, which is writable.

// an SVE register. // an SVE register.

// 01234567 // 01234567

// e.g. nxv2i32 = XX??XX?? // e.g. nxv2i32 = XX??XX??

// nxv4f16 = X?X?X?X? // nxv4f16 = X?X?X?X?

sdesmalenUnsubmitted

Done

Hardcoding a check for '128' doesn't seem right to me. It all depends on what can be represented in the element type of the instruction, i.e. for i8's the max value is 255. That means when targeting vscale_range(16, 16), then 255 is a valid index into the first source vector.

sdesmalen: Hardcoding a check for '128' doesn't seem right to me. It all depends on what can be…

dtemirbulatovAuthorUnsubmitted

Done

yes and for second source operand the index starts from 256, so that is what we are checking here is we have to address to both operands then it is going to be 255 + index and we already established that both operands involved.

dtemirbulatov: yes and for second source operand the index starts from 256, so that is what we are checking…

assert((VT.getVectorElementCount() == InVT.getVectorElementCount() || assert((VT.getVectorElementCount() == InVT.getVectorElementCount() ||

sdesmalenUnsubmitted

Done

It doesn't seem right to look at the number of indices in the mask. I think rather it should be about the maximum index value in the mask?

sdesmalen: It doesn't seem right to look at the //number of indices// in the mask. I think rather it…

dtemirbulatovAuthorUnsubmitted

Done

We already checked all values in the mask.

dtemirbulatov: We already checked all values in the mask.

sdesmalenUnsubmitted

Done

I think you can make this into an assert that NumElts <= IndexLen, because at the point of lowering the operation, the types will have been legalised in such a way that the number of lanes of the mask >= number of lanes of the source operand.

sdesmalen: I think you can make this into an assert that NumElts <= IndexLen, because at the point of…

VT == PackedVT || InVT == PackedInVT) && VT == PackedVT || InVT == PackedInVT) &&

sdesmalenUnsubmitted

Done

There are two things not right about this:

The contents of Op1 and Op2 shouldn't matter for the lowering of the shuffle, so there is no reason to bail out here. (If you remove this, does any test fail?)
isZerosVector will never return true, because Op1 and Op2 are INSERT_SUBVECTOR nodes because they are assigned with convertToScalableVector(), and INSERT_SUBVECTOR is not considered in isZerosVector.

sdesmalen: There are two things not right about this: 1. The **contents** of `Op1` and `Op2` shouldn't…

dtemirbulatovAuthorUnsubmitted

Done

Sorry, I left isZerosVector() check by accident.

dtemirbulatov: Sorry, I left isZerosVector() check by accident.

"Unexpected bitcast!"); "Unexpected bitcast!");

CarolineConcattoUnsubmitted

Done

For me this tests is a bit odd.
FillElements is unsigned and not a bool. So are you checking if IndexLen> NumElts?
Is it too difficult to put and example for each case with the fixed mask. So we can see how the mask was and how it has changed.
if (Swap)
for Offset < NumElts ->mask = <0,0,7,7> -> <3,3,0,0>
for Offset > NumElts ->mask = <0,0,7,7>

Maybe this will help to understand if there is any hidden problem with these ifs.

I re-wrote that to something like:

// Set the Mask with the correct indexes
SmallVector<SDValue, 8> TBLMask;
for (int Val : ShuffleMask) {
  unsigned Offset = Val;
  if (Swap)
    Offset = Offset < NumElts ? Offset + NumElts : Offset - NumElts;
  // why we need to set the mask to 255 when one of the vectors is zero/undef?
  else if(Offset >= NumElts){
     if(IsUndefOrZero)
         Offset = 255;
     else if (IndexLen > NumElts)
         Offset += IndexLen - NumElts;
  }
  TBLMask.push_back(DAG.getConstant(Offset, DL, MVT::i64));
}

CarolineConcatto: For me this tests is a bit odd. FillElements is unsigned and not a bool. So are you checking if…

dtemirbulatovAuthorUnsubmitted

Done

Thanks for finding this, I think now that "IndexLen> NumElts" logic/transform was not correct.

dtemirbulatov: Thanks for finding this, I think now that "IndexLen> NumElts" logic/transform was not correct.

// Pack input if required. // Pack input if required.

if (InVT != PackedInVT) if (InVT != PackedInVT)

Op = DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, PackedInVT, Op); Op = DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, PackedInVT, Op);

CarolineConcattoUnsubmitted

Done

I would suggest add a comment here.
You are populating the rest of the mask here with 255, why?
// Filling up the remaining positions of the mask with 255 because ...

CarolineConcatto: I would suggest add a comment here. You are populating the rest of the mask here with 255, why?

sdesmalenUnsubmitted

Done

What should happen if the vector type is <256 x i8> and IsUndefOrZero is false ?

sdesmalen: What should happen if the vector type is `<256 x i8>` and IsUndefOrZero is `false` ?

dtemirbulatovAuthorUnsubmitted

Not Done

I think

dtemirbulatov: I think

dtemirbulatovAuthorUnsubmitted

Done

I think we allowing this lowering to only one operand supported with SVE1 TBL.

dtemirbulatov: I think we allowing this lowering to only one operand supported with SVE1 TBL.

Op = DAG.getNode(ISD::BITCAST, DL, PackedVT, Op); Op = DAG.getNode(ISD::BITCAST, DL, PackedVT, Op);

CarolineConcattoUnsubmitted

Done

Are you here changing a fixed vector to a scalable vector to use tbl?

CarolineConcatto: Are you here changing a fixed vector to a scalable vector to use tbl?

dtemirbulatovAuthorUnsubmitted

Done

Right, as those instructions accept only scalable type operand and result.

dtemirbulatov: Right, as those instructions accept only scalable type operand and result.

// Unpack result if required. // Unpack result if required.

if (VT != PackedVT) if (VT != PackedVT)

sdesmalenUnsubmitted

Done

I'm curious, why is that better?

Also, as my comment above suggests, for <256 x i8> vectors, 255 is just the last element of the first data-vector. There is also some test-coverage missing here.

sdesmalen: I'm curious, why is that better? Also, as my comment above suggests, for `<256 x i8>` vectors…

dtemirbulatovAuthorUnsubmitted

Done

I think that if we leave uninitialized remaining out of scope indexes then they will end up with 0 value and means extra work since 0 element almost always means 0 lane and with 255 value probably hardware could understand this is out of scope value, except <256 x i8> case which a it rarely than valid 0 index. The last available value for index here is 255 then it flips to 0.

dtemirbulatov: I think that if we leave uninitialized remaining out of scope indexes then they will end up…

sdesmalenUnsubmitted

Done

While it may be rare, it must be correct for all cases. If the LLVM IR instruction requests an element from a zeroinitializer vector is selected, then it must make sure to do so. That means that 255 may not always be the correct value. Maybe it should just return SDValue() for that case, or explicitly materialize a zeroinitializer vector to select the value from.

sdesmalen: While it may be rare, it must be correct for all cases. If the LLVM IR instruction requests an…

Op = DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, VT, Op); Op = DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, VT, Op);

sdesmalenUnsubmitted

Done

What is the meaning of 'Op1IsUnused' after swapping Op1 and Op2? Should that be Op2IsUnused now?

To me it seems better to update the ShuffleMask directly and remove the need for Op*IsUnused altogether

sdesmalen: What is the meaning of 'Op1IsUnused' after swapping Op1 and Op2? Should that be Op2IsUnused now?

dtemirbulatovAuthorUnsubmitted

Done

replaced both variables with a single one 'IsSingleOp'.

dtemirbulatov: replaced both variables with a single one 'IsSingleOp'.

return Op; return Op;

} }

bool AArch64TargetLowering::isAllActivePredicate(SelectionDAG &DAG, bool AArch64TargetLowering::isAllActivePredicate(SelectionDAG &DAG,

sdesmalenUnsubmitted

Done

While not wrong, I don't really like '255' because it is a hard-coded value. For i8 elements, it may well be a valid index. For i16 or larger elements, there's no need to hard-code 255. Maybe just use '(1 << elementSize) - 1'?

sdesmalen: While not wrong, I don't really like '255' because it is a hard-coded value. For i8 elements…

SDValue N) const { SDValue N) const {

sdesmalenUnsubmitted

Done

It's normally better to bail out early, e.g.

if (condition) {
  // lots of code
}

return SDValue();

if (!condition)
 return SDValue
// lots of code

sdesmalen: It's normally better to bail out early, e.g. if (condition) { // lots of code }…

return ::isAllActivePredicate(DAG, N); return ::isAllActivePredicate(DAG, N);

} }

sdesmalenUnsubmitted

Done

I think this condition can be removed? (if it has swapped, then the second vector must be zero or undef?)

sdesmalen: I think this condition can be removed? (if it has swapped, then the second vector must be zero…

dtemirbulatovAuthorUnsubmitted

Done

No, IsUndefOrZero tells us that only operand 0 is used. I will add a test and rename the variable.

dtemirbulatov: No, IsUndefOrZero tells us that only operand 0 is used. I will add a test and rename the…

EVT AArch64TargetLowering::getPromotedVTForPredicate(EVT VT) const { EVT AArch64TargetLowering::getPromotedVTForPredicate(EVT VT) const {

sdesmalenUnsubmitted

Not Done

This condition and the comment above make little sense to me. The issue to solve is instead about the maximum *index* value in the shuffle mask. If the maximum value being indexed cannot be represented in an i8, then we cannot use TBL2 to lower this shuffle vector.

For example:

define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) {
  %op0 = load <256 x i8>, ptr %ptr0
  %op1 = load <256 x i8>, ptr %ptr1
  ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
  %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0,i32 1,i32 2,i32 3,i32 4, i32 5, i32 6, i32 7, i32 256,i32 257, i32 258, i32 259, i32 260, i32 261, i32 262, i32 263>
  ret <16 x i8> %res
}

However, the code in the patch seems to accept this without an issue and generates wrong code with the wrong indices.

.LCPI0_0:
        .byte   0                               // 0x0
        .byte   1                               // 0x1
        .byte   2                               // 0x2
        .byte   3                               // 0x3
        .byte   4                               // 0x4
        .byte   5                               // 0x5
        .byte   6                               // 0x6
        .byte   7                               // 0x7
        .byte   16                              // 0x10
        .byte   33                              // 0x21
        .byte   34                              // 0x22
        .byte   35                              // 0x23
        .byte   36                              // 0x24
        .byte   37                              // 0x25
        .byte   38                              // 0x26
        .byte   39                              // 0x27
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .byte   255                             // 0xff
        .text
        .globl  foo
        .p2align        2
        .type   foo,@function

foo:                                    // @foo
        .cfi_startproc
// %bb.0:
        adrp    x8, .LCPI0_0
        add     x8, x8, :lo12:.LCPI0_0
        ptrue   p0.b
        ldr     q1, [x1]
        ldr     q0, [x0]
        ld1b    { z2.b }, p0/z, [x8]
        tbl     z0.b, { z0.b, z1.b }, z2.b
                                        // kill: def $q0 killed $q0 killed $z0
        ret

sdesmalen: This condition and the comment above make little sense to me. The issue to solve is instead…

dtemirbulatovAuthorUnsubmitted

Done

I think this is correct, here is full DAG dump:
SelectionDAG has 16 nodes:

t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
t18: v16i8,ch = load<(load (s128) from %ir.ptr0, align 256)> t0, t2, undef:i64
  t4: i64,ch = CopyFromReg t0, Register:i64 %1
t16: v16i8,ch = load<(load (s128) from %ir.ptr1, align 256)> t0, t4, undef:i64
  t11: v16i8 = vector_shuffle<0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23> t18, t16
t13: ch,glue = CopyToReg t0, Register:v16i8 $q0, t11
t22: nxv16i8 = insert_subvector undef:nxv16i8, t18, Constant:i64<0>
t23: nxv16i8 = insert_subvector undef:nxv16i8, t16, Constant:i64<0>
t14: ch = AArch64ISD::RET_GLUE t13, Register:v16i8 $q0, t13:1

dtemirbulatov: I think this is correct, here is full DAG dump: SelectionDAG has 16 nodes: t0: ch,glue =…

sdesmalenUnsubmitted

Done

It seems SelectionDAGBuilder has extracted a subregister from the two <256 x i8> vectors, hence why the offsets are valid. This clearly wasn't the right example. The following case can't be transformed like that:

define <16 x i8> @foo(ptr %ptr0, ptr %ptr1) vscale_range(16, 16) {
  %op0 = load <256 x i8>, ptr %ptr0
  %op1 = load <256 x i8>, ptr %ptr1
  ; Cannot use TBL2 instruction because any offset > 255 cannot be represented in an i8!
  %res = shufflevector <256 x i8> %op0, <256 x i8> %op1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 252, i32 253, i32 254, i32 255, i32 256, i32 257, i32 258, i32 259, i32 508, i32 509, i32 510, i32 511>
  ret <16 x i8> %res
}

It's only because SelectionDAG inserts the mask into a <256 x i8> vector that your code happens to discard this case (because that will fail the check MaxOffset <= NumElts * 2). I think I'd still prefer an explicit check that the maximum index value in the ShuffleMask doesn't exceed the maximum value of the element type.

sdesmalen: It seems SelectionDAGBuilder has extracted a subregister from the two <256 x i8> vectors, hence…

return ::getPromotedVTForPredicate(VT); return ::getPromotedVTForPredicate(VT);

sdesmalenUnsubmitted

Done

Please check this condition earlier up for an early bail out! (at same place where it checks MinSVESize)

sdesmalen: Please check this condition earlier up for an early bail out! (at same place where it checks…

dtemirbulatovAuthorUnsubmitted

Done

dtemirbulatov: No, Most of those checks related to support TBL2 both opperand used and at the place where…

dtemirbulatovAuthorUnsubmitted

Done

And for a single operand case it is enough to know just MinSVESize.

dtemirbulatov: And for a single operand case it is enough to know just MinSVESize.

} }

bool AArch64TargetLowering::SimplifyDemandedBitsForTargetNode( bool AArch64TargetLowering::SimplifyDemandedBitsForTargetNode(

sdesmalenUnsubmitted

Done

What is the type of VecMask here? How can the type of VecMask here (single-vector TBL instruction) be the same as for the case of the SVE2 two-vector TBL instruction?

sdesmalen: What is the type of VecMask here? How can the type of VecMask here (single-vector TBL…

dtemirbulatovAuthorUnsubmitted

Done

This is not correct, According to specification index size is similar for TBL, TBL2.

dtemirbulatov: This is not correct, According to specification index size is similar for TBL, TBL2.

SDValue Op, const APInt &OriginalDemandedBits, SDValue Op, const APInt &OriginalDemandedBits,

sdesmalenUnsubmitted

Done

For a shufflevector that selects IndexLen elements from Op1/Op2, this code seems to be filling the first VL - IndexLen elements with some initializer (255 in this case). It seems odd to me that this loop starts at 0 and I doubt this has the desired effect.

I'm not sure to what extend it matters what value we use for 'undef' lanes. Can we just initialize the TBLMask with all numeric_limits<type>::max index value?

sdesmalen: For a shufflevector that selects `IndexLen` elements from Op1/Op2, this code seems to be…

const APInt &OriginalDemandedElts, KnownBits &Known, TargetLoweringOpt &TLO, const APInt &OriginalDemandedElts, KnownBits &Known, TargetLoweringOpt &TLO,

sdesmalenUnsubmitted

Done

nit:

} else if (Subtarget->hasSVE2()) {
   ...
} else
  return SDValue();

sdesmalen: nit: } else if (Subtarget->hasSVE2()) { ... } else return SDValue(); ?

unsigned Depth) const { unsigned Depth) const {

sdesmalenUnsubmitted

Done

Op1, SVEMask));

- } else {

- if (Subtarget->hasSVE2())

- Shuffle = convertFromScalableVector(

- DAG, VT,

- DAG.getNode(

- ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,

- DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32), Op1,

- Op2, SVEMask));

- else

- return SDValue();

+ } else if (Subtarget->hasSVE2()) {

+ Shuffle = convertFromScalableVector(

+ DAG, VT,

+ DAG.getNode(

+ ISD::INTRINSIC_WO_CHAIN, DL, ContainerVT,

+ DAG.getConstant(Intrinsic::aarch64_sve_tbl2, DL, MVT::i32), Op1,

+ Op2, SVEMask));

}

+ if (!Shuffle)

+ return SDValue();

return DAG.getNode(ISD::BITCAST, DL, Op.getValueType(), Shuffle);

}

SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op,

nit: what about this?

sdesmalen: nit: what about this?

unsigned Opc = Op.getOpcode(); unsigned Opc = Op.getOpcode();

switch (Opc) { switch (Opc) {

case AArch64ISD::VSHL: { case AArch64ISD::VSHL: {

// Match (VSHL (VLSHR Val X) X) // Match (VSHL (VLSHR Val X) X)

SDValue ShiftL = Op; SDValue ShiftL = Op;

SDValue ShiftR = Op->getOperand(0); SDValue ShiftR = Op->getOperand(0);

if (ShiftR->getOpcode() != AArch64ISD::VLSHR) if (ShiftR->getOpcode() != AArch64ISD::VLSHR)

return false; return false;

if (!ShiftL.hasOneUse() || !ShiftR.hasOneUse()) if (!ShiftL.hasOneUse() || !ShiftR.hasOneUse())

return false; return false;

unsigned ShiftLBits = ShiftL->getConstantOperandVal(1); unsigned ShiftLBits = ShiftL->getConstantOperandVal(1);

unsigned ShiftRBits = ShiftR->getConstantOperandVal(1); unsigned ShiftRBits = ShiftR->getConstantOperandVal(1);

sdesmalenUnsubmitted

Done

Please write this as:

} else if (Subtarget->hasSVE2()) {
  ...
} else
  return SDValue();

sdesmalen: Please write this as: } else if (Subtarget->hasSVE2()) { ... } else return SDValue…

// Other cases can be handled as well, but this is not // Other cases can be handled as well, but this is not

// implemented. // implemented.

if (ShiftRBits != ShiftLBits) if (ShiftRBits != ShiftLBits)

return false; return false;

unsigned ScalarSize = Op.getScalarValueSizeInBits(); unsigned ScalarSize = Op.getScalarValueSizeInBits();

assert(ScalarSize > ShiftLBits && "Invalid shift imm"); assert(ScalarSize > ShiftLBits && "Invalid shift imm");

▲ Show 20 Lines • Show All 187 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-rev.ll

Show First 20 Lines • Show All 188 Lines • ▼ Show 20 Lines	; VBITS_GE_512-NEXT: ret
store <32 x i16> %tmp2, ptr %a		store <32 x i16> %tmp2, ptr %a
ret void		ret void
}		}

; Only support to reverse bytes / halfwords / words within elements		; Only support to reverse bytes / halfwords / words within elements
define void @test_rev_elts_fail(ptr %a) #1 {		define void @test_rev_elts_fail(ptr %a) #1 {
; CHECK-LABEL: test_rev_elts_fail:		; CHECK-LABEL: test_rev_elts_fail:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
; CHECK-NEXT: sub x9, sp, #48
; CHECK-NEXT: mov x29, sp
; CHECK-NEXT: and sp, x9, #0xffffffffffffffe0
; CHECK-NEXT: .cfi_def_cfa w29, 16
; CHECK-NEXT: .cfi_offset w30, -8
; CHECK-NEXT: .cfi_offset w29, -16
; CHECK-NEXT: ptrue p0.d		; CHECK-NEXT: ptrue p0.d
		; CHECK-NEXT: adrp x8, .LCPI11_0
		; CHECK-NEXT: add x8, x8, :lo12:.LCPI11_0
; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]		; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
; CHECK-NEXT: mov z1.d, z0.d[2]		; CHECK-NEXT: ld1d { z1.d }, p0/z, [x8]
; CHECK-NEXT: mov z2.d, z0.d[3]		; CHECK-NEXT: tbl z0.d, { z0.d }, z1.d
; CHECK-NEXT: mov x9, v0.d[1]
; CHECK-NEXT: fmov x8, d1
; CHECK-NEXT: fmov x10, d2
; CHECK-NEXT: stp x10, x8, [sp, #16]
; CHECK-NEXT: fmov x8, d0
; CHECK-NEXT: stp x9, x8, [sp]
; CHECK-NEXT: mov x8, sp
; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
; CHECK-NEXT: st1d { z0.d }, p0, [x0]		; CHECK-NEXT: st1d { z0.d }, p0, [x0]
; CHECK-NEXT: mov sp, x29
; CHECK-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = load <4 x i64>, ptr %a		%tmp1 = load <4 x i64>, ptr %a
%tmp2 = shufflevector <4 x i64> %tmp1, <4 x i64> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		%tmp2 = shufflevector <4 x i64> %tmp1, <4 x i64> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
store <4 x i64> %tmp2, ptr %a		store <4 x i64> %tmp2, ptr %a
ret void		ret void
}		}

; This is the same test as above, but with sve2p1 it can use the REVD instruction to reverse		; This is the same test as above, but with sve2p1 it can use the REVD instruction to reverse
Show All 28 Lines
}		}

; REV instruction will reverse the order of all elements in the vector.		; REV instruction will reverse the order of all elements in the vector.
; When the vector length and the target register size are inconsistent,		; When the vector length and the target register size are inconsistent,
; the correctness of generated REV instruction for shuffle pattern cannot be guaranteed.		; the correctness of generated REV instruction for shuffle pattern cannot be guaranteed.

; sve-vector-bits-min=256, sve-vector-bits-max is not set, REV inst can't be generated.		; sve-vector-bits-min=256, sve-vector-bits-max is not set, REV inst can't be generated.
define void @test_revv8i32(ptr %a) #0 {		define void @test_revv8i32(ptr %a) #0 {
; CHECK-LABEL: test_revv8i32:		; VBITS_GE_256-LABEL: test_revv8i32:
; CHECK: // %bb.0:		; VBITS_GE_256: // %bb.0:
; CHECK-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill		; VBITS_GE_256-NEXT: ptrue p0.s, vl8
; CHECK-NEXT: sub x9, sp, #48		; VBITS_GE_256-NEXT: index z0.s, #7, #-1
; CHECK-NEXT: mov x29, sp		; VBITS_GE_256-NEXT: ld1w { z1.s }, p0/z, [x0]
; CHECK-NEXT: and sp, x9, #0xffffffffffffffe0		; VBITS_GE_256-NEXT: tbl z0.s, { z1.s }, z0.s
; CHECK-NEXT: .cfi_def_cfa w29, 16		; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0]
; CHECK-NEXT: .cfi_offset w30, -8		; VBITS_GE_256-NEXT: ret
; CHECK-NEXT: .cfi_offset w29, -16		;
; CHECK-NEXT: ptrue p0.s, vl8		; VBITS_GE_512-LABEL: test_revv8i32:
; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]		; VBITS_GE_512: // %bb.0:
; CHECK-NEXT: mov w8, v0.s[1]		; VBITS_GE_512-NEXT: ptrue p0.s, vl8
; CHECK-NEXT: mov w9, v0.s[2]		; VBITS_GE_512-NEXT: adrp x8, .LCPI14_0
; CHECK-NEXT: mov w10, v0.s[3]		; VBITS_GE_512-NEXT: add x8, x8, :lo12:.LCPI14_0
; CHECK-NEXT: fmov w11, s0		; VBITS_GE_512-NEXT: ptrue p1.s, vl16
; CHECK-NEXT: mov z1.s, z0.s[4]		; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]
; CHECK-NEXT: mov z2.s, z0.s[5]		; VBITS_GE_512-NEXT: ld1w { z1.s }, p1/z, [x8]
; CHECK-NEXT: mov z3.s, z0.s[6]		; VBITS_GE_512-NEXT: tbl z0.s, { z0.s }, z1.s
; CHECK-NEXT: mov z0.s, z0.s[7]		; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x0]
; CHECK-NEXT: stp w8, w11, [sp, #24]		; VBITS_GE_512-NEXT: ret
; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: stp w10, w9, [sp, #16]
; CHECK-NEXT: fmov w9, s2
; CHECK-NEXT: stp w9, w8, [sp, #8]
; CHECK-NEXT: fmov w8, s3
; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: stp w9, w8, [sp]
; CHECK-NEXT: mov x8, sp
; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
; CHECK-NEXT: st1w { z0.s }, p0, [x0]
; CHECK-NEXT: mov sp, x29
; CHECK-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
; CHECK-NEXT: ret
%tmp1 = load <8 x i32>, ptr %a		%tmp1 = load <8 x i32>, ptr %a
%tmp2 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>		%tmp2 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
store <8 x i32> %tmp2, ptr %a		store <8 x i32> %tmp2, ptr %a
ret void		ret void
}		}

; REV pattern for v32i8 shuffle with vscale_range(2,2)		; REV pattern for v32i8 shuffle with vscale_range(2,2)
define void @test_revv32i8_vl256(ptr %a) #1 {		define void @test_revv32i8_vl256(ptr %a) #1 {
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store <8 x i32> %tmp3, ptr %a		store <8 x i32> %tmp3, ptr %a
ret void		ret void
}		}

; Illegal REV pattern.		; Illegal REV pattern.
define void @test_rev_fail(ptr %a) #1 {		define void @test_rev_fail(ptr %a) #1 {
; CHECK-LABEL: test_rev_fail:		; CHECK-LABEL: test_rev_fail:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
; CHECK-NEXT: sub x9, sp, #48
; CHECK-NEXT: mov x29, sp
; CHECK-NEXT: and sp, x9, #0xffffffffffffffe0
; CHECK-NEXT: .cfi_def_cfa w29, 16
; CHECK-NEXT: .cfi_offset w30, -8
; CHECK-NEXT: .cfi_offset w29, -16
; CHECK-NEXT: ptrue p0.h		; CHECK-NEXT: ptrue p0.h
		; CHECK-NEXT: adrp x8, .LCPI20_0
		; CHECK-NEXT: add x8, x8, :lo12:.LCPI20_0
; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]		; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
; CHECK-NEXT: mov z1.h, z0.h[8]		; CHECK-NEXT: ld1h { z1.h }, p0/z, [x8]
; CHECK-NEXT: fmov w8, s0		; CHECK-NEXT: tbl z0.h, { z0.h }, z1.h
; CHECK-NEXT: mov z2.h, z0.h[9]
; CHECK-NEXT: mov z3.h, z0.h[10]
; CHECK-NEXT: mov z4.h, z0.h[11]
; CHECK-NEXT: strh w8, [sp, #14]
; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: mov z1.h, z0.h[12]
; CHECK-NEXT: fmov w9, s2
; CHECK-NEXT: mov z2.h, z0.h[13]
; CHECK-NEXT: strh w8, [sp, #30]
; CHECK-NEXT: fmov w8, s3
; CHECK-NEXT: mov z3.h, z0.h[14]
; CHECK-NEXT: strh w9, [sp, #28]
; CHECK-NEXT: fmov w9, s4
; CHECK-NEXT: mov z4.h, z0.h[15]
; CHECK-NEXT: fmov w10, s2
; CHECK-NEXT: strh w8, [sp, #26]
; CHECK-NEXT: fmov w8, s1
; CHECK-NEXT: fmov w11, s3
; CHECK-NEXT: strh w9, [sp, #24]
; CHECK-NEXT: umov w9, v0.h[1]
; CHECK-NEXT: fmov w12, s4
; CHECK-NEXT: strh w10, [sp, #20]
; CHECK-NEXT: umov w10, v0.h[3]
; CHECK-NEXT: strh w8, [sp, #22]
; CHECK-NEXT: umov w8, v0.h[2]
; CHECK-NEXT: strh w11, [sp, #18]
; CHECK-NEXT: umov w11, v0.h[4]
; CHECK-NEXT: strh w12, [sp, #16]
; CHECK-NEXT: umov w12, v0.h[5]
; CHECK-NEXT: strh w9, [sp, #12]
; CHECK-NEXT: umov w9, v0.h[6]
; CHECK-NEXT: strh w8, [sp, #10]
; CHECK-NEXT: umov w8, v0.h[7]
; CHECK-NEXT: strh w10, [sp, #8]
; CHECK-NEXT: strh w11, [sp, #6]
; CHECK-NEXT: strh w12, [sp, #4]
; CHECK-NEXT: strh w9, [sp, #2]
; CHECK-NEXT: strh w8, [sp]
; CHECK-NEXT: mov x8, sp
; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
; CHECK-NEXT: st1h { z0.h }, p0, [x0]		; CHECK-NEXT: st1h { z0.h }, p0, [x0]
; CHECK-NEXT: mov sp, x29
; CHECK-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%tmp1 = load <16 x i16>, ptr %a		%tmp1 = load <16 x i16>, ptr %a
%tmp2 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <16 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8>		%tmp2 = shufflevector <16 x i16> %tmp1, <16 x i16> undef, <16 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0, i32 15, i32 14, i32 13, i32 12, i32 11, i32 10, i32 9, i32 8>
store <16 x i16> %tmp2, ptr %a		store <16 x i16> %tmp2, ptr %a
ret void		ret void
}		}

; Don't use SVE for 128-bit shuffle with two inputs		; Don't use SVE for 128-bit shuffle with two inputs
Show All 19 Lines
; CHECK-NEXT: orr x9, x8, #0xe		; CHECK-NEXT: orr x9, x8, #0xe
; CHECK-NEXT: st1 { v1.h }[4], [x9]		; CHECK-NEXT: st1 { v1.h }[4], [x9]
; CHECK-NEXT: orr x9, x8, #0xc		; CHECK-NEXT: orr x9, x8, #0xc
; CHECK-NEXT: st1 { v1.h }[5], [x9]		; CHECK-NEXT: st1 { v1.h }[5], [x9]
; CHECK-NEXT: orr x9, x8, #0x8		; CHECK-NEXT: orr x9, x8, #0x8
; CHECK-NEXT: st1 { v0.h }[5], [x10]		; CHECK-NEXT: st1 { v0.h }[5], [x10]
; CHECK-NEXT: orr x10, x8, #0x10		; CHECK-NEXT: orr x10, x8, #0x10
; CHECK-NEXT: st1 { v1.h }[7], [x9]		; CHECK-NEXT: st1 { v1.h }[7], [x9]
; CHECK-NEXT: orr x9, x8, #0x4		; CHECK-NEXT: orr x9, x8, #0x4
; CHECK-NEXT: st1 { v0.h }[3], [x10]		; CHECK-NEXT: st1 { v0.h }[3], [x10]
		sdesmalenUnsubmitted Done Reply Inline Actions Seems like these changes can be committed separately. sdesmalen: Seems like these changes can be committed separately.
; CHECK-NEXT: mov w10, #26 // =0x1a		; CHECK-NEXT: mov w10, #26 // =0x1a
; CHECK-NEXT: st1 { v1.h }[1], [x9]		; CHECK-NEXT: st1 { v1.h }[1], [x9]
; CHECK-NEXT: orr x9, x8, #0x2		; CHECK-NEXT: orr x9, x8, #0x2
; CHECK-NEXT: st1 { v1.h }[2], [x9]		; CHECK-NEXT: st1 { v1.h }[2], [x9]
; CHECK-NEXT: orr x9, x8, x10		; CHECK-NEXT: orr x9, x8, x10
; CHECK-NEXT: mov w10, #20 // =0x14		; CHECK-NEXT: mov w10, #20 // =0x14
; CHECK-NEXT: st1 { v0.h }[6], [x9]		; CHECK-NEXT: st1 { v0.h }[6], [x9]
; CHECK-NEXT: orr x9, x8, x10		; CHECK-NEXT: orr x9, x8, x10
Show All 25 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mattr=+sve2 -force-streaming-compatible-sve -aarch64-sve-vector-bits-min=128 -aarch64-sve-vector-bits-max=128 < %s \| FileCheck %s -check-prefixes=CHECK,SVE2_128
				; RUN: llc -mattr=+sve2 -force-streaming-compatible-sve -aarch64-sve-vector-bits-min=128 < %s \| FileCheck %s -check-prefixes=CHECK,SVE2_128_NOMAX
				sdesmalenUnsubmitted Not Done Reply Inline Actions Can you remove this RUN line? This RUN line is identical to the last RUN line. sdesmalen: Can you remove this RUN line? This RUN line is identical to the last RUN line.
				dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, it is not here we targeting sve, not sve2. dtemirbulatov: No, it is not here we targeting sve, not sve2.

				sdesmalenUnsubmitted Done Reply Inline Actions Can you stick with one vector width (e.g. min==max==128)? I don't think the line below (min==max==256) is testing anything that the current RUN line isn't already testing when `-force-stremaing-compatible` is enabled. sdesmalen: Can you stick with one vector width (e.g. min==max==128)? I don't think the line below…
				target triple = "aarch64-unknown-linux-gnu"

				; SVE2_128: .LCPI0_0:
				; SVE2_128-NEXT: .byte 0 // 0x0
				; SVE2_128-NEXT: .byte 7 // 0x7
				; SVE2_128-NEXT: .byte 2 // 0x2
				; SVE2_128-NEXT: .byte 3 // 0x3
				; SVE2_128-NEXT: .byte 4 // 0x4
				; SVE2_128-NEXT: .byte 5 // 0x5
				; SVE2_128-NEXT: .byte 6 // 0x6
				; SVE2_128-NEXT: .byte 7 // 0x7
				; SVE2_128-NEXT: .byte 255 // 0xff
				; SVE2_128-NEXT: .byte 255 // 0xff
				define <8 x i8> @shuffle_index_indices_from_op1(ptr %a, ptr %b) {
				; CHECK-LABEL: shuffle_index_indices_from_op1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: ldr q1, [x8, :lo12:.LCPI0_0]
				; CHECK-NEXT: tbl z0.b, { z0.b }, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%op1 = load <8 x i8>, ptr %a
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 0, i32 7, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i8> %1
				}

				; SVE2_128: .LCPI1_0:
				; SVE2_128-NEXT: .byte 0 // 0x0
				; SVE2_128-NEXT: .byte 1 // 0x1
				; SVE2_128-NEXT: .byte 1 // 0x1
				; SVE2_128-NEXT: .byte 3 // 0x3
				; SVE2_128-NEXT: .byte 4 // 0x4
				; SVE2_128-NEXT: .byte 7 // 0x7
				; SVE2_128-NEXT: .byte 6 // 0x6
				; SVE2_128-NEXT: .byte 7 // 0x7
				; SVE2_128-NEXT: .byte 255 // 0xff
				; SVE2_128-NEXT: .byte 255 // 0xff
				define <8 x i8> @shuffle_index_indices_from_op2(ptr %a, ptr %b) {
				; CHECK-LABEL: shuffle_index_indices_from_op2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI1_0
				; CHECK-NEXT: ldr d0, [x1]
				; CHECK-NEXT: ldr q1, [x8, :lo12:.LCPI1_0]
				; CHECK-NEXT: tbl z0.b, { z0.b }, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%op1 = load <8 x i8>, ptr %a
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 8, i32 9, i32 9, i32 11, i32 12, i32 15, i32 14, i32 15>
				ret <8 x i8> %1
				}

				; SVE2_128: .LCPI2_0:
				; SVE2_128-NEXT: .byte 1 // 0x1
				; SVE2_128-NEXT: .byte 17 // 0x11
				; SVE2_128-NEXT: .byte 18 // 0x12
				; SVE2_128-NEXT: .byte 19 // 0x13
				; SVE2_128-NEXT: .byte 20 // 0x14
				; SVE2_128-NEXT: .byte 20 // 0x14
				; SVE2_128-NEXT: .byte 22 // 0x16
				; SVE2_128-NEXT: .byte 23 // 0x17
				; SVE2_128-NEXT: .byte 255 // 0xff
				; SVE2_128-NEXT: .byte 255 // 0xff
				define <8 x i8> @shuffle_index_indices_from_both_ops(ptr %a, ptr %b) {
				; SVE2_128-LABEL: shuffle_index_indices_from_both_ops:
				; SVE2_128: // %bb.0:
				; SVE2_128-NEXT: adrp x8, .LCPI2_0
				; SVE2_128-NEXT: ldr d0, [x0]
				; SVE2_128-NEXT: ldr d1, [x1]
				; SVE2_128-NEXT: ldr q2, [x8, :lo12:.LCPI2_0]
				; SVE2_128-NEXT: tbl z0.b, { z0.b, z1.b }, z2.b
				; SVE2_128-NEXT: // kill: def $d0 killed $d0 killed $z0
				; SVE2_128-NEXT: ret
				;
				; SVE2_128_NOMAX-LABEL: shuffle_index_indices_from_both_ops:
				; SVE2_128_NOMAX: // %bb.0:
				; SVE2_128_NOMAX-NEXT: sub sp, sp, #16
				; SVE2_128_NOMAX-NEXT: .cfi_def_cfa_offset 16
				; SVE2_128_NOMAX-NEXT: ldr d0, [x1]
				; SVE2_128_NOMAX-NEXT: mov z1.b, z0.b[7]
				; SVE2_128_NOMAX-NEXT: mov z2.b, z0.b[6]
				; SVE2_128_NOMAX-NEXT: mov z3.b, z0.b[4]
				; SVE2_128_NOMAX-NEXT: fmov w8, s1
				; SVE2_128_NOMAX-NEXT: ldr d1, [x0]
				; SVE2_128_NOMAX-NEXT: fmov w9, s2
				; SVE2_128_NOMAX-NEXT: mov z2.b, z0.b[3]
				; SVE2_128_NOMAX-NEXT: mov z1.b, z1.b[1]
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #15]
				; SVE2_128_NOMAX-NEXT: fmov w8, s3
				; SVE2_128_NOMAX-NEXT: mov z3.b, z0.b[2]
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #14]
				; SVE2_128_NOMAX-NEXT: mov z0.b, z0.b[1]
				; SVE2_128_NOMAX-NEXT: fmov w9, s2
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #13]
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #12]
				; SVE2_128_NOMAX-NEXT: fmov w8, s3
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #11]
				; SVE2_128_NOMAX-NEXT: fmov w9, s0
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #10]
				; SVE2_128_NOMAX-NEXT: fmov w8, s1
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #9]
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #8]
				; SVE2_128_NOMAX-NEXT: ldr d0, [sp, #8]
				; SVE2_128_NOMAX-NEXT: add sp, sp, #16
				; SVE2_128_NOMAX-NEXT: ret
				sdesmalenUnsubmitted Done Reply Inline Actions Please rename this to something like `shuffle_index_indices_from_op1` to make it clear the mask takes all the lanes from `%op1`. sdesmalen: Please rename this to something like ` shuffle_index_indices_from_op1` to make it clear the…
				%op1 = load <8 x i8>, ptr %a
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 15>
				ret <8 x i8> %1
				}

				; SVE2_128: .LCPI3_0:
				; SVE2_128-NEXT: .byte 1 // 0x1
				sdesmalenUnsubmitted Done Reply Inline Actions This test is missing check lines for the shuffle mask, could you please add it? sdesmalen: This test is missing check lines for the shuffle mask, could you please add it?
				dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, for the poison value it generates different value for different conditions (hardware vector size, etc), but for this test always in range of offset for operand 1, I am not sure it is consistant. dtemirbulatov: No, for the poison value it generates different value for different conditions (hardware vector…
				sdesmalenUnsubmitted Done Reply Inline Actions I don't think there's a benefit to choosing one index over another, so what if you change the code to always choose lane `0` if the index < 0? That way, you can add the CHECK lines. sdesmalen: I don't think there's a benefit to choosing one index over another, so what if you change the…
				; SVE2_128-NEXT: .byte 17 // 0x11
				; SVE2_128-NEXT: .byte 18 // 0x12
				; SVE2_128-NEXT: .byte 19 // 0x13
				; SVE2_128-NEXT: .byte 20 // 0x14
				; SVE2_128-NEXT: .byte 20 // 0x14
				; SVE2_128-NEXT: .byte 22 // 0x16
				; SVE2_128-NEXT: .byte 0 // 0x0
				; SVE2_128-NEXT: .byte 255 // 0xff
				; SVE2_128-NEXT: .byte 255 // 0xff
				define <8 x i8> @shuffle_index_poison_value(ptr %a, ptr %b) {
				; SVE2_128-LABEL: shuffle_index_poison_value:
				; SVE2_128: // %bb.0:
				; SVE2_128-NEXT: adrp x8, .LCPI3_0
				; SVE2_128-NEXT: ldr d0, [x0]
				; SVE2_128-NEXT: ldr d1, [x1]
				; SVE2_128-NEXT: ldr q2, [x8, :lo12:.LCPI3_0]
				; SVE2_128-NEXT: tbl z0.b, { z0.b, z1.b }, z2.b
				; SVE2_128-NEXT: // kill: def $d0 killed $d0 killed $z0
				; SVE2_128-NEXT: ret
				;
				; SVE2_128_NOMAX-LABEL: shuffle_index_poison_value:
				; SVE2_128_NOMAX: // %bb.0:
				; SVE2_128_NOMAX-NEXT: sub sp, sp, #16
				; SVE2_128_NOMAX-NEXT: .cfi_def_cfa_offset 16
				; SVE2_128_NOMAX-NEXT: ldr d0, [x1]
				; SVE2_128_NOMAX-NEXT: ldr d3, [x0]
				; SVE2_128_NOMAX-NEXT: mov z1.b, z0.b[6]
				; SVE2_128_NOMAX-NEXT: mov z2.b, z0.b[4]
				; SVE2_128_NOMAX-NEXT: fmov w8, s1
				; SVE2_128_NOMAX-NEXT: mov z1.b, z0.b[3]
				; SVE2_128_NOMAX-NEXT: fmov w9, s2
				; SVE2_128_NOMAX-NEXT: mov z2.b, z0.b[2]
				; SVE2_128_NOMAX-NEXT: mov z0.b, z0.b[1]
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #14]
				; SVE2_128_NOMAX-NEXT: fmov w8, s1
				; SVE2_128_NOMAX-NEXT: mov z1.b, z3.b[1]
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #13]
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #12]
				; SVE2_128_NOMAX-NEXT: fmov w9, s2
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #11]
				; SVE2_128_NOMAX-NEXT: fmov w8, s0
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #10]
				; SVE2_128_NOMAX-NEXT: fmov w9, s1
				; SVE2_128_NOMAX-NEXT: strb w8, [sp, #9]
				; SVE2_128_NOMAX-NEXT: strb w9, [sp, #8]
				; SVE2_128_NOMAX-NEXT: ldr d0, [sp, #8]
				; SVE2_128_NOMAX-NEXT: add sp, sp, #16
				; SVE2_128_NOMAX-NEXT: ret
				%op1 = load <8 x i8>, ptr %a
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 poison>
				ret <8 x i8> %1
				}

				define <8 x i8> @shuffle_op1_poison(ptr %a, ptr %b) {
				; CHECK-LABEL: shuffle_op1_poison:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI4_0
				; CHECK-NEXT: ldr d0, [x1]
				sdesmalenUnsubmitted Not Done Reply Inline Actions I'm a bit puzzled, why did you add this test? sdesmalen: I'm a bit puzzled, why did you add this test?
				; CHECK-NEXT: ldr q1, [x8, :lo12:.LCPI4_0]
				; CHECK-NEXT: tbl z0.b, { z0.b }, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> poison, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 15>
				ret <8 x i8> %1
				}

				; In this function, we could not represent indexes for the second operand
				; because for i8 type, the maximum constant in the mask is 256.
				define <8 x i8> @negative_test_shuffle_index_size_op_both_maxhw(ptr %a, ptr %b) "target-features"="+sve2" vscale_range(16,16) {
				; CHECK-LABEL: negative_test_shuffle_index_size_op_both_maxhw:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: ldr d0, [x1]
				; CHECK-NEXT: mov z1.b, z0.b[7]
				; CHECK-NEXT: mov z2.b, z0.b[6]
				; CHECK-NEXT: mov z3.b, z0.b[4]
				; CHECK-NEXT: fmov w8, s1
				; CHECK-NEXT: ldr d1, [x0]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: mov z2.b, z0.b[3]
				sdesmalenUnsubmitted Done Reply Inline Actions nit: I meant prepend do the start of the name, e.g. `@negative_test_shuffle_index_size_op_both_maxhw` sdesmalen: nit: I meant prepend do the start of the name, e.g.
				; CHECK-NEXT: mov z1.b, z1.b[1]
				; CHECK-NEXT: strb w8, [sp, #15]
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: mov z3.b, z0.b[2]
				; CHECK-NEXT: strb w9, [sp, #14]
				; CHECK-NEXT: mov z0.b, z0.b[1]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: strb w8, [sp, #13]
				; CHECK-NEXT: strb w8, [sp, #12]
				; CHECK-NEXT: fmov w8, s3
				sdesmalenUnsubmitted Done Reply Inline Actions Please rename this to something like `shuffle_index_indices_from_op2` to make it clear the mask takes all the lanes from `%op2`. sdesmalen: Please rename this to something like `shuffle_index_indices_from_op2` to make it clear the…
				; CHECK-NEXT: strb w9, [sp, #11]
				; CHECK-NEXT: fmov w9, s0
				; CHECK-NEXT: strb w8, [sp, #10]
				; CHECK-NEXT: fmov w8, s1
				; CHECK-NEXT: strb w9, [sp, #9]
				; CHECK-NEXT: strb w8, [sp, #8]
				; CHECK-NEXT: ldr d0, [sp, #8]
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%op1 = load <8 x i8>, ptr %a
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 1, i32 9, i32 10, i32 11, i32 12, i32 12, i32 14, i32 15>
				ret <8 x i8> %1
				}

				; CHECK: .LCPI6_0:
				; CHECK-NEXT: .byte 0 // 0x0
				; CHECK-NEXT: .byte 7 // 0x7
				; CHECK-NEXT: .byte 2 // 0x2
				; CHECK-NEXT: .byte 3 // 0x3
				; CHECK-NEXT: .byte 4 // 0x4
				; CHECK-NEXT: .byte 5 // 0x5
				; CHECK-NEXT: .byte 6 // 0x6
				; CHECK-NEXT: .byte 7 // 0x7
				; CHECK-NEXT: .byte 255 // 0xff
				; CHECK-NEXT: .byte 255 // 0xff
				define <8 x i8> @shuffle_index_size_op1_maxhw(ptr %a, ptr %b) "target-features"="+sve2" vscale_range(16,16) {
				; CHECK-LABEL: shuffle_index_size_op1_maxhw:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: adrp x8, .LCPI6_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI6_0
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x8]
				; CHECK-NEXT: tbl z0.b, { z0.b }, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%op1 = load <8 x i8>, ptr %a
				%op2 = load <8 x i8>, ptr %b
				%1 = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 0, i32 7, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i8> %1
				}
				sdesmalenUnsubmitted Done Reply Inline Actions Please rename this to something like `shuffle_index_indices_from_both_ops` to make it clear the mask takes all the lanes from both `%op1` and `%op2`. sdesmalen: Please rename this to something like `shuffle_index_indices_from_both_ops` to make it clear…
				sdesmalenUnsubmitted Done Reply Inline Actions As I said here, one of the indices being `poison` should not be a reason not to use vector instructions. This case can be handled by the algorithm. sdesmalen: As I said [[https://reviews.llvm.org/D152205?id=556773#inline-1548136\|here]], one of the…
				sdesmalenUnsubmitted Done Reply Inline Actions Please remove the `acceptable` and `_unacceptable` from the names. If something cannot use the TBL instruction, then prepend the name with `negative_`. sdesmalen: Please remove the `acceptable` and `_unacceptable` from the names. If something cannot use the…
				sdesmalenUnsubmitted Done Reply Inline Actions Please inline `"target-features"="+sve2" vscale_range(16,16)` here rather than using `#0`, as that makes the test a bit easier to read. sdesmalen: Please inline `"target-features"="+sve2" vscale_range(16,16)` here rather than using `#0`, as…
				sdesmalenUnsubmitted Done Reply Inline Actions This test is named `_unacceptable_form`, yet it uses TBL without issues. So this is not an "unacceptable form"? :) sdesmalen: This test is named `_unacceptable_form`, yet it uses TBL without issues. So this is not an…
				sdesmalenUnsubmitted Done Reply Inline Actions How is this test different from `shuffle_index_size_acceptable_op_both`, other than it having a slightly different mask? sdesmalen: How is this test different from `shuffle_index_size_acceptable_op_both`, other than it having a…
				sdesmalenUnsubmitted Done Reply Inline Actions How is this test different from `shuffle_index_size_acceptable_op_both`, other than it having a slightly different mask and a wider return type? sdesmalen: How is this test different from `shuffle_index_size_acceptable_op_both`, other than it having a…
				sdesmalenUnsubmitted Done Reply Inline Actions I don't think this is a good test, because this test is relying on a DAGCombine that swaps `%op1` and `%op2`. I think you can just remove it. sdesmalen: I don't think this is a good test, because this test is relying on a DAGCombine that swaps…
				sdesmalenUnsubmitted Done Reply Inline Actions Can you add a comment that clarifies why this function doesn't use TBL? sdesmalen: Can you add a comment that clarifies why this function doesn't use TBL?

This is an archive of the discontinued LLVM Phabricator instance.

[Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557563

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-rev.ll

llvm/test/CodeGen/AArch64/sve-fixed-length-vector-shuffle-tbl.ll

[Aarch64][SVE]SVE2] Enable tbl, tbl2 for shuffle lowering for fixed vector types.
ClosedPublic