This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Handle vector with two different values
ClosedPublic

Authored by jaykang10 on Apr 14 2023, 8:37 AM.

Details

Summary

gcc generates less instructions than llvm from below intrinsic example.

#include <arm_neon.h>

uint8x16_t foo(uint8_t *a, uint8_t *b) {
    return vcombine_u8(vld1_dup_u8(a), vld1_dup_u8(b));
} 

gcc output
foo:
	ld1r	{v0.8b}, [x0]
	ld1r	{v1.8b}, [x1]
	ins	v0.d[1], v1.d[0]
	ret

llvm output
foo:                                    // @foo
        ldrb    w8, [x0]
        fmov    s0, w8
        mov     v0.b[1], w8
        mov     v0.b[2], w8
        mov     v0.b[3], w8
        mov     v0.b[4], w8
        mov     v0.b[5], w8
        mov     v0.b[6], w8
        mov     v0.b[7], w8
        ldrb    w8, [x1]
        mov     v0.b[8], w8
        mov     v0.b[9], w8
        mov     v0.b[10], w8
        mov     v0.b[11], w8
        mov     v0.b[12], w8
        mov     v0.b[13], w8
        mov     v0.b[14], w8
        mov     v0.b[15], w8
        ret

If vector has two different values and it can be splitted into two sub vectors with same length, generate two DUP and CONCAT_VECTORS with them.
For example,

 t22: v16i8 = BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t23,
                           t24, t24, t24, t24, t24, t24, t24, t24
==>
   t26: v8i8 = AArch64ISD::DUP t23
   t28: v8i8 = AArch64ISD::DUP t24
 t29: v16i8 = concat_vectors t26, t28

With this patch, llvm generates below output.

foo:                                  // @foo
	ld1r	{ v1.8b }, [x1]
	ld1r	{ v0.8b }, [x0]
	mov	v0.d[1], v1.d[0]
	ret

Diff Detail

Event Timeline

jaykang10 created this revision.Apr 14 2023, 8:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 8:37 AM
jaykang10 requested review of this revision.Apr 14 2023, 8:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 8:37 AM

Could you add some extra test cases too? For example for the foo case in the summary and for case that are close-but-not-correct, like BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t24, t24, t24, t24, t24, t24, t24, t24, t23. You should be able to construct tests with a bunch of insertelement's.

jaykang10 added a comment.EditedApr 14 2023, 9:54 AM

Could you add some extra test cases too? For example for the foo case in the summary and for case that are close-but-not-correct, like BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t24, t24, t24, t24, t24, t24, t24, t24, t23. You should be able to construct tests with a bunch of insertelement's.

Ah! You are right!
It looks the condition does not cover all cases. I thought some cases are hit by previous conditions.
Let me update the code and add the tests next week.
Thanks!

jaykang10 updated this revision to Diff 513721.Apr 14 2023, 1:06 PM

@dmgreen I have added some tests with precommit.
Let me add more next week.

jaykang10 updated this revision to Diff 513732.Apr 14 2023, 1:36 PM

Instead of special-casing this specific pattern, maybe it makes sense to generalize to all BUILD_VECTORs with two values? Even for a completely arbitrary shuffle, you can still lower that to 5 instructions: VSELECT mask, DUP op1, DUP op2.

Instead of special-casing this specific pattern, maybe it makes sense to generalize to all BUILD_VECTORs with two values? Even for a completely arbitrary shuffle, you can still lower that to 5 instructions: VSELECT mask, DUP op1, DUP op2.

Thanks for comment. @efriedma That's good point.
Let me try to add the VSELECT with two DUPs.

Instead of special-casing this specific pattern, maybe it makes sense to generalize to all BUILD_VECTORs with two values? Even for a completely arbitrary shuffle, you can still lower that to 5 instructions: VSELECT mask, DUP op1, DUP op2.

Thanks for comment. @efriedma That's good point.
Let me try to add the VSELECT with two DUPs.

@efriedma It seems we need BUILD_VECTOR or something like that for the VSELECT's mask, which its type is vxi1, with legal vector type... I am not sure how we can generate it efficiently...
If you had ideas for the mask, please let me know.

An 16xi8 BUILD_VECTOR of constants should lower to something reasonable (worst case, a constant pool load).

An 16xi8 BUILD_VECTOR of constants should lower to something reasonable (worst case, a constant pool load).

I have seen the constant pool and load...and I was not sure it is good enough or not...

An 16xi8 BUILD_VECTOR of constants should lower to something reasonable (worst case, a constant pool load).

I have seen the constant pool and load...and I was not sure it is good enough or not...

It is a patch which reproduce the constant pool and load from the intrinsic example... Maybe, my implementation could be wrong...

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index fb41dfd8f245..089e51a5b981 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -12236,6 +12236,8 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
   unsigned NumUndefLanes = 0;
   SDValue Value;
   SDValue ConstantValue;
+  SmallPtrSet<SDNode *, 16> DifferentValueSet;
+  SmallVector<SDValue, 16> BitMaskVec;
   for (unsigned i = 0; i < NumElts; ++i) {
     SDValue V = Op.getOperand(i);
     if (V.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
@@ -12263,6 +12265,15 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
       usesOnlyOneValue = false;
       ++NumDifferentLanes;
     }
+
+    // Keep different values on vector.
+    DifferentValueSet.insert(V.getNode());
+    // Keep the lanes of the first value with bitmask. The bitmask will be valid
+    // only if the DifferentValueSet's size is 2.
+    if (V == Value)
+      BitMaskVec.push_back(DAG.getTargetConstant(1, dl, MVT::i8));
+    else
+      BitMaskVec.push_back(DAG.getTargetConstant(0, dl, MVT::i8));
   }
 
   if (!Value.getNode()) {
@@ -12454,6 +12465,22 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
       return Shuffle;
   }
 
+  // If vector consists of two different values, generate two DUPs and VSELECT.
+  if (DifferentValueSet.size() == 2) {
+    SmallVector<SDValue, 2> Vals;
+    for (auto *Val : DifferentValueSet)
+      Vals.push_back(SDValue(Val, 0));
+    SmallVector<SDValue, 8> Ops1(NumElts, Vals[0]);
+    SmallVector<SDValue, 8> Ops2(NumElts, Vals[1]);
+    SDValue DUP1 = LowerBUILD_VECTOR(DAG.getBuildVector(VT, dl, Ops1), DAG);
+    SDValue DUP2 = LowerBUILD_VECTOR(DAG.getBuildVector(VT, dl, Ops2), DAG);
+    SDValue VCOND = DAG.getBuildVector(VT, dl, BitMaskVec);
+    if (SDValue Res = LowerBUILD_VECTOR(VCOND, DAG))
+      VCOND = Res;
+    SDValue VSELECT = DAG.getNode(ISD::VSELECT, dl, VT, VCOND, DUP1, DUP2);
+    return VSELECT;
+  }
+
   if (PreferDUPAndInsert) {
     // First, build a constant vector with the common element.
     SmallVector<SDValue, 8> Ops(NumElts, Value);

Something along those lines, yes.

I haven't thought through how to optimize the cases where some non-VSELECT shuffle is optimal. We don't have any existing code to handle splat operands to shuffles. I guess to start, you could just check for specific patterns before creating the VSELECT. Alternatively, might make sense to make a VECTOR_SHUFFLE, then teach shuffle lowering to handle the relevant patterns.

I have seen the constant pool and load...and I was not sure it is good enough or not...

Better than a long sequence of MOVs. And if there's a loop, the load may get hoisted out.

I haven't thought through how to optimize the cases where some non-VSELECT shuffle is optimal. We don't have any existing code to handle splat operands to shuffles. I guess to start, you could just check for specific patterns before creating the VSELECT. Alternatively, might make sense to make a VECTOR_SHUFFLE, then teach shuffle lowering to handle the relevant patterns.

Thanks for good comment.
It looks it is not easy to generate the vector mask simply for vselect and vector_shuffle because it needs build_vector again...
Let me check the specific patterns + (vselect or vector_shuffle) more.

I have seen the constant pool and load...and I was not sure it is good enough or not...

Better than a long sequence of MOVs. And if there's a loop, the load may get hoisted out.

I agree with you.

jaykang10 added a comment.EditedApr 18 2023, 2:34 AM

um... It seems the vector_shuffle's output is better than vselect's one as below even though it generates the constant pool. The vector_shuffle is lowered to tbl. Let me try to use vector_shuffle.
Additionally, AArch64 target expands the vselect so we need to expand it manually or add tablegen patterns for it...

Optimized legalized selection DAG: %bb.0 'test3:entry'
SelectionDAG has 20 nodes:
  t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t23: i32,ch = load<(load (s8) from %ir.a), anyext from i8> t0, t2, undef:i64
      t37: v16i8 = AArch64ISD::DUP t23
          t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t24: i32,ch = load<(load (s8) from %ir.b), anyext from i8> t0, t4, undef:i64
      t36: v16i8 = AArch64ISD::DUP t24
          t40: i64 = AArch64ISD::ADRP TargetConstantPool:i64<<16 x i8> <i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16>> 0 [TF=1]
        t41: i64 = AArch64ISD::ADDlow t40, TargetConstantPool:i64<<16 x i8> <i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16>> 0 [TF=34]
      t35: v16i8,ch = load<(load (s128) from constant-pool)> t0, t41, undef:i64
    t33: v16i8 = llvm.aarch64.neon.tbl2 Constant:i32<610>, t37, t36, t35  --> tbl2 is generated.
  t16: ch,glue = CopyToReg t0, Register:v16i8 $q0, t33
  t17: ch = AArch64ISD::RET_GLUE t16, Register:v16i8 $q0, t16:1

Assembly output
test3:                                  // @test3
        .cfi_startproc
// %bb.0:                               // %entry
        adrp    x8, .LCPI0_0
        ld1r    { v1.16b }, [x1]
        ld1r    { v0.16b }, [x0]
        ldr     q2, [x8, :lo12:.LCPI0_0]
        tbl     v0.16b, { v0.16b, v1.16b }, v2.16b
        ret
jaykang10 updated this revision to Diff 514863.Apr 19 2023, 1:59 AM

@efriedma I have updated this patch with vector_shuffle and have checked the impact.
As you can see, the vector_shuffle generates more instructions on the vectors with small number of lanes so it could be good to enable this transformation with more than 8 lanes.
How do you think about it?

jaykang10 updated this revision to Diff 515718.Apr 21 2023, 6:47 AM

Do you have some examples of the impact? I don't see any test changes that involve the shuffle codepath.

Do you have some examples of the impact? I don't see any test changes that involve the shuffle codepath.

Yep, you can see test changes here. https://reviews.llvm.org/differential/diff/514863/

You should probably add more testcases for coverage. Some 8 and 16-element cases using shuffles. Maybe a case of a 4-element vector (a,a,b,b) which can be built using dup+dup+concat.

Some future areas to look at (not necessarily in this patch; I don't want to keep expanding the scope of this forever):

mov+ins+ins+ins is never the optimal sequence for creating a 4-element vector with multiple elements with the same value; you can always use either dup+ins or dup+ins+ins. Maybe worth trying to optimize.

It looks like the generic vector_shuffle code is generating suboptimal code in some cases because it doesn't realize the shuffle inputs are actually splats, so it can mess with the indices.

You should probably add more testcases for coverage. Some 8 and 16-element cases using shuffles. Maybe a case of a 4-element vector (a,a,b,b) which can be built using dup+dup+concat.

Some future areas to look at (not necessarily in this patch; I don't want to keep expanding the scope of this forever):

mov+ins+ins+ins is never the optimal sequence for creating a 4-element vector with multiple elements with the same value; you can always use either dup+ins or dup+ins+ins. Maybe worth trying to optimize.

It looks like the generic vector_shuffle code is generating suboptimal code in some cases because it doesn't realize the shuffle inputs are actually splats, so it can mess with the indices.

Thanks for good comment!
Let me add and check the test cases more.

jaykang10 updated this revision to Diff 516367.Apr 24 2023, 5:37 AM

It looks like the generic vector_shuffle code is generating suboptimal code in some cases because it doesn't realize the shuffle inputs are actually splats, so it can mess with the indices.

@efriedma I agree with you.
Let me check the splats input on vector_shuffle. It could be a different patch.

jaykang10 updated this revision to Diff 516709.Apr 25 2023, 1:56 AM
efriedma added inline comments.Apr 25 2023, 10:40 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12358

The name "MaskVec" is a bit confusing in this context...

Maybe it would be easier to understand if you just construct it later, in the if (DifferentValueMap.size() == 2 && NumUndefLanes == 0) codepath?

jaykang10 added inline comments.Apr 26 2023, 1:25 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12358

The name "MaskVec" is a bit confusing in this context...

Maybe it would be easier to understand if you just construct it later, in the if (DifferentValueMap.size() == 2 && NumUndefLanes == 0) codepath?

Sorry for confusing.
Let me add more comments and update the code.

jaykang10 updated this revision to Diff 517102.Apr 26 2023, 2:12 AM

Any more comments please?

dmgreen accepted this revision.May 5 2023, 5:00 AM

I think the code looks OK. You may want to add some extra test cases that generate tbl though to show more cases.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662

Would it help if this mask is <0,1,2,3,12,13,14,15> or <0,1,2,3,8,9,10,11>? I'm not sure it would help at the moment, but this case with i8's could use 's' lane inserts to avoid the tbl. It wouldn't help in general though.

This revision is now accepted and ready to land.May 5 2023, 5:00 AM
jaykang10 added inline comments.May 5 2023, 6:42 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662

Thanks for comment.
This code handles the vector with only two different values so the case with the mask <0,1,2,3,12,13,14,15> and <0,1,2,3,8,9,10,11> will not meet this code.

This revision was landed with ongoing or failed builds.May 5 2023, 7:04 AM
This revision was automatically updated to reflect the committed changes.
dmgreen added inline comments.May 5 2023, 7:08 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662

The idea is that the two original values have been dup'd to all elements of DUP1 and DUP2. So the value in lane 0 should be the same as in lane 1..7, and the value in lane 8 (lane 0 of DUP2) will be the same as 9..15. So we can chose any element in those vectors. And if we pick <0,1,2,3> there is a chance to convert that to a 'S' reg lane move without requiring the tbl. It looks like that might already happen if the lanes were sequential.

Having said that, if the type is a float then the value should already be in lane 0, and the DUP's become unnecessary. It should be able to just use a tbl directly.

jaykang10 added inline comments.May 5 2023, 7:24 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662

Ah, I understand what you said now.
Yep, we can choose the element for the mask because the values of the lanes are same.
Let me check which mask can be lowered well on LowerVECTOR_SHUFFLE.
Thanks for good comment.