This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
3/6
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
build-vector-two-dup.ll

Differential D148347

[AArch64] Handle vector with two different values
ClosedPublic

Authored by jaykang10 on Apr 14 2023, 8:37 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
t.p.northover

Commits

rGb18161d7850c: [AArch64] Handle vector with two different values

Summary

gcc generates less instructions than llvm from below intrinsic example.

#include <arm_neon.h>

uint8x16_t foo(uint8_t *a, uint8_t *b) {
    return vcombine_u8(vld1_dup_u8(a), vld1_dup_u8(b));
} 

gcc output
foo:
	ld1r	{v0.8b}, [x0]
	ld1r	{v1.8b}, [x1]
	ins	v0.d[1], v1.d[0]
	ret

llvm output
foo:                                    // @foo
        ldrb    w8, [x0]
        fmov    s0, w8
        mov     v0.b[1], w8
        mov     v0.b[2], w8
        mov     v0.b[3], w8
        mov     v0.b[4], w8
        mov     v0.b[5], w8
        mov     v0.b[6], w8
        mov     v0.b[7], w8
        ldrb    w8, [x1]
        mov     v0.b[8], w8
        mov     v0.b[9], w8
        mov     v0.b[10], w8
        mov     v0.b[11], w8
        mov     v0.b[12], w8
        mov     v0.b[13], w8
        mov     v0.b[14], w8
        mov     v0.b[15], w8
        ret

If vector has two different values and it can be splitted into two sub vectors with same length, generate two DUP and CONCAT_VECTORS with them.
For example,

 t22: v16i8 = BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t23,
                           t24, t24, t24, t24, t24, t24, t24, t24
==>
   t26: v8i8 = AArch64ISD::DUP t23
   t28: v8i8 = AArch64ISD::DUP t24
 t29: v16i8 = concat_vectors t26, t28

With this patch, llvm generates below output.

foo:                                  // @foo
	ld1r	{ v1.8b }, [x1]
	ld1r	{ v0.8b }, [x0]
	mov	v0.d[1], v1.d[0]
	ret

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jaykang10 created this revision.Apr 14 2023, 8:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 8:37 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

jaykang10 requested review of this revision.Apr 14 2023, 8:37 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 8:37 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Could you add some extra test cases too? For example for the foo case in the summary and for case that are close-but-not-correct, like BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t24, t24, t24, t24, t24, t24, t24, t24, t23. You should be able to construct tests with a bunch of insertelement's.

Harbormaster completed remote builds in B225633: Diff 513609.Apr 14 2023, 9:24 AM

In D148347#4268664, @dmgreen wrote:

Could you add some extra test cases too? For example for the foo case in the summary and for case that are close-but-not-correct, like BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t24, t24, t24, t24, t24, t24, t24, t24, t23. You should be able to construct tests with a bunch of insertelement's.

Ah! You are right!
It looks the condition does not cover all cases. I thought some cases are hit by previous conditions.
Let me update the code and add the tests next week.
Thanks!

jaykang10 mentioned this in rGbb15bf72580c: [AArch64] Precommit a test.Apr 14 2023, 12:58 PM

@dmgreen I have added some tests with precommit.
Let me add more next week.

jaykang10 updated this revision to Diff 513732.Apr 14 2023, 1:36 PM

Instead of special-casing this specific pattern, maybe it makes sense to generalize to all BUILD_VECTORs with two values? Even for a completely arbitrary shuffle, you can still lower that to 5 instructions: VSELECT mask, DUP op1, DUP op2.

Harbormaster completed remote builds in B225718: Diff 513732.Apr 14 2023, 3:21 PM

In D148347#4269669, @efriedma wrote:

Instead of special-casing this specific pattern, maybe it makes sense to generalize to all BUILD_VECTORs with two values? Even for a completely arbitrary shuffle, you can still lower that to 5 instructions: VSELECT mask, DUP op1, DUP op2.

Thanks for comment. @efriedma That's good point.
Let me try to add the VSELECT with two DUPs.

In D148347#4273021, @jaykang10 wrote:

In D148347#4269669, @efriedma wrote:

Instead of special-casing this specific pattern, maybe it makes sense to generalize to all BUILD_VECTORs with two values? Even for a completely arbitrary shuffle, you can still lower that to 5 instructions: VSELECT mask, DUP op1, DUP op2.

Thanks for comment. @efriedma That's good point.
Let me try to add the VSELECT with two DUPs.

@efriedma It seems we need BUILD_VECTOR or something like that for the VSELECT's mask, which its type is vxi1, with legal vector type... I am not sure how we can generate it efficiently...
If you had ideas for the mask, please let me know.

An 16xi8 BUILD_VECTOR of constants should lower to something reasonable (worst case, a constant pool load).

In D148347#4274325, @efriedma wrote:

An 16xi8 BUILD_VECTOR of constants should lower to something reasonable (worst case, a constant pool load).

I have seen the constant pool and load...and I was not sure it is good enough or not...

In D148347#4274366, @jaykang10 wrote:

In D148347#4274325, @efriedma wrote:

An 16xi8 BUILD_VECTOR of constants should lower to something reasonable (worst case, a constant pool load).

I have seen the constant pool and load...and I was not sure it is good enough or not...

It is a patch which reproduce the constant pool and load from the intrinsic example... Maybe, my implementation could be wrong...

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index fb41dfd8f245..089e51a5b981 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -12236,6 +12236,8 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
   unsigned NumUndefLanes = 0;
   SDValue Value;
   SDValue ConstantValue;
+  SmallPtrSet<SDNode *, 16> DifferentValueSet;
+  SmallVector<SDValue, 16> BitMaskVec;
   for (unsigned i = 0; i < NumElts; ++i) {
     SDValue V = Op.getOperand(i);
     if (V.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
@@ -12263,6 +12265,15 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
       usesOnlyOneValue = false;
       ++NumDifferentLanes;
     }
+
+    // Keep different values on vector.
+    DifferentValueSet.insert(V.getNode());
+    // Keep the lanes of the first value with bitmask. The bitmask will be valid
+    // only if the DifferentValueSet's size is 2.
+    if (V == Value)
+      BitMaskVec.push_back(DAG.getTargetConstant(1, dl, MVT::i8));
+    else
+      BitMaskVec.push_back(DAG.getTargetConstant(0, dl, MVT::i8));
   }
 
   if (!Value.getNode()) {
@@ -12454,6 +12465,22 @@ SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
       return Shuffle;
   }
 
+  // If vector consists of two different values, generate two DUPs and VSELECT.
+  if (DifferentValueSet.size() == 2) {
+    SmallVector<SDValue, 2> Vals;
+    for (auto *Val : DifferentValueSet)
+      Vals.push_back(SDValue(Val, 0));
+    SmallVector<SDValue, 8> Ops1(NumElts, Vals[0]);
+    SmallVector<SDValue, 8> Ops2(NumElts, Vals[1]);
+    SDValue DUP1 = LowerBUILD_VECTOR(DAG.getBuildVector(VT, dl, Ops1), DAG);
+    SDValue DUP2 = LowerBUILD_VECTOR(DAG.getBuildVector(VT, dl, Ops2), DAG);
+    SDValue VCOND = DAG.getBuildVector(VT, dl, BitMaskVec);
+    if (SDValue Res = LowerBUILD_VECTOR(VCOND, DAG))
+      VCOND = Res;
+    SDValue VSELECT = DAG.getNode(ISD::VSELECT, dl, VT, VCOND, DUP1, DUP2);
+    return VSELECT;
+  }
+
   if (PreferDUPAndInsert) {
     // First, build a constant vector with the common element.
     SmallVector<SDValue, 8> Ops(NumElts, Value);

Something along those lines, yes.

I haven't thought through how to optimize the cases where some non-VSELECT shuffle is optimal. We don't have any existing code to handle splat operands to shuffles. I guess to start, you could just check for specific patterns before creating the VSELECT. Alternatively, might make sense to make a VECTOR_SHUFFLE, then teach shuffle lowering to handle the relevant patterns.

I have seen the constant pool and load...and I was not sure it is good enough or not...

Better than a long sequence of MOVs. And if there's a loop, the load may get hoisted out.

I haven't thought through how to optimize the cases where some non-VSELECT shuffle is optimal. We don't have any existing code to handle splat operands to shuffles. I guess to start, you could just check for specific patterns before creating the VSELECT. Alternatively, might make sense to make a VECTOR_SHUFFLE, then teach shuffle lowering to handle the relevant patterns.

Thanks for good comment.
It looks it is not easy to generate the vector mask simply for vselect and vector_shuffle because it needs build_vector again...
Let me check the specific patterns + (vselect or vector_shuffle) more.

I have seen the constant pool and load...and I was not sure it is good enough or not...

Better than a long sequence of MOVs. And if there's a loop, the load may get hoisted out.

I agree with you.

um... It seems the vector_shuffle's output is better than vselect's one as below even though it generates the constant pool. The vector_shuffle is lowered to tbl. Let me try to use vector_shuffle.
Additionally, AArch64 target expands the vselect so we need to expand it manually or add tablegen patterns for it...

Optimized legalized selection DAG: %bb.0 'test3:entry'
SelectionDAG has 20 nodes:
  t0: ch,glue = EntryToken
          t2: i64,ch = CopyFromReg t0, Register:i64 %0
        t23: i32,ch = load<(load (s8) from %ir.a), anyext from i8> t0, t2, undef:i64
      t37: v16i8 = AArch64ISD::DUP t23
          t4: i64,ch = CopyFromReg t0, Register:i64 %1
        t24: i32,ch = load<(load (s8) from %ir.b), anyext from i8> t0, t4, undef:i64
      t36: v16i8 = AArch64ISD::DUP t24
          t40: i64 = AArch64ISD::ADRP TargetConstantPool:i64<<16 x i8> <i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16>> 0 [TF=1]
        t41: i64 = AArch64ISD::ADDlow t40, TargetConstantPool:i64<<16 x i8> <i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16, i8 16>> 0 [TF=34]
      t35: v16i8,ch = load<(load (s128) from constant-pool)> t0, t41, undef:i64
    t33: v16i8 = llvm.aarch64.neon.tbl2 Constant:i32<610>, t37, t36, t35  --> tbl2 is generated.
  t16: ch,glue = CopyToReg t0, Register:v16i8 $q0, t33
  t17: ch = AArch64ISD::RET_GLUE t16, Register:v16i8 $q0, t16:1

Assembly output
test3:                                  // @test3
        .cfi_startproc
// %bb.0:                               // %entry
        adrp    x8, .LCPI0_0
        ld1r    { v1.16b }, [x1]
        ld1r    { v0.16b }, [x0]
        ldr     q2, [x8, :lo12:.LCPI0_0]
        tbl     v0.16b, { v0.16b, v1.16b }, v2.16b
        ret

@efriedma I have updated this patch with vector_shuffle and have checked the impact.
As you can see, the vector_shuffle generates more instructions on the vectors with small number of lanes so it could be good to enable this transformation with more than 8 lanes.
How do you think about it?

jaykang10 updated this revision to Diff 515718.Apr 21 2023, 6:47 AM

Harbormaster completed remote builds in B227165: Diff 515718.Apr 21 2023, 7:26 AM

Do you have some examples of the impact? I don't see any test changes that involve the shuffle codepath.

In D148347#4287576, @efriedma wrote:

Do you have some examples of the impact? I don't see any test changes that involve the shuffle codepath.

Yep, you can see test changes here. https://reviews.llvm.org/differential/diff/514863/

You should probably add more testcases for coverage. Some 8 and 16-element cases using shuffles. Maybe a case of a 4-element vector (a,a,b,b) which can be built using dup+dup+concat.

Some future areas to look at (not necessarily in this patch; I don't want to keep expanding the scope of this forever):

mov+ins+ins+ins is never the optimal sequence for creating a 4-element vector with multiple elements with the same value; you can always use either dup+ins or dup+ins+ins. Maybe worth trying to optimize.

It looks like the generic vector_shuffle code is generating suboptimal code in some cases because it doesn't realize the shuffle inputs are actually splats, so it can mess with the indices.

In D148347#4287817, @efriedma wrote:

You should probably add more testcases for coverage. Some 8 and 16-element cases using shuffles. Maybe a case of a 4-element vector (a,a,b,b) which can be built using dup+dup+concat.

Some future areas to look at (not necessarily in this patch; I don't want to keep expanding the scope of this forever):

mov+ins+ins+ins is never the optimal sequence for creating a 4-element vector with multiple elements with the same value; you can always use either dup+ins or dup+ins+ins. Maybe worth trying to optimize.

It looks like the generic vector_shuffle code is generating suboptimal code in some cases because it doesn't realize the shuffle inputs are actually splats, so it can mess with the indices.

Thanks for good comment!
Let me add and check the test cases more.

jaykang10 mentioned this in rG29d24ed394dc: [AArch64] Precommit test.Apr 24 2023, 5:25 AM

jaykang10 updated this revision to Diff 516367.Apr 24 2023, 5:37 AM

It looks like the generic vector_shuffle code is generating suboptimal code in some cases because it doesn't realize the shuffle inputs are actually splats, so it can mess with the indices.

@efriedma I agree with you.
Let me check the splats input on vector_shuffle. It could be a different patch.

jaykang10 updated this revision to Diff 516709.Apr 25 2023, 1:56 AM

efriedma added inline comments.Apr 25 2023, 10:40 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12358	The name "MaskVec" is a bit confusing in this context... Maybe it would be easier to understand if you just construct it later, in the `if (DifferentValueMap.size() == 2 && NumUndefLanes == 0)` codepath?

jaykang10 added inline comments.Apr 26 2023, 1:25 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12358	The name "MaskVec" is a bit confusing in this context... Maybe it would be easier to understand if you just construct it later, in the `if (DifferentValueMap.size() == 2 && NumUndefLanes == 0)` codepath? Sorry for confusing. Let me add more comments and update the code.

jaykang10 updated this revision to Diff 517102.Apr 26 2023, 2:12 AM

Harbormaster completed remote builds in B228247: Diff 517102.Apr 26 2023, 3:00 AM

jaykang10 mentioned this in D149638: [AArch64] Handle VECTOR_SHUFFL mask with splats.May 2 2023, 1:57 AM

Any more comments please?

I think the code looks OK. You may want to add some extra test cases that generate tbl though to show more cases.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662	Would it help if this mask is `<0,1,2,3,12,13,14,15>` or `<0,1,2,3,8,9,10,11>`? I'm not sure it would help at the moment, but this case with i8's could use 's' lane inserts to avoid the tbl. It wouldn't help in general though.

This revision is now accepted and ready to land.May 5 2023, 5:00 AM

jaykang10 added inline comments.May 5 2023, 6:42 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662	Thanks for comment. This code handles the vector with only two different values so the case with the mask <0,1,2,3,12,13,14,15> and <0,1,2,3,8,9,10,11> will not meet this code.

This revision was landed with ongoing or failed builds.May 5 2023, 7:04 AM

Closed by commit rGb18161d7850c: [AArch64] Handle vector with two different values (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rGb18161d7850c: [AArch64] Handle vector with two different values.

dmgreen added inline comments.May 5 2023, 7:08 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662	The idea is that the two original values have been dup'd to all elements of DUP1 and DUP2. So the value in lane 0 should be the same as in lane 1..7, and the value in lane 8 (lane 0 of DUP2) will be the same as 9..15. So we can chose any element in those vectors. And if we pick <0,1,2,3> there is a chance to convert that to a 'S' reg lane move without requiring the tbl. It looks like that might already happen if the lanes were sequential. Having said that, if the type is a float then the value should already be in lane 0, and the DUP's become unnecessary. It should be able to just use a tbl directly.

jaykang10 added inline comments.May 5 2023, 7:24 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12662	Ah, I understand what you said now. Yep, we can choose the element for the mask because the values of the lanes are same. Let me check which mask can be lowered well on `LowerVECTOR_SHUFFLE`. Thanks for good comment.

jaykang10 mentioned this in D150345: [AArch64] Handle vector with two different values with efficient vector mask.May 11 2023, 12:31 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

97 lines

test/

CodeGen/

AArch64/

build-vector-two-dup.ll

98 lines

Diff 519843

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,347 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
bool usesOnlyOneConstantValue = true;		bool usesOnlyOneConstantValue = true;
bool isConstant = true;		bool isConstant = true;
bool AllLanesExtractElt = true;		bool AllLanesExtractElt = true;
unsigned NumConstantLanes = 0;		unsigned NumConstantLanes = 0;
unsigned NumDifferentLanes = 0;		unsigned NumDifferentLanes = 0;
unsigned NumUndefLanes = 0;		unsigned NumUndefLanes = 0;
SDValue Value;		SDValue Value;
SDValue ConstantValue;		SDValue ConstantValue;
		SmallMapVector<SDValue, unsigned, 16> DifferentValueMap;
		unsigned ConsecutiveValCount = 0;
		SDValue PrevVal;
		efriedmaUnsubmitted Not Done Reply Inline Actions The name "MaskVec" is a bit confusing in this context... Maybe it would be easier to understand if you just construct it later, in the `if (DifferentValueMap.size() == 2 && NumUndefLanes == 0)` codepath? efriedma: The name "MaskVec" is a bit confusing in this context... Maybe it would be easier to…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions The name "MaskVec" is a bit confusing in this context... Maybe it would be easier to understand if you just construct it later, in the `if (DifferentValueMap.size() == 2 && NumUndefLanes == 0)` codepath? Sorry for confusing. Let me add more comments and update the code. jaykang10: > The name "MaskVec" is a bit confusing in this context... > > Maybe it would be easier to…
for (unsigned i = 0; i < NumElts; ++i) {		for (unsigned i = 0; i < NumElts; ++i) {
SDValue V = Op.getOperand(i);		SDValue V = Op.getOperand(i);
if (V.getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (V.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
AllLanesExtractElt = false;		AllLanesExtractElt = false;
if (V.isUndef()) {		if (V.isUndef()) {
++NumUndefLanes;		++NumUndefLanes;
continue;		continue;
}		}
Show All 11 Lines	for (unsigned i = 0; i < NumElts; ++i) {
}		}

if (!Value.getNode())		if (!Value.getNode())
Value = V;		Value = V;
else if (V != Value) {		else if (V != Value) {
usesOnlyOneValue = false;		usesOnlyOneValue = false;
++NumDifferentLanes;		++NumDifferentLanes;
}		}

		if (PrevVal != V) {
		ConsecutiveValCount = 0;
		PrevVal = V;
		}

		// Keep different values and its last consecutive count. For example,
		//
		// t22: v16i8 = build_vector t23, t23, t23, t23, t23, t23, t23, t23,
		// t24, t24, t24, t24, t24, t24, t24, t24
		// t23 = consecutive count 8
		// t24 = consecutive count 8
		// ------------------------------------------------------------------
		// t22: v16i8 = build_vector t24, t24, t23, t23, t23, t23, t23, t24,
		// t24, t24, t24, t24, t24, t24, t24, t24
		// t23 = consecutive count 5
		// t24 = consecutive count 9
		DifferentValueMap[V] = ++ConsecutiveValCount;
}		}

if (!Value.getNode()) {		if (!Value.getNode()) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LowerBUILD_VECTOR: value undefined, creating undef node\n");		dbgs() << "LowerBUILD_VECTOR: value undefined, creating undef node\n");
return DAG.getUNDEF(VT);		return DAG.getUNDEF(VT);
}		}

▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	for (unsigned I = 0; I < NumElts; ++I)
if (Op.getOperand(I) != Value)		if (Op.getOperand(I) != Value)
NewVector =		NewVector =
DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, NewVector,		DAG.getNode(ISD::INSERT_VECTOR_ELT, dl, VT, NewVector,
Op.getOperand(I), DAG.getConstant(I, dl, MVT::i64));		Op.getOperand(I), DAG.getConstant(I, dl, MVT::i64));

return NewVector;		return NewVector;
}		}

		// If vector consists of two different values, try to generate two DUPs and
		// (CONCAT_VECTORS or VECTOR_SHUFFLE).
		if (DifferentValueMap.size() == 2 && NumUndefLanes == 0) {
		SmallVector<SDValue, 2> Vals;
		// Check the consecutive count of the value is the half number of vector
		// elements. In this case, we can use CONCAT_VECTORS. For example,
		//
		// canUseVECTOR_CONCAT = true;
		// t22: v16i8 = build_vector t23, t23, t23, t23, t23, t23, t23, t23,
		// t24, t24, t24, t24, t24, t24, t24, t24
		//
		// canUseVECTOR_CONCAT = false;
		// t22: v16i8 = build_vector t23, t23, t23, t23, t23, t24, t24, t24,
		// t24, t24, t24, t24, t24, t24, t24, t24
		bool canUseVECTOR_CONCAT = true;
		for (auto Pair : DifferentValueMap) {
		// Check different values have same length which is NumElts / 2.
		if (Pair.second != NumElts / 2)
		canUseVECTOR_CONCAT = false;
		Vals.push_back(Pair.first);
		}

		// If canUseVECTOR_CONCAT is true, we can generate two DUPs and
		// CONCAT_VECTORs. For example,
		//
		// t22: v16i8 = BUILD_VECTOR t23, t23, t23, t23, t23, t23, t23, t23,
		// t24, t24, t24, t24, t24, t24, t24, t24
		// ==>
		// t26: v8i8 = AArch64ISD::DUP t23
		// t28: v8i8 = AArch64ISD::DUP t24
		// t29: v16i8 = concat_vectors t26, t28
		if (canUseVECTOR_CONCAT) {
		EVT SubVT = VT.getHalfNumVectorElementsVT(*DAG.getContext());
		if (isTypeLegal(SubVT) && SubVT.isVector() &&
		SubVT.getVectorNumElements() >= 2) {
		SmallVector<SDValue, 8> Ops1(NumElts / 2, Vals[0]);
		SmallVector<SDValue, 8> Ops2(NumElts / 2, Vals[1]);
		SDValue DUP1 =
		LowerBUILD_VECTOR(DAG.getBuildVector(SubVT, dl, Ops1), DAG);
		SDValue DUP2 =
		LowerBUILD_VECTOR(DAG.getBuildVector(SubVT, dl, Ops2), DAG);
		SDValue CONCAT_VECTORS =
		DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, DUP1, DUP2);
		return CONCAT_VECTORS;
		}
		}

		// Let's try to generate two DUPs and VECTOR_SHUFFLE. For example,
		//
		// t24: v8i8 = BUILD_VECTOR t25, t25, t25, t25, t26, t26, t26, t26
		// ==>
		// t28: v8i8 = AArch64ISD::DUP t25
		// t30: v8i8 = AArch64ISD::DUP t26
		// t31: v8i8 = vector_shuffle<0,0,0,0,8,8,8,8> t28, t30
		dmgreenUnsubmitted Not Done Reply Inline Actions Would it help if this mask is `<0,1,2,3,12,13,14,15>` or `<0,1,2,3,8,9,10,11>`? I'm not sure it would help at the moment, but this case with i8's could use 's' lane inserts to avoid the tbl. It wouldn't help in general though. dmgreen: Would it help if this mask is `<0,1,2,3,12,13,14,15>` or `<0,1,2,3,8,9,10,11>`? I'm not sure it…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Thanks for comment. This code handles the vector with only two different values so the case with the mask <0,1,2,3,12,13,14,15> and <0,1,2,3,8,9,10,11> will not meet this code. jaykang10: Thanks for comment. This code handles the vector with only two different values so the case…
		dmgreenUnsubmitted Not Done Reply Inline Actions The idea is that the two original values have been dup'd to all elements of DUP1 and DUP2. So the value in lane 0 should be the same as in lane 1..7, and the value in lane 8 (lane 0 of DUP2) will be the same as 9..15. So we can chose any element in those vectors. And if we pick <0,1,2,3> there is a chance to convert that to a 'S' reg lane move without requiring the tbl. It looks like that might already happen if the lanes were sequential. Having said that, if the type is a float then the value should already be in lane 0, and the DUP's become unnecessary. It should be able to just use a tbl directly. dmgreen: The idea is that the two original values have been dup'd to all elements of DUP1 and DUP2. So…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Ah, I understand what you said now. Yep, we can choose the element for the mask because the values of the lanes are same. Let me check which mask can be lowered well on `LowerVECTOR_SHUFFLE`. Thanks for good comment. jaykang10: Ah, I understand what you said now. Yep, we can choose the element for the mask because the…
		if (NumElts >= 8) {
		SmallVector<int, 16> MaskVec;
		// Build mask for VECTOR_SHUFLLE.
		SDValue FirstLaneVal = Op.getOperand(0);
		for (unsigned i = 0; i < NumElts; ++i) {
		SDValue Val = Op.getOperand(i);
		if (FirstLaneVal == Val)
		MaskVec.push_back(0);
		else
		MaskVec.push_back(NumElts);
		}

		SmallVector<SDValue, 8> Ops1(NumElts, Vals[0]);
		SmallVector<SDValue, 8> Ops2(NumElts, Vals[1]);
		SDValue DUP1 = LowerBUILD_VECTOR(DAG.getBuildVector(VT, dl, Ops1), DAG);
		SDValue DUP2 = LowerBUILD_VECTOR(DAG.getBuildVector(VT, dl, Ops2), DAG);
		SDValue VECTOR_SHUFFLE =
		DAG.getVectorShuffle(VT, dl, DUP1, DUP2, MaskVec);
		return VECTOR_SHUFFLE;
		}
		}

// If all else fails, just use a sequence of INSERT_VECTOR_ELT when we		// If all else fails, just use a sequence of INSERT_VECTOR_ELT when we
// know the default expansion would otherwise fall back on something even		// know the default expansion would otherwise fall back on something even
// worse. For a vector with one or two non-undef values, that's		// worse. For a vector with one or two non-undef values, that's
// scalar_to_vector for the elements followed by a shuffle (provided the		// scalar_to_vector for the elements followed by a shuffle (provided the
// shuffle is valid for the target) and materialization element by element		// shuffle is valid for the target) and materialization element by element
// on the stack followed by a load for everything else.		// on the stack followed by a load for everything else.
if (!isConstant && !usesOnlyOneValue) {		if (!isConstant && !usesOnlyOneValue) {
LLVM_DEBUG(		LLVM_DEBUG(
▲ Show 20 Lines • Show All 12,570 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/build-vector-two-dup.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu \| FileCheck %s		; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu \| FileCheck %s

define <16 x i8> @test1(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {		define <16 x i8> @test1(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
; CHECK-LABEL: test1:		; CHECK-LABEL: test1:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldrb w8, [x0]		; CHECK-NEXT: ld1r { v1.8b }, [x1]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1r { v0.8b }, [x0]
; CHECK-NEXT: mov v0.b[1], w8		; CHECK-NEXT: mov v0.d[1], v1.d[0]
; CHECK-NEXT: mov v0.b[2], w8
; CHECK-NEXT: mov v0.b[3], w8
; CHECK-NEXT: mov v0.b[4], w8
; CHECK-NEXT: mov v0.b[5], w8
; CHECK-NEXT: mov v0.b[6], w8
; CHECK-NEXT: mov v0.b[7], w8
; CHECK-NEXT: ldrb w8, [x1]
; CHECK-NEXT: mov v0.b[8], w8
; CHECK-NEXT: mov v0.b[9], w8
; CHECK-NEXT: mov v0.b[10], w8
; CHECK-NEXT: mov v0.b[11], w8
; CHECK-NEXT: mov v0.b[12], w8
; CHECK-NEXT: mov v0.b[13], w8
; CHECK-NEXT: mov v0.b[14], w8
; CHECK-NEXT: mov v0.b[15], w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%0 = load i8, ptr %a, align 1		%0 = load i8, ptr %a, align 1
%1 = insertelement <8 x i8> poison, i8 %0, i64 0		%1 = insertelement <8 x i8> poison, i8 %0, i64 0
%lane = shufflevector <8 x i8> %1, <8 x i8> poison, <8 x i32> zeroinitializer		%lane = shufflevector <8 x i8> %1, <8 x i8> poison, <8 x i32> zeroinitializer
%2 = load i8, ptr %b, align 1		%2 = load i8, ptr %b, align 1
%3 = insertelement <8 x i8> poison, i8 %2, i64 0		%3 = insertelement <8 x i8> poison, i8 %2, i64 0
%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer		%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer
Show All 37 Lines	entry:
%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer		%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer
%shuffle.i = shufflevector <8 x i8> %lane, <8 x i8> %lane2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>		%shuffle.i = shufflevector <8 x i8> %lane, <8 x i8> %lane2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
ret <16 x i8> %shuffle.i		ret <16 x i8> %shuffle.i
}		}

define <16 x i8> @test4(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {		define <16 x i8> @test4(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
; CHECK-LABEL: test4:		; CHECK-LABEL: test4:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldrb w8, [x1]		; CHECK-NEXT: ld1r { v1.8b }, [x0]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1r { v0.8b }, [x1]
; CHECK-NEXT: mov v0.b[1], w8		; CHECK-NEXT: mov v0.d[1], v1.d[0]
; CHECK-NEXT: mov v0.b[2], w8
; CHECK-NEXT: mov v0.b[3], w8
; CHECK-NEXT: mov v0.b[4], w8
; CHECK-NEXT: mov v0.b[5], w8
; CHECK-NEXT: mov v0.b[6], w8
; CHECK-NEXT: mov v0.b[7], w8
; CHECK-NEXT: ldrb w8, [x0]
; CHECK-NEXT: mov v0.b[8], w8
; CHECK-NEXT: mov v0.b[9], w8
; CHECK-NEXT: mov v0.b[10], w8
; CHECK-NEXT: mov v0.b[11], w8
; CHECK-NEXT: mov v0.b[12], w8
; CHECK-NEXT: mov v0.b[13], w8
; CHECK-NEXT: mov v0.b[14], w8
; CHECK-NEXT: mov v0.b[15], w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%0 = load i8, ptr %a, align 1		%0 = load i8, ptr %a, align 1
%1 = insertelement <8 x i8> poison, i8 %0, i64 0		%1 = insertelement <8 x i8> poison, i8 %0, i64 0
%lane = shufflevector <8 x i8> %1, <8 x i8> poison, <8 x i32> zeroinitializer		%lane = shufflevector <8 x i8> %1, <8 x i8> poison, <8 x i32> zeroinitializer
%2 = load i8, ptr %b, align 1		%2 = load i8, ptr %b, align 1
%3 = insertelement <8 x i8> poison, i8 %2, i64 0		%3 = insertelement <8 x i8> poison, i8 %2, i64 0
%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer		%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer
Show All 19 Lines	entry:
%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer		%lane2 = shufflevector <8 x i8> %3, <8 x i8> poison, <8 x i32> zeroinitializer
%shuffle.i = shufflevector <8 x i8> %lane, <8 x i8> %lane2, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 15>		%shuffle.i = shufflevector <8 x i8> %lane, <8 x i8> %lane2, <16 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 15>
ret <16 x i8> %shuffle.i		ret <16 x i8> %shuffle.i
}		}

define <8 x i8> @test6(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {		define <8 x i8> @test6(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
; CHECK-LABEL: test6:		; CHECK-LABEL: test6:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldrb w8, [x0]		; CHECK-NEXT: ld1r { v0.8b }, [x1]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: adrp x8, .LCPI5_0
; CHECK-NEXT: mov v0.b[1], w8		; CHECK-NEXT: ld1r { v1.8b }, [x0]
; CHECK-NEXT: mov v0.b[2], w8		; CHECK-NEXT: mov v1.d[1], v0.d[0]
; CHECK-NEXT: mov v0.b[3], w8		; CHECK-NEXT: ldr d0, [x8, :lo12:.LCPI5_0]
; CHECK-NEXT: ldrb w8, [x1]		; CHECK-NEXT: tbl v0.8b, { v1.16b }, v0.8b
; CHECK-NEXT: mov v0.b[4], w8
; CHECK-NEXT: mov v0.b[5], w8
; CHECK-NEXT: mov v0.b[6], w8
; CHECK-NEXT: mov v0.b[7], w8
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%0 = load i8, ptr %a, align 1		%0 = load i8, ptr %a, align 1
%1 = insertelement <4 x i8> poison, i8 %0, i64 0		%1 = insertelement <4 x i8> poison, i8 %0, i64 0
%lane = shufflevector <4 x i8> %1, <4 x i8> poison, <4 x i32> zeroinitializer		%lane = shufflevector <4 x i8> %1, <4 x i8> poison, <4 x i32> zeroinitializer
%2 = load i8, ptr %b, align 1		%2 = load i8, ptr %b, align 1
%3 = insertelement <4 x i8> poison, i8 %2, i64 0		%3 = insertelement <4 x i8> poison, i8 %2, i64 0
%lane2 = shufflevector <4 x i8> %3, <4 x i8> poison, <4 x i32> zeroinitializer		%lane2 = shufflevector <4 x i8> %3, <4 x i8> poison, <4 x i32> zeroinitializer
%shuffle.i = shufflevector <4 x i8> %lane, <4 x i8> %lane2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%shuffle.i = shufflevector <4 x i8> %lane, <4 x i8> %lane2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <8 x i8> %shuffle.i		ret <8 x i8> %shuffle.i
}		}

define <8 x i8> @test7(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {		define <8 x i8> @test7(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
; CHECK-LABEL: test7:		; CHECK-LABEL: test7:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldrb w8, [x1]		; CHECK-NEXT: ld1r { v0.8b }, [x0]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: adrp x8, .LCPI6_0
; CHECK-NEXT: mov v0.b[1], w8		; CHECK-NEXT: ld1r { v1.8b }, [x1]
; CHECK-NEXT: mov v0.b[2], w8		; CHECK-NEXT: mov v1.d[1], v0.d[0]
; CHECK-NEXT: mov v0.b[3], w8		; CHECK-NEXT: ldr d0, [x8, :lo12:.LCPI6_0]
; CHECK-NEXT: ldrb w8, [x0]		; CHECK-NEXT: tbl v0.8b, { v1.16b }, v0.8b
; CHECK-NEXT: mov v0.b[4], w8
; CHECK-NEXT: mov v0.b[5], w8
; CHECK-NEXT: mov v0.b[6], w8
; CHECK-NEXT: mov v0.b[7], w8
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%0 = load i8, ptr %a, align 1		%0 = load i8, ptr %a, align 1
%1 = insertelement <4 x i8> poison, i8 %0, i64 0		%1 = insertelement <4 x i8> poison, i8 %0, i64 0
%lane = shufflevector <4 x i8> %1, <4 x i8> poison, <4 x i32> zeroinitializer		%lane = shufflevector <4 x i8> %1, <4 x i8> poison, <4 x i32> zeroinitializer
%2 = load i8, ptr %b, align 1		%2 = load i8, ptr %b, align 1
%3 = insertelement <4 x i8> poison, i8 %2, i64 0		%3 = insertelement <4 x i8> poison, i8 %2, i64 0
%lane2 = shufflevector <4 x i8> %3, <4 x i8> poison, <4 x i32> zeroinitializer		%lane2 = shufflevector <4 x i8> %3, <4 x i8> poison, <4 x i32> zeroinitializer
%shuffle.i = shufflevector <4 x i8> %lane, <4 x i8> %lane2, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>		%shuffle.i = shufflevector <4 x i8> %lane, <4 x i8> %lane2, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3>
ret <8 x i8> %shuffle.i		ret <8 x i8> %shuffle.i
}		}

define <8 x i16> @test8(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {		define <8 x i16> @test8(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
; CHECK-LABEL: test8:		; CHECK-LABEL: test8:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldrh w8, [x0]		; CHECK-NEXT: ld1r { v1.4h }, [x1]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1r { v0.4h }, [x0]
; CHECK-NEXT: mov v0.h[1], w8		; CHECK-NEXT: mov v0.d[1], v1.d[0]
; CHECK-NEXT: mov v0.h[2], w8
; CHECK-NEXT: mov v0.h[3], w8
; CHECK-NEXT: ldrh w8, [x1]
; CHECK-NEXT: mov v0.h[4], w8
; CHECK-NEXT: mov v0.h[5], w8
; CHECK-NEXT: mov v0.h[6], w8
; CHECK-NEXT: mov v0.h[7], w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%0 = load i16, ptr %a, align 1		%0 = load i16, ptr %a, align 1
%1 = insertelement <4 x i16> poison, i16 %0, i64 0		%1 = insertelement <4 x i16> poison, i16 %0, i64 0
%lane = shufflevector <4 x i16> %1, <4 x i16> poison, <4 x i32> zeroinitializer		%lane = shufflevector <4 x i16> %1, <4 x i16> poison, <4 x i32> zeroinitializer
%2 = load i16, ptr %b, align 1		%2 = load i16, ptr %b, align 1
%3 = insertelement <4 x i16> poison, i16 %2, i64 0		%3 = insertelement <4 x i16> poison, i16 %2, i64 0
%lane2 = shufflevector <4 x i16> %3, <4 x i16> poison, <4 x i32> zeroinitializer		%lane2 = shufflevector <4 x i16> %3, <4 x i16> poison, <4 x i32> zeroinitializer
%shuffle.i = shufflevector <4 x i16> %lane, <4 x i16> %lane2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%shuffle.i = shufflevector <4 x i16> %lane, <4 x i16> %lane2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
ret <8 x i16> %shuffle.i		ret <8 x i16> %shuffle.i
}		}

define <4 x i32> @test9(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {		define <4 x i32> @test9(ptr nocapture noundef readonly %a, ptr nocapture noundef readonly %b) {
; CHECK-LABEL: test9:		; CHECK-LABEL: test9:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: ldr w8, [x0]		; CHECK-NEXT: ld1r { v1.2s }, [x1]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1r { v0.2s }, [x0]
; CHECK-NEXT: mov v0.s[1], w8		; CHECK-NEXT: mov v0.d[1], v1.d[0]
; CHECK-NEXT: ldr w8, [x1]
; CHECK-NEXT: mov v0.s[2], w8
; CHECK-NEXT: mov v0.s[3], w8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%0 = load i32, ptr %a, align 1		%0 = load i32, ptr %a, align 1
%1 = insertelement <2 x i32> poison, i32 %0, i64 0		%1 = insertelement <2 x i32> poison, i32 %0, i64 0
%lane = shufflevector <2 x i32> %1, <2 x i32> poison, <2 x i32> zeroinitializer		%lane = shufflevector <2 x i32> %1, <2 x i32> poison, <2 x i32> zeroinitializer
%2 = load i32, ptr %b, align 1		%2 = load i32, ptr %b, align 1
%3 = insertelement <2 x i32> poison, i32 %2, i64 0		%3 = insertelement <2 x i32> poison, i32 %2, i64 0
%lane2 = shufflevector <2 x i32> %3, <2 x i32> poison, <2 x i32> zeroinitializer		%lane2 = shufflevector <2 x i32> %3, <2 x i32> poison, <2 x i32> zeroinitializer
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines