This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
5/9
AArch64ISelLowering.cpp
1/2
AArch64InstrInfo.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-neon-vector-insert-uaddlv.ll
-
dp1.ll
-
neon-addlv.ll

Differential D159267

[AArch64] Remove copy instruction between uaddlv and dup
ClosedPublic

Authored by jaykang10 on Aug 31 2023, 4:12 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
t.p.northover

Commits

rG67fc0d3d39db: [AArch64] Remove copy instruction between uaddlv and dup

Summary

gcc generates less number of instructions from below example than llvm.

#include <arm_neon.h>

uint8x8_t bar(uint8x8_t a) {
    return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3);
}

gcc output
bar:
        uaddlv  h0, v0.8b
        dup     v0.8h, v0.h[0]
        rshrn   v0.8b, v0.8h, 3
        ret

llvm output
bar:
        uaddlv  h0, v0.8b
        fmov    w8, s0
        dup     v0.8h, w8
        rshrn   v0.8b, v0.8h, #3
        ret

There is a copy instruction between gpr and fpr. We could need to change scalar dup to vector dup to remove the copy instruction as below.

def : Pat<(v8i16 (AArch64dup (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))),
          (v8i16 (DUPv8i16lane
            (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub),
            (i64 0)))>;

With above pattern, llvm generates below output.

bar:                                    // @bar
        uaddlv  h0, v0.8b
        dup     v0.8h, v0.h[0]
        rshrn   v0.8b, v0.8h, #3
        ret

The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.

Diff Detail

Event Timeline

jaykang10 created this revision.Aug 31 2023, 4:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 4:12 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

jaykang10 requested review of this revision.Aug 31 2023, 4:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 4:12 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B255973: Diff 554971.Aug 31 2023, 5:07 AM

Ideally, we'd lower the intrinsic to some operation that returns its result in a vector register. Given limitations of SelectionDAG, that means introducing an opcode that produces a <2 x i32> or something like that. So instead of "(AArch64dup (int_aarch64_neon_uaddlv))", we'd end up with something more like "(AArch64dup (extract_element(AArch64uaddlv))", and existing patterns would naturally do the right thing.

Otherwise, I think we end up needing way too many patterns to cover every operation that could possibly use the result of a uaddlv in a vector register.

In D159267#4631879, @efriedma wrote:

Ideally, we'd lower the intrinsic to some operation that returns its result in a vector register. Given limitations of SelectionDAG, that means introducing an opcode that produces a <2 x i32> or something like that. So instead of "(AArch64dup (int_aarch64_neon_uaddlv))", we'd end up with something more like "(AArch64dup (extract_element(AArch64uaddlv))", and existing patterns would naturally do the right thing.

Otherwise, I think we end up needing way too many patterns to cover every operation that could possibly use the result of a uaddlv in a vector register.

Thanks for your kind comment.
Even if we add a custom SDNode with vector type result for uaddlv, we would need copy instruction for different register classes because the AArch64dup is scalar one which has scalar input. We would need to change the scalar dup to the vector dup as well as uaddlv. That is the reason why I added the pattern...
I am not sure how we can generalize to change the uaddlv and its use instruction to vector one...

We have a DAGCombine to transform dup(extract_element) to duplane, so with my suggestion the actual isel input would be "(AArch64duplane16 (AArch64uaddlv))", which is exactly the instructions produced by your pattern.

dmgreen mentioned this in D159265: [AArch64] Remove copy instruction between uaddlv and urshr.Sep 3 2023, 11:54 PM

Yep, it looks there are patterns for dup(extract_element) --> duplane.

multiclass DUPWithTruncPats<ValueType ResVT, ValueType Src64VT,
                            ValueType Src128VT, ValueType ScalVT,
                            Instruction DUP, SDNodeXForm IdxXFORM> {
  def : Pat<(ResVT (AArch64dup (ScalVT (vector_extract (Src128VT V128:$Rn),
                                                     imm:$idx)))),
            (DUP V128:$Rn, (IdxXFORM imm:$idx))>;

  def : Pat<(ResVT (AArch64dup (ScalVT (vector_extract (Src64VT V64:$Rn),
                                                     imm:$idx)))),
            (DUP (SUBREG_TO_REG (i64 0), V64:$Rn, dsub), (IdxXFORM imm:$idx))>;
}

defm : DUPWithTruncPats<v8i8,   v4i16, v8i16, i32, DUPv8i8lane,  VecIndex_x2>;
defm : DUPWithTruncPats<v8i8,   v2i32, v4i32, i32, DUPv8i8lane,  VecIndex_x4>;
defm : DUPWithTruncPats<v4i16,  v2i32, v4i32, i32, DUPv4i16lane, VecIndex_x2>;

Let me add a custom SDNode AArch64uaddlv for uaddlv and pattern for it.
Thanks for checking it.

Following @efriedma's comment, added a custom SDNode for uaddlv and patterns for it.

@efriedma If you feel something wrong with this update, please let me know.

LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322	This could be extended to i16 uaddlv as well, but we can leave that for a followup, I guess.

This revision is now accepted and ready to land.Sep 4 2023, 9:02 PM

dmgreen added inline comments.Sep 5 2023, 1:16 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5325	An MVT::v8i16 with an extract might be a more natural representation for UADDLV that produces a h register.
8716	Can you change this to generate a UADDLV directly?
llvm/lib/Target/AArch64/AArch64InstrInfo.td
331	I think this can be the same as SDT_AArch64uaddlp

jaykang10 added inline comments.Sep 5 2023, 5:39 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322	Yep, it seems there is no pattern for dup(extract_element) --> duplane with v8i16. Let's handle the type with other patch.
5325	The uaddlv intrinsic's result type with v8i8 and v16i8 is i32 rather than i16 so we need to return i32 type as the extract vector element's result.
8716	Maybe, we could use UADDLV here. Let's check it with other patch.
llvm/lib/Target/AArch64/AArch64InstrInfo.td
331	Let me use SDT_AArch64uaddlp.

jaykang10 updated this revision to Diff 555852.Sep 5 2023, 6:09 AM

This revision was landed with ongoing or failed builds.Sep 5 2023, 6:45 AM

Closed by commit rG67fc0d3d39db: [AArch64] Remove copy instruction between uaddlv and dup (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG67fc0d3d39db: [AArch64] Remove copy instruction between uaddlv and dup.

jaykang10 mentioned this in D159447: [AArch64] Replace uaddlv intrinsic with uaddlv sdnode.Sep 5 2023, 7:07 AM

jaykang10 added inline comments.Sep 5 2023, 8:50 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

5322

I have tried below patch.

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 2bb8e4324306..87c836905659 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -5327,7 +5327,8 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
   case Intrinsic::aarch64_neon_uaddlv: {
     EVT OpVT = Op.getOperand(1).getValueType();
     EVT ResVT = Op.getValueType();
-    if (ResVT == MVT::i32 && (OpVT == MVT::v8i8 || OpVT == MVT::v16i8)) {
+    if (ResVT == MVT::i32 &&
+        (OpVT == MVT::v8i8 || OpVT == MVT::v16i8 || OpVT == MVT::v8i16)) {
       // In order to avoid insert_subvector, used v4i32 than v2i32.
       SDValue UADDLV =
           DAG.getNode(AArch64ISD::UADDLV, dl, MVT::v4i32, Op.getOperand(1));
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 4a1f46f2576a..658b22d312fb 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -6067,6 +6067,8 @@ defm : DUPWithTruncPats<v16i8,  v4i16, v8i16, i32, DUPv16i8lane, VecIndex_x2>;
 defm : DUPWithTruncPats<v16i8,  v2i32, v4i32, i32, DUPv16i8lane, VecIndex_x4>;
 defm : DUPWithTruncPats<v8i16,  v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>;
 
+defm : DUPWithTruncPats<v4i32,  v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>;
+
 multiclass DUPWithTrunci64Pats<ValueType ResVT, Instruction DUP,
                                SDNodeXForm IdxXFORM> {
   def : Pat<(ResVT (AArch64dup (i32 (trunc (extractelt (v2i64 V128:$Rn),
@@ -6462,12 +6464,21 @@ def : Pat<(i32 (int_aarch64_neon_uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op)))
             (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
             ssub))>;
 
+def : Pat<(i32 (vector_extract
+            (v4i32 (AArch64uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (i64 0))),
+          (i32 (EXTRACT_SUBREG
+            (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
+            ssub))>;
+
 def : Pat<(v4i32 (AArch64uaddlv (v8i8 V64:$Rn))),
           (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i8v V64:$Rn), hsub))>;
 
 def : Pat<(v4i32 (AArch64uaddlv (v16i8 V128:$Rn))),
           (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$Rn), hsub))>;
 
+def : Pat<(v4i32 (AArch64uaddlv (v8i16 V128:$Rn))),
+          (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i16v V128:$Rn), ssub))>;
+
 // Patterns for across-vector intrinsics, that have a node equivalent, that
 // returns a vector (with only the low lane defined) instead of a scalar.
 // In effect, opNode is the same as (scalar_to_vector (IntNode)).
diff --git a/llvm/test/CodeGen/AArch64/neon-addlv.ll b/llvm/test/CodeGen/AArch64/neon-addlv.ll
index 0f5a19c7a0f3..0769adce87d3 100644
--- a/llvm/test/CodeGen/AArch64/neon-addlv.ll
+++ b/llvm/test/CodeGen/AArch64/neon-addlv.ll
@@ -178,8 +178,8 @@ entry:
   ret i32 %0
 }

-define dso_local <8 x i8> @bar(<8 x i8> noundef %a) local_unnamed_addr #0 {
-; CHECK-LABEL: bar:
+define dso_local <8 x i8> @uaddlv_v8i8(<8 x i8> %a) {
+; CHECK-LABEL: uaddlv_v8i8:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    uaddlv h0, v0.8b
 ; CHECK-NEXT:    dup v0.8h, v0.h[0]
@@ -194,4 +194,22 @@ entry:
   ret <8 x i8> %vrshrn_n2
 }
 
+define dso_local <8 x i16> @uaddlv_v8i16(<8 x i16> %a) {
+; CHECK-LABEL: uaddlv_v8i16:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    uaddlv s0, v0.8h
+; CHECK-NEXT:    dup v1.8h, v0.h[0]
+; CHECK-NEXT:    rshrn v0.4h, v1.4s, #3
+; CHECK-NEXT:    rshrn2 v0.8h, v1.4s, #3
+; CHECK-NEXT:    ret
+entry:
+  %vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i16(<8 x i16> %a)
+  %vecinit.i = insertelement <8 x i32> undef, i32 %vaddlv.i, i64 0
+  %vecinit7.i = shufflevector <8 x i32> %vecinit.i, <8 x i32> poison, <8 x i32> zeroinitializer
+  %vrshrn_n2 = tail call <8 x i16> @llvm.aarch64.neon.rshrn.v8i16(<8 x i32> %vecinit7.i, i32 3)
+  ret <8 x i16> %vrshrn_n2
+}
+
 declare <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16>, i32)
+declare <8 x i16> @llvm.aarch64.neon.rshrn.v8i16(<8 x i32>, i32)
+declare i32 @llvm.aarch64.neon.uaddlv.i32.v8i16(<8 x i16>)
diff --git a/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll b/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll
index 8b48635b6694..e6b253b258f1 100644
--- a/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll
+++ b/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll
@@ -17,7 +17,8 @@ define i32 @uaddlv_uaddlp_v8i16(<8 x i16> %0) {
 define i16 @uaddlv_uaddlp_v16i8(<16 x i8> %0) {
 ; CHECK-LABEL: uaddlv_uaddlp_v16i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    uaddlv h0, v0.16b
+; CHECK-NEXT:    uaddlp v0.8h, v0.16b
+; CHECK-NEXT:    uaddlv s0, v0.8h
 ; CHECK-NEXT:    fmov w0, s0
 ; CHECK-NEXT:    ret
   %2 = tail call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %0)

As you can see, there is a regression on uaddlv_uaddlp_v8i16 even though I added a pattern to cover the regression because the first pattern is matched earlier than second one.

first pattern
+defm : DUPWithTruncPats<v4i32,  v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>;

second pattern
+def : Pat<(i32 (vector_extract
+            (v4i32 (AArch64uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (i64 0))),
+          (i32 (EXTRACT_SUBREG
+            (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
+            ssub))>;
+

I think it could be ok to keep uaddlv intrinsic than uaddlv sdnode for v8i16 type...

re: the big-endian stuff I mentioned on the other ticket... it looks like it isn't a regression, but my concern is the code generated for ctpop_i32 for a big-endian target. uaddlv v16i8 produces a result in h0 (element 0 of an 8 x i16), but we then access it as s0 (element 0 of a 4 x i32) without a bitcast. So I think the bits end up in the wrong place?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322	Not sure I understand the issue here; more specific patterns (more nodes in the pattern definition) should win, no matter the order in the file, so you should be able to fix this with the right patterns.

In D159267#4641051, @efriedma wrote:

re: the big-endian stuff I mentioned on the other ticket... it looks like it isn't a regression, but my concern is the code generated for ctpop_i32 for a big-endian target. uaddlv v16i8 produces a result in h0 (element 0 of an 8 x i16), but we then access it as s0 (element 0 of a 4 x i32) without a bitcast. So I think the bits end up in the wrong place?

I think it's the other way around (hopefully I have it the right way around, BE can be confusing). A bitcast would swap the lane indices (it acts as a load and a store). Otherwise lane 0 is the lowest lane in both llvmir and the neon registers.

In D159267#4641539, @dmgreen wrote:

In D159267#4641051, @efriedma wrote:

re: the big-endian stuff I mentioned on the other ticket... it looks like it isn't a regression, but my concern is the code generated for ctpop_i32 for a big-endian target. uaddlv v16i8 produces a result in h0 (element 0 of an 8 x i16), but we then access it as s0 (element 0 of a 4 x i32) without a bitcast. So I think the bits end up in the wrong place?

I think it's the other way around (hopefully I have it the right way around, BE can be confusing). A bitcast would swap the lane indices (it acts as a load and a store). Otherwise lane 0 is the lowest lane in both llvmir and the neon registers.

To be sure, I would like to check one thing. As far as I understand, the endianness affects to the order in memory so we need rev instruction after load and before store. After rev instruction, we do not need to care the endianness. Is it correct or wrong? There are other rules for big endian on AArch64?
For big endian output of the ctpop_i32, I can see rev instruction because AArch64TargetLowering::LowerCTPOP_PARITY generates bitcast from i64 to v8i8. Does it also need to be changed to NVCAST? It seems we could need to be careful to use` bitcast` which causes rev instruction for big endian...

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322	I could make a mistake. Let me check it again.

To be sure, I would like to check one thing. As far as I understand, the endianness affects to the order in memory so we need rev instruction after load and before store. After rev instruction, we do not need to care the endianness. Is it correct or wrong? There are other rules for big endian on AArch64?

bitcasts are defined as store+load, so can change the lane order. NVCast acts upon the representation in the vector so keeps the lanes in the same order. Vector function arguments are also passes in a particular order that sometimes needs to be considered (they often need a rev).

For big endian output of the ctpop_i32, I can see rev instruction because AArch64TargetLowering::LowerCTPOP_PARITY generates bitcast from i64 to v8i8. Does it also need to be changed to NVCAST? It seems we could need to be careful to use` bitcast` which causes rev instruction for big endian...

I think for this specific case it does not actually matter. Because the rev is into a cnt and a addlv on the individual i8 elements, and the addlv is performing a (commutative) reduction, it doesn't matter if the lanes get reversed. We still sum up the same values. So it could be either a BITCAST or a NVCAST and both should work (although I'm not sure a NVCAST between i64 and vectors is defined).

GitHub <noreply@github.com> mentioned this in rG59c3dcafd8f8: [AArch64] Remove copy instruction between uaddlv with v4i16/v8i16 and dup….Sep 19 2023, 1:05 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

3 lines

AArch64ISelLowering.cpp

15 lines

AArch64InstrInfo.td

7 lines

test/

CodeGen/

AArch64/

aarch64-neon-vector-insert-uaddlv.ll

4 lines

dp1.ll

3 lines

neon-addlv.ll

18 lines

Diff 555852

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
FCMLEz,		FCMLEz,
FCMLTz,		FCMLTz,

// Vector across-lanes addition		// Vector across-lanes addition
// Only the lower result lane is defined.		// Only the lower result lane is defined.
SADDV,		SADDV,
UADDV,		UADDV,

		// Unsigned sum Long across Vector
		UADDLV,

// Add Pairwise of two vectors		// Add Pairwise of two vectors
ADDP,		ADDP,
// Add Long Pairwise		// Add Long Pairwise
SADDLP,		SADDLP,
UADDLP,		UADDLP,

// udot/sdot instructions		// udot/sdot instructions
UDOT,		UDOT,
▲ Show 20 Lines • Show All 1,000 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,414 Lines • ▼ Show 20 Lines	case AArch64ISD::FIRST_NUMBER:
MAKE_CASE(AArch64ISD::CMLTz)		MAKE_CASE(AArch64ISD::CMLTz)
MAKE_CASE(AArch64ISD::FCMEQz)		MAKE_CASE(AArch64ISD::FCMEQz)
MAKE_CASE(AArch64ISD::FCMGEz)		MAKE_CASE(AArch64ISD::FCMGEz)
MAKE_CASE(AArch64ISD::FCMGTz)		MAKE_CASE(AArch64ISD::FCMGTz)
MAKE_CASE(AArch64ISD::FCMLEz)		MAKE_CASE(AArch64ISD::FCMLEz)
MAKE_CASE(AArch64ISD::FCMLTz)		MAKE_CASE(AArch64ISD::FCMLTz)
MAKE_CASE(AArch64ISD::SADDV)		MAKE_CASE(AArch64ISD::SADDV)
MAKE_CASE(AArch64ISD::UADDV)		MAKE_CASE(AArch64ISD::UADDV)
		MAKE_CASE(AArch64ISD::UADDLV)
MAKE_CASE(AArch64ISD::SDOT)		MAKE_CASE(AArch64ISD::SDOT)
MAKE_CASE(AArch64ISD::UDOT)		MAKE_CASE(AArch64ISD::UDOT)
MAKE_CASE(AArch64ISD::SMINV)		MAKE_CASE(AArch64ISD::SMINV)
MAKE_CASE(AArch64ISD::UMINV)		MAKE_CASE(AArch64ISD::UMINV)
MAKE_CASE(AArch64ISD::SMAXV)		MAKE_CASE(AArch64ISD::SMAXV)
MAKE_CASE(AArch64ISD::UMAXV)		MAKE_CASE(AArch64ISD::UMAXV)
MAKE_CASE(AArch64ISD::SADDV_PRED)		MAKE_CASE(AArch64ISD::SADDV_PRED)
MAKE_CASE(AArch64ISD::UADDV_PRED)		MAKE_CASE(AArch64ISD::UADDV_PRED)
▲ Show 20 Lines • Show All 2,879 Lines • ▼ Show 20 Lines	return DAG.getNode(Opcode, dl, Op.getValueType(), Op.getOperand(1),
Op.getOperand(2), Op.getOperand(3));		Op.getOperand(2), Op.getOperand(3));
}		}
case Intrinsic::get_active_lane_mask: {		case Intrinsic::get_active_lane_mask: {
SDValue ID =		SDValue ID =
DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);		DAG.getTargetConstant(Intrinsic::aarch64_sve_whilelo, dl, MVT::i64);
return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), ID,		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, dl, Op.getValueType(), ID,
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
}		}
		case Intrinsic::aarch64_neon_uaddlv: {
		EVT OpVT = Op.getOperand(1).getValueType();
		EVT ResVT = Op.getValueType();
		if (ResVT == MVT::i32 && (OpVT == MVT::v8i8 \|\| OpVT == MVT::v16i8)) {
		efriedmaUnsubmitted Not Done Reply Inline Actions This could be extended to i16 uaddlv as well, but we can leave that for a followup, I guess. efriedma: This could be extended to i16 uaddlv as well, but we can leave that for a followup, I guess.
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, it seems there is no pattern for dup(extract_element) --> duplane with v8i16. Let's handle the type with other patch. jaykang10: Yep, it seems there is no pattern for dup(extract_element) --> duplane with v8i16. Let's handle…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions I have tried below patch. diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp index 2bb8e4324306..87c836905659 100644 --- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp +++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp @@ -5327,7 +5327,8 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op, case Intrinsic::aarch64_neon_uaddlv: { EVT OpVT = Op.getOperand(1).getValueType(); EVT ResVT = Op.getValueType(); - if (ResVT == MVT::i32 && (OpVT == MVT::v8i8 \|\| OpVT == MVT::v16i8)) { + if (ResVT == MVT::i32 && + (OpVT == MVT::v8i8 \|\| OpVT == MVT::v16i8 \|\| OpVT == MVT::v8i16)) { // In order to avoid insert_subvector, used v4i32 than v2i32. SDValue UADDLV = DAG.getNode(AArch64ISD::UADDLV, dl, MVT::v4i32, Op.getOperand(1)); diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td index 4a1f46f2576a..658b22d312fb 100644 --- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td +++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td @@ -6067,6 +6067,8 @@ defm : DUPWithTruncPats<v16i8, v4i16, v8i16, i32, DUPv16i8lane, VecIndex_x2>; defm : DUPWithTruncPats<v16i8, v2i32, v4i32, i32, DUPv16i8lane, VecIndex_x4>; defm : DUPWithTruncPats<v8i16, v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>; +defm : DUPWithTruncPats<v4i32, v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>; + multiclass DUPWithTrunci64Pats<ValueType ResVT, Instruction DUP, SDNodeXForm IdxXFORM> { def : Pat<(ResVT (AArch64dup (i32 (trunc (extractelt (v2i64 V128:$Rn), @@ -6462,12 +6464,21 @@ def : Pat<(i32 (int_aarch64_neon_uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))) (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)), ssub))>; +def : Pat<(i32 (vector_extract + (v4i32 (AArch64uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (i64 0))), + (i32 (EXTRACT_SUBREG + (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)), + ssub))>; + def : Pat<(v4i32 (AArch64uaddlv (v8i8 V64:$Rn))), (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i8v V64:$Rn), hsub))>; def : Pat<(v4i32 (AArch64uaddlv (v16i8 V128:$Rn))), (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$Rn), hsub))>; +def : Pat<(v4i32 (AArch64uaddlv (v8i16 V128:$Rn))), + (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i16v V128:$Rn), ssub))>; + // Patterns for across-vector intrinsics, that have a node equivalent, that // returns a vector (with only the low lane defined) instead of a scalar. // In effect, opNode is the same as (scalar_to_vector (IntNode)). diff --git a/llvm/test/CodeGen/AArch64/neon-addlv.ll b/llvm/test/CodeGen/AArch64/neon-addlv.ll index 0f5a19c7a0f3..0769adce87d3 100644 --- a/llvm/test/CodeGen/AArch64/neon-addlv.ll +++ b/llvm/test/CodeGen/AArch64/neon-addlv.ll @@ -178,8 +178,8 @@ entry: ret i32 %0 } -define dso_local <8 x i8> @bar(<8 x i8> noundef %a) local_unnamed_addr #0 { -; CHECK-LABEL: bar: +define dso_local <8 x i8> @uaddlv_v8i8(<8 x i8> %a) { +; CHECK-LABEL: uaddlv_v8i8: ; CHECK: // %bb.0: // %entry ; CHECK-NEXT: uaddlv h0, v0.8b ; CHECK-NEXT: dup v0.8h, v0.h[0] @@ -194,4 +194,22 @@ entry: ret <8 x i8> %vrshrn_n2 } +define dso_local <8 x i16> @uaddlv_v8i16(<8 x i16> %a) { +; CHECK-LABEL: uaddlv_v8i16: +; CHECK: // %bb.0: // %entry +; CHECK-NEXT: uaddlv s0, v0.8h +; CHECK-NEXT: dup v1.8h, v0.h[0] +; CHECK-NEXT: rshrn v0.4h, v1.4s, #3 +; CHECK-NEXT: rshrn2 v0.8h, v1.4s, #3 +; CHECK-NEXT: ret +entry: + %vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i16(<8 x i16> %a) + %vecinit.i = insertelement <8 x i32> undef, i32 %vaddlv.i, i64 0 + %vecinit7.i = shufflevector <8 x i32> %vecinit.i, <8 x i32> poison, <8 x i32> zeroinitializer + %vrshrn_n2 = tail call <8 x i16> @llvm.aarch64.neon.rshrn.v8i16(<8 x i32> %vecinit7.i, i32 3) + ret <8 x i16> %vrshrn_n2 +} + declare <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16>, i32) +declare <8 x i16> @llvm.aarch64.neon.rshrn.v8i16(<8 x i32>, i32) +declare i32 @llvm.aarch64.neon.uaddlv.i32.v8i16(<8 x i16>) diff --git a/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll b/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll index 8b48635b6694..e6b253b258f1 100644 --- a/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll +++ b/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll @@ -17,7 +17,8 @@ define i32 @uaddlv_uaddlp_v8i16(<8 x i16> %0) { define i16 @uaddlv_uaddlp_v16i8(<16 x i8> %0) { ; CHECK-LABEL: uaddlv_uaddlp_v16i8: ; CHECK: // %bb.0: -; CHECK-NEXT: uaddlv h0, v0.16b +; CHECK-NEXT: uaddlp v0.8h, v0.16b +; CHECK-NEXT: uaddlv s0, v0.8h ; CHECK-NEXT: fmov w0, s0 ; CHECK-NEXT: ret %2 = tail call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %0) As you can see, there is a regression on `uaddlv_uaddlp_v8i16` even though I added a pattern to cover the regression because the first pattern is matched earlier than second one. first pattern +defm : DUPWithTruncPats<v4i32, v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>; second pattern +def : Pat<(i32 (vector_extract + (v4i32 (AArch64uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (i64 0))), + (i32 (EXTRACT_SUBREG + (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)), + ssub))>; + I think it could be ok to keep uaddlv intrinsic than uaddlv sdnode for v8i16 type... jaykang10: I have tried below patch. ``` diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp…
		efriedmaUnsubmitted Not Done Reply Inline Actions Not sure I understand the issue here; more specific patterns (more nodes in the pattern definition) should win, no matter the order in the file, so you should be able to fix this with the right patterns. efriedma: Not sure I understand the issue here; more specific patterns (more nodes in the pattern…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions I could make a mistake. Let me check it again. jaykang10: I could make a mistake. Let me check it again.
		// In order to avoid insert_subvector, used v4i32 than v2i32.
		SDValue UADDLV =
		DAG.getNode(AArch64ISD::UADDLV, dl, MVT::v4i32, Op.getOperand(1));
		dmgreenUnsubmitted Not Done Reply Inline Actions An MVT::v8i16 with an extract might be a more natural representation for UADDLV that produces a h register. dmgreen: An MVT::v8i16 with an extract might be a more natural representation for UADDLV that produces a…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions The uaddlv intrinsic's result type with v8i8 and v16i8 is i32 rather than i16 so we need to return i32 type as the extract vector element's result. jaykang10: The uaddlv intrinsic's result type with v8i8 and v16i8 is i32 rather than i16 so we need to…
		SDValue EXTRACT_VEC_ELT =
		DAG.getNode(ISD::EXTRACT_VECTOR_ELT, dl, MVT::i32, UADDLV,
		DAG.getConstant(0, dl, MVT::i64));
		return EXTRACT_VEC_ELT;
		}
		return SDValue();
		}
}		}
}		}

bool AArch64TargetLowering::shouldExtendGSIndex(EVT VT, EVT &EltTy) const {		bool AArch64TargetLowering::shouldExtendGSIndex(EVT VT, EVT &EltTy) const {
if (VT.getVectorElementType() == MVT::i8 \|\|		if (VT.getVectorElementType() == MVT::i8 \|\|
VT.getVectorElementType() == MVT::i16) {		VT.getVectorElementType() == MVT::i16) {
EltTy = MVT::i32;		EltTy = MVT::i32;
return true;		return true;
▲ Show 20 Lines • Show All 3,367 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerCTPOP_PARITY(SDValue Op,
if (VT == MVT::i32 \|\| VT == MVT::i64) {		if (VT == MVT::i32 \|\| VT == MVT::i64) {
if (VT == MVT::i32)		if (VT == MVT::i32)
Val = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, Val);		Val = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, Val);
Val = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Val);		Val = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Val);

SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v8i8, Val);		SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v8i8, Val);
SDValue UaddLV = DAG.getNode(		SDValue UaddLV = DAG.getNode(
ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,		ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,
DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);		DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);
		dmgreenUnsubmitted Not Done Reply Inline Actions Can you change this to generate a UADDLV directly? dmgreen: Can you change this to generate a UADDLV directly?
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Maybe, we could use UADDLV here. Let's check it with other patch. jaykang10: Maybe, we could use UADDLV here. Let's check it with other patch.

if (IsParity)		if (IsParity)
UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,		UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,
DAG.getConstant(1, DL, MVT::i32));		DAG.getConstant(1, DL, MVT::i32));

if (VT == MVT::i64)		if (VT == MVT::i64)
UaddLV = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, UaddLV);		UaddLV = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, UaddLV);
return UaddLV;		return UaddLV;
▲ Show 20 Lines • Show All 17,389 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 322 Lines • ▼ Show 20 Lines	def SDT_AArch64FCCMP : SDTypeProfile<1, 5,
SDTCisInt<3>,		SDTCisInt<3>,
SDTCisInt<4>,		SDTCisInt<4>,
SDTCisVT<5, i32>]>;		SDTCisVT<5, i32>]>;
def SDT_AArch64FCmp : SDTypeProfile<0, 2,		def SDT_AArch64FCmp : SDTypeProfile<0, 2,
[SDTCisFP<0>,		[SDTCisFP<0>,
SDTCisSameAs<0, 1>]>;		SDTCisSameAs<0, 1>]>;
def SDT_AArch64Dup : SDTypeProfile<1, 1, [SDTCisVec<0>]>;		def SDT_AArch64Dup : SDTypeProfile<1, 1, [SDTCisVec<0>]>;
def SDT_AArch64DupLane : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisInt<2>]>;		def SDT_AArch64DupLane : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisInt<2>]>;
def SDT_AArch64Insr : SDTypeProfile<1, 2, [SDTCisVec<0>]>;		def SDT_AArch64Insr : SDTypeProfile<1, 2, [SDTCisVec<0>]>;
		dmgreenUnsubmitted Not Done Reply Inline Actions I think this can be the same as SDT_AArch64uaddlp dmgreen: I think this can be the same as SDT_AArch64uaddlp
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Let me use SDT_AArch64uaddlp. jaykang10: Let me use SDT_AArch64uaddlp.
def SDT_AArch64Zip : SDTypeProfile<1, 2, [SDTCisVec<0>,		def SDT_AArch64Zip : SDTypeProfile<1, 2, [SDTCisVec<0>,
SDTCisSameAs<0, 1>,		SDTCisSameAs<0, 1>,
SDTCisSameAs<0, 2>]>;		SDTCisSameAs<0, 2>]>;
def SDT_AArch64MOVIedit : SDTypeProfile<1, 1, [SDTCisInt<1>]>;		def SDT_AArch64MOVIedit : SDTypeProfile<1, 1, [SDTCisInt<1>]>;
def SDT_AArch64MOVIshift : SDTypeProfile<1, 2, [SDTCisInt<1>, SDTCisInt<2>]>;		def SDT_AArch64MOVIshift : SDTypeProfile<1, 2, [SDTCisInt<1>, SDTCisInt<2>]>;
def SDT_AArch64vecimm : SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisSameAs<0,1>,		def SDT_AArch64vecimm : SDTypeProfile<1, 3, [SDTCisVec<0>, SDTCisSameAs<0,1>,
SDTCisInt<2>, SDTCisInt<3>]>;		SDTCisInt<2>, SDTCisInt<3>]>;
def SDT_AArch64UnaryVec: SDTypeProfile<1, 1, [SDTCisVec<0>, SDTCisSameAs<0,1>]>;		def SDT_AArch64UnaryVec: SDTypeProfile<1, 1, [SDTCisVec<0>, SDTCisSameAs<0,1>]>;
▲ Show 20 Lines • Show All 407 Lines • ▼ Show 20 Lines
def AArch64udot : SDNode<"AArch64ISD::UDOT", SDT_AArch64Dot>;		def AArch64udot : SDNode<"AArch64ISD::UDOT", SDT_AArch64Dot>;

def AArch64saddv : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;		def AArch64saddv : SDNode<"AArch64ISD::SADDV", SDT_AArch64UnaryVec>;
def AArch64uaddv : SDNode<"AArch64ISD::UADDV", SDT_AArch64UnaryVec>;		def AArch64uaddv : SDNode<"AArch64ISD::UADDV", SDT_AArch64UnaryVec>;
def AArch64sminv : SDNode<"AArch64ISD::SMINV", SDT_AArch64UnaryVec>;		def AArch64sminv : SDNode<"AArch64ISD::SMINV", SDT_AArch64UnaryVec>;
def AArch64uminv : SDNode<"AArch64ISD::UMINV", SDT_AArch64UnaryVec>;		def AArch64uminv : SDNode<"AArch64ISD::UMINV", SDT_AArch64UnaryVec>;
def AArch64smaxv : SDNode<"AArch64ISD::SMAXV", SDT_AArch64UnaryVec>;		def AArch64smaxv : SDNode<"AArch64ISD::SMAXV", SDT_AArch64UnaryVec>;
def AArch64umaxv : SDNode<"AArch64ISD::UMAXV", SDT_AArch64UnaryVec>;		def AArch64umaxv : SDNode<"AArch64ISD::UMAXV", SDT_AArch64UnaryVec>;
		def AArch64uaddlv : SDNode<"AArch64ISD::UADDLV", SDT_AArch64uaddlp>;

def AArch64uabd : PatFrags<(ops node:$lhs, node:$rhs),		def AArch64uabd : PatFrags<(ops node:$lhs, node:$rhs),
[(abdu node:$lhs, node:$rhs),		[(abdu node:$lhs, node:$rhs),
(int_aarch64_neon_uabd node:$lhs, node:$rhs)]>;		(int_aarch64_neon_uabd node:$lhs, node:$rhs)]>;
def AArch64sabd : PatFrags<(ops node:$lhs, node:$rhs),		def AArch64sabd : PatFrags<(ops node:$lhs, node:$rhs),
[(abds node:$lhs, node:$rhs),		[(abds node:$lhs, node:$rhs),
(int_aarch64_neon_sabd node:$lhs, node:$rhs)]>;		(int_aarch64_neon_sabd node:$lhs, node:$rhs)]>;

▲ Show 20 Lines • Show All 5,693 Lines • ▼ Show 20 Lines	def : Pat<(i64 (int_aarch64_neon_uaddlv (v4i32 (AArch64uaddlp (v8i16 V128:$op))))),
(v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i16v V128:$op), ssub)),		(v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i16v V128:$op), ssub)),
dsub))>;		dsub))>;

def : Pat<(i32 (int_aarch64_neon_uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))),		def : Pat<(i32 (int_aarch64_neon_uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))),
(i32 (EXTRACT_SUBREG		(i32 (EXTRACT_SUBREG
(v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),		(v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
ssub))>;		ssub))>;

		def : Pat<(v4i32 (AArch64uaddlv (v8i8 V64:$Rn))),
		(v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i8v V64:$Rn), hsub))>;

		def : Pat<(v4i32 (AArch64uaddlv (v16i8 V128:$Rn))),
		(v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$Rn), hsub))>;

// Patterns for across-vector intrinsics, that have a node equivalent, that		// Patterns for across-vector intrinsics, that have a node equivalent, that
// returns a vector (with only the low lane defined) instead of a scalar.		// returns a vector (with only the low lane defined) instead of a scalar.
// In effect, opNode is the same as (scalar_to_vector (IntNode)).		// In effect, opNode is the same as (scalar_to_vector (IntNode)).
multiclass SIMDAcrossLanesIntrinsic<string baseOpc,		multiclass SIMDAcrossLanesIntrinsic<string baseOpc,
SDPatternOperator opNode> {		SDPatternOperator opNode> {
// If a lane instruction caught the vector_extract around opNode, we can		// If a lane instruction caught the vector_extract around opNode, we can
// directly match the latter to the instruction.		// directly match the latter to the instruction.
def : Pat<(v8i8 (opNode V64:$Rn)),		def : Pat<(v8i8 (opNode V64:$Rn)),
▲ Show 20 Lines • Show All 2,717 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-neon-vector-insert-uaddlv.ll

	Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines

	define void @insert_vec_v2i32_uaddlv_from_v16i8(ptr %0) {			define void @insert_vec_v2i32_uaddlv_from_v16i8(ptr %0) {
	; CHECK-LABEL: insert_vec_v2i32_uaddlv_from_v16i8:			; CHECK-LABEL: insert_vec_v2i32_uaddlv_from_v16i8:
	; CHECK: ; %bb.0: ; %entry			; CHECK: ; %bb.0: ; %entry
	; CHECK-NEXT: movi.2d v0, #0000000000000000			; CHECK-NEXT: movi.2d v0, #0000000000000000
	; CHECK-NEXT: movi.2d v1, #0000000000000000			; CHECK-NEXT: movi.2d v1, #0000000000000000
	; CHECK-NEXT: uaddlv.16b h0, v0			; CHECK-NEXT: uaddlv.16b h0, v0
	; CHECK-NEXT: mov.s v1[0], v0[0]			; CHECK-NEXT: mov.s v1[0], v0[0]
	; CHECK-NEXT: ucvtf.2s v1, v1			; CHECK-NEXT: ucvtf.2s v0, v1
	; CHECK-NEXT: str d1, [x0]			; CHECK-NEXT: str d0, [x0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	entry:			entry:
	%vaddlv = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v16i8(<16 x i8> zeroinitializer)			%vaddlv = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v16i8(<16 x i8> zeroinitializer)
	%1 = insertelement <2 x i32> zeroinitializer, i32 %vaddlv, i64 0			%1 = insertelement <2 x i32> zeroinitializer, i32 %vaddlv, i64 0
	%2 = uitofp <2 x i32> %1 to <2 x float>			%2 = uitofp <2 x i32> %1 to <2 x float>
	store <2 x float> %2, ptr %0, align 8			store <2 x float> %2, ptr %0, align 8
	ret void			ret void
	▲ Show 20 Lines • Show All 381 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/dp1.ll

	Show First 20 Lines • Show All 199 Lines • ▼ Show 20 Lines
	; CHECK-SDAG-LABEL: ctpop_i32:			; CHECK-SDAG-LABEL: ctpop_i32:
	; CHECK-SDAG: // %bb.0:			; CHECK-SDAG: // %bb.0:
	; CHECK-SDAG-NEXT: adrp x8, :got:var32			; CHECK-SDAG-NEXT: adrp x8, :got:var32
	; CHECK-SDAG-NEXT: ldr x8, [x8, :got_lo12:var32]			; CHECK-SDAG-NEXT: ldr x8, [x8, :got_lo12:var32]
	; CHECK-SDAG-NEXT: ldr w9, [x8]			; CHECK-SDAG-NEXT: ldr w9, [x8]
	; CHECK-SDAG-NEXT: fmov d0, x9			; CHECK-SDAG-NEXT: fmov d0, x9
	; CHECK-SDAG-NEXT: cnt v0.8b, v0.8b			; CHECK-SDAG-NEXT: cnt v0.8b, v0.8b
	; CHECK-SDAG-NEXT: uaddlv h0, v0.8b			; CHECK-SDAG-NEXT: uaddlv h0, v0.8b
	; CHECK-SDAG-NEXT: fmov w9, s0			; CHECK-SDAG-NEXT: str s0, [x8]
	; CHECK-SDAG-NEXT: str w9, [x8]
	; CHECK-SDAG-NEXT: ret			; CHECK-SDAG-NEXT: ret
	;			;
	; CHECK-GISEL-LABEL: ctpop_i32:			; CHECK-GISEL-LABEL: ctpop_i32:
	; CHECK-GISEL: // %bb.0:			; CHECK-GISEL: // %bb.0:
	; CHECK-GISEL-NEXT: adrp x8, :got:var32			; CHECK-GISEL-NEXT: adrp x8, :got:var32
	; CHECK-GISEL-NEXT: ldr x8, [x8, :got_lo12:var32]			; CHECK-GISEL-NEXT: ldr x8, [x8, :got_lo12:var32]
	; CHECK-GISEL-NEXT: ldr w9, [x8]			; CHECK-GISEL-NEXT: ldr w9, [x8]
	; CHECK-GISEL-NEXT: fmov d0, x9			; CHECK-GISEL-NEXT: fmov d0, x9
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-addlv.ll

	Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: uaddlv h0, v0.16b			; CHECK-NEXT: uaddlv h0, v0.16b
	; CHECK-NEXT: fmov w0, s0			; CHECK-NEXT: fmov w0, s0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v16i8(<16 x i8> %a)			%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v16i8(<16 x i8> %a)
	%0 = and i32 %vaddlv.i, 65535			%0 = and i32 %vaddlv.i, 65535
	ret i32 %0			ret i32 %0
	}			}

				define dso_local <8 x i8> @bar(<8 x i8> noundef %a) local_unnamed_addr #0 {
				; CHECK-LABEL: bar:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: uaddlv h0, v0.8b
				; CHECK-NEXT: dup v0.8h, v0.h[0]
				; CHECK-NEXT: rshrn v0.8b, v0.8h, #3
				; CHECK-NEXT: ret
				entry:
				%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %a)
				%0 = trunc i32 %vaddlv.i to i16
				%vecinit.i = insertelement <8 x i16> undef, i16 %0, i64 0
				%vecinit7.i = shufflevector <8 x i16> %vecinit.i, <8 x i16> poison, <8 x i32> zeroinitializer
				%vrshrn_n2 = tail call <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16> %vecinit7.i, i32 3)
				ret <8 x i8> %vrshrn_n2
				}

				declare <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16>, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove copy instruction between uaddlv and dupClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 555852

llvm/lib/Target/AArch64/AArch64ISelLowering.h

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/test/CodeGen/AArch64/aarch64-neon-vector-insert-uaddlv.ll

llvm/test/CodeGen/AArch64/dp1.ll

llvm/test/CodeGen/AArch64/neon-addlv.ll

[AArch64] Remove copy instruction between uaddlv and dup
ClosedPublic