This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove copy instruction between uaddlv and dup
ClosedPublic

Authored by jaykang10 on Aug 31 2023, 4:12 AM.

Details

Summary

gcc generates less number of instructions from below example than llvm.

#include <arm_neon.h>

uint8x8_t bar(uint8x8_t a) {
    return vrshrn_n_u16(vdupq_n_u16(vaddlv_u8(a)), 3);
}

gcc output
bar:
        uaddlv  h0, v0.8b
        dup     v0.8h, v0.h[0]
        rshrn   v0.8b, v0.8h, 3
        ret

llvm output
bar:
        uaddlv  h0, v0.8b
        fmov    w8, s0
        dup     v0.8h, w8
        rshrn   v0.8b, v0.8h, #3
        ret

There is a copy instruction between gpr and fpr. We could need to change scalar dup to vector dup to remove the copy instruction as below.

def : Pat<(v8i16 (AArch64dup (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))),
          (v8i16 (DUPv8i16lane
            (INSERT_SUBREG (v8i16 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub),
            (i64 0)))>;

With above pattern, llvm generates below output.

bar:                                    // @bar
        uaddlv  h0, v0.8b
        dup     v0.8h, v0.h[0]
        rshrn   v0.8b, v0.8h, #3
        ret

The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.

Diff Detail

Event Timeline

jaykang10 created this revision.Aug 31 2023, 4:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 4:12 AM
jaykang10 requested review of this revision.Aug 31 2023, 4:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 4:12 AM

Ideally, we'd lower the intrinsic to some operation that returns its result in a vector register. Given limitations of SelectionDAG, that means introducing an opcode that produces a <2 x i32> or something like that. So instead of "(AArch64dup (int_aarch64_neon_uaddlv))", we'd end up with something more like "(AArch64dup (extract_element(AArch64uaddlv))", and existing patterns would naturally do the right thing.

Otherwise, I think we end up needing way too many patterns to cover every operation that could possibly use the result of a uaddlv in a vector register.

Ideally, we'd lower the intrinsic to some operation that returns its result in a vector register. Given limitations of SelectionDAG, that means introducing an opcode that produces a <2 x i32> or something like that. So instead of "(AArch64dup (int_aarch64_neon_uaddlv))", we'd end up with something more like "(AArch64dup (extract_element(AArch64uaddlv))", and existing patterns would naturally do the right thing.

Otherwise, I think we end up needing way too many patterns to cover every operation that could possibly use the result of a uaddlv in a vector register.

Thanks for your kind comment.
Even if we add a custom SDNode with vector type result for uaddlv, we would need copy instruction for different register classes because the AArch64dup is scalar one which has scalar input. We would need to change the scalar dup to the vector dup as well as uaddlv. That is the reason why I added the pattern...
I am not sure how we can generalize to change the uaddlv and its use instruction to vector one...

We have a DAGCombine to transform dup(extract_element) to duplane, so with my suggestion the actual isel input would be "(AArch64duplane16 (AArch64uaddlv))", which is exactly the instructions produced by your pattern.

Yep, it looks there are patterns for dup(extract_element) --> duplane.

multiclass DUPWithTruncPats<ValueType ResVT, ValueType Src64VT,
                            ValueType Src128VT, ValueType ScalVT,
                            Instruction DUP, SDNodeXForm IdxXFORM> {
  def : Pat<(ResVT (AArch64dup (ScalVT (vector_extract (Src128VT V128:$Rn),
                                                     imm:$idx)))),
            (DUP V128:$Rn, (IdxXFORM imm:$idx))>;

  def : Pat<(ResVT (AArch64dup (ScalVT (vector_extract (Src64VT V64:$Rn),
                                                     imm:$idx)))),
            (DUP (SUBREG_TO_REG (i64 0), V64:$Rn, dsub), (IdxXFORM imm:$idx))>;
}

defm : DUPWithTruncPats<v8i8,   v4i16, v8i16, i32, DUPv8i8lane,  VecIndex_x2>;
defm : DUPWithTruncPats<v8i8,   v2i32, v4i32, i32, DUPv8i8lane,  VecIndex_x4>;
defm : DUPWithTruncPats<v4i16,  v2i32, v4i32, i32, DUPv4i16lane, VecIndex_x2>;

Let me add a custom SDNode AArch64uaddlv for uaddlv and pattern for it.
Thanks for checking it.

jaykang10 updated this revision to Diff 555720.EditedSep 4 2023, 6:04 AM

Following @efriedma's comment, added a custom SDNode for uaddlv and patterns for it.

@efriedma If you feel something wrong with this update, please let me know.

efriedma accepted this revision.Sep 4 2023, 9:02 PM

LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322

This could be extended to i16 uaddlv as well, but we can leave that for a followup, I guess.

This revision is now accepted and ready to land.Sep 4 2023, 9:02 PM
dmgreen added inline comments.Sep 5 2023, 1:16 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5325

An MVT::v8i16 with an extract might be a more natural representation for UADDLV that produces a h register.

8716

Can you change this to generate a UADDLV directly?

llvm/lib/Target/AArch64/AArch64InstrInfo.td
331

I think this can be the same as SDT_AArch64uaddlp

jaykang10 added inline comments.Sep 5 2023, 5:39 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322

Yep, it seems there is no pattern for dup(extract_element) --> duplane with v8i16.
Let's handle the type with other patch.

5325

The uaddlv intrinsic's result type with v8i8 and v16i8 is i32 rather than i16 so we need to return i32 type as the extract vector element's result.

8716

Maybe, we could use UADDLV here.
Let's check it with other patch.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
331

Let me use SDT_AArch64uaddlp.

jaykang10 updated this revision to Diff 555852.Sep 5 2023, 6:09 AM
This revision was landed with ongoing or failed builds.Sep 5 2023, 6:45 AM
This revision was automatically updated to reflect the committed changes.
jaykang10 added inline comments.Sep 5 2023, 8:50 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322

I have tried below patch.

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 2bb8e4324306..87c836905659 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -5327,7 +5327,8 @@ SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
   case Intrinsic::aarch64_neon_uaddlv: {
     EVT OpVT = Op.getOperand(1).getValueType();
     EVT ResVT = Op.getValueType();
-    if (ResVT == MVT::i32 && (OpVT == MVT::v8i8 || OpVT == MVT::v16i8)) {
+    if (ResVT == MVT::i32 &&
+        (OpVT == MVT::v8i8 || OpVT == MVT::v16i8 || OpVT == MVT::v8i16)) {
       // In order to avoid insert_subvector, used v4i32 than v2i32.
       SDValue UADDLV =
           DAG.getNode(AArch64ISD::UADDLV, dl, MVT::v4i32, Op.getOperand(1));
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 4a1f46f2576a..658b22d312fb 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -6067,6 +6067,8 @@ defm : DUPWithTruncPats<v16i8,  v4i16, v8i16, i32, DUPv16i8lane, VecIndex_x2>;
 defm : DUPWithTruncPats<v16i8,  v2i32, v4i32, i32, DUPv16i8lane, VecIndex_x4>;
 defm : DUPWithTruncPats<v8i16,  v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>;
 
+defm : DUPWithTruncPats<v4i32,  v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>;
+
 multiclass DUPWithTrunci64Pats<ValueType ResVT, Instruction DUP,
                                SDNodeXForm IdxXFORM> {
   def : Pat<(ResVT (AArch64dup (i32 (trunc (extractelt (v2i64 V128:$Rn),
@@ -6462,12 +6464,21 @@ def : Pat<(i32 (int_aarch64_neon_uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op)))
             (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
             ssub))>;
 
+def : Pat<(i32 (vector_extract
+            (v4i32 (AArch64uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (i64 0))),
+          (i32 (EXTRACT_SUBREG
+            (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
+            ssub))>;
+
 def : Pat<(v4i32 (AArch64uaddlv (v8i8 V64:$Rn))),
           (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i8v V64:$Rn), hsub))>;
 
 def : Pat<(v4i32 (AArch64uaddlv (v16i8 V128:$Rn))),
           (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$Rn), hsub))>;
 
+def : Pat<(v4i32 (AArch64uaddlv (v8i16 V128:$Rn))),
+          (v4i32 (SUBREG_TO_REG (i64 0), (UADDLVv8i16v V128:$Rn), ssub))>;
+
 // Patterns for across-vector intrinsics, that have a node equivalent, that
 // returns a vector (with only the low lane defined) instead of a scalar.
 // In effect, opNode is the same as (scalar_to_vector (IntNode)).
diff --git a/llvm/test/CodeGen/AArch64/neon-addlv.ll b/llvm/test/CodeGen/AArch64/neon-addlv.ll
index 0f5a19c7a0f3..0769adce87d3 100644
--- a/llvm/test/CodeGen/AArch64/neon-addlv.ll
+++ b/llvm/test/CodeGen/AArch64/neon-addlv.ll
@@ -178,8 +178,8 @@ entry:
   ret i32 %0
 }

-define dso_local <8 x i8> @bar(<8 x i8> noundef %a) local_unnamed_addr #0 {
-; CHECK-LABEL: bar:
+define dso_local <8 x i8> @uaddlv_v8i8(<8 x i8> %a) {
+; CHECK-LABEL: uaddlv_v8i8:
 ; CHECK:       // %bb.0: // %entry
 ; CHECK-NEXT:    uaddlv h0, v0.8b
 ; CHECK-NEXT:    dup v0.8h, v0.h[0]
@@ -194,4 +194,22 @@ entry:
   ret <8 x i8> %vrshrn_n2
 }
 
+define dso_local <8 x i16> @uaddlv_v8i16(<8 x i16> %a) {
+; CHECK-LABEL: uaddlv_v8i16:
+; CHECK:       // %bb.0: // %entry
+; CHECK-NEXT:    uaddlv s0, v0.8h
+; CHECK-NEXT:    dup v1.8h, v0.h[0]
+; CHECK-NEXT:    rshrn v0.4h, v1.4s, #3
+; CHECK-NEXT:    rshrn2 v0.8h, v1.4s, #3
+; CHECK-NEXT:    ret
+entry:
+  %vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i16(<8 x i16> %a)
+  %vecinit.i = insertelement <8 x i32> undef, i32 %vaddlv.i, i64 0
+  %vecinit7.i = shufflevector <8 x i32> %vecinit.i, <8 x i32> poison, <8 x i32> zeroinitializer
+  %vrshrn_n2 = tail call <8 x i16> @llvm.aarch64.neon.rshrn.v8i16(<8 x i32> %vecinit7.i, i32 3)
+  ret <8 x i16> %vrshrn_n2
+}
+
 declare <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16>, i32)
+declare <8 x i16> @llvm.aarch64.neon.rshrn.v8i16(<8 x i32>, i32)
+declare i32 @llvm.aarch64.neon.uaddlv.i32.v8i16(<8 x i16>)
diff --git a/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll b/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll
index 8b48635b6694..e6b253b258f1 100644
--- a/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll
+++ b/llvm/test/CodeGen/AArch64/uaddlv-vaddlp-combine.ll
@@ -17,7 +17,8 @@ define i32 @uaddlv_uaddlp_v8i16(<8 x i16> %0) {
 define i16 @uaddlv_uaddlp_v16i8(<16 x i8> %0) {
 ; CHECK-LABEL: uaddlv_uaddlp_v16i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    uaddlv h0, v0.16b
+; CHECK-NEXT:    uaddlp v0.8h, v0.16b
+; CHECK-NEXT:    uaddlv s0, v0.8h
 ; CHECK-NEXT:    fmov w0, s0
 ; CHECK-NEXT:    ret
   %2 = tail call <8 x i16> @llvm.aarch64.neon.uaddlp.v8i16.v16i8(<16 x i8> %0)

As you can see, there is a regression on uaddlv_uaddlp_v8i16 even though I added a pattern to cover the regression because the first pattern is matched earlier than second one.

first pattern
+defm : DUPWithTruncPats<v4i32,  v2i32, v4i32, i32, DUPv8i16lane, VecIndex_x2>;

second pattern
+def : Pat<(i32 (vector_extract
+            (v4i32 (AArch64uaddlv (v8i16 (AArch64uaddlp (v16i8 V128:$op))))), (i64 0))),
+          (i32 (EXTRACT_SUBREG
+            (v8i16 (SUBREG_TO_REG (i64 0), (UADDLVv16i8v V128:$op), hsub)),
+            ssub))>;
+

I think it could be ok to keep uaddlv intrinsic than uaddlv sdnode for v8i16 type...

re: the big-endian stuff I mentioned on the other ticket... it looks like it isn't a regression, but my concern is the code generated for ctpop_i32 for a big-endian target. uaddlv v16i8 produces a result in h0 (element 0 of an 8 x i16), but we then access it as s0 (element 0 of a 4 x i32) without a bitcast. So I think the bits end up in the wrong place?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322

Not sure I understand the issue here; more specific patterns (more nodes in the pattern definition) should win, no matter the order in the file, so you should be able to fix this with the right patterns.

re: the big-endian stuff I mentioned on the other ticket... it looks like it isn't a regression, but my concern is the code generated for ctpop_i32 for a big-endian target. uaddlv v16i8 produces a result in h0 (element 0 of an 8 x i16), but we then access it as s0 (element 0 of a 4 x i32) without a bitcast. So I think the bits end up in the wrong place?

I think it's the other way around (hopefully I have it the right way around, BE can be confusing). A bitcast would swap the lane indices (it acts as a load and a store). Otherwise lane 0 is the lowest lane in both llvmir and the neon registers.

re: the big-endian stuff I mentioned on the other ticket... it looks like it isn't a regression, but my concern is the code generated for ctpop_i32 for a big-endian target. uaddlv v16i8 produces a result in h0 (element 0 of an 8 x i16), but we then access it as s0 (element 0 of a 4 x i32) without a bitcast. So I think the bits end up in the wrong place?

I think it's the other way around (hopefully I have it the right way around, BE can be confusing). A bitcast would swap the lane indices (it acts as a load and a store). Otherwise lane 0 is the lowest lane in both llvmir and the neon registers.

To be sure, I would like to check one thing. As far as I understand, the endianness affects to the order in memory so we need rev instruction after load and before store. After rev instruction, we do not need to care the endianness. Is it correct or wrong? There are other rules for big endian on AArch64?
For big endian output of the ctpop_i32, I can see rev instruction because AArch64TargetLowering::LowerCTPOP_PARITY generates bitcast from i64 to v8i8. Does it also need to be changed to NVCAST? It seems we could need to be careful to use` bitcast` which causes rev instruction for big endian...

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
5322

I could make a mistake. Let me check it again.

To be sure, I would like to check one thing. As far as I understand, the endianness affects to the order in memory so we need rev instruction after load and before store. After rev instruction, we do not need to care the endianness. Is it correct or wrong? There are other rules for big endian on AArch64?

bitcasts are defined as store+load, so can change the lane order. NVCast acts upon the representation in the vector so keeps the lanes in the same order. Vector function arguments are also passes in a particular order that sometimes needs to be considered (they often need a rev).

For big endian output of the ctpop_i32, I can see rev instruction because AArch64TargetLowering::LowerCTPOP_PARITY generates bitcast from i64 to v8i8. Does it also need to be changed to NVCAST? It seems we could need to be careful to use` bitcast` which causes rev instruction for big endian...

I think for this specific case it does not actually matter. Because the rev is into a cnt and a addlv on the individual i8 elements, and the addlv is performing a (commutative) reduction, it doesn't matter if the lanes get reversed. We still sum up the same values. So it could be either a BITCAST or a NVCAST and both should work (although I'm not sure a NVCAST between i64 and vectors is defined).