This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove AND and FMOV between uaddlv an urshl
Needs ReviewPublic

Authored by jaykang10 on Apr 13 2023, 7:55 AM.

Details

Summary

gcc generates less instructions than llvm from below intrinsic example. The example has mentioned on https://reviews.llvm.org/D148134.

#include <arm_neon.h>

uint8x8_t test1(uint8x8_t a) {
    return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}

gcc output
test1:
	uaddlv	h0, v0.8b
	umov	w0, v0.h[0]
	fmov	d0, x0
	urshr	d0, d0, 3
	dup	v0.8b, v0.b[0]
	ret

llvm output
test1:                                  // @test1
	uaddlv	h0, v0.8b
	fmov	w8, s0
	and	w8, w8, #0xffff
	fmov	d0, x8
	urshr	d0, d0, #3
	fmov	x8, d0
	dup	v0.8b, w8
	ret

With this patch's tablegen pattern, llvm generates below output.

test1:                                  // @test1
	uaddlv	h0, v0.8b
	urshr	d0, d0, #3
	fmov	x8, d0
	dup	v0.8b, w8
	ret

Diff Detail

Event Timeline

jaykang10 created this revision.Apr 13 2023, 7:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2023, 7:55 AM
jaykang10 requested review of this revision.Apr 13 2023, 7:55 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2023, 7:55 AM

This feels a bit too specific to the exact instructions here, as opposed to the general case. We could change how i64 shifts are represented in the DAG, using v1i64 instead to show that they operate on neon registers. The and 0xffff could be removed by teaching it that the uaddlv node only produces zeros in the upper bits (in AArch64TargetLowering::computeKnownBitsForTargetNode). That doesn't solve everything. The representation of aarch64.neon.uaddlv might need to change too, perhaps to produce a v8i16, and something might need to recognize that the upper lanes are zero. That is the part that I'm less sure how it would work.

This feels a bit too specific to the exact instructions here, as opposed to the general case. We could change how i64 shifts are represented in the DAG, using v1i64 instead to show that they operate on neon registers. The and 0xffff could be removed by teaching it that the uaddlv node only produces zeros in the upper bits (in AArch64TargetLowering::computeKnownBitsForTargetNode). That doesn't solve everything. The representation of aarch64.neon.uaddlv might need to change too, perhaps to produce a v8i16, and something might need to recognize that the upper lanes are zero. That is the part that I'm less sure how it would work.

I agree with you. This pattern targets too specific case...
The fundamental issue is clang generates the function definition of vaddlv_u8 as below and llvm supports the code sequence.

define internal fastcc i16 @vaddlv_u8(<8 x i8> noundef %__p0) unnamed_addr #2 {  
entry:
  %vaddlv = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %__p0)
  %0 = trunc i32 %vaddlv to i16
  ret i16 %0
}

If clang generates llvm.aarch64.neon.uaddlv.i16.v8i8 or llvm.aarch64.neon.uaddlv.f16.v8i8 rather than llvm.aarch64.neon.uaddlv.i32.v8i8 and llvm supports it, we could not see the and.
The uaddlv also has similar issue. It has FPR as output register class but the intrinsic function uses integer type as output type. In order to support it, llvm has specific tablegen patterns.
If possible, I did not want to change existing patterns and codes with the current intrinsic definition in clang...

I think the code generated by clang should be fine, for the most part. Intrinsics often produce a i32 (as opposed to i16) as it is a legal type, so the nodes become easier to legalize. That doesn't mean that in DAG we need to always represent it the same way. We could convert aarch64.neon.uaddlv to a AArch64ISD::UADDLV node, and have it produce different input/output types. I will try to put the shift patch I mentioned to you up into review, it has a problem with combining adds into ssra at the moment though.

I think the code generated by clang should be fine, for the most part. Intrinsics often produce a i32 (as opposed to i16) as it is a legal type, so the nodes become easier to legalize. That doesn't mean that in DAG we need to always represent it the same way. We could convert aarch64.neon.uaddlv to a AArch64ISD::UADDLV node, and have it produce different input/output types. I will try to put the shift patch I mentioned to you up into review, it has a problem with combining adds into ssra at the moment though.

Yep, Thanks!
Additionally, in the future, it could be good to use f16 type, which is legal type, for the intrinsics with f16 type. I guess it could be there was no f16 type support in clang and llvm when the intrinsics were implemented

jaykang10 added a comment.EditedApr 14 2023, 12:50 AM

Additionally, for MI Peephole opt, the MIR code sequence between uaddlv and ` is a bit long as below and there could be different sequence. I think your patch in SelectionDAG would be good.

%1:fpr16 = UADDLVv8i8v %0:fpr64
%3:fpr128 = IMPLICIT_DEF
%2:fpr128 = INSERT_SUBREG %3:fpr128(tied-def 0), killed %1:fpr16, %subreg.hsub
%4:gpr32 = COPY %2.ssub:fpr128
%5:gpr32common = ANDWri killed %4:gpr32, 15
%7:gpr64all = SUBREG_TO_REG 0, %5:gpr32common, %subreg.sub_32
%9:fpr64 = COPY %7:gpr64all
%8:fpr64 = URSHRd killed %9:fpr64, 3

I also tried DAGCombiner and tableGen pattern as below but it was almost same with this review's patch.

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 08ba05407888..e4aa3aee7bb7 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -15371,6 +15371,44 @@ static SDValue performUADDVCombine(SDNode *N, SelectionDAG &DAG) {
   return SDValue();
 }

+static SDValue performURSHR_ICombine(SDNode *N, SelectionDAG &DAG) {
+  // We are expecting below pattern.
+  //
+  //        t4: i32 = llvm.aarch64.neon.uaddlv TargetConstant:i64<618>, t2
+  //      t6: i32 = and t4, Constant:i32<65535>
+  //    t7: i64 = zero_extend t6
+  //  t21: i64 = AArch64ISD::URSHR_I t7, Constant:i32<3>
+  //
+  // We can remove `and` as below.
+  //
+  //      t4: i32 = llvm.aarch64.neon.uaddlv TargetConstant:i64<618>, t2
+  //    t7: i64 = zero_extend t4
+  //  t21: i64 = AArch64ISD::URSHR_I t7, Constant:i32<3>
+
+  // Try to detect above pattern.
+  SDValue ZExt = N->getOperand(0);
+  if (ZExt.getOpcode() != ISD::ZERO_EXTEND)
+    return SDValue();
+
+  SDValue AND = ZExt->getOperand(0);
+  if (AND.getOpcode() != ISD::AND)
+    return SDValue();
+
+  SDValue UADDLV = AND->getOperand(0);
+  unsigned IID = getIntrinsicID(UADDLV.getNode());
+  if (IID != Intrinsic::aarch64_neon_uaddlv)
+    return SDValue();
+
+  // We have detected above pattern. Let's create nodes without `and`.
+  SDValue NewZExt = DAG.getNode(ISD::ZERO_EXTEND, SDLoc(ZExt.getNode()),
+                                ZExt->getValueType(0), UADDLV);
+  SDValue NewURSHR_I =
+      DAG.getNode(AArch64ISD::URSHR_I, SDLoc(N), N->getValueType(0), NewZExt,
+                  N->getOperand(1));
+
+  return NewURSHR_I;
+}
+
 static SDValue performXorCombine(SDNode *N, SelectionDAG &DAG,
                                  TargetLowering::DAGCombinerInfo &DCI,
                                  const AArch64Subtarget *Subtarget) {
@@ -21787,6 +21825,8 @@ SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
     return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
   case AArch64ISD::UADDV:
     return performUADDVCombine(N, DAG);
+  case AArch64ISD::URSHR_I:
+    return performURSHR_ICombine(N, DAG);
   case AArch64ISD::SMULL:
   case AArch64ISD::UMULL:
   case AArch64ISD::PMULL:
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 4162da5f5f3c..6d3b4989d820 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -6995,6 +6995,14 @@ defm USRA     : SIMDScalarRShiftDTied<   1, 0b00010, "usra",
     TriOpFrag<(add_and_or_is_add node:$LHS,
                    (AArch64vlshr node:$MHS, node:$RHS))>>;

+def : Pat<(i64 (AArch64urshri (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))),
+                              (i32 vecshiftR64:$imm))),
+          (i64 (URSHRd
+            (EXTRACT_SUBREG
+              (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
+                (UADDLVv8i8v V64:$Rn), hsub), dsub),
+             vecshiftR64:$imm))>;
+
 //----------------------------------------------------------------------------
 // AdvSIMD vector shift instructions
 //----------------------------------------------------------------------------