This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64InstrInfo.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
remove-and-fmov-between-uaddlv-urshl.ll

Differential D148234

[AArch64] Remove AND and FMOV between uaddlv an urshl
Needs ReviewPublic

Authored by jaykang10 on Apr 13 2023, 7:55 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
t.p.northover

Summary

gcc generates less instructions than llvm from below intrinsic example. The example has mentioned on https://reviews.llvm.org/D148134.

#include <arm_neon.h>

uint8x8_t test1(uint8x8_t a) {
    return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}

gcc output
test1:
	uaddlv	h0, v0.8b
	umov	w0, v0.h[0]
	fmov	d0, x0
	urshr	d0, d0, 3
	dup	v0.8b, v0.b[0]
	ret

llvm output
test1:                                  // @test1
	uaddlv	h0, v0.8b
	fmov	w8, s0
	and	w8, w8, #0xffff
	fmov	d0, x8
	urshr	d0, d0, #3
	fmov	x8, d0
	dup	v0.8b, w8
	ret

With this patch's tablegen pattern, llvm generates below output.

test1:                                  // @test1
	uaddlv	h0, v0.8b
	urshr	d0, d0, #3
	fmov	x8, d0
	dup	v0.8b, w8
	ret

Diff Detail

Event Timeline

jaykang10 created this revision.Apr 13 2023, 7:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2023, 7:55 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

jaykang10 requested review of this revision.Apr 13 2023, 7:55 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2023, 7:55 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

This feels a bit too specific to the exact instructions here, as opposed to the general case. We could change how i64 shifts are represented in the DAG, using v1i64 instead to show that they operate on neon registers. The and 0xffff could be removed by teaching it that the uaddlv node only produces zeros in the upper bits (in AArch64TargetLowering::computeKnownBitsForTargetNode). That doesn't solve everything. The representation of aarch64.neon.uaddlv might need to change too, perhaps to produce a v8i16, and something might need to recognize that the upper lanes are zero. That is the part that I'm less sure how it would work.

Harbormaster completed remote builds in B225356: Diff 513235.Apr 13 2023, 8:40 AM

In D148234#4265381, @dmgreen wrote:

This feels a bit too specific to the exact instructions here, as opposed to the general case. We could change how i64 shifts are represented in the DAG, using v1i64 instead to show that they operate on neon registers. The and 0xffff could be removed by teaching it that the uaddlv node only produces zeros in the upper bits (in AArch64TargetLowering::computeKnownBitsForTargetNode). That doesn't solve everything. The representation of aarch64.neon.uaddlv might need to change too, perhaps to produce a v8i16, and something might need to recognize that the upper lanes are zero. That is the part that I'm less sure how it would work.

I agree with you. This pattern targets too specific case...
The fundamental issue is clang generates the function definition of vaddlv_u8 as below and llvm supports the code sequence.

define internal fastcc i16 @vaddlv_u8(<8 x i8> noundef %__p0) unnamed_addr #2 {  
entry:
  %vaddlv = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %__p0)
  %0 = trunc i32 %vaddlv to i16
  ret i16 %0
}

If clang generates llvm.aarch64.neon.uaddlv.i16.v8i8 or llvm.aarch64.neon.uaddlv.f16.v8i8 rather than llvm.aarch64.neon.uaddlv.i32.v8i8 and llvm supports it, we could not see the and.
The uaddlv also has similar issue. It has FPR as output register class but the intrinsic function uses integer type as output type. In order to support it, llvm has specific tablegen patterns.
If possible, I did not want to change existing patterns and codes with the current intrinsic definition in clang...

I think the code generated by clang should be fine, for the most part. Intrinsics often produce a i32 (as opposed to i16) as it is a legal type, so the nodes become easier to legalize. That doesn't mean that in DAG we need to always represent it the same way. We could convert aarch64.neon.uaddlv to a AArch64ISD::UADDLV node, and have it produce different input/output types. I will try to put the shift patch I mentioned to you up into review, it has a problem with combining adds into ssra at the moment though.

In D148234#4265961, @dmgreen wrote:

I think the code generated by clang should be fine, for the most part. Intrinsics often produce a i32 (as opposed to i16) as it is a legal type, so the nodes become easier to legalize. That doesn't mean that in DAG we need to always represent it the same way. We could convert aarch64.neon.uaddlv to a AArch64ISD::UADDLV node, and have it produce different input/output types. I will try to put the shift patch I mentioned to you up into review, it has a problem with combining adds into ssra at the moment though.

Yep, Thanks!
Additionally, in the future, it could be good to use f16 type, which is legal type, for the intrinsics with f16 type. I guess it could be there was no f16 type support in clang and llvm when the intrinsics were implemented

Additionally, for MI Peephole opt, the MIR code sequence between uaddlv and ` is a bit long as below and there could be different sequence. I think your patch in SelectionDAG would be good.

%1:fpr16 = UADDLVv8i8v %0:fpr64
%3:fpr128 = IMPLICIT_DEF
%2:fpr128 = INSERT_SUBREG %3:fpr128(tied-def 0), killed %1:fpr16, %subreg.hsub
%4:gpr32 = COPY %2.ssub:fpr128
%5:gpr32common = ANDWri killed %4:gpr32, 15
%7:gpr64all = SUBREG_TO_REG 0, %5:gpr32common, %subreg.sub_32
%9:fpr64 = COPY %7:gpr64all
%8:fpr64 = URSHRd killed %9:fpr64, 3

I also tried DAGCombiner and tableGen pattern as below but it was almost same with this review's patch.

diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 08ba05407888..e4aa3aee7bb7 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -15371,6 +15371,44 @@ static SDValue performUADDVCombine(SDNode *N, SelectionDAG &DAG) {
   return SDValue();
 }

+static SDValue performURSHR_ICombine(SDNode *N, SelectionDAG &DAG) {
+  // We are expecting below pattern.
+  //
+  //        t4: i32 = llvm.aarch64.neon.uaddlv TargetConstant:i64<618>, t2
+  //      t6: i32 = and t4, Constant:i32<65535>
+  //    t7: i64 = zero_extend t6
+  //  t21: i64 = AArch64ISD::URSHR_I t7, Constant:i32<3>
+  //
+  // We can remove `and` as below.
+  //
+  //      t4: i32 = llvm.aarch64.neon.uaddlv TargetConstant:i64<618>, t2
+  //    t7: i64 = zero_extend t4
+  //  t21: i64 = AArch64ISD::URSHR_I t7, Constant:i32<3>
+
+  // Try to detect above pattern.
+  SDValue ZExt = N->getOperand(0);
+  if (ZExt.getOpcode() != ISD::ZERO_EXTEND)
+    return SDValue();
+
+  SDValue AND = ZExt->getOperand(0);
+  if (AND.getOpcode() != ISD::AND)
+    return SDValue();
+
+  SDValue UADDLV = AND->getOperand(0);
+  unsigned IID = getIntrinsicID(UADDLV.getNode());
+  if (IID != Intrinsic::aarch64_neon_uaddlv)
+    return SDValue();
+
+  // We have detected above pattern. Let's create nodes without `and`.
+  SDValue NewZExt = DAG.getNode(ISD::ZERO_EXTEND, SDLoc(ZExt.getNode()),
+                                ZExt->getValueType(0), UADDLV);
+  SDValue NewURSHR_I =
+      DAG.getNode(AArch64ISD::URSHR_I, SDLoc(N), N->getValueType(0), NewZExt,
+                  N->getOperand(1));
+
+  return NewURSHR_I;
+}
+
 static SDValue performXorCombine(SDNode *N, SelectionDAG &DAG,
                                  TargetLowering::DAGCombinerInfo &DCI,
                                  const AArch64Subtarget *Subtarget) {
@@ -21787,6 +21825,8 @@ SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
     return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
   case AArch64ISD::UADDV:
     return performUADDVCombine(N, DAG);
+  case AArch64ISD::URSHR_I:
+    return performURSHR_ICombine(N, DAG);
   case AArch64ISD::SMULL:
   case AArch64ISD::UMULL:
   case AArch64ISD::PMULL:
diff --git a/llvm/lib/Target/AArch64/AArch64InstrInfo.td b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
index 4162da5f5f3c..6d3b4989d820 100644
--- a/llvm/lib/Target/AArch64/AArch64InstrInfo.td
+++ b/llvm/lib/Target/AArch64/AArch64InstrInfo.td
@@ -6995,6 +6995,14 @@ defm USRA     : SIMDScalarRShiftDTied<   1, 0b00010, "usra",
     TriOpFrag<(add_and_or_is_add node:$LHS,
                    (AArch64vlshr node:$MHS, node:$RHS))>>;

+def : Pat<(i64 (AArch64urshri (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))),
+                              (i32 vecshiftR64:$imm))),
+          (i64 (URSHRd
+            (EXTRACT_SUBREG
+              (INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
+                (UADDLVv8i8v V64:$Rn), hsub), dsub),
+             vecshiftR64:$imm))>;
+
 //----------------------------------------------------------------------------
 // AdvSIMD vector shift instructions
 //----------------------------------------------------------------------------

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64InstrInfo.td

11 lines

test/

CodeGen/

AArch64/

remove-and-fmov-between-uaddlv-urshl.ll

24 lines

Diff 513235

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 6,989 Lines • ▼ Show 20 Lines
	defm URSRA : SIMDScalarRShiftDTied< 1, 0b00110, "ursra",			defm URSRA : SIMDScalarRShiftDTied< 1, 0b00110, "ursra",
	TriOpFrag<(add node:$LHS,			TriOpFrag<(add node:$LHS,
	(AArch64urshri node:$MHS, node:$RHS))>>;			(AArch64urshri node:$MHS, node:$RHS))>>;
	defm USHR : SIMDScalarRShiftD< 1, 0b00000, "ushr", AArch64vlshr>;			defm USHR : SIMDScalarRShiftD< 1, 0b00000, "ushr", AArch64vlshr>;
	defm USRA : SIMDScalarRShiftDTied< 1, 0b00010, "usra",			defm USRA : SIMDScalarRShiftDTied< 1, 0b00010, "usra",
	TriOpFrag<(add_and_or_is_add node:$LHS,			TriOpFrag<(add_and_or_is_add node:$LHS,
	(AArch64vlshr node:$MHS, node:$RHS))>>;			(AArch64vlshr node:$MHS, node:$RHS))>>;

				def : Pat<(i64 (AArch64urshri
				(i64 (zext
				(i32 (and
				(i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))), (i32 65535))))),
				(i32 vecshiftR64:$imm))),
				(i64 (URSHRd
				(EXTRACT_SUBREG
				(INSERT_SUBREG (v16i8 (IMPLICIT_DEF)),
				(UADDLVv8i8v V64:$Rn), hsub), dsub),
				vecshiftR64:$imm))>;

	//----------------------------------------------------------------------------			//----------------------------------------------------------------------------
	// AdvSIMD vector shift instructions			// AdvSIMD vector shift instructions
	//----------------------------------------------------------------------------			//----------------------------------------------------------------------------
	defm FCVTZS:SIMDVectorRShiftSD<0, 0b11111, "fcvtzs", int_aarch64_neon_vcvtfp2fxs>;			defm FCVTZS:SIMDVectorRShiftSD<0, 0b11111, "fcvtzs", int_aarch64_neon_vcvtfp2fxs>;
	defm FCVTZU:SIMDVectorRShiftSD<1, 0b11111, "fcvtzu", int_aarch64_neon_vcvtfp2fxu>;			defm FCVTZU:SIMDVectorRShiftSD<1, 0b11111, "fcvtzu", int_aarch64_neon_vcvtfp2fxu>;
	defm SCVTF: SIMDVectorRShiftToFP<0, 0b11100, "scvtf",			defm SCVTF: SIMDVectorRShiftToFP<0, 0b11100, "scvtf",
	int_aarch64_neon_vcvtfxs2fp>;			int_aarch64_neon_vcvtfxs2fp>;
	defm RSHRN : SIMDVectorRShiftNarrowBHS<0, 0b10001, "rshrn",			defm RSHRN : SIMDVectorRShiftNarrowBHS<0, 0b10001, "rshrn",
	▲ Show 20 Lines • Show All 1,987 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/remove-and-fmov-between-uaddlv-urshl.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 2
				; RUN: llc -o - %s -mtriple=aarch64-none-linux-gnu \| FileCheck %s

				define <8 x i8> @test1(<8 x i8> noundef %a) {
				; CHECK-LABEL: test1:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: uaddlv h0, v0.8b
				; CHECK-NEXT: urshr d0, d0, #3
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: dup v0.8b, w8
				; CHECK-NEXT: ret
				entry:
				%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %a)
				%0 = and i32 %vaddlv.i, 65535
				%conv = zext i32 %0 to i64
				%vrshr_n = tail call i64 @llvm.aarch64.neon.urshl.i64(i64 %conv, i64 -3)
				%conv1 = trunc i64 %vrshr_n to i8
				%vecinit.i = insertelement <8 x i8> undef, i8 %conv1, i64 0
				%vecinit7.i = shufflevector <8 x i8> %vecinit.i, <8 x i8> poison, <8 x i32> zeroinitializer
				ret <8 x i8> %vecinit7.i
				}

				declare i64 @llvm.aarch64.neon.urshl.i64(i64, i64) #1
				declare i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8>) #1