Download Raw Diff

Details

Reviewers

dmgreen
efriedma
t.p.northover

Commits

rG5474d49f1f5c: [AArch64] Remove copy instruction between uaddlv and urshr

Summary

gcc generates less number of instructions from below example than llvm.

#include <arm_neon.h>

uint8x8_t foo(uint8x8_t a) {
    return vdup_n_u8(vrshrd_n_u64(vaddlv_u8(a), 3));
}

gcc output
foo:
        uaddlv  h0, v0.8b
        urshr   d0, d0, 3
        dup     v0.8b, v0.b[0]
        ret

llvm output
foo:
        uaddlv  h0, v0.8b
        fmov    w8, s0
        fmov    d0, x8
        urshr   d0, d0, #3
        dup     v0.8b, v0.b[0]
        ret

There are copy instructions between gpr and fpr. We could remove them as below pattern.

def : Pat<(v1i64 (scalar_to_vector (i64 (zext (i32 (int_aarch64_neon_uaddlv (v8i8 V64:$Rn))))))),
          (INSERT_SUBREG (v1i64 (IMPLICIT_DEF)), (UADDLVv8i8v V64:$Rn), hsub)>;

With above pattern, llvm generates below output.

foo:                                    // @foo
        uaddlv  h0, v0.8b
        urshr   d0, d0, #3
        dup     v0.8b, v0.b[0]
        ret

The pattern could be too specific for this example. If you have other idea to generalize this case, please let me know.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jaykang10 created this revision.Aug 31 2023, 3:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 3:58 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

jaykang10 requested review of this revision.Aug 31 2023, 3:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 31 2023, 3:58 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B255970: Diff 554964.Aug 31 2023, 4:43 AM

Eli reminded us recently that there is a v16i8 variant of uaddlv that should be handled as well. Maybe the i16->i32 variants of uaddlv too, for matching the i32->i64 zext.

You may find that if this is able to be done in DAG (as in D159267), that the pattern is unneeded though. The transform might become simplified in the DAG before it gets to selection.

Updated pattern using UADDLV SDNode

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

We might want to consider teaching DAGCombine to optimize this sequence to an ISD::BITCAST instead of pattern-matching it late. Might unblock other optimizations? Maybe there aren't really any other optimizations we can do on a uaddlv, though.

Thanks for comment.

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

I am not sure... It would be fine because compiler adds the rev instructions where they are needed for big-endian... but it could be wrong...
If you are concerned about something for big-endian, please let me know.

We might want to consider teaching DAGCombine to optimize this sequence to an ISD::BITCAST instead of pattern-matching it late. Might unblock other optimizations? Maybe there aren't really any other optimizations we can do on a uaddlv, though.

Let me try to detect the sequence and generate BITCAST with DAGCombine.

Following @efriedma's comment, folded sdnode sequence into bitcast in DAGCombine.

Harbormaster completed remote builds in B256793: Diff 556143.Sep 7 2023, 8:13 AM

LGTM

In D159265#4640515, @jaykang10 wrote:

Thanks for comment.

Do we need to be concerned at all about big-endian here? (Actually, also for D159267.) This is basically bitcasting from <2 x i32> to <1 x i64>.

I am not sure... It would be fine because compiler adds the rev instructions where they are needed for big-endian... but it could be wrong...
If you are concerned about something for big-endian, please let me know.

BITCAST itself should be fine; I meant, if we use a substitute sequence, we still need a REV, but there isn't any code to generate it.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23159	I think we should be able to generalize this to other operations that only produce a result in the low element, but I guess we can leave that for a followup.

This revision is now accepted and ready to land.Sep 7 2023, 10:29 AM

dmgreen requested changes to this revision.Sep 7 2023, 10:55 PM

dmgreen added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23169	I'm pretty sure this needs to be an AArch64ISD::NVCAST, not a BITCAST. The BITCAST will swap the 0th and 1st lanes into the i64, we need to keep them inorder.

This revision now requires changes to proceed.Sep 7 2023, 10:55 PM

jaykang10 added inline comments.Sep 8 2023, 1:33 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23169	Ah, I did not know we do not need `rev` instruction here. Let me change BITCAST to NVCAST. Thanks for letting me know.

Following @dmgreen's comment, changed BITCAST to NVCAST.

jaykang10 added inline comments.Sep 8 2023, 1:45 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
23159	Yep, let's check it with other patch.

Harbormaster completed remote builds in B256843: Diff 556236.Sep 8 2023, 2:50 AM

Thanks. It would be good to add other ops too if we can. Otherwise LGTM.

This revision is now accepted and ready to land.Sep 10 2023, 3:48 AM

This revision was landed with ongoing or failed builds.Sep 11 2023, 1:07 AM

Closed by commit rG5474d49f1f5c: [AArch64] Remove copy instruction between uaddlv and urshr (authored by jaykang10). · Explain Why

This revision was automatically updated to reflect the committed changes.

jaykang10 added a commit: rG5474d49f1f5c: [AArch64] Remove copy instruction between uaddlv and urshr.

Diff 556398

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,007 Lines • ▼ Show 20 Lines	#undef LCALLNAME5
setTargetDAGCombine(ISD::GlobalAddress);		setTargetDAGCombine(ISD::GlobalAddress);

setTargetDAGCombine(ISD::CTLZ);		setTargetDAGCombine(ISD::CTLZ);

setTargetDAGCombine(ISD::VECREDUCE_AND);		setTargetDAGCombine(ISD::VECREDUCE_AND);
setTargetDAGCombine(ISD::VECREDUCE_OR);		setTargetDAGCombine(ISD::VECREDUCE_OR);
setTargetDAGCombine(ISD::VECREDUCE_XOR);		setTargetDAGCombine(ISD::VECREDUCE_XOR);

		setTargetDAGCombine(ISD::SCALAR_TO_VECTOR);

// In case of strict alignment, avoid an excessive number of byte wide stores.		// In case of strict alignment, avoid an excessive number of byte wide stores.
MaxStoresPerMemsetOptSize = 8;		MaxStoresPerMemsetOptSize = 8;
MaxStoresPerMemset =		MaxStoresPerMemset =
Subtarget->requiresStrictAlign() ? MaxStoresPerMemsetOptSize : 32;		Subtarget->requiresStrictAlign() ? MaxStoresPerMemsetOptSize : 32;

MaxGluedStoresPerMemcpy = 4;		MaxGluedStoresPerMemcpy = 4;
MaxStoresPerMemcpyOptSize = 4;		MaxStoresPerMemcpyOptSize = 4;
MaxStoresPerMemcpy =		MaxStoresPerMemcpy =
▲ Show 20 Lines • Show All 22,092 Lines • ▼ Show 20 Lines	if (SDValue Val =
return Val;		return Val;

if (SDValue Val = tryCombineMULLWithUZP1(N, DCI, DAG))		if (SDValue Val = tryCombineMULLWithUZP1(N, DCI, DAG))
return Val;		return Val;

return SDValue();		return SDValue();
}		}

		static SDValue
		performScalarToVectorCombine(SDNode *N, TargetLowering::DAGCombinerInfo &DCI,
		SelectionDAG &DAG) {
		// Let's do below transform.
		//
		// t34: v4i32 = AArch64ISD::UADDLV t2
		// t35: i32 = extract_vector_elt t34, Constant:i64<0>
		// t7: i64 = zero_extend t35
		// t20: v1i64 = scalar_to_vector t7
		// ==>
		// t34: v4i32 = AArch64ISD::UADDLV t2
		// t39: v2i32 = extract_subvector t34, Constant:i64<0>
		// t40: v1i64 = AArch64ISD::NVCAST t39
		if (DCI.isBeforeLegalizeOps())
		return SDValue();

		EVT VT = N->getValueType(0);
		if (VT != MVT::v1i64)
		return SDValue();

		SDValue ZEXT = N->getOperand(0);
		if (ZEXT.getOpcode() != ISD::ZERO_EXTEND \|\| ZEXT.getValueType() != MVT::i64)
		return SDValue();

		SDValue EXTRACT_VEC_ELT = ZEXT.getOperand(0);
		if (EXTRACT_VEC_ELT.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		EXTRACT_VEC_ELT.getValueType() != MVT::i32)
		return SDValue();

		if (!isNullConstant(EXTRACT_VEC_ELT.getOperand(1)))
		return SDValue();

		SDValue UADDLV = EXTRACT_VEC_ELT.getOperand(0);
		if (UADDLV.getOpcode() != AArch64ISD::UADDLV \|\|
		efriedmaUnsubmitted Not Done Reply Inline Actions I think we should be able to generalize this to other operations that only produce a result in the low element, but I guess we can leave that for a followup. efriedma: I think we should be able to generalize this to other operations that only produce a result in…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Yep, let's check it with other patch. jaykang10: Yep, let's check it with other patch.
		UADDLV.getValueType() != MVT::v4i32 \|\|
		UADDLV.getOperand(0).getValueType() != MVT::v8i8)
		return SDValue();

		// Let's generate new sequence with AArch64ISD::NVCAST.
		SDLoc DL(N);
		SDValue EXTRACT_SUBVEC =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v2i32, UADDLV,
		DAG.getConstant(0, DL, MVT::i64));
		SDValue NVCAST =
		dmgreenUnsubmitted Not Done Reply Inline Actions I'm pretty sure this needs to be an AArch64ISD::NVCAST, not a BITCAST. The BITCAST will swap the 0th and 1st lanes into the i64, we need to keep them inorder. dmgreen: I'm pretty sure this needs to be an AArch64ISD::NVCAST, not a BITCAST. The BITCAST will swap…
		jaykang10AuthorUnsubmitted Done Reply Inline Actions Ah, I did not know we do not need `rev` instruction here. Let me change BITCAST to NVCAST. Thanks for letting me know. jaykang10: Ah, I did not know we do not need `rev` instruction here. Let me change BITCAST to NVCAST.
		DAG.getNode(AArch64ISD::NVCAST, DL, MVT::v1i64, EXTRACT_SUBVEC);

		return NVCAST;
		}

SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,		SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
switch (N->getOpcode()) {		switch (N->getOpcode()) {
default:		default:
LLVM_DEBUG(dbgs() << "Custom combining: skipping\n");		LLVM_DEBUG(dbgs() << "Custom combining: skipping\n");
break;		break;
case ISD::VECREDUCE_AND:		case ISD::VECREDUCE_AND:
▲ Show 20 Lines • Show All 299 Lines • ▼ Show 20 Lines	case ISD::INTRINSIC_W_CHAIN:
default:		default:
break;		break;
}		}
break;		break;
case ISD::GlobalAddress:		case ISD::GlobalAddress:
return performGlobalAddressCombine(N, DAG, Subtarget, getTargetMachine());		return performGlobalAddressCombine(N, DAG, Subtarget, getTargetMachine());
case ISD::CTLZ:		case ISD::CTLZ:
return performCTLZCombine(N, DAG, Subtarget);		return performCTLZCombine(N, DAG, Subtarget);
		case ISD::SCALAR_TO_VECTOR:
		return performScalarToVectorCombine(N, DCI, DAG);
}		}
return SDValue();		return SDValue();
}		}

// Check if the return value is used as only a return value, as otherwise		// Check if the return value is used as only a return value, as otherwise
// we can't perform a tail-call. In particular, we need to check for		// we can't perform a tail-call. In particular, we need to check for
// target ISD nodes that are returns and any other "odd" constructs		// target ISD nodes that are returns and any other "odd" constructs
// that the generic analysis code won't necessarily catch.		// that the generic analysis code won't necessarily catch.
▲ Show 20 Lines • Show All 2,763 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-addlv.ll

	Show First 20 Lines • Show All 172 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: fmov w0, s0			; CHECK-NEXT: fmov w0, s0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v16i8(<16 x i8> %a)			%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v16i8(<16 x i8> %a)
	%0 = and i32 %vaddlv.i, 65535			%0 = and i32 %vaddlv.i, 65535
	ret i32 %0			ret i32 %0
	}			}

	define dso_local <8 x i8> @bar(<8 x i8> noundef %a) local_unnamed_addr #0 {			define dso_local <8 x i8> @uaddlv_v8i8_dup(<8 x i8> %a) {
	; CHECK-LABEL: bar:			; CHECK-LABEL: uaddlv_v8i8_dup:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: uaddlv h0, v0.8b			; CHECK-NEXT: uaddlv h0, v0.8b
	; CHECK-NEXT: dup v0.8h, v0.h[0]			; CHECK-NEXT: dup v0.8h, v0.h[0]
	; CHECK-NEXT: rshrn v0.8b, v0.8h, #3			; CHECK-NEXT: rshrn v0.8b, v0.8h, #3
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %a)			%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %a)
	%0 = trunc i32 %vaddlv.i to i16			%0 = trunc i32 %vaddlv.i to i16
	%vecinit.i = insertelement <8 x i16> undef, i16 %0, i64 0			%vecinit.i = insertelement <8 x i16> undef, i16 %0, i64 0
	%vecinit7.i = shufflevector <8 x i16> %vecinit.i, <8 x i16> poison, <8 x i32> zeroinitializer			%vecinit7.i = shufflevector <8 x i16> %vecinit.i, <8 x i16> poison, <8 x i32> zeroinitializer
	%vrshrn_n2 = tail call <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16> %vecinit7.i, i32 3)			%vrshrn_n2 = tail call <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16> %vecinit7.i, i32 3)
	ret <8 x i8> %vrshrn_n2			ret <8 x i8> %vrshrn_n2
	}			}

	declare <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16>, i32)			declare <8 x i8> @llvm.aarch64.neon.rshrn.v8i8(<8 x i16>, i32)

				declare i64 @llvm.aarch64.neon.urshl.i64(i64, i64)

				define <8 x i8> @uaddlv_v8i8_urshr(<8 x i8> %a) {
				; CHECK-LABEL: uaddlv_v8i8_urshr:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: uaddlv h0, v0.8b
				; CHECK-NEXT: urshr d0, d0, #3
				; CHECK-NEXT: dup v0.8b, v0.b[0]
				; CHECK-NEXT: ret
				entry:
				%vaddlv.i = tail call i32 @llvm.aarch64.neon.uaddlv.i32.v8i8(<8 x i8> %a)
				%0 = and i32 %vaddlv.i, 65535
				%conv = zext i32 %0 to i64
				%vrshr_n = tail call i64 @llvm.aarch64.neon.urshl.i64(i64 %conv, i64 -3)
				%conv1 = trunc i64 %vrshr_n to i8
				%vecinit.i = insertelement <8 x i8> undef, i8 %conv1, i64 0
				%vecinit7.i = shufflevector <8 x i8> %vecinit.i, <8 x i8> poison, <8 x i32> zeroinitializer
				ret <8 x i8> %vecinit7.i
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove copy instruction between uaddlv and urshr
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 556398

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/neon-addlv.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Remove copy instruction between uaddlv and urshrClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 556398

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/neon-addlv.ll

[AArch64] Remove copy instruction between uaddlv and urshr
ClosedPublic