This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount
ClosedPublic

Authored by Allen on Dec 24 2022, 12:32 AM.

Download Raw Diff

Details

Reviewers

mingmingl
dmgreen
efriedma

Commits

rG9e8333344571: [AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount

Summary

Fix https://github.com/llvm/llvm-project/issues/59597.

mov w8, w0 + fmov d0, x8 ==> fmov s0, w0

Diff Detail

Event Timeline

Allen created this revision.Dec 24 2022, 12:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 24 2022, 12:32 AM

Herald added subscribers: ecnelises, hiraditya, kristof.beyls. · View Herald Transcript

Allen requested review of this revision.Dec 24 2022, 12:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 24 2022, 12:32 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B204837: Diff 485186.Dec 24 2022, 1:40 AM

I don't think this is right. The high bits do in fact have to be zero; otherwise they affect the result of the uaddlv.

If you add a pattern to optimize "insertelement <2 x i32> zeroinitializer, i32 %x, i32 0", you should be able to leverage that for ctpop lowering.

update with comment

Harbormaster completed remote builds in B204985: Diff 485377.Dec 27 2022, 8:49 AM

efriedma added inline comments.Jan 4 2023, 7:52 PM

llvm/test/CodeGen/AArch64/arm64-popcnt.ll
43	This doesn't appear to be equivalent.

Allen added inline comments.Jan 5 2023, 7:12 PM

llvm/test/CodeGen/AArch64/arm64-popcnt.ll

Thanks, I find the %4:fpr32 = COPY %1.ssub:fpr128 will be eliminated in pass SIMPLE REGISTER COALESCING with this change. but I don't sure the elimination is fine?

# *** IR Dump After Live Interval Analysis (liveintervals) ***:
# Machine code for function cnt32_advsimd_1: NoPHIs, TracksLiveness
Function Live Ins: $d0 in %0

0B	bb.0 (%ir-block.0):
	  liveins: $d0
16B	  %0:fpr64 = COPY $d0
32B	  undef %1.dsub:fpr128 = COPY %0:fpr64
48B	  %4:fpr32 = COPY %1.ssub:fpr128
64B	  %5:fpr64 = SUBREG_TO_REG 0, %4:fpr32, %subreg.ssub
80B	  %6:fpr64 = CNTv8i8 %5:fpr64
96B	  %7:fpr16 = UADDLVv8i8v %6:fpr64
112B	  undef %8.hsub:fpr128 = COPY %7:fpr16
128B	  %10:gpr32all = COPY %8.ssub:fpr128
144B	  $w0 = COPY %10:gpr32all
160B	  RET_ReallyLR implicit killed $w0

After the SIMPLE REGISTER COALESCING.

Function Live Ins: $d0 in %0

0B	bb.0 (%ir-block.0):
	  liveins: $d0
16B	  undef %1.dsub:fpr128 = COPY $d0
80B	  %6:fpr64 = CNTv8i8 %1.dsub:fpr128
96B	  undef %8.hsub:fpr128 = UADDLVv8i8v %6:fpr64
128B	  %10:gpr32all = COPY %8.ssub:fpr128
144B	  $w0 = COPY %10:gpr32all
160B	  RET_ReallyLR implicit killed $w0

efriedma added inline comments.Jan 5 2023, 8:01 PM

llvm/test/CodeGen/AArch64/arm64-popcnt.ll
43	See D127154 for a similar situation.

Could it just use a FMOVWSr?

def : Pat<(v8i8 (bitconvert (i64 (zext GPR32:$Rn)))),
          (SUBREG_TO_REG (i32 0), (f32 (FMOVWSr GPR32:$Rn)), ssub)>;

That way we know the top bits will be zero from the FMOVWSr, and so the SUBREG_TO_REG will correctly assert the top bits are zero.

update COPY_TO_REGCLASS with FMOVWSr to avoid the elimination in pass SIMPLE REGISTER COALESCING

In D140649#4030875, @dmgreen wrote:
Could it just use a FMOVWSr?
def : Pat<(v8i8 (bitconvert (i64 (zext GPR32:$Rn)))),
          (SUBREG_TO_REG (i32 0), (f32 (FMOVWSr GPR32:$Rn)), ssub)>;
That way we know the top bits will be zero from the FMOVWSr, and so the SUBREG_TO_REG will correctly assert the top bits are zero.

Thanks, apply your comment.

Harbormaster completed remote builds in B206224: Diff 487020.Jan 6 2023, 6:48 PM

LGTM. Thanks

This revision is now accepted and ready to land.Jan 8 2023, 11:48 PM

This revision was landed with ongoing or failed builds.Jan 9 2023, 12:08 AM

Closed by commit rG9e8333344571: [AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount (authored by Allen). · Explain Why

This revision was automatically updated to reflect the committed changes.

Allen added a commit: rG9e8333344571: [AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

8 lines

test/

CodeGen/

AArch64/

arm64-popcnt.ll

12 lines

dp1.ll

2 lines

Diff 485186

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	// be more efficiently lowered to the following sequence that uses			// be more efficiently lowered to the following sequence that uses
	// AdvSIMD registers/instructions as long as the copies to/from			// AdvSIMD registers/instructions as long as the copies to/from
	// the AdvSIMD registers are cheap.			// the AdvSIMD registers are cheap.
	// FMOV D0, X0 // copy 64-bit int to vector, high bits zero'd			// FMOV D0, X0 // copy 64-bit int to vector, high bits zero'd
	// CNT V0.8B, V0.8B // 8xbyte pop-counts			// CNT V0.8B, V0.8B // 8xbyte pop-counts
	// ADDV B0, V0.8B // sum 8xbyte pop-counts			// ADDV B0, V0.8B // sum 8xbyte pop-counts
	// UMOV X0, V0.B[0] // copy byte result back to integer reg			// UMOV X0, V0.B[0] // copy byte result back to integer reg
	if (VT == MVT::i32 \|\| VT == MVT::i64) {			if (VT == MVT::i32 \|\| VT == MVT::i64) {
				// The inserted value(i32) cast them to the vector type
				// by ignoring the upper bits of the lowest lane.
	if (VT == MVT::i32)			if (VT == MVT::i32)
	Val = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, Val);			Val = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v8i8, Val);
	Val = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Val);			else
				Val = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Val);
	SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v8i8, Val);			SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v8i8, Val);
	SDValue UaddLV = DAG.getNode(			SDValue UaddLV = DAG.getNode(
	ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,			ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,
	DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);			DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);

	if (IsParity)			if (IsParity)
	UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,			UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,
	DAG.getConstant(1, DL, MVT::i32));			DAG.getConstant(1, DL, MVT::i32));
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/arm64-popcnt.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=arm64-eabi -aarch64-neon-syntax=apple \| FileCheck %s	; RUN: llc < %s -mtriple=arm64-eabi -aarch64-neon-syntax=apple \| FileCheck %s
	; RUN: llc < %s -mtriple=aarch64-eabi -mattr -neon -aarch64-neon-syntax=apple \| FileCheck -check-prefix=CHECK-NONEON %s	; RUN: llc < %s -mtriple=aarch64-eabi -mattr -neon -aarch64-neon-syntax=apple \| FileCheck -check-prefix=CHECK-NONEON %s
	; RUN: llc < %s -mtriple=aarch64-eabi -mattr +cssc -aarch64-neon-syntax=apple \| FileCheck -check-prefix=CHECK-CSSC %s	; RUN: llc < %s -mtriple=aarch64-eabi -mattr +cssc -aarch64-neon-syntax=apple \| FileCheck -check-prefix=CHECK-CSSC %s

	define i32 @cnt32_advsimd(i32 %x) nounwind readnone {	define i32 @cnt32_advsimd(i32 %x) nounwind readnone {
	; CHECK-LABEL: cnt32_advsimd:	; CHECK-LABEL: cnt32_advsimd:
	; CHECK: // %bb.0:	; CHECK: // %bb.0:
	; CHECK-NEXT: mov w8, w0	; CHECK-NEXT: fmov s0, w0
	; CHECK-NEXT: fmov d0, x8
	; CHECK-NEXT: cnt.8b v0, v0	; CHECK-NEXT: cnt.8b v0, v0
	; CHECK-NEXT: uaddlv.8b h0, v0	; CHECK-NEXT: uaddlv.8b h0, v0
	; CHECK-NEXT: fmov w0, s0	; CHECK-NEXT: fmov w0, s0
	; CHECK-NEXT: ret	; CHECK-NEXT: ret
	;	;
	; CHECK-NONEON-LABEL: cnt32_advsimd:	; CHECK-NONEON-LABEL: cnt32_advsimd:
	; CHECK-NONEON: // %bb.0:	; CHECK-NONEON: // %bb.0:
	; CHECK-NONEON-NEXT: lsr w9, w0, #1	; CHECK-NONEON-NEXT: lsr w9, w0, #1
	Show All 17 Lines
	%cnt = tail call i32 @llvm.ctpop.i32(i32 %x)	%cnt = tail call i32 @llvm.ctpop.i32(i32 %x)
	ret i32 %cnt	ret i32 %cnt
	}	}

	define i32 @cnt32_advsimd_2(<2 x i32> %x) {	define i32 @cnt32_advsimd_2(<2 x i32> %x) {
	; CHECK-LABEL: cnt32_advsimd_2:	; CHECK-LABEL: cnt32_advsimd_2:
	; CHECK: // %bb.0:	; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0	; CHECK-NEXT: // kill: def $d0 killed $d0 def $q0
	; CHECK-NEXT: fmov w8, s0
	; CHECK-NEXT: fmov d0, x8
	; CHECK-NEXT: cnt.8b v0, v0	; CHECK-NEXT: cnt.8b v0, v0
		efriedmaUnsubmitted Not Done Reply Inline Actions This doesn't appear to be equivalent. efriedma: This doesn't appear to be equivalent.
		AllenAuthorUnsubmitted Done Reply Inline Actions Thanks, I find the %4:fpr32 = COPY %1.ssub:fpr128 will be eliminated in pass SIMPLE REGISTER COALESCING with this change. but I don't sure the elimination is fine? # * IR Dump After Live Interval Analysis (liveintervals) : # Machine code for function cnt32_advsimd_1: NoPHIs, TracksLiveness Function Live Ins: $d0 in %0 0B bb.0 (%ir-block.0): liveins: $d0 16B %0:fpr64 = COPY $d0 32B undef %1.dsub:fpr128 = COPY %0:fpr64 48B %4:fpr32 = COPY %1.ssub:fpr128 64B %5:fpr64 = SUBREG_TO_REG 0, %4:fpr32, %subreg.ssub 80B %6:fpr64 = CNTv8i8 %5:fpr64 96B %7:fpr16 = UADDLVv8i8v %6:fpr64 112B undef %8.hsub:fpr128 = COPY %7:fpr16 128B %10:gpr32all = COPY %8.ssub:fpr128 144B $w0 = COPY %10:gpr32all 160B RET_ReallyLR implicit killed $w0 After the SIMPLE REGISTER COALESCING. Function Live Ins: $d0 in %0 0B bb.0 (%ir-block.0): liveins: $d0 16B undef %1.dsub:fpr128 = COPY $d0 80B %6:fpr64 = CNTv8i8 %1.dsub:fpr128 96B undef %8.hsub:fpr128 = UADDLVv8i8v %6:fpr64 128B %10:gpr32all = COPY %8.ssub:fpr128 144B $w0 = COPY %10:gpr32all 160B RET_ReallyLR implicit killed $w0 Allen:* Thanks, I find the %4:fpr32 = COPY %1.ssub:fpr128 will be eliminated in pass SIMPLE…
		efriedmaUnsubmitted Not Done Reply Inline Actions See D127154 for a similar situation. efriedma: See D127154 for a similar situation.
	; CHECK-NEXT: uaddlv.8b h0, v0	; CHECK-NEXT: uaddlv.8b h0, v0
	; CHECK-NEXT: fmov w0, s0	; CHECK-NEXT: fmov w0, s0
	; CHECK-NEXT: ret	; CHECK-NEXT: ret
	;	;
	; CHECK-NONEON-LABEL: cnt32_advsimd_2:	; CHECK-NONEON-LABEL: cnt32_advsimd_2:
	; CHECK-NONEON: // %bb.0:	; CHECK-NONEON: // %bb.0:
	; CHECK-NONEON-NEXT: lsr w9, w0, #1	; CHECK-NONEON-NEXT: lsr w9, w0, #1
	; CHECK-NONEON-NEXT: mov w8, #16843009	; CHECK-NONEON-NEXT: mov w8, #16843009
	▲ Show 20 Lines • Show All 181 Lines • ▼ Show 20 Lines
	;	;
	; CHECK-NONEON-LABEL: ctpop32_ne_one:	; CHECK-NONEON-LABEL: ctpop32_ne_one:
	; CHECK-NONEON: // %bb.0:	; CHECK-NONEON: // %bb.0:
	; CHECK-NONEON-NEXT: sub w8, w0, #1	; CHECK-NONEON-NEXT: sub w8, w0, #1
	; CHECK-NONEON-NEXT: tst w0, w8	; CHECK-NONEON-NEXT: tst w0, w8
	; CHECK-NONEON-NEXT: ccmp w0, #0, #4, eq	; CHECK-NONEON-NEXT: ccmp w0, #0, #4, eq
	; CHECK-NONEON-NEXT: cset w0, eq	; CHECK-NONEON-NEXT: cset w0, eq
	; CHECK-NONEON-NEXT: ret	; CHECK-NONEON-NEXT: ret
		;
		; CHECK-CSSC-LABEL: ctpop32_ne_one:
		; CHECK-CSSC: // %bb.0:
		; CHECK-CSSC-NEXT: cnt w8, w0
		; CHECK-CSSC-NEXT: cmp w8, #1
		; CHECK-CSSC-NEXT: cset w0, ne
		; CHECK-CSSC-NEXT: ret
	%count = tail call i32 @llvm.ctpop.i32(i32 %x)	%count = tail call i32 @llvm.ctpop.i32(i32 %x)
	%cmp = icmp ne i32 %count, 1	%cmp = icmp ne i32 %count, 1
	ret i1 %cmp	ret i1 %cmp
	}	}

	declare i32 @llvm.ctpop.i32(i32) nounwind readnone	declare i32 @llvm.ctpop.i32(i32) nounwind readnone
	declare i64 @llvm.ctpop.i64(i64) nounwind readnone	declare i64 @llvm.ctpop.i64(i64) nounwind readnone
Context not available.

llvm/test/CodeGen/AArch64/dp1.ll

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	}			}

	define void @ctpop_i32() {			define void @ctpop_i32() {
	; CHECK-SDAG-LABEL: ctpop_i32:			; CHECK-SDAG-LABEL: ctpop_i32:
	; CHECK-SDAG: // %bb.0:			; CHECK-SDAG: // %bb.0:
	; CHECK-SDAG-NEXT: adrp x8, :got:var32			; CHECK-SDAG-NEXT: adrp x8, :got:var32
	; CHECK-SDAG-NEXT: ldr x8, [x8, :got_lo12:var32]			; CHECK-SDAG-NEXT: ldr x8, [x8, :got_lo12:var32]
	; CHECK-SDAG-NEXT: ldr w9, [x8]			; CHECK-SDAG-NEXT: ldr w9, [x8]
	; CHECK-SDAG-NEXT: fmov d0, x9			; CHECK-SDAG-NEXT: fmov s0, w9
	; CHECK-SDAG-NEXT: cnt v0.8b, v0.8b			; CHECK-SDAG-NEXT: cnt v0.8b, v0.8b
	; CHECK-SDAG-NEXT: uaddlv h0, v0.8b			; CHECK-SDAG-NEXT: uaddlv h0, v0.8b
	; CHECK-SDAG-NEXT: fmov w9, s0			; CHECK-SDAG-NEXT: fmov w9, s0
	; CHECK-SDAG-NEXT: str w9, [x8]			; CHECK-SDAG-NEXT: str w9, [x8]
	; CHECK-SDAG-NEXT: ret			; CHECK-SDAG-NEXT: ret
	;			;
	; CHECK-GISEL-LABEL: ctpop_i32:			; CHECK-GISEL-LABEL: ctpop_i32:
	; CHECK-GISEL: // %bb.0:			; CHECK-GISEL: // %bb.0:
	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcountClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 485186

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/arm64-popcnt.ll

llvm/test/CodeGen/AArch64/dp1.ll

[AArch64][SelectionDAG] Eliminates redundant zero-extension for 32-bit popcount
ClosedPublic