This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation
ClosedPublic

Authored by RKSimon on Jul 21 2022, 4:04 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
fhahn
david-arm
deadalnix

Commits

rG939cf9b1bea4: [AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation

Summary

As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations.

I'm not sure where the cutoff should be, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (not sure if scalar <-> neon transfers are particularly costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RKSimon created this revision.Jul 21 2022, 4:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 4:04 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

RKSimon requested review of this revision.Jul 21 2022, 4:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2022, 4:04 AM

Harbormaster completed remote builds in B176711: Diff 446422.Jul 21 2022, 4:45 AM

RKSimon edited the summary of this revision. (Show Details)Jul 22 2022, 7:07 AM

RKSimon mentioned this in D129765: [DAG] SimplifyDemandedBits - don't early-out for multiple use values.Jul 22 2022, 7:19 AM

It took me a bit a grave digging to figure out the motivation behind PARITY, but this seems to do the right thing.

That being said, shouldn't this be the default strategy to lower PARITY, rather than special case it for AArch64?

This revision is now accepted and ready to land.Jul 22 2022, 8:01 AM

I've been trying to add up latencies to see which is better between then two sequences. I think you are right about i32 case - it is better to avoid the fpr register moves.

The code changes looks good to me. I was just not sure which is better between the i64 eor's and moving to float regs to use a cnt. It will depend on the cpu - but an eor is either a quick 1 cycle instruction, which is hard to beat with neon instructions, or it is a 2 cycle instruction and the cnt; addv and fmov's will have longer latencies.

I ended up having to get a simulator out to measure the differences. Whilst it is slower on some cpus, it seems to be quicker in more cases and by more of a margin. So looks OK to me.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7790	Formatting.

Thanks!

This revision was landed with ongoing or failed builds.Jul 22 2022, 9:36 AM

Closed by commit rG939cf9b1bea4: [AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation (authored by RKSimon). · Explain Why

This revision was automatically updated to reflect the committed changes.

RKSimon added a commit: rG939cf9b1bea4: [AArch64] Use neon instructions for i64/i128 ISD::PARITY calculation.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

23 lines

test/

CodeGen/

AArch64/

parity.ll

36 lines

Diff 446422

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 996 Lines • ▼ Show 20 Lines	private:
SDValue LowerVECTOR_SPLICE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVECTOR_SPLICE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerEXTRACT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerEXTRACT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerINSERT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINSERT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDIV(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDIV(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorSRA_SRL_SHL(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorSRA_SRL_SHL(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerShiftParts(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerShiftParts(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVSETCC(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVSETCC(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerCTPOP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerCTPOP_PARITY(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerCTTZ(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerCTTZ(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerBitreverse(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerBitreverse(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerMinMax(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerMinMax(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFCOPYSIGN(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFCOPYSIGN(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_EXTEND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_ROUND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorFP_TO_INT_SAT(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorFP_TO_INT_SAT(SDValue Op, SelectionDAG &DAG) const;
▲ Show 20 Lines • Show All 162 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 515 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM,
// AArch64 doesn't have {U\|S}MUL_LOHI.		// AArch64 doesn't have {U\|S}MUL_LOHI.
setOperationAction(ISD::UMUL_LOHI, MVT::i64, Expand);		setOperationAction(ISD::UMUL_LOHI, MVT::i64, Expand);
setOperationAction(ISD::SMUL_LOHI, MVT::i64, Expand);		setOperationAction(ISD::SMUL_LOHI, MVT::i64, Expand);

setOperationAction(ISD::CTPOP, MVT::i32, Custom);		setOperationAction(ISD::CTPOP, MVT::i32, Custom);
setOperationAction(ISD::CTPOP, MVT::i64, Custom);		setOperationAction(ISD::CTPOP, MVT::i64, Custom);
setOperationAction(ISD::CTPOP, MVT::i128, Custom);		setOperationAction(ISD::CTPOP, MVT::i128, Custom);

		setOperationAction(ISD::PARITY, MVT::i64, Custom);
		setOperationAction(ISD::PARITY, MVT::i128, Custom);

setOperationAction(ISD::ABS, MVT::i32, Custom);		setOperationAction(ISD::ABS, MVT::i32, Custom);
setOperationAction(ISD::ABS, MVT::i64, Custom);		setOperationAction(ISD::ABS, MVT::i64, Custom);

setOperationAction(ISD::SDIVREM, MVT::i32, Expand);		setOperationAction(ISD::SDIVREM, MVT::i32, Expand);
setOperationAction(ISD::SDIVREM, MVT::i64, Expand);		setOperationAction(ISD::SDIVREM, MVT::i64, Expand);
for (MVT VT : MVT::fixedlen_vector_valuetypes()) {		for (MVT VT : MVT::fixedlen_vector_valuetypes()) {
setOperationAction(ISD::SDIVREM, VT, Expand);		setOperationAction(ISD::SDIVREM, VT, Expand);
setOperationAction(ISD::UDIVREM, VT, Expand);		setOperationAction(ISD::UDIVREM, VT, Expand);
▲ Show 20 Lines • Show All 4,926 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
case ISD::SRL:		case ISD::SRL:
case ISD::SHL:		case ISD::SHL:
return LowerVectorSRA_SRL_SHL(Op, DAG);		return LowerVectorSRA_SRL_SHL(Op, DAG);
case ISD::SHL_PARTS:		case ISD::SHL_PARTS:
case ISD::SRL_PARTS:		case ISD::SRL_PARTS:
case ISD::SRA_PARTS:		case ISD::SRA_PARTS:
return LowerShiftParts(Op, DAG);		return LowerShiftParts(Op, DAG);
case ISD::CTPOP:		case ISD::CTPOP:
return LowerCTPOP(Op, DAG);		case ISD::PARITY:
		return LowerCTPOP_PARITY(Op, DAG);
case ISD::FCOPYSIGN:		case ISD::FCOPYSIGN:
return LowerFCOPYSIGN(Op, DAG);		return LowerFCOPYSIGN(Op, DAG);
case ISD::OR:		case ISD::OR:
return LowerVectorOR(Op, DAG);		return LowerVectorOR(Op, DAG);
case ISD::XOR:		case ISD::XOR:
return LowerXOR(Op, DAG);		return LowerXOR(Op, DAG);
case ISD::PREFETCH:		case ISD::PREFETCH:
return LowerPREFETCH(Op, DAG);		return LowerPREFETCH(Op, DAG);
▲ Show 20 Lines • Show All 2,303 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerFCOPYSIGN(SDValue Op,
if (VT == MVT::f32)		if (VT == MVT::f32)
return DAG.getTargetExtractSubreg(AArch64::ssub, DL, VT, BSP);		return DAG.getTargetExtractSubreg(AArch64::ssub, DL, VT, BSP);
if (VT == MVT::f64)		if (VT == MVT::f64)
return DAG.getTargetExtractSubreg(AArch64::dsub, DL, VT, BSP);		return DAG.getTargetExtractSubreg(AArch64::dsub, DL, VT, BSP);

return BitCast(VT, BSP, DAG);		return BitCast(VT, BSP, DAG);
}		}

SDValue AArch64TargetLowering::LowerCTPOP(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerCTPOP_PARITY(SDValue Op, SelectionDAG &DAG) const {
		dmgreenUnsubmitted Not Done Reply Inline Actions Formatting. dmgreen: Formatting.
if (DAG.getMachineFunction().getFunction().hasFnAttribute(		if (DAG.getMachineFunction().getFunction().hasFnAttribute(
Attribute::NoImplicitFloat))		Attribute::NoImplicitFloat))
return SDValue();		return SDValue();

if (!Subtarget->hasNEON())		if (!Subtarget->hasNEON())
return SDValue();		return SDValue();

		bool IsParity = Op.getOpcode() == ISD::PARITY;

// While there is no integer popcount instruction, it can		// While there is no integer popcount instruction, it can
// be more efficiently lowered to the following sequence that uses		// be more efficiently lowered to the following sequence that uses
// AdvSIMD registers/instructions as long as the copies to/from		// AdvSIMD registers/instructions as long as the copies to/from
// the AdvSIMD registers are cheap.		// the AdvSIMD registers are cheap.
// FMOV D0, X0 // copy 64-bit int to vector, high bits zero'd		// FMOV D0, X0 // copy 64-bit int to vector, high bits zero'd
// CNT V0.8B, V0.8B // 8xbyte pop-counts		// CNT V0.8B, V0.8B // 8xbyte pop-counts
// ADDV B0, V0.8B // sum 8xbyte pop-counts		// ADDV B0, V0.8B // sum 8xbyte pop-counts
// UMOV X0, V0.B[0] // copy byte result back to integer reg		// UMOV X0, V0.B[0] // copy byte result back to integer reg
SDValue Val = Op.getOperand(0);		SDValue Val = Op.getOperand(0);
SDLoc DL(Op);		SDLoc DL(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

if (VT == MVT::i32 \|\| VT == MVT::i64) {		if (VT == MVT::i32 \|\| VT == MVT::i64) {
if (VT == MVT::i32)		if (VT == MVT::i32)
Val = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, Val);		Val = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, Val);
Val = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Val);		Val = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Val);

SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v8i8, Val);		SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v8i8, Val);
SDValue UaddLV = DAG.getNode(		SDValue UaddLV = DAG.getNode(
ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,		ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,
DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);		DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);

		if (IsParity)
		UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,
		DAG.getConstant(1, DL, MVT::i32));

if (VT == MVT::i64)		if (VT == MVT::i64)
UaddLV = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, UaddLV);		UaddLV = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, UaddLV);
return UaddLV;		return UaddLV;
} else if (VT == MVT::i128) {		} else if (VT == MVT::i128) {
Val = DAG.getNode(ISD::BITCAST, DL, MVT::v16i8, Val);		Val = DAG.getNode(ISD::BITCAST, DL, MVT::v16i8, Val);

SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v16i8, Val);		SDValue CtPop = DAG.getNode(ISD::CTPOP, DL, MVT::v16i8, Val);
SDValue UaddLV = DAG.getNode(		SDValue UaddLV = DAG.getNode(
ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,		ISD::INTRINSIC_WO_CHAIN, DL, MVT::i32,
DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);		DAG.getConstant(Intrinsic::aarch64_neon_uaddlv, DL, MVT::i32), CtPop);

		if (IsParity)
		UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,
		DAG.getConstant(1, DL, MVT::i32));

return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i128, UaddLV);		return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i128, UaddLV);
}		}

		assert(!IsParity && "ISD::PARITY of vector types not supported");

if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT))		if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT))
return LowerToPredicatedOp(Op, DAG, AArch64ISD::CTPOP_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::CTPOP_MERGE_PASSTHRU);

assert((VT == MVT::v1i64 \|\| VT == MVT::v2i64 \|\| VT == MVT::v2i32 \|\|		assert((VT == MVT::v1i64 \|\| VT == MVT::v2i64 \|\| VT == MVT::v2i32 \|\|
VT == MVT::v4i32 \|\| VT == MVT::v4i16 \|\| VT == MVT::v8i16) &&		VT == MVT::v4i32 \|\| VT == MVT::v4i16 \|\| VT == MVT::v8i16) &&
"Unexpected type for custom ctpop lowering");		"Unexpected type for custom ctpop lowering");

EVT VT8Bit = VT.is64BitVector() ? MVT::v8i8 : MVT::v16i8;		EVT VT8Bit = VT.is64BitVector() ? MVT::v8i8 : MVT::v16i8;
▲ Show 20 Lines • Show All 12,173 Lines • ▼ Show 20 Lines	case ISD::VECREDUCE_UMIN:
Results.push_back(LowerVECREDUCE(SDValue(N, 0), DAG));		Results.push_back(LowerVECREDUCE(SDValue(N, 0), DAG));
return;		return;
case ISD::ADD:		case ISD::ADD:
case ISD::FADD:		case ISD::FADD:
ReplaceAddWithADDP(N, Results, DAG, Subtarget);		ReplaceAddWithADDP(N, Results, DAG, Subtarget);
return;		return;

case ISD::CTPOP:		case ISD::CTPOP:
if (SDValue Result = LowerCTPOP(SDValue(N, 0), DAG))		case ISD::PARITY:
		if (SDValue Result = LowerCTPOP_PARITY(SDValue(N, 0), DAG))
Results.push_back(Result);		Results.push_back(Result);
return;		return;
case AArch64ISD::SADDV:		case AArch64ISD::SADDV:
ReplaceReductionResults(N, Results, DAG, ISD::ADD, AArch64ISD::SADDV);		ReplaceReductionResults(N, Results, DAG, ISD::ADD, AArch64ISD::SADDV);
return;		return;
case AArch64ISD::UADDV:		case AArch64ISD::UADDV:
ReplaceReductionResults(N, Results, DAG, ISD::ADD, AArch64ISD::UADDV);		ReplaceReductionResults(N, Results, DAG, ISD::ADD, AArch64ISD::UADDV);
return;		return;
▲ Show 20 Lines • Show All 1,692 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/parity.ll

Show First 20 Lines • Show All 71 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%1 = tail call i32 @llvm.ctpop.i32(i32 %x)		%1 = tail call i32 @llvm.ctpop.i32(i32 %x)
%2 = and i32 %1, 1		%2 = and i32 %1, 1
ret i32 %2		ret i32 %2
}		}

define i64 @parity_64(i64 %x) {		define i64 @parity_64(i64 %x) {
; CHECK-LABEL: parity_64:		; CHECK-LABEL: parity_64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: eor x8, x0, x0, lsr #32		; CHECK-NEXT: fmov d0, x0
; CHECK-NEXT: eor x8, x8, x8, lsr #16		; CHECK-NEXT: cnt v0.8b, v0.8b
; CHECK-NEXT: eor x8, x8, x8, lsr #8		; CHECK-NEXT: uaddlv h0, v0.8b
; CHECK-NEXT: eor x8, x8, x8, lsr #4		; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: eor x8, x8, x8, lsr #2		; CHECK-NEXT: and w0, w8, #0x1
; CHECK-NEXT: eor w8, w8, w8, lsr #1
; CHECK-NEXT: and x0, x8, #0x1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call i64 @llvm.ctpop.i64(i64 %x)		%1 = tail call i64 @llvm.ctpop.i64(i64 %x)
%2 = and i64 %1, 1		%2 = and i64 %1, 1
ret i64 %2		ret i64 %2
}		}

define i128 @parity_128(i128 %x) {		define i128 @parity_128(i128 %x) {
; CHECK-LABEL: parity_128:		; CHECK-LABEL: parity_128:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: eor x8, x0, x1		; CHECK-NEXT: fmov d0, x0
		; CHECK-NEXT: mov v0.d[1], x1
; CHECK-NEXT: mov x1, xzr		; CHECK-NEXT: mov x1, xzr
; CHECK-NEXT: eor x8, x8, x8, lsr #32		; CHECK-NEXT: cnt v0.16b, v0.16b
; CHECK-NEXT: eor x8, x8, x8, lsr #16		; CHECK-NEXT: uaddlv h0, v0.16b
; CHECK-NEXT: eor x8, x8, x8, lsr #8		; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: eor x8, x8, x8, lsr #4		; CHECK-NEXT: and w0, w8, #0x1
; CHECK-NEXT: eor x8, x8, x8, lsr #2
; CHECK-NEXT: eor w8, w8, w8, lsr #1
; CHECK-NEXT: and x0, x8, #0x1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call i128 @llvm.ctpop.i128(i128 %x)		%1 = tail call i128 @llvm.ctpop.i128(i128 %x)
%2 = and i128 %1, 1		%2 = and i128 %1, 1
ret i128 %2		ret i128 %2
}		}

define i32 @parity_64_trunc(i64 %x) {		define i32 @parity_64_trunc(i64 %x) {
; CHECK-LABEL: parity_64_trunc:		; CHECK-LABEL: parity_64_trunc:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: eor x8, x0, x0, lsr #32		; CHECK-NEXT: fmov d0, x0
; CHECK-NEXT: eor x8, x8, x8, lsr #16		; CHECK-NEXT: cnt v0.8b, v0.8b
; CHECK-NEXT: eor x8, x8, x8, lsr #8		; CHECK-NEXT: uaddlv h0, v0.8b
; CHECK-NEXT: eor x8, x8, x8, lsr #4		; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: eor x8, x8, x8, lsr #2
; CHECK-NEXT: eor w8, w8, w8, lsr #1
; CHECK-NEXT: and w0, w8, #0x1		; CHECK-NEXT: and w0, w8, #0x1
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%1 = tail call i64 @llvm.ctpop.i64(i64 %x)		%1 = tail call i64 @llvm.ctpop.i64(i64 %x)
%2 = trunc i64 %1 to i32		%2 = trunc i64 %1 to i32
%3 = and i32 %2, 1		%3 = and i32 %2, 1
ret i32 %3		ret i32 %3
}		}

▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines