This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Custom Lower MULLH{S,U} for v16i8, v8i16, and v4i32
ClosedPublic

Authored by zatrazz on Apr 24 2018, 6:44 AM.

Download Raw Diff

Details

Reviewers

fhahn
javed.absar
huntergr
SjoerdMeijer
t.p.northover
echristo
evandro
rengolin

Commits

rGa57ef17ab643: [AArch64] Custom Lower MULLH{S,U} for v16i8, v8i16, and v4i32
rL331522: [AArch64] Custom Lower MULLH{S,U} for v16i8, v8i16, and v4i32

Summary

This patch adds a custom lowering for ISD::MULH{S,U} used on divide by
constant optimization (DAGCombiner::BuildSDIV and DAGCombiner::BuildUDIV).

New patterns for smull and umull are added, so AArch64ISD::{S,U}MULL
can be correctly lowered to smull2 and umull2.

Diff Detail

Repository: rL LLVM

Event Timeline

zatrazz created this revision.Apr 24 2018, 6:44 AM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptApr 24 2018, 6:44 AM

SjoerdMeijer added inline comments.Apr 27 2018, 1:47 AM

lib/Target/AArch64/AArch64ISelLowering.cpp
2557 ↗	(On Diff #143729)	Do you need to pass the 'sign' boolean? Can you not look at the opcode and check for ISD::MULHS or ISD::MULHU?
2727 ↗	(On Diff #143729)	If you check the opcode inside LowerMULH, this can be simplified a bit, both these cases can fallthrough and call LowerMULH.
test/CodeGen/AArch64/arm64-neon-mul-div-cte.ll
3 ↗	(On Diff #143729)	nit: this is mul16xi8
14 ↗	(On Diff #143729)	nit: this is mul8xi16

zatrazz added inline comments.Apr 30 2018, 5:39 AM

lib/Target/AArch64/AArch64ISelLowering.cpp
2557 ↗	(On Diff #143729)	Yes, you can divise the correct ISD from SDValue opcode, I will change it.
test/CodeGen/AArch64/arm64-neon-mul-div-cte.ll
3 ↗	(On Diff #143729)	Ack.
14 ↗	(On Diff #143729)	Ack.

Update patch from previous comments.

Sorry, I needed to get up to speed here, also with this hacker's delight trick.
I think it would be good if you add some comments to LowerMULH what we are trying to do here.

I think I have some more comments, mainly about the tests:

you're not checking the magic number (e.g. in the first test "movi v1.16b, #57"), but I think that is quite relevant here.
you're only using constant vectors with value 9. I think it would be good to vary here.
I think you should also define the operands of the smull2 and smull instructions with regexps, and use them.
nit: some functions still have inconsistent names, e.g. function "umul8xi16" tests "udiv <16 x i8>"

In D46009#1083782, @SjoerdMeijer wrote:

Sorry, I needed to get up to speed here, also with this hacker's delight trick.
I think it would be good if you add some comments to LowerMULH what we are trying to do here.

Ack.

I think I have some more comments, mainly about the tests:

you're not checking the magic number (e.g. in the first test "movi v1.16b, #57"), but I think that is quite relevant here.

Right, but the change itself is to correct lower the multiply high part of the divide by constant optimization. I have adjusted the next patch version to match the expected constant as well.

you're only using constant vectors with value 9. I think it would be good to vary here.

Ack.

I think you should also define the operands of the smull2 and smull instructions with regexps, and use them.

Ack.

nit: some functions still have inconsistent names, e.g. function "umul8xi16" tests "udiv <16 x i8>"

Ack, I have changed to div*.

Update patch from previous comments.

Thanks. This looks OK to me.

This revision is now accepted and ready to land.May 3 2018, 1:07 AM

Closed by commit rL331522: [AArch64] Custom Lower MULLH{S,U} for v16i8, v8i16, and v4i32 (authored by azanella). · Explain WhyMay 4 2018, 7:37 AM

This revision was automatically updated to reflect the committed changes.

efriedma mentioned this in D69831: [AArch64] Re-add patterns for (s/u)mull2..Nov 4 2019, 3:58 PM

efriedma mentioned this in rG35cf9a1fc5d2: [AArch64] Re-add patterns for (s/u)mull2..Nov 6 2019, 12:30 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

72 lines

AArch64InstrInfo.td

19 lines

test/

CodeGen/

AArch64/

neon-idiv.ll

16 lines

Diff 145192

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 697 Lines • ▼ Show 20 Lines	if (Subtarget->hasNEON()) {

setOperationAction(ISD::ANY_EXTEND, MVT::v4i32, Legal);		setOperationAction(ISD::ANY_EXTEND, MVT::v4i32, Legal);
setTruncStoreAction(MVT::v2i32, MVT::v2i16, Expand);		setTruncStoreAction(MVT::v2i32, MVT::v2i16, Expand);
// Likewise, narrowing and extending vector loads/stores aren't handled		// Likewise, narrowing and extending vector loads/stores aren't handled
// directly.		// directly.
for (MVT VT : MVT::vector_valuetypes()) {		for (MVT VT : MVT::vector_valuetypes()) {
setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Expand);		setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Expand);

		if (VT == MVT::v16i8 \|\| VT == MVT::v8i16 \|\| VT == MVT::v4i32) {
		setOperationAction(ISD::MULHS, VT, Custom);
		setOperationAction(ISD::MULHU, VT, Custom);
		} else {
setOperationAction(ISD::MULHS, VT, Expand);		setOperationAction(ISD::MULHS, VT, Expand);
setOperationAction(ISD::SMUL_LOHI, VT, Expand);
setOperationAction(ISD::MULHU, VT, Expand);		setOperationAction(ISD::MULHU, VT, Expand);
		}
		setOperationAction(ISD::SMUL_LOHI, VT, Expand);
setOperationAction(ISD::UMUL_LOHI, VT, Expand);		setOperationAction(ISD::UMUL_LOHI, VT, Expand);

setOperationAction(ISD::BSWAP, VT, Expand);		setOperationAction(ISD::BSWAP, VT, Expand);

for (MVT InnerVT : MVT::vector_valuetypes()) {		for (MVT InnerVT : MVT::vector_valuetypes()) {
setTruncStoreAction(VT, InnerVT, Expand);		setTruncStoreAction(VT, InnerVT, Expand);
setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Expand);		setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Expand);
setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Expand);		setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Expand);
▲ Show 20 Lines • Show All 1,827 Lines • ▼ Show 20 Lines	static SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) {
EVT Op1VT = Op1.getValueType();		EVT Op1VT = Op1.getValueType();
return DAG.getNode(N0->getOpcode(), DL, VT,		return DAG.getNode(N0->getOpcode(), DL, VT,
DAG.getNode(NewOpc, DL, VT,		DAG.getNode(NewOpc, DL, VT,
DAG.getNode(ISD::BITCAST, DL, Op1VT, N00), Op1),		DAG.getNode(ISD::BITCAST, DL, Op1VT, N00), Op1),
DAG.getNode(NewOpc, DL, VT,		DAG.getNode(NewOpc, DL, VT,
DAG.getNode(ISD::BITCAST, DL, Op1VT, N01), Op1));		DAG.getNode(ISD::BITCAST, DL, Op1VT, N01), Op1));
}		}

		// Lower vector multiply high (ISD::MULHS and ISD::MULHU).
		static SDValue LowerMULH(SDValue Op, SelectionDAG &DAG) {
		// Multiplications are only custom-lowered for 128-bit vectors so that
		// {S,U}MULL{2} can be detected. Otherwise v2i64 multiplications are not
		// legal.
		EVT VT = Op.getValueType();
		assert(VT.is128BitVector() && VT.isInteger() &&
		"unexpected type for custom-lowering ISD::MULH{U,S}");

		SDValue V0 = Op.getOperand(0);
		SDValue V1 = Op.getOperand(1);

		SDLoc DL(Op);

		EVT ExtractVT = VT.getHalfNumVectorElementsVT(*DAG.getContext());

		// We turn (V0 mulhs/mulhu V1) to:
		//
		// (uzp2 (smull (extract_subvector (ExtractVT V128:V0, (i64 0)),
		// (extract_subvector (ExtractVT V128:V1, (i64 0))))),
		// (smull (extract_subvector (ExtractVT V128:V0, (i64 VMull2Idx)),
		// (extract_subvector (ExtractVT V128:V2, (i64 VMull2Idx))))))
		//
		// Where ExtractVT is a subvector with half number of elements, and
		// VMullIdx2 is the index of the middle element (the high part).
		//
		// The vector hight part extract and multiply will be matched against
		// {S,U}MULL{v16i8_v8i16,v8i16_v4i32,v4i32_v2i64} which in turn will
		// issue a {s}mull2 instruction.
		//
		// This basically multiply the lower subvector with '{s,u}mull', the high
		// subvector with '{s,u}mull2', and shuffle both results high part in
		// resulting vector.
		unsigned Mull2VectorIdx = VT.getVectorNumElements () / 2;
		SDValue VMullIdx = DAG.getConstant(0, DL, MVT::i64);
		SDValue VMull2Idx = DAG.getConstant(Mull2VectorIdx, DL, MVT::i64);

		SDValue VMullV0 =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V0, VMullIdx);
		SDValue VMullV1 =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V1, VMullIdx);

		SDValue VMull2V0 =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V0, VMull2Idx);
		SDValue VMull2V1 =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ExtractVT, V1, VMull2Idx);

		unsigned MullOpc = Op.getOpcode() == ISD::MULHS ? AArch64ISD::SMULL
		: AArch64ISD::UMULL;

		EVT MullVT = ExtractVT.widenIntegerVectorElementType(*DAG.getContext());
		SDValue Mull = DAG.getNode(MullOpc, DL, MullVT, VMullV0, VMullV1);
		SDValue Mull2 = DAG.getNode(MullOpc, DL, MullVT, VMull2V0, VMull2V1);

		Mull = DAG.getNode(ISD::BITCAST, DL, VT, Mull);
		Mull2 = DAG.getNode(ISD::BITCAST, DL, VT, Mull2);

		return DAG.getNode(AArch64ISD::UZP2, DL, VT, Mull, Mull2);
		}

SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,		SDValue AArch64TargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(Op.getOperand(0))->getZExtValue();
SDLoc dl(Op);		SDLoc dl(Op);
switch (IntNo) {		switch (IntNo) {
default: return SDValue(); // Don't custom lower most intrinsics.		default: return SDValue(); // Don't custom lower most intrinsics.
case Intrinsic::thread_pointer: {		case Intrinsic::thread_pointer: {
EVT PtrVT = getPointerTy(DAG.getDataLayout());		EVT PtrVT = getPointerTy(DAG.getDataLayout());
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	case ISD::UINT_TO_FP:
return LowerINT_TO_FP(Op, DAG);		return LowerINT_TO_FP(Op, DAG);
case ISD::FP_TO_SINT:		case ISD::FP_TO_SINT:
case ISD::FP_TO_UINT:		case ISD::FP_TO_UINT:
return LowerFP_TO_INT(Op, DAG);		return LowerFP_TO_INT(Op, DAG);
case ISD::FSINCOS:		case ISD::FSINCOS:
return LowerFSINCOS(Op, DAG);		return LowerFSINCOS(Op, DAG);
case ISD::MUL:		case ISD::MUL:
return LowerMUL(Op, DAG);		return LowerMUL(Op, DAG);
		case ISD::MULHS:
		case ISD::MULHU:
		return LowerMULH(Op, DAG);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return LowerINTRINSIC_WO_CHAIN(Op, DAG);		return LowerINTRINSIC_WO_CHAIN(Op, DAG);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
case ISD::VECREDUCE_SMAX:		case ISD::VECREDUCE_SMAX:
case ISD::VECREDUCE_SMIN:		case ISD::VECREDUCE_SMIN:
case ISD::VECREDUCE_UMAX:		case ISD::VECREDUCE_UMAX:
case ISD::VECREDUCE_UMIN:		case ISD::VECREDUCE_UMIN:
case ISD::VECREDUCE_FMAX:		case ISD::VECREDUCE_FMAX:
▲ Show 20 Lines • Show All 8,643 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,767 Lines • ▼ Show 20 Lines	def : Pat<(v2i64 (opnode (v2i32 V64:$Rn), (v2i32 V64:$Rm))),
(INST2S V64:$Rn, V64:$Rm)>;		(INST2S V64:$Rn, V64:$Rm)>;
}		}

defm : Neon_mul_widen_patterns<AArch64smull, SMULLv8i8_v8i16,		defm : Neon_mul_widen_patterns<AArch64smull, SMULLv8i8_v8i16,
SMULLv4i16_v4i32, SMULLv2i32_v2i64>;		SMULLv4i16_v4i32, SMULLv2i32_v2i64>;
defm : Neon_mul_widen_patterns<AArch64umull, UMULLv8i8_v8i16,		defm : Neon_mul_widen_patterns<AArch64umull, UMULLv8i8_v8i16,
UMULLv4i16_v4i32, UMULLv2i32_v2i64>;		UMULLv4i16_v4i32, UMULLv2i32_v2i64>;

		// Patterns for smull2/umull2.
		multiclass Neon_mul_high_patterns<SDPatternOperator opnode,
		Instruction INST8B, Instruction INST4H, Instruction INST2S> {
		def : Pat<(v8i16 (opnode (extract_high_v16i8 V128:$Rn),
		(extract_high_v16i8 V128:$Rm))),
		(INST8B V128:$Rn, V128:$Rm)>;
		def : Pat<(v4i32 (opnode (extract_high_v8i16 V128:$Rn),
		(extract_high_v8i16 V128:$Rm))),
		(INST4H V128:$Rn, V128:$Rm)>;
		def : Pat<(v2i64 (opnode (extract_high_v4i32 V128:$Rn),
		(extract_high_v4i32 V128:$Rm))),
		(INST2S V128:$Rn, V128:$Rm)>;
		}

		defm : Neon_mul_high_patterns<AArch64smull, SMULLv16i8_v8i16,
		SMULLv8i16_v4i32, SMULLv4i32_v2i64>;
		defm : Neon_mul_high_patterns<AArch64umull, UMULLv16i8_v8i16,
		UMULLv8i16_v4i32, UMULLv4i32_v2i64>;

// Additional patterns for SMLAL/SMLSL and UMLAL/UMLSL		// Additional patterns for SMLAL/SMLSL and UMLAL/UMLSL
multiclass Neon_mulacc_widen_patterns<SDPatternOperator opnode,		multiclass Neon_mulacc_widen_patterns<SDPatternOperator opnode,
Instruction INST8B, Instruction INST4H, Instruction INST2S> {		Instruction INST8B, Instruction INST4H, Instruction INST2S> {
def : Pat<(v8i16 (opnode (v8i16 V128:$Rd), (v8i8 V64:$Rn), (v8i8 V64:$Rm))),		def : Pat<(v8i16 (opnode (v8i16 V128:$Rd), (v8i8 V64:$Rn), (v8i8 V64:$Rm))),
(INST8B V128:$Rd, V64:$Rn, V64:$Rm)>;		(INST8B V128:$Rd, V64:$Rn, V64:$Rm)>;
def : Pat<(v4i32 (opnode (v4i32 V128:$Rd), (v4i16 V64:$Rn), (v4i16 V64:$Rm))),		def : Pat<(v4i32 (opnode (v4i32 V128:$Rd), (v4i16 V64:$Rn), (v4i16 V64:$Rm))),
(INST4H V128:$Rd, V64:$Rn, V64:$Rm)>;		(INST4H V128:$Rd, V64:$Rn, V64:$Rm)>;
def : Pat<(v2i64 (opnode (v2i64 V128:$Rd), (v2i32 V64:$Rn), (v2i32 V64:$Rm))),		def : Pat<(v2i64 (opnode (v2i64 V128:$Rd), (v2i32 V64:$Rn), (v2i32 V64:$Rm))),
▲ Show 20 Lines • Show All 2,524 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AArch64/neon-idiv.ll

	; RUN: llc -mtriple=aarch64-none-linux-gnu < %s -mattr=+neon \| FileCheck %s			; RUN: llc -mtriple=aarch64-none-linux-gnu < %s -mattr=+neon \| FileCheck %s

	define <4 x i32> @test1(<4 x i32> %a) {			define <4 x i32> @test1(<4 x i32> %a) {
	%rem = srem <4 x i32> %a, <i32 7, i32 7, i32 7, i32 7>			%rem = srem <4 x i32> %a, <i32 7, i32 7, i32 7, i32 7>
	ret <4 x i32> %rem			ret <4 x i32> %rem
	; CHECK-LABEL: test1			; For C constant X/C is simplified to X-X/C*C. The X/C division is lowered
	; FIXME: Can we lower this more efficiently?			; to MULHS due the simplification by multiplying by a magic number
	; CHECK: mul			; (TargetLowering::BuildSDIV).
	; CHECK: mul			; CHECK-LABEL: test1:
	; CHECK: mul			; CHECK: smull2 [[SMULL2:(v[0-9]+)]].2d, {{v[0-9]+}}.4s, {{v[0-9]+}}.4s
	; CHECK: mul			; CHECK: smull [[SMULL:(v[0-9]+)]].2d, {{v[0-9]+}}.2s, {{v[0-9]+}}.2s
				; CHECK: uzp2 [[UZP2:(v[0-9]+).4s]], [[SMULL]].4s, [[SMULL2]].4s
				; CHECK: add [[ADD:(v[0-9]+.4s)]], [[UZP2]], v0.4s
				; CHECK: sshr [[SSHR:(v[0-9]+.4s)]], [[ADD]], #2
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Custom Lower MULLH{S,U} for v16i8, v8i16, and v4i32ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 145192

llvm/trunk/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/trunk/lib/Target/AArch64/AArch64InstrInfo.td

llvm/trunk/test/CodeGen/AArch64/neon-idiv.ll

[AArch64] Custom Lower MULLH{S,U} for v16i8, v8i16, and v4i32
ClosedPublic