This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1/1
AArch64ISelLowering.h
-
AArch64ISelLowering.cpp
1/1
AArch64SVEInstrInfo.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
5/10
sve-fixed-length-int-mulh.ll
-
sve-int-mulh-pred.ll
-
sve2-int-mulh.ll

Differential D100487

[AArch64][SVE] Lower MULHU/MULHS nodes to umulh/smulh instructions
ClosedPublic

Authored by bsmith on Apr 14 2021, 8:28 AM.

Download Raw Diff

Details

Reviewers

joechrisellis
peterwaller-arm
paulwalker-arm
SanderSpies
efriedma

Commits

rGb8b075d8d744: [AArch64][SVE] Lower MULHU/MULHS nodes to umulh/smulh instructions

Summary

Mark MULHS/MULHU nodes as legal for both scalable and fixed SVE types,
and lower them to the appropriate SVE instructions.

Additionally now that the MULH nodes are legal, integer divides can be
expanded into a more performant code sequence.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bsmith created this revision.Apr 14 2021, 8:28 AM

Herald added a reviewer: efriedma. · View Herald TranscriptApr 14 2021, 8:28 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

bsmith requested review of this revision.Apr 14 2021, 8:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2021, 8:28 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

paulwalker-arm added inline comments.Apr 15 2021, 7:58 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.h
92–93	As the general rule we use the common node name suffixed with `_PRED`, so in this case these should be `MULHS_PRED` and `MULHU_PRED`.
llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
192	Just a note, because of my other comment, to say that this is fine, here we typically prefer the AArch64 names as you've used here, although it'll be nicer still if you maintained the existing alphabetical ordering :)
llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll
5	Should this be `VBITS_EQ_2048`?
67	Given you're using `EQ_256` I doubt the `#min` logic means anything. That said, why not VBITS_GE_256, shouldn't the code be the same for larger vectors?
393	As there's no NEON instruction for i64 based vectors I'm wondering if it's worth using SVE for this case as well? much like we do for ISD::MUL.

Rename ISD nodes to match common names
Preserve some alphabetical ordering
Fix 2048 vector width tests
Update fixed width tests to test vector widths smaller than the given value
Allow generation of SVE mulh instructions for neon sized i64 vectors
Add test to check divide expansion that can now happen
Add minsize attribute to some SVE divide tests to prevent divide expansion from happening

Harbormaster completed remote builds in B99150: Diff 338074.Apr 16 2021, 6:58 AM

Matt added a subscriber: Matt.Apr 17 2021, 9:15 AM

paulwalker-arm accepted this revision.Apr 19 2021, 4:06 AM

paulwalker-arm added inline comments.

llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll
3–6	Please add `RUN` lines for all the support vector lengths. Also, just in case somebody wonders why, can you add a comment saying there's no validation for splitting vector operations because the necessary MULH DAG combine does no apply to illegally typed operations.
292	Weirdly SVE is not being used here. Is the output different when SVE is disable?
780	Same comment as smulh_v4i32.

This revision is now accepted and ready to land.Apr 19 2021, 4:06 AM

bsmith added inline comments.Apr 20 2021, 5:42 AM

llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll
3–6	There already is a comment as such, on line 12 (23 in the new patch)
292	No, it's the same without SVE enabled. NEON has patterns to match 128-bit mulh nodes (but not 64-bit as above), as it can use the smull2+smull pattern below. Perhaps we should still fall back to SVE instead of this sequence? (Or I just fix the comment..)

paulwalker-arm added inline comments.Apr 20 2021, 5:59 AM

llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll
3–6	That must have been where "my" idea came from :)
292	Thanks. In which case fixing the comment works for me.

Closed by commit rGb8b075d8d744: [AArch64][SVE] Lower MULHU/MULHS nodes to umulh/smulh instructions (authored by bsmith). · Explain WhyApr 20 2021, 7:18 AM

This revision was automatically updated to reflect the committed changes.

bsmith added a commit: rGb8b075d8d744: [AArch64][SVE] Lower MULHU/MULHS nodes to umulh/smulh instructions.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

12 lines

AArch64SVEInstrInfo.td

8 lines

test/

CodeGen/

AArch64/

sve-fixed-length-int-mulh.ll

1011 lines

sve-int-mulh-pred.ll

140 lines

sve2-int-mulh.ll

132 lines

Diff 337466

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
FMA_PRED,		FMA_PRED,
FMAXNM_PRED,		FMAXNM_PRED,
FMINNM_PRED,		FMINNM_PRED,
FMAX_PRED,		FMAX_PRED,
FMIN_PRED,		FMIN_PRED,
FMUL_PRED,		FMUL_PRED,
FSUB_PRED,		FSUB_PRED,
MUL_PRED,		MUL_PRED,
		SMULH_PRED,
		UMULH_PRED,
		paulwalker-armUnsubmitted Done Reply Inline Actions As the general rule we use the common node name suffixed with `_PRED`, so in this case these should be `MULHS_PRED` and `MULHU_PRED`. paulwalker-arm: As the general rule we use the common node name suffixed with `_PRED`, so in this case these…
SDIV_PRED,		SDIV_PRED,
SHL_PRED,		SHL_PRED,
SMAX_PRED,		SMAX_PRED,
SMIN_PRED,		SMIN_PRED,
SRA_PRED,		SRA_PRED,
SRL_PRED,		SRL_PRED,
SUB_PRED,		SUB_PRED,
UDIV_PRED,		UDIV_PRED,
▲ Show 20 Lines • Show All 989 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,120 Lines • ▼ Show 20 Lines	for (auto VT : {MVT::nxv16i8, MVT::nxv8i16, MVT::nxv4i32, MVT::nxv2i64}) {
setOperationAction(ISD::INSERT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, VT, Custom);
setOperationAction(ISD::UINT_TO_FP, VT, Custom);		setOperationAction(ISD::UINT_TO_FP, VT, Custom);
setOperationAction(ISD::SINT_TO_FP, VT, Custom);		setOperationAction(ISD::SINT_TO_FP, VT, Custom);
setOperationAction(ISD::FP_TO_UINT, VT, Custom);		setOperationAction(ISD::FP_TO_UINT, VT, Custom);
setOperationAction(ISD::FP_TO_SINT, VT, Custom);		setOperationAction(ISD::FP_TO_SINT, VT, Custom);
setOperationAction(ISD::MGATHER, VT, Custom);		setOperationAction(ISD::MGATHER, VT, Custom);
setOperationAction(ISD::MSCATTER, VT, Custom);		setOperationAction(ISD::MSCATTER, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
		setOperationAction(ISD::MULHU, VT, Custom);
		setOperationAction(ISD::MULHS, VT, Custom);
setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);		setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);
setOperationAction(ISD::SELECT, VT, Custom);		setOperationAction(ISD::SELECT, VT, Custom);
setOperationAction(ISD::SETCC, VT, Custom);		setOperationAction(ISD::SETCC, VT, Custom);
setOperationAction(ISD::SDIV, VT, Custom);		setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);		setOperationAction(ISD::UDIV, VT, Custom);
setOperationAction(ISD::SMIN, VT, Custom);		setOperationAction(ISD::SMIN, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
setOperationAction(ISD::SMAX, VT, Custom);		setOperationAction(ISD::SMAX, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::SHL, VT, Custom);		setOperationAction(ISD::SHL, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::SRA, VT, Custom);		setOperationAction(ISD::SRA, VT, Custom);
setOperationAction(ISD::ABS, VT, Custom);		setOperationAction(ISD::ABS, VT, Custom);
setOperationAction(ISD::VECREDUCE_ADD, VT, Custom);		setOperationAction(ISD::VECREDUCE_ADD, VT, Custom);
setOperationAction(ISD::VECREDUCE_AND, VT, Custom);		setOperationAction(ISD::VECREDUCE_AND, VT, Custom);
setOperationAction(ISD::VECREDUCE_OR, VT, Custom);		setOperationAction(ISD::VECREDUCE_OR, VT, Custom);
setOperationAction(ISD::VECREDUCE_XOR, VT, Custom);		setOperationAction(ISD::VECREDUCE_XOR, VT, Custom);
setOperationAction(ISD::VECREDUCE_UMIN, VT, Custom);		setOperationAction(ISD::VECREDUCE_UMIN, VT, Custom);
setOperationAction(ISD::VECREDUCE_UMAX, VT, Custom);		setOperationAction(ISD::VECREDUCE_UMAX, VT, Custom);
setOperationAction(ISD::VECREDUCE_SMIN, VT, Custom);		setOperationAction(ISD::VECREDUCE_SMIN, VT, Custom);
setOperationAction(ISD::VECREDUCE_SMAX, VT, Custom);		setOperationAction(ISD::VECREDUCE_SMAX, VT, Custom);
setOperationAction(ISD::STEP_VECTOR, VT, Custom);		setOperationAction(ISD::STEP_VECTOR, VT, Custom);

setOperationAction(ISD::MULHU, VT, Expand);
setOperationAction(ISD::MULHS, VT, Expand);
setOperationAction(ISD::UMUL_LOHI, VT, Expand);		setOperationAction(ISD::UMUL_LOHI, VT, Expand);
setOperationAction(ISD::SMUL_LOHI, VT, Expand);		setOperationAction(ISD::SMUL_LOHI, VT, Expand);
}		}

// Illegal unpacked integer vector types.		// Illegal unpacked integer vector types.
for (auto VT : {MVT::nxv8i8, MVT::nxv4i16, MVT::nxv2i32}) {		for (auto VT : {MVT::nxv8i8, MVT::nxv4i16, MVT::nxv2i32}) {
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, VT, Custom);
▲ Show 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::FRINT, VT, Custom);		setOperationAction(ISD::FRINT, VT, Custom);
setOperationAction(ISD::FROUND, VT, Custom);		setOperationAction(ISD::FROUND, VT, Custom);
setOperationAction(ISD::FROUNDEVEN, VT, Custom);		setOperationAction(ISD::FROUNDEVEN, VT, Custom);
setOperationAction(ISD::FSQRT, VT, Custom);		setOperationAction(ISD::FSQRT, VT, Custom);
setOperationAction(ISD::FSUB, VT, Custom);		setOperationAction(ISD::FSUB, VT, Custom);
setOperationAction(ISD::FTRUNC, VT, Custom);		setOperationAction(ISD::FTRUNC, VT, Custom);
setOperationAction(ISD::LOAD, VT, Custom);		setOperationAction(ISD::LOAD, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
		setOperationAction(ISD::MULHU, VT, Custom);
		setOperationAction(ISD::MULHS, VT, Custom);
setOperationAction(ISD::OR, VT, Custom);		setOperationAction(ISD::OR, VT, Custom);
setOperationAction(ISD::SDIV, VT, Custom);		setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::SELECT, VT, Custom);		setOperationAction(ISD::SELECT, VT, Custom);
setOperationAction(ISD::SETCC, VT, Custom);		setOperationAction(ISD::SETCC, VT, Custom);
setOperationAction(ISD::SHL, VT, Custom);		setOperationAction(ISD::SHL, VT, Custom);
setOperationAction(ISD::SIGN_EXTEND, VT, Custom);		setOperationAction(ISD::SIGN_EXTEND, VT, Custom);
setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Custom);		setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Custom);
setOperationAction(ISD::SMAX, VT, Custom);		setOperationAction(ISD::SMAX, VT, Custom);
▲ Show 20 Lines • Show All 330 Lines • ▼ Show 20 Lines	case AArch64ISD::FIRST_NUMBER:
MAKE_CASE(AArch64ISD::FCSEL)		MAKE_CASE(AArch64ISD::FCSEL)
MAKE_CASE(AArch64ISD::CSINV)		MAKE_CASE(AArch64ISD::CSINV)
MAKE_CASE(AArch64ISD::CSNEG)		MAKE_CASE(AArch64ISD::CSNEG)
MAKE_CASE(AArch64ISD::CSINC)		MAKE_CASE(AArch64ISD::CSINC)
MAKE_CASE(AArch64ISD::THREAD_POINTER)		MAKE_CASE(AArch64ISD::THREAD_POINTER)
MAKE_CASE(AArch64ISD::TLSDESC_CALLSEQ)		MAKE_CASE(AArch64ISD::TLSDESC_CALLSEQ)
MAKE_CASE(AArch64ISD::ADD_PRED)		MAKE_CASE(AArch64ISD::ADD_PRED)
MAKE_CASE(AArch64ISD::MUL_PRED)		MAKE_CASE(AArch64ISD::MUL_PRED)
		MAKE_CASE(AArch64ISD::SMULH_PRED)
		MAKE_CASE(AArch64ISD::UMULH_PRED)
MAKE_CASE(AArch64ISD::SDIV_PRED)		MAKE_CASE(AArch64ISD::SDIV_PRED)
MAKE_CASE(AArch64ISD::SHL_PRED)		MAKE_CASE(AArch64ISD::SHL_PRED)
MAKE_CASE(AArch64ISD::SMAX_PRED)		MAKE_CASE(AArch64ISD::SMAX_PRED)
MAKE_CASE(AArch64ISD::SMIN_PRED)		MAKE_CASE(AArch64ISD::SMIN_PRED)
MAKE_CASE(AArch64ISD::SRA_PRED)		MAKE_CASE(AArch64ISD::SRA_PRED)
MAKE_CASE(AArch64ISD::SRL_PRED)		MAKE_CASE(AArch64ISD::SRL_PRED)
MAKE_CASE(AArch64ISD::SUB_PRED)		MAKE_CASE(AArch64ISD::SUB_PRED)
MAKE_CASE(AArch64ISD::UDIV_PRED)		MAKE_CASE(AArch64ISD::UDIV_PRED)
▲ Show 20 Lines • Show All 2,704 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
case ISD::FSINCOS:		case ISD::FSINCOS:
return LowerFSINCOS(Op, DAG);		return LowerFSINCOS(Op, DAG);
case ISD::FLT_ROUNDS_:		case ISD::FLT_ROUNDS_:
return LowerFLT_ROUNDS_(Op, DAG);		return LowerFLT_ROUNDS_(Op, DAG);
case ISD::SET_ROUNDING:		case ISD::SET_ROUNDING:
return LowerSET_ROUNDING(Op, DAG);		return LowerSET_ROUNDING(Op, DAG);
case ISD::MUL:		case ISD::MUL:
return LowerMUL(Op, DAG);		return LowerMUL(Op, DAG);
		case ISD::MULHU:
		return LowerToPredicatedOp(Op, DAG, AArch64ISD::UMULH_PRED);
		case ISD::MULHS:
		return LowerToPredicatedOp(Op, DAG, AArch64ISD::SMULH_PRED);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return LowerINTRINSIC_WO_CHAIN(Op, DAG);		return LowerINTRINSIC_WO_CHAIN(Op, DAG);
case ISD::STORE:		case ISD::STORE:
return LowerSTORE(Op, DAG);		return LowerSTORE(Op, DAG);
case ISD::MGATHER:		case ISD::MGATHER:
return LowerMGATHER(Op, DAG);		return LowerMGATHER(Op, DAG);
case ISD::MSCATTER:		case ISD::MSCATTER:
return LowerMSCATTER(Op, DAG);		return LowerMSCATTER(Op, DAG);
▲ Show 20 Lines • Show All 13,008 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines
def AArch64fminnm_p : SDNode<"AArch64ISD::FMINNM_PRED", SDT_AArch64Arith>;		def AArch64fminnm_p : SDNode<"AArch64ISD::FMINNM_PRED", SDT_AArch64Arith>;
def AArch64fmax_p : SDNode<"AArch64ISD::FMAX_PRED", SDT_AArch64Arith>;		def AArch64fmax_p : SDNode<"AArch64ISD::FMAX_PRED", SDT_AArch64Arith>;
def AArch64fmin_p : SDNode<"AArch64ISD::FMIN_PRED", SDT_AArch64Arith>;		def AArch64fmin_p : SDNode<"AArch64ISD::FMIN_PRED", SDT_AArch64Arith>;
def AArch64fmul_p : SDNode<"AArch64ISD::FMUL_PRED", SDT_AArch64Arith>;		def AArch64fmul_p : SDNode<"AArch64ISD::FMUL_PRED", SDT_AArch64Arith>;
def AArch64fsub_p : SDNode<"AArch64ISD::FSUB_PRED", SDT_AArch64Arith>;		def AArch64fsub_p : SDNode<"AArch64ISD::FSUB_PRED", SDT_AArch64Arith>;
def AArch64lsl_p : SDNode<"AArch64ISD::SHL_PRED", SDT_AArch64Arith>;		def AArch64lsl_p : SDNode<"AArch64ISD::SHL_PRED", SDT_AArch64Arith>;
def AArch64lsr_p : SDNode<"AArch64ISD::SRL_PRED", SDT_AArch64Arith>;		def AArch64lsr_p : SDNode<"AArch64ISD::SRL_PRED", SDT_AArch64Arith>;
def AArch64mul_p : SDNode<"AArch64ISD::MUL_PRED", SDT_AArch64Arith>;		def AArch64mul_p : SDNode<"AArch64ISD::MUL_PRED", SDT_AArch64Arith>;
		def AArch64smulh_p : SDNode<"AArch64ISD::SMULH_PRED", SDT_AArch64Arith>;
		paulwalker-armUnsubmitted Done Reply Inline Actions Just a note, because of my other comment, to say that this is fine, here we typically prefer the AArch64 names as you've used here, although it'll be nicer still if you maintained the existing alphabetical ordering :) paulwalker-arm: Just a note, because of my other comment, to say that this is fine, here we typically prefer…
		def AArch64umulh_p : SDNode<"AArch64ISD::UMULH_PRED", SDT_AArch64Arith>;
def AArch64sdiv_p : SDNode<"AArch64ISD::SDIV_PRED", SDT_AArch64Arith>;		def AArch64sdiv_p : SDNode<"AArch64ISD::SDIV_PRED", SDT_AArch64Arith>;
def AArch64smax_p : SDNode<"AArch64ISD::SMAX_PRED", SDT_AArch64Arith>;		def AArch64smax_p : SDNode<"AArch64ISD::SMAX_PRED", SDT_AArch64Arith>;
def AArch64smin_p : SDNode<"AArch64ISD::SMIN_PRED", SDT_AArch64Arith>;		def AArch64smin_p : SDNode<"AArch64ISD::SMIN_PRED", SDT_AArch64Arith>;
def AArch64sub_p : SDNode<"AArch64ISD::SUB_PRED", SDT_AArch64Arith>;		def AArch64sub_p : SDNode<"AArch64ISD::SUB_PRED", SDT_AArch64Arith>;
def AArch64udiv_p : SDNode<"AArch64ISD::UDIV_PRED", SDT_AArch64Arith>;		def AArch64udiv_p : SDNode<"AArch64ISD::UDIV_PRED", SDT_AArch64Arith>;
def AArch64umax_p : SDNode<"AArch64ISD::UMAX_PRED", SDT_AArch64Arith>;		def AArch64umax_p : SDNode<"AArch64ISD::UMAX_PRED", SDT_AArch64Arith>;
def AArch64umin_p : SDNode<"AArch64ISD::UMIN_PRED", SDT_AArch64Arith>;		def AArch64umin_p : SDNode<"AArch64ISD::UMIN_PRED", SDT_AArch64Arith>;

▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	let Predicates = [HasSVE] in {
defm UMIN_ZI : sve_int_arith_imm1_unsigned<0b11, "umin", AArch64umin_p>;		defm UMIN_ZI : sve_int_arith_imm1_unsigned<0b11, "umin", AArch64umin_p>;

defm MUL_ZI : sve_int_arith_imm2<"mul", AArch64mul_p>;		defm MUL_ZI : sve_int_arith_imm2<"mul", AArch64mul_p>;
defm MUL_ZPmZ : sve_int_bin_pred_arit_2<0b000, "mul", "MUL_ZPZZ", int_aarch64_sve_mul, DestructiveBinaryComm>;		defm MUL_ZPmZ : sve_int_bin_pred_arit_2<0b000, "mul", "MUL_ZPZZ", int_aarch64_sve_mul, DestructiveBinaryComm>;
defm SMULH_ZPmZ : sve_int_bin_pred_arit_2<0b010, "smulh", "SMULH_ZPZZ", int_aarch64_sve_smulh, DestructiveBinaryComm>;		defm SMULH_ZPmZ : sve_int_bin_pred_arit_2<0b010, "smulh", "SMULH_ZPZZ", int_aarch64_sve_smulh, DestructiveBinaryComm>;
defm UMULH_ZPmZ : sve_int_bin_pred_arit_2<0b011, "umulh", "UMULH_ZPZZ", int_aarch64_sve_umulh, DestructiveBinaryComm>;		defm UMULH_ZPmZ : sve_int_bin_pred_arit_2<0b011, "umulh", "UMULH_ZPZZ", int_aarch64_sve_umulh, DestructiveBinaryComm>;

defm MUL_ZPZZ : sve_int_bin_pred_bhsd<AArch64mul_p>;		defm MUL_ZPZZ : sve_int_bin_pred_bhsd<AArch64mul_p>;
		defm SMULH_ZPZZ : sve_int_bin_pred_bhsd<AArch64smulh_p>;
		defm UMULH_ZPZZ : sve_int_bin_pred_bhsd<AArch64umulh_p>;

defm SDIV_ZPmZ : sve_int_bin_pred_arit_2_div<0b100, "sdiv", "SDIV_ZPZZ", int_aarch64_sve_sdiv, DestructiveBinaryCommWithRev, "SDIVR_ZPmZ">;		defm SDIV_ZPmZ : sve_int_bin_pred_arit_2_div<0b100, "sdiv", "SDIV_ZPZZ", int_aarch64_sve_sdiv, DestructiveBinaryCommWithRev, "SDIVR_ZPmZ">;
defm UDIV_ZPmZ : sve_int_bin_pred_arit_2_div<0b101, "udiv", "UDIV_ZPZZ", int_aarch64_sve_udiv, DestructiveBinaryCommWithRev, "UDIVR_ZPmZ">;		defm UDIV_ZPmZ : sve_int_bin_pred_arit_2_div<0b101, "udiv", "UDIV_ZPZZ", int_aarch64_sve_udiv, DestructiveBinaryCommWithRev, "UDIVR_ZPmZ">;
defm SDIVR_ZPmZ : sve_int_bin_pred_arit_2_div<0b110, "sdivr", "SDIVR_ZPZZ", int_aarch64_sve_sdivr, DestructiveBinaryCommWithRev, "SDIV_ZPmZ", /isReverseInstr/ 1>;		defm SDIVR_ZPmZ : sve_int_bin_pred_arit_2_div<0b110, "sdivr", "SDIVR_ZPZZ", int_aarch64_sve_sdivr, DestructiveBinaryCommWithRev, "SDIV_ZPmZ", /isReverseInstr/ 1>;
defm UDIVR_ZPmZ : sve_int_bin_pred_arit_2_div<0b111, "udivr", "UDIVR_ZPZZ", int_aarch64_sve_udivr, DestructiveBinaryCommWithRev, "UDIV_ZPmZ", /isReverseInstr/ 1>;		defm UDIVR_ZPmZ : sve_int_bin_pred_arit_2_div<0b111, "udivr", "UDIVR_ZPZZ", int_aarch64_sve_udivr, DestructiveBinaryCommWithRev, "UDIV_ZPmZ", /isReverseInstr/ 1>;

defm SDIV_ZPZZ : sve_int_bin_pred_sd<AArch64sdiv_p>;		defm SDIV_ZPZZ : sve_int_bin_pred_sd<AArch64sdiv_p>;
defm UDIV_ZPZZ : sve_int_bin_pred_sd<AArch64udiv_p>;		defm UDIV_ZPZZ : sve_int_bin_pred_sd<AArch64udiv_p>;
▲ Show 20 Lines • Show All 2,021 Lines • ▼ Show 20 Lines	let Predicates = [HasSVE2] in {
defm SQRDMULH_ZZZI : sve2_int_mul_by_indexed_elem<0b1101, "sqrdmulh", int_aarch64_sve_sqrdmulh_lane>;		defm SQRDMULH_ZZZI : sve2_int_mul_by_indexed_elem<0b1101, "sqrdmulh", int_aarch64_sve_sqrdmulh_lane>;

// SVE2 signed saturating doubling multiply high (unpredicated)		// SVE2 signed saturating doubling multiply high (unpredicated)
defm SQDMULH_ZZZ : sve2_int_mul<0b100, "sqdmulh", int_aarch64_sve_sqdmulh>;		defm SQDMULH_ZZZ : sve2_int_mul<0b100, "sqdmulh", int_aarch64_sve_sqdmulh>;
defm SQRDMULH_ZZZ : sve2_int_mul<0b101, "sqrdmulh", int_aarch64_sve_sqrdmulh>;		defm SQRDMULH_ZZZ : sve2_int_mul<0b101, "sqrdmulh", int_aarch64_sve_sqrdmulh>;

// SVE2 integer multiply vectors (unpredicated)		// SVE2 integer multiply vectors (unpredicated)
defm MUL_ZZZ : sve2_int_mul<0b000, "mul", null_frag, AArch64mul_p>;		defm MUL_ZZZ : sve2_int_mul<0b000, "mul", null_frag, AArch64mul_p>;
defm SMULH_ZZZ : sve2_int_mul<0b010, "smulh", null_frag>;		defm SMULH_ZZZ : sve2_int_mul<0b010, "smulh", null_frag, AArch64smulh_p>;
defm UMULH_ZZZ : sve2_int_mul<0b011, "umulh", null_frag>;		defm UMULH_ZZZ : sve2_int_mul<0b011, "umulh", null_frag, AArch64umulh_p>;
defm PMUL_ZZZ : sve2_int_mul_single<0b001, "pmul", int_aarch64_sve_pmul>;		defm PMUL_ZZZ : sve2_int_mul_single<0b001, "pmul", int_aarch64_sve_pmul>;

// Add patterns for unpredicated version of smulh and umulh.		// Add patterns for unpredicated version of smulh and umulh.
def : Pat<(nxv16i8 (int_aarch64_sve_smulh (nxv16i1 (AArch64ptrue 31)), nxv16i8:$Op1, nxv16i8:$Op2)),		def : Pat<(nxv16i8 (int_aarch64_sve_smulh (nxv16i1 (AArch64ptrue 31)), nxv16i8:$Op1, nxv16i8:$Op2)),
(SMULH_ZZZ_B $Op1, $Op2)>;		(SMULH_ZZZ_B $Op1, $Op2)>;
def : Pat<(nxv8i16 (int_aarch64_sve_smulh (nxv8i1 (AArch64ptrue 31)), nxv8i16:$Op1, nxv8i16:$Op2)),		def : Pat<(nxv8i16 (int_aarch64_sve_smulh (nxv8i1 (AArch64ptrue 31)), nxv8i16:$Op1, nxv8i16:$Op2)),
(SMULH_ZZZ_H $Op1, $Op2)>;		(SMULH_ZZZ_H $Op1, $Op2)>;
def : Pat<(nxv4i32 (int_aarch64_sve_smulh (nxv4i1 (AArch64ptrue 31)), nxv4i32:$Op1, nxv4i32:$Op2)),		def : Pat<(nxv4i32 (int_aarch64_sve_smulh (nxv4i1 (AArch64ptrue 31)), nxv4i32:$Op1, nxv4i32:$Op2)),
▲ Show 20 Lines • Show All 394 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_EQ_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_EQ_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_EQ_1024
				paulwalker-armUnsubmitted Done Reply Inline Actions Should this be `VBITS_EQ_2048`? paulwalker-arm: Should this be `VBITS_EQ_2048`?

				paulwalker-armUnsubmitted Not Done Reply Inline Actions Please add `RUN` lines for all the support vector lengths. Also, just in case somebody wonders why, can you add a comment saying there's no validation for splitting vector operations because the necessary MULH DAG combine does no apply to illegally typed operations. paulwalker-arm: Please add `RUN` lines for all the support vector lengths. Also, just in case somebody wonders…
				bsmithAuthorUnsubmitted Done Reply Inline Actions There already is a comment as such, on line 12 (23 in the new patch) bsmith: There already is a comment as such, on line 12 (23 in the new patch)
				paulwalker-armUnsubmitted Not Done Reply Inline Actions That must have been where "my" idea came from :) paulwalker-arm: That must have been where "my" idea came from :)
				; VBYTES represents the useful byte size of a vector register from the code
				; generator's point of view. It is clamped to power-of-2 values because
				; only power-of-2 vector lengths are considered legal, regardless of the
				; user specified vector length.

				; This test only tests the legal types for a given vector width, as mulh nodes
				; do not get generated for non-legal types.

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				;
				; SMULH
				;

				; Don't use SVE for 64-bit vectors.
				define <8 x i8> @smulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: smulh_v8i8:
				; CHECK: smull v0.8h, v0.8b, v1.8b
				; CHECK: ushr v1.8h, v0.8h, #8
				; CHECK: umov w8, v1.h[0]
				; CHECK: fmov s0, w8
				; CHECK: umov w8, v1.h[1]
				; CHECK: mov v0.b[1], w8
				; CHECK: umov w8, v1.h[2]
				; CHECK: mov v0.b[2], w8
				; CHECK: umov w8, v1.h[3]
				; CHECK: mov v0.b[3], w8
				; CHECK: ret
				%insert = insertelement <8 x i16> undef, i16 8, i64 0
				%splat = shufflevector <8 x i16> %insert, <8 x i16> undef, <8 x i32> zeroinitializer
				%1 = sext <8 x i8> %op1 to <8 x i16>
				%2 = sext <8 x i8> %op2 to <8 x i16>
				%mul = mul <8 x i16> %1, %2
				%shr = lshr <8 x i16> %mul, %splat
				%res = trunc <8 x i16> %shr to <8 x i8>
				ret <8 x i8> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <16 x i8> @smulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: smulh_v16i8:
				; CHECK: smull2 v2.8h, v0.16b, v1.16b
				; CHECK: smull v0.8h, v0.8b, v1.8b
				; CHECK: uzp2 v0.16b, v0.16b, v2.16b
				; CHECK: ret
				%insert = insertelement <16 x i16> undef, i16 8, i64 0
				%splat = shufflevector <16 x i16> %insert, <16 x i16> undef, <16 x i32> zeroinitializer
				%1 = sext <16 x i8> %op1 to <16 x i16>
				%2 = sext <16 x i8> %op2 to <16 x i16>
				%mul = mul <16 x i16> %1, %2
				%shr = lshr <16 x i16> %mul, %splat
				%res = trunc <16 x i16> %shr to <16 x i8>
				ret <16 x i8> %res
				}

				define void @smulh_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: smulh_v32i8:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,32)]]
				paulwalker-armUnsubmitted Done Reply Inline Actions Given you're using `EQ_256` I doubt the `#min` logic means anything. That said, why not VBITS_GE_256, shouldn't the code be the same for larger vectors? paulwalker-arm: Given you're using `EQ_256` I doubt the `#min` logic means anything. That said, why not…
				; VBITS_EQ_256-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_256: smulh [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_256: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%insert = insertelement <32 x i16> undef, i16 8, i64 0
				%splat = shufflevector <32 x i16> %insert, <32 x i16> undef, <32 x i32> zeroinitializer
				%1 = sext <32 x i8> %op1 to <32 x i16>
				%2 = sext <32 x i8> %op2 to <32 x i16>
				%mul = mul <32 x i16> %1, %2
				%shr = lshr <32 x i16> %mul, %splat
				%res = trunc <32 x i16> %shr to <32 x i8>
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @smulh_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: smulh_v64i8:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,64)]]
				; VBITS_EQ_512-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_512: smulh [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_512: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%insert = insertelement <64 x i16> undef, i16 8, i64 0
				%splat = shufflevector <64 x i16> %insert, <64 x i16> undef, <64 x i32> zeroinitializer
				%1 = sext <64 x i8> %op1 to <64 x i16>
				%2 = sext <64 x i8> %op2 to <64 x i16>
				%mul = mul <64 x i16> %1, %2
				%shr = lshr <64 x i16> %mul, %splat
				%res = trunc <64 x i16> %shr to <64 x i8>
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define void @smulh_v128i8(<128 x i8>* %a, <128 x i8>* %b) #0 {
				; CHECK-LABEL: smulh_v128i8:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,128)]]
				; VBITS_EQ_1024-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: smulh [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_1024: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <128 x i8>, <128 x i8>* %a
				%op2 = load <128 x i8>, <128 x i8>* %b
				%insert = insertelement <128 x i16> undef, i16 8, i64 0
				%splat = shufflevector <128 x i16> %insert, <128 x i16> undef, <128 x i32> zeroinitializer
				%1 = sext <128 x i8> %op1 to <128 x i16>
				%2 = sext <128 x i8> %op2 to <128 x i16>
				%mul = mul <128 x i16> %1, %2
				%shr = lshr <128 x i16> %mul, %splat
				%res = trunc <128 x i16> %shr to <128 x i8>
				store <128 x i8> %res, <128 x i8>* %a
				ret void
				}

				define void @smulh_v256i8(<256 x i8>* %a, <256 x i8>* %b) #0 {
				; CHECK-LABEL: smulh_v256i8:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,256)]]
				; VBITS_EQ_2048-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_2048: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <256 x i8>, <256 x i8>* %a
				%op2 = load <256 x i8>, <256 x i8>* %b
				%insert = insertelement <256 x i16> undef, i16 8, i64 0
				%splat = shufflevector <256 x i16> %insert, <256 x i16> undef, <256 x i32> zeroinitializer
				%1 = sext <256 x i8> %op1 to <256 x i16>
				%2 = sext <256 x i8> %op2 to <256 x i16>
				%mul = mul <256 x i16> %1, %2
				%shr = lshr <256 x i16> %mul, %splat
				%res = trunc <256 x i16> %shr to <256 x i8>
				store <256 x i8> %res, <256 x i8>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <4 x i16> @smulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: smulh_v4i16:
				; CHECK: smull v0.4s, v0.4h, v1.4h
				; CHECK: ushr v0.4s, v0.4s, #16
				; CHECK: mov w8, v0.s[1]
				; CHECK: mov w9, v0.s[2]
				; CHECK: mov w10, v0.s[3]
				; CHECK: mov v0.h[1], w8
				; CHECK: mov v0.h[2], w9
				; CHECK: mov v0.h[3], w10
				; CHECK: ret
				%insert = insertelement <4 x i32> undef, i32 16, i64 0
				%splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = sext <4 x i16> %op1 to <4 x i32>
				%2 = sext <4 x i16> %op2 to <4 x i32>
				%mul = mul <4 x i32> %1, %2
				%shr = lshr <4 x i32> %mul, %splat
				%res = trunc <4 x i32> %shr to <4 x i16>
				ret <4 x i16> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <8 x i16> @smulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: smulh_v8i16:
				; CHECK: smull2 v2.4s, v0.8h, v1.8h
				; CHECK: smull v0.4s, v0.4h, v1.4h
				; CHECK: uzp2 v0.8h, v0.8h, v2.8h
				; CHECK: ret
				%insert = insertelement <8 x i32> undef, i32 16, i64 0
				%splat = shufflevector <8 x i32> %insert, <8 x i32> undef, <8 x i32> zeroinitializer
				%1 = sext <8 x i16> %op1 to <8 x i32>
				%2 = sext <8 x i16> %op2 to <8 x i32>
				%mul = mul <8 x i32> %1, %2
				%shr = lshr <8 x i32> %mul, %splat
				%res = trunc <8 x i32> %shr to <8 x i16>
				ret <8 x i16> %res
				}

				define void @smulh_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: smulh_v16i16:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,16)]]
				; VBITS_EQ_256-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_256: smulh [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_256: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%insert = insertelement <16 x i32> undef, i32 16, i64 0
				%splat = shufflevector <16 x i32> %insert, <16 x i32> undef, <16 x i32> zeroinitializer
				%1 = sext <16 x i16> %op1 to <16 x i32>
				%2 = sext <16 x i16> %op2 to <16 x i32>
				%mul = mul <16 x i32> %1, %2
				%shr = lshr <16 x i32> %mul, %splat
				%res = trunc <16 x i32> %shr to <16 x i16>
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @smulh_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: smulh_v32i16:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_512-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_512: smulh [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_512: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%insert = insertelement <32 x i32> undef, i32 16, i64 0
				%splat = shufflevector <32 x i32> %insert, <32 x i32> undef, <32 x i32> zeroinitializer
				%1 = sext <32 x i16> %op1 to <32 x i32>
				%2 = sext <32 x i16> %op2 to <32 x i32>
				%mul = mul <32 x i32> %1, %2
				%shr = lshr <32 x i32> %mul, %splat
				%res = trunc <32 x i32> %shr to <32 x i16>
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define void @smulh_v64i16(<64 x i16>* %a, <64 x i16>* %b) #0 {
				; CHECK-LABEL: smulh_v64i16:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,64)]]
				; VBITS_EQ_1024-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: smulh [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_1024: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <64 x i16>, <64 x i16>* %a
				%op2 = load <64 x i16>, <64 x i16>* %b
				%insert = insertelement <64 x i32> undef, i32 16, i64 0
				%splat = shufflevector <64 x i32> %insert, <64 x i32> undef, <64 x i32> zeroinitializer
				%1 = sext <64 x i16> %op1 to <64 x i32>
				%2 = sext <64 x i16> %op2 to <64 x i32>
				%mul = mul <64 x i32> %1, %2
				%shr = lshr <64 x i32> %mul, %splat
				%res = trunc <64 x i32> %shr to <64 x i16>
				store <64 x i16> %res, <64 x i16>* %a
				ret void
				}

				define void @smulh_v128i16(<128 x i16>* %a, <128 x i16>* %b) #0 {
				; CHECK-LABEL: smulh_v128i16:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,128)]]
				; VBITS_EQ_2048-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_2048: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <128 x i16>, <128 x i16>* %a
				%op2 = load <128 x i16>, <128 x i16>* %b
				%insert = insertelement <128 x i32> undef, i32 16, i64 0
				%splat = shufflevector <128 x i32> %insert, <128 x i32> undef, <128 x i32> zeroinitializer
				%1 = sext <128 x i16> %op1 to <128 x i32>
				%2 = sext <128 x i16> %op2 to <128 x i32>
				%mul = mul <128 x i32> %1, %2
				%shr = lshr <128 x i32> %mul, %splat
				%res = trunc <128 x i32> %shr to <128 x i16>
				store <128 x i16> %res, <128 x i16>* %a
				ret void
				}

				; Vector i64 multiplications are not legal for NEON so use SVE when available.
				define <2 x i32> @smulh_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: smulh_v2i32:
				; CHECK: sshll v0.2d, v0.2s, #0
				; CHECK: sshll v1.2d, v1.2s, #0
				; CHECK: ptrue p0.d, vl2
				; CHECK: mul z0.d, p0/m, z0.d, z1.d
				; CHECK: shrn v0.2s, v0.2d, #32
				; CHECK: ret
				%insert = insertelement <2 x i64> undef, i64 32, i64 0
				%splat = shufflevector <2 x i64> %insert, <2 x i64> undef, <2 x i32> zeroinitializer
				%1 = sext <2 x i32> %op1 to <2 x i64>
				%2 = sext <2 x i32> %op2 to <2 x i64>
				%mul = mul <2 x i64> %1, %2
				%shr = lshr <2 x i64> %mul, %splat
				%res = trunc <2 x i64> %shr to <2 x i32>
				ret <2 x i32> %res
				}

				; Vector i64 multiplications are not legal for NEON so use SVE when available.
				define <4 x i32> @smulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Weirdly SVE is not being used here. Is the output different when SVE is disable? paulwalker-arm: Weirdly SVE is not being used here. Is the output different when SVE is disable?
				bsmithAuthorUnsubmitted Done Reply Inline Actions No, it's the same without SVE enabled. NEON has patterns to match 128-bit mulh nodes (but not 64-bit as above), as it can use the smull2+smull pattern below. Perhaps we should still fall back to SVE instead of this sequence? (Or I just fix the comment..) bsmith: No, it's the same without SVE enabled. NEON has patterns to match 128-bit mulh nodes (but not…
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Thanks. In which case fixing the comment works for me. paulwalker-arm: Thanks. In which case fixing the comment works for me.
				; CHECK-LABEL: smulh_v4i32:
				; CHECK: smull2 v2.2d, v0.4s, v1.4s
				; CHECK: smull v0.2d, v0.2s, v1.2s
				; CHECK: uzp2 v0.4s, v0.4s, v2.4s
				; CHECK: ret
				%insert = insertelement <4 x i64> undef, i64 32, i64 0
				%splat = shufflevector <4 x i64> %insert, <4 x i64> undef, <4 x i32> zeroinitializer
				%1 = sext <4 x i32> %op1 to <4 x i64>
				%2 = sext <4 x i32> %op2 to <4 x i64>
				%mul = mul <4 x i64> %1, %2
				%shr = lshr <4 x i64> %mul, %splat
				%res = trunc <4 x i64> %shr to <4 x i32>
				ret <4 x i32> %res
				}

				define void @smulh_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: smulh_v8i32:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,8)]]
				; VBITS_EQ_256-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_256: smulh [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_256: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%insert = insertelement <8 x i64> undef, i64 32, i64 0
				%splat = shufflevector <8 x i64> %insert, <8 x i64> undef, <8 x i32> zeroinitializer
				%1 = sext <8 x i32> %op1 to <8 x i64>
				%2 = sext <8 x i32> %op2 to <8 x i64>
				%mul = mul <8 x i64> %1, %2
				%shr = lshr <8 x i64> %mul, %splat
				%res = trunc <8 x i64> %shr to <8 x i32>
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @smulh_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: smulh_v16i32:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,16)]]
				; VBITS_EQ_512-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_512: smulh [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_512: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%insert = insertelement <16 x i64> undef, i64 32, i64 0
				%splat = shufflevector <16 x i64> %insert, <16 x i64> undef, <16 x i32> zeroinitializer
				%1 = sext <16 x i32> %op1 to <16 x i64>
				%2 = sext <16 x i32> %op2 to <16 x i64>
				%mul = mul <16 x i64> %1, %2
				%shr = lshr <16 x i64> %mul, %splat
				%res = trunc <16 x i64> %shr to <16 x i32>
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define void @smulh_v32i32(<32 x i32>* %a, <32 x i32>* %b) #0 {
				; CHECK-LABEL: smulh_v32i32:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_1024-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: smulh [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_1024: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <32 x i32>, <32 x i32>* %a
				%op2 = load <32 x i32>, <32 x i32>* %b
				%insert = insertelement <32 x i64> undef, i64 32, i64 0
				%splat = shufflevector <32 x i64> %insert, <32 x i64> undef, <32 x i32> zeroinitializer
				%1 = sext <32 x i32> %op1 to <32 x i64>
				%2 = sext <32 x i32> %op2 to <32 x i64>
				%mul = mul <32 x i64> %1, %2
				%shr = lshr <32 x i64> %mul, %splat
				%res = trunc <32 x i64> %shr to <32 x i32>
				store <32 x i32> %res, <32 x i32>* %a
				ret void
				}

				define void @smulh_v64i32(<64 x i32>* %a, <64 x i32>* %b) #0 {
				; CHECK-LABEL: smulh_v64i32:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,64)]]
				; VBITS_EQ_2048-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_2048: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <64 x i32>, <64 x i32>* %a
				%op2 = load <64 x i32>, <64 x i32>* %b
				%insert = insertelement <64 x i64> undef, i64 32, i64 0
				%splat = shufflevector <64 x i64> %insert, <64 x i64> undef, <64 x i32> zeroinitializer
				%1 = sext <64 x i32> %op1 to <64 x i64>
				%2 = sext <64 x i32> %op2 to <64 x i64>
				%mul = mul <64 x i64> %1, %2
				%shr = lshr <64 x i64> %mul, %splat
				%res = trunc <64 x i64> %shr to <64 x i32>
				store <64 x i32> %res, <64 x i32>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <1 x i64> @smulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				paulwalker-armUnsubmitted Done Reply Inline Actions As there's no NEON instruction for i64 based vectors I'm wondering if it's worth using SVE for this case as well? much like we do for ISD::MUL. paulwalker-arm: As there's no NEON instruction for i64 based vectors I'm wondering if it's worth using SVE for…
				; CHECK-LABEL: smulh_v1i64:
				; CHECK: fmov x8, d0
				; CHECK: fmov x9, d1
				; CHECK: smulh x8, x8, x9
				; CHECK: fmov d0, x8
				; CHECK: ret
				%insert = insertelement <1 x i128> undef, i128 64, i128 0
				%splat = shufflevector <1 x i128> %insert, <1 x i128> undef, <1 x i32> zeroinitializer
				%1 = sext <1 x i64> %op1 to <1 x i128>
				%2 = sext <1 x i64> %op2 to <1 x i128>
				%mul = mul <1 x i128> %1, %2
				%shr = lshr <1 x i128> %mul, %splat
				%res = trunc <1 x i128> %shr to <1 x i64>
				ret <1 x i64> %res
				}

				; Don't use SVE for 64-bit vectors.
				define <2 x i64> @smulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: smulh_v2i64:
				; CHECK: mov x8, v0.d[1]
				; CHECK: fmov x9, d0
				; CHECK: mov x10, v1.d[1]
				; CHECK: fmov x11, d1
				; CHECK: smulh x9, x9, x11
				; CHECK: smulh x8, x8, x10
				; CHECK: fmov d0, x9
				; CHECK: fmov d1, x8
				; CHECK: ret
				%insert = insertelement <2 x i128> undef, i128 64, i128 0
				%splat = shufflevector <2 x i128> %insert, <2 x i128> undef, <2 x i32> zeroinitializer
				%1 = sext <2 x i64> %op1 to <2 x i128>
				%2 = sext <2 x i64> %op2 to <2 x i128>
				%mul = mul <2 x i128> %1, %2
				%shr = lshr <2 x i128> %mul, %splat
				%res = trunc <2 x i128> %shr to <2 x i64>
				ret <2 x i64> %res
				}

				define void @smulh_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: smulh_v4i64:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,4)]]
				; VBITS_EQ_256-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_256: smulh [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_256: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%insert = insertelement <4 x i128> undef, i128 64, i128 0
				%splat = shufflevector <4 x i128> %insert, <4 x i128> undef, <4 x i32> zeroinitializer
				%1 = sext <4 x i64> %op1 to <4 x i128>
				%2 = sext <4 x i64> %op2 to <4 x i128>
				%mul = mul <4 x i128> %1, %2
				%shr = lshr <4 x i128> %mul, %splat
				%res = trunc <4 x i128> %shr to <4 x i64>
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @smulh_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: smulh_v8i64:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,8)]]
				; VBITS_EQ_512-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_512: smulh [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_512: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%insert = insertelement <8 x i128> undef, i128 64, i128 0
				%splat = shufflevector <8 x i128> %insert, <8 x i128> undef, <8 x i32> zeroinitializer
				%1 = sext <8 x i64> %op1 to <8 x i128>
				%2 = sext <8 x i64> %op2 to <8 x i128>
				%mul = mul <8 x i128> %1, %2
				%shr = lshr <8 x i128> %mul, %splat
				%res = trunc <8 x i128> %shr to <8 x i64>
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				define void @smulh_v16i64(<16 x i64>* %a, <16 x i64>* %b) #0 {
				; CHECK-LABEL: smulh_v16i64:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,16)]]
				; VBITS_EQ_1024-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: smulh [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_1024: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <16 x i64>, <16 x i64>* %a
				%op2 = load <16 x i64>, <16 x i64>* %b
				%insert = insertelement <16 x i128> undef, i128 64, i128 0
				%splat = shufflevector <16 x i128> %insert, <16 x i128> undef, <16 x i32> zeroinitializer
				%1 = sext <16 x i64> %op1 to <16 x i128>
				%2 = sext <16 x i64> %op2 to <16 x i128>
				%mul = mul <16 x i128> %1, %2
				%shr = lshr <16 x i128> %mul, %splat
				%res = trunc <16 x i128> %shr to <16 x i64>
				store <16 x i64> %res, <16 x i64>* %a
				ret void
				}

				define void @smulh_v32i64(<32 x i64>* %a, <32 x i64>* %b) #0 {
				; CHECK-LABEL: smulh_v32i64:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_2048-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_2048: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <32 x i64>, <32 x i64>* %a
				%op2 = load <32 x i64>, <32 x i64>* %b
				%insert = insertelement <32 x i128> undef, i128 64, i128 0
				%splat = shufflevector <32 x i128> %insert, <32 x i128> undef, <32 x i32> zeroinitializer
				%1 = sext <32 x i64> %op1 to <32 x i128>
				%2 = sext <32 x i64> %op2 to <32 x i128>
				%mul = mul <32 x i128> %1, %2
				%shr = lshr <32 x i128> %mul, %splat
				%res = trunc <32 x i128> %shr to <32 x i64>
				store <32 x i64> %res, <32 x i64>* %a
				ret void
				}

				;
				; UMULH
				;

				; Don't use SVE for 64-bit vectors.
				define <8 x i8> @umulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: umulh_v8i8:
				; CHECK: umull v0.8h, v0.8b, v1.8b
				; CHECK: ushr v1.8h, v0.8h, #8
				; CHECK: umov w8, v1.h[0]
				; CHECK: fmov s0, w8
				; CHECK: umov w8, v1.h[1]
				; CHECK: mov v0.b[1], w8
				; CHECK: umov w8, v1.h[2]
				; CHECK: mov v0.b[2], w8
				; CHECK: umov w8, v1.h[3]
				; CHECK: mov v0.b[3], w8
				; CHECK: ret
				%insert = insertelement <8 x i16> undef, i16 8, i64 0
				%splat = shufflevector <8 x i16> %insert, <8 x i16> undef, <8 x i32> zeroinitializer
				%1 = zext <8 x i8> %op1 to <8 x i16>
				%2 = zext <8 x i8> %op2 to <8 x i16>
				%mul = mul <8 x i16> %1, %2
				%shr = lshr <8 x i16> %mul, %splat
				%res = trunc <8 x i16> %shr to <8 x i8>
				ret <8 x i8> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <16 x i8> @umulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: umulh_v16i8:
				; CHECK: umull2 v2.8h, v0.16b, v1.16b
				; CHECK: umull v0.8h, v0.8b, v1.8b
				; CHECK: uzp2 v0.16b, v0.16b, v2.16b
				; CHECK: ret
				%insert = insertelement <16 x i16> undef, i16 8, i64 0
				%splat = shufflevector <16 x i16> %insert, <16 x i16> undef, <16 x i32> zeroinitializer
				%1 = zext <16 x i8> %op1 to <16 x i16>
				%2 = zext <16 x i8> %op2 to <16 x i16>
				%mul = mul <16 x i16> %1, %2
				%shr = lshr <16 x i16> %mul, %splat
				%res = trunc <16 x i16> %shr to <16 x i8>
				ret <16 x i8> %res
				}

				define void @umulh_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: umulh_v32i8:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_256-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_256: umulh [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_256: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%insert = insertelement <32 x i16> undef, i16 8, i64 0
				%splat = shufflevector <32 x i16> %insert, <32 x i16> undef, <32 x i32> zeroinitializer
				%1 = zext <32 x i8> %op1 to <32 x i16>
				%2 = zext <32 x i8> %op2 to <32 x i16>
				%mul = mul <32 x i16> %1, %2
				%shr = lshr <32 x i16> %mul, %splat
				%res = trunc <32 x i16> %shr to <32 x i8>
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @umulh_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: umulh_v64i8:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,64)]]
				; VBITS_EQ_512-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_512: umulh [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_512: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%insert = insertelement <64 x i16> undef, i16 8, i64 0
				%splat = shufflevector <64 x i16> %insert, <64 x i16> undef, <64 x i32> zeroinitializer
				%1 = zext <64 x i8> %op1 to <64 x i16>
				%2 = zext <64 x i8> %op2 to <64 x i16>
				%mul = mul <64 x i16> %1, %2
				%shr = lshr <64 x i16> %mul, %splat
				%res = trunc <64 x i16> %shr to <64 x i8>
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define void @umulh_v128i8(<128 x i8>* %a, <128 x i8>* %b) #0 {
				; CHECK-LABEL: umulh_v128i8:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,128)]]
				; VBITS_EQ_1024-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: umulh [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_1024: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <128 x i8>, <128 x i8>* %a
				%op2 = load <128 x i8>, <128 x i8>* %b
				%insert = insertelement <128 x i16> undef, i16 8, i64 0
				%splat = shufflevector <128 x i16> %insert, <128 x i16> undef, <128 x i32> zeroinitializer
				%1 = zext <128 x i8> %op1 to <128 x i16>
				%2 = zext <128 x i8> %op2 to <128 x i16>
				%mul = mul <128 x i16> %1, %2
				%shr = lshr <128 x i16> %mul, %splat
				%res = trunc <128 x i16> %shr to <128 x i8>
				store <128 x i8> %res, <128 x i8>* %a
				ret void
				}

				define void @umulh_v256i8(<256 x i8>* %a, <256 x i8>* %b) #0 {
				; CHECK-LABEL: umulh_v256i8:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,256)]]
				; VBITS_EQ_2048-DAG: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].b, [[PG]]/m, [[OP1]].b, [[OP2]].b
				; VBITS_EQ_2048: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <256 x i8>, <256 x i8>* %a
				%op2 = load <256 x i8>, <256 x i8>* %b
				%insert = insertelement <256 x i16> undef, i16 8, i64 0
				%splat = shufflevector <256 x i16> %insert, <256 x i16> undef, <256 x i32> zeroinitializer
				%1 = zext <256 x i8> %op1 to <256 x i16>
				%2 = zext <256 x i8> %op2 to <256 x i16>
				%mul = mul <256 x i16> %1, %2
				%shr = lshr <256 x i16> %mul, %splat
				%res = trunc <256 x i16> %shr to <256 x i8>
				store <256 x i8> %res, <256 x i8>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <4 x i16> @umulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: umulh_v4i16:
				; CHECK: umull v0.4s, v0.4h, v1.4h
				; CHECK: ushr v0.4s, v0.4s, #16
				; CHECK: mov w8, v0.s[1]
				; CHECK: mov w9, v0.s[2]
				; CHECK: mov w10, v0.s[3]
				; CHECK: mov v0.h[1], w8
				; CHECK: mov v0.h[2], w9
				; CHECK: mov v0.h[3], w10
				; CHECK: ret
				%insert = insertelement <4 x i32> undef, i32 16, i64 0
				%splat = shufflevector <4 x i32> %insert, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = zext <4 x i16> %op1 to <4 x i32>
				%2 = zext <4 x i16> %op2 to <4 x i32>
				%mul = mul <4 x i32> %1, %2
				%shr = lshr <4 x i32> %mul, %splat
				%res = trunc <4 x i32> %shr to <4 x i16>
				ret <4 x i16> %res
				}

				; Don't use SVE for 128-bit vectors.
				define <8 x i16> @umulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: umulh_v8i16:
				; CHECK: umull2 v2.4s, v0.8h, v1.8h
				; CHECK: umull v0.4s, v0.4h, v1.4h
				; CHECK: uzp2 v0.8h, v0.8h, v2.8h
				; CHECK: ret
				%insert = insertelement <8 x i32> undef, i32 16, i64 0
				%splat = shufflevector <8 x i32> %insert, <8 x i32> undef, <8 x i32> zeroinitializer
				%1 = zext <8 x i16> %op1 to <8 x i32>
				%2 = zext <8 x i16> %op2 to <8 x i32>
				%mul = mul <8 x i32> %1, %2
				%shr = lshr <8 x i32> %mul, %splat
				%res = trunc <8 x i32> %shr to <8 x i16>
				ret <8 x i16> %res
				}

				define void @umulh_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: umulh_v16i16:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,16)]]
				; VBITS_EQ_256-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_256: umulh [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_256: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%insert = insertelement <16 x i32> undef, i32 16, i64 0
				%splat = shufflevector <16 x i32> %insert, <16 x i32> undef, <16 x i32> zeroinitializer
				%1 = zext <16 x i16> %op1 to <16 x i32>
				%2 = zext <16 x i16> %op2 to <16 x i32>
				%mul = mul <16 x i32> %1, %2
				%shr = lshr <16 x i32> %mul, %splat
				%res = trunc <16 x i32> %shr to <16 x i16>
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @umulh_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: umulh_v32i16:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_512-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_512: umulh [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_512: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%insert = insertelement <32 x i32> undef, i32 16, i64 0
				%splat = shufflevector <32 x i32> %insert, <32 x i32> undef, <32 x i32> zeroinitializer
				%1 = zext <32 x i16> %op1 to <32 x i32>
				%2 = zext <32 x i16> %op2 to <32 x i32>
				%mul = mul <32 x i32> %1, %2
				%shr = lshr <32 x i32> %mul, %splat
				%res = trunc <32 x i32> %shr to <32 x i16>
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define void @umulh_v64i16(<64 x i16>* %a, <64 x i16>* %b) #0 {
				; CHECK-LABEL: umulh_v64i16:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,64)]]
				; VBITS_EQ_1024-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: umulh [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_1024: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <64 x i16>, <64 x i16>* %a
				%op2 = load <64 x i16>, <64 x i16>* %b
				%insert = insertelement <64 x i32> undef, i32 16, i64 0
				%splat = shufflevector <64 x i32> %insert, <64 x i32> undef, <64 x i32> zeroinitializer
				%1 = zext <64 x i16> %op1 to <64 x i32>
				%2 = zext <64 x i16> %op2 to <64 x i32>
				%mul = mul <64 x i32> %1, %2
				%shr = lshr <64 x i32> %mul, %splat
				%res = trunc <64 x i32> %shr to <64 x i16>
				store <64 x i16> %res, <64 x i16>* %a
				ret void
				}

				define void @umulh_v128i16(<128 x i16>* %a, <128 x i16>* %b) #0 {
				; CHECK-LABEL: umulh_v128i16:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].h, vl[[#min(VBYTES,128)]]
				; VBITS_EQ_2048-DAG: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].h, [[PG]]/m, [[OP1]].h, [[OP2]].h
				; VBITS_EQ_2048: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <128 x i16>, <128 x i16>* %a
				%op2 = load <128 x i16>, <128 x i16>* %b
				%insert = insertelement <128 x i32> undef, i32 16, i64 0
				%splat = shufflevector <128 x i32> %insert, <128 x i32> undef, <128 x i32> zeroinitializer
				%1 = zext <128 x i16> %op1 to <128 x i32>
				%2 = zext <128 x i16> %op2 to <128 x i32>
				%mul = mul <128 x i32> %1, %2
				%shr = lshr <128 x i32> %mul, %splat
				%res = trunc <128 x i32> %shr to <128 x i16>
				store <128 x i16> %res, <128 x i16>* %a
				ret void
				}

				; Vector i64 multiplications are not legal for NEON so use SVE when available.
				define <2 x i32> @umulh_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: umulh_v2i32:
				; CHECK: ushll v0.2d, v0.2s, #0
				; CHECK: ushll v1.2d, v1.2s, #0
				; CHECK: ptrue p0.d, vl2
				; CHECK: mul z0.d, p0/m, z0.d, z1.d
				; CHECK: shrn v0.2s, v0.2d, #32
				; CHECK: ret
				%insert = insertelement <2 x i64> undef, i64 32, i64 0
				%splat = shufflevector <2 x i64> %insert, <2 x i64> undef, <2 x i32> zeroinitializer
				%1 = zext <2 x i32> %op1 to <2 x i64>
				%2 = zext <2 x i32> %op2 to <2 x i64>
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Same comment as smulh_v4i32. paulwalker-arm: Same comment as smulh_v4i32.
				%mul = mul <2 x i64> %1, %2
				%shr = lshr <2 x i64> %mul, %splat
				%res = trunc <2 x i64> %shr to <2 x i32>
				ret <2 x i32> %res
				}

				; Vector i64 multiplications are not legal for NEON so use SVE when available.
				define <4 x i32> @umulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: umulh_v4i32:
				; CHECK: umull2 v2.2d, v0.4s, v1.4s
				; CHECK: umull v0.2d, v0.2s, v1.2s
				; CHECK: uzp2 v0.4s, v0.4s, v2.4s
				; CHECK: ret
				%insert = insertelement <4 x i64> undef, i64 32, i64 0
				%splat = shufflevector <4 x i64> %insert, <4 x i64> undef, <4 x i32> zeroinitializer
				%1 = zext <4 x i32> %op1 to <4 x i64>
				%2 = zext <4 x i32> %op2 to <4 x i64>
				%mul = mul <4 x i64> %1, %2
				%shr = lshr <4 x i64> %mul, %splat
				%res = trunc <4 x i64> %shr to <4 x i32>
				ret <4 x i32> %res
				}

				define void @umulh_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: umulh_v8i32:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,8)]]
				; VBITS_EQ_256-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_256: umulh [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_256: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%insert = insertelement <8 x i64> undef, i64 32, i64 0
				%splat = shufflevector <8 x i64> %insert, <8 x i64> undef, <8 x i32> zeroinitializer
				%1 = zext <8 x i32> %op1 to <8 x i64>
				%2 = zext <8 x i32> %op2 to <8 x i64>
				%mul = mul <8 x i64> %1, %2
				%shr = lshr <8 x i64> %mul, %splat
				%res = trunc <8 x i64> %shr to <8 x i32>
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @umulh_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: umulh_v16i32:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,16)]]
				; VBITS_EQ_512-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_512: umulh [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_512: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%insert = insertelement <16 x i64> undef, i64 32, i64 0
				%splat = shufflevector <16 x i64> %insert, <16 x i64> undef, <16 x i32> zeroinitializer
				%1 = zext <16 x i32> %op1 to <16 x i64>
				%2 = zext <16 x i32> %op2 to <16 x i64>
				%mul = mul <16 x i64> %1, %2
				%shr = lshr <16 x i64> %mul, %splat
				%res = trunc <16 x i64> %shr to <16 x i32>
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define void @umulh_v32i32(<32 x i32>* %a, <32 x i32>* %b) #0 {
				; CHECK-LABEL: umulh_v32i32:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_1024-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: umulh [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_1024: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <32 x i32>, <32 x i32>* %a
				%op2 = load <32 x i32>, <32 x i32>* %b
				%insert = insertelement <32 x i64> undef, i64 32, i64 0
				%splat = shufflevector <32 x i64> %insert, <32 x i64> undef, <32 x i32> zeroinitializer
				%1 = zext <32 x i32> %op1 to <32 x i64>
				%2 = zext <32 x i32> %op2 to <32 x i64>
				%mul = mul <32 x i64> %1, %2
				%shr = lshr <32 x i64> %mul, %splat
				%res = trunc <32 x i64> %shr to <32 x i32>
				store <32 x i32> %res, <32 x i32>* %a
				ret void
				}

				define void @umulh_v64i32(<64 x i32>* %a, <64 x i32>* %b) #0 {
				; CHECK-LABEL: umulh_v64i32:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,64)]]
				; VBITS_EQ_2048-DAG: ld1w { [[OP1:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1w { [[OP2:z[0-9]+]].s }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].s, [[PG]]/m, [[OP1]].s, [[OP2]].s
				; VBITS_EQ_2048: st1w { [[RES]].s }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <64 x i32>, <64 x i32>* %a
				%op2 = load <64 x i32>, <64 x i32>* %b
				%insert = insertelement <64 x i64> undef, i64 32, i64 0
				%splat = shufflevector <64 x i64> %insert, <64 x i64> undef, <64 x i32> zeroinitializer
				%1 = zext <64 x i32> %op1 to <64 x i64>
				%2 = zext <64 x i32> %op2 to <64 x i64>
				%mul = mul <64 x i64> %1, %2
				%shr = lshr <64 x i64> %mul, %splat
				%res = trunc <64 x i64> %shr to <64 x i32>
				store <64 x i32> %res, <64 x i32>* %a
				ret void
				}

				; Don't use SVE for 64-bit vectors.
				define <1 x i64> @umulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: umulh_v1i64:
				; CHECK: fmov x8, d0
				; CHECK: fmov x9, d1
				; CHECK: umulh x8, x8, x9
				; CHECK: fmov d0, x8
				; CHECK: ret
				%insert = insertelement <1 x i128> undef, i128 64, i128 0
				%splat = shufflevector <1 x i128> %insert, <1 x i128> undef, <1 x i32> zeroinitializer
				%1 = zext <1 x i64> %op1 to <1 x i128>
				%2 = zext <1 x i64> %op2 to <1 x i128>
				%mul = mul <1 x i128> %1, %2
				%shr = lshr <1 x i128> %mul, %splat
				%res = trunc <1 x i128> %shr to <1 x i64>
				ret <1 x i64> %res
				}

				; Don't use SVE for 64-bit vectors.
				define <2 x i64> @umulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: umulh_v2i64:
				; CHECK: mov x8, v0.d[1]
				; CHECK: fmov x9, d0
				; CHECK: mov x10, v1.d[1]
				; CHECK: fmov x11, d1
				; CHECK: umulh x9, x9, x11
				; CHECK: umulh x8, x8, x10
				; CHECK: fmov d0, x9
				; CHECK: fmov d1, x8
				; CHECK: ret
				%insert = insertelement <2 x i128> undef, i128 64, i128 0
				%splat = shufflevector <2 x i128> %insert, <2 x i128> undef, <2 x i32> zeroinitializer
				%1 = zext <2 x i64> %op1 to <2 x i128>
				%2 = zext <2 x i64> %op2 to <2 x i128>
				%mul = mul <2 x i128> %1, %2
				%shr = lshr <2 x i128> %mul, %splat
				%res = trunc <2 x i128> %shr to <2 x i64>
				ret <2 x i64> %res
				}

				define void @umulh_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: umulh_v4i64:
				; VBITS_EQ_256: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,4)]]
				; VBITS_EQ_256-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_256: umulh [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_256: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_256: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%insert = insertelement <4 x i128> undef, i128 64, i128 0
				%splat = shufflevector <4 x i128> %insert, <4 x i128> undef, <4 x i32> zeroinitializer
				%1 = zext <4 x i64> %op1 to <4 x i128>
				%2 = zext <4 x i64> %op2 to <4 x i128>
				%mul = mul <4 x i128> %1, %2
				%shr = lshr <4 x i128> %mul, %splat
				%res = trunc <4 x i128> %shr to <4 x i64>
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @umulh_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: umulh_v8i64:
				; VBITS_EQ_512: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,8)]]
				; VBITS_EQ_512-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_512-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_512: umulh [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_512: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_512: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%insert = insertelement <8 x i128> undef, i128 64, i128 0
				%splat = shufflevector <8 x i128> %insert, <8 x i128> undef, <8 x i32> zeroinitializer
				%1 = zext <8 x i64> %op1 to <8 x i128>
				%2 = zext <8 x i64> %op2 to <8 x i128>
				%mul = mul <8 x i128> %1, %2
				%shr = lshr <8 x i128> %mul, %splat
				%res = trunc <8 x i128> %shr to <8 x i64>
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				define void @umulh_v16i64(<16 x i64>* %a, <16 x i64>* %b) #0 {
				; CHECK-LABEL: umulh_v16i64:
				; VBITS_EQ_1024: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,16)]]
				; VBITS_EQ_1024-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_1024-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_1024: umulh [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_1024: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_1024: ret
				%op1 = load <16 x i64>, <16 x i64>* %a
				%op2 = load <16 x i64>, <16 x i64>* %b
				%insert = insertelement <16 x i128> undef, i128 64, i128 0
				%splat = shufflevector <16 x i128> %insert, <16 x i128> undef, <16 x i32> zeroinitializer
				%1 = zext <16 x i64> %op1 to <16 x i128>
				%2 = zext <16 x i64> %op2 to <16 x i128>
				%mul = mul <16 x i128> %1, %2
				%shr = lshr <16 x i128> %mul, %splat
				%res = trunc <16 x i128> %shr to <16 x i64>
				store <16 x i64> %res, <16 x i64>* %a
				ret void
				}

				define void @umulh_v32i64(<32 x i64>* %a, <32 x i64>* %b) #0 {
				; CHECK-LABEL: umulh_v32i64:
				; VBITS_EQ_2048: ptrue [[PG:p[0-9]+]].d, vl[[#min(VBYTES,32)]]
				; VBITS_EQ_2048-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_2048-DAG: ld1d { [[OP2:z[0-9]+]].d }, [[PG]]/z, [x1]
				; VBITS_EQ_2048: mul [[RES:z[0-9]+]].d, [[PG]]/m, [[OP1]].d, [[OP2]].d
				; VBITS_EQ_2048: st1d { [[RES]].d }, [[PG]], [x0]
				; VBITS_EQ_2048: ret
				%op1 = load <32 x i64>, <32 x i64>* %a
				%op2 = load <32 x i64>, <32 x i64>* %b
				%insert = insertelement <32 x i128> undef, i128 64, i128 0
				%splat = shufflevector <32 x i128> %insert, <32 x i128> undef, <32 x i32> zeroinitializer
				%1 = zext <32 x i64> %op1 to <32 x i128>
				%2 = zext <32 x i64> %op2 to <32 x i128>
				%mul = mul <32 x i128> %1, %2
				%shr = lshr <32 x i128> %mul, %splat
				%res = trunc <32 x i128> %shr to <32 x i64>
				store <32 x i64> %res, <32 x i64>* %a
				ret void
				}
				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve-int-mulh-pred.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-linux-gnu < %s \| FileCheck %s

				;
				; SMULH
				;

				define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: smulh_i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: smulh z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 16 x i16> undef, i16 8, i64 0
				%splat = shufflevector <vscale x 16 x i16> %insert, <vscale x 16 x i16> undef, <vscale x 16 x i32> zeroinitializer
				%1 = sext <vscale x 16 x i8> %a to <vscale x 16 x i16>
				%2 = sext <vscale x 16 x i8> %b to <vscale x 16 x i16>
				%mul = mul <vscale x 16 x i16> %1, %2
				%shr = lshr <vscale x 16 x i16> %mul, %splat
				%tr = trunc <vscale x 16 x i16> %shr to <vscale x 16 x i8>
				ret <vscale x 16 x i8> %tr
				}

				define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: smulh_i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: smulh z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 8 x i32> undef, i32 16, i64 0
				%splat = shufflevector <vscale x 8 x i32> %insert, <vscale x 8 x i32> undef, <vscale x 8 x i32> zeroinitializer
				%1 = sext <vscale x 8 x i16> %a to <vscale x 8 x i32>
				%2 = sext <vscale x 8 x i16> %b to <vscale x 8 x i32>
				%mul = mul <vscale x 8 x i32> %1, %2
				%shr = lshr <vscale x 8 x i32> %mul, %splat
				%tr = trunc <vscale x 8 x i32> %shr to <vscale x 8 x i16>
				ret <vscale x 8 x i16> %tr
				}

				define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: smulh_i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: smulh z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 4 x i64> undef, i64 32, i64 0
				%splat = shufflevector <vscale x 4 x i64> %insert, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%1 = sext <vscale x 4 x i32> %a to <vscale x 4 x i64>
				%2 = sext <vscale x 4 x i32> %b to <vscale x 4 x i64>
				%mul = mul <vscale x 4 x i64> %1, %2
				%shr = lshr <vscale x 4 x i64> %mul, %splat
				%tr = trunc <vscale x 4 x i64> %shr to <vscale x 4 x i32>
				ret <vscale x 4 x i32> %tr
				}

				define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: smulh_i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 2 x i128> undef, i128 64, i64 0
				%splat = shufflevector <vscale x 2 x i128> %insert, <vscale x 2 x i128> undef, <vscale x 2 x i32> zeroinitializer
				%1 = sext <vscale x 2 x i64> %a to <vscale x 2 x i128>
				%2 = sext <vscale x 2 x i64> %b to <vscale x 2 x i128>
				%mul = mul <vscale x 2 x i128> %1, %2
				%shr = lshr <vscale x 2 x i128> %mul, %splat
				%tr = trunc <vscale x 2 x i128> %shr to <vscale x 2 x i64>
				ret <vscale x 2 x i64> %tr
				}

				;
				; UMULH
				;

				define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: umulh_i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: umulh z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 16 x i16> undef, i16 8, i64 0
				%splat = shufflevector <vscale x 16 x i16> %insert, <vscale x 16 x i16> undef, <vscale x 16 x i32> zeroinitializer
				%1 = zext <vscale x 16 x i8> %a to <vscale x 16 x i16>
				%2 = zext <vscale x 16 x i8> %b to <vscale x 16 x i16>
				%mul = mul <vscale x 16 x i16> %1, %2
				%shr = lshr <vscale x 16 x i16> %mul, %splat
				%tr = trunc <vscale x 16 x i16> %shr to <vscale x 16 x i8>
				ret <vscale x 16 x i8> %tr
				}

				define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: umulh_i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: umulh z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 8 x i32> undef, i32 16, i64 0
				%splat = shufflevector <vscale x 8 x i32> %insert, <vscale x 8 x i32> undef, <vscale x 8 x i32> zeroinitializer
				%1 = zext <vscale x 8 x i16> %a to <vscale x 8 x i32>
				%2 = zext <vscale x 8 x i16> %b to <vscale x 8 x i32>
				%mul = mul <vscale x 8 x i32> %1, %2
				%shr = lshr <vscale x 8 x i32> %mul, %splat
				%tr = trunc <vscale x 8 x i32> %shr to <vscale x 8 x i16>
				ret <vscale x 8 x i16> %tr
				}

				define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: umulh_i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: umulh z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 4 x i64> undef, i64 32, i64 0
				%splat = shufflevector <vscale x 4 x i64> %insert, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%1 = zext <vscale x 4 x i32> %a to <vscale x 4 x i64>
				%2 = zext <vscale x 4 x i32> %b to <vscale x 4 x i64>
				%mul = mul <vscale x 4 x i64> %1, %2
				%shr = lshr <vscale x 4 x i64> %mul, %splat
				%tr = trunc <vscale x 4 x i64> %shr to <vscale x 4 x i32>
				ret <vscale x 4 x i32> %tr
				}

				define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: umulh_i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 2 x i128> undef, i128 64, i64 0
				%splat = shufflevector <vscale x 2 x i128> %insert, <vscale x 2 x i128> undef, <vscale x 2 x i32> zeroinitializer
				%1 = zext <vscale x 2 x i64> %a to <vscale x 2 x i128>
				%2 = zext <vscale x 2 x i64> %b to <vscale x 2 x i128>
				%mul = mul <vscale x 2 x i128> %1, %2
				%shr = lshr <vscale x 2 x i128> %mul, %splat
				%tr = trunc <vscale x 2 x i128> %shr to <vscale x 2 x i64>
				ret <vscale x 2 x i64> %tr
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve2-int-mulh.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-linux-gnu < %s \| FileCheck %s

				;
				; SMULH
				;

				define <vscale x 16 x i8> @smulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: smulh_i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: smulh z0.b, z0.b, z1.b
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 16 x i16> undef, i16 8, i64 0
				%splat = shufflevector <vscale x 16 x i16> %insert, <vscale x 16 x i16> undef, <vscale x 16 x i32> zeroinitializer
				%1 = sext <vscale x 16 x i8> %a to <vscale x 16 x i16>
				%2 = sext <vscale x 16 x i8> %b to <vscale x 16 x i16>
				%mul = mul <vscale x 16 x i16> %1, %2
				%shr = lshr <vscale x 16 x i16> %mul, %splat
				%tr = trunc <vscale x 16 x i16> %shr to <vscale x 16 x i8>
				ret <vscale x 16 x i8> %tr
				}

				define <vscale x 8 x i16> @smulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: smulh_i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: smulh z0.h, z0.h, z1.h
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 8 x i32> undef, i32 16, i64 0
				%splat = shufflevector <vscale x 8 x i32> %insert, <vscale x 8 x i32> undef, <vscale x 8 x i32> zeroinitializer
				%1 = sext <vscale x 8 x i16> %a to <vscale x 8 x i32>
				%2 = sext <vscale x 8 x i16> %b to <vscale x 8 x i32>
				%mul = mul <vscale x 8 x i32> %1, %2
				%shr = lshr <vscale x 8 x i32> %mul, %splat
				%tr = trunc <vscale x 8 x i32> %shr to <vscale x 8 x i16>
				ret <vscale x 8 x i16> %tr
				}

				define <vscale x 4 x i32> @smulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: smulh_i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: smulh z0.s, z0.s, z1.s
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 4 x i64> undef, i64 32, i64 0
				%splat = shufflevector <vscale x 4 x i64> %insert, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%1 = sext <vscale x 4 x i32> %a to <vscale x 4 x i64>
				%2 = sext <vscale x 4 x i32> %b to <vscale x 4 x i64>
				%mul = mul <vscale x 4 x i64> %1, %2
				%shr = lshr <vscale x 4 x i64> %mul, %splat
				%tr = trunc <vscale x 4 x i64> %shr to <vscale x 4 x i32>
				ret <vscale x 4 x i32> %tr
				}

				define <vscale x 2 x i64> @smulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: smulh_i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: smulh z0.d, z0.d, z1.d
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 2 x i128> undef, i128 64, i64 0
				%splat = shufflevector <vscale x 2 x i128> %insert, <vscale x 2 x i128> undef, <vscale x 2 x i32> zeroinitializer
				%1 = sext <vscale x 2 x i64> %a to <vscale x 2 x i128>
				%2 = sext <vscale x 2 x i64> %b to <vscale x 2 x i128>
				%mul = mul <vscale x 2 x i128> %1, %2
				%shr = lshr <vscale x 2 x i128> %mul, %splat
				%tr = trunc <vscale x 2 x i128> %shr to <vscale x 2 x i64>
				ret <vscale x 2 x i64> %tr
				}

				;
				; UMULH
				;

				define <vscale x 16 x i8> @umulh_i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: umulh_i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: umulh z0.b, z0.b, z1.b
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 16 x i16> undef, i16 8, i64 0
				%splat = shufflevector <vscale x 16 x i16> %insert, <vscale x 16 x i16> undef, <vscale x 16 x i32> zeroinitializer
				%1 = zext <vscale x 16 x i8> %a to <vscale x 16 x i16>
				%2 = zext <vscale x 16 x i8> %b to <vscale x 16 x i16>
				%mul = mul <vscale x 16 x i16> %1, %2
				%shr = lshr <vscale x 16 x i16> %mul, %splat
				%tr = trunc <vscale x 16 x i16> %shr to <vscale x 16 x i8>
				ret <vscale x 16 x i8> %tr
				}

				define <vscale x 8 x i16> @umulh_i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: umulh_i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: umulh z0.h, z0.h, z1.h
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 8 x i32> undef, i32 16, i64 0
				%splat = shufflevector <vscale x 8 x i32> %insert, <vscale x 8 x i32> undef, <vscale x 8 x i32> zeroinitializer
				%1 = zext <vscale x 8 x i16> %a to <vscale x 8 x i32>
				%2 = zext <vscale x 8 x i16> %b to <vscale x 8 x i32>
				%mul = mul <vscale x 8 x i32> %1, %2
				%shr = lshr <vscale x 8 x i32> %mul, %splat
				%tr = trunc <vscale x 8 x i32> %shr to <vscale x 8 x i16>
				ret <vscale x 8 x i16> %tr
				}

				define <vscale x 4 x i32> @umulh_i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: umulh_i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: umulh z0.s, z0.s, z1.s
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 4 x i64> undef, i64 32, i64 0
				%splat = shufflevector <vscale x 4 x i64> %insert, <vscale x 4 x i64> undef, <vscale x 4 x i32> zeroinitializer
				%1 = zext <vscale x 4 x i32> %a to <vscale x 4 x i64>
				%2 = zext <vscale x 4 x i32> %b to <vscale x 4 x i64>
				%mul = mul <vscale x 4 x i64> %1, %2
				%shr = lshr <vscale x 4 x i64> %mul, %splat
				%tr = trunc <vscale x 4 x i64> %shr to <vscale x 4 x i32>
				ret <vscale x 4 x i32> %tr
				}

				define <vscale x 2 x i64> @umulh_i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: umulh_i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: umulh z0.d, z0.d, z1.d
				; CHECK-NEXT: ret
				%insert = insertelement <vscale x 2 x i128> undef, i128 64, i64 0
				%splat = shufflevector <vscale x 2 x i128> %insert, <vscale x 2 x i128> undef, <vscale x 2 x i32> zeroinitializer
				%1 = zext <vscale x 2 x i64> %a to <vscale x 2 x i128>
				%2 = zext <vscale x 2 x i64> %b to <vscale x 2 x i128>
				%mul = mul <vscale x 2 x i128> %1, %2
				%shr = lshr <vscale x 2 x i128> %mul, %splat
				%tr = trunc <vscale x 2 x i128> %shr to <vscale x 2 x i64>
				ret <vscale x 2 x i64> %tr
				}

				attributes #0 = { "target-features"="+sve2" }