This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
2/6
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-int-arith.ll
1
sve-fixed-length-int-mulh.ll
-
sve-fixed-length-int-rem.ll

Differential D118802

[AArch64][CodeGen] Always use SVE (when enabled) to lower 64-bit vector multiplies
ClosedPublic

Authored by david-arm on Feb 2 2022, 8:28 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
paulwalker-arm
peterwaller-arm
kmclaughlin

Commits

rGeabae1b01756: [AArch64][CodeGen] Always use SVE (when enabled) to lower 64-bit vector…

Summary

This patch adds custom lowering support for ISD::MUL with v1i64 and v2i64
types when SVE is enabled, regardless of the minimum SVE vector length. We
do this because NEON simply does not have 64-bit vector multiplies, so we
want to take advantage of these instructions in SVE.

I've updated the 128-bit min SVE vector bits tests here:

CodeGen/AArch64/sve-fixed-length-int-arith.ll
CodeGen/AArch64/sve-fixed-length-int-mulh.ll
CodeGen/AArch64/sve-fixed-length-int-rem.ll

Diff Detail

Unit TestsFailed

	Time	Test
	60,090 ms	x64 debian > AddressSanitizer-x86_64-linux.TestCases::scariness_score_test.cpp
	60,230 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vloxseg.c
	60,170 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vluxseg.c
	60,190 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-overloaded::vloxseg.c
	60,310 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-overloaded::vluxseg.c
		View Full Test Results (11 Failed)

Event Timeline

david-arm created this revision.Feb 2 2022, 8:28 AM

Herald added subscribers: ctetreau, hiraditya, kristof.beyls, tschuett. · View Herald TranscriptFeb 2 2022, 8:28 AM

david-arm requested review of this revision.Feb 2 2022, 8:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2022, 8:28 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B147155: Diff 405290.Feb 2 2022, 10:34 AM

sdesmalen added inline comments.Feb 3 2022, 2:00 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
3951–3952	I find this quite confusing, because I would expect `useSVEForFixedLengthVectorVT` to return `true` when OverrideNEON is specified and SVE is enabled. Personally I'd rather see you reintroducing the enum like you originally added in https://reviews.llvm.org/D117871?id=401931#inline-1127609 so that we can progressively migrate away from the `OnlyIfSVEGreaterThan128Bits` and replace this with `Always`, at which point we can do away with the function altogether. But at least it will be easier to search for cases to fix. @paulwalker-arm do you have any strong opinions here?
llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll
214	These check lines seem unnecessary because the output is the same. I wonder if they can be removed, or otherwise have a CHECK-NEON prefix, where the first RUN lines has `--check-prefixes=CHECK,CHECK-NEON` and the latter has `--check-prefixes=CHECK-NEON,VBITS_EQ_128` ?

paulwalker-arm added inline comments.Feb 3 2022, 5:22 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
3951–3952	I'm still against having such an enum. Originally `useSVEForFixedLengthVectorVT` served the specific purpose of returning true when a VT required SVE, or rather SVE with a known minimum length, in order to be considered legal. It gets used during operation legalisation as a way to categorise types because `isScalableVector()`/`isFixedLengthVector()` is not sufficient and listing the VTs explicitly doesn't work because of the variable nature of the minimum SVE vector length. The addition of `OverrideNEON` has always been a bit of a hack, because it gives the impression `useSVEForFixedLengthVectorVT`'s result depends on some underlying operation, which is does not. I accepted this because it did provide a little more protection during initial bring up. After D117871 that protection has gone, which is fine because it had outlived its usefulness. To my mind this now means `useSVEForFixedLengthVectorVT` no longer has any need for its `OverrideNEON` parameter and thus it can be completely removed. That said, `useSVEForFixedLengthVectorVT` is convenient because as you say, it makes the code easier to read and easier to search for the places where SVE is used for fixed length vectors. So my counter proposal is to keep `OverrideNEON` as a bool but change how it's used within `useSVEForFixedLengthVectorVT` to match https://reviews.llvm.org/D117871?id=401931#inline-1127609 (i.e. so it's checked before the `Subtarget->useSVEForFixedLengthVectors` ban hammer. The extra restriction I'm going to impose is that the `setOperationAction` calls should be protected by the same check used when lowering the operations. If this is agreeable, please can you make the `useSVEForFixedLengthVectors` change in isolation as an NFC patch? Note that you'll need to make a similar change to `LowerToScalableOp` as you did for `LowerToPredicatedOp`, which is fine given the functions have very similar jobs and then there's `isVectorLoadExtDesirable` which is going to get a little messier as it's now the only case that only wants to override NEON when wide vectors are enabled. We can probably clean this up later if we decide to use SVE for NEON loads and stores as well.

paulwalker-arm added inline comments.Feb 3 2022, 5:42 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
3951–3952	I fear I may have missed a key piece of information regarding my NFC patch suggestion. The part of the `useSVEForFixedLengthVectorVT` functionality you want to change over time is the way the NEON override is controlled via a runtime switch. So my suggestion is to mark those places explicitly using `useSVEForFixedLengthVectors`. That way, over time we'll just remove those calls.

david-arm added inline comments.Feb 3 2022, 8:05 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
3951–3952	Hi @paulwalker-arm, sorry I've read your comments but I'll be honest I'm still not really sure what you're asking me to do here? Do you still want me to create another patch that changes `useSVEForFixedLengthVectorVT` to always use SVE if `OverrideNEON` is true? That will be a large patch because it causes lots of test failures. Or are you asking me to simply use `useSVEForFixedLengthVectors` instead of `useSVEForFixedLengthVectorVT` in this patch?

david-arm added inline comments.Feb 3 2022, 8:12 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
3951–3952	Ah, maybe what you mean is remove the call to `useSVEForFixedLengthVectors` from `useSVEForFixedLengthVectorVT`, then add it explicitly in the place where we call `useSVEForFixedLengthVectorVT`, i.e. Transform: if (ST->useSVEForFixedLengthVectorVT()) LowerToPredicatedOp(); to if (ST->useSVEForFixedLengthVectors() && ST->useSVEForFixedLengthVectorVT()) LowerToPredicatedOp();

david-arm mentioned this in D118917: [AArch64][NFC] Remove call to useSVEForFixedLengthVectors in useSVEForFixedLengthVectorVT.Feb 3 2022, 8:58 AM

Rebased off D118917

david-arm added a parent revision: D118917: [AArch64][NFC] Remove call to useSVEForFixedLengthVectors in useSVEForFixedLengthVectorVT.Feb 3 2022, 9:07 AM

Harbormaster completed remote builds in B147423: Diff 405676.Feb 3 2022, 10:24 AM

paulwalker-arm added inline comments.Feb 3 2022, 5:23 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
3951–3952	Sorry for the confusion Dave. I've created D118957 to show what I meant. If this works for you then feel free to either take the changes into D118917 or just rebase onto D118957 and I'll land my patch once accepted.

david-arm edited parent revisions, added: D118957: [NFC][SVE] Change useSVEForFixedLengthVectorVT to allow unconditional SVE usage for NEON sized vectors.; removed: D118917: [AArch64][NFC] Remove call to useSVEForFixedLengthVectors in useSVEForFixedLengthVectorVT.Feb 4 2022, 12:55 AM

david-arm updated this revision to Diff 405885.Feb 4 2022, 12:58 AM

Harbormaster completed remote builds in B147574: Diff 405885.Feb 4 2022, 1:47 AM

For sure the test CHECK line structure will need to be revisited but that's likely better done once some of the interdependencies like MULH are finished and we can be sure the code quality for >=128bit vectors is consistently good across all target vector lengths.

This revision is now accepted and ready to land.Feb 7 2022, 9:43 AM

This revision was landed with ongoing or failed builds.Feb 8 2022, 7:38 AM

Closed by commit rGeabae1b01756: [AArch64][CodeGen] Always use SVE (when enabled) to lower 64-bit vector… (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGeabae1b01756: [AArch64][CodeGen] Always use SVE (when enabled) to lower 64-bit vector….

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

9 lines

test/

CodeGen/

AArch64/

sve-fixed-length-int-arith.ll

19 lines

sve-fixed-length-int-mulh.ll

150 lines

sve-fixed-length-int-rem.ll

30 lines

Diff 405290

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,332 Lines • ▼ Show 20 Lines	if (Subtarget->hasSVE()) {

// NEON doesn't support integer divides, but SVE does		// NEON doesn't support integer divides, but SVE does
for (auto VT : {MVT::v8i8, MVT::v16i8, MVT::v4i16, MVT::v8i16, MVT::v2i32,		for (auto VT : {MVT::v8i8, MVT::v16i8, MVT::v4i16, MVT::v8i16, MVT::v2i32,
MVT::v4i32, MVT::v1i64, MVT::v2i64}) {		MVT::v4i32, MVT::v1i64, MVT::v2i64}) {
setOperationAction(ISD::SDIV, VT, Custom);		setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);		setOperationAction(ISD::UDIV, VT, Custom);
}		}

		// NEON doesn't support 64-bit vector integer muls, but SVE does.
		setOperationAction(ISD::MUL, MVT::v1i64, Custom);
		setOperationAction(ISD::MUL, MVT::v2i64, Custom);

// NOTE: Currently this has to happen after computeRegisterProperties rather		// NOTE: Currently this has to happen after computeRegisterProperties rather
// than the preferred option of combining it with the addRegisterClass call.		// than the preferred option of combining it with the addRegisterClass call.
if (Subtarget->useSVEForFixedLengthVectors()) {		if (Subtarget->useSVEForFixedLengthVectors()) {
for (MVT VT : MVT::integer_fixedlen_vector_valuetypes())		for (MVT VT : MVT::integer_fixedlen_vector_valuetypes())
if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT))
addTypeForFixedLengthSVE(VT);		addTypeForFixedLengthSVE(VT);
for (MVT VT : MVT::fp_fixedlen_vector_valuetypes())		for (MVT VT : MVT::fp_fixedlen_vector_valuetypes())
if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT))
Show All 10 Lines	if (Subtarget->useSVEForFixedLengthVectors()) {
for (auto VT : {MVT::v8f16, MVT::v4f32})		for (auto VT : {MVT::v8f16, MVT::v4f32})
setOperationAction(ISD::FP_ROUND, VT, Custom);		setOperationAction(ISD::FP_ROUND, VT, Custom);

// These operations are not supported on NEON but SVE can do them.		// These operations are not supported on NEON but SVE can do them.
setOperationAction(ISD::BITREVERSE, MVT::v1i64, Custom);		setOperationAction(ISD::BITREVERSE, MVT::v1i64, Custom);
setOperationAction(ISD::CTLZ, MVT::v1i64, Custom);		setOperationAction(ISD::CTLZ, MVT::v1i64, Custom);
setOperationAction(ISD::CTLZ, MVT::v2i64, Custom);		setOperationAction(ISD::CTLZ, MVT::v2i64, Custom);
setOperationAction(ISD::CTTZ, MVT::v1i64, Custom);		setOperationAction(ISD::CTTZ, MVT::v1i64, Custom);
setOperationAction(ISD::MUL, MVT::v1i64, Custom);
setOperationAction(ISD::MUL, MVT::v2i64, Custom);
setOperationAction(ISD::MULHS, MVT::v1i64, Custom);		setOperationAction(ISD::MULHS, MVT::v1i64, Custom);
setOperationAction(ISD::MULHS, MVT::v2i64, Custom);		setOperationAction(ISD::MULHS, MVT::v2i64, Custom);
setOperationAction(ISD::MULHU, MVT::v1i64, Custom);		setOperationAction(ISD::MULHU, MVT::v1i64, Custom);
setOperationAction(ISD::MULHU, MVT::v2i64, Custom);		setOperationAction(ISD::MULHU, MVT::v2i64, Custom);
setOperationAction(ISD::SMAX, MVT::v1i64, Custom);		setOperationAction(ISD::SMAX, MVT::v1i64, Custom);
setOperationAction(ISD::SMAX, MVT::v2i64, Custom);		setOperationAction(ISD::SMAX, MVT::v2i64, Custom);
setOperationAction(ISD::SMIN, MVT::v1i64, Custom);		setOperationAction(ISD::SMIN, MVT::v1i64, Custom);
setOperationAction(ISD::SMIN, MVT::v2i64, Custom);		setOperationAction(ISD::SMIN, MVT::v2i64, Custom);
▲ Show 20 Lines • Show All 2,564 Lines • ▼ Show 20 Lines
}		}

SDValue AArch64TargetLowering::LowerMUL(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerMUL(SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

// If SVE is available then i64 vector multiplications can also be made legal.		// If SVE is available then i64 vector multiplications can also be made legal.
bool OverrideNEON = VT == MVT::v2i64 \|\| VT == MVT::v1i64;		bool OverrideNEON = VT == MVT::v2i64 \|\| VT == MVT::v1i64;

if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT, OverrideNEON))		if (VT.isScalableVector() \|\| (OverrideNEON && Subtarget->hasSVE()) \|\|
		useSVEForFixedLengthVectorVT(VT, false))
		sdesmalenUnsubmitted Not Done Reply Inline Actions I find this quite confusing, because I would expect `useSVEForFixedLengthVectorVT` to return `true` when OverrideNEON is specified and SVE is enabled. Personally I'd rather see you reintroducing the enum like you originally added in https://reviews.llvm.org/D117871?id=401931#inline-1127609 so that we can progressively migrate away from the `OnlyIfSVEGreaterThan128Bits` and replace this with `Always`, at which point we can do away with the function altogether. But at least it will be easier to search for cases to fix. @paulwalker-arm do you have any strong opinions here? sdesmalen: I find this quite confusing, because I would expect `useSVEForFixedLengthVectorVT` to return…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I'm still against having such an enum. Originally `useSVEForFixedLengthVectorVT` served the specific purpose of returning true when a VT required SVE, or rather SVE with a known minimum length, in order to be considered legal. It gets used during operation legalisation as a way to categorise types because `isScalableVector()`/`isFixedLengthVector()` is not sufficient and listing the VTs explicitly doesn't work because of the variable nature of the minimum SVE vector length. The addition of `OverrideNEON` has always been a bit of a hack, because it gives the impression `useSVEForFixedLengthVectorVT`'s result depends on some underlying operation, which is does not. I accepted this because it did provide a little more protection during initial bring up. After D117871 that protection has gone, which is fine because it had outlived its usefulness. To my mind this now means `useSVEForFixedLengthVectorVT` no longer has any need for its `OverrideNEON` parameter and thus it can be completely removed. That said, `useSVEForFixedLengthVectorVT` is convenient because as you say, it makes the code easier to read and easier to search for the places where SVE is used for fixed length vectors. So my counter proposal is to keep `OverrideNEON` as a bool but change how it's used within `useSVEForFixedLengthVectorVT` to match https://reviews.llvm.org/D117871?id=401931#inline-1127609 (i.e. so it's checked before the `Subtarget->useSVEForFixedLengthVectors` ban hammer. The extra restriction I'm going to impose is that the `setOperationAction` calls should be protected by the same check used when lowering the operations. If this is agreeable, please can you make the `useSVEForFixedLengthVectors` change in isolation as an NFC patch? Note that you'll need to make a similar change to `LowerToScalableOp` as you did for `LowerToPredicatedOp`, which is fine given the functions have very similar jobs and then there's `isVectorLoadExtDesirable` which is going to get a little messier as it's now the only case that only wants to override NEON when wide vectors are enabled. We can probably clean this up later if we decide to use SVE for NEON loads and stores as well. paulwalker-arm: I'm still against having such an enum. Originally `useSVEForFixedLengthVectorVT` served the…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I fear I may have missed a key piece of information regarding my NFC patch suggestion. The part of the `useSVEForFixedLengthVectorVT` functionality you want to change over time is the way the NEON override is controlled via a runtime switch. So my suggestion is to mark those places explicitly using `useSVEForFixedLengthVectors`. That way, over time we'll just remove those calls. paulwalker-arm: I fear I may have missed a key piece of information regarding my NFC patch suggestion. The…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @paulwalker-arm, sorry I've read your comments but I'll be honest I'm still not really sure what you're asking me to do here? Do you still want me to create another patch that changes `useSVEForFixedLengthVectorVT` to always use SVE if `OverrideNEON` is true? That will be a large patch because it causes lots of test failures. Or are you asking me to simply use `useSVEForFixedLengthVectors` instead of `useSVEForFixedLengthVectorVT` in this patch? david-arm: Hi @paulwalker-arm, sorry I've read your comments but I'll be honest I'm still not really sure…
		david-armAuthorUnsubmitted Done Reply Inline Actions Ah, maybe what you mean is remove the call to `useSVEForFixedLengthVectors` from `useSVEForFixedLengthVectorVT`, then add it explicitly in the place where we call `useSVEForFixedLengthVectorVT`, i.e. Transform: if (ST->useSVEForFixedLengthVectorVT()) LowerToPredicatedOp(); to if (ST->useSVEForFixedLengthVectors() && ST->useSVEForFixedLengthVectorVT()) LowerToPredicatedOp(); david-arm: Ah, maybe what you mean is remove the call to `useSVEForFixedLengthVectors` from…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Sorry for the confusion Dave. I've created D118957 to show what I meant. If this works for you then feel free to either take the changes into D118917 or just rebase onto D118957 and I'll land my patch once accepted. paulwalker-arm: Sorry for the confusion Dave. I've created D118957 to show what I meant. If this works for…
return LowerToPredicatedOp(Op, DAG, AArch64ISD::MUL_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::MUL_PRED);

// Multiplications are only custom-lowered for 128-bit vectors so that		// Multiplications are only custom-lowered for 128-bit vectors so that
// VMULL can be detected. Otherwise v2i64 multiplications are not legal.		// VMULL can be detected. Otherwise v2i64 multiplications are not legal.
assert(VT.is128BitVector() && VT.isInteger() &&		assert(VT.is128BitVector() && VT.isInteger() &&
"unexpected type for custom-lowering ISD::MUL");		"unexpected type for custom-lowering ISD::MUL");
SDNode *N0 = Op.getOperand(0).getNode();		SDNode *N0 = Op.getOperand(0).getNode();
SDNode *N1 = Op.getOperand(1).getNode();		SDNode *N1 = Op.getOperand(1).getNode();
▲ Show 20 Lines • Show All 16,042 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-arith.ll

	; RUN: llc -aarch64-sve-vector-bits-min=128 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE			; RUN: llc -aarch64-sve-vector-bits-min=128 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=VBITS_EQ_128
	; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512,VBITS_LE_256			; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512,VBITS_LE_256
	; RUN: llc -aarch64-sve-vector-bits-min=384 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512,VBITS_LE_256			; RUN: llc -aarch64-sve-vector-bits-min=384 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512,VBITS_LE_256
	; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512			; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512
	; RUN: llc -aarch64-sve-vector-bits-min=640 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512			; RUN: llc -aarch64-sve-vector-bits-min=640 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512
	; RUN: llc -aarch64-sve-vector-bits-min=768 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512			; RUN: llc -aarch64-sve-vector-bits-min=768 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512
	; RUN: llc -aarch64-sve-vector-bits-min=896 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512			; RUN: llc -aarch64-sve-vector-bits-min=896 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_LE_1024,VBITS_LE_512
	; RUN: llc -aarch64-sve-vector-bits-min=1024 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1024 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1152 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1152 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1280 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1280 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1408 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1408 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1536 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1536 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1664 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1664 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1792 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1792 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=1920 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024			; RUN: llc -aarch64-sve-vector-bits-min=1920 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_LE_1024
	; RUN: llc -aarch64-sve-vector-bits-min=2048 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK			; RUN: llc -aarch64-sve-vector-bits-min=2048 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK

	; VBYTES represents the useful byte size of a vector register from the code			; VBYTES represents the useful byte size of a vector register from the code
	; generator's point of view. It is clamped to power-of-2 values because			; generator's point of view. It is clamped to power-of-2 values because
	; only power-of-2 vector lengths are considered legal, regardless of the			; only power-of-2 vector lengths are considered legal, regardless of the
	; user specified vector length.			; user specified vector length.

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	; Don't use SVE when its registers are no bigger than NEON.
	; NO_SVE-NOT: ptrue

	;			;
	; ADD			; ADD
	;			;

	; Don't use SVE for 64-bit vectors.			; Don't use SVE for 64-bit vectors.
	define <8 x i8> @add_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {			define <8 x i8> @add_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
	; CHECK-LABEL: add_v8i8:			; CHECK-LABEL: add_v8i8:
	; CHECK: add v0.8b, v0.8b, v1.8b			; CHECK: add v0.8b, v0.8b, v1.8b
	▲ Show 20 Lines • Show All 616 Lines • ▼ Show 20 Lines
	; CHECK: ret			; CHECK: ret
	%op1 = load <64 x i32>, <64 x i32>* %a			%op1 = load <64 x i32>, <64 x i32>* %a
	%op2 = load <64 x i32>, <64 x i32>* %b			%op2 = load <64 x i32>, <64 x i32>* %b
	%res = mul <64 x i32> %op1, %op2			%res = mul <64 x i32> %op1, %op2
	store <64 x i32> %res, <64 x i32>* %a			store <64 x i32> %res, <64 x i32>* %a
	ret void			ret void
	}			}

	; Vector i64 multiplications are not legal for NEON so use SVE when available.
	define <1 x i64> @mul_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {			define <1 x i64> @mul_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
	; CHECK-LABEL: mul_v1i64:			; CHECK-LABEL: mul_v1i64:
	; CHECK: ptrue [[PG:p[0-9]+]].d, vl1			; CHECK: ptrue [[PG:p[0-9]+]].d, vl1
	; CHECK: mul z0.d, [[PG]]/m, z0.d, z1.d			; CHECK: mul z0.d, [[PG]]/m, z0.d, z1.d
	; CHECK: ret			; CHECK: ret

				; VBITS_EQ_128-LABEL: mul_v1i64:
				; VBITS_EQ_128: ptrue p0.d, vl1
				; VBITS_EQ_128: mul z0.d, p0/m, z0.d, z1.d
				; VBITS_EQ_128: ret

	%res = mul <1 x i64> %op1, %op2			%res = mul <1 x i64> %op1, %op2
	ret <1 x i64> %res			ret <1 x i64> %res
	}			}

	; Vector i64 multiplications are not legal for NEON so use SVE when available.
	define <2 x i64> @mul_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {			define <2 x i64> @mul_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
	; CHECK-LABEL: mul_v2i64:			; CHECK-LABEL: mul_v2i64:
	; CHECK: ptrue [[PG:p[0-9]+]].d, vl2			; CHECK: ptrue [[PG:p[0-9]+]].d, vl2
	; CHECK: mul z0.d, [[PG]]/m, z0.d, z1.d			; CHECK: mul z0.d, [[PG]]/m, z0.d, z1.d
	; CHECK: ret			; CHECK: ret

				; VBITS_EQ_128-LABEL: mul_v2i64:
				; VBITS_EQ_128: ptrue p0.d, vl2
				; VBITS_EQ_128: mul z0.d, p0/m, z0.d, z1.d
				; VBITS_EQ_128: ret

	%res = mul <2 x i64> %op1, %op2			%res = mul <2 x i64> %op1, %op2
	ret <2 x i64> %res			ret <2 x i64> %res
	}			}

	define void @mul_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {			define void @mul_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
	; CHECK-LABEL: mul_v4i64:			; CHECK-LABEL: mul_v4i64:
	; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),4)]]			; CHECK: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),4)]]
	; CHECK-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]			; CHECK-DAG: ld1d { [[OP1:z[0-9]+]].d }, [[PG]]/z, [x0]
	▲ Show 20 Lines • Show All 683 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-mulh.ll

; RUN: llc -aarch64-sve-vector-bits-min=128 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE		; RUN: llc -aarch64-sve-vector-bits-min=128 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=VBITS_EQ_128
; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK		; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
; RUN: llc -aarch64-sve-vector-bits-min=384 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK		; RUN: llc -aarch64-sve-vector-bits-min=384 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512		; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
; RUN: llc -aarch64-sve-vector-bits-min=640 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512		; RUN: llc -aarch64-sve-vector-bits-min=640 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
; RUN: llc -aarch64-sve-vector-bits-min=768 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512		; RUN: llc -aarch64-sve-vector-bits-min=768 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
; RUN: llc -aarch64-sve-vector-bits-min=896 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512		; RUN: llc -aarch64-sve-vector-bits-min=896 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
; RUN: llc -aarch64-sve-vector-bits-min=1024 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024		; RUN: llc -aarch64-sve-vector-bits-min=1024 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
; RUN: llc -aarch64-sve-vector-bits-min=1152 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024		; RUN: llc -aarch64-sve-vector-bits-min=1152 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
Show All 10 Lines
; only power-of-2 vector lengths are considered legal, regardless of the		; only power-of-2 vector lengths are considered legal, regardless of the
; user specified vector length.		; user specified vector length.

; This test only tests the legal types for a given vector width, as mulh nodes		; This test only tests the legal types for a given vector width, as mulh nodes
; do not get generated for non-legal types.		; do not get generated for non-legal types.

target triple = "aarch64-unknown-linux-gnu"		target triple = "aarch64-unknown-linux-gnu"

; Don't use SVE when its registers are no bigger than NEON.
; NO_SVE-NOT: ptrue

;		;
; SMULH		; SMULH
;		;

; Don't use SVE for 64-bit vectors.		; FIXME: The codegen for the >=256 bits case can be improved.
define <8 x i8> @smulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {		define <8 x i8> @smulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
; CHECK-LABEL: smulh_v8i8:		; CHECK-LABEL: smulh_v8i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: smull v0.8h, v0.8b, v1.8b		; CHECK-NEXT: smull v0.8h, v0.8b, v1.8b
; CHECK-NEXT: ushr v1.8h, v0.8h, #8		; CHECK-NEXT: ushr v1.8h, v0.8h, #8
; CHECK-NEXT: umov w8, v1.h[0]		; CHECK-NEXT: umov w8, v1.h[0]
; CHECK-NEXT: umov w9, v1.h[1]		; CHECK-NEXT: umov w9, v1.h[1]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: fmov s0, w8
; CHECK-NEXT: umov w8, v1.h[2]		; CHECK-NEXT: umov w8, v1.h[2]
; CHECK-NEXT: mov v0.b[1], w9		; CHECK-NEXT: mov v0.b[1], w9
; CHECK-NEXT: mov v0.b[2], w8		; CHECK-NEXT: mov v0.b[2], w8
; CHECK-NEXT: umov w8, v1.h[3]		; CHECK-NEXT: umov w8, v1.h[3]
; CHECK-NEXT: mov v0.b[3], w8		; CHECK-NEXT: mov v0.b[3], w8
; CHECK-NEXT: umov w8, v1.h[4]		; CHECK-NEXT: umov w8, v1.h[4]
; CHECK-NEXT: mov v0.b[4], w8		; CHECK-NEXT: mov v0.b[4], w8
; CHECK-NEXT: umov w8, v1.h[5]		; CHECK-NEXT: umov w8, v1.h[5]
; CHECK-NEXT: mov v0.b[5], w8		; CHECK-NEXT: mov v0.b[5], w8
; CHECK-NEXT: umov w8, v1.h[6]		; CHECK-NEXT: umov w8, v1.h[6]
; CHECK-NEXT: mov v0.b[6], w8		; CHECK-NEXT: mov v0.b[6], w8
; CHECK-NEXT: umov w8, v1.h[7]		; CHECK-NEXT: umov w8, v1.h[7]
; CHECK-NEXT: mov v0.b[7], w8		; CHECK-NEXT: mov v0.b[7], w8
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v8i8:
		; VBITS_EQ_128: smull v0.8h, v0.8b, v1.8b
		; VBITS_EQ_128-NEXT: shrn v0.8b, v0.8h, #8
		; VBITS_EQ_128-NEXT: ret

%insert = insertelement <8 x i16> undef, i16 8, i64 0		%insert = insertelement <8 x i16> undef, i16 8, i64 0
%splat = shufflevector <8 x i16> %insert, <8 x i16> undef, <8 x i32> zeroinitializer		%splat = shufflevector <8 x i16> %insert, <8 x i16> undef, <8 x i32> zeroinitializer
%1 = sext <8 x i8> %op1 to <8 x i16>		%1 = sext <8 x i8> %op1 to <8 x i16>
%2 = sext <8 x i8> %op2 to <8 x i16>		%2 = sext <8 x i8> %op2 to <8 x i16>
%mul = mul <8 x i16> %1, %2		%mul = mul <8 x i16> %1, %2
%shr = lshr <8 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		%shr = lshr <8 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%res = trunc <8 x i16> %shr to <8 x i8>		%res = trunc <8 x i16> %shr to <8 x i8>
ret <8 x i8> %res		ret <8 x i8> %res
}		}

; Don't use SVE for 128-bit vectors.
define <16 x i8> @smulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {		define <16 x i8> @smulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
; CHECK-LABEL: smulh_v16i8:		; CHECK-LABEL: smulh_v16i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: smull2 v2.8h, v0.16b, v1.16b		; CHECK-NEXT: smull2 v2.8h, v0.16b, v1.16b
; CHECK-NEXT: smull v0.8h, v0.8b, v1.8b		; CHECK-NEXT: smull v0.8h, v0.8b, v1.8b
; CHECK-NEXT: uzp2 v0.16b, v0.16b, v2.16b		; CHECK-NEXT: uzp2 v0.16b, v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v16i8:
		; VBITS_EQ_128: smull2 v2.8h, v0.16b, v1.16b
		; VBITS_EQ_128-NEXT: smull v0.8h, v0.8b, v1.8b
		; VBITS_EQ_128-NEXT: uzp2 v0.16b, v0.16b, v2.16b
		; VBITS_EQ_128-NEXT: ret

%1 = sext <16 x i8> %op1 to <16 x i16>		%1 = sext <16 x i8> %op1 to <16 x i16>
%2 = sext <16 x i8> %op2 to <16 x i16>		%2 = sext <16 x i8> %op2 to <16 x i16>
%mul = mul <16 x i16> %1, %2		%mul = mul <16 x i16> %1, %2
%shr = lshr <16 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		%shr = lshr <16 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%res = trunc <16 x i16> %shr to <16 x i8>		%res = trunc <16 x i16> %shr to <16 x i8>
ret <16 x i8> %res		ret <16 x i8> %res
}		}

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	; VBITS_GE_2048-NEXT: ret
%2 = sext <256 x i8> %op2 to <256 x i16>		%2 = sext <256 x i8> %op2 to <256 x i16>
%mul = mul <256 x i16> %1, %2		%mul = mul <256 x i16> %1, %2
%shr = lshr <256 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		%shr = lshr <256 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%res = trunc <256 x i16> %shr to <256 x i8>		%res = trunc <256 x i16> %shr to <256 x i8>
store <256 x i8> %res, <256 x i8>* %a		store <256 x i8> %res, <256 x i8>* %a
ret void		ret void
}		}

; Don't use SVE for 64-bit vectors.		; FIXME: The codegen for the >=256 bits case can be improved.
define <4 x i16> @smulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {		define <4 x i16> @smulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
; CHECK-LABEL: smulh_v4i16:		; CHECK-LABEL: smulh_v4i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: smull v0.4s, v0.4h, v1.4h		; CHECK-NEXT: smull v0.4s, v0.4h, v1.4h
; CHECK-NEXT: ushr v1.4s, v0.4s, #16		; CHECK-NEXT: ushr v1.4s, v0.4s, #16
; CHECK-NEXT: mov w8, v1.s[1]		; CHECK-NEXT: mov w8, v1.s[1]
; CHECK-NEXT: mov w9, v1.s[2]		; CHECK-NEXT: mov w9, v1.s[2]
; CHECK-NEXT: mov v0.16b, v1.16b		; CHECK-NEXT: mov v0.16b, v1.16b
; CHECK-NEXT: mov v0.h[1], w8		; CHECK-NEXT: mov v0.h[1], w8
; CHECK-NEXT: mov w8, v1.s[3]		; CHECK-NEXT: mov w8, v1.s[3]
; CHECK-NEXT: mov v0.h[2], w9		; CHECK-NEXT: mov v0.h[2], w9
; CHECK-NEXT: mov v0.h[3], w8		; CHECK-NEXT: mov v0.h[3], w8
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v4i16:
		; VBITS_EQ_128: smull v0.4s, v0.4h, v1.4h
		; VBITS_EQ_128-NEXT: shrn v0.4h, v0.4s, #16
		; VBITS_EQ_128-NEXT: ret

%1 = sext <4 x i16> %op1 to <4 x i32>		%1 = sext <4 x i16> %op1 to <4 x i32>
%2 = sext <4 x i16> %op2 to <4 x i32>		%2 = sext <4 x i16> %op2 to <4 x i32>
%mul = mul <4 x i32> %1, %2		%mul = mul <4 x i32> %1, %2
%shr = lshr <4 x i32> %mul, <i32 16, i32 16, i32 16, i32 16>		%shr = lshr <4 x i32> %mul, <i32 16, i32 16, i32 16, i32 16>
%res = trunc <4 x i32> %shr to <4 x i16>		%res = trunc <4 x i32> %shr to <4 x i16>
ret <4 x i16> %res		ret <4 x i16> %res
}		}

; Don't use SVE for 128-bit vectors.
define <8 x i16> @smulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {		define <8 x i16> @smulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
; CHECK-LABEL: smulh_v8i16:		; CHECK-LABEL: smulh_v8i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: smull2 v2.4s, v0.8h, v1.8h		; CHECK-NEXT: smull2 v2.4s, v0.8h, v1.8h
; CHECK-NEXT: smull v0.4s, v0.4h, v1.4h		; CHECK-NEXT: smull v0.4s, v0.4h, v1.4h
; CHECK-NEXT: uzp2 v0.8h, v0.8h, v2.8h		; CHECK-NEXT: uzp2 v0.8h, v0.8h, v2.8h
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v8i16:
		sdesmalenUnsubmitted Not Done Reply Inline Actions These check lines seem unnecessary because the output is the same. I wonder if they can be removed, or otherwise have a CHECK-NEON prefix, where the first RUN lines has `--check-prefixes=CHECK,CHECK-NEON` and the latter has `--check-prefixes=CHECK-NEON,VBITS_EQ_128` ? sdesmalen: These check lines seem unnecessary because the output is the same. I wonder if they can be…
		; VBITS_EQ_128: smull2 v2.4s, v0.8h, v1.8h
		; VBITS_EQ_128-NEXT: smull v0.4s, v0.4h, v1.4h
		; VBITS_EQ_128-NEXT: uzp2 v0.8h, v0.8h, v2.8h
		; VBITS_EQ_128-NEXT: ret

%1 = sext <8 x i16> %op1 to <8 x i32>		%1 = sext <8 x i16> %op1 to <8 x i32>
%2 = sext <8 x i16> %op2 to <8 x i32>		%2 = sext <8 x i16> %op2 to <8 x i32>
%mul = mul <8 x i32> %1, %2		%mul = mul <8 x i32> %1, %2
%shr = lshr <8 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>		%shr = lshr <8 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
%res = trunc <8 x i32> %shr to <8 x i16>		%res = trunc <8 x i32> %shr to <8 x i16>
ret <8 x i16> %res		ret <8 x i16> %res
}		}

▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
; CHECK-LABEL: smulh_v2i32:		; CHECK-LABEL: smulh_v2i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: sshll v0.2d, v0.2s, #0		; CHECK-NEXT: sshll v0.2d, v0.2s, #0
; CHECK-NEXT: sshll v1.2d, v1.2s, #0		; CHECK-NEXT: sshll v1.2d, v1.2s, #0
; CHECK-NEXT: ptrue p0.d, vl2		; CHECK-NEXT: ptrue p0.d, vl2
; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d		; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: shrn v0.2s, v0.2d, #32		; CHECK-NEXT: shrn v0.2s, v0.2d, #32
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v2i32:
		; VBITS_EQ_128: sshll v0.2d, v0.2s, #0
		; VBITS_EQ_128-NEXT: sshll v1.2d, v1.2s, #0
		; VBITS_EQ_128-NEXT: ptrue p0.d, vl2
		; VBITS_EQ_128-NEXT: mul z0.d, p0/m, z0.d, z1.d
		; VBITS_EQ_128-NEXT: shrn v0.2s, v0.2d, #32
		; VBITS_EQ_128-NEXT: ret

%1 = sext <2 x i32> %op1 to <2 x i64>		%1 = sext <2 x i32> %op1 to <2 x i64>
%2 = sext <2 x i32> %op2 to <2 x i64>		%2 = sext <2 x i32> %op2 to <2 x i64>
%mul = mul <2 x i64> %1, %2		%mul = mul <2 x i64> %1, %2
%shr = lshr <2 x i64> %mul, <i64 32, i64 32>		%shr = lshr <2 x i64> %mul, <i64 32, i64 32>
%res = trunc <2 x i64> %shr to <2 x i32>		%res = trunc <2 x i64> %shr to <2 x i32>
ret <2 x i32> %res		ret <2 x i32> %res
}		}

; Don't use SVE for 128-bit vectors.
define <4 x i32> @smulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {		define <4 x i32> @smulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
; CHECK-LABEL: smulh_v4i32:		; CHECK-LABEL: smulh_v4i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: smull2 v2.2d, v0.4s, v1.4s		; CHECK-NEXT: smull2 v2.2d, v0.4s, v1.4s
; CHECK-NEXT: smull v0.2d, v0.2s, v1.2s		; CHECK-NEXT: smull v0.2d, v0.2s, v1.2s
; CHECK-NEXT: uzp2 v0.4s, v0.4s, v2.4s		; CHECK-NEXT: uzp2 v0.4s, v0.4s, v2.4s
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v4i32:
		; VBITS_EQ_128: smull2 v2.2d, v0.4s, v1.4s
		; VBITS_EQ_128-NEXT: smull v0.2d, v0.2s, v1.2s
		; VBITS_EQ_128-NEXT: uzp2 v0.4s, v0.4s, v2.4s
		; VBITS_EQ_128-NEXT: ret

%1 = sext <4 x i32> %op1 to <4 x i64>		%1 = sext <4 x i32> %op1 to <4 x i64>
%2 = sext <4 x i32> %op2 to <4 x i64>		%2 = sext <4 x i32> %op2 to <4 x i64>
%mul = mul <4 x i64> %1, %2		%mul = mul <4 x i64> %1, %2
%shr = lshr <4 x i64> %mul, <i64 32, i64 32, i64 32, i64 32>		%shr = lshr <4 x i64> %mul, <i64 32, i64 32, i64 32, i64 32>
%res = trunc <4 x i64> %shr to <4 x i32>		%res = trunc <4 x i64> %shr to <4 x i32>
ret <4 x i32> %res		ret <4 x i32> %res
}		}

▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	; VBITS_GE_2048-NEXT: ret
%2 = sext <64 x i32> %op2 to <64 x i64>		%2 = sext <64 x i32> %op2 to <64 x i64>
%mul = mul <64 x i64> %1, %2		%mul = mul <64 x i64> %1, %2
%shr = lshr <64 x i64> %mul, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>		%shr = lshr <64 x i64> %mul, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
%res = trunc <64 x i64> %shr to <64 x i32>		%res = trunc <64 x i64> %shr to <64 x i32>
store <64 x i32> %res, <64 x i32>* %a		store <64 x i32> %res, <64 x i32>* %a
ret void		ret void
}		}

; Vector i64 multiplications are not legal for NEON so use SVE when available.
define <1 x i64> @smulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {		define <1 x i64> @smulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
; CHECK-LABEL: smulh_v1i64:		; CHECK-LABEL: smulh_v1i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0		; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
; CHECK-NEXT: ptrue p0.d, vl1		; CHECK-NEXT: ptrue p0.d, vl1
; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1		; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d		; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v1i64:
		; VBITS_EQ_128: fmov x8, d0
		; VBITS_EQ_128-NEXT: fmov x9, d1
		; VBITS_EQ_128-NEXT: smulh x8, x8, x9
		; VBITS_EQ_128-NEXT: fmov d0, x8
		; VBITS_EQ_128-NEXT: ret

%insert = insertelement <1 x i128> undef, i128 64, i128 0		%insert = insertelement <1 x i128> undef, i128 64, i128 0
%splat = shufflevector <1 x i128> %insert, <1 x i128> undef, <1 x i32> zeroinitializer		%splat = shufflevector <1 x i128> %insert, <1 x i128> undef, <1 x i32> zeroinitializer
%1 = sext <1 x i64> %op1 to <1 x i128>		%1 = sext <1 x i64> %op1 to <1 x i128>
%2 = sext <1 x i64> %op2 to <1 x i128>		%2 = sext <1 x i64> %op2 to <1 x i128>
%mul = mul <1 x i128> %1, %2		%mul = mul <1 x i128> %1, %2
%shr = lshr <1 x i128> %mul, %splat		%shr = lshr <1 x i128> %mul, %splat
%res = trunc <1 x i128> %shr to <1 x i64>		%res = trunc <1 x i128> %shr to <1 x i64>
ret <1 x i64> %res		ret <1 x i64> %res
}		}

; Vector i64 multiplications are not legal for NEON so use SVE when available.
define <2 x i64> @smulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {		define <2 x i64> @smulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
; CHECK-LABEL: smulh_v2i64:		; CHECK-LABEL: smulh_v2i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0		; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
; CHECK-NEXT: ptrue p0.d, vl2		; CHECK-NEXT: ptrue p0.d, vl2
; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1		; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d		; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: smulh_v2i64:
		; VBITS_EQ_128: mov x8, v0.d[1]
		; VBITS_EQ_128-NEXT: fmov x10, d0
		; VBITS_EQ_128-NEXT: mov x9, v1.d[1]
		; VBITS_EQ_128-NEXT: fmov x11, d1
		; VBITS_EQ_128-NEXT: smulh x10, x10, x11
		; VBITS_EQ_128-NEXT: smulh x8, x8, x9
		; VBITS_EQ_128-NEXT: fmov d0, x10
		; VBITS_EQ_128-NEXT: fmov d1, x8
		; VBITS_EQ_128-NEXT: mov v0.d[1], v1.d[0]
		; VBITS_EQ_128-NEXT: ret

%1 = sext <2 x i64> %op1 to <2 x i128>		%1 = sext <2 x i64> %op1 to <2 x i128>
%2 = sext <2 x i64> %op2 to <2 x i128>		%2 = sext <2 x i64> %op2 to <2 x i128>
%mul = mul <2 x i128> %1, %2		%mul = mul <2 x i128> %1, %2
%shr = lshr <2 x i128> %mul, <i128 64, i128 64>		%shr = lshr <2 x i128> %mul, <i128 64, i128 64>
%res = trunc <2 x i128> %shr to <2 x i64>		%res = trunc <2 x i128> %shr to <2 x i64>
ret <2 x i64> %res		ret <2 x i64> %res
}		}

▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	; VBITS_GE_2048-NEXT: ret
store <32 x i64> %res, <32 x i64>* %a		store <32 x i64> %res, <32 x i64>* %a
ret void		ret void
}		}

;		;
; UMULH		; UMULH
;		;

; Don't use SVE for 64-bit vectors.		; FIXME: The codegen for the >=256 bits case can be improved.
define <8 x i8> @umulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {		define <8 x i8> @umulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
; CHECK-LABEL: umulh_v8i8:		; CHECK-LABEL: umulh_v8i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: umull v0.8h, v0.8b, v1.8b		; CHECK-NEXT: umull v0.8h, v0.8b, v1.8b
; CHECK-NEXT: ushr v1.8h, v0.8h, #8		; CHECK-NEXT: ushr v1.8h, v0.8h, #8
; CHECK-NEXT: umov w8, v1.h[0]		; CHECK-NEXT: umov w8, v1.h[0]
; CHECK-NEXT: umov w9, v1.h[1]		; CHECK-NEXT: umov w9, v1.h[1]
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: fmov s0, w8
; CHECK-NEXT: umov w8, v1.h[2]		; CHECK-NEXT: umov w8, v1.h[2]
; CHECK-NEXT: mov v0.b[1], w9		; CHECK-NEXT: mov v0.b[1], w9
; CHECK-NEXT: mov v0.b[2], w8		; CHECK-NEXT: mov v0.b[2], w8
; CHECK-NEXT: umov w8, v1.h[3]		; CHECK-NEXT: umov w8, v1.h[3]
; CHECK-NEXT: mov v0.b[3], w8		; CHECK-NEXT: mov v0.b[3], w8
; CHECK-NEXT: umov w8, v1.h[4]		; CHECK-NEXT: umov w8, v1.h[4]
; CHECK-NEXT: mov v0.b[4], w8		; CHECK-NEXT: mov v0.b[4], w8
; CHECK-NEXT: umov w8, v1.h[5]		; CHECK-NEXT: umov w8, v1.h[5]
; CHECK-NEXT: mov v0.b[5], w8		; CHECK-NEXT: mov v0.b[5], w8
; CHECK-NEXT: umov w8, v1.h[6]		; CHECK-NEXT: umov w8, v1.h[6]
; CHECK-NEXT: mov v0.b[6], w8		; CHECK-NEXT: mov v0.b[6], w8
; CHECK-NEXT: umov w8, v1.h[7]		; CHECK-NEXT: umov w8, v1.h[7]
; CHECK-NEXT: mov v0.b[7], w8		; CHECK-NEXT: mov v0.b[7], w8
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v8i8:
		; VBITS_EQ_128: umull v0.8h, v0.8b, v1.8b
		; VBITS_EQ_128-NEXT: shrn v0.8b, v0.8h, #8
		; VBITS_EQ_128-NEXT: ret

%1 = zext <8 x i8> %op1 to <8 x i16>		%1 = zext <8 x i8> %op1 to <8 x i16>
%2 = zext <8 x i8> %op2 to <8 x i16>		%2 = zext <8 x i8> %op2 to <8 x i16>
%mul = mul <8 x i16> %1, %2		%mul = mul <8 x i16> %1, %2
%shr = lshr <8 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		%shr = lshr <8 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%res = trunc <8 x i16> %shr to <8 x i8>		%res = trunc <8 x i16> %shr to <8 x i8>
ret <8 x i8> %res		ret <8 x i8> %res
}		}

; Don't use SVE for 128-bit vectors.
define <16 x i8> @umulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {		define <16 x i8> @umulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
; CHECK-LABEL: umulh_v16i8:		; CHECK-LABEL: umulh_v16i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: umull2 v2.8h, v0.16b, v1.16b		; CHECK-NEXT: umull2 v2.8h, v0.16b, v1.16b
; CHECK-NEXT: umull v0.8h, v0.8b, v1.8b		; CHECK-NEXT: umull v0.8h, v0.8b, v1.8b
; CHECK-NEXT: uzp2 v0.16b, v0.16b, v2.16b		; CHECK-NEXT: uzp2 v0.16b, v0.16b, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v16i8:
		; VBITS_EQ_128: umull2 v2.8h, v0.16b, v1.16b
		; VBITS_EQ_128-NEXT: umull v0.8h, v0.8b, v1.8b
		; VBITS_EQ_128-NEXT: uzp2 v0.16b, v0.16b, v2.16b
		; VBITS_EQ_128-NEXT: ret

%1 = zext <16 x i8> %op1 to <16 x i16>		%1 = zext <16 x i8> %op1 to <16 x i16>
%2 = zext <16 x i8> %op2 to <16 x i16>		%2 = zext <16 x i8> %op2 to <16 x i16>
%mul = mul <16 x i16> %1, %2		%mul = mul <16 x i16> %1, %2
%shr = lshr <16 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		%shr = lshr <16 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%res = trunc <16 x i16> %shr to <16 x i8>		%res = trunc <16 x i16> %shr to <16 x i8>
ret <16 x i8> %res		ret <16 x i8> %res
}		}

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	; VBITS_GE_2048-NEXT: ret
%2 = zext <256 x i8> %op2 to <256 x i16>		%2 = zext <256 x i8> %op2 to <256 x i16>
%mul = mul <256 x i16> %1, %2		%mul = mul <256 x i16> %1, %2
%shr = lshr <256 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>		%shr = lshr <256 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
%res = trunc <256 x i16> %shr to <256 x i8>		%res = trunc <256 x i16> %shr to <256 x i8>
store <256 x i8> %res, <256 x i8>* %a		store <256 x i8> %res, <256 x i8>* %a
ret void		ret void
}		}

; Don't use SVE for 64-bit vectors.		; FIXME: The codegen for the >=256 bits case can be improved.
define <4 x i16> @umulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {		define <4 x i16> @umulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
; CHECK-LABEL: umulh_v4i16:		; CHECK-LABEL: umulh_v4i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: umull v0.4s, v0.4h, v1.4h		; CHECK-NEXT: umull v0.4s, v0.4h, v1.4h
; CHECK-NEXT: ushr v1.4s, v0.4s, #16		; CHECK-NEXT: ushr v1.4s, v0.4s, #16
; CHECK-NEXT: mov w8, v1.s[1]		; CHECK-NEXT: mov w8, v1.s[1]
; CHECK-NEXT: mov w9, v1.s[2]		; CHECK-NEXT: mov w9, v1.s[2]
; CHECK-NEXT: mov v0.16b, v1.16b		; CHECK-NEXT: mov v0.16b, v1.16b
; CHECK-NEXT: mov v0.h[1], w8		; CHECK-NEXT: mov v0.h[1], w8
; CHECK-NEXT: mov w8, v1.s[3]		; CHECK-NEXT: mov w8, v1.s[3]
; CHECK-NEXT: mov v0.h[2], w9		; CHECK-NEXT: mov v0.h[2], w9
; CHECK-NEXT: mov v0.h[3], w8		; CHECK-NEXT: mov v0.h[3], w8
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v4i16:
		; VBITS_EQ_128: umull v0.4s, v0.4h, v1.4h
		; VBITS_EQ_128-NEXT: shrn v0.4h, v0.4s, #16
		; VBITS_EQ_128-NEXT: ret

%1 = zext <4 x i16> %op1 to <4 x i32>		%1 = zext <4 x i16> %op1 to <4 x i32>
%2 = zext <4 x i16> %op2 to <4 x i32>		%2 = zext <4 x i16> %op2 to <4 x i32>
%mul = mul <4 x i32> %1, %2		%mul = mul <4 x i32> %1, %2
%shr = lshr <4 x i32> %mul, <i32 16, i32 16, i32 16, i32 16>		%shr = lshr <4 x i32> %mul, <i32 16, i32 16, i32 16, i32 16>
%res = trunc <4 x i32> %shr to <4 x i16>		%res = trunc <4 x i32> %shr to <4 x i16>
ret <4 x i16> %res		ret <4 x i16> %res
}		}

; Don't use SVE for 128-bit vectors.
define <8 x i16> @umulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {		define <8 x i16> @umulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
; CHECK-LABEL: umulh_v8i16:		; CHECK-LABEL: umulh_v8i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: umull2 v2.4s, v0.8h, v1.8h		; CHECK-NEXT: umull2 v2.4s, v0.8h, v1.8h
; CHECK-NEXT: umull v0.4s, v0.4h, v1.4h		; CHECK-NEXT: umull v0.4s, v0.4h, v1.4h
; CHECK-NEXT: uzp2 v0.8h, v0.8h, v2.8h		; CHECK-NEXT: uzp2 v0.8h, v0.8h, v2.8h
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v8i16:
		; VBITS_EQ_128: umull2 v2.4s, v0.8h, v1.8h
		; VBITS_EQ_128-NEXT: umull v0.4s, v0.4h, v1.4h
		; VBITS_EQ_128-NEXT: uzp2 v0.8h, v0.8h, v2.8h
		; VBITS_EQ_128-NEXT: ret

%1 = zext <8 x i16> %op1 to <8 x i32>		%1 = zext <8 x i16> %op1 to <8 x i32>
%2 = zext <8 x i16> %op2 to <8 x i32>		%2 = zext <8 x i16> %op2 to <8 x i32>
%mul = mul <8 x i32> %1, %2		%mul = mul <8 x i32> %1, %2
%shr = lshr <8 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>		%shr = lshr <8 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
%res = trunc <8 x i32> %shr to <8 x i16>		%res = trunc <8 x i32> %shr to <8 x i16>
ret <8 x i16> %res		ret <8 x i16> %res
}		}

▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	; VBITS_GE_2048-NEXT: ret
%2 = zext <128 x i16> %op2 to <128 x i32>		%2 = zext <128 x i16> %op2 to <128 x i32>
%mul = mul <128 x i32> %1, %2		%mul = mul <128 x i32> %1, %2
%shr = lshr <128 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>		%shr = lshr <128 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
%res = trunc <128 x i32> %shr to <128 x i16>		%res = trunc <128 x i32> %shr to <128 x i16>
store <128 x i16> %res, <128 x i16>* %a		store <128 x i16> %res, <128 x i16>* %a
ret void		ret void
}		}

; Vector i64 multiplications are not legal for NEON so use SVE when available.
define <2 x i32> @umulh_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {		define <2 x i32> @umulh_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
; CHECK-LABEL: umulh_v2i32:		; CHECK-LABEL: umulh_v2i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ushll v0.2d, v0.2s, #0		; CHECK-NEXT: ushll v0.2d, v0.2s, #0
; CHECK-NEXT: ushll v1.2d, v1.2s, #0		; CHECK-NEXT: ushll v1.2d, v1.2s, #0
; CHECK-NEXT: ptrue p0.d, vl2		; CHECK-NEXT: ptrue p0.d, vl2
; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d		; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: shrn v0.2s, v0.2d, #32		; CHECK-NEXT: shrn v0.2s, v0.2d, #32
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v2i32:
		; VBITS_EQ_128: ushll v0.2d, v0.2s, #0
		; VBITS_EQ_128-NEXT: ushll v1.2d, v1.2s, #0
		; VBITS_EQ_128-NEXT: ptrue p0.d, vl2
		; VBITS_EQ_128-NEXT: mul z0.d, p0/m, z0.d, z1.d
		; VBITS_EQ_128-NEXT: shrn v0.2s, v0.2d, #32
		; VBITS_EQ_128-NEXT: ret

%1 = zext <2 x i32> %op1 to <2 x i64>		%1 = zext <2 x i32> %op1 to <2 x i64>
%2 = zext <2 x i32> %op2 to <2 x i64>		%2 = zext <2 x i32> %op2 to <2 x i64>
%mul = mul <2 x i64> %1, %2		%mul = mul <2 x i64> %1, %2
%shr = lshr <2 x i64> %mul, <i64 32, i64 32>		%shr = lshr <2 x i64> %mul, <i64 32, i64 32>
%res = trunc <2 x i64> %shr to <2 x i32>		%res = trunc <2 x i64> %shr to <2 x i32>
ret <2 x i32> %res		ret <2 x i32> %res
}		}

; Don't use SVE for 128-bit vectors.
define <4 x i32> @umulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {		define <4 x i32> @umulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
; CHECK-LABEL: umulh_v4i32:		; CHECK-LABEL: umulh_v4i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: umull2 v2.2d, v0.4s, v1.4s		; CHECK-NEXT: umull2 v2.2d, v0.4s, v1.4s
; CHECK-NEXT: umull v0.2d, v0.2s, v1.2s		; CHECK-NEXT: umull v0.2d, v0.2s, v1.2s
; CHECK-NEXT: uzp2 v0.4s, v0.4s, v2.4s		; CHECK-NEXT: uzp2 v0.4s, v0.4s, v2.4s
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v4i32:
		; VBITS_EQ_128: umull2 v2.2d, v0.4s, v1.4s
		; VBITS_EQ_128-NEXT: umull v0.2d, v0.2s, v1.2s
		; VBITS_EQ_128-NEXT: uzp2 v0.4s, v0.4s, v2.4s
		; VBITS_EQ_128-NEXT: ret

%1 = zext <4 x i32> %op1 to <4 x i64>		%1 = zext <4 x i32> %op1 to <4 x i64>
%2 = zext <4 x i32> %op2 to <4 x i64>		%2 = zext <4 x i32> %op2 to <4 x i64>
%mul = mul <4 x i64> %1, %2		%mul = mul <4 x i64> %1, %2
%shr = lshr <4 x i64> %mul, <i64 32, i64 32, i64 32, i64 32>		%shr = lshr <4 x i64> %mul, <i64 32, i64 32, i64 32, i64 32>
%res = trunc <4 x i64> %shr to <4 x i32>		%res = trunc <4 x i64> %shr to <4 x i32>
ret <4 x i32> %res		ret <4 x i32> %res
}		}

▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	; VBITS_GE_2048-NEXT: ret
%2 = zext <64 x i32> %op2 to <64 x i64>		%2 = zext <64 x i32> %op2 to <64 x i64>
%mul = mul <64 x i64> %1, %2		%mul = mul <64 x i64> %1, %2
%shr = lshr <64 x i64> %mul, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>		%shr = lshr <64 x i64> %mul, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
%res = trunc <64 x i64> %shr to <64 x i32>		%res = trunc <64 x i64> %shr to <64 x i32>
store <64 x i32> %res, <64 x i32>* %a		store <64 x i32> %res, <64 x i32>* %a
ret void		ret void
}		}

; Vector i64 multiplications are not legal for NEON so use SVE when available.
define <1 x i64> @umulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {		define <1 x i64> @umulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
; CHECK-LABEL: umulh_v1i64:		; CHECK-LABEL: umulh_v1i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0		; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
; CHECK-NEXT: ptrue p0.d, vl1		; CHECK-NEXT: ptrue p0.d, vl1
; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1		; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d		; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v1i64:
		; VBITS_EQ_128: fmov x8, d0
		; VBITS_EQ_128-NEXT: fmov x9, d1
		; VBITS_EQ_128-NEXT: umulh x8, x8, x9
		; VBITS_EQ_128-NEXT: fmov d0, x8
		; VBITS_EQ_128-NEXT: ret

%1 = zext <1 x i64> %op1 to <1 x i128>		%1 = zext <1 x i64> %op1 to <1 x i128>
%2 = zext <1 x i64> %op2 to <1 x i128>		%2 = zext <1 x i64> %op2 to <1 x i128>
%mul = mul <1 x i128> %1, %2		%mul = mul <1 x i128> %1, %2
%shr = lshr <1 x i128> %mul, <i128 64>		%shr = lshr <1 x i128> %mul, <i128 64>
%res = trunc <1 x i128> %shr to <1 x i64>		%res = trunc <1 x i128> %shr to <1 x i64>
ret <1 x i64> %res		ret <1 x i64> %res
}		}

; Vector i64 multiplications are not legal for NEON so use SVE when available.
define <2 x i64> @umulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {		define <2 x i64> @umulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
; CHECK-LABEL: umulh_v2i64:		; CHECK-LABEL: umulh_v2i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0		; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
; CHECK-NEXT: ptrue p0.d, vl2		; CHECK-NEXT: ptrue p0.d, vl2
; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1		; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d		; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d
; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0		; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
; CHECK-NEXT: ret		; CHECK-NEXT: ret

		; VBITS_EQ_128-LABEL: umulh_v2i64:
		; VBITS_EQ_128: mov x8, v0.d[1]
		; VBITS_EQ_128-NEXT: fmov x10, d0
		; VBITS_EQ_128-NEXT: mov x9, v1.d[1]
		; VBITS_EQ_128-NEXT: fmov x11, d1
		; VBITS_EQ_128-NEXT: umulh x10, x10, x11
		; VBITS_EQ_128-NEXT: umulh x8, x8, x9
		; VBITS_EQ_128-NEXT: fmov d0, x10
		; VBITS_EQ_128-NEXT: fmov d1, x8
		; VBITS_EQ_128-NEXT: mov v0.d[1], v1.d[0]
		; VBITS_EQ_128-NEXT: ret

%1 = zext <2 x i64> %op1 to <2 x i128>		%1 = zext <2 x i64> %op1 to <2 x i128>
%2 = zext <2 x i64> %op2 to <2 x i128>		%2 = zext <2 x i64> %op2 to <2 x i128>
%mul = mul <2 x i128> %1, %2		%mul = mul <2 x i128> %1, %2
%shr = lshr <2 x i128> %mul, <i128 64, i128 64>		%shr = lshr <2 x i128> %mul, <i128 64, i128 64>
%res = trunc <2 x i128> %shr to <2 x i64>		%res = trunc <2 x i128> %shr to <2 x i64>
ret <2 x i64> %res		ret <2 x i64> %res
}		}

▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-rem.ll

	Show First 20 Lines • Show All 691 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d			; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d
	; CHECK-NEXT: sub d0, d0, d1			; CHECK-NEXT: sub d0, d0, d1
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	; VBITS_EQ_128-LABEL: srem_v1i64:			; VBITS_EQ_128-LABEL: srem_v1i64:
	; VBITS_EQ_128: ptrue p0.d, vl1			; VBITS_EQ_128: ptrue p0.d, vl1
	; VBITS_EQ_128-NEXT: movprfx z2, z0			; VBITS_EQ_128-NEXT: movprfx z2, z0
	; VBITS_EQ_128-NEXT: sdiv z2.d, p0/m, z2.d, z1.d			; VBITS_EQ_128-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
	; VBITS_EQ_128-NEXT: fmov x8, d2			; VBITS_EQ_128-NEXT: mul z1.d, p0/m, z1.d, z2.d
	; VBITS_EQ_128-NEXT: fmov x9, d1
	; VBITS_EQ_128-NEXT: mul x8, x8, x9
	; VBITS_EQ_128-NEXT: fmov d1, x8
	; VBITS_EQ_128-NEXT: sub d0, d0, d1			; VBITS_EQ_128-NEXT: sub d0, d0, d1
	; VBITS_EQ_128-NEXT: ret			; VBITS_EQ_128-NEXT: ret

	%res = srem <1 x i64> %op1, %op2			%res = srem <1 x i64> %op1, %op2
	ret <1 x i64> %res			ret <1 x i64> %res
	}			}

	; Vector i64 sdiv are not legal for NEON so use SVE when available.			; Vector i64 sdiv are not legal for NEON so use SVE when available.
	; FIXME: We should be able to improve the codegen for the 128 bits case here.			; FIXME: We should be able to improve the codegen for the 128 bits case here.
	define <2 x i64> @srem_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {			define <2 x i64> @srem_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
	; CHECK-LABEL: srem_v2i64:			; CHECK-LABEL: srem_v2i64:
	; CHECK: ptrue [[PG:p[0-9]+]].d, vl2			; CHECK: ptrue [[PG:p[0-9]+]].d, vl2
	; CHECK-NEXT: movprfx [[PFX:z[0-9]+]], [[OP1]]			; CHECK-NEXT: movprfx [[PFX:z[0-9]+]], [[OP1]]
	; CHECK-NEXT: sdiv [[DIV:z[0-9]+]].d, [[PG]]/m, [[PFX]].d, z1.d			; CHECK-NEXT: sdiv [[DIV:z[0-9]+]].d, [[PG]]/m, [[PFX]].d, z1.d
	; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d			; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d
	; CHECK-NEXT: sub v0.2d, v0.2d, v1.2d			; CHECK-NEXT: sub v0.2d, v0.2d, v1.2d
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	; VBITS_EQ_128-LABEL: srem_v2i64:			; VBITS_EQ_128-LABEL: srem_v2i64:
	; VBITS_EQ_128: ptrue p0.d, vl2			; VBITS_EQ_128: ptrue p0.d, vl2
	; VBITS_EQ_128-NEXT: movprfx z2, z0			; VBITS_EQ_128-NEXT: movprfx z2, z0
	; VBITS_EQ_128-NEXT: sdiv z2.d, p0/m, z2.d, z1.d			; VBITS_EQ_128-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
	; VBITS_EQ_128-NEXT: fmov x9, d2			; VBITS_EQ_128-NEXT: mul z1.d, p0/m, z1.d, z2.d
	; VBITS_EQ_128-NEXT: fmov x10, d1
	; VBITS_EQ_128-NEXT: mov x8, v2.d[1]
	; VBITS_EQ_128-NEXT: mov x11, v1.d[1]
	; VBITS_EQ_128-NEXT: mul x9, x9, x10
	; VBITS_EQ_128-NEXT: mul x8, x8, x11
	; VBITS_EQ_128-NEXT: fmov d1, x9
	; VBITS_EQ_128-NEXT: mov v1.d[1], x8
	; VBITS_EQ_128-NEXT: sub v0.2d, v0.2d, v1.2d			; VBITS_EQ_128-NEXT: sub v0.2d, v0.2d, v1.2d
	; VBITS_EQ_128-NEXT: ret			; VBITS_EQ_128-NEXT: ret

	%res = srem <2 x i64> %op1, %op2			%res = srem <2 x i64> %op1, %op2
	ret <2 x i64> %res			ret <2 x i64> %res
	}			}

	define void @srem_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {			define void @srem_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
	▲ Show 20 Lines • Show All 740 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d			; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d
	; CHECK-NEXT: sub d0, d0, d1			; CHECK-NEXT: sub d0, d0, d1
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	; VBITS_EQ_128-LABEL: urem_v1i64:			; VBITS_EQ_128-LABEL: urem_v1i64:
	; VBITS_EQ_128: ptrue p0.d, vl1			; VBITS_EQ_128: ptrue p0.d, vl1
	; VBITS_EQ_128-NEXT: movprfx z2, z0			; VBITS_EQ_128-NEXT: movprfx z2, z0
	; VBITS_EQ_128-NEXT: udiv z2.d, p0/m, z2.d, z1.d			; VBITS_EQ_128-NEXT: udiv z2.d, p0/m, z2.d, z1.d
	; VBITS_EQ_128-NEXT: fmov x8, d2			; VBITS_EQ_128-NEXT: mul z1.d, p0/m, z1.d, z2.d
	; VBITS_EQ_128-NEXT: fmov x9, d1
	; VBITS_EQ_128-NEXT: mul x8, x8, x9
	; VBITS_EQ_128-NEXT: fmov d1, x8
	; VBITS_EQ_128-NEXT: sub d0, d0, d1			; VBITS_EQ_128-NEXT: sub d0, d0, d1
	; VBITS_EQ_128-NEXT: ret			; VBITS_EQ_128-NEXT: ret

	%res = urem <1 x i64> %op1, %op2			%res = urem <1 x i64> %op1, %op2
	ret <1 x i64> %res			ret <1 x i64> %res
	}			}

	; Vector i64 udiv are not legal for NEON so use SVE when available.			; Vector i64 udiv are not legal for NEON so use SVE when available.
	; FIXME: We should be able to improve the codegen for the 128 bits case here.			; FIXME: We should be able to improve the codegen for the 128 bits case here.
	define <2 x i64> @urem_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {			define <2 x i64> @urem_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
	; CHECK-LABEL: urem_v2i64:			; CHECK-LABEL: urem_v2i64:
	; CHECK: ptrue [[PG:p[0-9]+]].d, vl2			; CHECK: ptrue [[PG:p[0-9]+]].d, vl2
	; CHECK-NEXT: movprfx [[PFX:z[0-9]+]], [[OP1]]			; CHECK-NEXT: movprfx [[PFX:z[0-9]+]], [[OP1]]
	; CHECK-NEXT: udiv [[DIV:z[0-9]+]].d, [[PG]]/m, [[PFX]].d, z1.d			; CHECK-NEXT: udiv [[DIV:z[0-9]+]].d, [[PG]]/m, [[PFX]].d, z1.d
	; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d			; CHECK-NEXT: mul z1.d, [[PG]]/m, [[OP2]].d, [[DIV]].d
	; CHECK-NEXT: sub v0.2d, v0.2d, v1.2d			; CHECK-NEXT: sub v0.2d, v0.2d, v1.2d
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	; VBITS_EQ_128-LABEL: urem_v2i64:			; VBITS_EQ_128-LABEL: urem_v2i64:
	; VBITS_EQ_128: ptrue p0.d, vl2			; VBITS_EQ_128: ptrue p0.d, vl2
	; VBITS_EQ_128-NEXT: movprfx z2, z0			; VBITS_EQ_128-NEXT: movprfx z2, z0
	; VBITS_EQ_128-NEXT: udiv z2.d, p0/m, z2.d, z1.d			; VBITS_EQ_128-NEXT: udiv z2.d, p0/m, z2.d, z1.d
	; VBITS_EQ_128-NEXT: fmov x9, d2			; VBITS_EQ_128-NEXT: mul z1.d, p0/m, z1.d, z2.d
	; VBITS_EQ_128-NEXT: fmov x10, d1
	; VBITS_EQ_128-NEXT: mov x8, v2.d[1]
	; VBITS_EQ_128-NEXT: mov x11, v1.d[1]
	; VBITS_EQ_128-NEXT: mul x9, x9, x10
	; VBITS_EQ_128-NEXT: mul x8, x8, x11
	; VBITS_EQ_128-NEXT: fmov d1, x9
	; VBITS_EQ_128-NEXT: mov v1.d[1], x8
	; VBITS_EQ_128-NEXT: sub v0.2d, v0.2d, v1.2d			; VBITS_EQ_128-NEXT: sub v0.2d, v0.2d, v1.2d
	; VBITS_EQ_128-NEXT: ret			; VBITS_EQ_128-NEXT: ret

	%res = urem <2 x i64> %op1, %op2			%res = urem <2 x i64> %op1, %op2
	ret <2 x i64> %res			ret <2 x i64> %res
	}			}

	define void @urem_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {			define void @urem_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
	; CHECK-LABEL: urem_v4i64:			; CHECK-LABEL: urem_v4i64:
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines