This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
2/4
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
aarch64-dup-ext.ll
-
aarch64-matrix-umull-smull.ll

Differential D120018

[AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors
ClosedPublic

Authored by dmgreen on Feb 17 2022, 1:09 AM.

Download Raw Diff

Details

Reviewers

NickGuy
SjoerdMeijer
fhahn
jaykang10
samtebbs

Commits

rG774b57154691: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors
rG9fc1a0dcb79a: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors

Summary

We have a combine for converting mul(dup(ext(..)), ...) into mul(ext(dup(..)), ..), for allowing more uses of smull and umull instructions. Currently it looks for vector insert and shuffle vectors to detect the element that we can convert to a vector extend. Not all cases will have a shufflevector/insert element though.

This started by extending the recognition to buildvectors (with elements that may be individually extended). The new method seems to cover all the cases that the old method captured though, as the shuffle will eventually be lowered to buildvectors, so the old method has been removed to keep the code a little simpler. The new code detects legal build_vector(ext(a), ext(b), ..), converting them to ext(build_vector(a, b, ..)) providing all the extends/types match up.

Found whilst looking at D96522 / D118979.

Diff Detail

Event Timeline

dmgreen created this revision.Feb 17 2022, 1:09 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptFeb 17 2022, 1:09 AM

dmgreen requested review of this revision.Feb 17 2022, 1:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 17 2022, 1:09 AM

Harbormaster completed remote builds in B150157: Diff 409514.Feb 17 2022, 1:09 AM

Looks good overall, with potential wins in both performance and codesize. Just 1 minor thing (and a nitpick about a comment).

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13512	Do we want to keep this comment (or replace it, rather than remove it)?
13542–13543	Is this correct? The previous behaviour used `PreExtendType` directly as an operand to `DAG.getAnyExtOrTrunc` (which kept it aligned with `PreExtendVT`), but this clamps it to a smallest type of i32.

dmgreen added inline comments.Feb 17 2022, 5:50 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13512	Oh yeah I can add this back in. I previously had both performCommonVectorExtendCombine and performBuildVectorExtendCombine methods being called, but removing the shuffle version made no effect on any of the tests we have if I removed it. The methods are still quite similar in places.
13542–13543	It needs to be a legal type, so we can't just create i8 or i16 out of nowhere. The BUILD_VECTOR can be made up of elements that are implicitly truncated. /// BUILD_VECTOR(ELT0, ELT1, ELT2, ELT3,...) - Return a fixed-width vector /// with the specified, possibly variable, elements. The types of the /// operands must match the vector element type, except that integer types /// are allowed to be larger than the element type, in which case the /// operands are implicitly truncated. The types of the operands must all /// be the same.

Add back comment

Harbormaster completed remote builds in B150214: Diff 409615.Feb 17 2022, 5:53 AM

I wasn't aware of the implicit truncation. In that case, LGTM

This revision is now accepted and ready to land.Feb 17 2022, 5:53 AM

This revision was landed with ongoing or failed builds.Feb 21 2022, 7:44 AM

Closed by commit rG9fc1a0dcb79a: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG9fc1a0dcb79a: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors.

FYI, this seems to have injected a build failure in Halide, but unfortunately I don't have a repro case yet (it's only been isolated in some proprietary code). Will work on providing one.

rnk added a reverting change: rGecb27004ecbc: Revert "[AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors".Feb 22 2022, 10:31 AM

OK thanks. Let me know. Hopefully a reproduce should be able to be relatively small, given what this patch altered.

Reduced test case:

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-eabi"

define void @foo() local_unnamed_addr {
  %tmp = xor <16 x i1> zeroinitializer, <i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true, i1 true>
  %tmp6 = load <8 x i16>, <8 x i16>* null, align 2
  %tmp7 = sub <8 x i16> zeroinitializer, %tmp6
  %tmp8 = shufflevector <8 x i16> %tmp7, <8 x i16> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
  %tmp9 = icmp slt i64 0, undef
  %tmp10 = zext i1 %tmp9 to i16
  %tmp11 = insertelement <16 x i16> undef, i16 %tmp10, i64 0
  %tmp12 = shufflevector <16 x i16> %tmp11, <16 x i16> undef, <16 x i32> zeroinitializer
  %tmp13 = mul nuw <16 x i16> %tmp8, %tmp12
  %tmp14 = icmp ne <16 x i16> %tmp13, zeroinitializer
  %tmp15 = and <16 x i1> %tmp14, %tmp
  %tmp16 = sext <16 x i1> %tmp15 to <16 x i8>
  %tmp17 = bitcast i8* undef to <16 x i8>*
  store <16 x i8> %tmp16, <16 x i8>* %tmp17, align 1
  ret void
}

$ llc t.ll
Value type is non-standard value, Other.
UNREACHABLE executed at llvm-project/llvm/include/llvm/Support/MachineValueType.h:869!

Oh it's an i1 type. I see.

Thanks for the reproducer.

Because the fix is pretty simple, I will recommit this with a fix for i1 types. Please let me know if there are any further problems.

dmgreen added a commit: rG774b57154691: [AArch64] Alter mull shuffle(ext(..)) combine to work on buildvectors.Feb 22 2022, 3:37 PM

dmgreen mentioned this in D123012: [AArch64] Alter mull buildvectors(ext(..)) combine to work on shuffles.Apr 4 2022, 12:10 AM

dmgreen mentioned this in rG3b9833597e81: [AArch64] Alter mull buildvectors(ext(..)) combine to work on shuffles.Apr 4 2022, 3:08 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

74 lines

test/

CodeGen/

AArch64/

aarch64-dup-ext.ll

6 lines

aarch64-matrix-umull-smull.ll

55 lines

Diff 409514

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,502 Lines • ▼ Show 20 Lines	case ISD::AND: {

return MVT::Other;		return MVT::Other;
}		}
default:		default:
return MVT::Other;		return MVT::Other;
}		}
}		}

/// Combines a dup(sext/zext) node pattern into sext/zext(dup)		static SDValue performBuildVectorExtendCombine(SDValue BV, SelectionDAG &DAG) {
/// making use of the vector SExt/ZExt rather than the scalar SExt/ZExt		EVT VT = BV.getValueType();
NickGuyUnsubmitted Not Done Reply Inline Actions Do we want to keep this comment (or replace it, rather than remove it)? NickGuy: Do we want to keep this comment (or replace it, rather than remove it)?
dmgreenAuthorUnsubmitted Done Reply Inline Actions Oh yeah I can add this back in. I previously had both performCommonVectorExtendCombine and performBuildVectorExtendCombine methods being called, but removing the shuffle version made no effect on any of the tests we have if I removed it. The methods are still quite similar in places. dmgreen: Oh yeah I can add this back in. I previously had both performCommonVectorExtendCombine and…
static SDValue performCommonVectorExtendCombine(SDValue VectorShuffle,		if (BV.getOpcode() != ISD::BUILD_VECTOR)
SelectionDAG &DAG) {
ShuffleVectorSDNode *ShuffleNode =
dyn_cast<ShuffleVectorSDNode>(VectorShuffle.getNode());
if (!ShuffleNode)
return SDValue();

// Ensuring the mask is zero before continuing
if (!ShuffleNode->isSplat() \|\| ShuffleNode->getSplatIndex() != 0)
return SDValue();

SDValue InsertVectorElt = VectorShuffle.getOperand(0);

if (InsertVectorElt.getOpcode() != ISD::INSERT_VECTOR_ELT)
return SDValue();

SDValue InsertLane = InsertVectorElt.getOperand(2);
ConstantSDNode *Constant = dyn_cast<ConstantSDNode>(InsertLane.getNode());
// Ensures the insert is inserting into lane 0
if (!Constant \|\| Constant->getZExtValue() != 0)
return SDValue();		return SDValue();

SDValue Extend = InsertVectorElt.getOperand(1);		// Use the first item in the buildvector to get the size of the extend, and
		// make sure it looks valid.
		SDValue Extend = BV->getOperand(0);
unsigned ExtendOpcode = Extend.getOpcode();		unsigned ExtendOpcode = Extend.getOpcode();

bool IsSExt = ExtendOpcode == ISD::SIGN_EXTEND \|\|		bool IsSExt = ExtendOpcode == ISD::SIGN_EXTEND \|\|
ExtendOpcode == ISD::SIGN_EXTEND_INREG \|\|		ExtendOpcode == ISD::SIGN_EXTEND_INREG \|\|
ExtendOpcode == ISD::AssertSext;		ExtendOpcode == ISD::AssertSext;
if (!IsSExt && ExtendOpcode != ISD::ZERO_EXTEND &&		if (!IsSExt && ExtendOpcode != ISD::ZERO_EXTEND &&
ExtendOpcode != ISD::AssertZext && ExtendOpcode != ISD::AND)		ExtendOpcode != ISD::AssertZext && ExtendOpcode != ISD::AND)
return SDValue();		return SDValue();

// Restrict valid pre-extend data type		// Restrict valid pre-extend data type
EVT PreExtendType = calculatePreExtendType(Extend);		EVT PreExtendType = calculatePreExtendType(Extend);
if (PreExtendType != MVT::i8 && PreExtendType != MVT::i16 &&		if (PreExtendType.getSizeInBits() != VT.getScalarSizeInBits() / 2)
PreExtendType != MVT::i32)
return SDValue();		return SDValue();

EVT TargetType = VectorShuffle.getValueType();		// Make sure all other operands are equally extended
EVT PreExtendVT = TargetType.changeVectorElementType(PreExtendType);		for (SDValue Op : drop_begin(BV->ops())) {
if (TargetType.getScalarSizeInBits() != PreExtendVT.getScalarSizeInBits() * 2)		unsigned Opc = Op.getOpcode();
		bool OpcIsSExt = Opc == ISD::SIGN_EXTEND \|\| Opc == ISD::SIGN_EXTEND_INREG \|\|
		Opc == ISD::AssertSext;
		if (OpcIsSExt != IsSExt \|\| calculatePreExtendType(Op) != PreExtendType)
return SDValue();		return SDValue();
		}

SDLoc DL(VectorShuffle);		EVT PreExtendVT = VT.changeVectorElementType(PreExtendType);
		EVT PreExtendLegalType =
SDValue InsertVectorNode = DAG.getNode(		PreExtendType.getScalarSizeInBits() < 32 ? MVT::i32 : PreExtendType;
		NickGuyUnsubmitted Not Done Reply Inline Actions Is this correct? The previous behaviour used `PreExtendType` directly as an operand to `DAG.getAnyExtOrTrunc` (which kept it aligned with `PreExtendVT`), but this clamps it to a smallest type of i32. NickGuy: Is this correct? The previous behaviour used `PreExtendType` directly as an operand to `DAG.
		dmgreenAuthorUnsubmitted Done Reply Inline Actions It needs to be a legal type, so we can't just create i8 or i16 out of nowhere. The BUILD_VECTOR can be made up of elements that are implicitly truncated. /// BUILD_VECTOR(ELT0, ELT1, ELT2, ELT3,...) - Return a fixed-width vector /// with the specified, possibly variable, elements. The types of the /// operands must match the vector element type, except that integer types /// are allowed to be larger than the element type, in which case the /// operands are implicitly truncated. The types of the operands must all /// be the same. dmgreen: It needs to be a legal type, so we can't just create i8 or i16 out of nowhere. The BUILD_VECTOR…
InsertVectorElt.getOpcode(), DL, PreExtendVT, DAG.getUNDEF(PreExtendVT),		SDLoc DL(BV);
DAG.getAnyExtOrTrunc(Extend.getOperand(0), DL, PreExtendType),		SmallVector<SDValue, 8> NewOps;
DAG.getConstant(0, DL, MVT::i64));		for (SDValue Op : BV->ops())
		NewOps.push_back(
std::vector<int> ShuffleMask(TargetType.getVectorNumElements());		DAG.getAnyExtOrTrunc(Op.getOperand(0), DL, PreExtendLegalType));
		SDValue NBV = DAG.getNode(ISD::BUILD_VECTOR, DL, PreExtendVT, NewOps);
SDValue VectorShuffleNode =		return DAG.getNode(IsSExt ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND, DL, VT, NBV);
DAG.getVectorShuffle(PreExtendVT, DL, InsertVectorNode,
DAG.getUNDEF(PreExtendVT), ShuffleMask);

return DAG.getNode(IsSExt ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND, DL,
TargetType, VectorShuffleNode);
}		}

/// Combines a mul(dup(sext/zext)) node pattern into mul(sext/zext(dup))		/// Combines a mul(dup(sext/zext)) node pattern into mul(sext/zext(dup))
/// making use of the vector SExt/ZExt rather than the scalar SExt/ZExt		/// making use of the vector SExt/ZExt rather than the scalar SExt/ZExt
static SDValue performMulVectorExtendCombine(SDNode *Mul, SelectionDAG &DAG) {		static SDValue performMulVectorExtendCombine(SDNode *Mul, SelectionDAG &DAG) {
// If the value type isn't a vector, none of the operands are going to be dups		// If the value type isn't a vector, none of the operands are going to be dups
EVT VT = Mul->getValueType(0);		EVT VT = Mul->getValueType(0);
if (VT != MVT::v8i16 && VT != MVT::v4i32 && VT != MVT::v2i64)		if (VT != MVT::v8i16 && VT != MVT::v4i32 && VT != MVT::v2i64)
return SDValue();		return SDValue();

SDValue Op0 = performCommonVectorExtendCombine(Mul->getOperand(0), DAG);		SDValue Op0 = performBuildVectorExtendCombine(Mul->getOperand(0), DAG);
SDValue Op1 = performCommonVectorExtendCombine(Mul->getOperand(1), DAG);		SDValue Op1 = performBuildVectorExtendCombine(Mul->getOperand(1), DAG);

// Neither operands have been changed, don't make any further changes		// Neither operands have been changed, don't make any further changes
if (!Op0 && !Op1)		if (!Op0 && !Op1)
return SDValue();		return SDValue();

SDLoc DL(Mul);		SDLoc DL(Mul);
return DAG.getNode(Mul->getOpcode(), DL, VT, Op0 ? Op0 : Mul->getOperand(0),		return DAG.getNode(Mul->getOpcode(), DL, VT, Op0 ? Op0 : Mul->getOperand(0),
Op1 ? Op1 : Mul->getOperand(1));		Op1 ? Op1 : Mul->getOperand(1));
▲ Show 20 Lines • Show All 6,782 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-dup-ext.ll

	Show First 20 Lines • Show All 150 Lines • ▼ Show 20 Lines
	; dupzext_v2i16_v2i32			; dupzext_v2i16_v2i32
	; dupzext_v2i16_v2i64			; dupzext_v2i16_v2i64

	; Unsupported states			; Unsupported states

	define <8 x i16> @nonsplat_shuffleinsert(i8 %src, <8 x i8> %b) {			define <8 x i16> @nonsplat_shuffleinsert(i8 %src, <8 x i8> %b) {
	; CHECK-LABEL: nonsplat_shuffleinsert:			; CHECK-LABEL: nonsplat_shuffleinsert:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: sxtb w8, w0			; CHECK-NEXT: dup v1.8b, w0
	; CHECK-NEXT: sshll v0.8h, v0.8b, #0			; CHECK-NEXT: smull v0.8h, v1.8b, v0.8b
	; CHECK-NEXT: dup v1.8h, w8
	; CHECK-NEXT: mul v0.8h, v1.8h, v0.8h
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%in = sext i8 %src to i16			%in = sext i8 %src to i16
	%ext.b = sext <8 x i8> %b to <8 x i16>			%ext.b = sext <8 x i8> %b to <8 x i16>
	%broadcast.splatinsert = insertelement <8 x i16> undef, i16 %in, i16 1			%broadcast.splatinsert = insertelement <8 x i16> undef, i16 %in, i16 1
	%broadcast.splat = shufflevector <8 x i16> %broadcast.splatinsert, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1>			%broadcast.splat = shufflevector <8 x i16> %broadcast.splatinsert, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1>
	%out = mul nsw <8 x i16> %broadcast.splat, %ext.b			%out = mul nsw <8 x i16> %broadcast.splat, %ext.b
	ret <8 x i16> %out			ret <8 x i16> %out
	Show All 15 Lines

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll

	Show First 20 Lines • Show All 195 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: mov w9, w3			; CHECK-NEXT: mov w9, w3
	; CHECK-NEXT: cmp w3, #15			; CHECK-NEXT: cmp w3, #15
	; CHECK-NEXT: b.hi .LBB3_3			; CHECK-NEXT: b.hi .LBB3_3
	; CHECK-NEXT: // %bb.2:			; CHECK-NEXT: // %bb.2:
	; CHECK-NEXT: mov x10, xzr			; CHECK-NEXT: mov x10, xzr
	; CHECK-NEXT: b .LBB3_6			; CHECK-NEXT: b .LBB3_6
	; CHECK-NEXT: .LBB3_3: // %vector.ph			; CHECK-NEXT: .LBB3_3: // %vector.ph
	; CHECK-NEXT: and x10, x9, #0xfffffff0			; CHECK-NEXT: and x10, x9, #0xfffffff0
				; CHECK-NEXT: dup v0.4h, w8
	; CHECK-NEXT: add x11, x2, #32			; CHECK-NEXT: add x11, x2, #32
	; CHECK-NEXT: add x12, x0, #16			; CHECK-NEXT: add x12, x0, #16
	; CHECK-NEXT: mov x13, x10			; CHECK-NEXT: mov x13, x10
	; CHECK-NEXT: dup v0.4s, w8			; CHECK-NEXT: dup v1.8h, w8
	; CHECK-NEXT: .LBB3_4: // %vector.body			; CHECK-NEXT: .LBB3_4: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ldp q1, q2, [x12, #-16]			; CHECK-NEXT: ldp q2, q3, [x12, #-16]
	; CHECK-NEXT: subs x13, x13, #16			; CHECK-NEXT: subs x13, x13, #16
	; CHECK-NEXT: add x12, x12, #32			; CHECK-NEXT: add x12, x12, #32
	; CHECK-NEXT: sshll2 v3.4s, v1.8h, #0			; CHECK-NEXT: smull2 v4.4s, v1.8h, v2.8h
	; CHECK-NEXT: sshll v1.4s, v1.4h, #0			; CHECK-NEXT: smull v2.4s, v0.4h, v2.4h
	; CHECK-NEXT: sshll2 v4.4s, v2.8h, #0			; CHECK-NEXT: smull2 v5.4s, v1.8h, v3.8h
	; CHECK-NEXT: sshll v2.4s, v2.4h, #0			; CHECK-NEXT: smull v3.4s, v0.4h, v3.4h
	; CHECK-NEXT: mul v3.4s, v0.4s, v3.4s			; CHECK-NEXT: stp q2, q4, [x11, #-32]
	; CHECK-NEXT: mul v1.4s, v0.4s, v1.4s			; CHECK-NEXT: stp q3, q5, [x11], #64
	; CHECK-NEXT: mul v4.4s, v0.4s, v4.4s
	; CHECK-NEXT: mul v2.4s, v0.4s, v2.4s
	; CHECK-NEXT: stp q1, q3, [x11, #-32]
	; CHECK-NEXT: stp q2, q4, [x11], #64
	; CHECK-NEXT: b.ne .LBB3_4			; CHECK-NEXT: b.ne .LBB3_4
	; CHECK-NEXT: // %bb.5: // %middle.block			; CHECK-NEXT: // %bb.5: // %middle.block
	; CHECK-NEXT: cmp x10, x9			; CHECK-NEXT: cmp x10, x9
	; CHECK-NEXT: b.eq .LBB3_8			; CHECK-NEXT: b.eq .LBB3_8
	; CHECK-NEXT: .LBB3_6: // %for.body.preheader1			; CHECK-NEXT: .LBB3_6: // %for.body.preheader1
	; CHECK-NEXT: sub x9, x9, x10			; CHECK-NEXT: sub x9, x9, x10
	; CHECK-NEXT: add x11, x2, x10, lsl #2			; CHECK-NEXT: add x11, x2, x10, lsl #2
	; CHECK-NEXT: add x10, x0, x10, lsl #1			; CHECK-NEXT: add x10, x0, x10, lsl #1
	▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: mov w9, w3			; CHECK-NEXT: mov w9, w3
	; CHECK-NEXT: cmp w3, #15			; CHECK-NEXT: cmp w3, #15
	; CHECK-NEXT: b.hi .LBB4_3			; CHECK-NEXT: b.hi .LBB4_3
	; CHECK-NEXT: // %bb.2:			; CHECK-NEXT: // %bb.2:
	; CHECK-NEXT: mov x10, xzr			; CHECK-NEXT: mov x10, xzr
	; CHECK-NEXT: b .LBB4_6			; CHECK-NEXT: b .LBB4_6
	; CHECK-NEXT: .LBB4_3: // %vector.ph			; CHECK-NEXT: .LBB4_3: // %vector.ph
	; CHECK-NEXT: and x10, x9, #0xfffffff0			; CHECK-NEXT: and x10, x9, #0xfffffff0
				; CHECK-NEXT: dup v0.4h, w8
	; CHECK-NEXT: add x11, x2, #32			; CHECK-NEXT: add x11, x2, #32
	; CHECK-NEXT: add x12, x0, #16			; CHECK-NEXT: add x12, x0, #16
	; CHECK-NEXT: mov x13, x10			; CHECK-NEXT: mov x13, x10
	; CHECK-NEXT: dup v0.4s, w8			; CHECK-NEXT: dup v1.8h, w8
	; CHECK-NEXT: .LBB4_4: // %vector.body			; CHECK-NEXT: .LBB4_4: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ldp q1, q2, [x12, #-16]			; CHECK-NEXT: ldp q2, q3, [x12, #-16]
	; CHECK-NEXT: subs x13, x13, #16			; CHECK-NEXT: subs x13, x13, #16
	; CHECK-NEXT: add x12, x12, #32			; CHECK-NEXT: add x12, x12, #32
	; CHECK-NEXT: ushll2 v3.4s, v1.8h, #0			; CHECK-NEXT: umull2 v4.4s, v1.8h, v2.8h
	; CHECK-NEXT: ushll v1.4s, v1.4h, #0			; CHECK-NEXT: umull v2.4s, v0.4h, v2.4h
	; CHECK-NEXT: ushll2 v4.4s, v2.8h, #0			; CHECK-NEXT: umull2 v5.4s, v1.8h, v3.8h
	; CHECK-NEXT: ushll v2.4s, v2.4h, #0			; CHECK-NEXT: umull v3.4s, v0.4h, v3.4h
	; CHECK-NEXT: mul v3.4s, v0.4s, v3.4s			; CHECK-NEXT: stp q2, q4, [x11, #-32]
	; CHECK-NEXT: mul v1.4s, v0.4s, v1.4s			; CHECK-NEXT: stp q3, q5, [x11], #64
	; CHECK-NEXT: mul v4.4s, v0.4s, v4.4s
	; CHECK-NEXT: mul v2.4s, v0.4s, v2.4s
	; CHECK-NEXT: stp q1, q3, [x11, #-32]
	; CHECK-NEXT: stp q2, q4, [x11], #64
	; CHECK-NEXT: b.ne .LBB4_4			; CHECK-NEXT: b.ne .LBB4_4
	; CHECK-NEXT: // %bb.5: // %middle.block			; CHECK-NEXT: // %bb.5: // %middle.block
	; CHECK-NEXT: cmp x10, x9			; CHECK-NEXT: cmp x10, x9
	; CHECK-NEXT: b.eq .LBB4_8			; CHECK-NEXT: b.eq .LBB4_8
	; CHECK-NEXT: .LBB4_6: // %for.body.preheader1			; CHECK-NEXT: .LBB4_6: // %for.body.preheader1
	; CHECK-NEXT: sub x9, x9, x10			; CHECK-NEXT: sub x9, x9, x10
	; CHECK-NEXT: add x11, x2, x10, lsl #2			; CHECK-NEXT: add x11, x2, x10, lsl #2
	; CHECK-NEXT: add x10, x0, x10, lsl #1			; CHECK-NEXT: add x10, x0, x10, lsl #1
	▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: // %bb.2:			; CHECK-NEXT: // %bb.2:
	; CHECK-NEXT: mov x11, xzr			; CHECK-NEXT: mov x11, xzr
	; CHECK-NEXT: mov w8, wzr			; CHECK-NEXT: mov w8, wzr
	; CHECK-NEXT: b .LBB5_7			; CHECK-NEXT: b .LBB5_7
	; CHECK-NEXT: .LBB5_3:			; CHECK-NEXT: .LBB5_3:
	; CHECK-NEXT: mov w0, wzr			; CHECK-NEXT: mov w0, wzr
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	; CHECK-NEXT: .LBB5_4: // %vector.ph			; CHECK-NEXT: .LBB5_4: // %vector.ph
				; CHECK-NEXT: dup v1.8b, w9
	; CHECK-NEXT: and x11, x10, #0xfffffff0			; CHECK-NEXT: and x11, x10, #0xfffffff0
	; CHECK-NEXT: add x8, x0, #8
	; CHECK-NEXT: movi v0.2d, #0000000000000000			; CHECK-NEXT: movi v0.2d, #0000000000000000
				; CHECK-NEXT: add x8, x0, #8
				; CHECK-NEXT: movi v2.2d, #0000000000000000
	; CHECK-NEXT: mov x12, x11			; CHECK-NEXT: mov x12, x11
	; CHECK-NEXT: movi v1.2d, #0000000000000000			; CHECK-NEXT: sshll v1.8h, v1.8b, #0
	; CHECK-NEXT: dup v2.8h, w9
	; CHECK-NEXT: .LBB5_5: // %vector.body			; CHECK-NEXT: .LBB5_5: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ldp d3, d4, [x8, #-8]			; CHECK-NEXT: ldp d3, d4, [x8, #-8]
	; CHECK-NEXT: subs x12, x12, #16			; CHECK-NEXT: subs x12, x12, #16
	; CHECK-NEXT: add x8, x8, #16			; CHECK-NEXT: add x8, x8, #16
	; CHECK-NEXT: ushll v3.8h, v3.8b, #0			; CHECK-NEXT: ushll v3.8h, v3.8b, #0
	; CHECK-NEXT: ushll v4.8h, v4.8b, #0			; CHECK-NEXT: ushll v4.8h, v4.8b, #0
	; CHECK-NEXT: mla v0.8h, v2.8h, v3.8h			; CHECK-NEXT: mla v0.8h, v1.8h, v3.8h
	; CHECK-NEXT: mla v1.8h, v2.8h, v4.8h			; CHECK-NEXT: mla v2.8h, v1.8h, v4.8h
	; CHECK-NEXT: b.ne .LBB5_5			; CHECK-NEXT: b.ne .LBB5_5
	; CHECK-NEXT: // %bb.6: // %middle.block			; CHECK-NEXT: // %bb.6: // %middle.block
	; CHECK-NEXT: add v0.8h, v1.8h, v0.8h			; CHECK-NEXT: add v0.8h, v2.8h, v0.8h
	; CHECK-NEXT: cmp x11, x10			; CHECK-NEXT: cmp x11, x10
	; CHECK-NEXT: addv h0, v0.8h			; CHECK-NEXT: addv h0, v0.8h
	; CHECK-NEXT: fmov w8, s0			; CHECK-NEXT: fmov w8, s0
	; CHECK-NEXT: b.eq .LBB5_9			; CHECK-NEXT: b.eq .LBB5_9
	; CHECK-NEXT: .LBB5_7: // %for.body.preheader1			; CHECK-NEXT: .LBB5_7: // %for.body.preheader1
	; CHECK-NEXT: sub x10, x10, x11			; CHECK-NEXT: sub x10, x10, x11
	; CHECK-NEXT: add x11, x0, x11			; CHECK-NEXT: add x11, x0, x11
	; CHECK-NEXT: .LBB5_8: // %for.body			; CHECK-NEXT: .LBB5_8: // %for.body
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines