This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
4/4
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-int-div.ll

Differential D86114

[SVE] Lower fixed length vXi8/vXi16 SDIV to scalable
ClosedPublic

Authored by cameron.mcinally on Aug 17 2020, 3:01 PM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
efriedma
david-arm
rengolin

Commits

rGac6395946060: [SVE] Lower fixed length vXi8/vXi16 SDIV to scalable

Summary

There are no nxv16i8/nxv8i16 SDIV instructions, so these fixed width operations must be promoted to nxv4i32. This required adding an OverrideNEON flag to LowerToScalableOp(...), so that we can make use of the existing scalable lowering for the smaller vectors (e.g. v8i8).

Also notice that a new test file was needed. The existing SDIV tests live in sve-fixed-length-int-arith.ll, which uses the legalization-style FileCheck macros. But the complicated lowering of these operations really needs the "end result"-style macros. If this patch is accepted, I'll move the vXi32/vXi64 tests to this new file as well.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cameron.mcinally created this revision.Aug 17 2020, 3:01 PM

Herald added a reviewer: rengolin. · View Herald TranscriptAug 17 2020, 3:01 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, hiraditya, tschuett. · View Herald Transcript

cameron.mcinally requested review of this revision.Aug 17 2020, 3:01 PM

Also notice that a new test file was needed. The existing SDIV tests live in sve-fixed-length-int-arith.ll, which uses the legalization-style FileCheck macros. But the complicated lowering of these operations really needs the "end result"-style macros. If this patch is accepted, I'll move the vXi32/vXi64 tests to this new file as well.

I don't see any obvious difference in the RUN lines?

Harbormaster completed remote builds in B68682: Diff 286152.Aug 17 2020, 3:47 PM

In D86114#2222692, @efriedma wrote:

I don't see any obvious difference in the RUN lines?

It's the VBITS_* differences. The VBITS_LE_* prefixes are better for checking the legalizations (really splitting), when the current VL > the fixed VL. The VBITS_GE_* prefixes are better for ignoring the legalizations, i.e. the current VL <= the fixed VL.

For the i8/i16 SDIV cases, the lowerings are pretty long, so it's nicer to check the cases where legalization was boring.

paulwalker-arm added inline comments.Aug 18 2020, 8:57 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953	I just wanted to double check that you are aware this is going to result in i8/i16 fixed length vector divides being different to the i32/i64 ones. The latter being predicated with the former cases not (or rather using an "all true" predicate). Given divides are rarely cheap I prefer the predicated route but I guess there's no reason to be consistent at this stage.

cameron.mcinally added inline comments.Aug 18 2020, 9:19 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953	Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why that's happening? Is it because there are no loads/stores to lower?

cameron.mcinally marked an inline comment as not done.Aug 18 2020, 9:22 AM

cameron.mcinally added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953	Oh, I see what you're saying now. The second predicate register is being generated all1s. That's surprising. Ultimately, both the i32/i64 and i8/i16 cases should be going through `return LowerToPredicatedOp(Op, DAG, PredOpcode);`. I'll see if I can find the difference.

Ok, I see it now. We have to explicitly call LowerToPredicatedOp(...) with the fixed types still intact, so that getPredicateForVector(...) will generate the correct predicate. Will update...

paulwalker-arm added inline comments.Aug 18 2020, 9:29 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953	I see you're finding the answers quicker than I can type them :) For what it's worth I've nothing against the current patch since it is simple and works, I just wanted to ensure you were aware what was going on.

Updated DIV lowering to make use of correct fixed length predicates.

The lowering is somewhat convoluted, so I'd like to hear if anyone has a better solution. I've attempted to extend the operands and truncate the results as scalable, but leave the DIVs fixed until they reach vXi32. This allows the intermediate fixed DIVs to be further lowered.

There's probably some room for code reuse with other instructions that do not have i8/i16 vector support, but I don't have a good view of what those instructions look like right now (if they exist).

cameron.mcinally marked 2 inline comments as done.Aug 19 2020, 10:57 AM

LGTM.

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

This revision is now accepted and ready to land.Aug 19 2020, 1:01 PM

In D86114#2226955, @efriedma wrote:

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

Agreed. I'll wait for Paul to have a look to see if he has any tricks.

We also need a peep for the smaller vectors, where the extend can fit in one register.

In D86114#2227054, @cameron.mcinally wrote:

In D86114#2226955, @efriedma wrote:

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

Agreed. I'll wait for Paul to have a look to see if he has any tricks.

I don't think there's a very different approach because you really want custom type legalisation for a legal type, which would be nice but doesn't exists. That said the custom lowering for i8/i16 doesn't have to be SVE specific as there's the alternative approach of using normal ISD nodes to do the widening so that only the final SDIV lowering is SVE specific. I'm thinking of the following:

std::tie(Op0Lo, Op0Hi) = DAG.SplitVector(Op.getOperand(0), DL);
std::tie(Op1Lo, Op1Hi) = DAG.SplitVector(Op.getOperand(1), DL);

Op0Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op0Lo);
Op1Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op1Lo);
Op0Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op0Hi);
Op1Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op1Hi);

ResultLo = DAG.getNode(Op.getOpcode(), DL, FixedWidenedVT, Op0Lo, Op0Hi);
ResultHi = DAG.getNode(Op.getOpcode(), DL, FixedWidenedVT, Op0Lo, Op0Hi);

ResultLo = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, ResultLo);
ResultHi = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, ResultHi);

return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResultLo, ResultHi);

Which I think looks nicer and might automatically benefit from optimising the "split" away when the resulting expanded operands can fit within a single register. The big downside to this is that bigger than NEON fixed length EXTRACT_SUBVECTOR and CONCAT_VECTORS support doesn't exist so it'll either not work today or result in terrible code.

Based on this I'm happy to take the above as a refactoring exercise (which I'm happy to do if you want) and have the patch landed in its current form.

In D86114#2228977, @paulwalker-arm wrote:

That said the custom lowering for i8/i16 doesn't have to be SVE specific as there's the alternative approach of using normal ISD nodes to do the widening so that only the final SDIV lowering is SVE specific.

I tried that at first, but the NEON assembly was pretty ugly. I may have made a mistake there, but if not, it was all scalarized. The SVE lowering is uglier, but the generated code seems cleaner.

Extending fixed width vectors
=======================

sdiv_v8i8:
	.cfi_startproc
	umov	w11, v1.b[1]
	umov	w9, v1.b[2]
	umov	w13, v0.b[1]
	fmov	s2, w11
	umov	w10, v1.b[3]
	umov	w8, v1.b[4]
	umov	w12, v0.b[2]
	umov	w14, v0.b[3]
	fmov	s3, w13
	umov	w11, v0.b[4]
	zip1	v1.8b, v1.8b, v0.8b
	zip1	v0.8b, v0.8b, v0.8b
	mov	v2.h[1], w9
	shl	v1.4h, v1.4h, #8
	shl	v0.4h, v0.4h, #8
	mov	v3.h[1], w12
	mov	v2.h[2], w10
	sshr	v1.4h, v1.4h, #8
	sshr	v0.4h, v0.4h, #8
	mov	v3.h[2], w14
	mov	v2.h[3], w8
	umov	w13, v1.h[0]
	umov	w8, v0.h[0]
	mov	v3.h[3], w11
	shl	v2.4h, v2.4h, #8
	umov	w9, v1.h[1]
	umov	w10, v1.h[2]
	umov	w12, v0.h[1]
	umov	w11, v0.h[2]
	fmov	s0, w13
	fmov	s1, w8
	shl	v3.4h, v3.4h, #8
	sshr	v2.4h, v2.4h, #8
	mov	v0.s[1], w9
	fmov	s4, w9
	mov	v1.s[1], w12
	umov	w8, v2.h[1]
	umov	w9, v2.h[2]
	umov	w13, v2.h[0]
	fmov	s2, w12
	sshr	v3.4h, v3.4h, #8
	mov	v4.s[1], w10
	mov	v2.s[1], w11
	umov	w12, v3.h[0]
	shl	v0.2s, v0.2s, #16
	shl	v1.2s, v1.2s, #16
	umov	w10, v3.h[1]
	umov	w11, v3.h[2]
	fmov	s3, w13
	sshr	v0.2s, v0.2s, #16
	fmov	s5, w12
	sshr	v1.2s, v1.2s, #16
	ptrue	p0.s, vl2
	shl	v4.2s, v4.2s, #16
	shl	v2.2s, v2.2s, #16
	mov	v3.s[1], w8
	fmov	s6, w8
	sshr	v4.2s, v4.2s, #16
	sdivr	z0.s, p0/m, z0.s, z1.s
	sshr	v1.2s, v2.2s, #16
	mov	v5.s[1], w10
	fmov	s2, w10
	sdiv	z1.s, p0/m, z1.s, z4.s
	mov	v6.s[1], w9
	mov	v2.s[1], w11
	shl	v3.2s, v3.2s, #16
	shl	v4.2s, v5.2s, #16
	shl	v5.2s, v6.2s, #16
	shl	v2.2s, v2.2s, #16
	sshr	v3.2s, v3.2s, #16
	sshr	v4.2s, v4.2s, #16
	sdivr	z3.s, p0/m, z3.s, z4.s
	sshr	v4.2s, v5.2s, #16
	sshr	v2.2s, v2.2s, #16
	sdiv	z2.s, p0/m, z2.s, z4.s
	uzp1	v2.4h, v3.4h, v2.4h
	uzp1	v0.4h, v0.4h, v1.4h
	uzp1	v0.8b, v0.8b, v2.8b
	ret

Oh, but to be fair, I didn't use DAG.SplitVector(Op.getOperand(0), DL);. So that may avoid some of the ugly expanding.

The expanding is because we don't yet attempt to lower the subvector and concat_vector operations. As I say, I think today the correct move is to take this patch and then see what the future holds when we have full support for subvec and concat.

Closed by commit rGac6395946060: [SVE] Lower fixed length vXi8/vXi16 SDIV to scalable (authored by cameron.mcinally). · Explain WhyAug 20 2020, 11:47 AM

This revision was automatically updated to reflect the committed changes.

cameron.mcinally added a commit: rGac6395946060: [SVE] Lower fixed length vXi8/vXi16 SDIV to scalable.

cameron.mcinally mentioned this in D86316: [SVE] Lower fixed length UDIV to scalable.Aug 20 2020, 1:14 PM

cameron.mcinally mentioned this in rG36dbb8fc972f: [SVE] Lower fixed length UDIV to scalable.Aug 21 2020, 7:01 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

70 lines

test/

CodeGen/

AArch64/

sve-fixed-length-int-div.ll

337 lines

Diff 286869

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 898 Lines • ▼ Show 20 Lines	private:
SDValue LowerATOMIC_LOAD_AND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerATOMIC_LOAD_AND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerWindowsDYNAMIC_STACKALLOC(SDValue Op, SDValue Chain,		SDValue LowerWindowsDYNAMIC_STACKALLOC(SDValue Op, SDValue Chain,
SDValue &Size,		SDValue &Size,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,		SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,
EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;		EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;

		SDValue LowerFixedLengthVectorIntDivideToSVE(SDValue Op,
		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorIntExtendToSVE(SDValue Op,		SDValue LowerFixedLengthVectorIntExtendToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorLoadToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorLoadToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorSetccToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorSetccToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorStoreToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorStoreToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorTruncateToSVE(SDValue Op,		SDValue LowerFixedLengthVectorTruncateToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 991 Lines • ▼ Show 20 Lines	if (useSVEForFixedLengthVectors()) {
for (auto VT : {MVT::v16i8, MVT::v8i16, MVT::v4i32})		for (auto VT : {MVT::v16i8, MVT::v8i16, MVT::v4i32})
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
for (auto VT : {MVT::v8f16, MVT::v4f32})		for (auto VT : {MVT::v8f16, MVT::v4f32})
setOperationAction(ISD::FP_ROUND, VT, Expand);		setOperationAction(ISD::FP_ROUND, VT, Expand);

// These operations are not supported on NEON but SVE can do them.		// These operations are not supported on NEON but SVE can do them.
setOperationAction(ISD::MUL, MVT::v1i64, Custom);		setOperationAction(ISD::MUL, MVT::v1i64, Custom);
setOperationAction(ISD::MUL, MVT::v2i64, Custom);		setOperationAction(ISD::MUL, MVT::v2i64, Custom);
		setOperationAction(ISD::SDIV, MVT::v8i8, Custom);
		setOperationAction(ISD::SDIV, MVT::v16i8, Custom);
		setOperationAction(ISD::SDIV, MVT::v4i16, Custom);
		setOperationAction(ISD::SDIV, MVT::v8i16, Custom);
setOperationAction(ISD::SDIV, MVT::v2i32, Custom);		setOperationAction(ISD::SDIV, MVT::v2i32, Custom);
setOperationAction(ISD::SDIV, MVT::v4i32, Custom);		setOperationAction(ISD::SDIV, MVT::v4i32, Custom);
setOperationAction(ISD::SDIV, MVT::v1i64, Custom);		setOperationAction(ISD::SDIV, MVT::v1i64, Custom);
setOperationAction(ISD::SDIV, MVT::v2i64, Custom);		setOperationAction(ISD::SDIV, MVT::v2i64, Custom);
setOperationAction(ISD::SMAX, MVT::v1i64, Custom);		setOperationAction(ISD::SMAX, MVT::v1i64, Custom);
setOperationAction(ISD::SMAX, MVT::v2i64, Custom);		setOperationAction(ISD::SMAX, MVT::v2i64, Custom);
setOperationAction(ISD::SMIN, MVT::v1i64, Custom);		setOperationAction(ISD::SMIN, MVT::v1i64, Custom);
setOperationAction(ISD::SMIN, MVT::v2i64, Custom);		setOperationAction(ISD::SMIN, MVT::v2i64, Custom);
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::FMA, VT, Custom);		setOperationAction(ISD::FMA, VT, Custom);
setOperationAction(ISD::FMAXNUM, VT, Custom);		setOperationAction(ISD::FMAXNUM, VT, Custom);
setOperationAction(ISD::FMINNUM, VT, Custom);		setOperationAction(ISD::FMINNUM, VT, Custom);
setOperationAction(ISD::FMUL, VT, Custom);		setOperationAction(ISD::FMUL, VT, Custom);
setOperationAction(ISD::FSUB, VT, Custom);		setOperationAction(ISD::FSUB, VT, Custom);
setOperationAction(ISD::LOAD, VT, Custom);		setOperationAction(ISD::LOAD, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::OR, VT, Custom);		setOperationAction(ISD::OR, VT, Custom);
		setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::SETCC, VT, Custom);		setOperationAction(ISD::SETCC, VT, Custom);
setOperationAction(ISD::SHL, VT, Custom);		setOperationAction(ISD::SHL, VT, Custom);
setOperationAction(ISD::SIGN_EXTEND, VT, Custom);		setOperationAction(ISD::SIGN_EXTEND, VT, Custom);
setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Custom);		setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Custom);
setOperationAction(ISD::SMAX, VT, Custom);		setOperationAction(ISD::SMAX, VT, Custom);
setOperationAction(ISD::SMIN, VT, Custom);		setOperationAction(ISD::SMIN, VT, Custom);
setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);		setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);
setOperationAction(ISD::SRA, VT, Custom);		setOperationAction(ISD::SRA, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::STORE, VT, Custom);		setOperationAction(ISD::STORE, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
setOperationAction(ISD::XOR, VT, Custom);		setOperationAction(ISD::XOR, VT, Custom);
setOperationAction(ISD::ZERO_EXTEND, VT, Custom);		setOperationAction(ISD::ZERO_EXTEND, VT, Custom);

if (VT.getVectorElementType() == MVT::i32 \|\|
VT.getVectorElementType() == MVT::i64)
setOperationAction(ISD::SDIV, VT, Custom);
}		}

void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {		void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {
addRegisterClass(VT, &AArch64::FPR64RegClass);		addRegisterClass(VT, &AArch64::FPR64RegClass);
addTypeForNEON(VT, MVT::v2i32);		addTypeForNEON(VT, MVT::v2i32);
}		}

void AArch64TargetLowering::addQRTypeForNEON(MVT VT) {		void AArch64TargetLowering::addQRTypeForNEON(MVT VT) {
▲ Show 20 Lines • Show All 7,780 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerINSERT_SUBVECTOR(SDValue Op,
if (Idx == 0 && isPackedVectorType(InVT, DAG) && Op.getOperand(0).isUndef())		if (Idx == 0 && isPackedVectorType(InVT, DAG) && Op.getOperand(0).isUndef())
return Op;		return Op;

return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerDIV(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerDIV(SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

		if (useSVEForFixedLengthVectorVT(VT, /OverrideNEON=/true))
		return LowerFixedLengthVectorIntDivideToSVE(Op, DAG);

		assert(VT.isScalableVector() && "Expected a scalable vector.");

bool Signed = Op.getOpcode() == ISD::SDIV;		bool Signed = Op.getOpcode() == ISD::SDIV;
unsigned PredOpcode = Signed ? AArch64ISD::SDIV_PRED : AArch64ISD::UDIV_PRED;		unsigned PredOpcode = Signed ? AArch64ISD::SDIV_PRED : AArch64ISD::UDIV_PRED;

if (useSVEForFixedLengthVectorVT(Op.getValueType(), /OverrideNEON=/true) &&
(VT.getVectorElementType() == MVT::i32 \|\|
VT.getVectorElementType() == MVT::i64))
return LowerToPredicatedOp(Op, DAG, PredOpcode, /OverrideNEON=/true);

if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv2i64)		if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv2i64)
return LowerToPredicatedOp(Op, DAG, PredOpcode);		return LowerToPredicatedOp(Op, DAG, PredOpcode);

// SVE doesn't have i8 and i16 DIV operations; widen them to 32-bit		// SVE doesn't have i8 and i16 DIV operations; widen them to 32-bit
// operations, and truncate the result.		// operations, and truncate the result.
EVT WidenedVT;		EVT WidenedVT;
if (VT == MVT::nxv16i8)		if (VT == MVT::nxv16i8)
		paulwalker-armUnsubmitted Done Reply Inline Actions I just wanted to double check that you are aware this is going to result in i8/i16 fixed length vector divides being different to the i32/i64 ones. The latter being predicated with the former cases not (or rather using an "all true" predicate). Given divides are rarely cheap I prefer the predicated route but I guess there's no reason to be consistent at this stage. paulwalker-arm: I just wanted to double check that you are aware this is going to result in i8/i16 fixed length…
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why that's happening? Is it because there are no loads/stores to lower? cameron.mcinally: Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why…
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions Oh, I see what you're saying now. The second predicate register is being generated all1s. That's surprising. Ultimately, both the i32/i64 and i8/i16 cases should be going through `return LowerToPredicatedOp(Op, DAG, PredOpcode);`. I'll see if I can find the difference. cameron.mcinally: Oh, I see what you're saying now. The second predicate register is being generated all1s.
		paulwalker-armUnsubmitted Done Reply Inline Actions I see you're finding the answers quicker than I can type them :) For what it's worth I've nothing against the current patch since it is simple and works, I just wanted to ensure you were aware what was going on. paulwalker-arm: I see you're finding the answers quicker than I can type them :) For what it's worth I've…
WidenedVT = MVT::nxv8i16;		WidenedVT = MVT::nxv8i16;
else if (VT == MVT::nxv8i16)		else if (VT == MVT::nxv8i16)
WidenedVT = MVT::nxv4i32;		WidenedVT = MVT::nxv4i32;
else		else
llvm_unreachable("Unexpected Custom DIV operation");		llvm_unreachable("Unexpected Custom DIV operation");

SDLoc dl(Op);		SDLoc dl(Op);
unsigned UnpkLo = Signed ? AArch64ISD::SUNPKLO : AArch64ISD::UUNPKLO;		unsigned UnpkLo = Signed ? AArch64ISD::SUNPKLO : AArch64ISD::UUNPKLO;
▲ Show 20 Lines • Show All 6,384 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerFixedLengthVectorStoreToSVE(
auto NewValue = convertToScalableVector(DAG, ContainerVT, Store->getValue());		auto NewValue = convertToScalableVector(DAG, ContainerVT, Store->getValue());
return DAG.getMaskedStore(		return DAG.getMaskedStore(
Store->getChain(), DL, NewValue, Store->getBasePtr(), Store->getOffset(),		Store->getChain(), DL, NewValue, Store->getBasePtr(), Store->getOffset(),
getPredicateForFixedLengthVector(DAG, DL, VT), Store->getMemoryVT(),		getPredicateForFixedLengthVector(DAG, DL, VT), Store->getMemoryVT(),
Store->getMemOperand(), Store->getAddressingMode(),		Store->getMemOperand(), Store->getAddressingMode(),
Store->isTruncatingStore());		Store->isTruncatingStore());
}		}

		SDValue AArch64TargetLowering::LowerFixedLengthVectorIntDivideToSVE(
		SDValue Op, SelectionDAG &DAG) const {
		SDLoc dl(Op);
		EVT VT = Op.getValueType();
		EVT EltVT = VT.getVectorElementType();

		bool Signed = Op.getOpcode() == ISD::SDIV;
		unsigned PredOpcode = Signed ? AArch64ISD::SDIV_PRED : AArch64ISD::UDIV_PRED;

		// Scalable vector i32/i64 DIV is supported.
		if (EltVT == MVT::i32 \|\| EltVT == MVT::i64)
		return LowerToPredicatedOp(Op, DAG, PredOpcode, /OverrideNEON=/true);

		// Scalable vector i8/i16 DIV is not supported. Promote it to i32.
		EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);
		EVT HalfVT = VT.getHalfNumVectorElementsVT(*DAG.getContext());
		EVT FixedWidenedVT = HalfVT.widenIntegerVectorElementType(*DAG.getContext());
		EVT ScalableWidenedVT = getContainerForFixedLengthVector(DAG, FixedWidenedVT);

		// Convert the operands to scalable vectors.
		SDValue Op0 = convertToScalableVector(DAG, ContainerVT, Op.getOperand(0));
		SDValue Op1 = convertToScalableVector(DAG, ContainerVT, Op.getOperand(1));

		// Extend the scalable operands.
		unsigned UnpkLo = Signed ? AArch64ISD::SUNPKLO : AArch64ISD::UUNPKLO;
		unsigned UnpkHi = Signed ? AArch64ISD::SUNPKHI : AArch64ISD::UUNPKHI;
		SDValue Op0Lo = DAG.getNode(UnpkLo, dl, ScalableWidenedVT, Op0);
		SDValue Op1Lo = DAG.getNode(UnpkLo, dl, ScalableWidenedVT, Op1);
		SDValue Op0Hi = DAG.getNode(UnpkHi, dl, ScalableWidenedVT, Op0);
		SDValue Op1Hi = DAG.getNode(UnpkHi, dl, ScalableWidenedVT, Op1);

		// Convert back to fixed vectors so the DIV can be further lowered.
		Op0Lo = convertFromScalableVector(DAG, FixedWidenedVT, Op0Lo);
		Op1Lo = convertFromScalableVector(DAG, FixedWidenedVT, Op1Lo);
		Op0Hi = convertFromScalableVector(DAG, FixedWidenedVT, Op0Hi);
		Op1Hi = convertFromScalableVector(DAG, FixedWidenedVT, Op1Hi);
		SDValue ResultLo = DAG.getNode(Op.getOpcode(), dl, FixedWidenedVT,
		Op0Lo, Op1Lo);
		SDValue ResultHi = DAG.getNode(Op.getOpcode(), dl, FixedWidenedVT,
		Op0Hi, Op1Hi);

		// Convert again to scalable vectors to truncate.
		ResultLo = convertToScalableVector(DAG, ScalableWidenedVT, ResultLo);
		ResultHi = convertToScalableVector(DAG, ScalableWidenedVT, ResultHi);
		SDValue ScalableResult = DAG.getNode(AArch64ISD::UZP1, dl, ContainerVT,
		ResultLo, ResultHi);

		return convertFromScalableVector(DAG, VT, ScalableResult);
		}

SDValue AArch64TargetLowering::LowerFixedLengthVectorIntExtendToSVE(		SDValue AArch64TargetLowering::LowerFixedLengthVectorIntExtendToSVE(
SDValue Op, SelectionDAG &DAG) const {		SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");		assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");

SDLoc DL(Op);		SDLoc DL(Op);
SDValue Val = Op.getOperand(0);		SDValue Val = Op.getOperand(0);
EVT ContainerVT = getContainerForFixedLengthVector(DAG, Val.getValueType());		EVT ContainerVT = getContainerForFixedLengthVector(DAG, Val.getValueType());
▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-div.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=384 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
				; RUN: llc -aarch64-sve-vector-bits-min=512 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=640 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=768 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=896 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1152 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1280 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1408 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1536 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1664 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1792 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1920 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024,VBITS_GE_2048

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				;
				; SDIV
				;

				; Vector vXi8 sdiv are not legal for NEON so use SVE when available.
				define <8 x i8> @sdiv_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v8i8:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, z0.b
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,2)]]
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, z0.b
				; CHECK-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; CHECK-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; CHECK-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; CHECK-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; CHECK-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; CHECK-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; CHECK-NEXT: uzp1 z0.b, [[RES_LO]].b, [[RES_HI]].b
				; CHECK: ret
				%res = sdiv <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @sdiv_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v16i8:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, z0.b
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s, vl[[#min(VBYTES,4)]]
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, z0.b
				; CHECK-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; CHECK-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; CHECK-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; CHECK-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; CHECK-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; CHECK-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; CHECK-NEXT: uzp1 z0.b, [[RES_LO]].b, [[RES_HI]].b
				; CHECK: ret
				%res = sdiv <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @sdiv_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v32i8:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,32)]]
				; VBITS_GE_512-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_256: ptrue [[PG1:p[0-9]+]].s, vl[[#min(VBYTES,8)]]
				; VBITS_GE_256-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_256-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_256-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_256-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_256-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_256-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_256-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_256-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_256-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_256-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_256-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_256-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_256-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_256-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_256-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_256-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_256-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = sdiv <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @sdiv_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v64i8:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,64)]]
				; VBITS_GE_512-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(VBYTES,16)]]
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_512-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_512-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_512-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_512-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_512-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_512-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_512-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_512-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_512-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_512-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_512-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = sdiv <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define void @sdiv_v128i8(<128 x i8>* %a, <128 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v128i8:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,128)]]
				; VBITS_GE_1024-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(VBYTES,32)]]
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_1024-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_1024-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_1024-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_1024-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_1024-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_1024-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_1024-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_1024-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_1024-NEXT: ret
				%op1 = load <128 x i8>, <128 x i8>* %a
				%op2 = load <128 x i8>, <128 x i8>* %b
				%res = sdiv <128 x i8> %op1, %op2
				store <128 x i8> %res, <128 x i8>* %a
				ret void
				}

				define void @sdiv_v256i8(<256 x i8>* %a, <256 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v256i8:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,256)]]
				; VBITS_GE_2048-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(VBYTES,64)]]
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_2048-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_2048-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_2048-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_2048-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_2048-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_2048-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_2048-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_2048-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_2048-NEXT: ret
				%op1 = load <256 x i8>, <256 x i8>* %a
				%op2 = load <256 x i8>, <256 x i8>* %b
				%res = sdiv <256 x i8> %op1, %op2
				store <256 x i8> %res, <256 x i8>* %a
				ret void
				}

				; Vector vXi16 sdiv are not legal for NEON so use SVE when available.
				define <4 x i16> @sdiv_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v4i16:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,2),2)]]
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; CHECK-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; CHECK-NEXT: ret
				%res = sdiv <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @sdiv_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v8i16:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,2),4)]]
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; CHECK-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; CHECK-NEXT: ret
				%res = sdiv <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @sdiv_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v16i16:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),16)]]
				; VBITS_GE_256-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(div(VBYTES,2),8)]]
				; VBITS_GE_256-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_256-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_256-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_256-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_256-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_256-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_256-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = sdiv <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @sdiv_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v32i16:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),32)]]
				; VBITS_GE_512-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(div(VBYTES,2),16)]]
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_512-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_512-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_512-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_512-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_512-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = sdiv <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define void @sdiv_v64i16(<64 x i16>* %a, <64 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v64i16:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),64)]]
				; VBITS_GE_1024-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(div(VBYTES,2),32)]]
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_1024-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_1024-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_1024-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_1024-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_1024-NEXT: ret
				%op1 = load <64 x i16>, <64 x i16>* %a
				%op2 = load <64 x i16>, <64 x i16>* %b
				%res = sdiv <64 x i16> %op1, %op2
				store <64 x i16> %res, <64 x i16>* %a
				ret void
				}

				define void @sdiv_v128i16(<128 x i16>* %a, <128 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v128i16:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),128)]]
				; VBITS_GE_2048-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: ptrue [[PG1:p[0-9]+]].s, vl[[#min(div(VBYTES,2),64)]]
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_2048-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_2048-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_2048-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_2048-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_2048-NEXT: ret
				%op1 = load <128 x i16>, <128 x i16>* %a
				%op2 = load <128 x i16>, <128 x i16>* %b
				%res = sdiv <128 x i16> %op1, %op2
				store <128 x i16> %res, <128 x i16>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }