This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
4/4
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-int-div.ll

Differential D86114

[SVE] Lower fixed length vXi8/vXi16 SDIV to scalable
ClosedPublic

Authored by cameron.mcinally on Aug 17 2020, 3:01 PM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
efriedma
david-arm
rengolin

Commits

rGac6395946060: [SVE] Lower fixed length vXi8/vXi16 SDIV to scalable

Summary

There are no nxv16i8/nxv8i16 SDIV instructions, so these fixed width operations must be promoted to nxv4i32. This required adding an OverrideNEON flag to LowerToScalableOp(...), so that we can make use of the existing scalable lowering for the smaller vectors (e.g. v8i8).

Also notice that a new test file was needed. The existing SDIV tests live in sve-fixed-length-int-arith.ll, which uses the legalization-style FileCheck macros. But the complicated lowering of these operations really needs the "end result"-style macros. If this patch is accepted, I'll move the vXi32/vXi64 tests to this new file as well.

Diff Detail

Event Timeline

cameron.mcinally created this revision.Aug 17 2020, 3:01 PM

Herald added a reviewer: rengolin. · View Herald TranscriptAug 17 2020, 3:01 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, hiraditya, tschuett. · View Herald Transcript

cameron.mcinally requested review of this revision.Aug 17 2020, 3:01 PM

Also notice that a new test file was needed. The existing SDIV tests live in sve-fixed-length-int-arith.ll, which uses the legalization-style FileCheck macros. But the complicated lowering of these operations really needs the "end result"-style macros. If this patch is accepted, I'll move the vXi32/vXi64 tests to this new file as well.

I don't see any obvious difference in the RUN lines?

Harbormaster completed remote builds in B68682: Diff 286152.Aug 17 2020, 3:47 PM

In D86114#2222692, @efriedma wrote:

I don't see any obvious difference in the RUN lines?

It's the VBITS_* differences. The VBITS_LE_* prefixes are better for checking the legalizations (really splitting), when the current VL > the fixed VL. The VBITS_GE_* prefixes are better for ignoring the legalizations, i.e. the current VL <= the fixed VL.

For the i8/i16 SDIV cases, the lowerings are pretty long, so it's nicer to check the cases where legalization was boring.

paulwalker-arm added inline comments.Aug 18 2020, 8:57 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8945	I just wanted to double check that you are aware this is going to result in i8/i16 fixed length vector divides being different to the i32/i64 ones. The latter being predicated with the former cases not (or rather using an "all true" predicate). Given divides are rarely cheap I prefer the predicated route but I guess there's no reason to be consistent at this stage.

cameron.mcinally added inline comments.Aug 18 2020, 9:19 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8945	Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why that's happening? Is it because there are no loads/stores to lower?

cameron.mcinally marked an inline comment as not done.Aug 18 2020, 9:22 AM

cameron.mcinally added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8945	Oh, I see what you're saying now. The second predicate register is being generated all1s. That's surprising. Ultimately, both the i32/i64 and i8/i16 cases should be going through `return LowerToPredicatedOp(Op, DAG, PredOpcode);`. I'll see if I can find the difference.

Ok, I see it now. We have to explicitly call LowerToPredicatedOp(...) with the fixed types still intact, so that getPredicateForVector(...) will generate the correct predicate. Will update...

paulwalker-arm added inline comments.Aug 18 2020, 9:29 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8945	I see you're finding the answers quicker than I can type them :) For what it's worth I've nothing against the current patch since it is simple and works, I just wanted to ensure you were aware what was going on.

Updated DIV lowering to make use of correct fixed length predicates.

The lowering is somewhat convoluted, so I'd like to hear if anyone has a better solution. I've attempted to extend the operands and truncate the results as scalable, but leave the DIVs fixed until they reach vXi32. This allows the intermediate fixed DIVs to be further lowered.

There's probably some room for code reuse with other instructions that do not have i8/i16 vector support, but I don't have a good view of what those instructions look like right now (if they exist).

cameron.mcinally marked 2 inline comments as done.Aug 19 2020, 10:57 AM

LGTM.

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

This revision is now accepted and ready to land.Aug 19 2020, 1:01 PM

In D86114#2226955, @efriedma wrote:

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

Agreed. I'll wait for Paul to have a look to see if he has any tricks.

We also need a peep for the smaller vectors, where the extend can fit in one register.

In D86114#2227054, @cameron.mcinally wrote:

In D86114#2226955, @efriedma wrote:

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

Agreed. I'll wait for Paul to have a look to see if he has any tricks.

I don't think there's a very different approach because you really want custom type legalisation for a legal type, which would be nice but doesn't exists. That said the custom lowering for i8/i16 doesn't have to be SVE specific as there's the alternative approach of using normal ISD nodes to do the widening so that only the final SDIV lowering is SVE specific. I'm thinking of the following:

std::tie(Op0Lo, Op0Hi) = DAG.SplitVector(Op.getOperand(0), DL);
std::tie(Op1Lo, Op1Hi) = DAG.SplitVector(Op.getOperand(1), DL);

Op0Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op0Lo);
Op1Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op1Lo);
Op0Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op0Hi);
Op1Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op1Hi);

ResultLo = DAG.getNode(Op.getOpcode(), DL, FixedWidenedVT, Op0Lo, Op0Hi);
ResultHi = DAG.getNode(Op.getOpcode(), DL, FixedWidenedVT, Op0Lo, Op0Hi);

ResultLo = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, ResultLo);
ResultHi = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, ResultHi);

return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResultLo, ResultHi);

Which I think looks nicer and might automatically benefit from optimising the "split" away when the resulting expanded operands can fit within a single register. The big downside to this is that bigger than NEON fixed length EXTRACT_SUBVECTOR and CONCAT_VECTORS support doesn't exist so it'll either not work today or result in terrible code.

Based on this I'm happy to take the above as a refactoring exercise (which I'm happy to do if you want) and have the patch landed in its current form.

In D86114#2228977, @paulwalker-arm wrote:

That said the custom lowering for i8/i16 doesn't have to be SVE specific as there's the alternative approach of using normal ISD nodes to do the widening so that only the final SDIV lowering is SVE specific.

I tried that at first, but the NEON assembly was pretty ugly. I may have made a mistake there, but if not, it was all scalarized. The SVE lowering is uglier, but the generated code seems cleaner.

Extending fixed width vectors
=======================

sdiv_v8i8:
	.cfi_startproc
	umov	w11, v1.b[1]
	umov	w9, v1.b[2]
	umov	w13, v0.b[1]
	fmov	s2, w11
	umov	w10, v1.b[3]
	umov	w8, v1.b[4]
	umov	w12, v0.b[2]
	umov	w14, v0.b[3]
	fmov	s3, w13
	umov	w11, v0.b[4]
	zip1	v1.8b, v1.8b, v0.8b
	zip1	v0.8b, v0.8b, v0.8b
	mov	v2.h[1], w9
	shl	v1.4h, v1.4h, #8
	shl	v0.4h, v0.4h, #8
	mov	v3.h[1], w12
	mov	v2.h[2], w10
	sshr	v1.4h, v1.4h, #8
	sshr	v0.4h, v0.4h, #8
	mov	v3.h[2], w14
	mov	v2.h[3], w8
	umov	w13, v1.h[0]
	umov	w8, v0.h[0]
	mov	v3.h[3], w11
	shl	v2.4h, v2.4h, #8
	umov	w9, v1.h[1]
	umov	w10, v1.h[2]
	umov	w12, v0.h[1]
	umov	w11, v0.h[2]
	fmov	s0, w13
	fmov	s1, w8
	shl	v3.4h, v3.4h, #8
	sshr	v2.4h, v2.4h, #8
	mov	v0.s[1], w9
	fmov	s4, w9
	mov	v1.s[1], w12
	umov	w8, v2.h[1]
	umov	w9, v2.h[2]
	umov	w13, v2.h[0]
	fmov	s2, w12
	sshr	v3.4h, v3.4h, #8
	mov	v4.s[1], w10
	mov	v2.s[1], w11
	umov	w12, v3.h[0]
	shl	v0.2s, v0.2s, #16
	shl	v1.2s, v1.2s, #16
	umov	w10, v3.h[1]
	umov	w11, v3.h[2]
	fmov	s3, w13
	sshr	v0.2s, v0.2s, #16
	fmov	s5, w12
	sshr	v1.2s, v1.2s, #16
	ptrue	p0.s, vl2
	shl	v4.2s, v4.2s, #16
	shl	v2.2s, v2.2s, #16
	mov	v3.s[1], w8
	fmov	s6, w8
	sshr	v4.2s, v4.2s, #16
	sdivr	z0.s, p0/m, z0.s, z1.s
	sshr	v1.2s, v2.2s, #16
	mov	v5.s[1], w10
	fmov	s2, w10
	sdiv	z1.s, p0/m, z1.s, z4.s
	mov	v6.s[1], w9
	mov	v2.s[1], w11
	shl	v3.2s, v3.2s, #16
	shl	v4.2s, v5.2s, #16
	shl	v5.2s, v6.2s, #16
	shl	v2.2s, v2.2s, #16
	sshr	v3.2s, v3.2s, #16
	sshr	v4.2s, v4.2s, #16
	sdivr	z3.s, p0/m, z3.s, z4.s
	sshr	v4.2s, v5.2s, #16
	sshr	v2.2s, v2.2s, #16
	sdiv	z2.s, p0/m, z2.s, z4.s
	uzp1	v2.4h, v3.4h, v2.4h
	uzp1	v0.4h, v0.4h, v1.4h
	uzp1	v0.8b, v0.8b, v2.8b
	ret

Oh, but to be fair, I didn't use DAG.SplitVector(Op.getOperand(0), DL);. So that may avoid some of the ugly expanding.

The expanding is because we don't yet attempt to lower the subvector and concat_vector operations. As I say, I think today the correct move is to take this patch and then see what the future holds when we have full support for subvec and concat.

Closed by commit rGac6395946060: [SVE] Lower fixed length vXi8/vXi16 SDIV to scalable (authored by cameron.mcinally). · Explain WhyAug 20 2020, 11:47 AM

This revision was automatically updated to reflect the committed changes.

cameron.mcinally added a commit: rGac6395946060: [SVE] Lower fixed length vXi8/vXi16 SDIV to scalable.

cameron.mcinally mentioned this in D86316: [SVE] Lower fixed length UDIV to scalable.Aug 20 2020, 1:14 PM

cameron.mcinally mentioned this in rG36dbb8fc972f: [SVE] Lower fixed length UDIV to scalable.Aug 21 2020, 7:01 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

3 lines

AArch64ISelLowering.cpp

29 lines

test/

CodeGen/

AArch64/

sve-fixed-length-int-div.ll

321 lines

Diff 286152

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 864 Lines • ▼ Show 20 Lines	private:
SDValue LowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSPLAT_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerSPLAT_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDUPQLane(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDUPQLane(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerToPredicatedOp(SDValue Op, SelectionDAG &DAG, unsigned NewOp,		SDValue LowerToPredicatedOp(SDValue Op, SelectionDAG &DAG, unsigned NewOp,
bool OverrideNEON = false) const;		bool OverrideNEON = false) const;
SDValue LowerToScalableOp(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerToScalableOp(SDValue Op, SelectionDAG &DAG,
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'LowerToScalableOp' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'LowerToScalableOp' [readability…
		bool OverrideNEON = false) const;
SDValue LowerEXTRACT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerEXTRACT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerINSERT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINSERT_SUBVECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDIV(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDIV(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerMUL(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorSRA_SRL_SHL(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorSRA_SRL_SHL(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerShiftLeftParts(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerShiftLeftParts(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerShiftRightParts(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerShiftRightParts(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVSETCC(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVSETCC(SDValue Op, SelectionDAG &DAG) const;
▲ Show 20 Lines • Show All 112 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 991 Lines • ▼ Show 20 Lines	if (useSVEForFixedLengthVectors()) {
for (auto VT : {MVT::v16i8, MVT::v8i16, MVT::v4i32})		for (auto VT : {MVT::v16i8, MVT::v8i16, MVT::v4i32})
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
for (auto VT : {MVT::v8f16, MVT::v4f32})		for (auto VT : {MVT::v8f16, MVT::v4f32})
setOperationAction(ISD::FP_ROUND, VT, Expand);		setOperationAction(ISD::FP_ROUND, VT, Expand);

// These operations are not supported on NEON but SVE can do them.		// These operations are not supported on NEON but SVE can do them.
setOperationAction(ISD::MUL, MVT::v1i64, Custom);		setOperationAction(ISD::MUL, MVT::v1i64, Custom);
setOperationAction(ISD::MUL, MVT::v2i64, Custom);		setOperationAction(ISD::MUL, MVT::v2i64, Custom);
		setOperationAction(ISD::SDIV, MVT::v8i8, Custom);
		setOperationAction(ISD::SDIV, MVT::v16i8, Custom);
		setOperationAction(ISD::SDIV, MVT::v4i16, Custom);
		setOperationAction(ISD::SDIV, MVT::v8i16, Custom);
setOperationAction(ISD::SDIV, MVT::v2i32, Custom);		setOperationAction(ISD::SDIV, MVT::v2i32, Custom);
setOperationAction(ISD::SDIV, MVT::v4i32, Custom);		setOperationAction(ISD::SDIV, MVT::v4i32, Custom);
setOperationAction(ISD::SDIV, MVT::v1i64, Custom);		setOperationAction(ISD::SDIV, MVT::v1i64, Custom);
setOperationAction(ISD::SDIV, MVT::v2i64, Custom);		setOperationAction(ISD::SDIV, MVT::v2i64, Custom);
setOperationAction(ISD::SMAX, MVT::v1i64, Custom);		setOperationAction(ISD::SMAX, MVT::v1i64, Custom);
setOperationAction(ISD::SMAX, MVT::v2i64, Custom);		setOperationAction(ISD::SMAX, MVT::v2i64, Custom);
setOperationAction(ISD::SMIN, MVT::v1i64, Custom);		setOperationAction(ISD::SMIN, MVT::v1i64, Custom);
setOperationAction(ISD::SMIN, MVT::v2i64, Custom);		setOperationAction(ISD::SMIN, MVT::v2i64, Custom);
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::FMA, VT, Custom);		setOperationAction(ISD::FMA, VT, Custom);
setOperationAction(ISD::FMAXNUM, VT, Custom);		setOperationAction(ISD::FMAXNUM, VT, Custom);
setOperationAction(ISD::FMINNUM, VT, Custom);		setOperationAction(ISD::FMINNUM, VT, Custom);
setOperationAction(ISD::FMUL, VT, Custom);		setOperationAction(ISD::FMUL, VT, Custom);
setOperationAction(ISD::FSUB, VT, Custom);		setOperationAction(ISD::FSUB, VT, Custom);
setOperationAction(ISD::LOAD, VT, Custom);		setOperationAction(ISD::LOAD, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::OR, VT, Custom);		setOperationAction(ISD::OR, VT, Custom);
		setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::SETCC, VT, Custom);		setOperationAction(ISD::SETCC, VT, Custom);
setOperationAction(ISD::SHL, VT, Custom);		setOperationAction(ISD::SHL, VT, Custom);
setOperationAction(ISD::SIGN_EXTEND, VT, Custom);		setOperationAction(ISD::SIGN_EXTEND, VT, Custom);
setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Custom);		setOperationAction(ISD::SIGN_EXTEND_INREG, VT, Custom);
setOperationAction(ISD::SMAX, VT, Custom);		setOperationAction(ISD::SMAX, VT, Custom);
setOperationAction(ISD::SMIN, VT, Custom);		setOperationAction(ISD::SMIN, VT, Custom);
setOperationAction(ISD::SRA, VT, Custom);		setOperationAction(ISD::SRA, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::STORE, VT, Custom);		setOperationAction(ISD::STORE, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
setOperationAction(ISD::XOR, VT, Custom);		setOperationAction(ISD::XOR, VT, Custom);
setOperationAction(ISD::ZERO_EXTEND, VT, Custom);		setOperationAction(ISD::ZERO_EXTEND, VT, Custom);

if (VT.getVectorElementType() == MVT::i32 \|\|
VT.getVectorElementType() == MVT::i64)
setOperationAction(ISD::SDIV, VT, Custom);
}		}

void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {		void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {
addRegisterClass(VT, &AArch64::FPR64RegClass);		addRegisterClass(VT, &AArch64::FPR64RegClass);
addTypeForNEON(VT, MVT::v2i32);		addTypeForNEON(VT, MVT::v2i32);
}		}

void AArch64TargetLowering::addQRTypeForNEON(MVT VT) {		void AArch64TargetLowering::addQRTypeForNEON(MVT VT) {
▲ Show 20 Lines • Show All 7,781 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerINSERT_SUBVECTOR(SDValue Op,
return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerDIV(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerDIV(SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
bool Signed = Op.getOpcode() == ISD::SDIV;		bool Signed = Op.getOpcode() == ISD::SDIV;
unsigned PredOpcode = Signed ? AArch64ISD::SDIV_PRED : AArch64ISD::UDIV_PRED;		unsigned PredOpcode = Signed ? AArch64ISD::SDIV_PRED : AArch64ISD::UDIV_PRED;

if (useSVEForFixedLengthVectorVT(Op.getValueType(), /OverrideNEON=/true) &&		if (useSVEForFixedLengthVectorVT(Op.getValueType(), /OverrideNEON=/true)) {
(VT.getVectorElementType() == MVT::i32 \|\|		// Scalable vector of i32/i64 DIV is supported.
VT.getVectorElementType() == MVT::i64))		if (VT.getVectorElementType() == MVT::i32 \|\|
		VT.getVectorElementType() == MVT::i64)
return LowerToPredicatedOp(Op, DAG, PredOpcode, /OverrideNEON=/true);		return LowerToPredicatedOp(Op, DAG, PredOpcode, /OverrideNEON=/true);

		// Convert vector of i8/i16 DIV to scalable to allow usual promotion.
		return LowerToScalableOp(Op, DAG, /OverrideNEON=/true);
		paulwalker-armUnsubmitted Done Reply Inline Actions I just wanted to double check that you are aware this is going to result in i8/i16 fixed length vector divides being different to the i32/i64 ones. The latter being predicated with the former cases not (or rather using an "all true" predicate). Given divides are rarely cheap I prefer the predicated route but I guess there's no reason to be consistent at this stage. paulwalker-arm: I just wanted to double check that you are aware this is going to result in i8/i16 fixed length…
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why that's happening? Is it because there are no loads/stores to lower? cameron.mcinally: Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why…
		cameron.mcinallyAuthorUnsubmitted Done Reply Inline Actions Oh, I see what you're saying now. The second predicate register is being generated all1s. That's surprising. Ultimately, both the i32/i64 and i8/i16 cases should be going through `return LowerToPredicatedOp(Op, DAG, PredOpcode);`. I'll see if I can find the difference. cameron.mcinally: Oh, I see what you're saying now. The second predicate register is being generated all1s.
		paulwalker-armUnsubmitted Done Reply Inline Actions I see you're finding the answers quicker than I can type them :) For what it's worth I've nothing against the current patch since it is simple and works, I just wanted to ensure you were aware what was going on. paulwalker-arm: I see you're finding the answers quicker than I can type them :) For what it's worth I've…
		}

if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv2i64)		if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv2i64)
return LowerToPredicatedOp(Op, DAG, PredOpcode);		return LowerToPredicatedOp(Op, DAG, PredOpcode);

// SVE doesn't have i8 and i16 DIV operations; widen them to 32-bit		// SVE doesn't have i8 and i16 DIV operations; widen them to 32-bit
// operations, and truncate the result.		// operations, and truncate the result.
EVT WidenedVT;		EVT WidenedVT;
if (VT == MVT::nxv16i8)		if (VT == MVT::nxv16i8)
WidenedVT = MVT::nxv8i16;		WidenedVT = MVT::nxv8i16;
▲ Show 20 Lines • Show All 6,520 Lines • ▼ Show 20 Lines	if (isMergePassthruOpcode(NewOp))
Operands.push_back(DAG.getUNDEF(VT));		Operands.push_back(DAG.getUNDEF(VT));

return DAG.getNode(NewOp, DL, VT, Operands);		return DAG.getNode(NewOp, DL, VT, Operands);
}		}

// If a fixed length vector operation has no side effects when applied to		// If a fixed length vector operation has no side effects when applied to
// undefined elements, we can safely use scalable vectors to perform the same		// undefined elements, we can safely use scalable vectors to perform the same
// operation without needing to worry about predication.		// operation without needing to worry about predication.
SDValue AArch64TargetLowering::LowerToScalableOp(SDValue Op,		SDValue AArch64TargetLowering::LowerToScalableOp(SDValue Op,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -SDValue AArch64TargetLowering::LowerToScalableOp(SDValue Op, - SelectionDAG &DAG, +SDValue AArch64TargetLowering::LowerToScalableOp(SDValue Op, SelectionDAG &DAG, Lint: Pre-merge checks: clang-format: please reformat the code ``` -SDValue AArch64TargetLowering::LowerToScalableOp…
SelectionDAG &DAG) const {		SelectionDAG &DAG,
		bool OverrideNEON) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
assert(useSVEForFixedLengthVectorVT(VT) &&		assert(useSVEForFixedLengthVectorVT(VT, OverrideNEON) &&
"Only expected to lower fixed length vector operation!");		"Only expected to lower fixed length vector operation!");
EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);		EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);

// Create list of operands by converting existing ones to scalable types.		// Create list of operands by converting existing ones to scalable types.
SmallVector<SDValue, 4> Ops;		SmallVector<SDValue, 4> Ops;
for (const SDValue &V : Op->op_values()) {		for (const SDValue &V : Op->op_values()) {
assert(useSVEForFixedLengthVectorVT(V.getValueType()) &&		assert(useSVEForFixedLengthVectorVT(V.getValueType(), OverrideNEON) &&
"Only fixed length vectors are supported!");		"Only fixed length vectors are supported!");
Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));		Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));
}		}

auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);		auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);
return convertFromScalableVector(DAG, VT, ScalableRes);		return convertFromScalableVector(DAG, VT, ScalableRes);
}		}

Show All 27 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-div.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=384 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
				; RUN: llc -aarch64-sve-vector-bits-min=512 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=640 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=768 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=896 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1152 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1280 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1408 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1536 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1664 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1792 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1920 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024,VBITS_GE_2048

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				;
				; SDIV
				;

				; Vector vXi8 sdiv are not legal for NEON so use SVE when available.
				define <8 x i8> @sdiv_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v8i8:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, z0.b
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, z0.b
				; CHECK-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; CHECK-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; CHECK-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; CHECK-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; CHECK-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; CHECK-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; CHECK-NEXT: uzp1 z0.b, [[RES_LO]].b, [[RES_HI]].b
				; CHECK: ret
				%res = sdiv <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @sdiv_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v16i8:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, z0.b
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, z1.b
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, z0.b
				; CHECK-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; CHECK-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; CHECK-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; CHECK-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; CHECK-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; CHECK-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; CHECK-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; CHECK-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; CHECK-NEXT: uzp1 z0.b, [[RES_LO]].b, [[RES_HI]].b
				; CHECK: ret
				%res = sdiv <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @sdiv_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v32i8:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,32)]]
				; VBITS_GE_256-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, z0.h
				; VBITS_GE_256-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, z1.h
				; VBITS_GE_256-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, z1.h
				; VBITS_GE_256-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, z0.h
				; VBITS_GE_256-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_256-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_256-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_256-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = sdiv <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @sdiv_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v64i8:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,64)]]
				; VBITS_GE_512-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_512-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_512-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_512-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_512-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_512-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_512-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_512-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_512-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_512-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_512-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_512-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = sdiv <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define void @sdiv_v128i8(<128 x i8>* %a, <128 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v128i8:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,128)]]
				; VBITS_GE_1024-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_1024-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_1024-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_1024-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_1024-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_1024-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_1024-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_1024-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_1024-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_1024-NEXT: ret
				%op1 = load <128 x i8>, <128 x i8>* %a
				%op2 = load <128 x i8>, <128 x i8>* %b
				%res = sdiv <128 x i8> %op1, %op2
				store <128 x i8> %res, <128 x i8>* %a
				ret void
				}

				define void @sdiv_v256i8(<256 x i8>* %a, <256 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v256i8:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,256)]]
				; VBITS_GE_2048-NEXT: ld1b { [[OP1:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1b { [[OP2:z[0-9]+]].b }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_LO:z[0-9]+]].h, [[OP2]].b
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_LO:z[0-9]+]].h, [[OP1]].b
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_HI_HI:z[0-9]]].s, [[OP2_HI]].h
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_HI_HI:z[0-9]]].s, [[OP1_HI]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_HI_LO:z[0-9]+]].s, [[OP2_HI]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_HI_LO:z[0-9]+]].s, [[OP1_HI]].h
				; VBITS_GE_2048-NEXT: sdivr [[RES_HI_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_HI]].s, [[OP1_HI_HI]].s
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_LO_HI:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_2048-NEXT: sdivr [[RES_HI_LO:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI_LO]].s, [[OP1_HI_LO]].s
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_LO_HI:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_LO_LO:z[0-9]+]].s, [[OP2_LO]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_LO_LO:z[0-9]+]].s, [[OP1_LO]].h
				; VBITS_GE_2048-NEXT: sdiv [[RES_LO_HI:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_HI]].s, [[OP2_LO_HI]].s
				; VBITS_GE_2048-NEXT: sdiv [[RES_LO_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO_LO]].s, [[OP2_LO_LO]].s
				; VBITS_GE_2048-NEXT: uzp1 [[RES_HI:z[0-9]+]].h, [[RES_HI_LO]].h, [[RES_HI_HI]].h
				; VBITS_GE_2048-NEXT: uzp1 [[RES_LO:z[0-9]+]].h, [[RES_LO_LO]].h, [[RES_LO_HI]].h
				; VBITS_GE_2048-NEXT: uzp1 [[RES:z[0-9]+]].b, [[RES_LO]].b, [[RES_HI]].b
				; VBITS_GE_2048-NEXT: st1b { [[RES]].b }, [[PG]], [x0]
				; VBITS_GE_2048-NEXT: ret
				%op1 = load <256 x i8>, <256 x i8>* %a
				%op2 = load <256 x i8>, <256 x i8>* %b
				%res = sdiv <256 x i8> %op1, %op2
				store <256 x i8> %res, <256 x i8>* %a
				ret void
				}

				; Vector vXi16 sdiv are not legal for NEON so use SVE when available.
				define <4 x i16> @sdiv_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v4i16:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; CHECK-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; CHECK-NEXT: ret
				%res = sdiv <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @sdiv_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v8i16:
				; CHECK: sunpkhi [[OP2_HI:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, z0.h
				; CHECK-NEXT: ptrue [[PG:p[0-9]+]].s
				; CHECK-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, z1.h
				; CHECK-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, z0.h
				; CHECK-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; CHECK-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; CHECK-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; CHECK-NEXT: ret
				%res = sdiv <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @sdiv_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v16i16:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),16)]]
				; VBITS_GE_256-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_256-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_256-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_256-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_256-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_256-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_256-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_256-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_256-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_256-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = sdiv <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @sdiv_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v32i16:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),32)]]
				; VBITS_GE_512-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_512-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_512-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_512-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_512-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_512-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_512-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_512-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_512-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_512-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = sdiv <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define void @sdiv_v64i16(<64 x i16>* %a, <64 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v64i16:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),64)]]
				; VBITS_GE_1024-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_1024-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_1024-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_1024-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_1024-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_1024-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_1024-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_1024-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_1024-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_1024-NEXT: ret
				%op1 = load <64 x i16>, <64 x i16>* %a
				%op2 = load <64 x i16>, <64 x i16>* %b
				%res = sdiv <64 x i16> %op1, %op2
				store <64 x i16> %res, <64 x i16>* %a
				ret void
				}

				define void @sdiv_v128i16(<128 x i16>* %a, <128 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v128i16:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),128)]]
				; VBITS_GE_2048-NEXT: ld1h { [[OP1:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: ld1h { [[OP2:z[0-9]+]].h }, [[PG]]/z, [x1]
				; VBITS_GE_2048-NEXT: ptrue [[PG1:p[0-9]+]].s
				; VBITS_GE_2048-NEXT: sunpkhi [[OP1_HI:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_2048-NEXT: sunpkhi [[OP2_HI:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP2_LO:z[0-9]+]].s, [[OP2]].h
				; VBITS_GE_2048-NEXT: sunpklo [[OP1_LO:z[0-9]+]].s, [[OP1]].h
				; VBITS_GE_2048-NEXT: sdivr [[RES_HI:z[0-9]+]].s, [[PG1]]/m, [[OP2_HI]].s, [[OP1_HI]].s
				; VBITS_GE_2048-NEXT: sdiv [[RES_LO:z[0-9]+]].s, [[PG1]]/m, [[OP1_LO]].s, [[OP2_LO]].s
				; VBITS_GE_2048-NEXT: uzp1 [[RES:z[0-9]+]].h, [[RES_LO]].h, [[RES_HI]].h
				; VBITS_GE_2048-NEXT: st1h { [[RES]].h }, [[PG]], [x0]
				; VBITS_GE_2048-NEXT: ret
				%op1 = load <128 x i16>, <128 x i16>* %a
				%op2 = load <128 x i16>, <128 x i16>* %b
				%res = sdiv <128 x i16> %op1, %op2
				store <128 x i16> %res, <128 x i16>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }