This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1
AArch64ISelLowering.h
5/5
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1
sve-fixed-length-int-reduce.ll

Differential D87796

[SVE][WIP] Lower fixed length VECREDUCE_ADD to Scalable
ClosedPublic

Authored by cameron.mcinally on Sep 16 2020, 2:15 PM.

Download Raw Diff

Details

Reviewers

dancgr
paulwalker-arm
sdesmalen
eli.friedman
rengolin
efriedma

Commits

rGdb40a7434429: [SVE] Lower fixed length ISD::VECREDUCE_ADD to Scalable

Summary

This isn't really ready for review, but more like an RFC...

There are a number of VECREDUCE_* nodes that can be lowered to SVE instructions. The integer instructions are:

UADDV, SADDV, SMAXV, SMINV, UMAXV, and UMINV

UADDV/SADDV are a little different from the rest in that they always return an i64 result, where the other results are of the vector element type.

That is fine, but the TableGen pattern aren't straight-forward. The patterns for UADDV/SADDV return a scalar i64. The other patterns return a NEON vector register, from which the scalar result is later extracted. I'd like to understand why that 'insert-then-extract' decision was made. Here is the code in question:

class SVE_2_Op_Pat_Reduce_To_Neon<ValueType vtd, SDPatternOperator op, ValueType vt1,
                   ValueType vt2, Instruction inst, SubRegIndex sub>
: Pat<(vtd (op vt1:$Op1, vt2:$Op2)),
      (INSERT_SUBREG (vtd (IMPLICIT_DEF)), (inst $Op1, $Op2), sub)>;

And then the lowering code extracts it back out:

static SDValue getReductionSDNode(unsigned Op, SDLoc DL, SDValue ScalarOp,
                                  SelectionDAG &DAG) {
  SDValue VecOp = ScalarOp.getOperand(0);
  auto Rdx = DAG.getNode(Op, DL, VecOp.getSimpleValueType(), VecOp);
  return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarOp.getValueType(), Rdx,
                     DAG.getConstant(0, DL, MVT::i64));
}

What was the motivation for that design?

In the current state, we'll probably need an unfortunate amount of code duplication, or special cases in the lowering code, for SADDV/UADDV. I'd like to avoid that if possible...

Diff Detail

Event Timeline

cameron.mcinally created this revision.Sep 16 2020, 2:15 PM

Herald added a reviewer: rengolin. · View Herald TranscriptSep 16 2020, 2:15 PM

Herald added a reviewer: efriedma. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, hiraditya, tschuett. · View Herald Transcript

cameron.mcinally requested review of this revision.Sep 16 2020, 2:15 PM

I think the NEON code is trying to avoid producing an extra instruction if the result ends up getting used by a vector operation. SelectionDAG technically allows instructions that produce an i64 result to return it in a floating-point register, but it doesn't work very well in practice: the operations that consume it will force it back into a GPR. This is one of the issues GlobalISel's RegBankSelect solves.

It should be possible to make the patterns for SVE do the same thing, I think?

Harbormaster completed remote builds in B71930: Diff 292324.Sep 16 2020, 3:05 PM

Oh, I hadn't realised we are handling reduction in this way upstream (although it does match an old design we had downstream). This is certainly not the expected behaviour so I'll get it fixed. The expectation is for the SVE reduction ISD nodes to reflect the underlying instructions behaviour, which is they set the whole vector register. The reason for this is that we don't want the element extraction to be done during isel because it introduces needless vpr-gpr transitions and there are also use cases that make use of the implicit zeroing of the upper lanes.

This is certainly not the expected behaviour so I'll get it fixed.

Ok, thanks. Since you have downstream changes dependent on this, I'll wait for your lead...

D87843 fixes the issue. Its dependence on D87842 is purely to maintain current code quality.

Thanks for the patches, @paulwalker-arm. Those helped a lot.

Here's the updated Diff for VECREDUCE_ADD ready for review. Will add the other nodes if this is acceptable...

paulwalker-arm added inline comments.Sep 22 2020, 3:00 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15924	Using Op here is confusing because that is normally what is used to reference the passed in SDValue. Existing code typically uses `Opc` or `Opcode` when naming ISD node variables.
15928	VT is normally used to mean the node's result value type. Perhaps InVT or SrcVT?
15931	This is the result VT of the reduction so you could use getPackedSVEVectorVT passing in the element type of SrcVT. Something like: EVT RdxVT = getPackedSVEVectorVT(Op == AArch64ISD::UADDV_PRED ? MVT::i64 : SrcVT.getVectorElementType()); Or you can move the code lower and use ContainerVT instead of VT. e.g. EVT RdxVT = (Op == AArch64ISD::UADDV_PRED) ? MVT::nxv2i64 : ContainerVT;
15939–15940	I don't think you need convertFromScalableVector here as you can just extract the scalar result directly from the scalable vector result of the reduction.
15950–15951	It would be better to protect the truncate with something like `Rdx.ElementType != Op.getValueType()`. I say this because I don't currently see a reason why this function cannot be used for floating-point reductions other than this unprotected truncate.
llvm/test/CodeGen/AArch64/sve-fixed-length-int-reduce.ll
47	Is `#min()` required here? Same goes for the other tests.

A few other cleanliness opportunities popped up during the refactor. How about something like this?

paulwalker-arm accepted this revision.Sep 23 2020, 3:41 AM

paulwalker-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.h
927	Opcode

This revision is now accepted and ready to land.Sep 23 2020, 3:41 AM

Closed by commit rGdb40a7434429: [SVE] Lower fixed length ISD::VECREDUCE_ADD to Scalable (authored by cameron.mcinally). · Explain WhySep 23 2020, 7:08 AM

This revision was automatically updated to reflect the committed changes.

cameron.mcinally added a commit: rGdb40a7434429: [SVE] Lower fixed length ISD::VECREDUCE_ADD to Scalable.

cameron.mcinally mentioned this in D88444: [SVE] Lower fixed length VECREDUCE_[FMAX|FMIN] to Scalable.Sep 29 2020, 9:24 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

2 lines

AArch64ISelLowering.cpp

36 lines

test/

CodeGen/

AArch64/

sve-fixed-length-int-reduce.ll

319 lines

Diff 293264

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 918 Lines • ▼ Show 20 Lines	private:
SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,		SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,
EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;		EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;

SDValue LowerFixedLengthVectorIntDivideToSVE(SDValue Op,		SDValue LowerFixedLengthVectorIntDivideToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorIntExtendToSVE(SDValue Op,		SDValue LowerFixedLengthVectorIntExtendToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorLoadToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorLoadToSVE(SDValue Op, SelectionDAG &DAG) const;
		SDValue LowerFixedLengthReductionToSVE(unsigned Op, SDValue ScalarOp,
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Opcode paulwalker-arm: Opcode
		SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorSelectToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorSelectToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorSetccToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorSetccToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorStoreToSVE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFixedLengthVectorStoreToSVE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFixedLengthVectorTruncateToSVE(SDValue Op,		SDValue LowerFixedLengthVectorTruncateToSVE(SDValue Op,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
SmallVectorImpl<SDNode *> &Created) const override;		SmallVectorImpl<SDNode *> &Created) const override;
▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,189 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::SRA, VT, Custom);		setOperationAction(ISD::SRA, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::STORE, VT, Custom);		setOperationAction(ISD::STORE, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);		setOperationAction(ISD::UDIV, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
		setOperationAction(ISD::VECREDUCE_ADD, VT, Custom);
setOperationAction(ISD::VSELECT, VT, Custom);		setOperationAction(ISD::VSELECT, VT, Custom);
setOperationAction(ISD::XOR, VT, Custom);		setOperationAction(ISD::XOR, VT, Custom);
setOperationAction(ISD::ZERO_EXTEND, VT, Custom);		setOperationAction(ISD::ZERO_EXTEND, VT, Custom);
}		}

void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {		void AArch64TargetLowering::addDRTypeForNEON(MVT VT) {
addRegisterClass(VT, &AArch64::FPR64RegClass);		addRegisterClass(VT, &AArch64::FPR64RegClass);
addTypeForNEON(VT, MVT::v2i32);		addTypeForNEON(VT, MVT::v2i32);
▲ Show 20 Lines • Show All 8,376 Lines • ▼ Show 20 Lines	static SDValue getReductionSDNode(unsigned Op, SDLoc DL, SDValue ScalarOp,
SDValue VecOp = ScalarOp.getOperand(0);		SDValue VecOp = ScalarOp.getOperand(0);
auto Rdx = DAG.getNode(Op, DL, VecOp.getSimpleValueType(), VecOp);		auto Rdx = DAG.getNode(Op, DL, VecOp.getSimpleValueType(), VecOp);
return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarOp.getValueType(), Rdx,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ScalarOp.getValueType(), Rdx,
DAG.getConstant(0, DL, MVT::i64));		DAG.getConstant(0, DL, MVT::i64));
}		}

SDValue AArch64TargetLowering::LowerVECREDUCE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECREDUCE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
		SDValue VecOp = Op.getOperand(0);

SDLoc dl(Op);		SDLoc dl(Op);
switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
		if (useSVEForFixedLengthVectorVT(VecOp.getValueType()))
		return LowerFixedLengthReductionToSVE(AArch64ISD::UADDV_PRED, Op, DAG);
return getReductionSDNode(AArch64ISD::UADDV, dl, Op, DAG);		return getReductionSDNode(AArch64ISD::UADDV, dl, Op, DAG);
case ISD::VECREDUCE_SMAX:		case ISD::VECREDUCE_SMAX:
return getReductionSDNode(AArch64ISD::SMAXV, dl, Op, DAG);		return getReductionSDNode(AArch64ISD::SMAXV, dl, Op, DAG);
case ISD::VECREDUCE_SMIN:		case ISD::VECREDUCE_SMIN:
return getReductionSDNode(AArch64ISD::SMINV, dl, Op, DAG);		return getReductionSDNode(AArch64ISD::SMINV, dl, Op, DAG);
case ISD::VECREDUCE_UMAX:		case ISD::VECREDUCE_UMAX:
return getReductionSDNode(AArch64ISD::UMAXV, dl, Op, DAG);		return getReductionSDNode(AArch64ISD::UMAXV, dl, Op, DAG);
case ISD::VECREDUCE_UMIN:		case ISD::VECREDUCE_UMIN:
▲ Show 20 Lines • Show All 6,310 Lines • ▼ Show 20 Lines	assert(useSVEForFixedLengthVectorVT(V.getValueType()) &&
"Only fixed length vectors are supported!");		"Only fixed length vectors are supported!");
Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));		Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));
}		}

auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);		auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);
return convertFromScalableVector(DAG, VT, ScalableRes);		return convertFromScalableVector(DAG, VT, ScalableRes);
}		}

		SDValue AArch64TargetLowering::LowerFixedLengthReductionToSVE(unsigned Op,
		paulwalker-armUnsubmitted Done Reply Inline Actions Using Op here is confusing because that is normally what is used to reference the passed in SDValue. Existing code typically uses `Opc` or `Opcode` when naming ISD node variables. paulwalker-arm: Using Op here is confusing because that is normally what is used to reference the passed in…
		SDValue ScalarOp, SelectionDAG &DAG) const {
		SDLoc DL(ScalarOp);
		SDValue VecOp = ScalarOp.getOperand(0);
		EVT VT = VecOp.getValueType();
		paulwalker-armUnsubmitted Done Reply Inline Actions VT is normally used to mean the node's result value type. Perhaps InVT or SrcVT? paulwalker-arm: VT is normally used to mean the node's result value type. Perhaps InVT or SrcVT?

		// UADDV always returns a vector of i64 result.
		EVT ResVT = (Op == AArch64ISD::UADDV_PRED) ? MVT::v2i64 : VT;
		paulwalker-armUnsubmitted Done Reply Inline Actions This is the result VT of the reduction so you could use getPackedSVEVectorVT passing in the element type of SrcVT. Something like: EVT RdxVT = getPackedSVEVectorVT(Op == AArch64ISD::UADDV_PRED ? MVT::i64 : SrcVT.getVectorElementType()); Or you can move the code lower and use ContainerVT instead of VT. e.g. EVT RdxVT = (Op == AArch64ISD::UADDV_PRED) ? MVT::nxv2i64 : ContainerVT; paulwalker-arm: This is the result VT of the reduction so you could use getPackedSVEVectorVT passing in the…

		SDValue Pg = getPredicateForVector(DAG, DL, VT);
		EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);
		EVT ResContainerVT = getContainerForFixedLengthVector(DAG, ResVT);

		// Generate the scalable operation.
		VecOp = convertToScalableVector(DAG, ContainerVT, VecOp);
		SDValue Rdx = DAG.getNode(Op, DL, ResContainerVT, Pg, VecOp);
		SDValue Res = convertFromScalableVector(DAG, ResVT, Rdx);
		paulwalker-armUnsubmitted Done Reply Inline Actions I don't think you need convertFromScalableVector here as you can just extract the scalar result directly from the scalable vector result of the reduction. paulwalker-arm: I don't think you need convertFromScalableVector here as you can just extract the scalar result…

		// The reduction patterns return a vector result, so extract
		// the scalar.
		SDValue Ext = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL,
		ResVT.getVectorElementType(),
		Res, DAG.getConstant(0, DL, MVT::i64));

		// This is needed for UADDV, since it returns an i64 result. The
		// VEC_REDUCE nodes expect an element size result.
		// FIXME: This should be a no-op trunc on the other nodes. Confirm
		// and remove this comment.
		paulwalker-armUnsubmitted Done Reply Inline Actions It would be better to protect the truncate with something like `Rdx.ElementType != Op.getValueType()`. I say this because I don't currently see a reason why this function cannot be used for floating-point reductions other than this unprotected truncate. paulwalker-arm: It would be better to protect the truncate with something like `Rdx.ElementType != Op.
		return DAG.getNode(ISD::TRUNCATE, DL, ScalarOp.getValueType(), Ext);
		}

SDValue		SDValue
AArch64TargetLowering::LowerFixedLengthVectorSelectToSVE(SDValue Op,		AArch64TargetLowering::LowerFixedLengthVectorSelectToSVE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
SDLoc DL(Op);		SDLoc DL(Op);

EVT InVT = Op.getOperand(1).getValueType();		EVT InVT = Op.getOperand(1).getValueType();
EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);		EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-int-reduce.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=384 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
				; RUN: llc -aarch64-sve-vector-bits-min=512 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=640 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=768 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=896 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1152 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1280 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1408 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1536 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1664 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1792 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1920 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024,VBITS_GE_2048

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				;
				; UADDV
				;

				; Don't use SVE for 64-bit vectors.
				define i8 @uaddv_v8i8(<8 x i8> %a) #0 {
				; CHECK-LABEL: uaddv_v8i8:
				; CHECK: addv b0, v0.8b
				; CHECK: ret
				%res = call i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8> %a)
				ret i8 %res
				}

				; Don't use SVE for 128-bit vectors.
				define i8 @uaddv_v16i8(<16 x i8> %a) #0 {
				; CHECK-LABEL: uaddv_v16i8:
				; CHECK: addv b0, v0.16b
				; CHECK: ret
				%res = call i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8> %a)
				ret i8 %res
				}

				define i8 @uaddv_v32i8(<32 x i8>* %a) #0 {
				; CHECK-LABEL: uaddv_v32i8:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,32)]]
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Is `#min()` required here? Same goes for the other tests. paulwalker-arm: Is `#min()` required here? Same goes for the other tests.
				; VBITS_GE_256-DAG: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_256-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <32 x i8>, <32 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8> %op)
				ret i8 %res
				}

				define i8 @uaddv_v64i8(<64 x i8>* %a) #0 {
				; CHECK-LABEL: uaddv_v64i8:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,64)]]
				; VBITS_GE_512-DAG: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_512-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_512-NEXT: ret
				%op = load <64 x i8>, <64 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8> %op)
				ret i8 %res
				}

				define i8 @uaddv_v128i8(<128 x i8>* %a) #0 {
				; CHECK-LABEL: uaddv_v128i8:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,128)]]
				; VBITS_GE_1024-DAG: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_1024-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_1024-NEXT: ret
				%op = load <128 x i8>, <128 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8> %op)
				ret i8 %res
				}

				define i8 @uaddv_v256i8(<256 x i8>* %a) #0 {
				; CHECK-LABEL: uaddv_v256i8:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].b, vl[[#min(VBYTES,256)]]
				; VBITS_GE_2048-DAG: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_2048-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_2048-NEXT: ret
				%op = load <256 x i8>, <256 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.add.v256i8(<256 x i8> %op)
				ret i8 %res
				}

				; Don't use SVE for 64-bit vectors.
				define i16 @uaddv_v4i16(<4 x i16> %a) #0 {
				; CHECK-LABEL: uaddv_v4i16:
				; CHECK: addv h0, v0.4h
				; CHECK: ret
				%res = call i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16> %a)
				ret i16 %res
				}

				; Don't use SVE for 128-bit vectors.
				define i16 @uaddv_v8i16(<8 x i16> %a) #0 {
				; CHECK-LABEL: uaddv_v8i16:
				; CHECK: addv h0, v0.8h
				; CHECK: ret
				%res = call i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16> %a)
				ret i16 %res
				}

				define i16 @uaddv_v16i16(<16 x i16>* %a) #0 {
				; CHECK-LABEL: uaddv_v16i16:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),16)]]
				; VBITS_GE_256-DAG: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_256-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <16 x i16>, <16 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16> %op)
				ret i16 %res
				}

				define i16 @uaddv_v32i16(<32 x i16>* %a) #0 {
				; CHECK-LABEL: uaddv_v32i16:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),32)]]
				; VBITS_GE_512-DAG: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_512-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_512-NEXT: ret
				%op = load <32 x i16>, <32 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16> %op)
				ret i16 %res
				}

				define i16 @uaddv_v64i16(<64 x i16>* %a) #0 {
				; CHECK-LABEL: uaddv_v64i16:
				; VBITS_GE_1048: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),64)]]
				; VBITS_GE_1048-DAG: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_1048-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_1048-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_1048-NEXT: ret
				%op = load <64 x i16>, <64 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16> %op)
				ret i16 %res
				}

				define i16 @uaddv_v128i16(<128 x i16>* %a) #0 {
				; CHECK-LABEL: uaddv_v128i16:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].h, vl[[#min(div(VBYTES,2),128)]]
				; VBITS_GE_2048-DAG: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_2048-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_2048-NEXT: ret
				%op = load <128 x i16>, <128 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.add.v128i16(<128 x i16> %op)
				ret i16 %res
				}

				; Don't use SVE for 64-bit vectors.
				define i32 @uaddv_v2i32(<2 x i32> %a) #0 {
				; CHECK-LABEL: uaddv_v2i32:
				; CHECK: addp v0.2s, v0.2s
				; CHECK: ret
				%res = call i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32> %a)
				ret i32 %res
				}

				; Don't use SVE for 128-bit vectors.
				define i32 @uaddv_v4i32(<4 x i32> %a) #0 {
				; CHECK-LABEL: uaddv_v4i32:
				; CHECK: addv s0, v0.4s
				; CHECK: ret
				%res = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> %a)
				ret i32 %res
				}

				define i32 @uaddv_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: uaddv_v8i32:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),8)]]
				; VBITS_GE_256-DAG: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_256-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <8 x i32>, <8 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> %op)
				ret i32 %res
				}

				define i32 @uaddv_v16i32(<16 x i32>* %a) #0 {
				; CHECK-LABEL: uaddv_v16i32:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),16)]]
				; VBITS_GE_512-DAG: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_512-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_512-NEXT: ret
				%op = load <16 x i32>, <16 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32> %op)
				ret i32 %res
				}

				define i32 @uaddv_v32i32(<32 x i32>* %a) #0 {
				; CHECK-LABEL: uaddv_v32i32:
				; VBITS_GE_1048: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),32)]]
				; VBITS_GE_1048-DAG: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_1048-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_1048-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_1048-NEXT: ret
				%op = load <32 x i32>, <32 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32> %op)
				ret i32 %res
				}

				define i32 @uaddv_v64i32(<64 x i32>* %a) #0 {
				; CHECK-LABEL: uaddv_v64i32:
				; VBITS_GE_2096: ptrue [[PG:p[0-9]+]].s, vl[[#min(div(VBYTES,4),64)]]
				; VBITS_GE_2096-DAG: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_2096-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_2086-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_2096-NEXT: ret
				%op = load <64 x i32>, <64 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.add.v64i32(<64 x i32> %op)
				ret i32 %res
				}

				; Nothing to do for 64-bit vectors..
				define i64 @uaddv_v1i64(<1 x i64> %a) #0 {
				; CHECK-LABEL: uaddv_v1i64:
				; CHECK: fmov x0, d0
				; CHECK: ret
				%res = call i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64> %a)
				ret i64 %res
				}

				; Don't use SVE for 128-bit vectors.
				define i64 @uaddv_v2i64(<2 x i64> %a) #0 {
				; CHECK-LABEL: uaddv_v2i64:
				; CHECK: addp d0, v0.2d
				; CHECK: ret
				%res = call i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64> %a)
				ret i64 %res
				}

				define i64 @uaddv_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: uaddv_v4i64:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),4)]]
				; VBITS_GE_256-DAG: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_256-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <4 x i64>, <4 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64> %op)
				ret i64 %res
				}

				define i64 @uaddv_v8i64(<8 x i64>* %a) #0 {
				; CHECK-LABEL: uaddv_v8i64:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),8)]]
				; VBITS_GE_512-DAG: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_512-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_512-NEXT: ret
				%op = load <8 x i64>, <8 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64> %op)
				ret i64 %res
				}

				define i64 @uaddv_v16i64(<16 x i64>* %a) #0 {
				; CHECK-LABEL: uaddv_v16i64:
				; VBITS_GE_1048: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),16)]]
				; VBITS_GE_1048-DAG: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_1048-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_1048-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_1048-NEXT: ret
				%op = load <16 x i64>, <16 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64> %op)
				ret i64 %res
				}

				define i64 @uaddv_v32i64(<32 x i64>* %a) #0 {
				; CHECK-LABEL: uaddv_v32i64:
				; VBITS_GE_2096: ptrue [[PG:p[0-9]+]].d, vl[[#min(div(VBYTES,8),32)]]
				; VBITS_GE_2096-DAG: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_2096-NEXT: uaddv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_2096-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_2096-NEXT: ret
				%op = load <32 x i64>, <32 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.add.v32i64(<32 x i64> %op)
				ret i64 %res
				}

				attributes #0 = { "target-features"="+sve" }

				declare i8 @llvm.experimental.vector.reduce.add.v8i8(<8 x i8>)
				declare i8 @llvm.experimental.vector.reduce.add.v16i8(<16 x i8>)
				declare i8 @llvm.experimental.vector.reduce.add.v32i8(<32 x i8>)
				declare i8 @llvm.experimental.vector.reduce.add.v64i8(<64 x i8>)
				declare i8 @llvm.experimental.vector.reduce.add.v128i8(<128 x i8>)
				declare i8 @llvm.experimental.vector.reduce.add.v256i8(<256 x i8>)

				declare i16 @llvm.experimental.vector.reduce.add.v4i16(<4 x i16>)
				declare i16 @llvm.experimental.vector.reduce.add.v8i16(<8 x i16>)
				declare i16 @llvm.experimental.vector.reduce.add.v16i16(<16 x i16>)
				declare i16 @llvm.experimental.vector.reduce.add.v32i16(<32 x i16>)
				declare i16 @llvm.experimental.vector.reduce.add.v64i16(<64 x i16>)
				declare i16 @llvm.experimental.vector.reduce.add.v128i16(<128 x i16>)

				declare i32 @llvm.experimental.vector.reduce.add.v2i32(<2 x i32>)
				declare i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32>)
				declare i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32>)
				declare i32 @llvm.experimental.vector.reduce.add.v16i32(<16 x i32>)
				declare i32 @llvm.experimental.vector.reduce.add.v32i32(<32 x i32>)
				declare i32 @llvm.experimental.vector.reduce.add.v64i32(<64 x i32>)

				declare i64 @llvm.experimental.vector.reduce.add.v1i64(<1 x i64>)
				declare i64 @llvm.experimental.vector.reduce.add.v2i64(<2 x i64>)
				declare i64 @llvm.experimental.vector.reduce.add.v4i64(<4 x i64>)
				declare i64 @llvm.experimental.vector.reduce.add.v8i64(<8 x i64>)
				declare i64 @llvm.experimental.vector.reduce.add.v16i64(<16 x i64>)
				declare i64 @llvm.experimental.vector.reduce.add.v32i64(<32 x i64>)