This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/1
sve-fixed-length-log-reduce.ll

Differential D88707

[SVE] Lower fixed length VECREDUCE_AND operation
ClosedPublic

Authored by cameron.mcinally on Oct 1 2020, 3:33 PM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
kmclaughlin
dancgr
efriedma
rengolin

Commits

rG9642ded8ba64: [SVE] Lower fixed length VECREDUCE_AND operation

Summary

Some things to consider:

NEON currently lowers the 64-bit and 128-bit vector AND reductions to a tree based algorithm, IINM. Do we want to go with the NEON lowerings or SVE instructions for those? The i8 and i16 cases should probably use SVE for code clarity. But the v2 vectors may be short enough that NEON is a win. They'll expand to something like:

ext v1.16b, v0.16b, v0.16b, #8
and v0.8b, v0.8b, v1.8b

Are we okay with tuning these later? Or should I do a study now?

If we choose SVE instructions for #1, the OverrideNEON flag is getting bulky again. We might want to consider refactoring that, since we'll need to add more cases for OR and XOR.

I named the new test file sve-fixed-length-log-reduce.ll. The log follows the existing AND tests in /AArch64, but is a little misleading since they're bitwise operations, not logical. Any suggestions on alternative names?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cameron.mcinally created this revision.Oct 1 2020, 3:33 PM

Herald added a reviewer: rengolin. · View Herald TranscriptOct 1 2020, 3:33 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, hiraditya and 2 others. · View Herald Transcript

cameron.mcinally requested review of this revision.Oct 1 2020, 3:33 PM

Harbormaster completed remote builds in B73717: Diff 295679.Oct 1 2020, 3:44 PM

Re: OverrideNEON: I wouldn't get too hung up on this. The original intention was me trying to reduce the chances of going down broken code paths and also not surprise people with a complete change of code generation output (a.k.a Operation "get something that's usable quickly"). Once wide vector support is driven by function attributes and we start adding proper v#i1 support, we'll be forced to breakaway from NEON and OverrideNEON will become a distant memory.

In D88707#2308382, @paulwalker-arm wrote:

we start adding proper v#i1 support

Glad to hear this. Knowing a vector is a mask will go a long way during lowering...

paulwalker-arm accepted this revision.Oct 5 2020, 4:56 AM

paulwalker-arm added inline comments.

llvm/test/CodeGen/AArch64/sve-fixed-length-log-reduce.ll
68	Can you add VBITS_EQ_256 check lines to the VBITS_GE_512 related tests to ensure sensible type legalisation. See sve-fixed-length-int-minmax.ll for example. It looks like I missed this for the other reduction tests.

This revision is now accepted and ready to land.Oct 5 2020, 4:56 AM

This revision was landed with ongoing or failed builds.Oct 5 2020, 9:34 AM

Closed by commit rG9642ded8ba64: [SVE] Lower fixed length VECREDUCE_AND operation (authored by cameron.mcinally). · Explain Why

This revision was automatically updated to reflect the committed changes.

cameron.mcinally added a commit: rG9642ded8ba64: [SVE] Lower fixed length VECREDUCE_AND operation.

Committed. It looks like the legalisations seem reasonable. Something like:

; VBITS_EQ_256-DAG: and [[AND:z[0-9]+]].d, [[LO]].d, [[HI]].d
; VBITS_EQ_256-DAG: andv h[[REDUCE:[0-9]+]], [[PG]], [[AND]].h

How are the legalisation tests usually handled? Are they done once for a class of instructions? Or should I go back to add CHECKs for the other reductions too? @kmclaughlin

In D88707#2312081, @cameron.mcinally wrote:
Committed. It looks like the legalisations seem reasonable. Something like:
; VBITS_EQ_256-DAG: and [[AND:z[0-9]+]].d, [[LO]].d, [[HI]].d
; VBITS_EQ_256-DAG: andv h[[REDUCE:[0-9]+]], [[PG]], [[AND]].h
How are the legalisation tests usually handled? Are they done once for a class of instructions? Or should I go back to add CHECKs for the other reductions too? @kmclaughlin

In general I've tried to always add VBITS_EQ_256 check lines to any test that also has VBITS_GE_512 check lines.

Legalisation tests added for sve-fixed-length-int-reduce.ll in:

6bec45e2558566e10be71280a3e2c1b144f1b236

Legalisation tests added for sve-fixed-length-fp-reduce.ll in:

365ef499d60

Can you update the tests to use the new non-experimental intrinsic name?

In D88707#2335496, @aemerson wrote:

Can you update the tests to use the new non-experimental intrinsic name?

Sure. I'll land that today....

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

16 lines

test/

CodeGen/

AArch64/

sve-fixed-length-log-reduce.ll

374 lines

Diff 296202

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,108 Lines • ▼ Show 20 Lines	if (useSVEForFixedLengthVectors()) {
setOperationAction(ISD::UDIV, MVT::v2i32, Custom);		setOperationAction(ISD::UDIV, MVT::v2i32, Custom);
setOperationAction(ISD::UDIV, MVT::v4i32, Custom);		setOperationAction(ISD::UDIV, MVT::v4i32, Custom);
setOperationAction(ISD::UDIV, MVT::v1i64, Custom);		setOperationAction(ISD::UDIV, MVT::v1i64, Custom);
setOperationAction(ISD::UDIV, MVT::v2i64, Custom);		setOperationAction(ISD::UDIV, MVT::v2i64, Custom);
setOperationAction(ISD::UMAX, MVT::v1i64, Custom);		setOperationAction(ISD::UMAX, MVT::v1i64, Custom);
setOperationAction(ISD::UMAX, MVT::v2i64, Custom);		setOperationAction(ISD::UMAX, MVT::v2i64, Custom);
setOperationAction(ISD::UMIN, MVT::v1i64, Custom);		setOperationAction(ISD::UMIN, MVT::v1i64, Custom);
setOperationAction(ISD::UMIN, MVT::v2i64, Custom);		setOperationAction(ISD::UMIN, MVT::v2i64, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v8i8, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v16i8, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v4i16, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v8i16, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v2i32, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v4i32, Custom);
		setOperationAction(ISD::VECREDUCE_AND, MVT::v2i64, Custom);
setOperationAction(ISD::VECREDUCE_SMAX, MVT::v2i64, Custom);		setOperationAction(ISD::VECREDUCE_SMAX, MVT::v2i64, Custom);
setOperationAction(ISD::VECREDUCE_SMIN, MVT::v2i64, Custom);		setOperationAction(ISD::VECREDUCE_SMIN, MVT::v2i64, Custom);
setOperationAction(ISD::VECREDUCE_UMAX, MVT::v2i64, Custom);		setOperationAction(ISD::VECREDUCE_UMAX, MVT::v2i64, Custom);
setOperationAction(ISD::VECREDUCE_UMIN, MVT::v2i64, Custom);		setOperationAction(ISD::VECREDUCE_UMIN, MVT::v2i64, Custom);
}		}
}		}

PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();		PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();
▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
setOperationAction(ISD::STORE, VT, Custom);		setOperationAction(ISD::STORE, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);		setOperationAction(ISD::UDIV, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
setOperationAction(ISD::VECREDUCE_ADD, VT, Custom);		setOperationAction(ISD::VECREDUCE_ADD, VT, Custom);
		setOperationAction(ISD::VECREDUCE_AND, VT, Custom);
setOperationAction(ISD::VECREDUCE_FMAX, VT, Custom);		setOperationAction(ISD::VECREDUCE_FMAX, VT, Custom);
setOperationAction(ISD::VECREDUCE_FMIN, VT, Custom);		setOperationAction(ISD::VECREDUCE_FMIN, VT, Custom);
setOperationAction(ISD::VECREDUCE_SMAX, VT, Custom);		setOperationAction(ISD::VECREDUCE_SMAX, VT, Custom);
setOperationAction(ISD::VECREDUCE_SMIN, VT, Custom);		setOperationAction(ISD::VECREDUCE_SMIN, VT, Custom);
setOperationAction(ISD::VECREDUCE_UMAX, VT, Custom);		setOperationAction(ISD::VECREDUCE_UMAX, VT, Custom);
setOperationAction(ISD::VECREDUCE_UMIN, VT, Custom);		setOperationAction(ISD::VECREDUCE_UMIN, VT, Custom);
setOperationAction(ISD::VSELECT, VT, Custom);		setOperationAction(ISD::VSELECT, VT, Custom);
setOperationAction(ISD::XOR, VT, Custom);		setOperationAction(ISD::XOR, VT, Custom);
▲ Show 20 Lines • Show All 2,666 Lines • ▼ Show 20 Lines	case ISD::FLT_ROUNDS_:
return LowerFLT_ROUNDS_(Op, DAG);		return LowerFLT_ROUNDS_(Op, DAG);
case ISD::MUL:		case ISD::MUL:
return LowerMUL(Op, DAG);		return LowerMUL(Op, DAG);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return LowerINTRINSIC_WO_CHAIN(Op, DAG);		return LowerINTRINSIC_WO_CHAIN(Op, DAG);
case ISD::STORE:		case ISD::STORE:
return LowerSTORE(Op, DAG);		return LowerSTORE(Op, DAG);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
		case ISD::VECREDUCE_AND:
case ISD::VECREDUCE_SMAX:		case ISD::VECREDUCE_SMAX:
case ISD::VECREDUCE_SMIN:		case ISD::VECREDUCE_SMIN:
case ISD::VECREDUCE_UMAX:		case ISD::VECREDUCE_UMAX:
case ISD::VECREDUCE_UMIN:		case ISD::VECREDUCE_UMIN:
case ISD::VECREDUCE_FMAX:		case ISD::VECREDUCE_FMAX:
case ISD::VECREDUCE_FMIN:		case ISD::VECREDUCE_FMIN:
return LowerVECREDUCE(Op, DAG);		return LowerVECREDUCE(Op, DAG);
case ISD::ATOMIC_LOAD_SUB:		case ISD::ATOMIC_LOAD_SUB:
▲ Show 20 Lines • Show All 5,771 Lines • ▼ Show 20 Lines
}		}

SDValue AArch64TargetLowering::LowerVECREDUCE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECREDUCE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDValue Src = Op.getOperand(0);		SDValue Src = Op.getOperand(0);

// Try to lower fixed length reductions to SVE.		// Try to lower fixed length reductions to SVE.
EVT SrcVT = Src.getValueType();		EVT SrcVT = Src.getValueType();
bool OverrideNEON = Op.getOpcode() != ISD::VECREDUCE_ADD &&		bool OverrideNEON = Op.getOpcode() == ISD::VECREDUCE_AND \|\|
SrcVT.getVectorElementType() == MVT::i64;		(Op.getOpcode() != ISD::VECREDUCE_ADD &&
		SrcVT.getVectorElementType() == MVT::i64);
if (useSVEForFixedLengthVectorVT(SrcVT, OverrideNEON)) {		if (useSVEForFixedLengthVectorVT(SrcVT, OverrideNEON)) {
switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
return LowerFixedLengthReductionToSVE(AArch64ISD::UADDV_PRED, Op, DAG);		return LowerFixedLengthReductionToSVE(AArch64ISD::UADDV_PRED, Op, DAG);
		case ISD::VECREDUCE_AND:
		return LowerFixedLengthReductionToSVE(AArch64ISD::ANDV_PRED, Op, DAG);
case ISD::VECREDUCE_SMAX:		case ISD::VECREDUCE_SMAX:
return LowerFixedLengthReductionToSVE(AArch64ISD::SMAXV_PRED, Op, DAG);		return LowerFixedLengthReductionToSVE(AArch64ISD::SMAXV_PRED, Op, DAG);
case ISD::VECREDUCE_SMIN:		case ISD::VECREDUCE_SMIN:
return LowerFixedLengthReductionToSVE(AArch64ISD::SMINV_PRED, Op, DAG);		return LowerFixedLengthReductionToSVE(AArch64ISD::SMINV_PRED, Op, DAG);
case ISD::VECREDUCE_UMAX:		case ISD::VECREDUCE_UMAX:
return LowerFixedLengthReductionToSVE(AArch64ISD::UMAXV_PRED, Op, DAG);		return LowerFixedLengthReductionToSVE(AArch64ISD::UMAXV_PRED, Op, DAG);
case ISD::VECREDUCE_UMIN:		case ISD::VECREDUCE_UMIN:
return LowerFixedLengthReductionToSVE(AArch64ISD::UMINV_PRED, Op, DAG);		return LowerFixedLengthReductionToSVE(AArch64ISD::UMINV_PRED, Op, DAG);
▲ Show 20 Lines • Show All 6,449 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-log-reduce.ll

This file was added.

				; RUN: llc -aarch64-sve-vector-bits-min=128 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=16 -check-prefix=NO_SVE
				; RUN: llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=384 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=32 -check-prefixes=CHECK
				; RUN: llc -aarch64-sve-vector-bits-min=512 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=640 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=768 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=896 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=64 -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=1024 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1152 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1280 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1408 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1536 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1664 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1792 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=1920 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=128 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024
				; RUN: llc -aarch64-sve-vector-bits-min=2048 -asm-verbose=0 < %s \| FileCheck %s -D#VBYTES=256 -check-prefixes=CHECK,VBITS_GE_512,VBITS_GE_1024,VBITS_GE_2048

				target triple = "aarch64-unknown-linux-gnu"

				; Don't use SVE when its registers are no bigger than NEON.
				; NO_SVE-NOT: ptrue

				;
				; ANDV
				;

				; No single instruction NEON ANDV support. Use SVE.
				define i8 @andv_v8i8(<8 x i8> %a) #0 {
				; CHECK-LABEL: andv_v8i8:
				; CHECK: ptrue [[PG:p[0-9]+]].b, vl8
				; CHECK: andv b[[REDUCE:[0-9]+]], [[PG]], z0.b
				; CHECK: fmov w0, s[[REDUCE]]
				; CHECK: ret
				%res = call i8 @llvm.experimental.vector.reduce.and.v8i8(<8 x i8> %a)
				ret i8 %res
				}

				; No single instruction NEON ANDV support. Use SVE.
				define i8 @andv_v16i8(<16 x i8> %a) #0 {
				; CHECK-LABEL: andv_v16i8:
				; CHECK: ptrue [[PG:p[0-9]+]].b, vl16
				; CHECK: andv b[[REDUCE:[0-9]+]], [[PG]], z0.b
				; CHECK: fmov w0, s[[REDUCE]]
				; CHECK: ret
				%res = call i8 @llvm.experimental.vector.reduce.and.v16i8(<16 x i8> %a)
				ret i8 %res
				}

				define i8 @andv_v32i8(<32 x i8>* %a) #0 {
				; CHECK-LABEL: andv_v32i8:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].b, vl32
				; VBITS_GE_256-NEXT: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: andv b[[REDUCE:[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_256-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <32 x i8>, <32 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.and.v32i8(<32 x i8> %op)
				ret i8 %res
				}

				define i8 @andv_v64i8(<64 x i8>* %a) #0 {
				; CHECK-LABEL: andv_v64i8:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].b, vl64
				; VBITS_GE_512-NEXT: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: andv b[[REDUCE:[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_512-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_512-NEXT: ret

				paulwalker-armUnsubmitted Done Reply Inline Actions Can you add VBITS_EQ_256 check lines to the VBITS_GE_512 related tests to ensure sensible type legalisation. See sve-fixed-length-int-minmax.ll for example. It looks like I missed this for the other reduction tests. paulwalker-arm: Can you add VBITS_EQ_256 check lines to the VBITS_GE_512 related tests to ensure sensible type…
				; Ensure sensible type legalisation.
				; VBITS_EQ_256-DAG: ptrue [[PG:p[0-9]+]].b, vl32
				; VBITS_EQ_256-DAG: mov w[[A_HI:[0-9]+]], #32
				; VBITS_EQ_256-DAG: ld1b { [[LO:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1b { [[HI:z[0-9]+]].b }, [[PG]]/z, [x0, x[[A_HI]]]
				; VBITS_EQ_256-DAG: and [[AND:z[0-9]+]].d, [[LO]].d, [[HI]].d
				; VBITS_EQ_256-DAG: andv b[[REDUCE:[0-9]+]], [[PG]], [[AND]].b
				; VBITS_EQ_256-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_EQ_256-NEXT: ret

				%op = load <64 x i8>, <64 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.and.v64i8(<64 x i8> %op)
				ret i8 %res
				}

				define i8 @andv_v128i8(<128 x i8>* %a) #0 {
				; CHECK-LABEL: andv_v128i8:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].b, vl128
				; VBITS_GE_1024-NEXT: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: andv b[[REDUCE:[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_1024-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_1024-NEXT: ret
				%op = load <128 x i8>, <128 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.and.v128i8(<128 x i8> %op)
				ret i8 %res
				}

				define i8 @andv_v256i8(<256 x i8>* %a) #0 {
				; CHECK-LABEL: andv_v256i8:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].b, vl256
				; VBITS_GE_2048-NEXT: ld1b { [[OP:z[0-9]+]].b }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: andv b[[REDUCE:[0-9]+]], [[PG]], [[OP]].b
				; VBITS_GE_2048-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_2048-NEXT: ret
				%op = load <256 x i8>, <256 x i8>* %a
				%res = call i8 @llvm.experimental.vector.reduce.and.v256i8(<256 x i8> %op)
				ret i8 %res
				}

				; No single instruction NEON ANDV support. Use SVE.
				define i16 @andv_v4i16(<4 x i16> %a) #0 {
				; CHECK-LABEL: andv_v4i16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl4
				; CHECK: andv h[[REDUCE:[0-9]+]], [[PG]], z0.h
				; CHECK: fmov w0, s[[REDUCE]]
				; CHECK: ret
				%res = call i16 @llvm.experimental.vector.reduce.and.v4i16(<4 x i16> %a)
				ret i16 %res
				}

				; No single instruction NEON ANDV support. Use SVE.
				define i16 @andv_v8i16(<8 x i16> %a) #0 {
				; CHECK-LABEL: andv_v8i16:
				; CHECK: ptrue [[PG:p[0-9]+]].h, vl8
				; CHECK: andv h[[REDUCE:[0-9]+]], [[PG]], z0.h
				; CHECK: fmov w0, s[[REDUCE]]
				; CHECK: ret
				%res = call i16 @llvm.experimental.vector.reduce.and.v8i16(<8 x i16> %a)
				ret i16 %res
				}

				define i16 @andv_v16i16(<16 x i16>* %a) #0 {
				; CHECK-LABEL: andv_v16i16:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].h, vl16
				; VBITS_GE_256-NEXT: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: andv h[[REDUCE:[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_256-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <16 x i16>, <16 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.and.v16i16(<16 x i16> %op)
				ret i16 %res
				}

				define i16 @andv_v32i16(<32 x i16>* %a) #0 {
				; CHECK-LABEL: andv_v32i16:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].h, vl32
				; VBITS_GE_512-NEXT: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: andv h[[REDUCE:[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_512-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_512-NEXT: ret

				; Ensure sensible type legalisation.
				; VBITS_EQ_256-DAG: ptrue [[PG:p[0-9]+]].h, vl16
				; VBITS_EQ_256-DAG: add x[[A_HI:[0-9]+]], x0, #32
				; VBITS_EQ_256-DAG: ld1h { [[LO:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1h { [[HI:z[0-9]+]].h }, [[PG]]/z, [x[[A_HI]]]
				; VBITS_EQ_256-DAG: and [[AND:z[0-9]+]].d, [[LO]].d, [[HI]].d
				; VBITS_EQ_256-DAG: andv h[[REDUCE:[0-9]+]], [[PG]], [[AND]].h
				; VBITS_EQ_256-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_EQ_256-NEXT: ret
				%op = load <32 x i16>, <32 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.and.v32i16(<32 x i16> %op)
				ret i16 %res
				}

				define i16 @andv_v64i16(<64 x i16>* %a) #0 {
				; CHECK-LABEL: andv_v64i16:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].h, vl64
				; VBITS_GE_1024-NEXT: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: andv h[[REDUCE:[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_1024-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_1024-NEXT: ret
				%op = load <64 x i16>, <64 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.and.v64i16(<64 x i16> %op)
				ret i16 %res
				}

				define i16 @andv_v128i16(<128 x i16>* %a) #0 {
				; CHECK-LABEL: andv_v128i16:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].h, vl128
				; VBITS_GE_2048-NEXT: ld1h { [[OP:z[0-9]+]].h }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: andv h[[REDUCE:[0-9]+]], [[PG]], [[OP]].h
				; VBITS_GE_2048-NEXT: fmov w0, s[[REDUCE]]
				; VBITS_GE_2048-NEXT: ret
				%op = load <128 x i16>, <128 x i16>* %a
				%res = call i16 @llvm.experimental.vector.reduce.and.v128i16(<128 x i16> %op)
				ret i16 %res
				}

				; No single instruction NEON ANDV support. Use SVE.
				define i32 @andv_v2i32(<2 x i32> %a) #0 {
				; CHECK-LABEL: andv_v2i32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl2
				; CHECK: andv [[REDUCE:s[0-9]+]], [[PG]], z0.s
				; CHECK: fmov w0, [[REDUCE]]
				; CHECK: ret
				%res = call i32 @llvm.experimental.vector.reduce.and.v2i32(<2 x i32> %a)
				ret i32 %res
				}

				; No single instruction NEON ANDV support. Use SVE.
				define i32 @andv_v4i32(<4 x i32> %a) #0 {
				; CHECK-LABEL: andv_v4i32:
				; CHECK: ptrue [[PG:p[0-9]+]].s, vl4
				; CHECK: andv [[REDUCE:s[0-9]+]], [[PG]], z0.s
				; CHECK: fmov w0, [[REDUCE]]
				; CHECK: ret
				%res = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> %a)
				ret i32 %res
				}

				define i32 @andv_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: andv_v8i32:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].s, vl8
				; VBITS_GE_256-NEXT: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: andv [[REDUCE:s[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_256-NEXT: fmov w0, [[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <8 x i32>, <8 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.and.v8i32(<8 x i32> %op)
				ret i32 %res
				}

				define i32 @andv_v16i32(<16 x i32>* %a) #0 {
				; CHECK-LABEL: andv_v16i32:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].s, vl16
				; VBITS_GE_512-NEXT: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: andv [[REDUCE:s[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_512-NEXT: fmov w0, [[REDUCE]]
				; VBITS_GE_512-NEXT: ret

				; Ensure sensible type legalisation.
				; VBITS_EQ_256-DAG: ptrue [[PG:p[0-9]+]].s, vl8
				; VBITS_EQ_256-DAG: add x[[A_HI:[0-9]+]], x0, #32
				; VBITS_EQ_256-DAG: ld1w { [[LO:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1w { [[HI:z[0-9]+]].s }, [[PG]]/z, [x[[A_HI]]]
				; VBITS_EQ_256-DAG: and [[AND:z[0-9]+]].d, [[LO]].d, [[HI]].d
				; VBITS_EQ_256-DAG: andv [[REDUCE:s[0-9]+]], [[PG]], [[AND]].s
				; VBITS_EQ_256-NEXT: fmov w0, [[REDUCE]]
				; VBITS_EQ_256-NEXT: ret
				%op = load <16 x i32>, <16 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.and.v16i32(<16 x i32> %op)
				ret i32 %res
				}

				define i32 @andv_v32i32(<32 x i32>* %a) #0 {
				; CHECK-LABEL: andv_v32i32:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].s, vl32
				; VBITS_GE_1024-NEXT: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: andv [[REDUCE:s[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_1024-NEXT: fmov w0, [[REDUCE]]
				; VBITS_GE_1024-NEXT: ret
				%op = load <32 x i32>, <32 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.and.v32i32(<32 x i32> %op)
				ret i32 %res
				}

				define i32 @andv_v64i32(<64 x i32>* %a) #0 {
				; CHECK-LABEL: andv_v64i32:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].s, vl64
				; VBITS_GE_2048-NEXT: ld1w { [[OP:z[0-9]+]].s }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: andv [[REDUCE:s[0-9]+]], [[PG]], [[OP]].s
				; VBITS_GE_2048-NEXT: fmov w0, [[REDUCE]]
				; VBITS_GE_2048-NEXT: ret
				%op = load <64 x i32>, <64 x i32>* %a
				%res = call i32 @llvm.experimental.vector.reduce.and.v64i32(<64 x i32> %op)
				ret i32 %res
				}

				; Nothing to do for single element vectors.
				define i64 @andv_v1i64(<1 x i64> %a) #0 {
				; CHECK-LABEL: andv_v1i64:
				; CHECK: fmov x0, d0
				; CHECK: ret
				%res = call i64 @llvm.experimental.vector.reduce.and.v1i64(<1 x i64> %a)
				ret i64 %res
				}

				; Use SVE for 128-bit vectors
				define i64 @andv_v2i64(<2 x i64> %a) #0 {
				; CHECK-LABEL: andv_v2i64:
				; CHECK: ptrue [[PG:p[0-9]+]].d, vl2
				; CHECK: andv [[REDUCE:d[0-9]+]], [[PG]], z0.d
				; CHECK: fmov x0, [[REDUCE]]
				; CHECK: ret
				%res = call i64 @llvm.experimental.vector.reduce.and.v2i64(<2 x i64> %a)
				ret i64 %res
				}

				define i64 @andv_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: andv_v4i64:
				; VBITS_GE_256: ptrue [[PG:p[0-9]+]].d, vl4
				; VBITS_GE_256-NEXT: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_256-NEXT: andv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_256-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_256-NEXT: ret
				%op = load <4 x i64>, <4 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.and.v4i64(<4 x i64> %op)
				ret i64 %res
				}

				define i64 @andv_v8i64(<8 x i64>* %a) #0 {
				; CHECK-LABEL: andv_v8i64:
				; VBITS_GE_512: ptrue [[PG:p[0-9]+]].d, vl8
				; VBITS_GE_512-NEXT: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_512-NEXT: andv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_512-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_512-NEXT: ret

				; Ensure sensible type legalisation.
				; VBITS_EQ_256-DAG: ptrue [[PG:p[0-9]+]].d, vl4
				; VBITS_EQ_256-DAG: add x[[A_HI:[0-9]+]], x0, #32
				; VBITS_EQ_256-DAG: ld1d { [[LO:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_EQ_256-DAG: ld1d { [[HI:z[0-9]+]].d }, [[PG]]/z, [x[[A_HI]]]
				; VBITS_EQ_256-DAG: and [[AND:z[0-9]+]].d, [[LO]].d, [[HI]].d
				; VBITS_EQ_256-DAG: andv [[REDUCE:d[0-9]+]], [[PG]], [[AND]].d
				; VBITS_EQ_256-NEXT: fmov x0, [[REDUCE]]
				; VBITS_EQ_256-NEXT: ret
				%op = load <8 x i64>, <8 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.and.v8i64(<8 x i64> %op)
				ret i64 %res
				}

				define i64 @andv_v16i64(<16 x i64>* %a) #0 {
				; CHECK-LABEL: andv_v16i64:
				; VBITS_GE_1024: ptrue [[PG:p[0-9]+]].d, vl16
				; VBITS_GE_1024-NEXT: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_1024-NEXT: andv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_1024-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_1024-NEXT: ret
				%op = load <16 x i64>, <16 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.and.v16i64(<16 x i64> %op)
				ret i64 %res
				}

				define i64 @andv_v32i64(<32 x i64>* %a) #0 {
				; CHECK-LABEL: andv_v32i64:
				; VBITS_GE_2048: ptrue [[PG:p[0-9]+]].d, vl32
				; VBITS_GE_2048-NEXT: ld1d { [[OP:z[0-9]+]].d }, [[PG]]/z, [x0]
				; VBITS_GE_2048-NEXT: andv [[REDUCE:d[0-9]+]], [[PG]], [[OP]].d
				; VBITS_GE_2048-NEXT: fmov x0, [[REDUCE]]
				; VBITS_GE_2048-NEXT: ret
				%op = load <32 x i64>, <32 x i64>* %a
				%res = call i64 @llvm.experimental.vector.reduce.and.v32i64(<32 x i64> %op)
				ret i64 %res
				}

				attributes #0 = { "target-features"="+sve" }

				declare i8 @llvm.experimental.vector.reduce.and.v8i8(<8 x i8>)
				declare i8 @llvm.experimental.vector.reduce.and.v16i8(<16 x i8>)
				declare i8 @llvm.experimental.vector.reduce.and.v32i8(<32 x i8>)
				declare i8 @llvm.experimental.vector.reduce.and.v64i8(<64 x i8>)
				declare i8 @llvm.experimental.vector.reduce.and.v128i8(<128 x i8>)
				declare i8 @llvm.experimental.vector.reduce.and.v256i8(<256 x i8>)

				declare i16 @llvm.experimental.vector.reduce.and.v4i16(<4 x i16>)
				declare i16 @llvm.experimental.vector.reduce.and.v8i16(<8 x i16>)
				declare i16 @llvm.experimental.vector.reduce.and.v16i16(<16 x i16>)
				declare i16 @llvm.experimental.vector.reduce.and.v32i16(<32 x i16>)
				declare i16 @llvm.experimental.vector.reduce.and.v64i16(<64 x i16>)
				declare i16 @llvm.experimental.vector.reduce.and.v128i16(<128 x i16>)

				declare i32 @llvm.experimental.vector.reduce.and.v2i32(<2 x i32>)
				declare i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32>)
				declare i32 @llvm.experimental.vector.reduce.and.v8i32(<8 x i32>)
				declare i32 @llvm.experimental.vector.reduce.and.v16i32(<16 x i32>)
				declare i32 @llvm.experimental.vector.reduce.and.v32i32(<32 x i32>)
				declare i32 @llvm.experimental.vector.reduce.and.v64i32(<64 x i32>)

				declare i64 @llvm.experimental.vector.reduce.and.v1i64(<1 x i64>)
				declare i64 @llvm.experimental.vector.reduce.and.v2i64(<2 x i64>)
				declare i64 @llvm.experimental.vector.reduce.and.v4i64(<4 x i64>)
				declare i64 @llvm.experimental.vector.reduce.and.v8i64(<8 x i64>)
				declare i64 @llvm.experimental.vector.reduce.and.v16i64(<16 x i64>)
				declare i64 @llvm.experimental.vector.reduce.and.v32i64(<32 x i64>)