This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
2/6
MVETailPredication.cpp
-
test/CodeGen/Thumb2/LowOverheadLoops/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
-
tail-reduce.ll
-
varying-outer-2d-reduction.ll

Differential D85737

ARM][MVE] tail-predication: overflow checks for backedge taken count
ClosedPublic

Authored by SjoerdMeijer on Aug 11 2020, 8:19 AM.

Download Raw Diff

Details

Reviewers

samparker
efriedma
dmgreen

Commits

rG6716e7868ec3: [ARM][MVE] tail-predication: overflow checks for backedge taken count.

Summary

This pick ups the work on the overflow checks for get.active.lane.mask, which ensure that it is safe to insert the VCTP intrinisc that enables tail-predication. For a 2d auto-correlation kernel and its inner loop j:

M = Size - i;
for (j = 0; j < M; j++)
  Sum += Input[j] * Input[j+i];

For this inner loop, the SCEV backedge taken count (BTC) expression is:

(-1 + (sext i16 %Size to i32)),+,-1}<nw><%for.body>

and LoopUtil cannotBeMaxInLoop couldn't calculate a bound on this, thus "BTC cannot be max" could not be determined. So overflow behaviour had to be assumed in the loop tripcount expression that uses the BTC. As a result tail-predication had to be forced (with an option) for this case.

This change solves that by using ScalarEvolution's helper getConstantMaxBackedgeTakenCount which is able to determine the range of BTC, thus can determine it is safe, so that we no longer need to force tail-predication as reflected in the changed test cases.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

SjoerdMeijer created this revision.Aug 11 2020, 8:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 11 2020, 8:19 AM

Herald added subscribers: danielkiss, javed.absar, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Aug 11 2020, 8:19 AM

It's nice that we have some utility that computes what you need, even if it's not the obvious one.

llvm/lib/Target/ARM/MVETailPredication.cpp
374	I'm forgetting, do we have a check somewhere that BackedgeTakenCount is actually the backedge-taken count of the loop?
383	cast<>.

This revision is now accepted and ready to land.Aug 11 2020, 1:53 PM

Thanks for reviewing!

It's nice that we have some utility that computes what you need, even if it's not the obvious one.

Yes, that's why I was not entirely happy. Think I have seen a few non-obvious things in the scev helpers, and a bug too hidden by other things, but one step at a time here...

llvm/lib/Target/ARM/MVETailPredication.cpp
374	Good point. I thought we had one, but we have similar checks for the IV, which is done in step 3) below on line 464. Will address this in a follow up.

Closed by commit rG6716e7868ec3: [ARM][MVE] tail-predication: overflow checks for backedge taken count. (authored by SjoerdMeijer). · Explain WhyAug 12 2020, 1:36 AM

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG6716e7868ec3: [ARM][MVE] tail-predication: overflow checks for backedge taken count..

efriedma added inline comments.Aug 17 2020, 1:21 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
374	I think I messed up reviewing this. Took me a bit of time to remember what's going on here. `SE->getConstantMaxBackedgeTakenCount()` is the backedge-taken count of the loop. BTC is the number of elements the loop processes, minus one. If you want to ensure BTC + 1 doesn't overflow, getConstantMaxBackedgeTakenCount() doesn't actually help, unless you prove some connection between the two values. The code currently in IsSafeActiveMask does not try to prove that connection.

efriedma mentioned this in D86074: [ARM][MVE] Tail-predication: check get.active.lane.mask's TC value.Aug 17 2020, 1:24 PM

efriedma added inline comments.Aug 17 2020, 1:38 PM

llvm/lib/Target/ARM/MVETailPredication.cpp
374	On a higher-level note, I think this work has shown that trying to force this to work is really complicated, and it might make sense to change the vectorizer to generate something that's easier to analyze.

SjoerdMeijer added inline comments.Aug 18 2020, 2:38 AM

llvm/lib/Target/ARM/MVETailPredication.cpp
374	Ah, sorry about this. Don't know what I was thinking...somehow had myself convinced, but it's obviously not really what we need. I fully agree about your high-level remark. This whole exercise has shown the limitations of our current approach, i.e. the complex analysis required, which we can hopefully avoid by changing the intrinsic/semantics. So, I am going to pursue that direction, and will soon propose something for this.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

MVETailPredication.cpp

24 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

tail-reduce.ll

33 lines

varying-outer-2d-reduction.ll

301 lines

Diff 285005

llvm/lib/Target/ARM/MVETailPredication.cpp

Show First 20 Lines • Show All 356 Lines • ▼ Show 20 Lines
// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount		// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
// 3) The IV must be an induction phi with an increment equal to the		// 3) The IV must be an induction phi with an increment equal to the
// vector width.		// vector width.
bool MVETailPredication::IsSafeActiveMask(IntrinsicInst *ActiveLaneMask,		bool MVETailPredication::IsSafeActiveMask(IntrinsicInst *ActiveLaneMask,
Value TripCount, FixedVectorType VecTy) {		Value TripCount, FixedVectorType VecTy) {
bool ForceTailPredication =		bool ForceTailPredication =
EnableTailPredication == TailPredication::ForceEnabledNoReductions \|\|		EnableTailPredication == TailPredication::ForceEnabledNoReductions \|\|
EnableTailPredication == TailPredication::ForceEnabled;		EnableTailPredication == TailPredication::ForceEnabled;

// 1) Test whether entry to the loop is protected by a conditional		// 1) Test whether entry to the loop is protected by a conditional
// BTC + 1 < 0. In other words, if the scalar trip count overflows,		// BTC + 1 < 0. In other words, if the scalar trip count overflows,
// becomes negative, we shouldn't enter the loop and creating		// becomes negative, we shouldn't enter the loop and creating
// tripcount expression BTC + 1 is not safe. So, check that BTC		// tripcount expression BTC + 1 is not safe. So, check that BTC
// isn't max. This is evaluated in unsigned, because the semantics		// isn't max. This is evaluated in unsigned, because the semantics
// of @get.active.lane.mask is a ULE comparison.		// of @get.active.lane.mask is a ULE comparison.

int VectorWidth = VecTy->getNumElements();
auto *BackedgeTakenCount = ActiveLaneMask->getOperand(1);		auto *BackedgeTakenCount = ActiveLaneMask->getOperand(1);
auto *BTC = SE->getSCEV(BackedgeTakenCount);		auto *BTC = SE->getSCEV(BackedgeTakenCount);
		auto *MaxBTC = SE->getConstantMaxBackedgeTakenCount(L);
		efriedmaUnsubmitted Not Done Reply Inline Actions I'm forgetting, do we have a check somewhere that BackedgeTakenCount is actually the backedge-taken count of the loop? efriedma: I'm forgetting, do we have a check somewhere that BackedgeTakenCount is actually the backedge…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Good point. I thought we had one, but we have similar checks for the IV, which is done in step 3) below on line 464. Will address this in a follow up. SjoerdMeijer: Good point. I thought we had one, but we have similar checks for the IV, which is done in step…
		efriedmaUnsubmitted Not Done Reply Inline Actions I think I messed up reviewing this. Took me a bit of time to remember what's going on here. `SE->getConstantMaxBackedgeTakenCount()` is the backedge-taken count of the loop. BTC is the number of elements the loop processes, minus one. If you want to ensure BTC + 1 doesn't overflow, getConstantMaxBackedgeTakenCount() doesn't actually help, unless you prove some connection between the two values. The code currently in IsSafeActiveMask does not try to prove that connection. efriedma: I think I messed up reviewing this. Took me a bit of time to remember what's going on here.
		efriedmaUnsubmitted Not Done Reply Inline Actions On a higher-level note, I think this work has shown that trying to force this to work is really complicated, and it might make sense to change the vectorizer to generate something that's easier to analyze. efriedma: On a higher-level note, I think this work has shown that trying to force this to work is really…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ah, sorry about this. Don't know what I was thinking...somehow had myself convinced, but it's obviously not really what we need. I fully agree about your high-level remark. This whole exercise has shown the limitations of our current approach, i.e. the complex analysis required, which we can hopefully avoid by changing the intrinsic/semantics. So, I am going to pursue that direction, and will soon propose something for this. SjoerdMeijer: Ah, sorry about this. Don't know what I was thinking...somehow had myself convinced, but it's…

		if (isa<SCEVCouldNotCompute>(MaxBTC)) {
		LLVM_DEBUG(dbgs() << "ARM TP: Can't compute SCEV BTC expression: ";
		BTC->dump());
		return false;
		}

if (!llvm::cannotBeMaxInLoop(BTC, L, SE, false /Signed*/) &&		APInt MaxInt = APInt(BTC->getType()->getScalarSizeInBits(), ~0);
		if (cast<SCEVConstant>(MaxBTC)->getAPInt().eq(MaxInt) &&
		efriedmaUnsubmitted Not Done Reply Inline Actions cast<>. efriedma: cast<>.
!ForceTailPredication) {		!ForceTailPredication) {
LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible, BTC can be max: ";		LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible, BTC can be int max: ";
BTC->dump());		BTC->dump());
return false;		return false;
}		}

// 2) Prove that the sub expression is non-negative, i.e. it doesn't overflow:		// 2) Prove that the sub expression is non-negative, i.e. it doesn't overflow:
//		//
// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount		// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount
//		//
// 2.1) First prove overflow can't happen in:		// 2.1) First prove overflow can't happen in:
//		//
// ElementCount + (VectorWidth - 1)		// ElementCount + (VectorWidth - 1)
//		//
// Because of a lack of context, it is difficult to get a useful bounds on		// Because of a lack of context, it is difficult to get a useful bounds on
// this expression. But since ElementCount uses the same variables as the		// this expression. But since ElementCount uses the same variables as the
// TripCount (TC), for which we can find meaningful value ranges, we use that		// TripCount (TC), for which we can find meaningful value ranges, we use that
// instead and assert that:		// instead and assert that:
//		//
// upperbound(TC) <= UINT_MAX - VectorWidth		// upperbound(TC) <= UINT_MAX - VectorWidth
//		//
auto *TC = SE->getSCEV(TripCount);		auto *TC = SE->getSCEV(TripCount);
unsigned SizeInBits = TripCount->getType()->getScalarSizeInBits();		unsigned SizeInBits = TripCount->getType()->getScalarSizeInBits();
		int VectorWidth = VecTy->getNumElements();
auto Diff = APInt(SizeInBits, ~0) - APInt(SizeInBits, VectorWidth);		auto Diff = APInt(SizeInBits, ~0) - APInt(SizeInBits, VectorWidth);
uint64_t MaxMinusVW = Diff.getZExtValue();		uint64_t MaxMinusVW = Diff.getZExtValue();
uint64_t UpperboundTC = SE->getSignedRange(TC).getUpper().getZExtValue();		uint64_t UpperboundTC = SE->getSignedRange(TC).getUpper().getZExtValue();

if (UpperboundTC > MaxMinusVW && !ForceTailPredication) {		if (UpperboundTC > MaxMinusVW && !ForceTailPredication) {
LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible in tripcount rounding:\n";		LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible in tripcount rounding:\n";
dbgs() << "upperbound(TC) <= UINT_MAX - VectorWidth\n";		dbgs() << "upperbound(TC) <= UINT_MAX - VectorWidth\n";
dbgs() << UpperboundTC << " <= " << MaxMinusVW << "== false\n";);		dbgs() << UpperboundTC << " <= " << MaxMinusVW << " == false\n";);
return false;		return false;
}		}

// 2.2) Make sure overflow doesn't happen in final expression:		// 2.2) Make sure overflow doesn't happen in final expression:
// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount,		// (((ElementCount + (VectorWidth - 1)) / VectorWidth) - TripCount,
// To do this, compare the full ranges of these subexpressions:		// To do this, compare the full ranges of these subexpressions:
//		//
// Range(Ceil) <= Range(TC)		// Range(Ceil) <= Range(TC)
Show All 32 Lines	auto ZeroRange =
ConstantRange(APInt(TripCount->getType()->getScalarSizeInBits(), 0));		ConstantRange(APInt(TripCount->getType()->getScalarSizeInBits(), 0));
RangeTC = RangeTC.unionWith(ZeroRange);		RangeTC = RangeTC.unionWith(ZeroRange);
}		}
if (!RangeTC.contains(RangeCeil) && !ForceTailPredication) {		if (!RangeTC.contains(RangeCeil) && !ForceTailPredication) {
LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible in sub\n");		LLVM_DEBUG(dbgs() << "ARM TP: Overflow possible in sub\n");
return false;		return false;
}		}

// 3) Find out if IV is an induction phi. Note that We can't use Loop		// 3) Find out if IV is an induction phi. Note that we can't use Loop
// helpers here to get the induction variable, because the hardware loop is		// helpers here to get the induction variable, because the hardware loop is
// no longer in loopsimplify form, and also the hwloop intrinsic use a		// no longer in loopsimplify form, and also the hwloop intrinsic uses a
// different counter. Using SCEV, we check that the induction is of the		// different counter. Using SCEV, we check that the induction is of the
// form i = i + 4, where the increment must be equal to the VectorWidth.		// form i = i + 4, where the increment must be equal to the VectorWidth.
auto *IV = ActiveLaneMask->getOperand(0);		auto *IV = ActiveLaneMask->getOperand(0);
auto *IVExpr = SE->getSCEV(IV);		auto *IVExpr = SE->getSCEV(IV);
auto *AddExpr = dyn_cast<SCEVAddRecExpr>(IVExpr);		auto *AddExpr = dyn_cast<SCEVAddRecExpr>(IVExpr);
if (!AddExpr) {		if (!AddExpr) {
LLVM_DEBUG(dbgs() << "ARM TP: induction not an add expr: "; IVExpr->dump());		LLVM_DEBUG(dbgs() << "ARM TP: induction not an add expr: "; IVExpr->dump());
return false;		return false;
}		}
▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/tail-reduce.ll

; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -tail-predication=enabled -mattr=+mve %s -S -o - \| FileCheck %s		; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -tail-predication=enabled -mattr=+mve %s -S -o - \| FileCheck %s
; RUN: opt -mtriple=thumbv8.1m.main -mve-tail-predication -tail-predication=force-enabled \
; RUN: -mattr=+mve %s -S -o - \| FileCheck %s --check-prefix=FORCE

; CHECK-LABEL: reduction_i32		; CHECK-LABEL: reduction_i32
; CHECK: phi i32 [ 0, %vector.ph ]		; CHECK: phi i32 [ 0, %vector.ph ]
; CHECK: phi <8 x i16> [ zeroinitializer, %vector.ph ]		; CHECK: phi <8 x i16> [ zeroinitializer, %vector.ph ]
; CHECK: phi i32		; CHECK: phi i32
; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[ELEMS:%[^ ]+]], %vector.body ]		; CHECK: [[PHI:%[^ ]+]] = phi i32 [ %N, %vector.ph ], [ [[ELEMS:%[^ ]+]], %vector.body ]
; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.mve.vctp16(i32 [[PHI]])		; CHECK: [[VCTP:%[^ ]+]] = call <8 x i1> @llvm.arm.mve.vctp16(i32 [[PHI]])
; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8		; CHECK: [[ELEMS]] = sub i32 [[PHI]], 8
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	middle.block: ; preds = %vector.body
%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0		%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0
ret i16 %tmp9		ret i16 %tmp9

for.cond.cleanup:		for.cond.cleanup:
%res.0 = phi i16 [ 0, %entry ]		%res.0 = phi i16 [ 0, %entry ]
ret i16 %res.0		ret i16 %res.0
}		}

; The vector loop is not guarded with an entry check (N == 0).		; The vector loop is not guarded with an entry check (N == 0). Check that
; This means we can't calculate a precise range for the backedge count in		; despite this we can still calculate a precise enough range for the
; @llvm.get.active.lane.mask, and are assuming overflow can happen and thus		; backedge count to safely insert a vctp here.
; we can't insert the VCTP here.
;		;
; CHECK-LABEL: @reduction_not_guarded		; CHECK-LABEL: @reduction_not_guarded
;		;
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.arm.mve.vctp		; CHECK: @llvm.arm.mve.vctp
; CHECK: @llvm.get.active.lane.mask.v8i1.i32		; CHECK-NOT: @llvm.get.active.lane.mask.v8i1.i32
; CHECK: ret		; CHECK: ret
;		;
define i16 @reduction_not_guarded(i16* nocapture readonly %A, i16 %B, i32 %N) local_unnamed_addr {		define i16 @reduction_not_guarded(i16* nocapture readonly %A, i16 %B, i32 %N) local_unnamed_addr {
entry:		entry:
%tmp = add i32 %N, -1		%tmp = add i32 %N, -1
%n.rnd.up = add nuw nsw i32 %tmp, 8		%n.rnd.up = add nuw nsw i32 %tmp, 8
%n.vec = and i32 %n.rnd.up, -8		%n.vec = and i32 %n.rnd.up, -8
%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0		%broadcast.splatinsert1 = insertelement <8 x i32> undef, i32 %tmp, i32 0
Show All 34 Lines	middle.block: ; preds = %vector.body
%rdx.shuf5 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		%rdx.shuf5 = shufflevector <8 x i16> %bin.rdx, <8 x i16> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx6 = add <8 x i16> %rdx.shuf5, %bin.rdx		%bin.rdx6 = add <8 x i16> %rdx.shuf5, %bin.rdx
%rdx.shuf7 = shufflevector <8 x i16> %bin.rdx6, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		%rdx.shuf7 = shufflevector <8 x i16> %bin.rdx6, <8 x i16> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%bin.rdx8 = add <8 x i16> %rdx.shuf7, %bin.rdx6		%bin.rdx8 = add <8 x i16> %rdx.shuf7, %bin.rdx6
%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0		%tmp9 = extractelement <8 x i16> %bin.rdx8, i32 0
ret i16 %tmp9		ret i16 %tmp9
}		}

; Without forcing tail-predication, we bail because overflow analysis says:
;
; overflow possible in: {(-1 + (sext i16 %Size to i32)),+,-1}<nw><%for.body>
;
; CHECK-LABEL: @Correlation		; CHECK-LABEL: @Correlation
;
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NOT: @llvm.arm.mve.vctp		; CHECK: @llvm.arm.mve.vctp
; CHECK: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 %index, i32 %trip.count.minus.1)		; CHECK-NOT: %active.lane.mask = call <4 x i1> @llvm.get.active.lane.mask
;
; FORCE-LABEL: @Correlation
; FORCE: vector.ph: ; preds = %for.body
; FORCE: %trip.count.minus.1 = add i32 %{{.*}}, -1
; FORCE: call void @llvm.set.loop.iterations.i32(i32 %{{.*}})
; FORCE: br label %vector.body
; FORCE: vector.body: ; preds = %vector.body, %vector.ph
; FORCE: %[[VCTP:.]] = call <4 x i1> @llvm.arm.mve.vctp32(i32 %{{.}})
; FORCE: call <4 x i16> @llvm.masked.load.v4i16.p0v4i16({{.}}, <4 x i1> %[[VCTP]]{{.}}
;		;
define dso_local void @Correlation(i16* nocapture readonly %Input, i16* nocapture %Output, i16 signext %Size, i16 signext %N, i16 signext %Scale) local_unnamed_addr #0 {		define dso_local void @Correlation(i16* nocapture readonly %Input, i16* nocapture %Output, i16 signext %Size, i16 signext %N, i16 signext %Scale) local_unnamed_addr #0 {
entry:		entry:
%conv = sext i16 %N to i32		%conv = sext i16 %N to i32
%cmp36 = icmp sgt i16 %N, 0		%cmp36 = icmp sgt i16 %N, 0
br i1 %cmp36, label %for.body.lr.ph, label %for.end17		br i1 %cmp36, label %for.body.lr.ph, label %for.end17

for.body.lr.ph:		for.body.lr.ph:
▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/varying-outer-2d-reduction.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	;			;
	; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=enabled %s -o - \| \			; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=enabled %s -o - \| \
	; RUN: FileCheck %s --check-prefix=ENABLED			; RUN: FileCheck %s --check-prefix=ENABLED
	;			;
				; Forcing tail-predication should not be necessary here, so we check the same
				; ENABLED label as the run above:
	; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=force-enabled %s -o - \| \			; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=force-enabled %s -o - \| \
	; RUN: FileCheck %s --check-prefix=FORCE			; RUN: FileCheck %s --check-prefix=ENABLED
	;			;
	; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=enabled-no-reductions %s -o - \| \			; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=enabled-no-reductions %s -o - \| \
	; RUN: FileCheck %s --check-prefix=NOREDUCTIONS			; RUN: FileCheck %s --check-prefix=NOREDUCTIONS
	;			;
	; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=force-enabled-no-reductions %s -o - \| \			; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve -tail-predication=force-enabled-no-reductions %s -o - \| \
	; RUN: FileCheck %s --check-prefix=FORCENOREDUCTIONS			; RUN: FileCheck %s --check-prefix=NOREDUCTIONS

	define dso_local void @varying_outer_2d_reduction(i16* nocapture readonly %Input, i16* nocapture %Output, i16 signext %Size, i16 signext %N, i16 signext %Scale) local_unnamed_addr {			define dso_local void @varying_outer_2d_reduction(i16* nocapture readonly %Input, i16* nocapture %Output, i16 signext %Size, i16 signext %N, i16 signext %Scale) local_unnamed_addr {
	; ENABLED-LABEL: varying_outer_2d_reduction:			; ENABLED-LABEL: varying_outer_2d_reduction:
	; ENABLED: @ %bb.0: @ %entry			; ENABLED: @ %bb.0: @ %entry
	; ENABLED-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}			; ENABLED-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, lr}
	; ENABLED-NEXT: sub sp, #8			; ENABLED-NEXT: sub sp, #4
	; ENABLED-NEXT: cmp r3, #1			; ENABLED-NEXT: cmp r3, #1
	; ENABLED-NEXT: str r0, [sp, #4] @ 4-byte Spill			; ENABLED-NEXT: str r0, [sp] @ 4-byte Spill
	; ENABLED-NEXT: blt .LBB0_8			; ENABLED-NEXT: blt .LBB0_8
	; ENABLED-NEXT: @ %bb.1: @ %for.body.lr.ph			; ENABLED-NEXT: @ %bb.1: @ %for.body.lr.ph
	; ENABLED-NEXT: ldr r0, [sp, #44]			; ENABLED-NEXT: ldr r0, [sp, #36]
	; ENABLED-NEXT: adr r7, .LCPI0_0			; ENABLED-NEXT: add.w r12, r2, #3
	; ENABLED-NEXT: ldr.w r10, [sp, #4] @ 4-byte Reload			; ENABLED-NEXT: ldr.w r10, [sp] @ 4-byte Reload
	; ENABLED-NEXT: add.w r9, r2, #3			; ENABLED-NEXT: movs r6, #0
	; ENABLED-NEXT: vldrw.u32 q0, [r7]			; ENABLED-NEXT: mov r9, r12
	; ENABLED-NEXT: mov.w r11, #0
	; ENABLED-NEXT: uxth r0, r0			; ENABLED-NEXT: uxth r0, r0
	; ENABLED-NEXT: rsbs r5, r0, #0			; ENABLED-NEXT: rsbs r5, r0, #0
	; ENABLED-NEXT: str.w r9, [sp] @ 4-byte Spill
	; ENABLED-NEXT: b .LBB0_4			; ENABLED-NEXT: b .LBB0_4
	; ENABLED-NEXT: .LBB0_2: @ in Loop: Header=BB0_4 Depth=1			; ENABLED-NEXT: .LBB0_2: @ in Loop: Header=BB0_4 Depth=1
	; ENABLED-NEXT: movs r0, #0			; ENABLED-NEXT: movs r0, #0
	; ENABLED-NEXT: .LBB0_3: @ %for.end			; ENABLED-NEXT: .LBB0_3: @ %for.end
	; ENABLED-NEXT: @ in Loop: Header=BB0_4 Depth=1			; ENABLED-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; ENABLED-NEXT: lsrs r0, r0, #16			; ENABLED-NEXT: lsrs r0, r0, #16
	; ENABLED-NEXT: sub.w r9, r9, #1			; ENABLED-NEXT: sub.w r9, r9, #1
	; ENABLED-NEXT: strh.w r0, [r1, r11, lsl #1]			; ENABLED-NEXT: strh.w r0, [r1, r6, lsl #1]
	; ENABLED-NEXT: add.w r11, r11, #1			; ENABLED-NEXT: adds r6, #1
	; ENABLED-NEXT: add.w r10, r10, #2			; ENABLED-NEXT: add.w r10, r10, #2
	; ENABLED-NEXT: cmp r11, r3			; ENABLED-NEXT: cmp r6, r3
	; ENABLED-NEXT: beq .LBB0_8			; ENABLED-NEXT: beq .LBB0_8
	; ENABLED-NEXT: .LBB0_4: @ %for.body			; ENABLED-NEXT: .LBB0_4: @ %for.body
	; ENABLED-NEXT: @ =>This Loop Header: Depth=1			; ENABLED-NEXT: @ =>This Loop Header: Depth=1
	; ENABLED-NEXT: @ Child Loop BB0_6 Depth 2			; ENABLED-NEXT: @ Child Loop BB0_6 Depth 2
	; ENABLED-NEXT: cmp r2, r11			; ENABLED-NEXT: cmp r2, r6
	; ENABLED-NEXT: ble .LBB0_2			; ENABLED-NEXT: ble .LBB0_2
	; ENABLED-NEXT: @ %bb.5: @ %vector.ph			; ENABLED-NEXT: @ %bb.5: @ %vector.ph
	; ENABLED-NEXT: @ in Loop: Header=BB0_4 Depth=1			; ENABLED-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; ENABLED-NEXT: bic r7, r9, #3			; ENABLED-NEXT: bic r0, r9, #3
	; ENABLED-NEXT: movs r6, #1			; ENABLED-NEXT: movs r7, #1
	; ENABLED-NEXT: subs r7, #4			; ENABLED-NEXT: subs r0, #4
	; ENABLED-NEXT: sub.w r0, r2, r11			; ENABLED-NEXT: subs r4, r2, r6
	; ENABLED-NEXT: vmov.i32 q2, #0x0			; ENABLED-NEXT: vmov.i32 q0, #0x0
	; ENABLED-NEXT: add.w r8, r6, r7, lsr #2			; ENABLED-NEXT: add.w r8, r7, r0, lsr #2
	; ENABLED-NEXT: ldr r7, [sp] @ 4-byte Reload			; ENABLED-NEXT: mov r7, r10
	; ENABLED-NEXT: sub.w r4, r7, r11			; ENABLED-NEXT: dlstp.32 lr, r4
	; ENABLED-NEXT: movs r7, #0			; ENABLED-NEXT: ldr r0, [sp] @ 4-byte Reload
	; ENABLED-NEXT: bic r4, r4, #3
	; ENABLED-NEXT: subs r4, #4
	; ENABLED-NEXT: add.w r4, r6, r4, lsr #2
	; ENABLED-NEXT: subs r6, r0, #1
	; ENABLED-NEXT: dls lr, r4
	; ENABLED-NEXT: mov r4, r10
	; ENABLED-NEXT: ldr r0, [sp, #4] @ 4-byte Reload
	; ENABLED-NEXT: .LBB0_6: @ %vector.body			; ENABLED-NEXT: .LBB0_6: @ %vector.body
	; ENABLED-NEXT: @ Parent Loop BB0_4 Depth=1			; ENABLED-NEXT: @ Parent Loop BB0_4 Depth=1
	; ENABLED-NEXT: @ => This Inner Loop Header: Depth=2			; ENABLED-NEXT: @ => This Inner Loop Header: Depth=2
	; ENABLED-NEXT: vmov q1, q2			; ENABLED-NEXT: vldrh.s32 q1, [r0], #8
	; ENABLED-NEXT: vadd.i32 q2, q0, r7			; ENABLED-NEXT: vldrh.s32 q2, [r7], #8
	; ENABLED-NEXT: vdup.32 q3, r7
	; ENABLED-NEXT: mov lr, r8			; ENABLED-NEXT: mov lr, r8
	; ENABLED-NEXT: vcmp.u32 hi, q3, q2			; ENABLED-NEXT: vmul.i32 q1, q2, q1
	; ENABLED-NEXT: vdup.32 q3, r6
	; ENABLED-NEXT: vpnot
	; ENABLED-NEXT: sub.w r8, r8, #1			; ENABLED-NEXT: sub.w r8, r8, #1
	; ENABLED-NEXT: vpsttt			; ENABLED-NEXT: vshl.s32 q1, r5
	; ENABLED-NEXT: vcmpt.u32 cs, q3, q2			; ENABLED-NEXT: vadd.i32 q0, q1, q0
	; ENABLED-NEXT: vldrht.s32 q2, [r0], #8			; ENABLED-NEXT: letp lr, .LBB0_6
	; ENABLED-NEXT: vldrht.s32 q3, [r4], #8
	; ENABLED-NEXT: adds r7, #4
	; ENABLED-NEXT: vmul.i32 q2, q3, q2
	; ENABLED-NEXT: vshl.s32 q2, r5
	; ENABLED-NEXT: vadd.i32 q2, q2, q1
	; ENABLED-NEXT: le lr, .LBB0_6
	; ENABLED-NEXT: @ %bb.7: @ %middle.block			; ENABLED-NEXT: @ %bb.7: @ %middle.block
	; ENABLED-NEXT: @ in Loop: Header=BB0_4 Depth=1			; ENABLED-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; ENABLED-NEXT: vpsel q1, q2, q1			; ENABLED-NEXT: vaddv.u32 r0, q0
	; ENABLED-NEXT: vaddv.u32 r0, q1
	; ENABLED-NEXT: b .LBB0_3			; ENABLED-NEXT: b .LBB0_3
	; ENABLED-NEXT: .LBB0_8: @ %for.end17			; ENABLED-NEXT: .LBB0_8: @ %for.end17
	; ENABLED-NEXT: add sp, #8			; ENABLED-NEXT: add sp, #4
	; ENABLED-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}			; ENABLED-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}
	; ENABLED-NEXT: .p2align 4
	; ENABLED-NEXT: @ %bb.9:
	; ENABLED-NEXT: .LCPI0_0:
	; ENABLED-NEXT: .long 0 @ 0x0
	; ENABLED-NEXT: .long 1 @ 0x1
	; ENABLED-NEXT: .long 2 @ 0x2
	; ENABLED-NEXT: .long 3 @ 0x3
	;
	; FORCE-LABEL: varying_outer_2d_reduction:
	; FORCE: @ %bb.0: @ %entry
	; FORCE-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, lr}
	; FORCE-NEXT: sub sp, #4
	; FORCE-NEXT: cmp r3, #1
	; FORCE-NEXT: str r0, [sp] @ 4-byte Spill
	; FORCE-NEXT: blt .LBB0_8
	; FORCE-NEXT: @ %bb.1: @ %for.body.lr.ph
	; FORCE-NEXT: ldr r0, [sp, #36]
	; FORCE-NEXT: add.w r12, r2, #3
	; FORCE-NEXT: ldr.w r10, [sp] @ 4-byte Reload
	; FORCE-NEXT: movs r6, #0
	; FORCE-NEXT: mov r9, r12
	; FORCE-NEXT: uxth r0, r0
	; FORCE-NEXT: rsbs r5, r0, #0
	; FORCE-NEXT: b .LBB0_4
	; FORCE-NEXT: .LBB0_2: @ in Loop: Header=BB0_4 Depth=1
	; FORCE-NEXT: movs r0, #0
	; FORCE-NEXT: .LBB0_3: @ %for.end
	; FORCE-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; FORCE-NEXT: lsrs r0, r0, #16
	; FORCE-NEXT: sub.w r9, r9, #1
	; FORCE-NEXT: strh.w r0, [r1, r6, lsl #1]
	; FORCE-NEXT: adds r6, #1
	; FORCE-NEXT: add.w r10, r10, #2
	; FORCE-NEXT: cmp r6, r3
	; FORCE-NEXT: beq .LBB0_8
	; FORCE-NEXT: .LBB0_4: @ %for.body
	; FORCE-NEXT: @ =>This Loop Header: Depth=1
	; FORCE-NEXT: @ Child Loop BB0_6 Depth 2
	; FORCE-NEXT: cmp r2, r6
	; FORCE-NEXT: ble .LBB0_2
	; FORCE-NEXT: @ %bb.5: @ %vector.ph
	; FORCE-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; FORCE-NEXT: bic r0, r9, #3
	; FORCE-NEXT: movs r7, #1
	; FORCE-NEXT: subs r0, #4
	; FORCE-NEXT: subs r4, r2, r6
	; FORCE-NEXT: vmov.i32 q0, #0x0
	; FORCE-NEXT: add.w r8, r7, r0, lsr #2
	; FORCE-NEXT: mov r7, r10
	; FORCE-NEXT: dlstp.32 lr, r4
	; FORCE-NEXT: ldr r0, [sp] @ 4-byte Reload
	; FORCE-NEXT: .LBB0_6: @ %vector.body
	; FORCE-NEXT: @ Parent Loop BB0_4 Depth=1
	; FORCE-NEXT: @ => This Inner Loop Header: Depth=2
	; FORCE-NEXT: vldrh.s32 q1, [r0], #8
	; FORCE-NEXT: vldrh.s32 q2, [r7], #8
	; FORCE-NEXT: mov lr, r8
	; FORCE-NEXT: vmul.i32 q1, q2, q1
	; FORCE-NEXT: sub.w r8, r8, #1
	; FORCE-NEXT: vshl.s32 q1, r5
	; FORCE-NEXT: vadd.i32 q0, q1, q0
	; FORCE-NEXT: letp lr, .LBB0_6
	; FORCE-NEXT: @ %bb.7: @ %middle.block
	; FORCE-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; FORCE-NEXT: vaddv.u32 r0, q0
	; FORCE-NEXT: b .LBB0_3
	; FORCE-NEXT: .LBB0_8: @ %for.end17
	; FORCE-NEXT: add sp, #4
	; FORCE-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}
	;			;
	; NOREDUCTIONS-LABEL: varying_outer_2d_reduction:			; NOREDUCTIONS-LABEL: varying_outer_2d_reduction:
	; NOREDUCTIONS: @ %bb.0: @ %entry			; NOREDUCTIONS: @ %bb.0: @ %entry
	; NOREDUCTIONS-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, r11, lr}			; NOREDUCTIONS-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, lr}
	; NOREDUCTIONS-NEXT: sub sp, #8			; NOREDUCTIONS-NEXT: sub sp, #4
	; NOREDUCTIONS-NEXT: cmp r3, #1			; NOREDUCTIONS-NEXT: cmp r3, #1
	; NOREDUCTIONS-NEXT: str r0, [sp, #4] @ 4-byte Spill			; NOREDUCTIONS-NEXT: str r0, [sp] @ 4-byte Spill
	; NOREDUCTIONS-NEXT: blt .LBB0_8			; NOREDUCTIONS-NEXT: blt .LBB0_8
	; NOREDUCTIONS-NEXT: @ %bb.1: @ %for.body.lr.ph			; NOREDUCTIONS-NEXT: @ %bb.1: @ %for.body.lr.ph
	; NOREDUCTIONS-NEXT: ldr r0, [sp, #44]			; NOREDUCTIONS-NEXT: ldr r0, [sp, #36]
	; NOREDUCTIONS-NEXT: adr r7, .LCPI0_0			; NOREDUCTIONS-NEXT: add.w r12, r2, #3
	; NOREDUCTIONS-NEXT: ldr.w r10, [sp, #4] @ 4-byte Reload			; NOREDUCTIONS-NEXT: ldr.w r10, [sp] @ 4-byte Reload
	; NOREDUCTIONS-NEXT: add.w r9, r2, #3			; NOREDUCTIONS-NEXT: movs r6, #0
	; NOREDUCTIONS-NEXT: vldrw.u32 q0, [r7]			; NOREDUCTIONS-NEXT: mov r9, r12
	; NOREDUCTIONS-NEXT: mov.w r11, #0
	; NOREDUCTIONS-NEXT: uxth r0, r0			; NOREDUCTIONS-NEXT: uxth r0, r0
	; NOREDUCTIONS-NEXT: rsbs r5, r0, #0			; NOREDUCTIONS-NEXT: rsbs r5, r0, #0
	; NOREDUCTIONS-NEXT: str.w r9, [sp] @ 4-byte Spill
	; NOREDUCTIONS-NEXT: b .LBB0_4			; NOREDUCTIONS-NEXT: b .LBB0_4
	; NOREDUCTIONS-NEXT: .LBB0_2: @ in Loop: Header=BB0_4 Depth=1			; NOREDUCTIONS-NEXT: .LBB0_2: @ in Loop: Header=BB0_4 Depth=1
	; NOREDUCTIONS-NEXT: movs r0, #0			; NOREDUCTIONS-NEXT: movs r0, #0
	; NOREDUCTIONS-NEXT: .LBB0_3: @ %for.end			; NOREDUCTIONS-NEXT: .LBB0_3: @ %for.end
	; NOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1			; NOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; NOREDUCTIONS-NEXT: lsrs r0, r0, #16			; NOREDUCTIONS-NEXT: lsrs r0, r0, #16
	; NOREDUCTIONS-NEXT: sub.w r9, r9, #1			; NOREDUCTIONS-NEXT: sub.w r9, r9, #1
	; NOREDUCTIONS-NEXT: strh.w r0, [r1, r11, lsl #1]			; NOREDUCTIONS-NEXT: strh.w r0, [r1, r6, lsl #1]
	; NOREDUCTIONS-NEXT: add.w r11, r11, #1			; NOREDUCTIONS-NEXT: adds r6, #1
	; NOREDUCTIONS-NEXT: add.w r10, r10, #2			; NOREDUCTIONS-NEXT: add.w r10, r10, #2
	; NOREDUCTIONS-NEXT: cmp r11, r3			; NOREDUCTIONS-NEXT: cmp r6, r3
	; NOREDUCTIONS-NEXT: beq .LBB0_8			; NOREDUCTIONS-NEXT: beq .LBB0_8
	; NOREDUCTIONS-NEXT: .LBB0_4: @ %for.body			; NOREDUCTIONS-NEXT: .LBB0_4: @ %for.body
	; NOREDUCTIONS-NEXT: @ =>This Loop Header: Depth=1			; NOREDUCTIONS-NEXT: @ =>This Loop Header: Depth=1
	; NOREDUCTIONS-NEXT: @ Child Loop BB0_6 Depth 2			; NOREDUCTIONS-NEXT: @ Child Loop BB0_6 Depth 2
	; NOREDUCTIONS-NEXT: cmp r2, r11			; NOREDUCTIONS-NEXT: cmp r2, r6
	; NOREDUCTIONS-NEXT: ble .LBB0_2			; NOREDUCTIONS-NEXT: ble .LBB0_2
	; NOREDUCTIONS-NEXT: @ %bb.5: @ %vector.ph			; NOREDUCTIONS-NEXT: @ %bb.5: @ %vector.ph
	; NOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1			; NOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; NOREDUCTIONS-NEXT: bic r7, r9, #3			; NOREDUCTIONS-NEXT: bic r0, r9, #3
	; NOREDUCTIONS-NEXT: movs r6, #1			; NOREDUCTIONS-NEXT: movs r7, #1
	; NOREDUCTIONS-NEXT: subs r7, #4			; NOREDUCTIONS-NEXT: subs r0, #4
	; NOREDUCTIONS-NEXT: sub.w r0, r2, r11			; NOREDUCTIONS-NEXT: subs r4, r2, r6
	; NOREDUCTIONS-NEXT: vmov.i32 q2, #0x0			; NOREDUCTIONS-NEXT: vmov.i32 q0, #0x0
	; NOREDUCTIONS-NEXT: add.w r8, r6, r7, lsr #2			; NOREDUCTIONS-NEXT: add.w r8, r7, r0, lsr #2
	; NOREDUCTIONS-NEXT: ldr r7, [sp] @ 4-byte Reload			; NOREDUCTIONS-NEXT: mov r7, r10
	; NOREDUCTIONS-NEXT: sub.w r4, r7, r11			; NOREDUCTIONS-NEXT: dlstp.32 lr, r4
	; NOREDUCTIONS-NEXT: movs r7, #0			; NOREDUCTIONS-NEXT: ldr r0, [sp] @ 4-byte Reload
	; NOREDUCTIONS-NEXT: bic r4, r4, #3
	; NOREDUCTIONS-NEXT: subs r4, #4
	; NOREDUCTIONS-NEXT: add.w r4, r6, r4, lsr #2
	; NOREDUCTIONS-NEXT: subs r6, r0, #1
	; NOREDUCTIONS-NEXT: dls lr, r4
	; NOREDUCTIONS-NEXT: mov r4, r10
	; NOREDUCTIONS-NEXT: ldr r0, [sp, #4] @ 4-byte Reload
	; NOREDUCTIONS-NEXT: .LBB0_6: @ %vector.body			; NOREDUCTIONS-NEXT: .LBB0_6: @ %vector.body
	; NOREDUCTIONS-NEXT: @ Parent Loop BB0_4 Depth=1			; NOREDUCTIONS-NEXT: @ Parent Loop BB0_4 Depth=1
	; NOREDUCTIONS-NEXT: @ => This Inner Loop Header: Depth=2			; NOREDUCTIONS-NEXT: @ => This Inner Loop Header: Depth=2
	; NOREDUCTIONS-NEXT: vmov q1, q2			; NOREDUCTIONS-NEXT: vldrh.s32 q1, [r0], #8
	; NOREDUCTIONS-NEXT: vadd.i32 q2, q0, r7			; NOREDUCTIONS-NEXT: vldrh.s32 q2, [r7], #8
	; NOREDUCTIONS-NEXT: vdup.32 q3, r7
	; NOREDUCTIONS-NEXT: mov lr, r8			; NOREDUCTIONS-NEXT: mov lr, r8
	; NOREDUCTIONS-NEXT: vcmp.u32 hi, q3, q2			; NOREDUCTIONS-NEXT: vmul.i32 q1, q2, q1
	; NOREDUCTIONS-NEXT: vdup.32 q3, r6
	; NOREDUCTIONS-NEXT: vpnot
	; NOREDUCTIONS-NEXT: sub.w r8, r8, #1			; NOREDUCTIONS-NEXT: sub.w r8, r8, #1
	; NOREDUCTIONS-NEXT: vpsttt			; NOREDUCTIONS-NEXT: vshl.s32 q1, r5
	; NOREDUCTIONS-NEXT: vcmpt.u32 cs, q3, q2			; NOREDUCTIONS-NEXT: vadd.i32 q0, q1, q0
	; NOREDUCTIONS-NEXT: vldrht.s32 q2, [r0], #8			; NOREDUCTIONS-NEXT: letp lr, .LBB0_6
	; NOREDUCTIONS-NEXT: vldrht.s32 q3, [r4], #8
	; NOREDUCTIONS-NEXT: adds r7, #4
	; NOREDUCTIONS-NEXT: vmul.i32 q2, q3, q2
	; NOREDUCTIONS-NEXT: vshl.s32 q2, r5
	; NOREDUCTIONS-NEXT: vadd.i32 q2, q2, q1
	; NOREDUCTIONS-NEXT: le lr, .LBB0_6
	; NOREDUCTIONS-NEXT: @ %bb.7: @ %middle.block			; NOREDUCTIONS-NEXT: @ %bb.7: @ %middle.block
	; NOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1			; NOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; NOREDUCTIONS-NEXT: vpsel q1, q2, q1			; NOREDUCTIONS-NEXT: vaddv.u32 r0, q0
	; NOREDUCTIONS-NEXT: vaddv.u32 r0, q1
	; NOREDUCTIONS-NEXT: b .LBB0_3			; NOREDUCTIONS-NEXT: b .LBB0_3
	; NOREDUCTIONS-NEXT: .LBB0_8: @ %for.end17			; NOREDUCTIONS-NEXT: .LBB0_8: @ %for.end17
	; NOREDUCTIONS-NEXT: add sp, #8			; NOREDUCTIONS-NEXT: add sp, #4
	; NOREDUCTIONS-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, r11, pc}			; NOREDUCTIONS-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}
	; NOREDUCTIONS-NEXT: .p2align 4
	; NOREDUCTIONS-NEXT: @ %bb.9:
	; NOREDUCTIONS-NEXT: .LCPI0_0:
	; NOREDUCTIONS-NEXT: .long 0 @ 0x0
	; NOREDUCTIONS-NEXT: .long 1 @ 0x1
	; NOREDUCTIONS-NEXT: .long 2 @ 0x2
	; NOREDUCTIONS-NEXT: .long 3 @ 0x3
	;			;
	; FORCENOREDUCTIONS-LABEL: varying_outer_2d_reduction:
	; FORCENOREDUCTIONS: @ %bb.0: @ %entry
	; FORCENOREDUCTIONS-NEXT: push.w {r4, r5, r6, r7, r8, r9, r10, lr}
	; FORCENOREDUCTIONS-NEXT: sub sp, #4
	; FORCENOREDUCTIONS-NEXT: cmp r3, #1
	; FORCENOREDUCTIONS-NEXT: str r0, [sp] @ 4-byte Spill
	; FORCENOREDUCTIONS-NEXT: blt .LBB0_8
	; FORCENOREDUCTIONS-NEXT: @ %bb.1: @ %for.body.lr.ph
	; FORCENOREDUCTIONS-NEXT: ldr r0, [sp, #36]
	; FORCENOREDUCTIONS-NEXT: add.w r12, r2, #3
	; FORCENOREDUCTIONS-NEXT: ldr.w r10, [sp] @ 4-byte Reload
	; FORCENOREDUCTIONS-NEXT: movs r6, #0
	; FORCENOREDUCTIONS-NEXT: mov r9, r12
	; FORCENOREDUCTIONS-NEXT: uxth r0, r0
	; FORCENOREDUCTIONS-NEXT: rsbs r5, r0, #0
	; FORCENOREDUCTIONS-NEXT: b .LBB0_4
	; FORCENOREDUCTIONS-NEXT: .LBB0_2: @ in Loop: Header=BB0_4 Depth=1
	; FORCENOREDUCTIONS-NEXT: movs r0, #0
	; FORCENOREDUCTIONS-NEXT: .LBB0_3: @ %for.end
	; FORCENOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; FORCENOREDUCTIONS-NEXT: lsrs r0, r0, #16
	; FORCENOREDUCTIONS-NEXT: sub.w r9, r9, #1
	; FORCENOREDUCTIONS-NEXT: strh.w r0, [r1, r6, lsl #1]
	; FORCENOREDUCTIONS-NEXT: adds r6, #1
	; FORCENOREDUCTIONS-NEXT: add.w r10, r10, #2
	; FORCENOREDUCTIONS-NEXT: cmp r6, r3
	; FORCENOREDUCTIONS-NEXT: beq .LBB0_8
	; FORCENOREDUCTIONS-NEXT: .LBB0_4: @ %for.body
	; FORCENOREDUCTIONS-NEXT: @ =>This Loop Header: Depth=1
	; FORCENOREDUCTIONS-NEXT: @ Child Loop BB0_6 Depth 2
	; FORCENOREDUCTIONS-NEXT: cmp r2, r6
	; FORCENOREDUCTIONS-NEXT: ble .LBB0_2
	; FORCENOREDUCTIONS-NEXT: @ %bb.5: @ %vector.ph
	; FORCENOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; FORCENOREDUCTIONS-NEXT: bic r0, r9, #3
	; FORCENOREDUCTIONS-NEXT: movs r7, #1
	; FORCENOREDUCTIONS-NEXT: subs r0, #4
	; FORCENOREDUCTIONS-NEXT: subs r4, r2, r6
	; FORCENOREDUCTIONS-NEXT: vmov.i32 q0, #0x0
	; FORCENOREDUCTIONS-NEXT: add.w r8, r7, r0, lsr #2
	; FORCENOREDUCTIONS-NEXT: mov r7, r10
	; FORCENOREDUCTIONS-NEXT: dlstp.32 lr, r4
	; FORCENOREDUCTIONS-NEXT: ldr r0, [sp] @ 4-byte Reload
	; FORCENOREDUCTIONS-NEXT: .LBB0_6: @ %vector.body
	; FORCENOREDUCTIONS-NEXT: @ Parent Loop BB0_4 Depth=1
	; FORCENOREDUCTIONS-NEXT: @ => This Inner Loop Header: Depth=2
	; FORCENOREDUCTIONS-NEXT: vldrh.s32 q1, [r0], #8
	; FORCENOREDUCTIONS-NEXT: vldrh.s32 q2, [r7], #8
	; FORCENOREDUCTIONS-NEXT: mov lr, r8
	; FORCENOREDUCTIONS-NEXT: vmul.i32 q1, q2, q1
	; FORCENOREDUCTIONS-NEXT: sub.w r8, r8, #1
	; FORCENOREDUCTIONS-NEXT: vshl.s32 q1, r5
	; FORCENOREDUCTIONS-NEXT: vadd.i32 q0, q1, q0
	; FORCENOREDUCTIONS-NEXT: letp lr, .LBB0_6
	; FORCENOREDUCTIONS-NEXT: @ %bb.7: @ %middle.block
	; FORCENOREDUCTIONS-NEXT: @ in Loop: Header=BB0_4 Depth=1
	; FORCENOREDUCTIONS-NEXT: vaddv.u32 r0, q0
	; FORCENOREDUCTIONS-NEXT: b .LBB0_3
	; FORCENOREDUCTIONS-NEXT: .LBB0_8: @ %for.end17
	; FORCENOREDUCTIONS-NEXT: add sp, #4
	; FORCENOREDUCTIONS-NEXT: pop.w {r4, r5, r6, r7, r8, r9, r10, pc}
	entry:			entry:
	%conv = sext i16 %N to i32			%conv = sext i16 %N to i32
	%cmp36 = icmp sgt i16 %N, 0			%cmp36 = icmp sgt i16 %N, 0
	br i1 %cmp36, label %for.body.lr.ph, label %for.end17			br i1 %cmp36, label %for.body.lr.ph, label %for.end17

	for.body.lr.ph: ; preds = %entry			for.body.lr.ph: ; preds = %entry
	%conv2 = sext i16 %Size to i32			%conv2 = sext i16 %Size to i32
	%conv1032 = zext i16 %Scale to i32			%conv1032 = zext i16 %Scale to i32
	▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines