This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
reduction-predselect.ll

Differential D84741

[LV] Allow tail folded reduction selects to remain in the loop
ClosedPublic

Authored by dmgreen on Jul 28 2020, 3:35 AM.

Download Raw Diff

Details

Reviewers

Ayal
SjoerdMeijer
gilr
fhahn

Commits

rG816097e4e5f3: [LV] Allow tail folded reduction selects to remain in the loop

Summary

The normal scheme for tail folding reductions is to use:

loop:
  p = phi(0, a)
  mask = ...
  x = masked_load(..., mask)
  a = add(x, p)
s = select(mask, a, p)

This means we need to keep the register p and a alive out of the loop, plus the mask. On a target with predicated operations we can instead generate the phi as p = phi(0, s). This ensures the select in the loop and we can fold select(m, add(a, b), c) to something like a vaddt c, a, b using the m predicate. This in turn allows us to tail predicate the entire loop.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Jul 28 2020, 3:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2020, 3:35 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

dmgreen requested review of this revision.Jul 28 2020, 3:35 AM

ping :)

SjoerdMeijer added inline comments.Aug 11 2020, 8:49 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
1261 ↗	(On Diff #281157)	typo: to to
1262 ↗	(On Diff #281157)	typo: e -> be
1264 ↗	(On Diff #281157)	Probably best to do the TTI changes separately. Just kind of echoing what people told me last time...but I guess it is more convenient in case of reverts of the LV part for example.
1264 ↗	(On Diff #281157)	Bikeshedding names, so ignore if you don't think it is a good fit. "InLoop" is used in helpers above. I was thinking if `preferInLoopReductionSelect` would be more consistent and clear. Then I was wondering the next thing.... can we not simply use `preferInLoopReduction`? Is that not more or less the same, or can it be the same?
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1560 ↗	(On Diff #281157)	Is this an unrelated change (the deletions)?

Thanks for taking a look. I will update this soon...

llvm/include/llvm/Analysis/TargetTransformInfo.h
1264 ↗	(On Diff #281157)	You read my mind. I had it called preferInLoopReductionSelect originally, but I changed it because it is really a different concept to the inloop reductions. Inloop reductions are about placing a vecreduce in the loops as opposed to after it. This patch is about being able to transform `add; select` into a predicated `add`, which can make the select more efficient in the loop. Hopefully that would make it useful in different architectures too, if they have the ability to predicate the add.
1264 ↗	(On Diff #281157)	Yeah. I was trying to avoid having to add an option for it, and it wouldn't do anything without a target hook. I'll add one though and move the TTI part out.
llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp
1560 ↗	(On Diff #281157)	The idea is that we can now handle any reductions in tail predicated loops, not just integers. But we still have an option to disable them. I have to check that they actually do OK all the time though. I know predicated VFMA was missing until now. I'm not sure if any others are needed...

Now just the Vectorizer part and an option to test it with, plus a new test.

The rest is now in D85980

Thanks, this now looks like a small and good change, LGTM.

This revision is now accepted and ready to land.Aug 17 2020, 1:19 AM

Closed by commit rG816097e4e5f3: [LV] Allow tail folded reduction selects to remain in the loop (authored by dmgreen). · Explain WhyAug 20 2020, 6:31 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG816097e4e5f3: [LV] Allow tail folded reduction selects to remain in the loop.

dmgreen mentioned this in rG2b69efded0dc: [ARM][LV] Add a preferPredicatedReductionSelect target hook.Aug 21 2020, 12:48 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

17 lines

test/

Transforms/

LoopVectorize/

reduction-predselect.ll

86 lines

Diff 286803

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	cl::desc("The maximum interleave count to use when interleaving a scalar "
"reduction in a nested loop."));		"reduction in a nested loop."));

static cl::opt<bool>		static cl::opt<bool>
PreferInLoopReductions("prefer-inloop-reductions", cl::init(false),		PreferInLoopReductions("prefer-inloop-reductions", cl::init(false),
cl::Hidden,		cl::Hidden,
cl::desc("Prefer in-loop vector reductions, "		cl::desc("Prefer in-loop vector reductions, "
"overriding the targets preference."));		"overriding the targets preference."));

		static cl::opt<bool> PreferPredicatedReductionSelect(
		"prefer-predicated-reduction-select", cl::init(false), cl::Hidden,
		cl::desc(
		"Prefer predicating a reduction operation over an after loop select."));

cl::opt<bool> EnableVPlanNativePath(		cl::opt<bool> EnableVPlanNativePath(
"enable-vplan-native-path", cl::init(false), cl::Hidden,		"enable-vplan-native-path", cl::init(false), cl::Hidden,
cl::desc("Enable VPlan-native vectorization path with "		cl::desc("Enable VPlan-native vectorization path with "
"support for outer loop vectorization."));		"support for outer loop vectorization."));

// FIXME: Remove this switch once we have divergence analysis. Currently we		// FIXME: Remove this switch once we have divergence analysis. Currently we
// assume divergent non-backedge branches when this switch is true.		// assume divergent non-backedge branches when this switch is true.
cl::opt<bool> EnableVPlanPredication(		cl::opt<bool> EnableVPlanPredication(
▲ Show 20 Lines • Show All 3,630 Lines • ▼ Show 20 Lines	for (unsigned Part = 0; Part < UF; ++Part) {
if (isa<SelectInst>(U)) {		if (isa<SelectInst>(U)) {
assert(!Sel && "Reduction exit feeding two selects");		assert(!Sel && "Reduction exit feeding two selects");
Sel = U;		Sel = U;
} else		} else
assert(isa<PHINode>(U) && "Reduction exit must feed Phi's or select");		assert(isa<PHINode>(U) && "Reduction exit must feed Phi's or select");
}		}
assert(Sel && "Reduction exit feeds no select");		assert(Sel && "Reduction exit feeds no select");
VectorLoopValueMap.resetVectorValue(LoopExitInst, Part, Sel);		VectorLoopValueMap.resetVectorValue(LoopExitInst, Part, Sel);

		// If the target can create a predicated operator for the reduction at no
		// extra cost in the loop (for example a predicated vadd), it can be
		// cheaper for the select to remain in the loop than be sunk out of it,
		// and so use the select value for the phi instead of the old
		// LoopExitValue.
		RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[Phi];
		if (PreferPredicatedReductionSelect) {
		auto *VecRdxPhi = cast<PHINode>(getOrCreateVectorValue(Phi, Part));
		VecRdxPhi->setIncomingValueForBlock(
		LI->getLoopFor(LoopVectorBody)->getLoopLatch(), Sel);
		}
}		}
}		}

// If the vector reduction can be performed in a smaller type, we truncate		// If the vector reduction can be performed in a smaller type, we truncate
// then extend the loop exit value to enable InstCombine to evaluate the		// then extend the loop exit value to enable InstCombine to evaluate the
// entire expression in the smaller type.		// entire expression in the smaller type.
if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {		if (VF > 1 && Phi->getType() != RdxDesc.getRecurrenceType()) {
assert(!IsInLoopReductionPhi && "Unexpected truncated inloop reduction!");		assert(!IsInLoopReductionPhi && "Unexpected truncated inloop reduction!");
▲ Show 20 Lines • Show All 4,494 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/reduction-predselect.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -prefer-predicate-over-epilog -force-reduction-intrinsics -dce -instcombine -S \| FileCheck %s			; RUN: opt < %s -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 -prefer-predicate-over-epilog -prefer-predicated-reduction-select -force-reduction-intrinsics -dce -instcombine -S \| FileCheck %s

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"

	define i32 @reduction_sum_single(i32* noalias nocapture %A) {			define i32 @reduction_sum_single(i32* noalias nocapture %A) {
	; CHECK-LABEL: @reduction_sum_single(			; CHECK-LABEL: @reduction_sum_single(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP24:%.]], %pred.load.continue6 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP25:%.]], %pred.load.continue6 ]
	; CHECK: [[TMP24]] = add <4 x i32> [[VEC_PHI]], [[TMP23:%.*]]			; CHECK: [[TMP24:%.]] = select <4 x i1> [[TMP0:%.]], <4 x i32> [[TMP23:%.*]], <4 x i32> zeroinitializer
				; CHECK: [[TMP25]] = add <4 x i32> [[VEC_PHI]], [[TMP24]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP26:%.]] = select <4 x i1> [[TMP0:%.]], <4 x i32> [[TMP24]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP27:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP25]])
	; CHECK: [[TMP27:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP26]])
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l7, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l7, %.lr.ph ], [ 0, %entry ]
	%l2 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%l2 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	%l3 = load i32, i32* %l2, align 4			%l3 = load i32, i32* %l2, align 4
	%l7 = add i32 %sum.02, %l3			%l7 = add i32 %sum.02, %l3
	%indvars.iv.next = add i32 %indvars.iv, 1			%indvars.iv.next = add i32 %indvars.iv, 1
	%exitcond = icmp eq i32 %indvars.iv.next, 257			%exitcond = icmp eq i32 %indvars.iv.next, 257
	br i1 %exitcond, label %._crit_edge, label %.lr.ph			br i1 %exitcond, label %._crit_edge, label %.lr.ph

	._crit_edge: ; preds = %.lr.ph			._crit_edge: ; preds = %.lr.ph
	%sum.0.lcssa = phi i32 [ %l7, %.lr.ph ]			%sum.0.lcssa = phi i32 [ %l7, %.lr.ph ]
	ret i32 %sum.0.lcssa			ret i32 %sum.0.lcssa
	}			}

	define i32 @reduction_sum(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_sum(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	; CHECK-LABEL: @reduction_sum(			; CHECK-LABEL: @reduction_sum(
	; CHECK: vector.body:			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP47:%.]], %pred.load.continue14 ]
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
	; CHECK: [[TMP44:%.]] = add <4 x i32> [[VEC_PHI]], [[VEC_IND:%.]]			; CHECK: [[TMP44:%.]] = add <4 x i32> [[VEC_PHI]], [[VEC_IND:%.]]
	; CHECK: [[TMP45:%.]] = add <4 x i32> [[TMP44]], [[TMP23:%.]]			; CHECK: [[TMP45:%.]] = add <4 x i32> [[TMP44]], [[TMP23:%.]]
	; CHECK: [[TMP46]] = add <4 x i32> [[TMP45]], [[TMP43:%.*]]			; CHECK: [[TMP46:%.]] = add <4 x i32> [[TMP45]], [[TMP43:%.]]
				; CHECK: [[TMP47]] = select <4 x i1> [[TMP3:%.*]], <4 x i32> [[TMP46]], <4 x i32> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP48:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP46]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP49:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP47]])
	; CHECK: [[TMP49:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP48]])
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]			%sum.02 = phi i32 [ %l9, %.lr.ph ], [ 0, %entry ]
	%l2 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%l2 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	Show All 10 Lines
	._crit_edge: ; preds = %.lr.ph			._crit_edge: ; preds = %.lr.ph
	%sum.0.lcssa = phi i32 [ %l9, %.lr.ph ]			%sum.0.lcssa = phi i32 [ %l9, %.lr.ph ]
	ret i32 %sum.0.lcssa			ret i32 %sum.0.lcssa
	}			}

	define i32 @reduction_prod(i32* noalias nocapture %A, i32* noalias nocapture %B) {			define i32 @reduction_prod(i32* noalias nocapture %A, i32* noalias nocapture %B) {
	; CHECK-LABEL: @reduction_prod(			; CHECK-LABEL: @reduction_prod(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1, i32 1, i32 1, i32 1>, %vector.ph ], [ [[TMP45:%.]], %pred.load.continue14 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1, i32 1, i32 1, i32 1>, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
	; CHECK: [[TMP44:%.]] = mul <4 x i32> [[VEC_PHI]], [[TMP23:%.]]			; CHECK: [[TMP44:%.]] = mul <4 x i32> [[VEC_PHI]], [[TMP23:%.]]
	; CHECK: [[TMP45]] = mul <4 x i32> [[TMP44]], [[TMP43:%.*]]			; CHECK: [[TMP45:%.]] = mul <4 x i32> [[TMP44]], [[TMP43:%.]]
				; CHECK: [[TMP46]] = select <4 x i1> [[TMP3:%.*]], <4 x i32> [[TMP45]], <4 x i32> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP47:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP45]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[TMP46]])
	; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.mul.v4i32(<4 x i32> [[TMP47]])
	;			;
	entry:			entry:
	br label %.lr.ph			br label %.lr.ph

	.lr.ph: ; preds = %entry, %.lr.ph			.lr.ph: ; preds = %entry, %.lr.ph
	%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %.lr.ph ], [ 0, %entry ]
	%prod.02 = phi i32 [ %l9, %.lr.ph ], [ 1, %entry ]			%prod.02 = phi i32 [ %l9, %.lr.ph ], [ 1, %entry ]
	%l2 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%l2 = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	Show All 9 Lines
	._crit_edge: ; preds = %.lr.ph			._crit_edge: ; preds = %.lr.ph
	%prod.0.lcssa = phi i32 [ %l9, %.lr.ph ]			%prod.0.lcssa = phi i32 [ %l9, %.lr.ph ]
	ret i32 %prod.0.lcssa			ret i32 %prod.0.lcssa
	}			}

	define i32 @reduction_and(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_and(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_and(			; CHECK-LABEL: @reduction_and(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 -1, i32 -1, i32 -1, i32 -1>, %vector.ph ], [ [[TMP45:%.]], %pred.load.continue14 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 -1, i32 -1, i32 -1, i32 -1>, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
	; CHECK: [[TMP44:%.]] = and <4 x i32> [[VEC_PHI]], [[TMP42:%.]]			; CHECK: [[TMP44:%.]] = and <4 x i32> [[VEC_PHI]], [[TMP23:%.]]
	; CHECK: [[TMP45]] = and <4 x i32> [[TMP44]], [[TMP43]]			; CHECK: [[TMP45:%.]] = and <4 x i32> [[TMP44]], [[TMP43:%.]]
				; CHECK: [[TMP46]] = select <4 x i1> [[TMP3:%.*]], <4 x i32> [[TMP45]], <4 x i32> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP47:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP45]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[TMP46]])
	; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.and.v4i32(<4 x i32> [[TMP47]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %and, %for.body ], [ -1, %entry ]			%result.08 = phi i32 [ %and, %for.body ], [ -1, %entry ]
	%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	Show All 9 Lines
	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%result.0.lcssa = phi i32 [ %and, %for.body ]			%result.0.lcssa = phi i32 [ %and, %for.body ]
	ret i32 %result.0.lcssa			ret i32 %result.0.lcssa
	}			}

	define i32 @reduction_or(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_or(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_or(			; CHECK-LABEL: @reduction_or(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP45:%.]], %pred.load.continue14 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
	; CHECK: [[TMP45]] = or <4 x i32> [[TMP44:%.*]], [[VEC_PHI]]			; CHECK: [[TMP45:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP44:%.*]], <4 x i32> zeroinitializer
				; CHECK: [[TMP46]] = or <4 x i32> [[VEC_PHI]], [[TMP45]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP47:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP45]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> [[TMP46]])
	; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.or.v4i32(<4 x i32> [[TMP47]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %or, %for.body ], [ 0, %entry ]			%result.08 = phi i32 [ %or, %for.body ], [ 0, %entry ]
	%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	%l0 = load i32, i32* %arrayidx, align 4			%l0 = load i32, i32* %arrayidx, align 4
	%arrayidx2 = getelementptr inbounds i32, i32* %B, i32 %indvars.iv			%arrayidx2 = getelementptr inbounds i32, i32* %B, i32 %indvars.iv
	%l1 = load i32, i32* %arrayidx2, align 4			%l1 = load i32, i32* %arrayidx2, align 4
	%add = add nsw i32 %l1, %l0			%add = add nsw i32 %l1, %l0
	%or = or i32 %add, %result.08			%or = or i32 %add, %result.08
	%indvars.iv.next = add i32 %indvars.iv, 1			%indvars.iv.next = add i32 %indvars.iv, 1
	%exitcond = icmp eq i32 %indvars.iv.next, 257			%exitcond = icmp eq i32 %indvars.iv.next, 257
	br i1 %exitcond, label %for.end, label %for.body			br i1 %exitcond, label %for.end, label %for.body

	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%result.0.lcssa = phi i32 [ %or, %for.body ]			%result.0.lcssa = phi i32 [ %or, %for.body ]
	ret i32 %result.0.lcssa			ret i32 %result.0.lcssa
	}			}

	define i32 @reduction_xor(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_xor(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_xor(			; CHECK-LABEL: @reduction_xor(
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP45:%.]], %pred.load.continue14 ]			; CHECK: vector.body:
	; CHECK: [[TMP45]] = xor <4 x i32> [[TMP44:%.*]], [[VEC_PHI]]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
				; CHECK: [[TMP45:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP44:%.*]], <4 x i32> zeroinitializer
				; CHECK: [[TMP46]] = xor <4 x i32> [[VEC_PHI]], [[TMP45]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP47:%.]] = select <4 x i1> [[TMP3:%.]], <4 x i32> [[TMP45]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.xor.v4i32(<4 x i32> [[TMP46]])
	; CHECK: [[TMP48:%.*]] = call i32 @llvm.experimental.vector.reduce.xor.v4i32(<4 x i32> [[TMP47]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %xor, %for.body ], [ 0, %entry ]			%result.08 = phi i32 [ %xor, %for.body ], [ 0, %entry ]
	%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	Show All 9 Lines
	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%result.0.lcssa = phi i32 [ %xor, %for.body ]			%result.0.lcssa = phi i32 [ %xor, %for.body ]
	ret i32 %result.0.lcssa			ret i32 %result.0.lcssa
	}			}

	define float @reduction_fadd(float* nocapture %A, float* nocapture %B) {			define float @reduction_fadd(float* nocapture %A, float* nocapture %B) {
	; CHECK-LABEL: @reduction_fadd(			; CHECK-LABEL: @reduction_fadd(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x float> [ zeroinitializer, %vector.ph ], [ [[TMP45:%.]], %pred.load.continue14 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x float> [ zeroinitializer, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
	; CHECK: [[TMP44:%.]] = fadd fast <4 x float> [[VEC_PHI]], [[TMP23:%.]]			; CHECK: [[TMP44:%.]] = fadd fast <4 x float> [[VEC_PHI]], [[TMP23:%.]]
	; CHECK: [[TMP45]] = fadd fast <4 x float> [[TMP44]], [[TMP43]]			; CHECK: [[TMP45:%.]] = fadd fast <4 x float> [[TMP44]], [[TMP43:%.]]
				; CHECK: [[TMP46]] = select <4 x i1> [[TMP3:%.*]], <4 x float> [[TMP45]], <4 x float> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP47:%.]] = select <4 x i1> [[TMP3:%.]], <4 x float> [[TMP45]], <4 x float> [[VEC_PHI]]			; CHECK: [[TMP48:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> [[TMP46]])
	; CHECK: [[TMP48:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fadd.f32.v4f32(float 0.000000e+00, <4 x float> [[TMP47]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi float [ %fadd, %for.body ], [ 0.0, %entry ]			%result.08 = phi float [ %fadd, %for.body ], [ 0.0, %entry ]
	%arrayidx = getelementptr inbounds float, float* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds float, float* %A, i32 %indvars.iv
	Show All 9 Lines
	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%result.0.lcssa = phi float [ %fadd, %for.body ]			%result.0.lcssa = phi float [ %fadd, %for.body ]
	ret float %result.0.lcssa			ret float %result.0.lcssa
	}			}

	define float @reduction_fmul(float* nocapture %A, float* nocapture %B) {			define float @reduction_fmul(float* nocapture %A, float* nocapture %B) {
	; CHECK-LABEL: @reduction_fmul(			; CHECK-LABEL: @reduction_fmul(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x float> [ <float 0.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %vector.ph ], [ [[TMP45:%.]], %pred.load.continue14 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x float> [ <float 0.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %vector.ph ], [ [[TMP46:%.]], %pred.load.continue14 ]
	; CHECK: [[TMP44:%.]] = fmul fast <4 x float> [[VEC_PHI]], [[TMP23:%.]]			; CHECK: [[TMP44:%.]] = fmul fast <4 x float> [[VEC_PHI]], [[TMP23:%.]]
	; CHECK: [[TMP45]] = fmul fast <4 x float> [[TMP44]], [[TMP43]]			; CHECK: [[TMP45:%.]] = fmul fast <4 x float> [[TMP44]], [[TMP43:%.]]
				; CHECK: [[TMP46]] = select <4 x i1> [[TMP3:%.*]], <4 x float> [[TMP45]], <4 x float> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP47:%.]] = select <4 x i1> [[TMP3:%.]], <4 x float> [[TMP45]], <4 x float> [[VEC_PHI]]			; CHECK: [[TMP48:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> [[TMP46]])
	; CHECK: [[TMP48:%.*]] = call fast float @llvm.experimental.vector.reduce.v2.fmul.f32.v4f32(float 1.000000e+00, <4 x float> [[TMP47]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi float [ %fmul, %for.body ], [ 0.0, %entry ]			%result.08 = phi float [ %fmul, %for.body ], [ 0.0, %entry ]
	%arrayidx = getelementptr inbounds float, float* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds float, float* %A, i32 %indvars.iv
	Show All 9 Lines
	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%result.0.lcssa = phi float [ %fmul, %for.body ]			%result.0.lcssa = phi float [ %fmul, %for.body ]
	ret float %result.0.lcssa			ret float %result.0.lcssa
	}			}

	define i32 @reduction_min(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_min(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_min(			; CHECK-LABEL: @reduction_min(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1000, i32 1000, i32 1000, i32 1000>, %vector.ph ], [ [[TMP25:%.]], %pred.load.continue6 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1000, i32 1000, i32 1000, i32 1000>, %vector.ph ], [ [[TMP26:%.]], %pred.load.continue6 ]
	; CHECK: [[TMP24:%.]] = icmp slt <4 x i32> [[VEC_PHI]], [[TMP23:%.]]			; CHECK: [[TMP24:%.]] = icmp slt <4 x i32> [[VEC_PHI]], [[TMP23:%.]]
	; CHECK: [[TMP25]] = select <4 x i1> [[TMP24]], <4 x i32> [[VEC_PHI]], <4 x i32> [[TMP23]]			; CHECK: [[TMP25:%.*]] = select <4 x i1> [[TMP24]], <4 x i32> [[VEC_PHI]], <4 x i32> [[TMP23]]
				; CHECK: [[TMP26]] = select <4 x i1> [[TMP0:%.*]], <4 x i32> [[TMP25]], <4 x i32> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP27:%.]] = select <4 x i1> [[TMP0:%.]], <4 x i32> [[TMP25]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP28:%.*]] = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> [[TMP26]])
	; CHECK: [[TMP28:%.*]] = call i32 @llvm.experimental.vector.reduce.smin.v4i32(<4 x i32> [[TMP27]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]			%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]
	%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	%l0 = load i32, i32* %arrayidx, align 4			%l0 = load i32, i32* %arrayidx, align 4
	%c0 = icmp slt i32 %result.08, %l0			%c0 = icmp slt i32 %result.08, %l0
	%v0 = select i1 %c0, i32 %result.08, i32 %l0			%v0 = select i1 %c0, i32 %result.08, i32 %l0
	%indvars.iv.next = add i32 %indvars.iv, 1			%indvars.iv.next = add i32 %indvars.iv, 1
	%exitcond = icmp eq i32 %indvars.iv.next, 257			%exitcond = icmp eq i32 %indvars.iv.next, 257
	br i1 %exitcond, label %for.end, label %for.body			br i1 %exitcond, label %for.end, label %for.body

	for.end: ; preds = %for.body, %entry			for.end: ; preds = %for.body, %entry
	%result.0.lcssa = phi i32 [ %v0, %for.body ]			%result.0.lcssa = phi i32 [ %v0, %for.body ]
	ret i32 %result.0.lcssa			ret i32 %result.0.lcssa
	}			}

	define i32 @reduction_max(i32* nocapture %A, i32* nocapture %B) {			define i32 @reduction_max(i32* nocapture %A, i32* nocapture %B) {
	; CHECK-LABEL: @reduction_max(			; CHECK-LABEL: @reduction_max(
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1000, i32 1000, i32 1000, i32 1000>, %vector.ph ], [ [[TMP25:%.]], %pred.load.continue6 ]			; CHECK: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 1000, i32 1000, i32 1000, i32 1000>, %vector.ph ], [ [[TMP26:%.]], %pred.load.continue6 ]
	; CHECK: [[TMP24:%.]] = icmp ugt <4 x i32> [[VEC_PHI]], [[TMP23:%.]]			; CHECK: [[TMP24:%.]] = icmp ugt <4 x i32> [[VEC_PHI]], [[TMP23:%.]]
	; CHECK: [[TMP25]] = select <4 x i1> [[TMP24]], <4 x i32> [[VEC_PHI]], <4 x i32> [[TMP23]]			; CHECK: [[TMP25:%.*]] = select <4 x i1> [[TMP24]], <4 x i32> [[VEC_PHI]], <4 x i32> [[TMP23]]
				; CHECK: [[TMP26]] = select <4 x i1> [[TMP0:%.*]], <4 x i32> [[TMP25]], <4 x i32> [[VEC_PHI]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK: [[TMP27:%.]] = select <4 x i1> [[TMP0:%.]], <4 x i32> [[TMP25]], <4 x i32> [[VEC_PHI]]			; CHECK: [[TMP28:%.*]] = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> [[TMP26]])
	; CHECK: [[TMP28:%.*]] = call i32 @llvm.experimental.vector.reduce.umax.v4i32(<4 x i32> [[TMP27]])
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]			%indvars.iv = phi i32 [ %indvars.iv.next, %for.body ], [ 0, %entry ]
	%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]			%result.08 = phi i32 [ %v0, %for.body ], [ 1000, %entry ]
	%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv			%arrayidx = getelementptr inbounds i32, i32* %A, i32 %indvars.iv
	Show All 11 Lines