This is an archive of the discontinued LLVM Phabricator instance.

[LV] Add a new reduction pattern match
ClosedPublic

Authored by rengolin on Jul 11 2018, 12:30 AM.

Download Raw Diff

Details

Reviewers

mkuper
karthikthecool
TylerNowicki
mcrosier
t.p.northover
fhahn
RKSimon
dcaballe
hsaito
takahiro.miyoshi

Commits

rGcb19c8e3aafd: [LV] Add a new reduction pattern match
rL344172: [LV] Add a new reduction pattern match

Summary

Adding a new reduction pattern match for vectorizing code similar to TSVC s3111:

for (int i = 0; i < N; i++)
  if (a[i] > b)
    sum += a[i];

This patch adds support for fadd, fsub and fmull, as well as multiple
branches and different (but compatible) instructions (ex. add+sub) in
different branches.

I have forwarded to trunk, added fsub and fmul functionality and
additional tests, but the credit goes to Takahiro, who did most of the
actual work.

Patch by Takahiro Miyoshi <takahiro.miyoshi@linaro.org>.

Diff Detail

Event Timeline

takahiro.miyoshi created this revision.Jul 11 2018, 12:30 AM

Herald added a subscriber: llvm-commits. · View Herald TranscriptJul 11 2018, 12:30 AM

Hi Takahiro,

The patch looks good, but I'm adding more people to have a closer look, as it has been a while since last time I touched this code.

cheers,
--renato

Takahiro,

I'm not familiar with the Recurrence Descriptor code, but I suppose the following is considered as RK_FloatAdd. If that's the case, we should be beefing up RK_FloatAdd rather than adding a new Kind. We can't keep adding new kind every time we encounter a different pattern of reduction sum/product. Downside is possibly exposing a downstream bug, but that should only help generalizing reduction handling code. From reduction analysis perspective, select (IF-converted) and phi (IF) should be the same thing. So, trying to handle this within FloatAdd/FloatMult should also help generalize recurrence analysis code. That's how I look at the issue. Hope this helps.

Thanks,
Hideki

float foo(float *a, int n){

float sum=0;
for (int i=0;i<n;i++){
  if (a[i]>1.0){
    sum+=a[i];
  }
  else if (a[i]<3.0){
    sum+=2*a[i];
  }
}
return sum;

}

for.body: ; preds = %for.inc, %for.body.preheader

%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.inc ]
%sum.026 = phi float [ 0.000000e+00, %for.body.preheader ], [ %sum.1, %for.inc ]
%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
%0 = load float, float* %arrayidx, align 4, !tbaa !2
%cmp1 = fcmp ogt float %0, 1.000000e+00
br i1 %cmp1, label %if.then, label %if.else

if.then: ; preds = %for.body

%add = fadd fast float %0, %sum.026
br label %for.inc

if.else: ; preds = %for.body

%cmp8 = fcmp olt float %0, 3.000000e+00
br i1 %cmp8, label %if.then10, label %for.inc

if.then10: ; preds = %if.else

%mul = fmul fast float %0, 2.000000e+00
%add13 = fadd fast float %mul, %sum.026
br label %for.inc

for.inc: ; preds = %if.then, %if.then10, %if.else

%sum.1 = phi float [ %add, %if.then ], [ %add13, %if.then10 ], [ %sum.026, %if.else ]
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body

Hi Hideki,

Thank you for your comments.

At first, as you said, I supposed that the target loop was FloatAdd pattern. But, I supposed this was also a little similar to FloatMixMax.
It means that the target one contains the meaning of FloatAdd and FloatMixMax, and I add the new recurrence descriptor to express this.

Certainly, I think keeping on adding new kind for a different pattern of reduction doesn't make sense.
So, I will retry to handle this within FloatAdd/FloatMult.

Best regards,
Takahiro

I modified my patch to use RK_FloatAdd instead of adding a new recurrence descriptor.
And, input IRs of this target loop is already converted into a select instruction, so I don't extend If-convert functionality.

takahiro.miyoshi updated this revision to Diff 159445.Aug 6 2018, 7:26 PM

takahiro.miyoshi updated this revision to Diff 159447.Aug 6 2018, 7:32 PM

In D49168#1190457, @takahiro.miyoshi wrote:

I modified my patch to use RK_FloatAdd instead of adding a new recurrence descriptor.
And, input IRs of this target loop is already converted into a select instruction, so I don't extend If-convert functionality.

Is this ready for another round of review? I took a quick look. I think this is the right direction to follow. Any specific reasons for restricting to FloatAdd? Should be the same for integers and SUB and MUL, as well, isn't it?
Please also add a negative test for "two use" cases but outside of the pattern you are looking for as well as "three use" negative test.

By any chance, did you try looking into creating a LIT test for the following conditional reduction where both IFs are converted to selects? I'm just curious about how much extension of your code would be needed to capture that.
If this is already caught great. If low hanging, its nice to extend a bit further (and try to see if that covers 3, 4, 5, ..., N cases).

if (cond1)

sum+=...

if (cond2)

sum+=...

[
if (cond3)

sum+=...

if (cond4)

sum+=...

...
if (condN)

sum+=...

]

Thanks,
Hideki

Ayal mentioned this in D50474: [LV] Vectorize header phis that feed from if-convertable latch phis.Aug 9 2018, 3:27 PM

Hi Hideki,

Takahiro is on leave, so I'm taking this work to make sure we don't delay much more.

I have added fsub and fmul functionality and a few tests (with multiple branches), including some that should not vectorise.

All the credit still goes to Takahiro.

cheers,
--renato

Herald added a subscriber: rkruppe. · View Herald TranscriptOct 8 2018, 10:15 AM

rengolin updated this revision to Diff 168678.Oct 8 2018, 10:16 AM

rengolin edited the summary of this revision. (Show Details)

Thanks a lot, Renato. Will take a look quick.

Code looks good. Just a minor suggestion on the comment. Looking at the LIT test.

lib/Analysis/IVDescriptors.cpp
503	where the Instruction argument I is the last select in the chain.

rengolin added inline comments.Oct 8 2018, 2:05 PM

lib/Analysis/IVDescriptors.cpp
503	Good point! Probably better to also rename the argument and simplify the cast in the beginning of the function. I didn´t want to change much, but I guess that's more cosmetic than anything. :)

LGTM. Please wait for a few days to give others time to respond if they'd like to.

test/Transforms/LoopVectorize/if-reduction.ll
2	My preference is to have vectorization/non-vectorization checked by itself and then have another RUN line to check the expected optimization by InstCombine. That way, we'll quickly know which part changed when the test fails. I don't insist, though.
584	Is this the correct check here?

This revision is now accepted and ready to land.Oct 8 2018, 2:28 PM

In D49168#1258185, @hsaito wrote:

LGTM. Please wait for a few days to give others time to respond if they'd like to.

Thanks Hideki!

I'll update with the review comments and wait a few days.

test/Transforms/LoopVectorize/if-reduction.ll
2	I see what you mean, will try to separate them. I'm not even sure the instcombine is necessary for the results we check, though.
584	ouch, no, regex left-over. Will fix.

hsaito added inline comments.Oct 8 2018, 3:12 PM

test/Transforms/LoopVectorize/if-reduction.ll
37	One comment somehow went missing. I suggest adding one more negative test, for example, storing %add to y[i]. Single use of %add should be checked, I think. If we find a bug there, that's an easy thing to remedy.

Changes to comment:

Improved comments on isConditionalRdxPattern
Removed instcombine pass from test
Added write negative test
Fix typo in CHECK line

rengolin marked 7 inline comments as done.Oct 9 2018, 3:09 AM

LGTM.

Closed by commit rL344172: [LV] Add a new reduction pattern match (authored by rengolin). · Explain WhyOct 10 2018, 11:51 AM

This revision was automatically updated to reflect the committed changes.

This patch doesn't correctly handle isFast(). By default isRecurrenceInstr() should check I->isFast(), but for this pattern I is Select, isFast() doesn't apply to it, it should be checked against FAdd/FMul inside isConditionalRdxPattern().

It caused our several internal applications failed. Following is a simple reproduction.

static double bar(double* v) {

double t = 0.0;
for (int i=0; i<10000; i++) {
  double s = v[i];
  if (s > 0) {
    t += s;
  }
}
return t;

}

double foo(double* v)
{

return bar(v);

}

clang++ -msse4.2 -c -O2 t9.cc -save-temps

In the generated code, the loop is wrongly vectorized.

In D49168#1277837, @Carrot wrote:

This patch doesn't correctly handle isFast(). By default isRecurrenceInstr() should check I->isFast(), but for this pattern I is Select, isFast() doesn't apply to it, it should be checked against FAdd/FMul inside isConditionalRdxPattern().

Benjamin has said something similar, but I could not reproduce. I'll revert the patch and fix the issue. Thanks for the reproducer case!

--renato

Reverted in r345465. Will take a look and land again when fixed. Thanks!

Revision Contents

Path

Size

include/

llvm/

Analysis/

IVDescriptors.h

7 lines

lib/

Analysis/

IVDescriptors.cpp

71 lines

test/

Transforms/

LoopVectorize/

if-reduction.ll

666 lines

Diff 168768

include/llvm/Analysis/IVDescriptors.h

Context not available.

	/// Returns true if instruction I has multiple uses in Insts	/// Returns true if instruction I has multiple uses in Insts
	static bool hasMultipleUsesOf(Instruction *I,	static bool hasMultipleUsesOf(Instruction *I,
	SmallPtrSetImpl<Instruction *> &Insts);	SmallPtrSetImpl<Instruction *> &Insts,
		unsigned MaxNumUses);

	/// Returns true if all uses of the instruction I is within the Set.	/// Returns true if all uses of the instruction I is within the Set.
	static bool areAllUsesIn(Instruction I, SmallPtrSetImpl<Instruction > &Set);	static bool areAllUsesIn(Instruction I, SmallPtrSetImpl<Instruction > &Set);
Context not available.
	/// or max(X, Y).	/// or max(X, Y).
	static InstDesc isMinMaxSelectCmpPattern(Instruction *I, InstDesc &Prev);	static InstDesc isMinMaxSelectCmpPattern(Instruction *I, InstDesc &Prev);

		/// Returns a struct describing if the instruction is a
		/// Select(FCmp(X, Y), (Z = X op PHINode), PHINode) instruction pattern.
		static InstDesc isConditionalRdxPattern(RecurrenceKind Kind, Instruction *I);

	/// Returns identity corresponding to the RecurrenceKind.	/// Returns identity corresponding to the RecurrenceKind.
	static Constant getRecurrenceIdentity(RecurrenceKind K, Type Tp);	static Constant getRecurrenceIdentity(RecurrenceKind K, Type Tp);

Context not available.

lib/Analysis/IVDescriptors.cpp

Context not available.
	return false;	return false;
	}	}

		bool IsASelect = isa<SelectInst>(Cur);

		// A conditional reduction operation must only have 2 or less uses in
		// VisitedInsts.
		if (IsASelect && (Kind == RK_FloatAdd \|\| Kind == RK_FloatMult) &&
		hasMultipleUsesOf(Cur, VisitedInsts, 2))
		return false;

	// A reduction operation must only have one use of the reduction value.	// A reduction operation must only have one use of the reduction value.
	if (!IsAPhi && Kind != RK_IntegerMinMax && Kind != RK_FloatMinMax &&	if (!IsAPhi && !IsASelect && Kind != RK_IntegerMinMax &&
	hasMultipleUsesOf(Cur, VisitedInsts))	Kind != RK_FloatMinMax && hasMultipleUsesOf(Cur, VisitedInsts, 1))
	return false;	return false;

	// All inputs to a PHI node must be a reduction value.	// All inputs to a PHI node must be a reduction value.
Context not available.
	} else if (!isa<PHINode>(UI) &&	} else if (!isa<PHINode>(UI) &&
	((!isa<FCmpInst>(UI) && !isa<ICmpInst>(UI) &&	((!isa<FCmpInst>(UI) && !isa<ICmpInst>(UI) &&
	!isa<SelectInst>(UI)) \|\|	!isa<SelectInst>(UI)) \|\|
	!isMinMaxSelectCmpPattern(UI, IgnoredVal).isRecurrence()))	(!isConditionalRdxPattern(Kind, UI).isRecurrence() &&
		!isMinMaxSelectCmpPattern(UI, IgnoredVal).isRecurrence())))
	return false;	return false;

	// Remember that we completed the cycle.	// Remember that we completed the cycle.
Context not available.
	return InstDesc(false, I);	return InstDesc(false, I);
	}	}

		/// Returns true if the select instruction has users in the compare-and-add
		hsaitoUnsubmitted Done Reply Inline Actions where the Instruction argument I is the last select in the chain. hsaito: where the Instruction argument I is the last select in the chain.
		rengolinAuthorUnsubmitted Done Reply Inline Actions Good point! Probably better to also rename the argument and simplify the cast in the beginning of the function. I didn´t want to change much, but I guess that's more cosmetic than anything. :) rengolin: Good point! Probably better to also rename the argument and simplify the cast in the beginning…
		/// reduction pattern below. The select instruction argument is the last one
		/// in the sequence.
		///
		/// %sum.1 = phi ...
		/// ...
		/// %cmp = fcmp pred %0, %CFP
		/// %add = fadd %0, %sum.1
		/// %sum.2 = select %cmp, %add, %sum.1
		RecurrenceDescriptor::InstDesc
		RecurrenceDescriptor::isConditionalRdxPattern(
		RecurrenceKind Kind, Instruction *I) {
		SelectInst *SI = dyn_cast<SelectInst>(I);
		if (!SI)
		return InstDesc(false, I);

		CmpInst *CI = dyn_cast<CmpInst>(SI->getCondition());
		// Only handle single use cases for now.
		if (!CI \|\| !CI->hasOneUse())
		return InstDesc(false, I);

		Value *TrueVal = SI->getTrueValue();
		Value *FalseVal = SI->getFalseValue();
		// Handle only when either of operands of select instruction is a PHI
		// node for now.
		if ((isa<PHINode>(TrueVal) && isa<PHINode>(FalseVal)) \|\|
		(!isa<PHINode>(TrueVal) && !isa<PHINode>(FalseVal)))
		return InstDesc(false, I);

		Instruction *I1 =
		isa<PHINode>(*TrueVal) ? dyn_cast<Instruction>(FalseVal)
		: dyn_cast<Instruction>(TrueVal);
		if (!I1 \|\| !I1->isBinaryOp())
		return InstDesc(false, I);

		Value Op1, Op2;
		if (m_FAdd(m_Value(Op1), m_Value(Op2)).match(I1) \|\|
		m_FSub(m_Value(Op1), m_Value(Op2)).match(I1))
		return InstDesc(Kind == RK_FloatAdd, SI);

		if (m_FMul(m_Value(Op1), m_Value(Op2)).match(I1))
		return InstDesc(Kind == RK_FloatMult, SI);

		return InstDesc(false, I);
		}

	RecurrenceDescriptor::InstDesc	RecurrenceDescriptor::InstDesc
	RecurrenceDescriptor::isRecurrenceInstr(Instruction *I, RecurrenceKind Kind,	RecurrenceDescriptor::isRecurrenceInstr(Instruction *I, RecurrenceKind Kind,
	InstDesc &Prev, bool HasFunNoNaNAttr) {	InstDesc &Prev, bool HasFunNoNaNAttr) {
Context not available.
	case Instruction::FSub:	case Instruction::FSub:
	case Instruction::FAdd:	case Instruction::FAdd:
	return InstDesc(Kind == RK_FloatAdd, I, UAI);	return InstDesc(Kind == RK_FloatAdd, I, UAI);
		case Instruction::Select:
		if (Kind == RK_FloatAdd \|\| Kind == RK_FloatMult)
		return isConditionalRdxPattern(Kind, I);
		LLVM_FALLTHROUGH;
	case Instruction::FCmp:	case Instruction::FCmp:
	case Instruction::ICmp:	case Instruction::ICmp:
	case Instruction::Select:
	if (Kind != RK_IntegerMinMax &&	if (Kind != RK_IntegerMinMax &&
	(!HasFunNoNaNAttr \|\| Kind != RK_FloatMinMax))	(!HasFunNoNaNAttr \|\| Kind != RK_FloatMinMax))
	return InstDesc(false, I);	return InstDesc(false, I);
Context not available.
	}	}

	bool RecurrenceDescriptor::hasMultipleUsesOf(	bool RecurrenceDescriptor::hasMultipleUsesOf(
	Instruction I, SmallPtrSetImpl<Instruction > &Insts) {	Instruction I, SmallPtrSetImpl<Instruction > &Insts,
		unsigned MaxNumUses) {
	unsigned NumUses = 0;	unsigned NumUses = 0;
	for (User::op_iterator Use = I->op_begin(), E = I->op_end(); Use != E;	for (User::op_iterator Use = I->op_begin(), E = I->op_end(); Use != E;
	++Use) {	++Use) {
	if (Insts.count(dyn_cast<Instruction>(*Use)))	if (Insts.count(dyn_cast<Instruction>(*Use)))
	++NumUses;	++NumUses;
	if (NumUses > 1)	if (NumUses > MaxNumUses)
	return true;	return true;
	}	}

Context not available.

test/Transforms/LoopVectorize/if-reduction.ll

This file was added.

				; RUN: opt -S -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 < %s \| FileCheck %s

				hsaitoUnsubmitted Done Reply Inline Actions My preference is to have vectorization/non-vectorization checked by itself and then have another RUN line to check the expected optimization by InstCombine. That way, we'll quickly know which part changed when the test fails. I don't insist, though. hsaito: My preference is to have vectorization/non-vectorization checked by itself and then have…
				rengolinAuthorUnsubmitted Done Reply Inline Actions I see what you mean, will try to separate them. I'm not even sure the instcombine is necessary for the results we check, though. rengolin: I see what you mean, will try to separate them. I'm not even sure the instcombine is necessary…
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"

				; Float pattern:
				; Check vectorization of reduction code which has an fadd instruction after
				; an fcmp instruction which compares an array element and 0.
				;
				; float fcmp_0_fadd_select1(float * restrict x, const int N) {
				; float sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > (float)0.)
				; sum += x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_0_fadd_select1(
				; CHECK: %[[V1:.]] = fcmp fast ogt <4 x float> %[[V0:.]], zeroinitializer
				; CHECK: %[[V3:.]] = fadd fast <4 x float> %[[V0]], %[[V2:.]]
				; CHECK: select <4 x i1> %[[V1]], <4 x float> %[[V3]], <4 x float> %[[V2]]
				define float @fcmp_0_fadd_select1(float* noalias %x, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %header, %for.body
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi float [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %x, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.2 = fcmp fast ogt float %0, 0.000000e+00
				%add = fadd fast float %0, %sum.1
				%sum.2 = select i1 %cmp.2, float %add, float %sum.1
				hsaitoUnsubmitted Done Reply Inline Actions One comment somehow went missing. I suggest adding one more negative test, for example, storing %add to y[i]. Single use of %add should be checked, I think. If we find a bug there, that's an easy thing to remedy. hsaito: One comment somehow went missing. I suggest adding one more negative test, for example…
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi float [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret float %1
				}

				; Double pattern:
				; Check vectorization of reduction code which has an fadd instruction after
				; an fcmp instruction which compares an array element and 0.
				;
				; double fcmp_0_fadd_select2(double * restrict x, const int N) {
				; double sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > 0.)
				; sum += x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_0_fadd_select2(
				; CHECK: %[[V1:.]] = fcmp fast ogt <4 x double> %[[V0:.]], zeroinitializer
				; CHECK: %[[V3:.]] = fadd fast <4 x double> %[[V0]], %[[V2:.]]
				; CHECK: select <4 x i1> %[[V1]], <4 x double> %[[V3]], <4 x double> %[[V2]]
				define double @fcmp_0_fadd_select2(double* noalias %x, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %header, %for.body
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi double [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %x, i64 %indvars.iv
				%0 = load double, double* %arrayidx, align 4
				%cmp.2 = fcmp fast ogt double %0, 0.000000e+00
				%add = fadd fast double %0, %sum.1
				%sum.2 = select i1 %cmp.2, double %add, double %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi double [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret double %1
				}

				; Float pattern:
				; Check vectorization of reduction code which has an fadd instruction after
				; an fcmp instruction which compares an array element and a floating-point
				; value.
				;
				; float fcmp_val_fadd_select1(float * restrict x, float y, const int N) {
				; float sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > y)
				; sum += x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_val_fadd_select1(
				; CHECK: %[[V1:.]] = fcmp fast ogt <4 x float> %[[V0:.]], %broadcast.splat2
				; CHECK: %[[V3:.]] = fadd fast <4 x float> %[[V0]], %[[V2:.]]
				; CHECK: select <4 x i1> %[[V1]], <4 x float> %[[V3]], <4 x float> %[[V2]]
				define float @fcmp_val_fadd_select1(float* noalias %x, float %y, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %header, %for.body
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi float [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %x, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.2 = fcmp fast ogt float %0, %y
				%add = fadd fast float %0, %sum.1
				%sum.2 = select i1 %cmp.2, float %add, float %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi float [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret float %1
				}

				; Double pattern:
				; Check vectorization of reduction code which has an fadd instruction after
				; an fcmp instruction which compares an array element and a floating-point
				; value.
				;
				; double fcmp_val_fadd_select2(double * restrict x, double y, const int N) {
				; double sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > y)
				; sum += x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_val_fadd_select2(
				; CHECK: %[[V1:.]] = fcmp fast ogt <4 x double> %[[V0:.]], %broadcast.splat2
				; CHECK: %[[V3:.]] = fadd fast <4 x double> %[[V0]], %[[V2:.]]
				; CHECK: select <4 x i1> %[[V1]], <4 x double> %[[V3]], <4 x double> %[[V2]]
				define double @fcmp_val_fadd_select2(double* noalias %x, double %y, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %header, %for.body
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi double [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %x, i64 %indvars.iv
				%0 = load double, double* %arrayidx, align 4
				%cmp.2 = fcmp fast ogt double %0, %y
				%add = fadd fast double %0, %sum.1
				%sum.2 = select i1 %cmp.2, double %add, double %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi double [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret double %1
				}

				; Float pattern:
				; Check vectorization of reduction code which has an fadd instruction after
				; an fcmp instruction which compares an array element and another array
				; element.
				;
				; float fcmp_array_elm_fadd_select1(float * restrict x, float * restrict y,
				; const int N) {
				; float sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > y[i])
				; sum += x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_array_elm_fadd_select1(
				; CHECK: %[[V2:.]] = fcmp fast ogt <4 x float> %[[V0:.]], %[[V1:.*]]
				; CHECK: %[[V4:.]] = fadd fast <4 x float> %[[V0]], %[[V3:.]]
				; CHECK: select <4 x i1> %[[V2]], <4 x float> %[[V4]], <4 x float> %[[V3]]
				define float @fcmp_array_elm_fadd_select1(float* noalias %x, float* noalias %y, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.header
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi float [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx.1 = getelementptr inbounds float, float* %x, i64 %indvars.iv
				%0 = load float, float* %arrayidx.1, align 4
				%arrayidx.2 = getelementptr inbounds float, float* %y, i64 %indvars.iv
				%1 = load float, float* %arrayidx.2, align 4
				%cmp.2 = fcmp fast ogt float %0, %1
				%add = fadd fast float %0, %sum.1
				%sum.2 = select i1 %cmp.2, float %add, float %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%2 = phi float [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret float %2
				}

				; Double pattern:
				; Check vectorization of reduction code which has an fadd instruction after
				; an fcmp instruction which compares an array element and another array
				; element.
				;
				; double fcmp_array_elm_fadd_select2(double * restrict x, double * restrict y,
				; const int N) {
				; double sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > y[i])
				; sum += x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_array_elm_fadd_select2(
				; CHECK: %[[V2:.]] = fcmp fast ogt <4 x double> %[[V0:.]], %[[V1:.*]]
				; CHECK: %[[V4:.]] = fadd fast <4 x double> %[[V0]], %[[V3:.]]
				; CHECK: select <4 x i1> %[[V2]], <4 x double> %[[V4]], <4 x double> %[[V3]]
				define double @fcmp_array_elm_fadd_select2(double* noalias %x, double* noalias %y, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.header
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi double [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx.1 = getelementptr inbounds double, double* %x, i64 %indvars.iv
				%0 = load double, double* %arrayidx.1, align 4
				%arrayidx.2 = getelementptr inbounds double, double* %y, i64 %indvars.iv
				%1 = load double, double* %arrayidx.2, align 4
				%cmp.2 = fcmp fast ogt double %0, %1
				%add = fadd fast double %0, %sum.1
				%sum.2 = select i1 %cmp.2, double %add, double %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%2 = phi double [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret double %2
				}

				; Float pattern:
				; Check vectorization of reduction code which has an fsub instruction after
				; an fcmp instruction which compares an array element and 0.
				;
				; float fcmp_0_fsub_select1(float * restrict x, const int N) {
				; float sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > (float)0.)
				; sum -= x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_0_fsub_select1(
				; CHECK: %[[V1:.]] = fcmp ogt <4 x float> %[[V0:.]], zeroinitializer
				; CHECK: %[[V3:.]] = fsub <4 x float> %[[V2:.]], %[[V0]]
				; CHECK: select <4 x i1> %[[V1]], <4 x float> %[[V3]], <4 x float> %[[V2]]
				define float @fcmp_0_fsub_select1(float* noalias %x, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.header
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi float [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %x, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.2 = fcmp ogt float %0, 0.000000e+00
				%sub = fsub float %sum.1, %0
				%sum.2 = select i1 %cmp.2, float %sub, float %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi float [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret float %1
				}

				; Double pattern:
				; Check vectorization of reduction code which has an fsub instruction after
				; an fcmp instruction which compares an array element and 0.
				;
				; double fcmp_0_fsub_select2(double * restrict x, const int N) {
				; double sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > 0.)
				; sum -= x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_0_fsub_select2(
				; CHECK: %[[V1:.]] = fcmp ogt <4 x double> %[[V0:.]], zeroinitializer
				; CHECK: %[[V3:.]] = fsub <4 x double> %[[V2:.]], %[[V0]]
				; CHECK: select <4 x i1> %[[V1]], <4 x double> %[[V3]], <4 x double> %[[V2]]
				define double @fcmp_0_fsub_select2(double* noalias %x, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.header
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi double [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %x, i64 %indvars.iv
				%0 = load double, double* %arrayidx, align 4
				%cmp.2 = fcmp ogt double %0, 0.000000e+00
				%sub = fsub double %sum.1, %0
				%sum.2 = select i1 %cmp.2, double %sub, double %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi double [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret double %1
				}

				; Float pattern:
				; Check vectorization of reduction code which has an fmul instruction after
				; an fcmp instruction which compares an array element and 0.
				;
				; float fcmp_0_fmult_select1(float * restrict x, const int N) {
				; float sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > (float)0.)
				; sum *= x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_0_fmult_select1(
				; CHECK: %[[V1:.]] = fcmp ogt <4 x float> %[[V0:.]], zeroinitializer
				; CHECK: %[[V3:.]] = fmul <4 x float> %[[V2:.]], %[[V0]]
				; CHECK: select <4 x i1> %[[V1]], <4 x float> %[[V3]], <4 x float> %[[V2]]
				define float @fcmp_0_fmult_select1(float* noalias %x, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.header
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi float [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %x, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp.2 = fcmp ogt float %0, 0.000000e+00
				%mult = fmul float %sum.1, %0
				%sum.2 = select i1 %cmp.2, float %mult, float %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi float [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret float %1
				}

				; Double pattern:
				; Check vectorization of reduction code which has an fmul instruction after
				; an fcmp instruction which compares an array element and 0.
				;
				; double fcmp_0_fmult_select2(double * restrict x, const int N) {
				; double sum = 0.
				; for (int i = 0; i < N; ++i)
				; if (x[i] > 0.)
				; sum *= x[i];
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_0_fmult_select2(
				; CHECK: %[[V1:.]] = fcmp ogt <4 x double> %[[V0:.]], zeroinitializer
				; CHECK: %[[V3:.]] = fmul <4 x double> %[[V2:.]], %[[V0]]
				; CHECK: select <4 x i1> %[[V1]], <4 x double> %[[V3]], <4 x double> %[[V2]]
				define double @fcmp_0_fmult_select2(double* noalias %x, i32 %N) nounwind readonly {
				entry:
				%cmp.1 = icmp sgt i32 %N, 0
				br i1 %cmp.1, label %for.header, label %for.end

				for.header: ; preds = %entry
				%zext = zext i32 %N to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.header
				%indvars.iv = phi i64 [ 0, %for.header ], [ %indvars.iv.next, %for.body ]
				%sum.1 = phi double [ 0.000000e+00, %for.header ], [ %sum.2, %for.body ]
				%arrayidx = getelementptr inbounds double, double* %x, i64 %indvars.iv
				%0 = load double, double* %arrayidx, align 4
				%cmp.2 = fcmp ogt double %0, 0.000000e+00
				%mult = fmul double %sum.1, %0
				%sum.2 = select i1 %cmp.2, double %mult, double %sum.1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %zext
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%1 = phi double [ 0.000000e+00, %entry ], [ %sum.2, %for.body ]
				ret double %1
				}

				; Float multi pattern
				; Check vectorisation of reduction code with a pair of selects to different
				; fadd patterns.
				;
				; float fcmp_multi(float *a, int n) {
				; float sum=0.0;
				; for (int i=0;i<n;i++) {
				; if (a[i]>1.0)
				; sum+=a[i];
				; else if (a[i]<3.0)
				; sum+=2*a[i];
				; else
				; sum+=3*a[i];
				; }
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_multi(
				; CHECK: %[[C1:.]] = fcmp ogt <4 x float> %[[V0:.]], <float 1.000000e+00,
				; CHECK: %[[C2:.*]] = fcmp olt <4 x float> %[[V0]], <float 3.000000e+00,
				; CHECK-DAG: %[[M1:.*]] = fmul fast <4 x float> %[[V0]], <float 3.000000e+00,
				; CHECK-DAG: %[[M2:.*]] = fmul fast <4 x float> %[[V0]], <float 2.000000e+00,
				; CHECK: %[[C11:.*]] = xor <4 x i1> %[[C1]], <i1 true,
				; CHECK-DAG: %[[C12:.*]] = and <4 x i1> %[[C2]], %[[C11]]
				; CHECK-DAG: %[[C21:.*]] = xor <4 x i1> %[[C2]], <i1 true,
				; CHECK: %[[C22:.*]] = and <4 x i1> %[[C21]], %[[C11]]
				; CHECK: %[[S1:.*]] = select <4 x i1> %[[C22]], <4 x float> %[[M1]], <4 x float> %[[M2]]
				; CHECK: %[[S2:.*]] = select <4 x i1> %[[C1]], <4 x float> %[[V0]], <4 x float> %[[S1]]
				; CHECK: fadd fast <4 x float> %[[S2]],
				define float @fcmp_multi(float* nocapture readonly %a, i32 %n) nounwind readonly {
				entry:
				%cmp10 = icmp sgt i32 %n, 0
				br i1 %cmp10, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.inc, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.inc ]
				%sum.011 = phi float [ 0.000000e+00, %for.body.preheader ], [ %sum.1, %for.inc ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp1 = fcmp ogt float %0, 1.000000e+00
				br i1 %cmp1, label %for.inc, label %if.else

				if.else: ; preds = %for.body
				%cmp8 = fcmp olt float %0, 3.000000e+00
				br i1 %cmp8, label %if.then10, label %if.else14

				if.then10: ; preds = %if.else
				%mul = fmul fast float %0, 2.000000e+00
				br label %for.inc

				if.else14: ; preds = %if.else
				%mul17 = fmul fast float %0, 3.000000e+00
				br label %for.inc

				for.inc: ; preds = %for.body, %if.else14, %if.then10
				%.pn = phi float [ %mul, %if.then10 ], [ %mul17, %if.else14 ], [ %0, %for.body ]
				%sum.1 = fadd fast float %.pn, %sum.011
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.inc, %entry
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %sum.1, %for.inc ]
				ret float %sum.0.lcssa
				}

				; Float fadd + fsub patterns
				; Check vectorisation of reduction code with a pair of selects to different
				; instructions { fadd, fsub } but equivalent (change in constant).
				;
				; float fcmp_multi(float *a, int n) {
				; float sum=0.0;
				; for (int i=0;i<n;i++) {
				; if (a[i]>1.0)
				; sum+=a[i];
				; else if (a[i]<3.0)
				; sum-=a[i];
				; }
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_fadd_fsub(
				; CHECK: %[[C1:.]] = fcmp ogt <4 x float> %[[V0:.]], <float 1.000000e+00,
				; CHECK: %[[C2:.*]] = fcmp olt <4 x float> %[[V0]], <float 3.000000e+00,
				; CHECK-DAG: %[[SUB:.*]] = fsub fast <4 x float>
				; CHECK-DAG: %[[ADD:.*]] = fadd fast <4 x float>
				; CHECK: %[[C11:.*]] = xor <4 x i1> %[[C1]], <i1 true,
				; CHECK-DAG: %[[C12:.*]] = and <4 x i1> %[[C2]], %[[C11]]
				; CHECK-DAG: %[[C21:.*]] = xor <4 x i1> %[[C2]], <i1 true,
				; CHECK: %[[C22:.*]] = and <4 x i1> %[[C21]], %[[C11]]
				; CHECK: %[[S1:.*]] = select <4 x i1> %[[C12]], <4 x float> %[[SUB]], <4 x float> %[[ADD]]
				; CHECK: %[[S2:.]] = select <4 x i1> %[[C22]], {{.}} <4 x float> %[[S1]]
				define float @fcmp_fadd_fsub(float* nocapture readonly %a, i32 %n) nounwind readonly {
				entry:
				%cmp9 = icmp sgt i32 %n, 0
				br i1 %cmp9, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.inc, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.inc ]
				%sum.010 = phi float [ 0.000000e+00, %for.body.preheader ], [ %sum.1, %for.inc ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp1 = fcmp ogt float %0, 1.000000e+00
				br i1 %cmp1, label %if.then, label %if.else

				if.then: ; preds = %for.body
				%add = fadd fast float %0, %sum.010
				br label %for.inc

				if.else: ; preds = %for.body
				%cmp8 = fcmp olt float %0, 3.000000e+00
				br i1 %cmp8, label %if.then10, label %for.inc

				if.then10: ; preds = %if.else
				%sub = fsub fast float %sum.010, %0
				br label %for.inc

				for.inc: ; preds = %if.then, %if.then10, %if.else
				%sum.1 = phi float [ %add, %if.then ], [ %sub, %if.then10 ], [ %sum.010, %if.else ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.inc, %entry
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %sum.1, %for.inc ]
				ret float %sum.0.lcssa
				}

				; Float fadd + fmul patterns
				; Check lack of vectorisation of reduction code with a pair of non-compatible
				; instructions { fadd, fmul }.
				;
				; float fcmp_multi(float *a, int n) {
				; float sum=0.0;
				; for (int i=0;i<n;i++) {
				; if (a[i]>1.0)
				; sum+=a[i];
				; else if (a[i]<3.0)
				; sum*=a[i];
				; }
				; return sum;
				; }
				hsaitoUnsubmitted Done Reply Inline Actions Is this the correct check here? hsaito: Is this the correct check here?
				rengolinAuthorUnsubmitted Done Reply Inline Actions ouch, no, regex left-over. Will fix. rengolin: ouch, no, regex left-over. Will fix.

				; CHECK-LABEL: @fcmp_fadd_fmul(
				; CHECK-NOT: <4 x float>
				define float @fcmp_fadd_fmul(float* nocapture readonly %a, i32 %n) nounwind readonly {
				entry:
				%cmp9 = icmp sgt i32 %n, 0
				br i1 %cmp9, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %n to i64
				br label %for.body

				for.body: ; preds = %for.inc, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.inc ]
				%sum.010 = phi float [ 0.000000e+00, %for.body.preheader ], [ %sum.1, %for.inc ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%cmp1 = fcmp ogt float %0, 1.000000e+00
				br i1 %cmp1, label %if.then, label %if.else

				if.then: ; preds = %for.body
				%add = fadd fast float %0, %sum.010
				br label %for.inc

				if.else: ; preds = %for.body
				%cmp8 = fcmp olt float %0, 3.000000e+00
				br i1 %cmp8, label %if.then10, label %for.inc

				if.then10: ; preds = %if.else
				%mul = fmul fast float %0, %sum.010
				br label %for.inc

				for.inc: ; preds = %if.then, %if.then10, %if.else
				%sum.1 = phi float [ %add, %if.then ], [ %mul, %if.then10 ], [ %sum.010, %if.else ]
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.inc, %entry
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %sum.1, %for.inc ]
				ret float %sum.0.lcssa
				}

				; Float fadd + store patterns
				; Check lack of vectorisation of reduction code with a store back, given it
				; has loop dependency on a[i].
				;
				; float fcmp_store_back(float a[], int LEN) {
				; float sum = 0.0;
				; for (int i = 0; i < LEN; i++) {
				; sum += a[i];
				; a[i] = sum;
				; }
				; return sum;
				; }

				; CHECK-LABEL: @fcmp_store_back(
				; CHECK-NOT: <4 x float>
				define float @fcmp_store_back(float* nocapture %a, i32 %LEN) nounwind readonly {
				entry:
				%cmp7 = icmp sgt i32 %LEN, 0
				br i1 %cmp7, label %for.body.preheader, label %for.end

				for.body.preheader: ; preds = %entry
				%wide.trip.count = zext i32 %LEN to i64
				br label %for.body

				for.body: ; preds = %for.body, %for.body.preheader
				%indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
				%sum.08 = phi float [ 0.000000e+00, %for.body.preheader ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %indvars.iv
				%0 = load float, float* %arrayidx, align 4
				%add = fadd fast float %0, %sum.08
				store float %add, float* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				%sum.0.lcssa = phi float [ 0.000000e+00, %entry ], [ %add, %for.body ]
				ret float %sum.0.lcssa
				}