This is an archive of the discontinued LLVM Phabricator instance.

Don't vectorize loops when everything will be scalarized
ClosedPublic

Authored by hfinkel on Mar 28 2016, 6:09 PM.

Download Raw Diff

Details

Reviewers

anemet
nadav
jmolloy
congh

Commits

rG2e0ff2b244e0: [LoopVectorize] Don't vectorize loops when everything will be scalarized
rL264904: [LoopVectorize] Don't vectorize loops when everything will be scalarized

Summary

This patch prevents the loop vectorizer from vectorizing when all of the vector types it generates will be scalarized. I run into this problem on the PPC's QPX vector ISA, which only holds floating-point vector types. The loop vectorizer will, however, happily vectorize loops with purely integer computation. Here's an example:

LV: The Smallest and Widest types: 32 / 32 bits.
LV: The Widest register is: 256 bits.
LV: Found an estimated cost of 0 for VF 1 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 1 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 1 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 1 for VF 1 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 1 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 1 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 1 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Scalar loop costs: 3.
LV: Found an estimated cost of 0 for VF 2 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 2 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 2 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 2 for VF 2 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 2 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 2 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 2 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 2 costs: 2.
LV: Found an estimated cost of 0 for VF 4 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 4 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 4 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 4 for VF 4 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 4 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 4 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 4 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 4 costs: 1.
LV: Found an estimated cost of 0 for VF 8 For instruction:   %indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
LV: Found an estimated cost of 0 for VF 8 For instruction:   %arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
LV: Found an estimated cost of 0 for VF 8 For instruction:   %2 = trunc i64 %indvars.iv25 to i32
LV: Found an estimated cost of 8 for VF 8 For instruction:   store i32 %2, i32* %arrayidx, align 4
LV: Found an estimated cost of 1 for VF 8 For instruction:   %indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
LV: Found an estimated cost of 1 for VF 8 For instruction:   %exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
LV: Found an estimated cost of 0 for VF 8 For instruction:   br i1 %exitcond27, label %for.cond.cleanup, label %for.body
LV: Vector loop of width 8 costs: 1.
LV: Selecting VF: 8.
LV: The target has 32 registers
LV(REG): Calculating max register usage:
LV(REG): At #0 Interval # 0
LV(REG): At #1 Interval # 1
LV(REG): At #2 Interval # 2
LV(REG): At #4 Interval # 1
LV(REG): At #5 Interval # 1
LV(REG): VF = 8

The problem is that the cost model here is not wrong, exactly. Since all of these operations are scalarized, their cost (aside from the uniform ones) are indeed VF*(scalar cost), just as the model suggests. In fact, the larger the VF picked, the lower the relative overhead from the loop itself (and the induction-variable update and check), and so in a sense, picking the largest VF here is the right thing to do.

The problem is that vectorizing like this, where all of the vectors will be scalarized in the backend, isn't really vectorizing, but rather interleaving. By itself, this would be okay, but then the vectorizer itself also interleaves, and that's where the problem manifests itself. There's aren't actually enough scalar registers to support the normal interleave factor multiplied by a factor of VF (8 in this example). In other words, the problem with this is that our register-pressure heuristic does not account for scalarization.

While we might want to improve our register-pressure heuristic, I don't think this is the right motivating case for that work. Here we have a more-basic problem: The job of the vectorizer is to vectorize things (interleaving aside), and if the IR it generates won't generate any actual vector code, then something is wrong. Thus, if every type looks like it will be scalarized (i.e. will be split into VF or more parts), then don't consider that VF.

Diff Detail

Repository: rL LLVM

Event Timeline

hfinkel updated this revision to Diff 51864.Mar 28 2016, 6:09 PM

hfinkel retitled this revision from to Don't vectorize loops when everything will be scalarized.

hfinkel updated this object.

hfinkel added reviewers: anemet, nadav, congh, jmolloy.

hfinkel added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, mcrosier. · View Herald TranscriptMar 28 2016, 6:09 PM

spatel added a subscriber: spatel.Mar 28 2016, 7:36 PM

LGTM.

I have two suggestions you may want to consider.

lib/Transforms/Vectorize/LoopVectorize.cpp
1535–1539 ↗	(On Diff #51864)	Assuming I understand the logic correctly, you may want to mention it here that for ActuallyVectorized we currently use type legality. (I.e. we could still have a loop where we don't vectorize anything because none of the instructions are legal.)
5557–5558 ↗	(On Diff #51864)	It seems that the current logic requires ActuallyVectorized to be initialized to false by the caller. I think that it would less error-prone if you initialized to false here inside the API. What do you think?

This revision is now accepted and ready to land.Mar 28 2016, 9:24 PM

Hal, I am not sure I understand the problem. Is the problem register pressure or the fact that store <8 x i32> is more expensive than 8 times store i32?

This looks like a problem with the PPC cost model that does not take into account the cost of scalarization.

In D18537#386178, @nadav wrote:

Hal, I am not sure I understand the problem. Is the problem register pressure or the fact that store <8 x i32> is more expensive than 8 times store i32?

It is really just register pressure. Since there is no legal vector type of i32 in this configuration, everything is just scalarized (or perhaps I should say expanded to avoid overloading terminology here -- the point is that it is type legalization, not operation legalization).

This looks like a problem with the PPC cost model that does not take into account the cost of scalarization.

No, there is no scalarization cost, because nothing is ever a vector. In fact, if you take my test case and turn off interleaving, you get pretty nice-looking code (which is even interleaved in practice, because that's what type legalization gives us). However, between the vector expansion (type legalization) and the interleaving the targets generally requests, the register pressure is too high.

Also, FWIW, Sanjay says that this patch also fixes PR26837 (which applies to SSE).

I don't like the approach of passing in address-of-bool as parameter argument, especially since you did not document the parameter (is it IN, is it OUT, etc).

Please change the getCost return value to return a struct, or std::pair that ties the cost and the bit that says if vectorization happened. Please declare a type for this pair:

using VectorizationCostTy = std::pair<unsigned, bool>;

Then, update the Doxygen comments of the functions that use it.

Also, the name "ActuallyVectorized" is not descriptive. Maybe "AnyInstructionVectorized" ?

In D18537#386214, @nadav wrote:

I don't like the approach of passing in address-of-bool as parameter argument, especially since you did not document the parameter (is it IN, is it OUT, etc).

Please change the getCost return value to return a struct, or std::pair that ties the cost and the bit that says if vectorization happened. Please declare a type for this pair:

using VectorizationCostTy = std::pair<unsigned, bool>;

Then, update the Doxygen comments of the functions that use it.

Also, the name "ActuallyVectorized" is not descriptive. Maybe "AnyInstructionVectorized" ?

Sure. Will do.

Address review comments.

In D18537#386482, @hfinkel wrote:

Address review comments.

Nadav, this version uses a std::pair (with an appropriately-commented typedef). What do you think of this version?

Hal, the code looks much better. I was thinking about this patch on the way to work this morning and I was wondering if you could mark the type <4 x i32> as legal, because you _can_ load it and store it to memory. Right?

Closed by commit rL264904: [LoopVectorize] Don't vectorize loops when everything will be scalarized (authored by hfinkel). · Explain WhyMar 30 2016, 12:42 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

67 lines

test/

Transforms/

LoopVectorize/

PowerPC/

vectorize-only-for-real.ll

62 lines

X86/

vectorize-only-for-real.ll

39 lines

Diff 52107

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 1,526 Lines • ▼ Show 20 Lines	public:
};		};

/// \return Returns information about the register usages of the loop for the		/// \return Returns information about the register usages of the loop for the
/// given vectorization factors.		/// given vectorization factors.
SmallVector<RegisterUsage, 8>		SmallVector<RegisterUsage, 8>
calculateRegisterUsage(const SmallVector<unsigned, 8> &VFs);		calculateRegisterUsage(const SmallVector<unsigned, 8> &VFs);

private:		private:
		/// The vectorization cost is a combination of the cost itself and a boolean
		/// indicating whether any of the contributing operations will actually operate on
		/// vector values after type legalization in the backend. If this latter value is
		/// false, then all operations will be scalarized (i.e. no vectorization has
		/// actually taken place).
		typedef std::pair<unsigned, bool> VectorizationCostTy;

/// Returns the expected execution cost. The unit of the cost does		/// Returns the expected execution cost. The unit of the cost does
/// not matter because we use the 'cost' units to compare different		/// not matter because we use the 'cost' units to compare different
/// vector widths. The cost that is returned is not normalized by		/// vector widths. The cost that is returned is not normalized by
/// the factor width.		/// the factor width.
unsigned expectedCost(unsigned VF);		VectorizationCostTy expectedCost(unsigned VF);

/// Returns the execution time cost of an instruction for a given vector		/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
unsigned getInstructionCost(Instruction *I, unsigned VF);		VectorizationCostTy getInstructionCost(Instruction *I, unsigned VF);

		/// The cost-computation logic from getInstructionCost which provides
		/// the vector type as an output parameter.
		unsigned getInstructionCost(Instruction I, unsigned VF, Type &VectorTy);

/// Returns whether the instruction is a load or store and will be a emitted		/// Returns whether the instruction is a load or store and will be a emitted
/// as a vector operation.		/// as a vector operation.
bool isConsecutiveLoadOrStore(Instruction *I);		bool isConsecutiveLoadOrStore(Instruction *I);

/// Report an analysis message to assist the user in diagnosing loops that are		/// Report an analysis message to assist the user in diagnosing loops that are
/// not vectorized. These are handled as LoopAccessReport rather than		/// not vectorized. These are handled as LoopAccessReport rather than
/// VectorizationReport because the << operator of VectorizationReport returns		/// VectorizationReport because the << operator of VectorizationReport returns
▲ Show 20 Lines • Show All 3,588 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
if (UserVF != 0) {		if (UserVF != 0) {
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");

Factor.Width = UserVF;		Factor.Width = UserVF;
return Factor;		return Factor;
}		}

float Cost = expectedCost(1);		float Cost = expectedCost(1).first;
#ifndef NDEBUG		#ifndef NDEBUG
const float ScalarCost = Cost;		const float ScalarCost = Cost;
#endif /* NDEBUG */		#endif /* NDEBUG */
unsigned Width = 1;		unsigned Width = 1;
DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");		DEBUG(dbgs() << "LV: Scalar loop costs: " << (int)ScalarCost << ".\n");

bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;		bool ForceVectorization = Hints->getForce() == LoopVectorizeHints::FK_Enabled;
// Ignore scalar width, because the user explicitly wants vectorization.		// Ignore scalar width, because the user explicitly wants vectorization.
if (ForceVectorization && VF > 1) {		if (ForceVectorization && VF > 1) {
Width = 2;		Width = 2;
Cost = expectedCost(Width) / (float)Width;		Cost = expectedCost(Width).first / (float)Width;
}		}

for (unsigned i=2; i <= VF; i*=2) {		for (unsigned i=2; i <= VF; i*=2) {
// Notice that the vector loop needs to be executed less times, so		// Notice that the vector loop needs to be executed less times, so
// we need to divide the cost of the vector loops by the width of		// we need to divide the cost of the vector loops by the width of
// the vector elements.		// the vector elements.
float VectorCost = expectedCost(i) / (float)i;		VectorizationCostTy C = expectedCost(i);
		float VectorCost = C.first / (float)i;
DEBUG(dbgs() << "LV: Vector loop of width " << i << " costs: " <<		DEBUG(dbgs() << "LV: Vector loop of width " << i << " costs: " <<
(int)VectorCost << ".\n");		(int)VectorCost << ".\n");
		if (!C.second && !ForceVectorization) {
		DEBUG(dbgs() << "LV: Not considering vector loop of width " << i <<
		" because it will not generate any vector instructions.\n");
		continue;
		}
if (VectorCost < Cost) {		if (VectorCost < Cost) {
Cost = VectorCost;		Cost = VectorCost;
Width = i;		Width = i;
}		}
}		}

DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()		DEBUG(if (ForceVectorization && Width > 1 && Cost >= ScalarCost) dbgs()
<< "LV: Vectorization seems to be not beneficial, "		<< "LV: Vectorization seems to be not beneficial, "
▲ Show 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::selectInterleaveCount(bool OptForSize,
} else {		} else {
if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)		if (ForceTargetMaxVectorInterleaveFactor.getNumOccurrences() > 0)
MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;		MaxInterleaveCount = ForceTargetMaxVectorInterleaveFactor;
}		}

// If we did not calculate the cost for VF (because the user selected the VF)		// If we did not calculate the cost for VF (because the user selected the VF)
// then we calculate the cost of VF here.		// then we calculate the cost of VF here.
if (LoopCost == 0)		if (LoopCost == 0)
LoopCost = expectedCost(VF);		LoopCost = expectedCost(VF).first;

// Clamp the calculated IC to be between the 1 and the max interleave count		// Clamp the calculated IC to be between the 1 and the max interleave count
// that the target allows.		// that the target allows.
if (IC > MaxInterleaveCount)		if (IC > MaxInterleaveCount)
IC = MaxInterleaveCount;		IC = MaxInterleaveCount;
else if (IC < 1)		else if (IC < 1)
IC = 1;		IC = 1;

▲ Show 20 Lines • Show All 210 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
RU.LoopInvariantRegs = Invariant;		RU.LoopInvariantRegs = Invariant;
RU.MaxLocalUsers = MaxUsages[i];		RU.MaxLocalUsers = MaxUsages[i];
RUs[i] = RU;		RUs[i] = RU;
}		}

return RUs;		return RUs;
}		}

unsigned LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::VectorizationCostTy
unsigned Cost = 0;		LoopVectorizationCostModel::expectedCost(unsigned VF) {
		VectorizationCostTy Cost;

// For each block.		// For each block.
for (Loop::block_iterator bb = TheLoop->block_begin(),		for (Loop::block_iterator bb = TheLoop->block_begin(),
be = TheLoop->block_end(); bb != be; ++bb) {		be = TheLoop->block_end(); bb != be; ++bb) {
unsigned BlockCost = 0;		VectorizationCostTy BlockCost;
BasicBlock BB = bb;		BasicBlock BB = bb;

// For each instruction in the old loop.		// For each instruction in the old loop.
for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {		for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
// Skip dbg intrinsics.		// Skip dbg intrinsics.
if (isa<DbgInfoIntrinsic>(it))		if (isa<DbgInfoIntrinsic>(it))
continue;		continue;

// Skip ignored values.		// Skip ignored values.
if (ValuesToIgnore.count(&*it))		if (ValuesToIgnore.count(&*it))
continue;		continue;

unsigned C = getInstructionCost(&*it, VF);		VectorizationCostTy C = getInstructionCost(&*it, VF);

// Check if we should override the cost.		// Check if we should override the cost.
if (ForceTargetInstructionCost.getNumOccurrences() > 0)		if (ForceTargetInstructionCost.getNumOccurrences() > 0)
C = ForceTargetInstructionCost;		C.first = ForceTargetInstructionCost;

BlockCost += C;		BlockCost.first += C.first;
DEBUG(dbgs() << "LV: Found an estimated cost of " << C << " for VF " <<		BlockCost.second \|= C.second;
VF << " For instruction: " << *it << '\n');		DEBUG(dbgs() << "LV: Found an estimated cost of " << C.first <<
		" for VF " << VF << " For instruction: " << *it << '\n');
}		}

// We assume that if-converted blocks have a 50% chance of being executed.		// We assume that if-converted blocks have a 50% chance of being executed.
// When the code is scalar then some of the blocks are avoided due to CF.		// When the code is scalar then some of the blocks are avoided due to CF.
// When the code is vectorized we execute all code paths.		// When the code is vectorized we execute all code paths.
if (VF == 1 && Legal->blockNeedsPredication(*bb))		if (VF == 1 && Legal->blockNeedsPredication(*bb))
BlockCost /= 2;		BlockCost.first /= 2;

Cost += BlockCost;		Cost.first += BlockCost.first;
		Cost.second \|= BlockCost.second;
}		}

return Cost;		return Cost;
}		}

/// \brief Check if the load/store instruction \p I may be translated into		/// \brief Check if the load/store instruction \p I may be translated into
/// gather/scatter during vectorization.		/// gather/scatter during vectorization.
///		///
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	static bool isLikelyComplexAddressComputation(Value *Ptr,
return StepVal > MaxMergeDistance;		return StepVal > MaxMergeDistance;
}		}

static bool isStrideMul(Instruction I, LoopVectorizationLegality Legal) {		static bool isStrideMul(Instruction I, LoopVectorizationLegality Legal) {
return Legal->hasStride(I->getOperand(0)) \|\|		return Legal->hasStride(I->getOperand(0)) \|\|
Legal->hasStride(I->getOperand(1));		Legal->hasStride(I->getOperand(1));
}		}

unsigned		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (Legal->isUniformAfterVectorization(I))		if (Legal->isUniformAfterVectorization(I))
VF = 1;		VF = 1;

		Type *VectorTy;
		unsigned C = getInstructionCost(I, VF, VectorTy);

		bool TypeNotScalarized = VF > 1 && !VectorTy->isVoidTy() &&
		TTI.getNumberOfParts(VectorTy) < VF;
		return VectorizationCostTy(C, TypeNotScalarized);
		}

		unsigned
		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF,
		Type *&VectorTy) {
Type *RetTy = I->getType();		Type *RetTy = I->getType();
if (VF > 1 && MinBWs.count(I))		if (VF > 1 && MinBWs.count(I))
RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);		RetTy = IntegerType::get(RetTy->getContext(), MinBWs[I]);
Type *VectorTy = ToVectorTy(RetTy, VF);		VectorTy = ToVectorTy(RetTy, VF);

// TODO: We need to estimate the cost of intrinsic calls.		// TODO: We need to estimate the cost of intrinsic calls.
switch (I->getOpcode()) {		switch (I->getOpcode()) {
case Instruction::GetElementPtr:		case Instruction::GetElementPtr:
// We mark this instruction as zero-cost because the cost of GEPs in		// We mark this instruction as zero-cost because the cost of GEPs in
// vectorized code depends on whether the corresponding memory instruction		// vectorized code depends on whether the corresponding memory instruction
// is scalarized or not. Therefore, we handle GEPs with the memory		// is scalarized or not. Therefore, we handle GEPs with the memory
// instruction cost.		// instruction cost.
▲ Show 20 Lines • Show All 427 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/PowerPC/vectorize-only-for-real.ll

				; RUN: opt -S -loop-vectorize < %s \| FileCheck %s
				target datalayout = "E-m:e-i64:64-n32:64"
				target triple = "powerpc64-bgq-linux"

				; Function Attrs: nounwind
				define zeroext i32 @test() #0 {
				; CHECK-LABEL: @test
				; CHECK-NOT: x i32>

				entry:
				%a = alloca [1600 x i32], align 4
				%c = alloca [1600 x i32], align 4
				%0 = bitcast [1600 x i32]* %a to i8*
				call void @llvm.lifetime.start(i64 6400, i8* %0) #3
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				%1 = bitcast [1600 x i32]* %c to i8*
				call void @llvm.lifetime.start(i64 6400, i8* %1) #3
				%arraydecay = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 0
				%arraydecay1 = getelementptr inbounds [1600 x i32], [1600 x i32]* %c, i64 0, i64 0
				%call = call signext i32 @bar(i32* %arraydecay, i32* %arraydecay1) #3
				br label %for.body6

				for.body: ; preds = %for.body, %entry
				%indvars.iv25 = phi i64 [ 0, %entry ], [ %indvars.iv.next26, %for.body ]
				%arrayidx = getelementptr inbounds [1600 x i32], [1600 x i32]* %a, i64 0, i64 %indvars.iv25
				%2 = trunc i64 %indvars.iv25 to i32
				store i32 %2, i32* %arrayidx, align 4
				%indvars.iv.next26 = add nuw nsw i64 %indvars.iv25, 1
				%exitcond27 = icmp eq i64 %indvars.iv.next26, 1600
				br i1 %exitcond27, label %for.cond.cleanup, label %for.body

				for.cond.cleanup5: ; preds = %for.body6
				call void @llvm.lifetime.end(i64 6400, i8* nonnull %1) #3
				call void @llvm.lifetime.end(i64 6400, i8* %0) #3
				ret i32 %add

				for.body6: ; preds = %for.body6, %for.cond.cleanup
				%indvars.iv = phi i64 [ 0, %for.cond.cleanup ], [ %indvars.iv.next, %for.body6 ]
				%s.022 = phi i32 [ 0, %for.cond.cleanup ], [ %add, %for.body6 ]
				%arrayidx8 = getelementptr inbounds [1600 x i32], [1600 x i32]* %c, i64 0, i64 %indvars.iv
				%3 = load i32, i32* %arrayidx8, align 4
				%add = add i32 %3, %s.022
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1600
				br i1 %exitcond, label %for.cond.cleanup5, label %for.body6
				}

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.start(i64, i8* nocapture) #1

				; Function Attrs: argmemonly nounwind
				declare void @llvm.lifetime.end(i64, i8* nocapture) #1

				declare signext i32 @bar(i32, i32) #2

				attributes #0 = { nounwind "target-cpu"="a2q" "target-features"="+qpx,-altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-vsx" }
				attributes #1 = { argmemonly nounwind }
				attributes #2 = { "target-cpu"="a2q" "target-features"="+qpx,-altivec,-bpermd,-crypto,-direct-move,-extdiv,-power8-vector,-vsx" }
				attributes #3 = { nounwind }

llvm/trunk/test/Transforms/LoopVectorize/X86/vectorize-only-for-real.ll

				; RUN: opt -S -basicaa -loop-vectorize < %s \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.11.0"

				define i32 @accum(i32* nocapture readonly %x, i32 %N) #0 {
				entry:
				; CHECK-LABEL: @accum
				; CHECK-NOT: x i32>

				%cmp1 = icmp sgt i32 %N, 0
				br i1 %cmp1, label %for.inc.preheader, label %for.end

				for.inc.preheader:
				br label %for.inc

				for.inc:
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.inc ], [ 0, %for.inc.preheader ]
				%sum.02 = phi i32 [ %add, %for.inc ], [ 0, %for.inc.preheader ]
				%arrayidx = getelementptr inbounds i32, i32* %x, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %0, %sum.02
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.inc

				for.end.loopexit:
				%add.lcssa = phi i32 [ %add, %for.inc ]
				br label %for.end

				for.end:
				%sum.0.lcssa = phi i32 [ 0, %entry ], [ %add.lcssa, %for.end.loopexit ]
				ret i32 %sum.0.lcssa

				; CHECK: ret i32
				}

				attributes #0 = { "target-cpu"="core2" "target-features"="+sse,-avx,-avx2,-sse2" }