This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Support predication of div/rem
ClosedPublic

Authored by gilr on Jul 28 2016, 8:43 AM.

Download Raw Diff

Details

Reviewers

anemet
jmolloy
mkuper

Commits

rG550148b2f662: [Loop Vectorizer] Support predication of div/rem
rL279620: [Loop Vectorizer] Support predication of div/rem

Summary

div/rem instructions in basic blocks that require predication currently prevent vectorization. This patch extends the existing mechanism for predicating stores to handle other instructions and leverages it to predicate divs and rems.

The generated vector extracts and inserts are now moved into the predicated block (reflected in the cost model for scalarization).

Diff Detail

Repository: rL LLVM

Event Timeline

gilr updated this revision to Diff 65934.Jul 28 2016, 8:43 AM

gilr retitled this revision from to [Loop Vectorizer] Support predication of div/rem.

gilr updated this object.

gilr added reviewers: mkuper, jmolloy.

gilr added subscribers: llvm-commits, Ayal, delena.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptJul 28 2016, 8:43 AM

anemet added a subscriber: anemet.Jul 28 2016, 9:58 AM

anemet added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
3825–3834 ↗	(On Diff #65934)	Only a drive-by comment, please don't make vectorizeLoop any more unreadable than it already is. Please consider prequel to this patch that moves store-predication into its own function.

You might want to consider special-casing division by a constant integer. For example, on x86, we can convert a 16-bit unsigned divide by a constant into a pmulhuw+psrlw.

mssimpso added a subscriber: mssimpso.Jul 28 2016, 11:49 AM

mkuper added inline comments.Jul 28 2016, 3:50 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
3327 ↗	(On Diff #65934)	I'm not entirely sure what "the cost of a phi" means, especially without a type. Also, I don't believe any in-tree target actually assigns a non-zero to PHIs right now. So if you actually want a meaningful cost here for, say, x86, you may want to look into the cost model as well...
3825 ↗	(On Diff #65934)	...that need predication? (Before, we really would predicate any store)
3825–3834 ↗	(On Diff #65934)	+1
3833 ↗	(On Diff #65934)	We don't run anything that will sink this later in the pipeline? (To be honest, even if we do, I'm not sure whether we should do the cleanup here or rely on a later pass, but I'm curious)
3835 ↗	(On Diff #65934)	Any reason not to use a range for over the operands?
3841 ↗	(On Diff #65934)	Do you know if we have an isOnlyUserOf helper? I know we have one for SDNode... but, I couldn't find one in IR.
3859 ↗	(On Diff #65934)	I->hasOneUse()? Or do you care specifically about the users() list? In any case, no need to compute std::distance.
3860 ↗	(On Diff #65934)	I->user_begin()?
4251 ↗	(On Diff #65934)	FRem looks out of place. Did you mean URem?
4254 ↗	(On Diff #65934)	To expand on what Eli said, if we have division by a non-zero constant, then: It may be efficiently lowered. Since the constant is non-zero, it doesn't need predication.
4264 ↗	(On Diff #65934)	And, correspondingly, this should probably be FRem.
5070 ↗	(On Diff #65934)	As long as you're touching this - remove this comment?

gilr added inline comments.Jul 31 2016, 8:19 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3327 ↗	(On Diff #65934)	So IIRC I took it from CostModelAnalysis::getInstructionCost(), but I'll remove it if it makes no sense.
3833 ↗	(On Diff #65934)	Actually the vectorizer currently seems to expect inst-combine to do so (see the deleted VEC-IC case in if-pred-stores.ll), but IINM the cost model didn't reflect that. I agree, there's the general issue of generating efficient code here vs relying on later passes to clean up, which I think this was also brought up here.
3835 ↗	(On Diff #65934)	No, will fix.
3841 ↗	(On Diff #65934)	Me either - anyone?
3859 ↗	(On Diff #65934)	Right, will fix.
3860 ↗	(On Diff #65934)	Right, will fix.
4251 ↗	(On Diff #65934)	Yes.
4264 ↗	(On Diff #65934)	Indeed.

mkuper added inline comments.Aug 1 2016, 2:54 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
3327 ↗	(On Diff #65934)	I'm not sure what makes sense here, to be honest. What cost, in the final generated code, are you trying to account for? An extra register copy?
4251 ↗	(On Diff #65934)	Too bad there wasn't a test that would have failed because of this. ;-)

Implemented several reviewer comments, notably avoiding predication when dividing by a non-zero constant.

Added may-divide-by-zero logic to cost model (was missing in previous patch); moved logic to its own helper function.

More drive-by comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
3835–3844 ↗	(On Diff #66483)	Would be good to describe the high-level strategy of how we predicate these instruction, perhaps with an IR example.
3860–3861 ↗	(On Diff #66483)	I think that Twine knows how to concatenate string-like things. You only need the explicit ctor on the first one.
test/Transforms/LoopVectorize/if-pred-non-void.ll
18–51 ↗	(On Diff #66483)	I think we're pretty consistent about using uppercase for the named regexes. That helps readability.

gilr added inline comments.Aug 4 2016, 3:56 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3331 ↗	(On Diff #66483)	I actually wasn't trying to model any specific cost in the generated code, just trying to be consistent about accounting for every generated instruction at IR level (and letting TTI decide their cost). So if PHIs have zero cost by definition/convention and should not be taken into account in cost models then I should just remove this. Otherwise, placing the call now makes sure we don't miss that cost if targets start modelling it. What say you?
3860–3861 ↗	(On Diff #66483)	Right, will fix.

mkuper added inline comments.Aug 4 2016, 9:28 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3331 ↗	(On Diff #66483)	We're not even consistent about it in our different cost models - CostModelAnalysis::getInstructionCost() calls getCFInstrCost(), while (the admittedly, old) BBVectorizer cost model just reutrns 0. But I get what you're saying. If you prefer to leave it, leave it, but I think it'd be nice to document the fact you don't currently expect a real cost here.

Merged with prequel patch r277595 (D23013).
Changed named registers in new lit test to uppercase.
Documented predication logic
Removed unnecessary Twine ctor calls.

LGTM, but please wait a bit for anemet, in case he also wants to review this in non-drive-by-mode. :-)

This revision is now accepted and ready to land.Aug 5 2016, 10:14 AM

anemet added inline comments.Aug 9 2016, 12:01 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3385–3386 ↗	(On Diff #66949)	There is something wrong with this sentence.
3388 ↗	(On Diff #66949)	Predicate->Predicated Same in the other functions.
3416–3417 ↗	(On Diff #66949)	Explain in the comment how this guys is different from the previous one.
3431–3432 ↗	(On Diff #66949)	Same here.
3433 ↗	(On Diff #66949)	Type* -> Type * Did you run clang-format on the diff?
4105 ↗	(On Diff #66949)	auto *
4108–4115 ↗	(On Diff #66949)	Is this any different than OpInst->hasOneUse?
4135–4138 ↗	(On Diff #66949)	Is there a test for the non-insertelt case?
4138 ↗	(On Diff #66949)	Why is the undef correct here?
4347 ↗	(On Diff #66949)	auto *

gilr added inline comments.Aug 10 2016, 3:53 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4108–4115 ↗	(On Diff #66949)	Yes, for Instructions that use the same value more than once (see Michael's comment).
4135–4138 ↗	(On Diff #66949)	No. We currently always create an insertelement on scalarization. I added support for the non-insertelement case for completeness, since predication is done separately from scalarization. IIUC this case will need to be supported if & when Matthew's patch is committed, but for now it's really FFU. I'll replace this case with an assertion for this patch and leave it to Matthew to resurrect in his patch as needed.
4138 ↗	(On Diff #66949)	We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either a select that will blend it out, or a Use dominated by this instruction's BB that is either predicated by (at least) the same predicate, which won't use the undef, or not predicated due to no side effects, where undef would be safe

mssimpso added inline comments.Aug 10 2016, 5:06 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4135–4138 ↗	(On Diff #66949)	Gil, For the non-insert cases that might arise with my patch, I think it would be better to leave the case here (don't add an assert), but include a test with this patch that would break with the other patch. Does that make sense? That will keep the patches better self-contained and prevent us from having to revisit this. You will need a test where the div only feeds an instruction that will also be scalar (like a GEP). If this patch lands before the other one, it will be cleaner if in the other we just have to change the test. Presumably, that would involve replacing the PHI for the inserts with a PHI for the predicated instruction, since the inserts will no longer be there? If the other patch lands first, there will be no issue since you'll already have a test for the non-insert case.

Ayal added inline comments.Aug 10 2016, 6:33 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3400 ↗	(On Diff #66949)	Suggest to first add the cost of the InsertElement and then the (optional) cost of the Phi, just to keep things in the same order they will be executed. Furthermore, cost if Extract should precede that of Insert.
4346–4347 ↗	(On Diff #66949)	Suggest to rename Op2 to Divisor. It may be worthwhile to generalize and check isKnownNonZero(). Non-zero divisors that are not compile-time constants will not be converted into multiplication, so we will still end up scalarizing the division, but can do so w/o predication.

Also, store predication is currently behind a flag that defaults to false (-enable-cond-stores-vec). Since you're reusing the store predication logic, I'm wondering if the mayDivideByZero cases should be under the flag as well. What do you think?

Matt.

In D22918#511194, @mssimpso wrote:

Also, store predication is currently behind a flag that defaults to false (-enable-cond-stores-vec). Since you're reusing the store predication logic, I'm wondering if the mayDivideByZero cases should be under the flag as well. What do you think?

Matt.

The reason enable-cond-stores-vec defaults to false is the lack of cost modeling for predicated stores.
The change to getScalarizationOverhead() is supposed to solve this for divs. But I guess it depends on the real-world performance impact this patch has.

Per Matt's comment: code continues to support the non-insertelement case; added an FFU test that should fail once the vectorizer supports direct scalar-scalar use.
Implemented (hopefully) all other review comments
Ran clang-format again

mssimpso added inline comments.Aug 11 2016, 2:26 PM

test/Transforms/LoopVectorize/if-pred-non-void.ll
136–138 ↗	(On Diff #67673)	Thanks for adding the additional "future" test. I don't think it will exercise the non-insert case, though. I'm very sorry for not being more clear previously. Here, %rsd will always have to be inserted into a vector since it will be directly used by a select instruction, which will remain vectorized. I didn't think of this when I last commented. But I think if you add an additional instruction, this should produce the desired effect. Something like: if.then: %tmp = sdiv i32 %psd, %lsd %rsd = sdiv i32 %tmp, %lsd br label %if.end When I ran the modified test with this patch and the scalar patch, the non-insert case was used for %tmp and the insert case was used for %rsd. This makes sense becase %tmp is only used by %rsd (will be scalar), and %rsd will again feed the vector select.

gilr added inline comments.Aug 12 2016, 2:01 PM

test/Transforms/LoopVectorize/if-pred-non-void.ll
136–138 ↗	(On Diff #67673)	Argh, sorry about that. Your explanation was clear - just a hasty implementation on my side :( Yes, the second sdiv should go under the same condition - will fix.

Fixed FFU test case.

mssimpso added inline comments.Aug 15 2016, 11:39 AM

test/Transforms/LoopVectorize/if-pred-non-void.ll
137–139 ↗	(On Diff #67972)	The test looks good to me now. Thanks!

Adam,

Do you have any other comments? Am I good to go with this change?
Thanks!

Gil, I will look at this today.

anemet added inline comments.Aug 17 2016, 11:58 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4070–4096 ↗	(On Diff #67972)	I would also include the select instruction in this excerpt (omitting anything in between with a ...). Then you can explain in the initial comment that the value produced on the false branch is not used (the conditional execution is only reintroduced to avoid side-effects).
4076 ↗	(On Diff #67972)	"So for the first element of a scalarized instruction, e.g."
4111–4118 ↗	(On Diff #67972)	OK, is this difference relevant? If yes, add a helper either in this module or at a more global place and use it. You can probably also writes this with std::all_of or something.
4141 ↗	(On Diff #67972)	We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either Makes sense but we need a comment for this. I think the best is to explain this on the excerpt at the beginning, see my comment there.
test/Transforms/LoopVectorize/if-pred-non-void.ll
8–9 ↗	(On Diff #67972)	As a demo for how this works it would be actually good to include at least one of the second element sequences as well.
20–21 ↗	(On Diff #67972)	Can you please name and match this extract as well, it helps reading. Everywhere in these tests.
22–23 ↗	(On Diff #67972)	It would be also good to check the extractelements feeding the divs here.

gilr added inline comments.Aug 18 2016, 2:32 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
4111–4118 ↗	(On Diff #67972)	It shouldn't make a difference for the currently predicated instructions. I'll replace with hasOneUse with a comment to capture the possible (conservative) inaccuracy.

Implemented Adam's comments.

Ping

LGTM too with the comments addressed below. Thanks!

lib/Transforms/Vectorize/LoopVectorize.cpp
4079 ↗	(On Diff #68616)	Please remove the '; pred =' comments You may want to add a // ... after the for.body: label in all these loops. That is where the div operand would be loaded, etc.
4091 ↗	(On Diff #68616)	s/selected-out/if-converted using a select/
4101–4102 ↗	(On Diff #68616)	%33 and %34 are not used, please remove

Closed by commit rL279620: [Loop Vectorizer] Support predication of div/rem (authored by gilr). · Explain WhyAug 24 2016, 4:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

307 lines

test/

Transforms/

LoopVectorize/

if-pred-non-void.ll

173 lines

if-pred-not-when-safe.ll

90 lines

if-pred-stores.ll

31 lines

Diff 69097

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 380 Lines • ▼ Show 20 Lines	protected:
void fixFirstOrderRecurrence(PHINode *Phi);		void fixFirstOrderRecurrence(PHINode *Phi);

/// \brief The Loop exit block may have single value PHI nodes where the		/// \brief The Loop exit block may have single value PHI nodes where the
/// incoming value is 'Undef'. While vectorizing we only handled real values		/// incoming value is 'Undef'. While vectorizing we only handled real values
/// that were defined inside the loop. Here we fix the 'undef case'.		/// that were defined inside the loop. Here we fix the 'undef case'.
/// See PR14725.		/// See PR14725.
void fixLCSSAPHIs();		void fixLCSSAPHIs();

/// Predicate conditional stores on their respective conditions.		/// Predicate conditional instructions that require predication on their
void predicateStores();		/// respective conditions.
		void predicateInstructions();

/// Shrinks vector element sizes based on information in "MinBWs".		/// Shrinks vector element sizes based on information in "MinBWs".
void truncateToMinimalBitwidths();		void truncateToMinimalBitwidths();

/// A helper function that computes the predicate of the block BB, assuming		/// A helper function that computes the predicate of the block BB, assuming
/// that the header block of the loop is set to True. It returns the entry		/// that the header block of the loop is set to True. It returns the entry
/// mask for the block BB.		/// mask for the block BB.
VectorParts createBlockInMask(BasicBlock *BB);		VectorParts createBlockInMask(BasicBlock *BB);
Show All 10 Lines	protected:
void widenPHIInstruction(Instruction *PN, VectorParts &Entry, unsigned UF,		void widenPHIInstruction(Instruction *PN, VectorParts &Entry, unsigned UF,
unsigned VF, PhiVector *PV);		unsigned VF, PhiVector *PV);

/// Insert the new loop to the loop hierarchy and pass manager		/// Insert the new loop to the loop hierarchy and pass manager
/// and update the analysis passes.		/// and update the analysis passes.
void updateAnalysis();		void updateAnalysis();

/// This instruction is un-vectorizable. Implement it as a sequence		/// This instruction is un-vectorizable. Implement it as a sequence
/// of scalars. If \p IfPredicateStore is true we need to 'hide' each		/// of scalars. If \p IfPredicateInstr is true we need to 'hide' each
/// scalarized instruction behind an if block predicated on the control		/// scalarized instruction behind an if block predicated on the control
/// dependence of the instruction.		/// dependence of the instruction.
virtual void scalarizeInstruction(Instruction *Instr,		virtual void scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore = false);		bool IfPredicateInstr = false);

/// Vectorize Load and Store instructions,		/// Vectorize Load and Store instructions,
virtual void vectorizeMemoryInstruction(Instruction *Instr);		virtual void vectorizeMemoryInstruction(Instruction *Instr);

/// Create a broadcast instruction. This method generates a broadcast		/// Create a broadcast instruction. This method generates a broadcast
/// instruction (shuffle) for loop invariant values and for the induction		/// instruction (shuffle) for loop invariant values and for the induction
/// value. If this is the induction variable then we extend it to N, N+1, ...		/// value. If this is the induction variable then we extend it to N, N+1, ...
/// this is needed because each iteration in the loop corresponds to a SIMD		/// this is needed because each iteration in the loop corresponds to a SIMD
▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	protected:
/// ScalarIVMap maps induction variables from the original loop that are not		/// ScalarIVMap maps induction variables from the original loop that are not
/// vectorized to their scalar equivalents in the vector loop. Maintaining a		/// vectorized to their scalar equivalents in the vector loop. Maintaining a
/// separate map for scalarized induction variables allows us to avoid		/// separate map for scalarized induction variables allows us to avoid
/// unnecessary scalar-to-vector-to-scalar conversions.		/// unnecessary scalar-to-vector-to-scalar conversions.
DenseMap<Value , SmallVector<Value , 8>> ScalarIVMap;		DenseMap<Value , SmallVector<Value , 8>> ScalarIVMap;

/// Store instructions that should be predicated, as a pair		/// Store instructions that should be predicated, as a pair
/// <StoreInst, Predicate>		/// <StoreInst, Predicate>
SmallVector<std::pair<StoreInst , Value >, 4> PredicatedStores;		SmallVector<std::pair<Instruction , Value >, 4> PredicatedInstructions;
EdgeMaskCache MaskCache;		EdgeMaskCache MaskCache;
/// Trip count of the original loop.		/// Trip count of the original loop.
Value *TripCount;		Value *TripCount;
/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))		/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))
Value *VectorTripCount;		Value *VectorTripCount;

/// Map of scalar integer values to the smallest bitwidth they can be legally		/// Map of scalar integer values to the smallest bitwidth they can be legally
/// represented as. The vector equivalents of these values should be truncated		/// represented as. The vector equivalents of these values should be truncated
Show All 13 Lines	InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned UnrollFactor)		OptimizationRemarkEmitter *ORE, unsigned UnrollFactor)
: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,		: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,
UnrollFactor) {}		UnrollFactor) {}

private:		private:
void scalarizeInstruction(Instruction *Instr,		void scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore = false) override;		bool IfPredicateInstr = false) override;
void vectorizeMemoryInstruction(Instruction *Instr) override;		void vectorizeMemoryInstruction(Instruction *Instr) override;
Value getBroadcastInstrs(Value V) override;		Value getBroadcastInstrs(Value V) override;
Value getStepVector(Value Val, int StartIdx, Value *Step,		Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
Instruction::BinaryOpsEnd) override;		Instruction::BinaryOpsEnd) override;
Value reverseVector(Value Vec) override;		Value reverseVector(Value Vec) override;
};		};

▲ Show 20 Lines • Show All 2,096 Lines • ▼ Show 20 Lines	if (CreateGatherScatter) {
NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load");		NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load");
Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;		Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;
}		}
addMetadata(NewLI, LI);		addMetadata(NewLI, LI);
}		}
}		}

void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,		void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
		DEBUG(dbgs() << "LV: Scalarizing"
		<< (IfPredicateInstr ? " and predicating:" : ":") << *Instr
		<< '\n');
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);

// Find all of the vectorized parameters.		// Find all of the vectorized parameters.
for (Value *SrcOp : Instr->operands()) {		for (Value *SrcOp : Instr->operands()) {
// If we are accessing the old induction variable, use the new one.		// If we are accessing the old induction variable, use the new one.
Show All 27 Lines	void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,

Value *UndefVec =		Value *UndefVec =
IsVoidRetTy ? nullptr		IsVoidRetTy ? nullptr
: UndefValue::get(VectorType::get(Instr->getType(), VF));		: UndefValue::get(VectorType::get(Instr->getType(), VF));
// Create a new entry in the WidenMap and initialize it to Undef or Null.		// Create a new entry in the WidenMap and initialize it to Undef or Null.
VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);		VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);

VectorParts Cond;		VectorParts Cond;
if (IfPredicateStore) {		if (IfPredicateInstr) {
assert(Instr->getParent()->getSinglePredecessor() &&		assert(Instr->getParent()->getSinglePredecessor() &&
"Only support single predecessor blocks");		"Only support single predecessor blocks");
Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),		Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),
Instr->getParent());		Instr->getParent());
}		}

// For each vector unroll 'part':		// For each vector unroll 'part':
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
// For each scalar that we create:		// For each scalar that we create:
for (unsigned Width = 0; Width < VF; ++Width) {		for (unsigned Width = 0; Width < VF; ++Width) {

// Start if-block.		// Start if-block.
Value *Cmp = nullptr;		Value *Cmp = nullptr;
if (IfPredicateStore) {		if (IfPredicateInstr) {
Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Width));		Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Width));
Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp,		Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp,
ConstantInt::get(Cmp->getType(), 1));		ConstantInt::get(Cmp->getType(), 1));
}		}

Instruction *Cloned = Instr->clone();		Instruction *Cloned = Instr->clone();
if (!IsVoidRetTy)		if (!IsVoidRetTy)
Cloned->setName(Instr->getName() + ".cloned");		Cloned->setName(Instr->getName() + ".cloned");
Show All 22 Lines	for (unsigned Width = 0; Width < VF; ++Width) {
AC->registerAssumption(II);		AC->registerAssumption(II);

// If the original scalar returns a value we need to place it in a vector		// If the original scalar returns a value we need to place it in a vector
// so that future users will be able to use it.		// so that future users will be able to use it.
if (!IsVoidRetTy)		if (!IsVoidRetTy)
VecResults[Part] = Builder.CreateInsertElement(VecResults[Part], Cloned,		VecResults[Part] = Builder.CreateInsertElement(VecResults[Part], Cloned,
Builder.getInt32(Width));		Builder.getInt32(Width));
// End if-block.		// End if-block.
if (IfPredicateStore)		if (IfPredicateInstr)
PredicatedStores.push_back(		PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
std::make_pair(cast<StoreInst>(Cloned), Cmp));
}		}
}		}
}		}

PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,		PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,
Value End, Value Step,		Value End, Value Step,
Instruction *DL) {		Instruction *DL) {
BasicBlock *Header = L->getHeader();		BasicBlock *Header = L->getHeader();
▲ Show 20 Lines • Show All 514 Lines • ▼ Show 20 Lines	static Value addFastMathFlag(Value V) {
if (isa<FPMathOperator>(V)) {		if (isa<FPMathOperator>(V)) {
FastMathFlags Flags;		FastMathFlags Flags;
Flags.setUnsafeAlgebra();		Flags.setUnsafeAlgebra();
cast<Instruction>(V)->setFastMathFlags(Flags);		cast<Instruction>(V)->setFastMathFlags(Flags);
}		}
return V;		return V;
}		}

/// Estimate the overhead of scalarizing a value. Insert and Extract are set if		/// \brief Estimate the overhead of scalarizing a value based on its type.
/// the result needs to be inserted and/or extracted from vectors.		/// Insert and Extract are set if the result needs to be inserted and/or
		/// extracted from vectors.
		/// If the instruction is also to be predicated, add the cost of a PHI
		/// node to the insertion cost.
static unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract,		static unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract,
		bool Predicated,
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (Ty->isVoidTy())		if (Ty->isVoidTy())
return 0;		return 0;

assert(Ty->isVectorTy() && "Can only scalarize vectors");		assert(Ty->isVectorTy() && "Can only scalarize vectors");
unsigned Cost = 0;		unsigned Cost = 0;

for (unsigned I = 0, E = Ty->getVectorNumElements(); I < E; ++I) {		for (unsigned I = 0, E = Ty->getVectorNumElements(); I < E; ++I) {
if (Insert)
Cost += TTI.getVectorInstrCost(Instruction::InsertElement, Ty, I);
if (Extract)		if (Extract)
Cost += TTI.getVectorInstrCost(Instruction::ExtractElement, Ty, I);		Cost += TTI.getVectorInstrCost(Instruction::ExtractElement, Ty, I);
		if (Insert) {
		Cost += TTI.getVectorInstrCost(Instruction::InsertElement, Ty, I);
		if (Predicated)
		Cost += TTI.getCFInstrCost(Instruction::PHI);
		}
}		}

		// We assume that if-converted blocks have a 50% chance of being executed.
		// Predicated scalarized instructions are avoided due to the CF that bypasses
		// turned off lanes. The extracts and inserts will be sinked/hoisted to the
		// predicated basic-block and are subjected to the same assumption.
		if (Predicated)
		Cost /= 2;

return Cost;		return Cost;
}		}

		/// \brief Estimate the overhead of scalarizing an Instruction based on the
		/// types of its operands and return value.
		static unsigned getScalarizationOverhead(SmallVectorImpl<Type *> &OpTys,
		Type *RetTy, bool Predicated,
		const TargetTransformInfo &TTI) {
		unsigned ScalarizationCost =
		getScalarizationOverhead(RetTy, true, false, Predicated, TTI);

		for (Type *Ty : OpTys)
		ScalarizationCost +=
		getScalarizationOverhead(Ty, false, true, Predicated, TTI);

		return ScalarizationCost;
		}

		/// \brief Estimate the overhead of scalarizing an instruction. This is a
		/// convenience wrapper for the type-based getScalarizationOverhead API.
		static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,
		bool Predicated,
		const TargetTransformInfo &TTI) {
		if (VF == 1)
		return 0;

		Type *RetTy = ToVectorTy(I->getType(), VF);

		SmallVector<Type *, 4> OpTys;
		unsigned OperandsNum = I->getNumOperands();
		for (unsigned OpInd = 0; OpInd < OperandsNum; ++OpInd)
		OpTys.push_back(ToVectorTy(I->getOperand(OpInd)->getType(), VF));

		return getScalarizationOverhead(OpTys, RetTy, Predicated, TTI);
		}

// Estimate cost of a call instruction CI if it were vectorized with factor VF.		// Estimate cost of a call instruction CI if it were vectorized with factor VF.
// Return the cost of the instruction, including scalarization overhead if it's		// Return the cost of the instruction, including scalarization overhead if it's
// needed. The flag NeedToScalarize shows if the call needs to be scalarized -		// needed. The flag NeedToScalarize shows if the call needs to be scalarized -
// i.e. either vector version isn't available, or is too expensive.		// i.e. either vector version isn't available, or is too expensive.
static unsigned getVectorCallCost(CallInst *CI, unsigned VF,		static unsigned getVectorCallCost(CallInst *CI, unsigned VF,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
bool &NeedToScalarize) {		bool &NeedToScalarize) {
Show All 14 Lines	static unsigned getVectorCallCost(CallInst *CI, unsigned VF,

// Compute corresponding vector type for return value and arguments.		// Compute corresponding vector type for return value and arguments.
Type *RetTy = ToVectorTy(ScalarRetTy, VF);		Type *RetTy = ToVectorTy(ScalarRetTy, VF);
for (Type *ScalarTy : ScalarTys)		for (Type *ScalarTy : ScalarTys)
Tys.push_back(ToVectorTy(ScalarTy, VF));		Tys.push_back(ToVectorTy(ScalarTy, VF));

// Compute costs of unpacking argument values for the scalar calls and		// Compute costs of unpacking argument values for the scalar calls and
// packing the return values to a vector.		// packing the return values to a vector.
unsigned ScalarizationCost =		unsigned ScalarizationCost = getScalarizationOverhead(Tys, RetTy, false, TTI);
getScalarizationOverhead(RetTy, true, false, TTI);
for (Type *Ty : Tys)
ScalarizationCost += getScalarizationOverhead(Ty, false, true, TTI);

unsigned Cost = ScalarCallCost * VF + ScalarizationCost;		unsigned Cost = ScalarCallCost * VF + ScalarizationCost;

// If we can't emit a vector call for this function, then the currently found		// If we can't emit a vector call for this function, then the currently found
// cost is the cost we need to return.		// cost is the cost we need to return.
NeedToScalarize = true;		NeedToScalarize = true;
if (!TLI \|\| !TLI->isFunctionVectorizable(FnName, VF) \|\| CI->isNoBuiltin())		if (!TLI \|\| !TLI->isFunctionVectorizable(FnName, VF) \|\| CI->isNoBuiltin())
return Cost;		return Cost;
▲ Show 20 Lines • Show All 403 Lines • ▼ Show 20 Lines	for (PHINode *Phi : PHIsToFix) {
Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);		Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
} // end of for each Phi in PHIsToFix.		} // end of for each Phi in PHIsToFix.

fixLCSSAPHIs();		fixLCSSAPHIs();

// Make sure DomTree is updated.		// Make sure DomTree is updated.
updateAnalysis();		updateAnalysis();

predicateStores();		predicateInstructions();

// Remove redundant induction instructions.		// Remove redundant induction instructions.
cse(LoopVectorBody);		cse(LoopVectorBody);
}		}

void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {		void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {

// This is the second phase of vectorizing first-order recurrences. An		// This is the second phase of vectorizing first-order recurrences. An
▲ Show 20 Lines • Show All 150 Lines • ▼ Show 20 Lines	for (Instruction &LEI : *LoopExitBlock) {
auto *LCSSAPhi = dyn_cast<PHINode>(&LEI);		auto *LCSSAPhi = dyn_cast<PHINode>(&LEI);
if (!LCSSAPhi)		if (!LCSSAPhi)
break;		break;
if (LCSSAPhi->getNumIncomingValues() == 1)		if (LCSSAPhi->getNumIncomingValues() == 1)
LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),		LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),
LoopMiddleBlock);		LoopMiddleBlock);
}		}
}		}

void InnerLoopVectorizer::predicateStores() {		void InnerLoopVectorizer::predicateInstructions() {
for (auto KV : PredicatedStores) {
		// For each instruction I marked for predication on value C, split I into its
		// own basic block to form an if-then construct over C.
		// Since I may be fed by extractelement and/or be feeding an insertelement
		// generated during scalarization we try to move such instructions into the
		// predicated basic block as well. For the insertelement this also means that
		// the PHI will be created for the resulting vector rather than for the
		// scalar instruction.
		// So for some predicated instruction, e.g. the conditional sdiv in:
		//
		// for.body:
		// ...
		// %add = add nsw i32 %mul, %0
		// %cmp5 = icmp sgt i32 %2, 7
		// br i1 %cmp5, label %if.then, label %if.end
		//
		// if.then:
		// %div = sdiv i32 %0, %1
		// br label %if.end
		//
		// if.end:
		// %x.0 = phi i32 [ %div, %if.then ], [ %add, %for.body ]
		//
		// the sdiv at this point is scalarized and if-converted using a select.
		// The inactive elements in the vector are not used, but the predicated
		// instruction is still executed for all vector elements, essentially:
		//
		// vector.body:
		// ...
		// %17 = add nsw <2 x i32> %16, %wide.load
		// %29 = extractelement <2 x i32> %wide.load, i32 0
		// %30 = extractelement <2 x i32> %wide.load51, i32 0
		// %31 = sdiv i32 %29, %30
		// %32 = insertelement <2 x i32> undef, i32 %31, i32 0
		// %35 = extractelement <2 x i32> %wide.load, i32 1
		// %36 = extractelement <2 x i32> %wide.load51, i32 1
		// %37 = sdiv i32 %35, %36
		// %38 = insertelement <2 x i32> %32, i32 %37, i32 1
		// %predphi = select <2 x i1> %26, <2 x i32> %38, <2 x i32> %17
		//
		// Predication will now re-introduce the original control flow to avoid false
		// side-effects by the sdiv instructions on the inactive elements, yielding
		// (after cleanup):
		//
		// vector.body:
		// ...
		// %5 = add nsw <2 x i32> %4, %wide.load
		// %8 = icmp sgt <2 x i32> %wide.load52, <i32 7, i32 7>
		// %9 = extractelement <2 x i1> %8, i32 0
		// br i1 %9, label %pred.sdiv.if, label %pred.sdiv.continue
		//
		// pred.sdiv.if:
		// %10 = extractelement <2 x i32> %wide.load, i32 0
		// %11 = extractelement <2 x i32> %wide.load51, i32 0
		// %12 = sdiv i32 %10, %11
		// %13 = insertelement <2 x i32> undef, i32 %12, i32 0
		// br label %pred.sdiv.continue
		//
		// pred.sdiv.continue:
		// %14 = phi <2 x i32> [ undef, %vector.body ], [ %13, %pred.sdiv.if ]
		// %15 = extractelement <2 x i1> %8, i32 1
		// br i1 %15, label %pred.sdiv.if54, label %pred.sdiv.continue55
		//
		// pred.sdiv.if54:
		// %16 = extractelement <2 x i32> %wide.load, i32 1
		// %17 = extractelement <2 x i32> %wide.load51, i32 1
		// %18 = sdiv i32 %16, %17
		// %19 = insertelement <2 x i32> %14, i32 %18, i32 1
		// br label %pred.sdiv.continue55
		//
		// pred.sdiv.continue55:
		// %20 = phi <2 x i32> [ %14, %pred.sdiv.continue ], [ %19, %pred.sdiv.if54 ]
		// %predphi = select <2 x i1> %8, <2 x i32> %20, <2 x i32> %5

		for (auto KV : PredicatedInstructions) {
BasicBlock::iterator I(KV.first);		BasicBlock::iterator I(KV.first);
auto BB = SplitBlock(I->getParent(), &std::next(I), DT, LI);		BasicBlock *Head = I->getParent();
		auto BB = SplitBlock(Head, &std::next(I), DT, LI);
auto T = SplitBlockAndInsertIfThen(KV.second, &I, /Unreachable=/false,		auto T = SplitBlockAndInsertIfThen(KV.second, &I, /Unreachable=/false,
/BranchWeights=/nullptr, DT, LI);		/BranchWeights=/nullptr, DT, LI);
I->moveBefore(T);		I->moveBefore(T);
I->getParent()->setName("pred.store.if");		// Try to move any extractelement we may have created for the predicated
BB->setName("pred.store.continue");		// instruction into the Then block.
		for (Use &Op : I->operands()) {
		auto OpInst = dyn_cast<ExtractElementInst>(&Op);
		if (OpInst && OpInst->hasOneUse()) // TODO: more accurately - hasOneUser()
		OpInst->moveBefore(&*I);
		}

		I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if");
		BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue");

		// If the instruction is non-void create a Phi node at reconvergence point.
		if (!I->getType()->isVoidTy()) {
		Value *IncomingTrue = nullptr;
		Value *IncomingFalse = nullptr;

		if (I->hasOneUse() && isa<InsertElementInst>(*I->user_begin())) {
		// If the predicated instruction is feeding an insert-element, move it
		// into the Then block; Phi node will be created for the vector.
		InsertElementInst IEI = cast<InsertElementInst>(I->user_begin());
		IEI->moveBefore(T);
		IncomingTrue = IEI; // the new vector with the inserted element.
		IncomingFalse = IEI->getOperand(0); // the unmodified vector
		} else {
		// Phi node will be created for the scalar predicated instruction.
		IncomingTrue = &*I;
		IncomingFalse = UndefValue::get(I->getType());
		}

		BasicBlock *PostDom = I->getParent()->getSingleSuccessor();
		assert(PostDom && "Then block has multiple successors");
		PHINode *Phi =
		PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front());
		IncomingTrue->replaceAllUsesWith(Phi);
		Phi->addIncoming(IncomingFalse, Head);
		Phi->addIncoming(IncomingTrue, I->getParent());
		}
}		}

DEBUG(DT->verifyDomTree());		DEBUG(DT->verifyDomTree());
}		}

InnerLoopVectorizer::VectorParts		InnerLoopVectorizer::VectorParts
InnerLoopVectorizer::createEdgeMask(BasicBlock Src, BasicBlock Dst) {		InnerLoopVectorizer::createEdgeMask(BasicBlock Src, BasicBlock Dst) {
assert(is_contained(predecessors(Dst), Src) && "Invalid edge");		assert(is_contained(predecessors(Dst), Src) && "Invalid edge");

// Look for cached value.		// Look for cached value.
▲ Show 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	case InductionDescriptor::IK_FpInduction: {
for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
Entry[part] = getStepVector(Broadcasted, VF * part, StepVal,		Entry[part] = getStepVector(Broadcasted, VF * part, StepVal,
II.getInductionOpcode());		II.getInductionOpcode());
return;		return;
}		}
}		}
}		}

		/// A helper function for checking whether an integer division-related
		/// instruction may divide by zero (in which case it must be predicated if
		/// executed conditionally in the scalar code).
		/// TODO: It may be worthwhile to generalize and check isKnownNonZero().
		/// Non-zero divisors that are non compile-time constants will not be
		/// converted into multiplication, so we will still end up scalarizing
		/// the division, but can do so w/o predication.
		static bool mayDivideByZero(Instruction &I) {
		assert((I.getOpcode() == Instruction::UDiv \|\|
		I.getOpcode() == Instruction::SDiv \|\|
		I.getOpcode() == Instruction::URem \|\|
		I.getOpcode() == Instruction::SRem) &&
		"Unexpected instruction");
		Value *Divisor = I.getOperand(1);
		auto *CInt = dyn_cast<ConstantInt>(Divisor);
		return !CInt \|\| CInt->isZero();
		}

void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {		void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {
// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
VectorParts &Entry = WidenMap.get(&I);		VectorParts &Entry = WidenMap.get(&I);

switch (I.getOpcode()) {		switch (I.getOpcode()) {
case Instruction::Br:		case Instruction::Br:
// Nothing to do for PHIs and BR, since we already took care of the		// Nothing to do for PHIs and BR, since we already took care of the
// loop control flow instructions.		// loop control flow instructions.
continue;		continue;
case Instruction::PHI: {		case Instruction::PHI: {
// Vectorize PHINodes.		// Vectorize PHINodes.
widenPHIInstruction(&I, Entry, UF, VF, PV);		widenPHIInstruction(&I, Entry, UF, VF, PV);
continue;		continue;
} // End of PHI.		} // End of PHI.

		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::SRem:
		case Instruction::URem:
		// Scalarize with predication if this instruction may divide by zero and
		// block execution is conditional, otherwise fallthrough.
		if (mayDivideByZero(I) && Legal->blockNeedsPredication(I.getParent())) {
		scalarizeInstruction(&I, true);
		continue;
		}
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::URem:
case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Just widen binops.		// Just widen binops.
▲ Show 20 Lines • Show All 877 Lines • ▼ Show 20 Lines	if (I.mayWriteToMemory()) {
bool isSinglePredecessor = SI->getParent()->getSinglePredecessor();		bool isSinglePredecessor = SI->getParent()->getSinglePredecessor();

if (++NumPredStores > NumberOfStoresToPredicate \|\| !isSafePtr \|\|		if (++NumPredStores > NumberOfStoresToPredicate \|\| !isSafePtr \|\|
!isSinglePredecessor)		!isSinglePredecessor)
return false;		return false;
}		}
if (I.mayThrow())		if (I.mayThrow())
return false;		return false;

// The instructions below can trap.
switch (I.getOpcode()) {
default:
continue;
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::URem:
case Instruction::SRem:
return false;
}
}		}

return true;		return true;
}		}

void InterleavedAccessInfo::collectConstStrideAccesses(		void InterleavedAccessInfo::collectConstStrideAccesses(
MapVector<Instruction *, StrideDescriptor> &AccessStrideInfo,		MapVector<Instruction *, StrideDescriptor> &AccessStrideInfo,
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides) {
▲ Show 20 Lines • Show All 900 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
// First-order recurrences are replaced by vector shuffles inside the loop.		// First-order recurrences are replaced by vector shuffles inside the loop.
if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))		if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))
return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,		return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
VectorTy, VF - 1, VectorTy);		VectorTy, VF - 1, VectorTy);

// TODO: IF-converted IFs become selects.		// TODO: IF-converted IFs become selects.
return 0;		return 0;
}		}
		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::URem:
		case Instruction::SRem:
		// We assume that if-converted blocks have a 50% chance of being executed.
		// Predicated scalarized instructions are avoided due to the CF that
		// bypasses turned off lanes. If we are not predicating, fallthrough.
		if (VF > 1 && mayDivideByZero(*I) &&
		Legal->blockNeedsPredication(I->getParent()))
		return VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy) / 2 +
		getScalarizationOverhead(I, VF, true, TTI);
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::URem:
case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Since we will replace the stride by 1 the multiplication should go away.		// Since we will replace the stride by 1 the multiplication should go away.
▲ Show 20 Lines • Show All 219 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::Call: {		case Instruction::Call: {
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);		unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
if (getVectorIntrinsicIDForCall(CI, TLI))		if (getVectorIntrinsicIDForCall(CI, TLI))
return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));		return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
return CallCost;		return CallCost;
}		}
default: {		default:
// We are scalarizing the instruction. Return the cost of the scalar
// instruction, plus the cost of insert and extract into vector
// elements, times the vector width.
unsigned Cost = 0;

if (!RetTy->isVoidTy() && VF != 1) {
unsigned InsCost =
TTI.getVectorInstrCost(Instruction::InsertElement, VectorTy);
unsigned ExtCost =
TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy);

// The cost of inserting the results plus extracting each one of the
// operands.
Cost += VF * (InsCost + ExtCost * I->getNumOperands());
}

// The cost of executing VF copies of the scalar instruction. This opcode		// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.		// is unknown. Assume that it is the same as 'mul'.
Cost += VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy);		return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) +
return Cost;		getScalarizationOverhead(I, VF, false, TTI);
}
} // end of switch.		} // end of switch.
}		}

char LoopVectorize::ID = 0;		char LoopVectorize::ID = 0;
static const char lv_name[] = "Loop Vectorization";		static const char lv_name[] = "Loop Vectorization";
INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)		INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)		INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// Insert values known to be scalar into VecValuesToIgnore.		// Insert values known to be scalar into VecValuesToIgnore.
for (auto *BB : TheLoop->getBlocks())		for (auto *BB : TheLoop->getBlocks())
for (auto &I : *BB)		for (auto &I : *BB)
if (Legal->isScalarAfterVectorization(&I))		if (Legal->isScalarAfterVectorization(&I))
VecValuesToIgnore.insert(&I);		VecValuesToIgnore.insert(&I);
}		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);

// Find all of the vectorized parameters.		// Find all of the vectorized parameters.
for (Value *SrcOp : Instr->operands()) {		for (Value *SrcOp : Instr->operands()) {
Show All 26 Lines	void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
// Does this instruction return a value ?		// Does this instruction return a value ?
bool IsVoidRetTy = Instr->getType()->isVoidTy();		bool IsVoidRetTy = Instr->getType()->isVoidTy();

Value *UndefVec = IsVoidRetTy ? nullptr : UndefValue::get(Instr->getType());		Value *UndefVec = IsVoidRetTy ? nullptr : UndefValue::get(Instr->getType());
// Create a new entry in the WidenMap and initialize it to Undef or Null.		// Create a new entry in the WidenMap and initialize it to Undef or Null.
VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);		VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);

VectorParts Cond;		VectorParts Cond;
if (IfPredicateStore) {		if (IfPredicateInstr) {
assert(Instr->getParent()->getSinglePredecessor() &&		assert(Instr->getParent()->getSinglePredecessor() &&
"Only support single predecessor blocks");		"Only support single predecessor blocks");
Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),		Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),
Instr->getParent());		Instr->getParent());
}		}

// For each vector unroll 'part':		// For each vector unroll 'part':
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
// For each scalar that we create:		// For each scalar that we create:

// Start an "if (pred) a[i] = ..." block.		// Start an "if (pred) a[i] = ..." block.
Value *Cmp = nullptr;		Value *Cmp = nullptr;
if (IfPredicateStore) {		if (IfPredicateInstr) {
if (Cond[Part]->getType()->isVectorTy())		if (Cond[Part]->getType()->isVectorTy())
Cond[Part] =		Cond[Part] =
Builder.CreateExtractElement(Cond[Part], Builder.getInt32(0));		Builder.CreateExtractElement(Cond[Part], Builder.getInt32(0));
Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cond[Part],		Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cond[Part],
ConstantInt::get(Cond[Part]->getType(), 1));		ConstantInt::get(Cond[Part]->getType(), 1));
}		}

Instruction *Cloned = Instr->clone();		Instruction *Cloned = Instr->clone();
Show All 14 Lines	if (auto *II = dyn_cast<IntrinsicInst>(Cloned))
AC->registerAssumption(II);		AC->registerAssumption(II);

// If the original scalar returns a value we need to place it in a vector		// If the original scalar returns a value we need to place it in a vector
// so that future users will be able to use it.		// so that future users will be able to use it.
if (!IsVoidRetTy)		if (!IsVoidRetTy)
VecResults[Part] = Cloned;		VecResults[Part] = Cloned;

// End if-block.		// End if-block.
if (IfPredicateStore)		if (IfPredicateInstr)
PredicatedStores.push_back(std::make_pair(cast<StoreInst>(Cloned), Cmp));		PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
}		}
}		}

void InnerLoopUnroller::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopUnroller::vectorizeMemoryInstruction(Instruction *Instr) {
auto *SI = dyn_cast<StoreInst>(Instr);		auto *SI = dyn_cast<StoreInst>(Instr);
bool IfPredicateStore = (SI && Legal->blockNeedsPredication(SI->getParent()));		bool IfPredicateInstr = (SI && Legal->blockNeedsPredication(SI->getParent()));

return scalarizeInstruction(Instr, IfPredicateStore);		return scalarizeInstruction(Instr, IfPredicateInstr);
}		}

Value InnerLoopUnroller::reverseVector(Value Vec) { return Vec; }		Value InnerLoopUnroller::reverseVector(Value Vec) { return Vec; }

Value InnerLoopUnroller::getBroadcastInstrs(Value V) { return V; }		Value InnerLoopUnroller::getBroadcastInstrs(Value V) { return V; }

Value InnerLoopUnroller::getStepVector(Value Val, int StartIdx, Value *Step,		Value InnerLoopUnroller::getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps BinOp) {		Instruction::BinaryOps BinOp) {
▲ Show 20 Lines • Show All 356 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/if-pred-non-void.ll

				; RUN: opt -S -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Test predication of non-void instructions, specifically (i) that these
				; instructions permit vectorization and (ii) the creation of an insertelement
				; and a Phi node. We check the full 2-element sequence for the first
				; instruction; For the rest we'll just make sure they get predicated based
				; on the code generated for the first element.
				define void @test(i32* nocapture %asd, i32* nocapture %aud,
				i32* nocapture %asr, i32* nocapture %aur) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %if.end
				ret void

				; CHECK-LABEL: test
				; CHECK: vector.body:
				; CHECK: %[[SDEE:[a-zA-Z0-9]+]] = extractelement <2 x i1> %{{.*}}, i32 0
				; CHECK: %[[SDCC:[a-zA-Z0-9]+]] = icmp eq i1 %[[SDEE]], true
				; CHECK: br i1 %[[SDCC]], label %[[CSD:[a-zA-Z0-9.]+]], label %[[ESD:[a-zA-Z0-9.]+]]
				; CHECK: [[CSD]]:
				; CHECK: %[[SDA0:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[SDA1:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[SD0:[a-zA-Z0-9]+]] = sdiv i32 %[[SDA0]], %[[SDA1]]
				; CHECK: %[[SD1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[SD0]], i32 0
				; CHECK: br label %[[ESD]]
				; CHECK: [[ESD]]:
				; CHECK: %[[SDR:[a-zA-Z0-9]+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[SD1]], %[[CSD]] ]
				; CHECK: %[[SDEEH:[a-zA-Z0-9]+]] = extractelement <2 x i1> %{{.*}}, i32 1
				; CHECK: %[[SDCCH:[a-zA-Z0-9]+]] = icmp eq i1 %[[SDEEH]], true
				; CHECK: br i1 %[[SDCCH]], label %[[CSDH:[a-zA-Z0-9.]+]], label %[[ESDH:[a-zA-Z0-9.]+]]
				; CHECK: [[CSDH]]:
				; CHECK: %[[SDA0H:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 1
				; CHECK: %[[SDA1H:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 1
				; CHECK: %[[SD0H:[a-zA-Z0-9]+]] = sdiv i32 %[[SDA0H]], %[[SDA1H]]
				; CHECK: %[[SD1H:[a-zA-Z0-9]+]] = insertelement <2 x i32> %[[SDR]], i32 %[[SD0H]], i32 1
				; CHECK: br label %[[ESDH]]
				; CHECK: [[ESDH]]:
				; CHECK: %{{.*}} = phi <2 x i32> [ %[[SDR]], %[[ESD]] ], [ %[[SD1H]], %[[CSDH]] ]

				; CHECK: %[[UDEE:[a-zA-Z0-9]+]] = extractelement <2 x i1> %{{.*}}, i32 0
				; CHECK: %[[UDCC:[a-zA-Z0-9]+]] = icmp eq i1 %[[UDEE]], true
				; CHECK: br i1 %[[UDCC]], label %[[CUD:[a-zA-Z0-9.]+]], label %[[EUD:[a-zA-Z0-9.]+]]
				; CHECK: [[CUD]]:
				; CHECK: %[[UDA0:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[UDA1:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[UD0:[a-zA-Z0-9]+]] = udiv i32 %[[UDA0]], %[[UDA1]]
				; CHECK: %[[UD1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[UD0]], i32 0
				; CHECK: br label %[[EUD]]
				; CHECK: [[EUD]]:
				; CHECK: %{{.}} = phi <2 x i32> [ undef, %{{.}} ], [ %[[UD1]], %[[CUD]] ]

				; CHECK: %[[SREE:[a-zA-Z0-9]+]] = extractelement <2 x i1> %{{.*}}, i32 0
				; CHECK: %[[SRCC:[a-zA-Z0-9]+]] = icmp eq i1 %[[SREE]], true
				; CHECK: br i1 %[[SRCC]], label %[[CSR:[a-zA-Z0-9.]+]], label %[[ESR:[a-zA-Z0-9.]+]]
				; CHECK: [[CSR]]:
				; CHECK: %[[SRA0:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[SRA1:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[SR0:[a-zA-Z0-9]+]] = srem i32 %[[SRA0]], %[[SRA1]]
				; CHECK: %[[SR1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[SR0]], i32 0
				; CHECK: br label %[[ESR]]
				; CHECK: [[ESR]]:
				; CHECK: %{{.}} = phi <2 x i32> [ undef, %{{.}} ], [ %[[SR1]], %[[CSR]] ]

				; CHECK: %[[UREE:[a-zA-Z0-9]+]] = extractelement <2 x i1> %{{.*}}, i32 0
				; CHECK: %[[URCC:[a-zA-Z0-9]+]] = icmp eq i1 %[[UREE]], true
				; CHECK: br i1 %[[URCC]], label %[[CUR:[a-zA-Z0-9.]+]], label %[[EUR:[a-zA-Z0-9.]+]]
				; CHECK: [[CUR]]:
				; CHECK: %[[URA0:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[URA1:[a-zA-Z0-9]+]] = extractelement <2 x i32> %{{.*}}, i32 0
				; CHECK: %[[UR0:[a-zA-Z0-9]+]] = urem i32 %[[URA0]], %[[URA1]]
				; CHECK: %[[UR1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[UR0]], i32 0
				; CHECK: br label %[[EUR]]
				; CHECK: [[EUR]]:
				; CHECK: %{{.}} = phi <2 x i32> [ undef, %{{.}} ], [ %[[UR1]], %[[CUR]] ]

				for.body: ; preds = %if.end, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
				%isd = getelementptr inbounds i32, i32* %asd, i64 %indvars.iv
				%iud = getelementptr inbounds i32, i32* %aud, i64 %indvars.iv
				%isr = getelementptr inbounds i32, i32* %asr, i64 %indvars.iv
				%iur = getelementptr inbounds i32, i32* %aur, i64 %indvars.iv
				%lsd = load i32, i32* %isd, align 4
				%lud = load i32, i32* %iud, align 4
				%lsr = load i32, i32* %isr, align 4
				%lur = load i32, i32* %iur, align 4
				%psd = add nsw i32 %lsd, 23
				%pud = add nsw i32 %lud, 24
				%psr = add nsw i32 %lsr, 25
				%pur = add nsw i32 %lur, 26
				%cmp1 = icmp slt i32 %lsd, 100
				br i1 %cmp1, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%rsd = sdiv i32 %psd, %lsd
				%rud = udiv i32 %pud, %lud
				%rsr = srem i32 %psr, %lsr
				%rur = urem i32 %pur, %lur
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %for.body ]
				%yud.0 = phi i32 [ %rud, %if.then ], [ %pud, %for.body ]
				%ysr.0 = phi i32 [ %rsr, %if.then ], [ %psr, %for.body ]
				%yur.0 = phi i32 [ %rur, %if.then ], [ %pur, %for.body ]
				store i32 %ysd.0, i32* %isd, align 4
				store i32 %yud.0, i32* %iud, align 4
				store i32 %ysr.0, i32* %isr, align 4
				store i32 %yur.0, i32* %iur, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; Future-use test for predication under smarter scalar-scalar: this test will
				; fail when the vectorizer starts feeding scalarized values directly to their
				; scalar users, i.e. w/o generating redundant insertelement/extractelement
				; instructions. This case is already supported by the predication code (which
				; should generate a phi for the scalar predicated value rather than for the
				; insertelement), but cannot be tested yet.
				; If you got this test to fail, kindly fix the test by using the alternative
				; FFU sequence. This will make the test check how we handle this case from
				; now on.
				define void @test_scalar2scalar(i32* nocapture %asd, i32* nocapture %bsd) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %if.end
				ret void

				; CHECK-LABEL: test_scalar2scalar
				; CHECK: vector.body:
				; CHECK: br i1 %{{.*}}, label %[[THEN:[a-zA-Z0-9.]+]], label %[[FI:[a-zA-Z0-9.]+]]
				; CHECK: [[THEN]]:
				; CHECK: %[[PD:[a-zA-Z0-9]+]] = sdiv i32 %{{.}}, %{{.}}
				; CHECK: %[[PDV:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[PD]], i32 0
				; CHECK: br label %[[FI]]
				; CHECK: [[FI]]:
				; CHECK: %[[PH:[a-zA-Z0-9]+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[PDV]], %[[THEN]] ]
				; FFU-LABEL: test_scalar2scalar
				; FFU: vector.body:
				; FFU: br i1 %{{.*}}, label %[[THEN:[a-zA-Z0-9.]+]], label %[[FI:[a-zA-Z0-9.]+]]
				; FFU: [[THEN]]:
				; FFU: %[[PD:[a-zA-Z0-9]+]] = sdiv i32 %{{.}}, %{{.}}
				; FFU: br label %[[FI]]
				; FFU: [[FI]]:
				; FFU: %{{.*}} = phi i32 [ undef, %vector.body ], [ %[[PD]], %[[THEN]] ]

				for.body: ; preds = %if.end, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
				%isd = getelementptr inbounds i32, i32* %asd, i64 %indvars.iv
				%lsd = load i32, i32* %isd, align 4
				%isd.b = getelementptr inbounds i32, i32* %bsd, i64 %indvars.iv
				%lsd.b = load i32, i32* %isd.b, align 4
				%psd = add nsw i32 %lsd, 23
				%cmp1 = icmp slt i32 %lsd, 100
				br i1 %cmp1, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%sd1 = sdiv i32 %psd, %lsd
				%rsd = sdiv i32 %lsd.b, %sd1
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %for.body ]
				store i32 %ysd.0, i32* %isd, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

llvm/trunk/test/Transforms/LoopVectorize/if-pred-not-when-safe.ll

				; RUN: opt -S -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Test no-predication of instructions that are provably safe, e.g. dividing by
				; a non-zero constant.
				define void @test(i32* nocapture %asd, i32* nocapture %aud,
				i32* nocapture %asr, i32* nocapture %aur,
				i32* nocapture %asd0, i32* nocapture %aud0,
				i32* nocapture %asr0, i32* nocapture %aur0
				) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %if.end
				ret void

				; CHECK-LABEL: test
				; CHECK: vector.body:
				; CHECK: %{{.}} = sdiv <2 x i32> %{{.}}, <i32 11, i32 11>
				; CHECK: %{{.}} = udiv <2 x i32> %{{.}}, <i32 13, i32 13>
				; CHECK: %{{.}} = srem <2 x i32> %{{.}}, <i32 17, i32 17>
				; CHECK: %{{.}} = urem <2 x i32> %{{.}}, <i32 19, i32 19>
				; CHECK-NOT: %{{.}} = sdiv <2 x i32> %{{.}}, <i32 0, i32 0>
				; CHECK-NOT: %{{.}} = udiv <2 x i32> %{{.}}, <i32 0, i32 0>
				; CHECK-NOT: %{{.}} = srem <2 x i32> %{{.}}, <i32 0, i32 0>
				; CHECK-NOT: %{{.}} = urem <2 x i32> %{{.}}, <i32 0, i32 0>

				for.body: ; preds = %if.end, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
				%isd = getelementptr inbounds i32, i32* %asd, i64 %indvars.iv
				%iud = getelementptr inbounds i32, i32* %aud, i64 %indvars.iv
				%isr = getelementptr inbounds i32, i32* %asr, i64 %indvars.iv
				%iur = getelementptr inbounds i32, i32* %aur, i64 %indvars.iv
				%lsd = load i32, i32* %isd, align 4
				%lud = load i32, i32* %iud, align 4
				%lsr = load i32, i32* %isr, align 4
				%lur = load i32, i32* %iur, align 4
				%psd = add nsw i32 %lsd, 23
				%pud = add nsw i32 %lud, 24
				%psr = add nsw i32 %lsr, 25
				%pur = add nsw i32 %lur, 26
				%isd0 = getelementptr inbounds i32, i32* %asd0, i64 %indvars.iv
				%iud0 = getelementptr inbounds i32, i32* %aud0, i64 %indvars.iv
				%isr0 = getelementptr inbounds i32, i32* %asr0, i64 %indvars.iv
				%iur0 = getelementptr inbounds i32, i32* %aur0, i64 %indvars.iv
				%lsd0 = load i32, i32* %isd0, align 4
				%lud0 = load i32, i32* %iud0, align 4
				%lsr0 = load i32, i32* %isr0, align 4
				%lur0 = load i32, i32* %iur0, align 4
				%psd0 = add nsw i32 %lsd, 27
				%pud0 = add nsw i32 %lud, 28
				%psr0 = add nsw i32 %lsr, 29
				%pur0 = add nsw i32 %lur, 30
				%cmp1 = icmp slt i32 %lsd, 100
				br i1 %cmp1, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%rsd = sdiv i32 %psd, 11
				%rud = udiv i32 %pud, 13
				%rsr = srem i32 %psr, 17
				%rur = urem i32 %pur, 19
				%rsd0 = sdiv i32 %psd0, 0
				%rud0 = udiv i32 %pud0, 0
				%rsr0 = srem i32 %psr0, 0
				%rur0 = urem i32 %pur0, 0
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %for.body ]
				%yud.0 = phi i32 [ %rud, %if.then ], [ %pud, %for.body ]
				%ysr.0 = phi i32 [ %rsr, %if.then ], [ %psr, %for.body ]
				%yur.0 = phi i32 [ %rur, %if.then ], [ %pur, %for.body ]
				%ysd0.0 = phi i32 [ %rsd0, %if.then ], [ %psd0, %for.body ]
				%yud0.0 = phi i32 [ %rud0, %if.then ], [ %pud0, %for.body ]
				%ysr0.0 = phi i32 [ %rsr0, %if.then ], [ %psr0, %for.body ]
				%yur0.0 = phi i32 [ %rur0, %if.then ], [ %pur0, %for.body ]
				store i32 %ysd.0, i32* %isd, align 4
				store i32 %yud.0, i32* %iud, align 4
				store i32 %ysr.0, i32* %isr, align 4
				store i32 %yur.0, i32* %iur, align 4
				store i32 %ysd0.0, i32* %isd0, align 4
				store i32 %yud0.0, i32* %iud0, align 4
				store i32 %ysr0.0, i32* %isr0, align 4
				store i32 %yur0.0, i32* %iur0, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

llvm/trunk/test/Transforms/LoopVectorize/if-pred-stores.ll

	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=UNROLL			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=UNROLL
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info < %s \| FileCheck %s --check-prefix=UNROLL-NOSIMPLIFY			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info < %s \| FileCheck %s --check-prefix=UNROLL-NOSIMPLIFY
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=VEC			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=VEC
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg -instcombine < %s \| FileCheck %s --check-prefix=VEC-IC

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.9.0"			target triple = "x86_64-apple-macosx10.9.0"

	; Test predication of stores.			; Test predication of stores.
	define i32 @test(i32* nocapture %f) #0 {			define i32 @test(i32* nocapture %f) #0 {
	entry:			entry:
	br label %for.body			br label %for.body

	; VEC-LABEL: test			; VEC-LABEL: test
	; VEC: %[[v8:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>			; VEC: %[[v8:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>
	; VEC: %[[v9:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>			; VEC: %[[v9:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>
	; VEC: %[[v10:.+]] = and <2 x i1> %[[v8]], <i1 true, i1 true>			; VEC: %[[v10:.+]] = and <2 x i1> %[[v8]], <i1 true, i1 true>
	; VEC: %[[v11:.+]] = extractelement <2 x i1> %[[v10]], i32 0			; VEC: %[[v11:.+]] = extractelement <2 x i1> %[[v10]], i32 0
	; VEC: %[[v12:.+]] = icmp eq i1 %[[v11]], true			; VEC: %[[v12:.+]] = icmp eq i1 %[[v11]], true
	; VEC: %[[v13:.+]] = extractelement <2 x i32> %[[v9]], i32 0
	; VEC: %[[v14:.+]] = extractelement <2 x i32> %{{.}}, i32 0
	; VEC: br i1 %[[v12]], label %[[cond:.+]], label %[[else:.+]]			; VEC: br i1 %[[v12]], label %[[cond:.+]], label %[[else:.+]]
	;			;
	; VEC: [[cond]]:			; VEC: [[cond]]:
				; VEC: %[[v13:.+]] = extractelement <2 x i32> %[[v9]], i32 0
				; VEC: %[[v14:.+]] = extractelement <2 x i32> %{{.}}, i32 0
	; VEC: store i32 %[[v13]], i32* %[[v14]], align 4			; VEC: store i32 %[[v13]], i32* %[[v14]], align 4
	; VEC: br label %[[else:.+]]			; VEC: br label %[[else:.+]]
	;			;
	; VEC: [[else]]:			; VEC: [[else]]:
	; VEC: %[[v15:.+]] = extractelement <2 x i1> %[[v10]], i32 1			; VEC: %[[v15:.+]] = extractelement <2 x i1> %[[v10]], i32 1
	; VEC: %[[v16:.+]] = icmp eq i1 %[[v15]], true			; VEC: %[[v16:.+]] = icmp eq i1 %[[v15]], true
	; VEC: %[[v17:.+]] = extractelement <2 x i32> %[[v9]], i32 1
	; VEC: %[[v18:.+]] = extractelement <2 x i32*> %{{.+}} i32 1
	; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]]			; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]]
	;			;
	; VEC: [[cond2]]:			; VEC: [[cond2]]:
				; VEC: %[[v17:.+]] = extractelement <2 x i32> %[[v9]], i32 1
				; VEC: %[[v18:.+]] = extractelement <2 x i32*> %{{.+}} i32 1
	; VEC: store i32 %[[v17]], i32* %[[v18]], align 4			; VEC: store i32 %[[v17]], i32* %[[v18]], align 4
	; VEC: br label %[[else2:.+]]			; VEC: br label %[[else2:.+]]
	;			;
	; VEC: [[else2]]:			; VEC: [[else2]]:

	; VEC-IC-LABEL: test
	; VEC-IC: %[[v1:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>
	; VEC-IC: %[[v2:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>
	; VEC-IC: %[[v3:.+]] = extractelement <2 x i1> %[[v1]], i32 0
	; VEC-IC: br i1 %[[v3]], label %[[cond:.+]], label %[[else:.+]]
	;
	; VEC-IC: [[cond]]:
	; VEC-IC: %[[v4:.+]] = extractelement <2 x i32> %[[v2]], i32 0
	; VEC-IC: store i32 %[[v4]], i32* %{{.*}}, align 4
	; VEC-IC: br label %[[else:.+]]
	;
	; VEC-IC: [[else]]:
	; VEC-IC: %[[v5:.+]] = extractelement <2 x i1> %[[v1]], i32 1
	; VEC-IC: br i1 %[[v5]], label %[[cond2:.+]], label %[[else2:.+]]
	;
	; VEC-IC: [[cond2]]:
	; VEC-IC: %[[v6:.+]] = extractelement <2 x i32> %[[v2]], i32 1
	; VEC-IC: store i32 %[[v6]], i32* %{{.*}}, align 4
	; VEC-IC: br label %[[else2:.+]]
	;
	; VEC-IC: [[else2]]:

	; UNROLL-LABEL: test			; UNROLL-LABEL: test
	; UNROLL: vector.body:			; UNROLL: vector.body:
	; UNROLL: %[[IND:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 0			; UNROLL: %[[IND:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 0
	; UNROLL: %[[IND1:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 1			; UNROLL: %[[IND1:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 1
	; UNROLL: %[[v0:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND]]			; UNROLL: %[[v0:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND]]
	; UNROLL: %[[v1:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND1]]			; UNROLL: %[[v1:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND1]]
	; UNROLL: %[[v2:[a-zA-Z0-9]+]] = load i32, i32* %[[v0]], align 4			; UNROLL: %[[v2:[a-zA-Z0-9]+]] = load i32, i32* %[[v0]], align 4
	; UNROLL: %[[v3:[a-zA-Z0-9]+]] = load i32, i32* %[[v1]], align 4			; UNROLL: %[[v3:[a-zA-Z0-9]+]] = load i32, i32* %[[v1]], align 4
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines