This is an archive of the discontinued LLVM Phabricator instance.

[LV] Scalarize operands of predicated instructions
ClosedPublic

Authored by mssimpso on Oct 28 2016, 9:00 AM.

Download Raw Diff

Details

Reviewers

anemet
mkuper
gilr

Commits

rG364da7e52704: [LV] Scalarize operands of predicated instructions
rL288909: [LV] Scalarize operands of predicated instructions

Summary

This patch attempts to scalarize the operand expressions of predicated instructions if they were conditionally executed in the original loop. After scalarization, the expressions will be sunk inside the blocks created for the predicated instructions. The transformation essentially performs un-if-conversion on the operands.

The cost model has been updated to determine if scalarization is profitable. It compares the cost of a vectorized instruction, assuming it will be if-converted, to the cost of the scalarized instruction, assuming that the instructions corresponding to each vector lane will be sunk inside a predicated block, possibly avoiding execution. If it's more profitable to scalarize the entire expression tree feeding the predicated instruction, the expression will be scalarized; otherwise, it will be vectorized. We only consider the cost of the entire expression to accurately estimate the cost of the required insertelement and extractelement instructions.

Diff Detail

Repository: rL LLVM

Event Timeline

mssimpso updated this revision to Diff 76204.Oct 28 2016, 9:00 AM

mssimpso retitled this revision from to [LV] Scalarize operands of predicated instructions.

mssimpso updated this object.

mssimpso added reviewers: anemet, mkuper, gilr.

mssimpso added subscribers: llvm-commits, mcrosier.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptOct 28 2016, 9:00 AM

dorit added a subscriber: dorit.Oct 30 2016, 3:01 PM

gilr added inline comments.Oct 31 2016, 2:40 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
6546 ↗	(On Diff #76204)	It seems the only instructions pushed into Worklist are PredInst and instructions from its basic block. Is the invariance check necessary?

mkuper added inline comments.Oct 31 2016, 5:03 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
1916 ↗	(On Diff #76204)	Can this actually happen, or should there be an assert here? This seems weird, because if we sometimes call this before we collect the instructions for the VF, and sometimes after, then we'll get different results?
4678 ↗	(On Diff #76204)	We have 5 places that call isScalarAfterVectorization(). Is this the only call site that cares about this? (If all of them should care, perhaps wrap in a helper function?)

mssimpso added inline comments.Nov 1 2016, 7:33 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
1916 ↗	(On Diff #76204)	You're right, this can be an assert. I'll update the patch. Thanks!
4678 ↗	(On Diff #76204)	You're asking if all places where we call isScalarAfterVectorization should also be concerned with isProfitableToScalarize? I think a helper makes sense in general, but this is the only case (at least currently) where I think it will make a difference. In needsScalarInduction and widenIntInduction, the IVs shouldn't be predicated so it shouldn't make a difference. And in collectValuesToIgnore, we haven't yet computed the instruction costs.
6546 ↗	(On Diff #76204)	Ah, that's right. Thanks! I think the invariance check is leftover from a previous implementation I had. I'll remove it.

Addressed comments from Gil and Michael. Thanks!

Removed unnecessary invariance check from computePredInstDiscount
Added assert to isProfitableToScalarlize
Added call to collectInstsToScalarize for user-selected VFs (I realized we needed this with the new assert)
Made some auto types explicit

In D26083#585225, @mssimpso wrote:

Added call to collectInstsToScalarize for user-selected VFs (I realized we needed this with the new assert)

Yay, finally being pedantic and asking for an assert pays off! :-)

lib/Transforms/Vectorize/LoopVectorize.cpp
1962 ↗	(On Diff #76631)	Maybe make more explicit that the cost is the cost of the instruction if scalarized? ("instruction-cost pairs for each choice of vectorization factor" seems to imply the vectorized cost.)
1968 ↗	(On Diff #76631)	This can also be more explicit. "A negative returned value implies..."
6154 ↗	(On Diff #76631)	Could you add a test for this?
6570 ↗	(On Diff #76631)	"to to" -> "to"
6583 ↗	(On Diff #76631)	Just trying to understand if I got this right. Let's say we have: %a = add i32, ... %r = udiv i32 %x, %a %s = udiv i32 %y, %a %t = udiv i32 %z, %a In the same block. Will we do the right thing evaluating the cost of scalarizing the add in a separate context for each udiv?
6624 ↗	(On Diff #76631)	Can we end up with a non-scalar type during the traversal? I'm thinking about something like a div that depends on an extractvalue from a struct.
4678 ↗	(On Diff #76204)	And in collectValuesToIgnore, we haven't yet computed the instruction costs. So, we get some imprecision here, right? If we end up scalarizing things that aren't in VecValuesToIgnore then we'll overestimate the register pressure? Or am I confused? Anyway, assuming I got it right - I'm not saying we need to fix this in this patch. The patch, as it is, is probably still a strict improvement. But a FIXME would be good.

gilr added inline comments.Nov 7 2016, 8:14 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6583 ↗	(On Diff #76631)	Actually in such a case it seems the addition gets scalarized but cannot be sunk down because of the multiple uses. I also wonder if we calculate the cost correctly when one predicated instruction feeds another.
6605 ↗	(On Diff #76631)	The following loop causes the patch to assert void foo(int* restrict a, int b, int* restrict c) { for (int i = 0; i < 10000; ++i) { if (a[i] > 777) { int t2 = c[i]; int t3 = t2 / b; a[i] += t3; } } } while trying to scalarize a uniform value (the predicated load). Got this example to work by skipping I if any of its operands isUniformAfterVectorization().
6624 ↗	(On Diff #76631)	Wasn't the cost of insert-element accounted for by VectorCost?
4678 ↗	(On Diff #76204)	I think Michael's comment raises a more general question: should it matter why we scalarize? Put differently, once an instruction is marked for scalarization by legal or cost, shouldn't we treat it uniformly in the code? [if so, then the question "will the instruction be scalarized" would only be well-defined in the context of some VF]

Michael/Gil,

Thanks for all the comments! I'll update a new version of the patch shortly.

lib/Transforms/Vectorize/LoopVectorize.cpp
1962 ↗	(On Diff #76631)	Sounds good.
1968 ↗	(On Diff #76631)	Sounds good.
6154 ↗	(On Diff #76631)	Sure.
6570 ↗	(On Diff #76631)	Thanks!
6583 ↗	(On Diff #76631)	Yeah, Gil is right here in that the add in this example will not be sunk. Thanks, Gil! This is because our current predication method places each udiv in it's own separate block. Long term I think we will want to keep the predicated instructions in the same block after vectorization if they were in the same block before vectorization. I can add some comments about this in the meantime. So we will need to check for the multi-context case, then. But this doesn't mean that we can't sink instructions with multi-uses. We should still be able to handle the following, for example. if: %a = add i32, ... %r = add i32 %a, %a %s = udiv i32 %x, %r br I'll add some tests.
6605 ↗	(On Diff #76631)	Thanks, Gil!. I'll fix this in the updated patch and add an appropriate test.
6624 ↗	(On Diff #76631)	For the cost of the insertelement for predicated instructions: that's right. We computed it in VectorCost. The insert overhead should be the same for both versions of the code: predication as-is vs. predication with scalarized operands. So we either need to subtract this overhead from VectorCost or add it to ScalarCost to make the calculations comparable. For the non-scalar type question: this may be possible (I'll need to investigate). I'll add a test either way. Thanks!
4678 ↗	(On Diff #76204)	That's right. There is some conservatism in collectValuesToIgnore. Currently, we place a value in VecValuesToIgnore only if we're sure (for all VFs) that the value will be scalar. But this doesn't mean that all other values will be vectorized. With this patch, we'll have a more precise determination of the known scalars after VF selection. For the question about scalarization by legality vs. the cost model, for code generation, it's not going to matter why we decide to scalarize. In legality, we know that for any (legal) VF that we select, a value will be scalarized. In the cost model, we decide for particular VFs if it's more profitable to scalarize. When we come to code generation, we know what that VF is. If the comment is that InnerLoopVectorizer shouldn't need to make a distinction between Legal and Cost scalarization, then we can create a helper function like Michael suggested for InnerLoopVectorizer to use. What do you think?

mkuper added inline comments.Nov 8 2016, 11:10 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4678 ↗	(On Diff #76204)	That sounds good to me. I think eventually we'll probably need to keep a full "this will get scalarized" list per-VF instead of the current VecValuesToIgnore, but that's an issue for another patch. (By the way, do we really expect to see cases where scalarization is better for some VFs, but worse for others?)

mssimpso added inline comments.Nov 8 2016, 11:40 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4678 ↗	(On Diff #76204)	At least for this patch, scalarization tends to become less profitable as the VF increases. This is primarily because, without PGO, we always assume a 0.5 predicated block probability.

mkuper added inline comments.Nov 8 2016, 1:34 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
4678 ↗	(On Diff #76204)	Ah, ok, makes sense.

mssimpso added inline comments.Nov 11 2016, 10:38 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6583 ↗	(On Diff #76631)	Michael/Gil, After thinking about the multi-context case a bit more (and not wanting to revisit this again later), I decided to take a stab at modifying code generation for the predicated instructions so that we don't place them in separate basic blocks if we don't have to. I'll upload this as a separate patch shortly. Matt.

mssimpso mentioned this in D26555: [LV] Keep predicated instructions in the same block.Nov 16 2016, 5:34 AM

Michael/Gil,

I'm wrapping up addressing your comments, but I have a question about testing. Since this change is cost-model driven, all tests should really live under a target-specific directory (i.e., test/Transforms/LoopVectorize/AArch64). That's easy enough to do, but what should we do about all the target-independent tests that currently exist in the top-level directory? With this change, we would be introducing target-specific behavior that may change how those tests are vectorized, depending on the default target.

I can think of two ways to avoid this: (1) disable scalarization when a user-specified vectorization factor is provided. That is, only perform the optimization when the cost model is actually run. Or (2) introduce a new flag (e.g., --enable-scalarization-with-predication), defaulting to true that we could then disable in the run line of all the current top-level tests.

What do you think?

I'm not a fan of (1) - I don't think specifying the vectorization factor manually should mean that you're uninterested in all other cost-model considerations.

Regarding (2), I'm not sure what's worse - knob proliferation, or not having target-independent tests. That is, one knob is fine, but I wouldn't want to add a knob for every cost-model-dependent decision.
Maybe we can have a catch-all knob that basically says "use the default TTI instead of the current target's"? Does that even make sense?

In D26083#598865, @mkuper wrote:

I'm not a fan of (1) - I don't think specifying the vectorization factor manually should mean that you're uninterested in all other cost-model considerations.

Regarding (2), I'm not sure what's worse - knob proliferation, or not having target-independent tests. That is, one knob is fine, but I wouldn't want to add a knob for every cost-model-dependent decision.
Maybe we can have a catch-all knob that basically says "use the default TTI instead of the current target's"? Does that even make sense?

That should be doable. I'll give it a shot in a separate patch.

Addressed comments from Michael and Gil.

I think I've taken care of everything mentioned until now. Thanks again for the reviews! With my testing concerns addressed, I think we're fine with the code generation tests remaining target-independent.

Added a helper function in InnerLoopVectorizer combining Legal->isScalarAfterVectorization with Cost->isProfitableToScalarize. I replaced all uses of isScalarAfterVectorization in the vectorizer with the new helper function. I added a FIXME for the use of isScalarAfterVectorization in collectValuesToIgnore, like Michael suggested.
Added an assert for Michael's non-scalar type question. Non-scalar types aren't allowable (we already check for this in canVectorizeInstrs), but I don't think an additional assert will hurt.
Added a test requested by Michael for user-specified vectorization factors with no interleaving (AArch64/aarch64-predication.ll). The test is target-specific to ensure we perform the same optimization with and without manually specifying the vectorization factor.
Reorganized scalarization conditions in canBeScalarized, and added a check for operands that are uniform-after-vectorization. This fixes the crash Gil discovered (X86/x86-predication.ll). The test is target-specific because the crash depends on the generation of masked loads.
Added a check in canBeScalarized for the multi-context case. We now only consider instructions forming a single-use chain from the original predicated block that would otherwise be vectorized. I added a test case for the multi-context costs (AArch64/predication_costs.ll).
Updated comments.

This LGTM, except some vague thoughts about slightly relaxing canBeScalarized().
If you think that's not a real concern, I'm ok with this going on as is.

lib/Transforms/Vectorize/LoopVectorize.cpp
6627 ↗	(On Diff #79755)	Please add an explanation for why we bail on Legal->isScalarAfterVectorization(I)
6641 ↗	(On Diff #79755)	Are you sure this is the right check? If I understand correctly, we fail not because the pointer is uniform, but because it's only uniform in the "we use a single (LLVM) value to represent it" sense, not the "the (abstract) value is the same for all lanes" sense. Otherwise it'd be safe to scalarize. I'm having a hard time to think of a good example, though because this is limited to instructions (so GV and param operands won't be affected), and uniform instructions tend to either be consecutive, loop-invariant (in which case they should be hoisted out by LICM before we hit the vectorizer), or uses of the scalar IV (in which case they won't be operands of vectorized instructions).

mssimpso added inline comments.Dec 1 2016, 9:25 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6627 ↗	(On Diff #79755)	Sure. I don't think it's strictly necessary to bail here, but I couldn't think of an example where continuing to traverse the chain would be useful.
6641 ↗	(On Diff #79755)	That's exactly right. We can't scalarize because we use a single value to represent the uniform-after-vec instructions. It's the difference between uniform meaning "only lane zero will be used so the others aren't needed" vs. "all lanes can be used but they will all be equal". I think we could change this if we wanted to. In getScalarValue, we currently assert if we're trying to access a uniform-after-vec instruction for a Lane > 0. We could instead clone the Lane == 0 value and return that. What do you think?

mkuper accepted this revision.Dec 1 2016, 3:25 PM

mkuper edited edge metadata.

mkuper added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
6641 ↗	(On Diff #79755)	I'm really ok with this going in as is, as long as it's clearly documented. We can fix this if ever run into a case where it matters, and I don't want to introduce even more complexity here if it's unwarranted.

This revision is now accepted and ready to land.Dec 1 2016, 3:25 PM

mssimpso added inline comments.Dec 2 2016, 4:49 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6641 ↗	(On Diff #79755)	Sounds good, thanks Michael. I'll add some comments to make this very clear.

gilr added inline comments.Dec 2 2016, 7:56 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6688 ↗	(On Diff #79755)	Is there scalarization overhead if canBeScalarized(J) is false due to isScalarAfterVectorization(J)? More generally, shouldn't this line be under a !isScalarAfterVectorization(J) condition? We may have bailed out e.g. due to J being on a different basic block but J itself may be scalar (meaning we've "smoothed" a scalar-vector-scalar sequence). Is that correct?

mssimpso added inline comments.Dec 2 2016, 8:35 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
6688 ↗	(On Diff #79755)	That's right Gil. We also should check that J is actually contained in the loop before adding in the extract overhead. I had already added this to my working copy after uploading the last revision. But I'll upload a new revision with this change and Michael's comment suggestions for clarity. Thanks!

Addressed comments from Michael and Gil.

Added documentation requested by Michael.
Added a needsExtract condition for determining if we need to compute a scalarization overhead, as mentioned by Gil. I updated some costs in the AArch64/predication_costs.ll test to reflect this. For predicated stores, we know their GEP pointer operands will be scalar, so we no longer will calculate an extract cost for them.

(Still LGTM :-) )

Did I address all of your comments, Gil? Thanks!

Yes you did, Matt - sorry for the delay. LGTM too :)

In D26083#614898, @gilr wrote:

Yes you did, Matt - sorry for the delay. LGTM too :)

No problem - just wanted to make sure. Thanks for the review!

Closed by commit rL288909: [LV] Scalarize operands of predicated instructions (authored by mssimpso). · Explain WhyDec 7 2016, 7:13 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

217 lines

test/

Transforms/

LoopVectorize/

AArch64/

aarch64-predication.ll

63 lines

predication_costs.ll

148 lines

X86/

x86-predication.ll

60 lines

if-pred-non-void.ll

54 lines

if-pred-stores.ll

11 lines

Diff 80592

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 512 Lines • ▼ Show 20 Lines	protected:
/// EntryVal's type.		/// EntryVal's type.
void createVectorIntInductionPHI(const InductionDescriptor &II,		void createVectorIntInductionPHI(const InductionDescriptor &II,
Instruction *EntryVal);		Instruction *EntryVal);

/// Widen an integer induction variable \p IV. If \p Trunc is provided, the		/// Widen an integer induction variable \p IV. If \p Trunc is provided, the
/// induction variable will first be truncated to the corresponding type.		/// induction variable will first be truncated to the corresponding type.
void widenIntInduction(PHINode IV, TruncInst Trunc = nullptr);		void widenIntInduction(PHINode IV, TruncInst Trunc = nullptr);

		/// Returns true if an instruction \p I should be scalarized instead of
		/// vectorized for the chosen vectorization factor.
		bool shouldScalarizeInstruction(Instruction *I) const;

/// Returns true if we should generate a scalar version of \p IV.		/// Returns true if we should generate a scalar version of \p IV.
bool needsScalarInduction(Instruction *IV) const;		bool needsScalarInduction(Instruction *IV) const;

/// Return a constant reference to the VectorParts corresponding to \p V from		/// Return a constant reference to the VectorParts corresponding to \p V from
/// the original loop. If the value has already been vectorized, the		/// the original loop. If the value has already been vectorized, the
/// corresponding vector entry in VectorLoopValueMap is returned. If,		/// corresponding vector entry in VectorLoopValueMap is returned. If,
/// however, the value has a scalar entry in VectorLoopValueMap, we construct		/// however, the value has a scalar entry in VectorLoopValueMap, we construct
/// new vector values on-demand by inserting the scalar values into vectors		/// new vector values on-demand by inserting the scalar values into vectors
▲ Show 20 Lines • Show All 1,373 Lines • ▼ Show 20 Lines	public:

/// \returns The smallest bitwidth each instruction can be represented with.		/// \returns The smallest bitwidth each instruction can be represented with.
/// The vector equivalents of these instructions should be truncated to this		/// The vector equivalents of these instructions should be truncated to this
/// type.		/// type.
const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {		const MapVector<Instruction *, uint64_t> &getMinimalBitwidths() const {
return MinBWs;		return MinBWs;
}		}

		/// \returns True if it is more profitable to scalarize instruction \p I for
		/// vectorization factor \p VF.
		bool isProfitableToScalarize(Instruction *I, unsigned VF) const {
		auto Scalars = InstsToScalarize.find(VF);
		assert(Scalars != InstsToScalarize.end() &&
		"VF not yet analyzed for scalarization profitability");
		return Scalars->second.count(I);
		}

private:		private:
/// The vectorization cost is a combination of the cost itself and a boolean		/// The vectorization cost is a combination of the cost itself and a boolean
/// indicating whether any of the contributing operations will actually		/// indicating whether any of the contributing operations will actually
/// operate on		/// operate on
/// vector values after type legalization in the backend. If this latter value		/// vector values after type legalization in the backend. If this latter value
/// is		/// is
/// false, then all operations will be scalarized (i.e. no vectorization has		/// false, then all operations will be scalarized (i.e. no vectorization has
/// actually taken place).		/// actually taken place).
Show All 26 Lines	return ::createMissedAnalysis(Hints->vectorizeAnalysisPassName(),
RemarkName, TheLoop);		RemarkName, TheLoop);
}		}

/// Map of scalar integer values to the smallest bitwidth they can be legally		/// Map of scalar integer values to the smallest bitwidth they can be legally
/// represented as. The vector equivalents of these values should be truncated		/// represented as. The vector equivalents of these values should be truncated
/// to this type.		/// to this type.
MapVector<Instruction *, uint64_t> MinBWs;		MapVector<Instruction *, uint64_t> MinBWs;

		/// A type representing the costs for instructions if they were to be
		/// scalarized rather than vectorized. The entries are Instruction-Cost
		/// pairs.
		typedef DenseMap<Instruction *, unsigned> ScalarCostsTy;

		/// A map holding scalar costs for different vectorization factors. The
		/// presence of a cost for an instruction in the mapping indicates that the
		/// instruction will be scalarized when vectorizing with the associated
		/// vectorization factor. The entries are VF-ScalarCostTy pairs.
		DenseMap<unsigned, ScalarCostsTy> InstsToScalarize;

		/// Returns the expected difference in cost from scalarizing the expression
		/// feeding a predicated instruction \p PredInst. The instructions to
		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
		/// non-negative return value implies the expression will be scalarized.
		/// Currently, only single-use chains are considered for scalarization.
		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
		unsigned VF);

		/// Collects the instructions to scalarize for each predicated instruction in
		/// the loop.
		void collectInstsToScalarize(unsigned VF);

public:		public:
/// The loop that we evaluate.		/// The loop that we evaluate.
Loop *TheLoop;		Loop *TheLoop;
/// Predicated scalar evolution analysis.		/// Predicated scalar evolution analysis.
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
/// Loop Info analysis.		/// Loop Info analysis.
LoopInfo *LI;		LoopInfo *LI;
/// Vectorization legality.		/// Vectorization legality.
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::createVectorIntInductionPHI(
auto *ICmp = cast<Instruction>(Br->getCondition());		auto *ICmp = cast<Instruction>(Br->getCondition());
LastInduction->moveBefore(ICmp);		LastInduction->moveBefore(ICmp);
LastInduction->setName("vec.ind.next");		LastInduction->setName("vec.ind.next");

VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);		VecInd->addIncoming(SteppedStart, LoopVectorPreHeader);
VecInd->addIncoming(LastInduction, LoopVectorLatch);		VecInd->addIncoming(LastInduction, LoopVectorLatch);
}		}

		bool InnerLoopVectorizer::shouldScalarizeInstruction(Instruction *I) const {
		return Legal->isScalarAfterVectorization(I) \|\|
		Cost->isProfitableToScalarize(I, VF);
		}

bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const {		bool InnerLoopVectorizer::needsScalarInduction(Instruction *IV) const {
if (Legal->isScalarAfterVectorization(IV))		if (shouldScalarizeInstruction(IV))
return true;		return true;
auto isScalarInst = [&](User *U) -> bool {		auto isScalarInst = [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return (OrigLoop->contains(I) && Legal->isScalarAfterVectorization(I));		return (OrigLoop->contains(I) && shouldScalarizeInstruction(I));
};		};
return any_of(IV->users(), isScalarInst);		return any_of(IV->users(), isScalarInst);
}		}

void InnerLoopVectorizer::widenIntInduction(PHINode IV, TruncInst Trunc) {		void InnerLoopVectorizer::widenIntInduction(PHINode IV, TruncInst Trunc) {

auto II = Legal->getInductionVars()->find(IV);		auto II = Legal->getInductionVars()->find(IV);
assert(II != Legal->getInductionVars()->end() && "IV is not an induction");		assert(II != Legal->getInductionVars()->end() && "IV is not an induction");
Show All 24 Lines	void InnerLoopVectorizer::widenIntInduction(PHINode IV, TruncInst Trunc) {
// get it now.		// get it now.
if (ID.getConstIntStepValue())		if (ID.getConstIntStepValue())
Step = ID.getConstIntStepValue();		Step = ID.getConstIntStepValue();

// Try to create a new independent vector induction variable. If we can't		// Try to create a new independent vector induction variable. If we can't
// create the phi node, we will splat the scalar induction variable in each		// create the phi node, we will splat the scalar induction variable in each
// loop iteration.		// loop iteration.
if (VF > 1 && IV->getType() == Induction->getType() && Step &&		if (VF > 1 && IV->getType() == Induction->getType() && Step &&
!Legal->isScalarAfterVectorization(EntryVal)) {		!shouldScalarizeInstruction(EntryVal)) {
createVectorIntInductionPHI(ID, EntryVal);		createVectorIntInductionPHI(ID, EntryVal);
VectorizedIV = true;		VectorizedIV = true;
}		}

// If we haven't yet vectorized the induction variable, or if we will create		// If we haven't yet vectorized the induction variable, or if we will create
// a scalar one, we need to define the scalar induction variable and step		// a scalar one, we need to define the scalar induction variable and step
// values. If we were given a truncation type, truncate the canonical		// values. If we were given a truncation type, truncate the canonical
// induction variable and constant step. Otherwise, derive these values from		// induction variable and constant step. Otherwise, derive these values from
▲ Show 20 Lines • Show All 2,402 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {

// If the instruction will become trivially dead when vectorized, we don't		// If the instruction will become trivially dead when vectorized, we don't
// need to generate it.		// need to generate it.
if (DeadInstructions.count(&I))		if (DeadInstructions.count(&I))
continue;		continue;

// Scalarize instructions that should remain scalar after vectorization.		// Scalarize instructions that should remain scalar after vectorization.
if (!(isa<BranchInst>(&I) \|\| isa<PHINode>(&I) \|\|		if (VF > 1 &&
		!(isa<BranchInst>(&I) \|\| isa<PHINode>(&I) \|\|
isa<DbgInfoIntrinsic>(&I)) &&		isa<DbgInfoIntrinsic>(&I)) &&
Legal->isScalarAfterVectorization(&I)) {		shouldScalarizeInstruction(&I)) {
scalarizeInstruction(&I);		scalarizeInstruction(&I, Legal->isScalarWithPredication(&I));
continue;		continue;
}		}

switch (I.getOpcode()) {		switch (I.getOpcode()) {
case Instruction::Br:		case Instruction::Br:
// Nothing to do for PHIs and BR, since we already took care of the		// Nothing to do for PHIs and BR, since we already took care of the
// loop control flow instructions.		// loop control flow instructions.
continue;		continue;
▲ Show 20 Lines • Show All 1,456 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
}		}

int UserVF = Hints->getWidth();		int UserVF = Hints->getWidth();
if (UserVF != 0) {		if (UserVF != 0) {
assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");		assert(isPowerOf2_32(UserVF) && "VF needs to be a power of two");
DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");		DEBUG(dbgs() << "LV: Using user VF " << UserVF << ".\n");

Factor.Width = UserVF;		Factor.Width = UserVF;
		collectInstsToScalarize(UserVF);
return Factor;		return Factor;
}		}

float Cost = expectedCost(1).first;		float Cost = expectedCost(1).first;
#ifndef NDEBUG		#ifndef NDEBUG
const float ScalarCost = Cost;		const float ScalarCost = Cost;
#endif /* NDEBUG */		#endif /* NDEBUG */
unsigned Width = 1;		unsigned Width = 1;
▲ Show 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = VFs.size(); i < e; ++i) {
RU.LoopInvariantRegs = Invariant;		RU.LoopInvariantRegs = Invariant;
RU.MaxLocalUsers = MaxUsages[i];		RU.MaxLocalUsers = MaxUsages[i];
RUs[i] = RU;		RUs[i] = RU;
}		}

return RUs;		return RUs;
}		}

		void LoopVectorizationCostModel::collectInstsToScalarize(unsigned VF) {

		// If we aren't vectorizing the loop, or if we've already collected the
		// instructions to scalarize, there's nothing to do. Collection may already
		// have occurred if we have a user-selected VF and are now computing the
		// expected cost for interleaving.
		if (VF < 2 \|\| InstsToScalarize.count(VF))
		return;

		// Initialize a mapping for VF in InstsToScalalarize. If we find that it's
		// not profitable to scalarize any instructions, the presence of VF in the
		// map will indicate that we've analyzed it already.
		ScalarCostsTy &ScalarCostsVF = InstsToScalarize[VF];

		// Find all the instructions that are scalar with predication in the loop and
		// determine if it would be better to not if-convert the blocks they are in.
		// If so, we also record the instructions to scalarize.
		for (BasicBlock *BB : TheLoop->blocks()) {
		if (!Legal->blockNeedsPredication(BB))
		continue;
		for (Instruction &I : *BB)
		if (Legal->isScalarWithPredication(&I)) {
		ScalarCostsTy ScalarCosts;
		if (computePredInstDiscount(&I, ScalarCosts, VF) >= 0)
		ScalarCostsVF.insert(ScalarCosts.begin(), ScalarCosts.end());
		}
		}
		}

		int LoopVectorizationCostModel::computePredInstDiscount(
		Instruction PredInst, DenseMap<Instruction , unsigned> &ScalarCosts,
		unsigned VF) {

		assert(!Legal->isUniformAfterVectorization(PredInst) &&
		"Instruction marked uniform-after-vectorization will be predicated");

		// Initialize the discount to zero, meaning that the scalar version and the
		// vector version cost the same.
		int Discount = 0;

		// Holds instructions to analyze. The instructions we visit are mapped in
		// ScalarCosts. Those instructions are the ones that would be scalarized if
		// we find that the scalar version costs less.
		SmallVector<Instruction *, 8> Worklist;

		// Returns true if the given instruction can be scalarized.
		auto canBeScalarized = [&](Instruction *I) -> bool {

		// We only attempt to scalarize instructions forming a single-use chain
		// from the original predicated block that would otherwise be vectorized.
		// Although not strictly necessary, we give up on instructions we know will
		// already be scalar to avoid traversing chains that are unlikely to be
		// beneficial.
		if (!I->hasOneUse() \|\| PredInst->getParent() != I->getParent() \|\|
		Legal->isScalarAfterVectorization(I))
		return false;

		// If the instruction is scalar with predication, it will be analyzed
		// separately. We ignore it within the context of PredInst.
		if (Legal->isScalarWithPredication(I))
		return false;

		// If any of the instruction's operands are uniform after vectorization,
		// the instruction cannot be scalarized. This prevents, for example, a
		// masked load from being scalarized.
		//
		// We assume we will only emit a value for lane zero of an instruction
		// marked uniform after vectorization, rather than VF identical values.
		// Thus, if we scalarize an instruction that uses a uniform, we would
		// create uses of values corresponding to the lanes we aren't emitting code
		// for. This behavior can be changed by allowing getScalarValue to clone
		// the lane zero values for uniforms rather than asserting.
		for (Use &U : I->operands())
		if (auto *J = dyn_cast<Instruction>(U.get()))
		if (Legal->isUniformAfterVectorization(J))
		return false;

		// Otherwise, we can scalarize the instruction.
		return true;
		};

		// Returns true if an operand that cannot be scalarized must be extracted
		// from a vector. We will account for this scalarization overhead below. Note
		// that the non-void predicated instructions are placed in their own blocks,
		// and their return values are inserted into vectors. Thus, an extract would
		// still be required.
		auto needsExtract = [&](Instruction *I) -> bool {
		return TheLoop->contains(I) && !Legal->isScalarAfterVectorization(I);
		};

		// Compute the expected cost discount from scalarizing the entire expression
		// feeding the predicated instruction. We currently only consider expressions
		// that are single-use instruction chains.
		Worklist.push_back(PredInst);
		while (!Worklist.empty()) {
		Instruction *I = Worklist.pop_back_val();

		// If we've already analyzed the instruction, there's nothing to do.
		if (ScalarCosts.count(I))
		continue;

		// Compute the cost of the vector instruction. Note that this cost already
		// includes the scalarization overhead of the predicated instruction.
		unsigned VectorCost = getInstructionCost(I, VF).first;

		// Compute the cost of the scalarized instruction. This cost is the cost of
		// the instruction as if it wasn't if-converted and instead remained in the
		// predicated block. We will scale this cost by block probability after
		// computing the scalarization overhead.
		unsigned ScalarCost = VF * getInstructionCost(I, 1).first;

		// Compute the scalarization overhead of needed insertelement instructions
		// and phi nodes.
		if (Legal->isScalarWithPredication(I) && !I->getType()->isVoidTy()) {
		ScalarCost += getScalarizationOverhead(ToVectorTy(I->getType(), VF), true,
		false, TTI);
		ScalarCost += VF * TTI.getCFInstrCost(Instruction::PHI);
		}

		// Compute the scalarization overhead of needed extractelement
		// instructions. For each of the instruction's operands, if the operand can
		// be scalarized, add it to the worklist; otherwise, account for the
		// overhead.
		for (Use &U : I->operands())
		if (auto *J = dyn_cast<Instruction>(U.get())) {
		assert(VectorType::isValidElementType(J->getType()) &&
		"Instruction has non-scalar type");
		if (canBeScalarized(J))
		Worklist.push_back(J);
		else if (needsExtract(J))
		ScalarCost += getScalarizationOverhead(ToVectorTy(J->getType(), VF),
		false, true, TTI);
		}

		// Scale the total scalar cost by block probability.
		ScalarCost /= getReciprocalPredBlockProb();

		// Compute the discount. A non-negative discount means the vector version
		// of the instruction costs more, and scalarizing would be beneficial.
		Discount += VectorCost - ScalarCost;
		ScalarCosts[I] = ScalarCost;
		}

		return Discount;
		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::expectedCost(unsigned VF) {
VectorizationCostTy Cost;		VectorizationCostTy Cost;

		// Collect the instructions (and their associated costs) that will be more
		// profitable to scalarize.
		collectInstsToScalarize(VF);

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
VectorizationCostTy BlockCost;		VectorizationCostTy BlockCost;

// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Skip dbg intrinsics.		// Skip dbg intrinsics.
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {		LoopVectorizationCostModel::getInstructionCost(Instruction *I, unsigned VF) {
// If we know that this instruction will remain uniform, check the cost of		// If we know that this instruction will remain uniform, check the cost of
// the scalar version.		// the scalar version.
if (Legal->isUniformAfterVectorization(I))		if (Legal->isUniformAfterVectorization(I))
VF = 1;		VF = 1;

		if (VF > 1 && isProfitableToScalarize(I, VF))
		return VectorizationCostTy(InstsToScalarize[VF][I], false);

Type *VectorTy;		Type *VectorTy;
unsigned C = getInstructionCost(I, VF, VectorTy);		unsigned C = getInstructionCost(I, VF, VectorTy);

bool TypeNotScalarized =		bool TypeNotScalarized =
VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;		VF > 1 && !VectorTy->isVoidTy() && TTI.getNumberOfParts(VectorTy) < VF;
return VectorizationCostTy(C, TypeNotScalarized);		return VectorizationCostTy(C, TypeNotScalarized);
}		}

▲ Show 20 Lines • Show All 350 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// Ignore type-promoting instructions we identified during reduction		// Ignore type-promoting instructions we identified during reduction
// detection.		// detection.
for (auto &Reduction : *Legal->getReductionVars()) {		for (auto &Reduction : *Legal->getReductionVars()) {
RecurrenceDescriptor &RedDes = Reduction.second;		RecurrenceDescriptor &RedDes = Reduction.second;
SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();		SmallPtrSetImpl<Instruction *> &Casts = RedDes.getCastInsts();
VecValuesToIgnore.insert(Casts.begin(), Casts.end());		VecValuesToIgnore.insert(Casts.begin(), Casts.end());
}		}

// Insert values known to be scalar into VecValuesToIgnore.		// Insert values known to be scalar into VecValuesToIgnore. This is a
		// conservative estimation of the values that will later be scalarized.
		//
		// FIXME: Even though an instruction is not scalar-after-vectoriztion, it may
		// still be scalarized. For example, we may find an instruction to be
		// more profitable for a given vectorization factor if it were to be
		// scalarized. But at this point, we haven't yet computed the
		// vectorization factor.
for (auto *BB : TheLoop->getBlocks())		for (auto *BB : TheLoop->getBlocks())
for (auto &I : *BB)		for (auto &I : *BB)
if (Legal->isScalarAfterVectorization(&I))		if (Legal->isScalarAfterVectorization(&I))
VecValuesToIgnore.insert(&I);		VecValuesToIgnore.insert(&I);
}		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateInstr) {		bool IfPredicateInstr) {
▲ Show 20 Lines • Show All 448 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/AArch64/aarch64-predication.ll

				; RUN: opt < %s -loop-vectorize -simplifycfg -S \| FileCheck %s
				; RUN: opt < %s -force-vector-width=2 -loop-vectorize -simplifycfg -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; CHECK-LABEL: predicated_udiv_scalarized_operand
				;
				; This test checks that we correctly compute the scalarized operands for a
				; user-specified vectorization factor when interleaving is disabled. We use the
				; "optsize" attribute to disable all interleaving calculations.
				;
				; CHECK: vector.body:
				; CHECK: %wide.load = load <2 x i64>, <2 x i64>* {{.*}}, align 4
				; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
				; CHECK: [[IF0]]:
				; CHECK: %[[T00:.+]] = extractelement <2 x i64> %wide.load, i32 0
				; CHECK: %[[T01:.+]] = extractelement <2 x i64> %wide.load, i32 0
				; CHECK: %[[T02:.+]] = add nsw i64 %[[T01]], %x
				; CHECK: %[[T03:.+]] = udiv i64 %[[T00]], %[[T02]]
				; CHECK: %[[T04:.+]] = insertelement <2 x i64> undef, i64 %[[T03]], i32 0
				; CHECK: br label %[[CONT0]]
				; CHECK: [[CONT0]]:
				; CHECK: %[[T05:.+]] = phi <2 x i64> [ undef, %vector.body ], [ %[[T04]], %[[IF0]] ]
				; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
				; CHECK: [[IF1]]:
				; CHECK: %[[T06:.+]] = extractelement <2 x i64> %wide.load, i32 1
				; CHECK: %[[T07:.+]] = extractelement <2 x i64> %wide.load, i32 1
				; CHECK: %[[T08:.+]] = add nsw i64 %[[T07]], %x
				; CHECK: %[[T09:.+]] = udiv i64 %[[T06]], %[[T08]]
				; CHECK: %[[T10:.+]] = insertelement <2 x i64> %[[T05]], i64 %[[T09]], i32 1
				; CHECK: br label %[[CONT1]]
				; CHECK: [[CONT1]]:
				; CHECK: phi <2 x i64> [ %[[T05]], %[[CONT0]] ], [ %[[T10]], %[[IF1]] ]
				; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

				define i64 @predicated_udiv_scalarized_operand(i64* %a, i1 %c, i64 %x) optsize {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%r = phi i64 [ 0, %entry ], [ %tmp6, %for.inc ]
				%tmp0 = getelementptr inbounds i64, i64* %a, i64 %i
				%tmp2 = load i64, i64* %tmp0, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp3 = add nsw i64 %tmp2, %x
				%tmp4 = udiv i64 %tmp2, %tmp3
				br label %for.inc

				for.inc:
				%tmp5 = phi i64 [ %tmp2, %for.body ], [ %tmp4, %if.then]
				%tmp6 = add i64 %r, %tmp5
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, 100
				br i1 %cond, label %for.body, label %for.end

				for.end:
				%tmp7 = phi i64 [ %tmp6, %for.inc ]
				ret i64 %tmp7
				}

llvm/trunk/test/Transforms/LoopVectorize/AArch64/predication_costs.ll

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines

	; CHECK-LABEL: predicated_store			; CHECK-LABEL: predicated_store
	;			;
	; This test checks that we correctly compute the cost of the predicated store			; This test checks that we correctly compute the cost of the predicated store
	; instruction. If we assume the block probability is 50%, we compute the cost			; instruction. If we assume the block probability is 50%, we compute the cost
	; as:			; as:
	;			;
	; Cost of store:			; Cost of store:
	; (store(4) + extractelement(6)) / 2 = 5			; (store(4) + extractelement(3)) / 2 = 3
	;			;
	; CHECK: Found an estimated cost of 5 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Found an estimated cost of 3 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
	; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4			; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
	;			;
	define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {			define void @predicated_store(i32* %a, i1 %c, i32 %x, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
	Show All 9 Lines
	for.inc:			for.inc:
	%i.next = add nuw nsw i64 %i, 1			%i.next = add nuw nsw i64 %i, 1
	%cond = icmp slt i64 %i.next, %n			%cond = icmp slt i64 %i.next, %n
	br i1 %cond, label %for.body, label %for.end			br i1 %cond, label %for.body, label %for.end

	for.end:			for.end:
	ret void			ret void
	}			}

				; CHECK-LABEL: predicated_udiv_scalarized_operand
				;
				; This test checks that we correctly compute the cost of the predicated udiv
				; instruction and the add instruction it uses. The add is scalarized and sunk
				; inside the predicated block. If we assume the block probability is 50%, we
				; compute the cost as:
				;
				; Cost of add:
				; (add(2) + extractelement(3)) / 2 = 2
				; Cost of udiv:
				; (udiv(2) + extractelement(3) + insertelement(3)) / 2 = 4
				;
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp3 = add nsw i32 %tmp2, %x
				; CHECK: Found an estimated cost of 4 for VF 2 For instruction: %tmp4 = udiv i32 %tmp2, %tmp3
				; CHECK: Scalarizing: %tmp3 = add nsw i32 %tmp2, %x
				; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp2, %tmp3
				;
				define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]
				%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
				%tmp2 = load i32, i32* %tmp0, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp3 = add nsw i32 %tmp2, %x
				%tmp4 = udiv i32 %tmp2, %tmp3
				br label %for.inc

				for.inc:
				%tmp5 = phi i32 [ %tmp2, %for.body ], [ %tmp4, %if.then]
				%tmp6 = add i32 %r, %tmp5
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				%tmp7 = phi i32 [ %tmp6, %for.inc ]
				ret i32 %tmp7
				}

				; CHECK-LABEL: predicated_store_scalarized_operand
				;
				; This test checks that we correctly compute the cost of the predicated store
				; instruction and the add instruction it uses. The add is scalarized and sunk
				; inside the predicated block. If we assume the block probability is 50%, we
				; compute the cost as:
				;
				; Cost of add:
				; (add(2) + extractelement(3)) / 2 = 2
				; Cost of store:
				; store(4) / 2 = 2
				;
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp2 = add nsw i32 %tmp1, %x
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp2, i32* %tmp0, align 4
				; CHECK: Scalarizing: %tmp2 = add nsw i32 %tmp1, %x
				; CHECK: Scalarizing and predicating: store i32 %tmp2, i32* %tmp0, align 4
				;
				define void @predicated_store_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
				%tmp1 = load i32, i32* %tmp0, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp2 = add nsw i32 %tmp1, %x
				store i32 %tmp2, i32* %tmp0, align 4
				br label %for.inc

				for.inc:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void
				}

				; CHECK-LABEL: predication_multi_context
				;
				; This test checks that we correctly compute the cost of multiple predicated
				; instructions in the same block. The sdiv, udiv, and store must be scalarized
				; and predicated. The sub feeding the store is scalarized and sunk inside the
				; store's predicated block. However, the add feeding the sdiv and udiv cannot
				; be sunk and is not scalarized. If we assume the block probability is 50%, we
				; compute the cost as:
				;
				; Cost of add:
				; add(1) = 1
				; Cost of sdiv:
				; (sdiv(2) + extractelement(6) + insertelement(3)) / 2 = 5
				; Cost of udiv:
				; (udiv(2) + extractelement(6) + insertelement(3)) / 2 = 5
				; Cost of sub:
				; (sub(2) + extractelement(3)) / 2 = 2
				; Cost of store:
				; store(4) / 2 = 2
				;
				; CHECK: Found an estimated cost of 1 for VF 2 For instruction: %tmp2 = add i32 %tmp1, %x
				; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp3 = sdiv i32 %tmp1, %tmp2
				; CHECK: Found an estimated cost of 5 for VF 2 For instruction: %tmp4 = udiv i32 %tmp3, %tmp2
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: %tmp5 = sub i32 %tmp4, %x
				; CHECK: Found an estimated cost of 2 for VF 2 For instruction: store i32 %tmp5, i32* %tmp0, align 4
				; CHECK-NOT: Scalarizing: %tmp2 = add i32 %tmp1, %x
				; CHECK: Scalarizing and predicating: %tmp3 = sdiv i32 %tmp1, %tmp2
				; CHECK: Scalarizing and predicating: %tmp4 = udiv i32 %tmp3, %tmp2
				; CHECK: Scalarizing: %tmp5 = sub i32 %tmp4, %x
				; CHECK: Scalarizing and predicating: store i32 %tmp5, i32* %tmp0, align 4
				;
				define void @predication_multi_context(i32* %a, i1 %c, i32 %x, i64 %n) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
				%tmp1 = load i32, i32* %tmp0, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp2 = add i32 %tmp1, %x
				%tmp3 = sdiv i32 %tmp1, %tmp2
				%tmp4 = udiv i32 %tmp3, %tmp2
				%tmp5 = sub i32 %tmp4, %x
				store i32 %tmp5, i32* %tmp0, align 4
				br label %for.inc

				for.inc:
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void
				}

llvm/trunk/test/Transforms/LoopVectorize/X86/x86-predication.ll

				; RUN: opt < %s -mattr=avx -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -simplifycfg -S \| FileCheck %s

				target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
				target triple = "x86_64-apple-macosx10.8.0"

				; CHECK-LABEL: predicated_sdiv_masked_load
				;
				; This test ensures that we don't scalarize the predicated load. Since the load
				; can be vectorized with predication, scalarizing it would cause its pointer
				; operand to become non-uniform.
				;
				; CHECK: vector.body:
				; CHECK: %wide.masked.load = call <2 x i32> @llvm.masked.load.v2i32.p0v2i32
				; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
				; CHECK: [[IF0]]:
				; CHECK: %[[T0:.+]] = extractelement <2 x i32> %wide.masked.load, i32 0
				; CHECK: %[[T1:.+]] = sdiv i32 %[[T0]], %x
				; CHECK: %[[T2:.+]] = insertelement <2 x i32> undef, i32 %[[T1]], i32 0
				; CHECK: br label %[[CONT0]]
				; CHECK: [[CONT0]]:
				; CHECK: %[[T3:.+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[T2]], %[[IF0]] ]
				; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
				; CHECK: [[IF1]]:
				; CHECK: %[[T4:.+]] = extractelement <2 x i32> %wide.masked.load, i32 1
				; CHECK: %[[T5:.+]] = sdiv i32 %[[T4]], %x
				; CHECK: %[[T6:.+]] = insertelement <2 x i32> %[[T3]], i32 %[[T5]], i32 1
				; CHECK: br label %[[CONT1]]
				; CHECK: [[CONT1]]:
				; CHECK: phi <2 x i32> [ %[[T3]], %[[CONT0]] ], [ %[[T6]], %[[IF1]] ]
				; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

				define i32 @predicated_sdiv_masked_load(i32* %a, i32* %b, i32 %x, i1 %c) {
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%r = phi i32 [ 0, %entry ], [ %tmp7, %for.inc ]
				%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
				%tmp1 = load i32, i32* %tmp0, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp2 = getelementptr inbounds i32, i32* %b, i64 %i
				%tmp3 = load i32, i32* %tmp2, align 4
				%tmp4 = sdiv i32 %tmp3, %x
				%tmp5 = add nsw i32 %tmp4, %tmp1
				br label %for.inc

				for.inc:
				%tmp6 = phi i32 [ %tmp1, %for.body ], [ %tmp5, %if.then]
				%tmp7 = add i32 %r, %tmp6
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp eq i64 %i.next, 10000
				br i1 %cond, label %for.end, label %for.body

				for.end:
				%tmp8 = phi i32 [ %tmp7, %for.inc ]
				ret i32 %tmp8
				}

llvm/trunk/test/Transforms/LoopVectorize/if-pred-non-void.ll

	Show First 20 Lines • Show All 201 Lines • ▼ Show 20 Lines

	if.end: ; preds = %if.then, %check			if.end: ; preds = %if.then, %check
	%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %check ]			%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %check ]
	store i32 %ysd.0, i32* %isd, align 4			store i32 %ysd.0, i32* %isd, align 4
	%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1			%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
	%exitcond = icmp eq i64 %indvars.iv.next, 128			%exitcond = icmp eq i64 %indvars.iv.next, 128
	br i1 %exitcond, label %for.cond.cleanup, label %for.body			br i1 %exitcond, label %for.cond.cleanup, label %for.body
	}			}


				define i32 @predicated_udiv_scalarized_operand(i32* %a, i1 %c, i32 %x, i64 %n) {
				entry:
				br label %for.body

				; CHECK-LABEL: predicated_udiv_scalarized_operand
				; CHECK: vector.body:
				; CHECK: %wide.load = load <2 x i32>, <2 x i32>* {{.*}}, align 4
				; CHECK: br i1 {{.*}}, label %[[IF0:.+]], label %[[CONT0:.+]]
				; CHECK: [[IF0]]:
				; CHECK: %[[T00:.+]] = extractelement <2 x i32> %wide.load, i32 0
				; CHECK: %[[T01:.+]] = extractelement <2 x i32> %wide.load, i32 0
				; CHECK: %[[T02:.+]] = add nsw i32 %[[T01]], %x
				; CHECK: %[[T03:.+]] = udiv i32 %[[T00]], %[[T02]]
				; CHECK: %[[T04:.+]] = insertelement <2 x i32> undef, i32 %[[T03]], i32 0
				; CHECK: br label %[[CONT0]]
				; CHECK: [[CONT0]]:
				; CHECK: %[[T05:.+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[T04]], %[[IF0]] ]
				; CHECK: br i1 {{.*}}, label %[[IF1:.+]], label %[[CONT1:.+]]
				; CHECK: [[IF1]]:
				; CHECK: %[[T06:.+]] = extractelement <2 x i32> %wide.load, i32 1
				; CHECK: %[[T07:.+]] = extractelement <2 x i32> %wide.load, i32 1
				; CHECK: %[[T08:.+]] = add nsw i32 %[[T07]], %x
				; CHECK: %[[T09:.+]] = udiv i32 %[[T06]], %[[T08]]
				; CHECK: %[[T10:.+]] = insertelement <2 x i32> %[[T05]], i32 %[[T09]], i32 1
				; CHECK: br label %[[CONT1]]
				; CHECK: [[CONT1]]:
				; CHECK: phi <2 x i32> [ %[[T05]], %[[CONT0]] ], [ %[[T10]], %[[IF1]] ]
				; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

				for.body:
				%i = phi i64 [ 0, %entry ], [ %i.next, %for.inc ]
				%r = phi i32 [ 0, %entry ], [ %tmp6, %for.inc ]
				%tmp0 = getelementptr inbounds i32, i32* %a, i64 %i
				%tmp2 = load i32, i32* %tmp0, align 4
				br i1 %c, label %if.then, label %for.inc

				if.then:
				%tmp3 = add nsw i32 %tmp2, %x
				%tmp4 = udiv i32 %tmp2, %tmp3
				br label %for.inc

				for.inc:
				%tmp5 = phi i32 [ %tmp2, %for.body ], [ %tmp4, %if.then]
				%tmp6 = add i32 %r, %tmp5
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				%tmp7 = phi i32 [ %tmp6, %for.inc ]
				ret i32 %tmp7
				}

llvm/trunk/test/Transforms/LoopVectorize/if-pred-stores.ll

	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=UNROLL			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=UNROLL
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info < %s \| FileCheck %s --check-prefix=UNROLL-NOSIMPLIFY			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info < %s \| FileCheck %s --check-prefix=UNROLL-NOSIMPLIFY
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=VEC			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=VEC

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

	; Test predication of stores.			; Test predication of stores.
	define i32 @test(i32* nocapture %f) #0 {			define i32 @test(i32* nocapture %f) #0 {
	entry:			entry:
	br label %for.body			br label %for.body

	; VEC-LABEL: test			; VEC-LABEL: test
	; VEC: %[[v0:.+]] = add i64 %index, 0			; VEC: %[[v0:.+]] = add i64 %index, 0
	; VEC: %[[v8:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>			; VEC: %[[v8:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>
	; VEC: %[[v9:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>
	; VEC: %[[v10:.+]] = and <2 x i1> %[[v8]], <i1 true, i1 true>			; VEC: %[[v10:.+]] = and <2 x i1> %[[v8]], <i1 true, i1 true>
	; VEC: %[[o1:.+]] = or <2 x i1> zeroinitializer, %[[v10]]			; VEC: %[[o1:.+]] = or <2 x i1> zeroinitializer, %[[v10]]
	; VEC: %[[v11:.+]] = extractelement <2 x i1> %[[o1]], i32 0			; VEC: %[[v11:.+]] = extractelement <2 x i1> %[[o1]], i32 0
	; VEC: %[[v12:.+]] = icmp eq i1 %[[v11]], true			; VEC: %[[v12:.+]] = icmp eq i1 %[[v11]], true
	; VEC: br i1 %[[v12]], label %[[cond:.+]], label %[[else:.+]]			; VEC: br i1 %[[v12]], label %[[cond:.+]], label %[[else:.+]]
	;			;
	; VEC: [[cond]]:			; VEC: [[cond]]:
	; VEC: %[[v13:.+]] = extractelement <2 x i32> %[[v9]], i32 0			; VEC: %[[v13:.+]] = extractelement <2 x i32> %wide.load, i32 0
				; VEC: %[[v9a:.+]] = add nsw i32 %[[v13]], 20
	; VEC: %[[v2:.+]] = getelementptr inbounds i32, i32* %f, i64 %[[v0]]			; VEC: %[[v2:.+]] = getelementptr inbounds i32, i32* %f, i64 %[[v0]]
	; VEC: store i32 %[[v13]], i32* %[[v2]], align 4			; VEC: store i32 %[[v9a]], i32* %[[v2]], align 4
	; VEC: br label %[[else:.+]]			; VEC: br label %[[else:.+]]
	;			;
	; VEC: [[else]]:			; VEC: [[else]]:
	; VEC: %[[v15:.+]] = extractelement <2 x i1> %[[o1]], i32 1			; VEC: %[[v15:.+]] = extractelement <2 x i1> %[[o1]], i32 1
	; VEC: %[[v16:.+]] = icmp eq i1 %[[v15]], true			; VEC: %[[v16:.+]] = icmp eq i1 %[[v15]], true
	; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]]			; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]]
	;			;
	; VEC: [[cond2]]:			; VEC: [[cond2]]:
	; VEC: %[[v17:.+]] = extractelement <2 x i32> %[[v9]], i32 1			; VEC: %[[v17:.+]] = extractelement <2 x i32> %wide.load, i32 1
				; VEC: %[[v9b:.+]] = add nsw i32 %[[v17]], 20
	; VEC: %[[v1:.+]] = add i64 %index, 1			; VEC: %[[v1:.+]] = add i64 %index, 1
	; VEC: %[[v4:.+]] = getelementptr inbounds i32, i32* %f, i64 %[[v1]]			; VEC: %[[v4:.+]] = getelementptr inbounds i32, i32* %f, i64 %[[v1]]
	; VEC: store i32 %[[v17]], i32* %[[v4]], align 4			; VEC: store i32 %[[v9b]], i32* %[[v4]], align 4
	; VEC: br label %[[else2:.+]]			; VEC: br label %[[else2:.+]]
	;			;
	; VEC: [[else2]]:			; VEC: [[else2]]:

	; UNROLL-LABEL: test			; UNROLL-LABEL: test
	; UNROLL: vector.body:			; UNROLL: vector.body:
	; UNROLL: %[[IND:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 0			; UNROLL: %[[IND:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 0
	; UNROLL: %[[IND1:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 1			; UNROLL: %[[IND1:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 1
	▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines