This is an archive of the discontinued LLVM Phabricator instance.

[Loop Vectorizer] Support predication of div/rem
ClosedPublic

Authored by gilr on Jul 28 2016, 8:43 AM.

Download Raw Diff

Details

Reviewers

anemet
jmolloy
mkuper

Commits

rG550148b2f662: [Loop Vectorizer] Support predication of div/rem
rL279620: [Loop Vectorizer] Support predication of div/rem

Summary

div/rem instructions in basic blocks that require predication currently prevent vectorization. This patch extends the existing mechanism for predicating stores to handle other instructions and leverages it to predicate divs and rems.

The generated vector extracts and inserts are now moved into the predicated block (reflected in the cost model for scalarization).

Diff Detail

Event Timeline

gilr updated this revision to Diff 65934.Jul 28 2016, 8:43 AM

gilr retitled this revision from to [Loop Vectorizer] Support predication of div/rem.

gilr updated this object.

gilr added reviewers: mkuper, jmolloy.

gilr added subscribers: llvm-commits, Ayal, delena.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptJul 28 2016, 8:43 AM

anemet added a subscriber: anemet.Jul 28 2016, 9:58 AM

anemet added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
3899–3900	Only a drive-by comment, please don't make vectorizeLoop any more unreadable than it already is. Please consider prequel to this patch that moves store-predication into its own function.

You might want to consider special-casing division by a constant integer. For example, on x86, we can convert a 16-bit unsigned divide by a constant into a pmulhuw+psrlw.

mssimpso added a subscriber: mssimpso.Jul 28 2016, 11:49 AM

mkuper added inline comments.Jul 28 2016, 3:50 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
3399	I'm not entirely sure what "the cost of a phi" means, especially without a type. Also, I don't believe any in-tree target actually assigns a non-zero to PHIs right now. So if you actually want a meaningful cost here for, say, x86, you may want to look into the cost model as well...
3899–3900	...that need predication? (Before, we really would predicate any store)
3899–3900	+1
3899–3900	We don't run anything that will sink this later in the pipeline? (To be honest, even if we do, I'm not sure whether we should do the cleanup here or rely on a later pass, but I'm curious)
3901	Any reason not to use a range for over the operands?
3907	Do you know if we have an isOnlyUserOf helper? I know we have one for SDNode... but, I couldn't find one in IR.
3925	I->hasOneUse()? Or do you care specifically about the users() list? In any case, no need to compute std::distance.
3926	I->user_begin()?
4378	FRem looks out of place. Did you mean URem?
4381	To expand on what Eli said, if we have division by a non-zero constant, then: It may be efficiently lowered. Since the constant is non-zero, it doesn't need predication.
4391–4392	And, correspondingly, this should probably be FRem.
5257–5258	As long as you're touching this - remove this comment?

gilr added inline comments.Jul 31 2016, 8:19 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3399	So IIRC I took it from CostModelAnalysis::getInstructionCost(), but I'll remove it if it makes no sense.
3899–3900	Actually the vectorizer currently seems to expect inst-combine to do so (see the deleted VEC-IC case in if-pred-stores.ll), but IINM the cost model didn't reflect that. I agree, there's the general issue of generating efficient code here vs relying on later passes to clean up, which I think this was also brought up here.
3901	No, will fix.
3907	Me either - anyone?
3925	Right, will fix.
3926	Right, will fix.
4378	Yes.
4391–4392	Indeed.

mkuper added inline comments.Aug 1 2016, 2:54 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
3399	I'm not sure what makes sense here, to be honest. What cost, in the final generated code, are you trying to account for? An extra register copy?
4378	Too bad there wasn't a test that would have failed because of this. ;-)

Implemented several reviewer comments, notably avoiding predication when dividing by a non-zero constant.

Added may-divide-by-zero logic to cost model (was missing in previous patch); moved logic to its own helper function.

More drive-by comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
3905–3914	Would be good to describe the high-level strategy of how we predicate these instruction, perhaps with an IR example.
3916–3917	I think that Twine knows how to concatenate string-like things. You only need the explicit ctor on the first one.
test/Transforms/LoopVectorize/if-pred-non-void.ll
19–52	I think we're pretty consistent about using uppercase for the named regexes. That helps readability.

gilr added inline comments.Aug 4 2016, 3:56 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3399	I actually wasn't trying to model any specific cost in the generated code, just trying to be consistent about accounting for every generated instruction at IR level (and letting TTI decide their cost). So if PHIs have zero cost by definition/convention and should not be taken into account in cost models then I should just remove this. Otherwise, placing the call now makes sure we don't miss that cost if targets start modelling it. What say you?
3916–3917	Right, will fix.

mkuper added inline comments.Aug 4 2016, 9:28 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3399	We're not even consistent about it in our different cost models - CostModelAnalysis::getInstructionCost() calls getCFInstrCost(), while (the admittedly, old) BBVectorizer cost model just reutrns 0. But I get what you're saying. If you prefer to leave it, leave it, but I think it'd be nice to document the fact you don't currently expect a real cost here.

Merged with prequel patch r277595 (D23013).
Changed named registers in new lit test to uppercase.
Documented predication logic
Removed unnecessary Twine ctor calls.

LGTM, but please wait a bit for anemet, in case he also wants to review this in non-drive-by-mode. :-)

This revision is now accepted and ready to land.Aug 5 2016, 10:14 AM

anemet added inline comments.Aug 9 2016, 12:01 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3385–3386	There is something wrong with this sentence.
3389	Predicate->Predicated Same in the other functions.
3417–3418	Explain in the comment how this guys is different from the previous one.
3432–3433	Same here.
3434	Type* -> Type * Did you run clang-format on the diff?
4108	auto *
4111–4118	Is this any different than OpInst->hasOneUse?
4138–4141	Is there a test for the non-insertelt case?
4141	Why is the undef correct here?
4350	auto *

gilr added inline comments.Aug 10 2016, 3:53 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4111–4118	Yes, for Instructions that use the same value more than once (see Michael's comment).
4138–4141	No. We currently always create an insertelement on scalarization. I added support for the non-insertelement case for completeness, since predication is done separately from scalarization. IIUC this case will need to be supported if & when Matthew's patch is committed, but for now it's really FFU. I'll replace this case with an assertion for this patch and leave it to Matthew to resurrect in his patch as needed.
4141	We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either a select that will blend it out, or a Use dominated by this instruction's BB that is either predicated by (at least) the same predicate, which won't use the undef, or not predicated due to no side effects, where undef would be safe

mssimpso added inline comments.Aug 10 2016, 5:06 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4138–4141	Gil, For the non-insert cases that might arise with my patch, I think it would be better to leave the case here (don't add an assert), but include a test with this patch that would break with the other patch. Does that make sense? That will keep the patches better self-contained and prevent us from having to revisit this. You will need a test where the div only feeds an instruction that will also be scalar (like a GEP). If this patch lands before the other one, it will be cleaner if in the other we just have to change the test. Presumably, that would involve replacing the PHI for the inserts with a PHI for the predicated instruction, since the inserts will no longer be there? If the other patch lands first, there will be no issue since you'll already have a test for the non-insert case.

Ayal added inline comments.Aug 10 2016, 6:33 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
3397–3398	Suggest to first add the cost of the InsertElement and then the (optional) cost of the Phi, just to keep things in the same order they will be executed. Furthermore, cost if Extract should precede that of Insert.
4349–4350	Suggest to rename Op2 to Divisor. It may be worthwhile to generalize and check isKnownNonZero(). Non-zero divisors that are not compile-time constants will not be converted into multiplication, so we will still end up scalarizing the division, but can do so w/o predication.

Also, store predication is currently behind a flag that defaults to false (-enable-cond-stores-vec). Since you're reusing the store predication logic, I'm wondering if the mayDivideByZero cases should be under the flag as well. What do you think?

Matt.

In D22918#511194, @mssimpso wrote:

Also, store predication is currently behind a flag that defaults to false (-enable-cond-stores-vec). Since you're reusing the store predication logic, I'm wondering if the mayDivideByZero cases should be under the flag as well. What do you think?

Matt.

The reason enable-cond-stores-vec defaults to false is the lack of cost modeling for predicated stores.
The change to getScalarizationOverhead() is supposed to solve this for divs. But I guess it depends on the real-world performance impact this patch has.

Per Matt's comment: code continues to support the non-insertelement case; added an FFU test that should fail once the vectorizer supports direct scalar-scalar use.
Implemented (hopefully) all other review comments
Ran clang-format again

mssimpso added inline comments.Aug 11 2016, 2:26 PM

test/Transforms/LoopVectorize/if-pred-non-void.ll
136–138	Thanks for adding the additional "future" test. I don't think it will exercise the non-insert case, though. I'm very sorry for not being more clear previously. Here, %rsd will always have to be inserted into a vector since it will be directly used by a select instruction, which will remain vectorized. I didn't think of this when I last commented. But I think if you add an additional instruction, this should produce the desired effect. Something like: if.then: %tmp = sdiv i32 %psd, %lsd %rsd = sdiv i32 %tmp, %lsd br label %if.end When I ran the modified test with this patch and the scalar patch, the non-insert case was used for %tmp and the insert case was used for %rsd. This makes sense becase %tmp is only used by %rsd (will be scalar), and %rsd will again feed the vector select.

gilr added inline comments.Aug 12 2016, 2:01 PM

test/Transforms/LoopVectorize/if-pred-non-void.ll
136–138	Argh, sorry about that. Your explanation was clear - just a hasty implementation on my side :( Yes, the second sdiv should go under the same condition - will fix.

Fixed FFU test case.

mssimpso added inline comments.Aug 15 2016, 11:39 AM

test/Transforms/LoopVectorize/if-pred-non-void.ll
138–140	The test looks good to me now. Thanks!

Adam,

Do you have any other comments? Am I good to go with this change?
Thanks!

Gil, I will look at this today.

anemet added inline comments.Aug 17 2016, 11:58 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
4070–4096	I would also include the select instruction in this excerpt (omitting anything in between with a ...). Then you can explain in the initial comment that the value produced on the false branch is not used (the conditional execution is only reintroduced to avoid side-effects).
4076	"So for the first element of a scalarized instruction, e.g."
4111–4118	OK, is this difference relevant? If yes, add a helper either in this module or at a more global place and use it. You can probably also writes this with std::all_of or something.
4141	We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either Makes sense but we need a comment for this. I think the best is to explain this on the excerpt at the beginning, see my comment there.
test/Transforms/LoopVectorize/if-pred-non-void.ll
9–10	As a demo for how this works it would be actually good to include at least one of the second element sequences as well.
21–22	Can you please name and match this extract as well, it helps reading. Everywhere in these tests.
23–24	It would be also good to check the extractelements feeding the divs here.

gilr added inline comments.Aug 18 2016, 2:32 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
4111–4118	It shouldn't make a difference for the currently predicated instructions. I'll replace with hasOneUse with a comment to capture the possible (conservative) inaccuracy.

Implemented Adam's comments.

Ping

LGTM too with the comments addressed below. Thanks!

lib/Transforms/Vectorize/LoopVectorize.cpp
4079	Please remove the '; pred =' comments You may want to add a // ... after the for.body: label in all these loops. That is where the div operand would be loaded, etc.
4091	s/selected-out/if-converted using a select/
4101–4102	%33 and %34 are not used, please remove

Closed by commit rL279620: [Loop Vectorizer] Support predication of div/rem (authored by gilr). · Explain WhyAug 24 2016, 4:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

273 lines

test/

Transforms/

LoopVectorize/

if-pred-non-void.ll

149 lines

if-pred-not-when-safe.ll

90 lines

if-pred-stores.ll

31 lines

Diff 67673

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 361 Lines • ▼ Show 20 Lines	protected:
void fixFirstOrderRecurrence(PHINode *Phi);		void fixFirstOrderRecurrence(PHINode *Phi);

/// \brief The Loop exit block may have single value PHI nodes where the		/// \brief The Loop exit block may have single value PHI nodes where the
/// incoming value is 'Undef'. While vectorizing we only handled real values		/// incoming value is 'Undef'. While vectorizing we only handled real values
/// that were defined inside the loop. Here we fix the 'undef case'.		/// that were defined inside the loop. Here we fix the 'undef case'.
/// See PR14725.		/// See PR14725.
void fixLCSSAPHIs();		void fixLCSSAPHIs();

/// Predicate conditional stores on their respective conditions.		/// Predicate conditional instructions that require predication on their
void predicateStores();		/// respective conditions.
		void predicateInstructions();

/// Shrinks vector element sizes based on information in "MinBWs".		/// Shrinks vector element sizes based on information in "MinBWs".
void truncateToMinimalBitwidths();		void truncateToMinimalBitwidths();

/// A helper function that computes the predicate of the block BB, assuming		/// A helper function that computes the predicate of the block BB, assuming
/// that the header block of the loop is set to True. It returns the entry		/// that the header block of the loop is set to True. It returns the entry
/// mask for the block BB.		/// mask for the block BB.
VectorParts createBlockInMask(BasicBlock *BB);		VectorParts createBlockInMask(BasicBlock *BB);
Show All 10 Lines	protected:
void widenPHIInstruction(Instruction *PN, VectorParts &Entry, unsigned UF,		void widenPHIInstruction(Instruction *PN, VectorParts &Entry, unsigned UF,
unsigned VF, PhiVector *PV);		unsigned VF, PhiVector *PV);

/// Insert the new loop to the loop hierarchy and pass manager		/// Insert the new loop to the loop hierarchy and pass manager
/// and update the analysis passes.		/// and update the analysis passes.
void updateAnalysis();		void updateAnalysis();

/// This instruction is un-vectorizable. Implement it as a sequence		/// This instruction is un-vectorizable. Implement it as a sequence
/// of scalars. If \p IfPredicateStore is true we need to 'hide' each		/// of scalars. If \p IfPredicateInstr is true we need to 'hide' each
/// scalarized instruction behind an if block predicated on the control		/// scalarized instruction behind an if block predicated on the control
/// dependence of the instruction.		/// dependence of the instruction.
virtual void scalarizeInstruction(Instruction *Instr,		virtual void scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore = false);		bool IfPredicateInstr = false);

/// Vectorize Load and Store instructions,		/// Vectorize Load and Store instructions,
virtual void vectorizeMemoryInstruction(Instruction *Instr);		virtual void vectorizeMemoryInstruction(Instruction *Instr);

/// Create a broadcast instruction. This method generates a broadcast		/// Create a broadcast instruction. This method generates a broadcast
/// instruction (shuffle) for loop invariant values and for the induction		/// instruction (shuffle) for loop invariant values and for the induction
/// value. If this is the induction variable then we extend it to N, N+1, ...		/// value. If this is the induction variable then we extend it to N, N+1, ...
/// this is needed because each iteration in the loop corresponds to a SIMD		/// this is needed because each iteration in the loop corresponds to a SIMD
▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	protected:
/// ScalarIVMap maps induction variables from the original loop that are not		/// ScalarIVMap maps induction variables from the original loop that are not
/// vectorized to their scalar equivalents in the vector loop. Maintaining a		/// vectorized to their scalar equivalents in the vector loop. Maintaining a
/// separate map for scalarized induction variables allows us to avoid		/// separate map for scalarized induction variables allows us to avoid
/// unnecessary scalar-to-vector-to-scalar conversions.		/// unnecessary scalar-to-vector-to-scalar conversions.
DenseMap<Value , SmallVector<Value , 8>> ScalarIVMap;		DenseMap<Value , SmallVector<Value , 8>> ScalarIVMap;

/// Store instructions that should be predicated, as a pair		/// Store instructions that should be predicated, as a pair
/// <StoreInst, Predicate>		/// <StoreInst, Predicate>
SmallVector<std::pair<StoreInst , Value >, 4> PredicatedStores;		SmallVector<std::pair<Instruction , Value >, 4> PredicatedInstructions;
EdgeMaskCache MaskCache;		EdgeMaskCache MaskCache;
/// Trip count of the original loop.		/// Trip count of the original loop.
Value *TripCount;		Value *TripCount;
/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))		/// Trip count of the widened loop (TripCount - TripCount % (VF*UF))
Value *VectorTripCount;		Value *VectorTripCount;

/// Map of scalar integer values to the smallest bitwidth they can be legally		/// Map of scalar integer values to the smallest bitwidth they can be legally
/// represented as. The vector equivalents of these values should be truncated		/// represented as. The vector equivalents of these values should be truncated
Show All 13 Lines	InnerLoopUnroller(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
const TargetTransformInfo TTI, AssumptionCache AC,		const TargetTransformInfo TTI, AssumptionCache AC,
OptimizationRemarkEmitter *ORE, unsigned UnrollFactor)		OptimizationRemarkEmitter *ORE, unsigned UnrollFactor)
: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,		: InnerLoopVectorizer(OrigLoop, PSE, LI, DT, TLI, TTI, AC, ORE, 1,
UnrollFactor) {}		UnrollFactor) {}

private:		private:
void scalarizeInstruction(Instruction *Instr,		void scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore = false) override;		bool IfPredicateInstr = false) override;
void vectorizeMemoryInstruction(Instruction *Instr) override;		void vectorizeMemoryInstruction(Instruction *Instr) override;
Value getBroadcastInstrs(Value V) override;		Value getBroadcastInstrs(Value V) override;
Value getStepVector(Value Val, int StartIdx, Value *Step,		Value getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps Opcode =		Instruction::BinaryOps Opcode =
Instruction::BinaryOpsEnd) override;		Instruction::BinaryOpsEnd) override;
Value reverseVector(Value Vec) override;		Value reverseVector(Value Vec) override;
};		};

▲ Show 20 Lines • Show All 2,094 Lines • ▼ Show 20 Lines	if (CreateGatherScatter) {
NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load");		NewLI = Builder.CreateAlignedLoad(VecPtr, Alignment, "wide.load");
Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;		Entry[Part] = Reverse ? reverseVector(NewLI) : NewLI;
}		}
addMetadata(NewLI, LI);		addMetadata(NewLI, LI);
}		}
}		}

void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,		void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
		DEBUG(dbgs() << "LV: Scalarizing"
		<< (IfPredicateInstr ? " and predicating:" : ":") << *Instr
		<< '\n');
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);

// Find all of the vectorized parameters.		// Find all of the vectorized parameters.
for (Value *SrcOp : Instr->operands()) {		for (Value *SrcOp : Instr->operands()) {
// If we are accessing the old induction variable, use the new one.		// If we are accessing the old induction variable, use the new one.
Show All 27 Lines	void InnerLoopVectorizer::scalarizeInstruction(Instruction *Instr,

Value *UndefVec =		Value *UndefVec =
IsVoidRetTy ? nullptr		IsVoidRetTy ? nullptr
: UndefValue::get(VectorType::get(Instr->getType(), VF));		: UndefValue::get(VectorType::get(Instr->getType(), VF));
// Create a new entry in the WidenMap and initialize it to Undef or Null.		// Create a new entry in the WidenMap and initialize it to Undef or Null.
VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);		VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);

VectorParts Cond;		VectorParts Cond;
if (IfPredicateStore) {		if (IfPredicateInstr) {
assert(Instr->getParent()->getSinglePredecessor() &&		assert(Instr->getParent()->getSinglePredecessor() &&
"Only support single predecessor blocks");		"Only support single predecessor blocks");
Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),		Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),
Instr->getParent());		Instr->getParent());
}		}

// For each vector unroll 'part':		// For each vector unroll 'part':
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
// For each scalar that we create:		// For each scalar that we create:
for (unsigned Width = 0; Width < VF; ++Width) {		for (unsigned Width = 0; Width < VF; ++Width) {

// Start if-block.		// Start if-block.
Value *Cmp = nullptr;		Value *Cmp = nullptr;
if (IfPredicateStore) {		if (IfPredicateInstr) {
Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Width));		Cmp = Builder.CreateExtractElement(Cond[Part], Builder.getInt32(Width));
Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp,		Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cmp,
ConstantInt::get(Cmp->getType(), 1));		ConstantInt::get(Cmp->getType(), 1));
}		}

Instruction *Cloned = Instr->clone();		Instruction *Cloned = Instr->clone();
if (!IsVoidRetTy)		if (!IsVoidRetTy)
Cloned->setName(Instr->getName() + ".cloned");		Cloned->setName(Instr->getName() + ".cloned");
Show All 22 Lines	for (unsigned Width = 0; Width < VF; ++Width) {
AC->registerAssumption(II);		AC->registerAssumption(II);

// If the original scalar returns a value we need to place it in a vector		// If the original scalar returns a value we need to place it in a vector
// so that future users will be able to use it.		// so that future users will be able to use it.
if (!IsVoidRetTy)		if (!IsVoidRetTy)
VecResults[Part] = Builder.CreateInsertElement(VecResults[Part], Cloned,		VecResults[Part] = Builder.CreateInsertElement(VecResults[Part], Cloned,
Builder.getInt32(Width));		Builder.getInt32(Width));
// End if-block.		// End if-block.
if (IfPredicateStore)		if (IfPredicateInstr)
PredicatedStores.push_back(		PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
std::make_pair(cast<StoreInst>(Cloned), Cmp));
}		}
}		}
}		}

PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,		PHINode InnerLoopVectorizer::createInductionVariable(Loop L, Value *Start,
Value End, Value Step,		Value End, Value Step,
Instruction *DL) {		Instruction *DL) {
BasicBlock *Header = L->getHeader();		BasicBlock *Header = L->getHeader();
▲ Show 20 Lines • Show All 514 Lines • ▼ Show 20 Lines	static Value addFastMathFlag(Value V) {
if (isa<FPMathOperator>(V)) {		if (isa<FPMathOperator>(V)) {
FastMathFlags Flags;		FastMathFlags Flags;
Flags.setUnsafeAlgebra();		Flags.setUnsafeAlgebra();
cast<Instruction>(V)->setFastMathFlags(Flags);		cast<Instruction>(V)->setFastMathFlags(Flags);
}		}
return V;		return V;
}		}

/// Estimate the overhead of scalarizing a value. Insert and Extract are set if		/// \brief Estimate the overhead of scalarizing a value based on its type.
/// the result needs to be inserted and/or extracted from vectors.		/// Insert and Extract are set if the result needs to be inserted and/or
		/// extracted from vectors.
		/// If the instruction is also to be predicated, add the cost of a PHI
		anemetUnsubmitted Not Done Reply Inline Actions There is something wrong with this sentence. anemet: There is something wrong with this sentence.
		/// node to the insertion cost.
static unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract,		static unsigned getScalarizationOverhead(Type *Ty, bool Insert, bool Extract,
		bool Predicated,
		anemetUnsubmitted Not Done Reply Inline Actions Predicate->Predicated Same in the other functions. anemet: Predicate->Predicated Same in the other functions.
const TargetTransformInfo &TTI) {		const TargetTransformInfo &TTI) {
if (Ty->isVoidTy())		if (Ty->isVoidTy())
return 0;		return 0;

assert(Ty->isVectorTy() && "Can only scalarize vectors");		assert(Ty->isVectorTy() && "Can only scalarize vectors");
unsigned Cost = 0;		unsigned Cost = 0;

for (unsigned I = 0, E = Ty->getVectorNumElements(); I < E; ++I) {		for (unsigned I = 0, E = Ty->getVectorNumElements(); I < E; ++I) {
if (Insert)
Cost += TTI.getVectorInstrCost(Instruction::InsertElement, Ty, I);
if (Extract)		if (Extract)
		AyalUnsubmitted Not Done Reply Inline Actions Suggest to first add the cost of the InsertElement and then the (optional) cost of the Phi, just to keep things in the same order they will be executed. Furthermore, cost if Extract should precede that of Insert. Ayal: Suggest to first add the cost of the InsertElement and then the (optional) cost of the Phi…
Cost += TTI.getVectorInstrCost(Instruction::ExtractElement, Ty, I);		Cost += TTI.getVectorInstrCost(Instruction::ExtractElement, Ty, I);
		mkuperUnsubmitted Not Done Reply Inline Actions I'm not entirely sure what "the cost of a phi" means, especially without a type. Also, I don't believe any in-tree target actually assigns a non-zero to PHIs right now. So if you actually want a meaningful cost here for, say, x86, you may want to look into the cost model as well... mkuper: I'm not entirely sure what "the cost of a phi" means, especially without a type. Also, I don't…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions So IIRC I took it from CostModelAnalysis::getInstructionCost(), but I'll remove it if it makes no sense. gilr: So IIRC I took it from CostModelAnalysis::getInstructionCost(), but I'll remove it if it makes…
		mkuperUnsubmitted Not Done Reply Inline Actions I'm not sure what makes sense here, to be honest. What cost, in the final generated code, are you trying to account for? An extra register copy? mkuper: I'm not sure what makes sense here, to be honest. What cost, in the final generated code, are…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions I actually wasn't trying to model any specific cost in the generated code, just trying to be consistent about accounting for every generated instruction at IR level (and letting TTI decide their cost). So if PHIs have zero cost by definition/convention and should not be taken into account in cost models then I should just remove this. Otherwise, placing the call now makes sure we don't miss that cost if targets start modelling it. What say you? gilr: I actually wasn't trying to model any specific cost in the generated code, just trying to be…
		mkuperUnsubmitted Not Done Reply Inline Actions We're not even consistent about it in our different cost models - CostModelAnalysis::getInstructionCost() calls getCFInstrCost(), while (the admittedly, old) BBVectorizer cost model just reutrns 0. But I get what you're saying. If you prefer to leave it, leave it, but I think it'd be nice to document the fact you don't currently expect a real cost here. mkuper: We're not even consistent about it in our different cost models - CostModelAnalysis…
		if (Insert) {
		Cost += TTI.getVectorInstrCost(Instruction::InsertElement, Ty, I);
		if (Predicated)
		Cost += TTI.getCFInstrCost(Instruction::PHI);
}		}
		}

		// We assume that if-converted blocks have a 50% chance of being executed.
		// Predicated scalarized instructions are avoided due to the CF that bypasses
		// turned off lanes. The extracts and inserts will be sinked/hoisted to the
		// predicated basic-block and are subjected to the same assumption.
		if (Predicated)
		Cost /= 2;

return Cost;		return Cost;
}		}

		/// \brief Estimate the overhead of scalarizing an Instruction based on the
		/// types of its operands and return value.
		anemetUnsubmitted Not Done Reply Inline Actions Explain in the comment how this guys is different from the previous one. anemet: Explain in the comment how this guys is different from the previous one.
		static unsigned getScalarizationOverhead(SmallVectorImpl<Type *> &OpTys,
		Type *RetTy, bool Predicated,
		const TargetTransformInfo &TTI) {
		unsigned ScalarizationCost =
		getScalarizationOverhead(RetTy, true, false, Predicated, TTI);

		for (Type *Ty : OpTys)
		ScalarizationCost +=
		getScalarizationOverhead(Ty, false, true, Predicated, TTI);

		return ScalarizationCost;
		}

		/// \brief Estimate the overhead of scalarizing an instruction. This is a
		/// convenience wrapper for the type-based getScalarizationOverhead API.
		anemetUnsubmitted Not Done Reply Inline Actions Same here. anemet: Same here.
		static unsigned getScalarizationOverhead(Instruction *I, unsigned VF,
		anemetUnsubmitted Not Done Reply Inline Actions Type* -> Type * Did you run clang-format on the diff? anemet: Type* -> Type * Did you run clang-format on the diff?
		bool Predicated,
		const TargetTransformInfo &TTI) {
		if (VF == 1)
		return 0;

		Type *RetTy = ToVectorTy(I->getType(), VF);

		SmallVector<Type *, 4> OpTys;
		unsigned OperandsNum = I->getNumOperands();
		for (unsigned OpInd = 0; OpInd < OperandsNum; ++OpInd)
		OpTys.push_back(ToVectorTy(I->getOperand(OpInd)->getType(), VF));

		return getScalarizationOverhead(OpTys, RetTy, Predicated, TTI);
		}

// Estimate cost of a call instruction CI if it were vectorized with factor VF.		// Estimate cost of a call instruction CI if it were vectorized with factor VF.
// Return the cost of the instruction, including scalarization overhead if it's		// Return the cost of the instruction, including scalarization overhead if it's
// needed. The flag NeedToScalarize shows if the call needs to be scalarized -		// needed. The flag NeedToScalarize shows if the call needs to be scalarized -
// i.e. either vector version isn't available, or is too expensive.		// i.e. either vector version isn't available, or is too expensive.
static unsigned getVectorCallCost(CallInst *CI, unsigned VF,		static unsigned getVectorCallCost(CallInst *CI, unsigned VF,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
bool &NeedToScalarize) {		bool &NeedToScalarize) {
Show All 14 Lines	static unsigned getVectorCallCost(CallInst *CI, unsigned VF,

// Compute corresponding vector type for return value and arguments.		// Compute corresponding vector type for return value and arguments.
Type *RetTy = ToVectorTy(ScalarRetTy, VF);		Type *RetTy = ToVectorTy(ScalarRetTy, VF);
for (Type *ScalarTy : ScalarTys)		for (Type *ScalarTy : ScalarTys)
Tys.push_back(ToVectorTy(ScalarTy, VF));		Tys.push_back(ToVectorTy(ScalarTy, VF));

// Compute costs of unpacking argument values for the scalar calls and		// Compute costs of unpacking argument values for the scalar calls and
// packing the return values to a vector.		// packing the return values to a vector.
unsigned ScalarizationCost =		unsigned ScalarizationCost = getScalarizationOverhead(Tys, RetTy, false, TTI);
getScalarizationOverhead(RetTy, true, false, TTI);
for (Type *Ty : Tys)
ScalarizationCost += getScalarizationOverhead(Ty, false, true, TTI);

unsigned Cost = ScalarCallCost * VF + ScalarizationCost;		unsigned Cost = ScalarCallCost * VF + ScalarizationCost;

// If we can't emit a vector call for this function, then the currently found		// If we can't emit a vector call for this function, then the currently found
// cost is the cost we need to return.		// cost is the cost we need to return.
NeedToScalarize = true;		NeedToScalarize = true;
if (!TLI \|\| !TLI->isFunctionVectorizable(FnName, VF) \|\| CI->isNoBuiltin())		if (!TLI \|\| !TLI->isFunctionVectorizable(FnName, VF) \|\| CI->isNoBuiltin())
return Cost;		return Cost;
▲ Show 20 Lines • Show All 402 Lines • ▼ Show 20 Lines	for (PHINode *Phi : PHIsToFix) {
Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);		Phi->setIncomingValue(SelfEdgeBlockIdx, BCBlockPhi);
Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);		Phi->setIncomingValue(IncomingEdgeBlockIdx, LoopExitInst);
} // end of for each Phi in PHIsToFix.		} // end of for each Phi in PHIsToFix.

fixLCSSAPHIs();		fixLCSSAPHIs();

// Make sure DomTree is updated.		// Make sure DomTree is updated.
updateAnalysis();		updateAnalysis();

predicateStores();		predicateInstructions();
		anemetUnsubmitted Not Done Reply Inline Actions Only a drive-by comment, please don't make vectorizeLoop any more unreadable than it already is. Please consider prequel to this patch that moves store-predication into its own function. anemet: Only a drive-by comment, please don't make vectorizeLoop any more unreadable than it already is.
		mkuperUnsubmitted Not Done Reply Inline Actions +1 mkuper: +1
		mkuperUnsubmitted Not Done Reply Inline Actions ...that need predication? (Before, we really would predicate any store) mkuper: ...that need predication? (Before, we really would predicate any store)
		mkuperUnsubmitted Not Done Reply Inline Actions We don't run anything that will sink this later in the pipeline? (To be honest, even if we do, I'm not sure whether we should do the cleanup here or rely on a later pass, but I'm curious) mkuper: We don't run anything that will sink this later in the pipeline? (To be honest, even if we do…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Actually the vectorizer currently seems to expect inst-combine to do so (see the deleted VEC-IC case in if-pred-stores.ll), but IINM the cost model didn't reflect that. I agree, there's the general issue of generating efficient code here vs relying on later passes to clean up, which I think this was also brought up here. gilr: Actually the vectorizer currently seems to expect inst-combine to do so (see the deleted VEC-IC…

		mkuperUnsubmitted Not Done Reply Inline Actions Any reason not to use a range for over the operands? mkuper: Any reason not to use a range for over the operands?
		gilrAuthorUnsubmitted Not Done Reply Inline Actions No, will fix. gilr: No, will fix.
// Remove redundant induction instructions.		// Remove redundant induction instructions.
cse(LoopVectorBody);		cse(LoopVectorBody);
}		}

void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {		void InnerLoopVectorizer::fixFirstOrderRecurrence(PHINode *Phi) {

		mkuperUnsubmitted Not Done Reply Inline Actions Do you know if we have an isOnlyUserOf helper? I know we have one for SDNode... but, I couldn't find one in IR. mkuper: Do you know if we have an isOnlyUserOf helper? I know we have one for SDNode... but, I couldn't…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Me either - anyone? gilr: Me either - anyone?
// This is the second phase of vectorizing first-order recurrences. An		// This is the second phase of vectorizing first-order recurrences. An
// overview of the transformation is described below. Suppose we have the		// overview of the transformation is described below. Suppose we have the
// following loop.		// following loop.
//		//
// for (int i = 0; i < n; ++i)		// for (int i = 0; i < n; ++i)
// b[i] = a[i] - a[i - 1];		// b[i] = a[i] - a[i - 1];
//		//
		anemetUnsubmitted Not Done Reply Inline Actions Would be good to describe the high-level strategy of how we predicate these instruction, perhaps with an IR example. anemet: Would be good to describe the high-level strategy of how we predicate these instruction…
// There is a first-order recurrence on "a". For this loop, the shorthand		// There is a first-order recurrence on "a". For this loop, the shorthand
// scalar IR looks like:		// scalar IR looks like:
//		//
		anemetUnsubmitted Not Done Reply Inline Actions I think that Twine knows how to concatenate string-like things. You only need the explicit ctor on the first one. anemet: I think that Twine knows how to concatenate string-like things. You only need the explicit…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Right, will fix. gilr: Right, will fix.
// scalar.ph:		// scalar.ph:
// s_init = a[-1]		// s_init = a[-1]
// br scalar.body		// br scalar.body
//		//
// scalar.body:		// scalar.body:
// i = phi [0, scalar.ph], [i+1, scalar.body]		// i = phi [0, scalar.ph], [i+1, scalar.body]
// s1 = phi [s_init, scalar.ph], [s2, scalar.body]		// s1 = phi [s_init, scalar.ph], [s2, scalar.body]
// s2 = a[i]		// s2 = a[i]
		mkuperUnsubmitted Not Done Reply Inline Actions I->hasOneUse()? Or do you care specifically about the users() list? In any case, no need to compute std::distance. mkuper: I->hasOneUse()? Or do you care specifically about the users() list? In any case, no need to…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Right, will fix. gilr: Right, will fix.
// b[i] = s2 - s1		// b[i] = s2 - s1
		mkuperUnsubmitted Not Done Reply Inline Actions I->user_begin()? mkuper: I->user_begin()?
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Right, will fix. gilr: Right, will fix.
// br cond, scalar.body, ...		// br cond, scalar.body, ...
//		//
// In this example, s1 is a recurrence because it's value depends on the		// In this example, s1 is a recurrence because it's value depends on the
// previous iteration. In the first phase of vectorization, we created a		// previous iteration. In the first phase of vectorization, we created a
// temporary value for s1. We now complete the vectorization and produce the		// temporary value for s1. We now complete the vectorization and produce the
// shorthand vector IR shown below (for VF = 4, UF = 1).		// shorthand vector IR shown below (for VF = 4, UF = 1).
//		//
// vector.ph:		// vector.ph:
▲ Show 20 Lines • Show All 124 Lines • ▼ Show 20 Lines	for (Instruction &LEI : *LoopExitBlock) {
auto *LCSSAPhi = dyn_cast<PHINode>(&LEI);		auto *LCSSAPhi = dyn_cast<PHINode>(&LEI);
if (!LCSSAPhi)		if (!LCSSAPhi)
break;		break;
if (LCSSAPhi->getNumIncomingValues() == 1)		if (LCSSAPhi->getNumIncomingValues() == 1)
LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),		LCSSAPhi->addIncoming(UndefValue::get(LCSSAPhi->getType()),
LoopMiddleBlock);		LoopMiddleBlock);
}		}
}		}

void InnerLoopVectorizer::predicateStores() {		void InnerLoopVectorizer::predicateInstructions() {
for (auto KV : PredicatedStores) {
		// For each instruction I marked for predication on value C, split I into its
		// own basic block to form an if-then construct over C.
		// Since I may be fed by extractelement and/or be feeding an insertelement
		// generated during scalarization we try to move such instructions into the
		// predicated basic block as well. For the insertelement this also means that
		// the PHI will be created for the resulting vector rather than for the
		// scalar instruction. So for some scalarized instruction, e.g.
		anemetUnsubmitted Not Done Reply Inline Actions "So for the first element of a scalarized instruction, e.g." anemet: "So for the first element of a scalarized instruction, e.g."
		//
		// %34 = extractelement <2 x i32> %26, i32 0
		// %35 = extractelement <2 x i32> %wide.load, i32 0
		anemetUnsubmitted Not Done Reply Inline Actions Please remove the '; pred =' comments You may want to add a // ... after the for.body: label in all these loops. That is where the div operand would be loaded, etc. anemet: Please remove the '; pred =' comments You may want to add a // ... after the for.body…
		// %36 = sdiv i32 %34, %35
		// %37 = insertelement <2 x i32> undef, i32 %36, i32 0
		//
		// predication typically yields:
		//
		// %33 = icmp eq i1 %32, true
		// br i1 %33, label %pred.sdiv.if, label %pred.sdiv.continue
		//
		// pred.sdiv.if: ; preds = %vector.body
		// %34 = extractelement <2 x i32> %26, i32 0
		// %35 = extractelement <2 x i32> %wide.load, i32 0
		// %36 = sdiv i32 %34, %35
		anemetUnsubmitted Not Done Reply Inline Actions s/selected-out/if-converted using a select/ anemet: s/selected-out/if-converted using a select/
		// %37 = insertelement <2 x i32> undef, i32 %36, i32 0
		// br label %pred.sdiv.continue
		//
		// pred.sdiv.continue: ; preds = %pred.sdiv.if, %vector.body
		// %38 = phi <2 x i32> [ undef, %vector.body ], [ %37, %pred.sdiv.if ]
		anemetUnsubmitted Not Done Reply Inline Actions I would also include the select instruction in this excerpt (omitting anything in between with a ...). Then you can explain in the initial comment that the value produced on the false branch is not used (the conditional execution is only reintroduced to avoid side-effects). anemet: I would also include the select instruction in this excerpt (omitting anything in between with…

		for (auto KV : PredicatedInstructions) {
BasicBlock::iterator I(KV.first);		BasicBlock::iterator I(KV.first);
auto BB = SplitBlock(I->getParent(), &std::next(I), DT, LI);		BasicBlock *Head = I->getParent();
		auto BB = SplitBlock(Head, &std::next(I), DT, LI);
auto T = SplitBlockAndInsertIfThen(KV.second, &I, /Unreachable=/false,		auto T = SplitBlockAndInsertIfThen(KV.second, &I, /Unreachable=/false,
		anemetUnsubmitted Not Done Reply Inline Actions %33 and %34 are not used, please remove anemet: %33 and %34 are not used, please remove
/BranchWeights=/nullptr, DT, LI);		/BranchWeights=/nullptr, DT, LI);
I->moveBefore(T);		I->moveBefore(T);
I->getParent()->setName("pred.store.if");		// Try to move any extractelement we may have created for the predicated
BB->setName("pred.store.continue");		// instruction into the Then block.
		for (Use &Op : I->operands()) {
		auto OpInst = dyn_cast<ExtractElementInst>(&Op);
		anemetUnsubmitted Not Done Reply Inline Actions auto * anemet: auto *
		if (!OpInst)
		continue;
		bool CanSinkToUse = true;
		for (User *U : OpInst->users()) {
		if (U != &*I) {
		// The extractelement is feeding another instruction - give up.
		CanSinkToUse = false;
		break;
		}
		}
		anemetUnsubmitted Not Done Reply Inline Actions Is this any different than OpInst->hasOneUse? anemet: Is this any different than OpInst->hasOneUse?
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Yes, for Instructions that use the same value more than once (see Michael's comment). gilr: Yes, for Instructions that use the same value more than once (see [[ https://reviews.llvm.
		anemetUnsubmitted Not Done Reply Inline Actions OK, is this difference relevant? If yes, add a helper either in this module or at a more global place and use it. You can probably also writes this with std::all_of or something. anemet: OK, is this difference relevant? If yes, add a helper either in this module or at a more…
		gilrAuthorUnsubmitted Not Done Reply Inline Actions It shouldn't make a difference for the currently predicated instructions. I'll replace with hasOneUse with a comment to capture the possible (conservative) inaccuracy. gilr: It shouldn't make a difference for the currently predicated instructions. I'll replace with…
		if (CanSinkToUse)
		OpInst->moveBefore(&*I);
		}

		I->getParent()->setName(Twine("pred.") + I->getOpcodeName() + ".if");
		BB->setName(Twine("pred.") + I->getOpcodeName() + ".continue");

		// If the instruction is non-void create a Phi node at reconvergence point.
		if (!I->getType()->isVoidTy()) {
		Value *IncomingTrue = nullptr;
		Value *IncomingFalse = nullptr;

		if (I->hasOneUse() && isa<InsertElementInst>(*I->user_begin())) {
		// If the predicated instruction is feeding an insert-element, move it
		// into the Then block; Phi node will be created for the vector.
		InsertElementInst IEI = cast<InsertElementInst>(I->user_begin());
		IEI->moveBefore(T);
		IncomingTrue = IEI; // the new vector with the inserted element.
		IncomingFalse = IEI->getOperand(0); // the unmodified vector
		} else {
		// Phi node will be created for the scalar predicated instruction.
		IncomingTrue = &*I;
		IncomingFalse = UndefValue::get(I->getType());
		anemetUnsubmitted Not Done Reply Inline Actions Is there a test for the non-insertelt case? anemet: Is there a test for the non-insertelt case?
		gilrAuthorUnsubmitted Not Done Reply Inline Actions No. We currently always create an insertelement on scalarization. I added support for the non-insertelement case for completeness, since predication is done separately from scalarization. IIUC this case will need to be supported if & when Matthew's patch is committed, but for now it's really FFU. I'll replace this case with an assertion for this patch and leave it to Matthew to resurrect in his patch as needed. gilr: No. We currently always create an insertelement on scalarization. I added support for the non…
		mssimpsoUnsubmitted Not Done Reply Inline Actions Gil, For the non-insert cases that might arise with my patch, I think it would be better to leave the case here (don't add an assert), but include a test with this patch that would break with the other patch. Does that make sense? That will keep the patches better self-contained and prevent us from having to revisit this. You will need a test where the div only feeds an instruction that will also be scalar (like a GEP). If this patch lands before the other one, it will be cleaner if in the other we just have to change the test. Presumably, that would involve replacing the PHI for the inserts with a PHI for the predicated instruction, since the inserts will no longer be there? If the other patch lands first, there will be no issue since you'll already have a test for the non-insert case. mssimpso: Gil, For the non-insert cases that might arise with my patch, I think it would be better to…
		anemetUnsubmitted Not Done Reply Inline Actions Why is the undef correct here? anemet: Why is the undef correct here?
		gilrAuthorUnsubmitted Not Done Reply Inline Actions We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either a select that will blend it out, or a Use dominated by this instruction's BB that is either predicated by (at least) the same predicate, which won't use the undef, or not predicated due to no side effects, where undef would be safe gilr: We are re-introducing the original scalar conditional execution of an instruction here. This…
		anemetUnsubmitted Not Done Reply Inline Actions We are re-introducing the original scalar conditional execution of an instruction here. This undef can reach either Makes sense but we need a comment for this. I think the best is to explain this on the excerpt at the beginning, see my comment there. anemet: > We are re-introducing the original scalar conditional execution of an instruction here. This…
		}

		BasicBlock *PostDom = I->getParent()->getSingleSuccessor();
		assert(PostDom && "Then block has multiple successors");
		PHINode *Phi =
		PHINode::Create(IncomingTrue->getType(), 2, "", &PostDom->front());
		IncomingTrue->replaceAllUsesWith(Phi);
		Phi->addIncoming(IncomingFalse, Head);
		Phi->addIncoming(IncomingTrue, I->getParent());
}		}
		}

DEBUG(DT->verifyDomTree());		DEBUG(DT->verifyDomTree());
}		}

InnerLoopVectorizer::VectorParts		InnerLoopVectorizer::VectorParts
InnerLoopVectorizer::createEdgeMask(BasicBlock Src, BasicBlock Dst) {		InnerLoopVectorizer::createEdgeMask(BasicBlock Src, BasicBlock Dst) {
assert(std::find(pred_begin(Dst), pred_end(Dst), Src) != pred_end(Dst) &&		assert(std::find(pred_begin(Dst), pred_end(Dst), Src) != pred_end(Dst) &&
"Invalid edge");		"Invalid edge");

▲ Show 20 Lines • Show All 171 Lines • ▼ Show 20 Lines	case InductionDescriptor::IK_FpInduction: {
for (unsigned part = 0; part < UF; ++part)		for (unsigned part = 0; part < UF; ++part)
Entry[part] = getStepVector(Broadcasted, VF * part, StepVal,		Entry[part] = getStepVector(Broadcasted, VF * part, StepVal,
II.getInductionOpcode());		II.getInductionOpcode());
return;		return;
}		}
}		}
}		}

		/// A helper function for checking whether an integer division-related
		/// instruction may divide by zero (in which case it must be predicated if
		/// executed conditionally in the scalar code).
		/// TODO: It may be worthwhile to generalize and check isKnownNonZero().
		/// Non-zero divisors that are non compile-time constants will not be
		/// converted into multiplication, so we will still end up scalarizing
		/// the division, but can do so w/o predication.
		static bool mayDivideByZero(Instruction &I) {
		assert((I.getOpcode() == Instruction::UDiv \|\|
		I.getOpcode() == Instruction::SDiv \|\|
		anemetUnsubmitted Not Done Reply Inline Actions auto * anemet: auto *
		AyalUnsubmitted Not Done Reply Inline Actions Suggest to rename Op2 to Divisor. It may be worthwhile to generalize and check isKnownNonZero(). Non-zero divisors that are not compile-time constants will not be converted into multiplication, so we will still end up scalarizing the division, but can do so w/o predication. Ayal: Suggest to rename Op2 to Divisor. It may be worthwhile to generalize and check isKnownNonZero…
		I.getOpcode() == Instruction::URem \|\|
		I.getOpcode() == Instruction::SRem) &&
		"Unexpected instruction");
		Value *Divisor = I.getOperand(1);
		auto *CInt = dyn_cast<ConstantInt>(Divisor);
		return !CInt \|\| CInt->isZero();
		}

void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {		void InnerLoopVectorizer::vectorizeBlockInLoop(BasicBlock BB, PhiVector PV) {
// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
VectorParts &Entry = WidenMap.get(&I);		VectorParts &Entry = WidenMap.get(&I);

switch (I.getOpcode()) {		switch (I.getOpcode()) {
case Instruction::Br:		case Instruction::Br:
// Nothing to do for PHIs and BR, since we already took care of the		// Nothing to do for PHIs and BR, since we already took care of the
// loop control flow instructions.		// loop control flow instructions.
continue;		continue;
case Instruction::PHI: {		case Instruction::PHI: {
// Vectorize PHINodes.		// Vectorize PHINodes.
widenPHIInstruction(&I, Entry, UF, VF, PV);		widenPHIInstruction(&I, Entry, UF, VF, PV);
continue;		continue;
} // End of PHI.		} // End of PHI.

		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::SRem:
		case Instruction::URem:
		mkuperUnsubmitted Not Done Reply Inline Actions FRem looks out of place. Did you mean URem? mkuper: FRem looks out of place. Did you mean URem?
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Yes. gilr: Yes.
		mkuperUnsubmitted Not Done Reply Inline Actions Too bad there wasn't a test that would have failed because of this. ;-) mkuper: Too bad there wasn't a test that would have failed because of this. ;-)
		// Scalarize with predication if this instruction may divide by zero and
		// block execution is conditional, otherwise fallthrough.
		if (mayDivideByZero(I) && Legal->blockNeedsPredication(I.getParent())) {
		mkuperUnsubmitted Not Done Reply Inline Actions To expand on what Eli said, if we have division by a non-zero constant, then: It may be efficiently lowered. Since the constant is non-zero, it doesn't need predication. mkuper: To expand on what Eli said, if we have division by a non-zero constant, then: 1) It may be…
		scalarizeInstruction(&I, true);
		continue;
		}
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::URem:
case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
		mkuperUnsubmitted Not Done Reply Inline Actions And, correspondingly, this should probably be FRem. mkuper: And, correspondingly, this should probably be FRem.
		gilrAuthorUnsubmitted Not Done Reply Inline Actions Indeed. gilr: Indeed.
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Just widen binops.		// Just widen binops.
auto *BinOp = cast<BinaryOperator>(&I);		auto *BinOp = cast<BinaryOperator>(&I);
▲ Show 20 Lines • Show All 848 Lines • ▼ Show 20 Lines	if (I.mayReadFromMemory()) {
continue;		continue;
}		}
// !llvm.mem.parallel_loop_access implies if-conversion safety.		// !llvm.mem.parallel_loop_access implies if-conversion safety.
if (IsAnnotatedParallel)		if (IsAnnotatedParallel)
continue;		continue;
return false;		return false;
}		}
}		}

if (I.mayWriteToMemory()) {		if (I.mayWriteToMemory()) {
		mkuperUnsubmitted Not Done Reply Inline Actions As long as you're touching this - remove this comment? mkuper: As long as you're touching this - remove this comment?
auto *SI = dyn_cast<StoreInst>(&I);		auto *SI = dyn_cast<StoreInst>(&I);
// We only support predication of stores in basic blocks with one		// We only support predication of stores in basic blocks with one
// predecessor.		// predecessor.
if (!SI)		if (!SI)
return false;		return false;

// Build a masked store if it is legal for the target.		// Build a masked store if it is legal for the target.
if (isLegalMaskedStore(SI->getValueOperand()->getType(),		if (isLegalMaskedStore(SI->getValueOperand()->getType(),
SI->getPointerOperand()) \|\|		SI->getPointerOperand()) \|\|
isLegalMaskedScatter(SI->getValueOperand()->getType())) {		isLegalMaskedScatter(SI->getValueOperand()->getType())) {
MaskedOp.insert(SI);		MaskedOp.insert(SI);
continue;		continue;
}		}

bool isSafePtr = (SafePtrs.count(SI->getPointerOperand()) != 0);		bool isSafePtr = (SafePtrs.count(SI->getPointerOperand()) != 0);
bool isSinglePredecessor = SI->getParent()->getSinglePredecessor();		bool isSinglePredecessor = SI->getParent()->getSinglePredecessor();

if (++NumPredStores > NumberOfStoresToPredicate \|\| !isSafePtr \|\|		if (++NumPredStores > NumberOfStoresToPredicate \|\| !isSafePtr \|\|
!isSinglePredecessor)		!isSinglePredecessor)
return false;		return false;
}		}
if (I.mayThrow())		if (I.mayThrow())
return false;		return false;

// The instructions below can trap.
switch (I.getOpcode()) {
default:
continue;
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::URem:
case Instruction::SRem:
return false;
}
}		}

return true;		return true;
}		}

void InterleavedAccessInfo::collectConstStrideAccesses(		void InterleavedAccessInfo::collectConstStrideAccesses(
MapVector<Instruction *, StrideDescriptor> &AccessStrideInfo,		MapVector<Instruction *, StrideDescriptor> &AccessStrideInfo,
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides) {
▲ Show 20 Lines • Show All 756 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::calculateRegisterUsage(ArrayRef<unsigned> VFs) {
}		}

return RUs;		return RUs;
}		}

LoopVectorizationCostModel::VectorizationCostTy		LoopVectorizationCostModel::VectorizationCostTy
LoopVectorizationCostModel::expectedCost(unsigned VF) {		LoopVectorizationCostModel::expectedCost(unsigned VF) {
VectorizationCostTy Cost;		VectorizationCostTy Cost;

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
VectorizationCostTy BlockCost;		VectorizationCostTy BlockCost;

// For each instruction in the old loop.		// For each instruction in the old loop.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Skip dbg intrinsics.		// Skip dbg intrinsics.
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
// First-order recurrences are replaced by vector shuffles inside the loop.		// First-order recurrences are replaced by vector shuffles inside the loop.
if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))		if (VF > 1 && Legal->isFirstOrderRecurrence(Phi))
return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,		return TTI.getShuffleCost(TargetTransformInfo::SK_ExtractSubvector,
VectorTy, VF - 1, VectorTy);		VectorTy, VF - 1, VectorTy);

// TODO: IF-converted IFs become selects.		// TODO: IF-converted IFs become selects.
return 0;		return 0;
}		}
		case Instruction::UDiv:
		case Instruction::SDiv:
		case Instruction::URem:
		case Instruction::SRem:
		// We assume that if-converted blocks have a 50% chance of being executed.
		// Predicated scalarized instructions are avoided due to the CF that
		// bypasses turned off lanes. If we are not predicating, fallthrough.
		if (VF > 1 && mayDivideByZero(*I) &&
		Legal->blockNeedsPredication(I->getParent()))
		return VF * TTI.getArithmeticInstrCost(I->getOpcode(), RetTy) / 2 +
		getScalarizationOverhead(I, VF, true, TTI);
case Instruction::Add:		case Instruction::Add:
case Instruction::FAdd:		case Instruction::FAdd:
case Instruction::Sub:		case Instruction::Sub:
case Instruction::FSub:		case Instruction::FSub:
case Instruction::Mul:		case Instruction::Mul:
case Instruction::FMul:		case Instruction::FMul:
case Instruction::UDiv:
case Instruction::SDiv:
case Instruction::FDiv:		case Instruction::FDiv:
case Instruction::URem:
case Instruction::SRem:
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Since we will replace the stride by 1 the multiplication should go away.		// Since we will replace the stride by 1 the multiplication should go away.
▲ Show 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::Call: {		case Instruction::Call: {
bool NeedToScalarize;		bool NeedToScalarize;
CallInst *CI = cast<CallInst>(I);		CallInst *CI = cast<CallInst>(I);
unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);		unsigned CallCost = getVectorCallCost(CI, VF, TTI, TLI, NeedToScalarize);
if (getVectorIntrinsicIDForCall(CI, TLI))		if (getVectorIntrinsicIDForCall(CI, TLI))
return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));		return std::min(CallCost, getVectorIntrinsicCost(CI, VF, TTI, TLI));
return CallCost;		return CallCost;
}		}
default: {		default:
// We are scalarizing the instruction. Return the cost of the scalar
// instruction, plus the cost of insert and extract into vector
// elements, times the vector width.
unsigned Cost = 0;

if (!RetTy->isVoidTy() && VF != 1) {
unsigned InsCost =
TTI.getVectorInstrCost(Instruction::InsertElement, VectorTy);
unsigned ExtCost =
TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy);

// The cost of inserting the results plus extracting each one of the
// operands.
Cost += VF * (InsCost + ExtCost * I->getNumOperands());
}

// The cost of executing VF copies of the scalar instruction. This opcode		// The cost of executing VF copies of the scalar instruction. This opcode
// is unknown. Assume that it is the same as 'mul'.		// is unknown. Assume that it is the same as 'mul'.
Cost += VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy);		return VF * TTI.getArithmeticInstrCost(Instruction::Mul, VectorTy) +
return Cost;		getScalarizationOverhead(I, VF, false, TTI);
}
} // end of switch.		} // end of switch.
}		}

char LoopVectorize::ID = 0;		char LoopVectorize::ID = 0;
static const char lv_name[] = "Loop Vectorization";		static const char lv_name[] = "Loop Vectorization";
INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)		INITIALIZE_PASS_BEGIN(LoopVectorize, LV_NAME, lv_name, false, false)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)		INITIALIZE_PASS_DEPENDENCY(BasicAAWrapperPass)
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectValuesToIgnore() {
// Insert values known to be scalar into VecValuesToIgnore.		// Insert values known to be scalar into VecValuesToIgnore.
for (auto *BB : TheLoop->getBlocks())		for (auto *BB : TheLoop->getBlocks())
for (auto &I : *BB)		for (auto &I : *BB)
if (Legal->isScalarAfterVectorization(&I))		if (Legal->isScalarAfterVectorization(&I))
VecValuesToIgnore.insert(&I);		VecValuesToIgnore.insert(&I);
}		}

void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,		void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
bool IfPredicateStore) {		bool IfPredicateInstr) {
assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");		assert(!Instr->getType()->isAggregateType() && "Can't handle vectors");
// Holds vector parameters or scalars, in case of uniform vals.		// Holds vector parameters or scalars, in case of uniform vals.
SmallVector<VectorParts, 4> Params;		SmallVector<VectorParts, 4> Params;

setDebugLocFromInst(Builder, Instr);		setDebugLocFromInst(Builder, Instr);

// Find all of the vectorized parameters.		// Find all of the vectorized parameters.
for (Value *SrcOp : Instr->operands()) {		for (Value *SrcOp : Instr->operands()) {
Show All 26 Lines	void InnerLoopUnroller::scalarizeInstruction(Instruction *Instr,
// Does this instruction return a value ?		// Does this instruction return a value ?
bool IsVoidRetTy = Instr->getType()->isVoidTy();		bool IsVoidRetTy = Instr->getType()->isVoidTy();

Value *UndefVec = IsVoidRetTy ? nullptr : UndefValue::get(Instr->getType());		Value *UndefVec = IsVoidRetTy ? nullptr : UndefValue::get(Instr->getType());
// Create a new entry in the WidenMap and initialize it to Undef or Null.		// Create a new entry in the WidenMap and initialize it to Undef or Null.
VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);		VectorParts &VecResults = WidenMap.splat(Instr, UndefVec);

VectorParts Cond;		VectorParts Cond;
if (IfPredicateStore) {		if (IfPredicateInstr) {
assert(Instr->getParent()->getSinglePredecessor() &&		assert(Instr->getParent()->getSinglePredecessor() &&
"Only support single predecessor blocks");		"Only support single predecessor blocks");
Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),		Cond = createEdgeMask(Instr->getParent()->getSinglePredecessor(),
Instr->getParent());		Instr->getParent());
}		}

// For each vector unroll 'part':		// For each vector unroll 'part':
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
// For each scalar that we create:		// For each scalar that we create:

// Start an "if (pred) a[i] = ..." block.		// Start an "if (pred) a[i] = ..." block.
Value *Cmp = nullptr;		Value *Cmp = nullptr;
if (IfPredicateStore) {		if (IfPredicateInstr) {
if (Cond[Part]->getType()->isVectorTy())		if (Cond[Part]->getType()->isVectorTy())
Cond[Part] =		Cond[Part] =
Builder.CreateExtractElement(Cond[Part], Builder.getInt32(0));		Builder.CreateExtractElement(Cond[Part], Builder.getInt32(0));
Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cond[Part],		Cmp = Builder.CreateICmp(ICmpInst::ICMP_EQ, Cond[Part],
ConstantInt::get(Cond[Part]->getType(), 1));		ConstantInt::get(Cond[Part]->getType(), 1));
}		}

Instruction *Cloned = Instr->clone();		Instruction *Cloned = Instr->clone();
Show All 14 Lines	if (auto *II = dyn_cast<IntrinsicInst>(Cloned))
AC->registerAssumption(II);		AC->registerAssumption(II);

// If the original scalar returns a value we need to place it in a vector		// If the original scalar returns a value we need to place it in a vector
// so that future users will be able to use it.		// so that future users will be able to use it.
if (!IsVoidRetTy)		if (!IsVoidRetTy)
VecResults[Part] = Cloned;		VecResults[Part] = Cloned;

// End if-block.		// End if-block.
if (IfPredicateStore)		if (IfPredicateInstr)
PredicatedStores.push_back(std::make_pair(cast<StoreInst>(Cloned), Cmp));		PredicatedInstructions.push_back(std::make_pair(Cloned, Cmp));
}		}
}		}

void InnerLoopUnroller::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopUnroller::vectorizeMemoryInstruction(Instruction *Instr) {
auto *SI = dyn_cast<StoreInst>(Instr);		auto *SI = dyn_cast<StoreInst>(Instr);
bool IfPredicateStore = (SI && Legal->blockNeedsPredication(SI->getParent()));		bool IfPredicateInstr = (SI && Legal->blockNeedsPredication(SI->getParent()));

return scalarizeInstruction(Instr, IfPredicateStore);		return scalarizeInstruction(Instr, IfPredicateInstr);
}		}

Value InnerLoopUnroller::reverseVector(Value Vec) { return Vec; }		Value InnerLoopUnroller::reverseVector(Value Vec) { return Vec; }

Value InnerLoopUnroller::getBroadcastInstrs(Value V) { return V; }		Value InnerLoopUnroller::getBroadcastInstrs(Value V) { return V; }

Value InnerLoopUnroller::getStepVector(Value Val, int StartIdx, Value *Step,		Value InnerLoopUnroller::getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps BinOp) {		Instruction::BinaryOps BinOp) {
▲ Show 20 Lines • Show All 356 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/if-pred-non-void.ll

				; RUN: opt -S -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Test predication of non-void instructions, specifically (i) that these
				; instructions permit vectorization and (ii) the creation of an insertelement
				; and a Phi node. For each predicated instruction we search for the code
				; generated for the first element.
				define void @test(i32* nocapture %asd, i32* nocapture %aud,
				anemetUnsubmitted Not Done Reply Inline Actions As a demo for how this works it would be actually good to include at least one of the second element sequences as well. anemet: As a demo for how this works it would be actually good to include at least one of the second…
				i32* nocapture %asr, i32* nocapture %aur) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %if.end
				ret void

				; CHECK-LABEL: test
				; CHECK: vector.body:
				; CHECK: %{{.}} = extractelement <2 x i1> %{{.}}, i32 0
				; CHECK: br i1 %{{.*}}, label %[[CSD:[a-zA-Z0-9.]+]], label %[[ESD:[a-zA-Z0-9.]+]]
				; CHECK: [[CSD]]:
				anemetUnsubmitted Not Done Reply Inline Actions Can you please name and match this extract as well, it helps reading. Everywhere in these tests. anemet: Can you please name and match this extract as well, it helps reading. Everywhere in these…
				; CHECK: %[[SD0:[a-zA-Z0-9]+]] = sdiv i32 %{{.}}, %{{.}}
				; CHECK: %[[SD1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[SD0]], i32 0
				anemetUnsubmitted Not Done Reply Inline Actions It would be also good to check the extractelements feeding the divs here. anemet: It would be also good to check the extractelements feeding the divs here.
				; CHECK: br label %[[ESD]]
				; CHECK: [[ESD]]:
				; CHECK: %{{.*}} = phi <2 x i32> [ undef, %vector.body ], [ %[[SD1]], %[[CSD]] ]
				; CHECK: %{{.}} = extractelement <2 x i1> %{{.}}, i32 0
				; CHECK: br i1 %{{.*}}, label %[[CUD:[a-zA-Z0-9.]+]], label %[[EUD:[a-zA-Z0-9.]+]]
				; CHECK: [[CUD]]:
				; CHECK: %[[UD0:[a-zA-Z0-9]+]] = udiv i32 %{{.}}, %{{.}}
				; CHECK: %[[UD1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[UD0]], i32 0
				; CHECK: br label %[[EUD]]
				; CHECK: [[EUD]]:
				; CHECK: %{{.}} = phi <2 x i32> [ undef, %{{.}} ], [ %[[UD1]], %[[CUD]] ]
				; CHECK: %{{.}} = extractelement <2 x i1> %{{.}}, i32 0
				; CHECK: br i1 %{{.*}}, label %[[CSR:[a-zA-Z0-9.]+]], label %[[ESR:[a-zA-Z0-9.]+]]
				; CHECK: [[CSR]]:
				; CHECK: %[[SR0:[a-zA-Z0-9]+]] = srem i32 %{{.}}, %{{.}}
				; CHECK: %[[SR1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[SR0]], i32 0
				; CHECK: br label %[[ESR]]
				; CHECK: [[ESR]]:
				; CHECK: %{{.}} = phi <2 x i32> [ undef, %{{.}} ], [ %[[SR1]], %[[CSR]] ]
				; CHECK: %{{.}} = extractelement <2 x i1> %{{.}}, i32 0
				; CHECK: br i1 %{{.*}}, label %[[CUR:[a-zA-Z0-9.]+]], label %[[EUR:[a-zA-Z0-9.]+]]
				; CHECK: [[CUR]]:
				; CHECK: %[[UR0:[a-zA-Z0-9]+]] = urem i32 %{{.}}, %{{.}}
				; CHECK: %[[UR1:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[UR0]], i32 0
				; CHECK: br label %[[EUR]]
				; CHECK: [[EUR]]:
				; CHECK: %{{.}} = phi <2 x i32> [ undef, %{{.}} ], [ %[[UR1]], %[[CUR]] ]

				anemetUnsubmitted Not Done Reply Inline Actions I think we're pretty consistent about using uppercase for the named regexes. That helps readability. anemet: I think we're pretty consistent about using uppercase for the named regexes. That helps…
				for.body: ; preds = %if.end, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
				%isd = getelementptr inbounds i32, i32* %asd, i64 %indvars.iv
				%iud = getelementptr inbounds i32, i32* %aud, i64 %indvars.iv
				%isr = getelementptr inbounds i32, i32* %asr, i64 %indvars.iv
				%iur = getelementptr inbounds i32, i32* %aur, i64 %indvars.iv
				%lsd = load i32, i32* %isd, align 4
				%lud = load i32, i32* %iud, align 4
				%lsr = load i32, i32* %isr, align 4
				%lur = load i32, i32* %iur, align 4
				%psd = add nsw i32 %lsd, 23
				%pud = add nsw i32 %lud, 24
				%psr = add nsw i32 %lsr, 25
				%pur = add nsw i32 %lur, 26
				%cmp1 = icmp slt i32 %lsd, 100
				br i1 %cmp1, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%rsd = sdiv i32 %psd, %lsd
				%rud = udiv i32 %pud, %lud
				%rsr = srem i32 %psr, %lsr
				%rur = urem i32 %pur, %lur
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %for.body ]
				%yud.0 = phi i32 [ %rud, %if.then ], [ %pud, %for.body ]
				%ysr.0 = phi i32 [ %rsr, %if.then ], [ %psr, %for.body ]
				%yur.0 = phi i32 [ %rur, %if.then ], [ %pur, %for.body ]
				store i32 %ysd.0, i32* %isd, align 4
				store i32 %yud.0, i32* %iud, align 4
				store i32 %ysr.0, i32* %isr, align 4
				store i32 %yur.0, i32* %iur, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				declare i32 @scalarized(i32 %a, i32 %b)

				; Future-use test for predication under smarter scalar-scalar: this test will
				; fail when the vectorizer starts feeding scalarized values directly to their
				; scalar users, i.e. w/o generating redundant insertelement/extractelement
				; instructions. This case is already supported by the predication code (which
				; should generate a phi for the scalar predicated value rather than for the
				; insertelement), but cannot be tested yet.
				; If you got this test to fail, fix the test by using the alternative FFU
				; sequence to make this test check how we handle this case from now on.
				define void @test_scalar2scalar(i32* nocapture %asd, i32* nocapture %bsd) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %if.end
				ret void

				; CHECK-LABEL: test_scalar2scalar
				; CHECK: vector.body:
				; CHECK: %{{.}} = extractelement <2 x i1> %{{.}}, i32 0
				; CHECK: br i1 %{{.*}}, label %[[THEN:[a-zA-Z0-9.]+]], label %[[FI:[a-zA-Z0-9.]+]]
				; CHECK: [[THEN]]:
				; CHECK: %[[PD:[a-zA-Z0-9]+]] = sdiv i32 %{{.}}, %{{.}}
				; CHECK: %[[PDV:[a-zA-Z0-9]+]] = insertelement <2 x i32> undef, i32 %[[PD]], i32 0
				; CHECK: br label %[[FI]]
				; CHECK: [[FI]]:
				; CHECK: %[[PH:[a-zA-Z0-9]+]] = phi <2 x i32> [ undef, %vector.body ], [ %[[PDV]], %[[THEN]] ]
				; FFU-LABEL: test_scalar2scalar
				; FFU: vector.body:
				; FFU: %{{.}} = extractelement <2 x i1> %{{.}}, i32 0
				; FFU: br i1 %{{.*}}, label %[[THEN:[a-zA-Z0-9.]+]], label %[[FI:[a-zA-Z0-9.]+]]
				; FFU: [[THEN]]:
				; FFU: %[[PD:[a-zA-Z0-9]+]] = sdiv i32 %{{.}}, %{{.}}
				; FFU: br label %[[FI]]
				; FFU: [[FI]]:
				; FFU: %{{.*}} = phi i32 [ undef, %vector.body ], [ %[[PD]], %[[THEN]] ]

				for.body: ; preds = %if.end, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
				%isd = getelementptr inbounds i32, i32* %asd, i64 %indvars.iv
				%lsd = load i32, i32* %isd, align 4
				%psd = add nsw i32 %lsd, 23
				%cmp1 = icmp slt i32 %lsd, 100
				br i1 %cmp1, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%rsd = sdiv i32 %psd, %lsd
				br label %if.end
				mssimpsoUnsubmitted Not Done Reply Inline Actions Thanks for adding the additional "future" test. I don't think it will exercise the non-insert case, though. I'm very sorry for not being more clear previously. Here, %rsd will always have to be inserted into a vector since it will be directly used by a select instruction, which will remain vectorized. I didn't think of this when I last commented. But I think if you add an additional instruction, this should produce the desired effect. Something like: if.then: %tmp = sdiv i32 %psd, %lsd %rsd = sdiv i32 %tmp, %lsd br label %if.end When I ran the modified test with this patch and the scalar patch, the non-insert case was used for %tmp and the insert case was used for %rsd. This makes sense becase %tmp is only used by %rsd (will be scalar), and %rsd will again feed the vector select. mssimpso: Thanks for adding the additional "future" test. I don't think it will exercise the non-insert…
				gilrAuthorUnsubmitted Not Done Reply Inline Actions Argh, sorry about that. Your explanation was clear - just a hasty implementation on my side :( Yes, the second sdiv should go under the same condition - will fix. gilr: Argh, sorry about that. Your explanation was clear - just a hasty implementation on my side…

				if.end: ; preds = %if.then, %for.body
				mssimpsoUnsubmitted Not Done Reply Inline Actions The test looks good to me now. Thanks! mssimpso: The test looks good to me now. Thanks!
				%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %for.body ]
				%isd.b = getelementptr inbounds i32, i32* %bsd, i64 %indvars.iv
				%lsd.b = load i32, i32* %isd.b, align 4
				%z = sdiv i32 %lsd.b, %ysd.0
				store i32 %z, i32* %isd, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

test/Transforms/LoopVectorize/if-pred-not-when-safe.ll

				; RUN: opt -S -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; Test no-predication of instructions that are provably safe, e.g. dividing by
				; a non-zero constant.
				define void @test(i32* nocapture %asd, i32* nocapture %aud,
				i32* nocapture %asr, i32* nocapture %aur,
				i32* nocapture %asd0, i32* nocapture %aud0,
				i32* nocapture %asr0, i32* nocapture %aur0
				) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %if.end
				ret void

				; CHECK-LABEL: test
				; CHECK: vector.body:
				; CHECK: %{{.}} = sdiv <2 x i32> %{{.}}, <i32 11, i32 11>
				; CHECK: %{{.}} = udiv <2 x i32> %{{.}}, <i32 13, i32 13>
				; CHECK: %{{.}} = srem <2 x i32> %{{.}}, <i32 17, i32 17>
				; CHECK: %{{.}} = urem <2 x i32> %{{.}}, <i32 19, i32 19>
				; CHECK-NOT: %{{.}} = sdiv <2 x i32> %{{.}}, <i32 0, i32 0>
				; CHECK-NOT: %{{.}} = udiv <2 x i32> %{{.}}, <i32 0, i32 0>
				; CHECK-NOT: %{{.}} = srem <2 x i32> %{{.}}, <i32 0, i32 0>
				; CHECK-NOT: %{{.}} = urem <2 x i32> %{{.}}, <i32 0, i32 0>

				for.body: ; preds = %if.end, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %if.end ]
				%isd = getelementptr inbounds i32, i32* %asd, i64 %indvars.iv
				%iud = getelementptr inbounds i32, i32* %aud, i64 %indvars.iv
				%isr = getelementptr inbounds i32, i32* %asr, i64 %indvars.iv
				%iur = getelementptr inbounds i32, i32* %aur, i64 %indvars.iv
				%lsd = load i32, i32* %isd, align 4
				%lud = load i32, i32* %iud, align 4
				%lsr = load i32, i32* %isr, align 4
				%lur = load i32, i32* %iur, align 4
				%psd = add nsw i32 %lsd, 23
				%pud = add nsw i32 %lud, 24
				%psr = add nsw i32 %lsr, 25
				%pur = add nsw i32 %lur, 26
				%isd0 = getelementptr inbounds i32, i32* %asd0, i64 %indvars.iv
				%iud0 = getelementptr inbounds i32, i32* %aud0, i64 %indvars.iv
				%isr0 = getelementptr inbounds i32, i32* %asr0, i64 %indvars.iv
				%iur0 = getelementptr inbounds i32, i32* %aur0, i64 %indvars.iv
				%lsd0 = load i32, i32* %isd0, align 4
				%lud0 = load i32, i32* %iud0, align 4
				%lsr0 = load i32, i32* %isr0, align 4
				%lur0 = load i32, i32* %iur0, align 4
				%psd0 = add nsw i32 %lsd, 27
				%pud0 = add nsw i32 %lud, 28
				%psr0 = add nsw i32 %lsr, 29
				%pur0 = add nsw i32 %lur, 30
				%cmp1 = icmp slt i32 %lsd, 100
				br i1 %cmp1, label %if.then, label %if.end

				if.then: ; preds = %for.body
				%rsd = sdiv i32 %psd, 11
				%rud = udiv i32 %pud, 13
				%rsr = srem i32 %psr, 17
				%rur = urem i32 %pur, 19
				%rsd0 = sdiv i32 %psd0, 0
				%rud0 = udiv i32 %pud0, 0
				%rsr0 = srem i32 %psr0, 0
				%rur0 = urem i32 %pur0, 0
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%ysd.0 = phi i32 [ %rsd, %if.then ], [ %psd, %for.body ]
				%yud.0 = phi i32 [ %rud, %if.then ], [ %pud, %for.body ]
				%ysr.0 = phi i32 [ %rsr, %if.then ], [ %psr, %for.body ]
				%yur.0 = phi i32 [ %rur, %if.then ], [ %pur, %for.body ]
				%ysd0.0 = phi i32 [ %rsd0, %if.then ], [ %psd0, %for.body ]
				%yud0.0 = phi i32 [ %rud0, %if.then ], [ %pud0, %for.body ]
				%ysr0.0 = phi i32 [ %rsr0, %if.then ], [ %psr0, %for.body ]
				%yur0.0 = phi i32 [ %rur0, %if.then ], [ %pur0, %for.body ]
				store i32 %ysd.0, i32* %isd, align 4
				store i32 %yud.0, i32* %iud, align 4
				store i32 %ysr.0, i32* %isr, align 4
				store i32 %yur.0, i32* %iur, align 4
				store i32 %ysd0.0, i32* %isd0, align 4
				store i32 %yud0.0, i32* %iud0, align 4
				store i32 %ysr0.0, i32* %isr0, align 4
				store i32 %yur0.0, i32* %iur0, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

test/Transforms/LoopVectorize/if-pred-stores.ll

	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=UNROLL			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=UNROLL
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info < %s \| FileCheck %s --check-prefix=UNROLL-NOSIMPLIFY			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=1 -force-vector-interleave=2 -loop-vectorize -verify-loop-info < %s \| FileCheck %s --check-prefix=UNROLL-NOSIMPLIFY
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=VEC			; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg < %s \| FileCheck %s --check-prefix=VEC
	; RUN: opt -S -vectorize-num-stores-pred=1 -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -enable-cond-stores-vec -verify-loop-info -simplifycfg -instcombine < %s \| FileCheck %s --check-prefix=VEC-IC

	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-apple-macosx10.9.0"			target triple = "x86_64-apple-macosx10.9.0"

	; Test predication of stores.			; Test predication of stores.
	define i32 @test(i32* nocapture %f) #0 {			define i32 @test(i32* nocapture %f) #0 {
	entry:			entry:
	br label %for.body			br label %for.body

	; VEC-LABEL: test			; VEC-LABEL: test
	; VEC: %[[v8:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>			; VEC: %[[v8:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>
	; VEC: %[[v9:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>			; VEC: %[[v9:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>
	; VEC: %[[v10:.+]] = and <2 x i1> %[[v8]], <i1 true, i1 true>			; VEC: %[[v10:.+]] = and <2 x i1> %[[v8]], <i1 true, i1 true>
	; VEC: %[[v11:.+]] = extractelement <2 x i1> %[[v10]], i32 0			; VEC: %[[v11:.+]] = extractelement <2 x i1> %[[v10]], i32 0
	; VEC: %[[v12:.+]] = icmp eq i1 %[[v11]], true			; VEC: %[[v12:.+]] = icmp eq i1 %[[v11]], true
	; VEC: %[[v13:.+]] = extractelement <2 x i32> %[[v9]], i32 0
	; VEC: %[[v14:.+]] = extractelement <2 x i32> %{{.}}, i32 0
	; VEC: br i1 %[[v12]], label %[[cond:.+]], label %[[else:.+]]			; VEC: br i1 %[[v12]], label %[[cond:.+]], label %[[else:.+]]
	;			;
	; VEC: [[cond]]:			; VEC: [[cond]]:
				; VEC: %[[v13:.+]] = extractelement <2 x i32> %[[v9]], i32 0
				; VEC: %[[v14:.+]] = extractelement <2 x i32> %{{.}}, i32 0
	; VEC: store i32 %[[v13]], i32* %[[v14]], align 4			; VEC: store i32 %[[v13]], i32* %[[v14]], align 4
	; VEC: br label %[[else:.+]]			; VEC: br label %[[else:.+]]
	;			;
	; VEC: [[else]]:			; VEC: [[else]]:
	; VEC: %[[v15:.+]] = extractelement <2 x i1> %[[v10]], i32 1			; VEC: %[[v15:.+]] = extractelement <2 x i1> %[[v10]], i32 1
	; VEC: %[[v16:.+]] = icmp eq i1 %[[v15]], true			; VEC: %[[v16:.+]] = icmp eq i1 %[[v15]], true
	; VEC: %[[v17:.+]] = extractelement <2 x i32> %[[v9]], i32 1
	; VEC: %[[v18:.+]] = extractelement <2 x i32*> %{{.+}} i32 1
	; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]]			; VEC: br i1 %[[v16]], label %[[cond2:.+]], label %[[else2:.+]]
	;			;
	; VEC: [[cond2]]:			; VEC: [[cond2]]:
				; VEC: %[[v17:.+]] = extractelement <2 x i32> %[[v9]], i32 1
				; VEC: %[[v18:.+]] = extractelement <2 x i32*> %{{.+}} i32 1
	; VEC: store i32 %[[v17]], i32* %[[v18]], align 4			; VEC: store i32 %[[v17]], i32* %[[v18]], align 4
	; VEC: br label %[[else2:.+]]			; VEC: br label %[[else2:.+]]
	;			;
	; VEC: [[else2]]:			; VEC: [[else2]]:

	; VEC-IC-LABEL: test
	; VEC-IC: %[[v1:.+]] = icmp sgt <2 x i32> %{{.*}}, <i32 100, i32 100>
	; VEC-IC: %[[v2:.+]] = add nsw <2 x i32> %{{.*}}, <i32 20, i32 20>
	; VEC-IC: %[[v3:.+]] = extractelement <2 x i1> %[[v1]], i32 0
	; VEC-IC: br i1 %[[v3]], label %[[cond:.+]], label %[[else:.+]]
	;
	; VEC-IC: [[cond]]:
	; VEC-IC: %[[v4:.+]] = extractelement <2 x i32> %[[v2]], i32 0
	; VEC-IC: store i32 %[[v4]], i32* %{{.*}}, align 4
	; VEC-IC: br label %[[else:.+]]
	;
	; VEC-IC: [[else]]:
	; VEC-IC: %[[v5:.+]] = extractelement <2 x i1> %[[v1]], i32 1
	; VEC-IC: br i1 %[[v5]], label %[[cond2:.+]], label %[[else2:.+]]
	;
	; VEC-IC: [[cond2]]:
	; VEC-IC: %[[v6:.+]] = extractelement <2 x i32> %[[v2]], i32 1
	; VEC-IC: store i32 %[[v6]], i32* %{{.*}}, align 4
	; VEC-IC: br label %[[else2:.+]]
	;
	; VEC-IC: [[else2]]:

	; UNROLL-LABEL: test			; UNROLL-LABEL: test
	; UNROLL: vector.body:			; UNROLL: vector.body:
	; UNROLL: %[[IND:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 0			; UNROLL: %[[IND:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 0
	; UNROLL: %[[IND1:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 1			; UNROLL: %[[IND1:[a-zA-Z0-9]+]] = add i64 %{{.*}}, 1
	; UNROLL: %[[v0:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND]]			; UNROLL: %[[v0:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND]]
	; UNROLL: %[[v1:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND1]]			; UNROLL: %[[v1:[a-zA-Z0-9]+]] = getelementptr inbounds i32, i32* %f, i64 %[[IND1]]
	; UNROLL: %[[v2:[a-zA-Z0-9]+]] = load i32, i32* %[[v0]], align 4			; UNROLL: %[[v2:[a-zA-Z0-9]+]] = load i32, i32* %[[v0]], align 4
	; UNROLL: %[[v3:[a-zA-Z0-9]+]] = load i32, i32* %[[v1]], align 4			; UNROLL: %[[v3:[a-zA-Z0-9]+]] = load i32, i32* %[[v1]], align 4
	▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines