This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopUnrollAnalyzer.h
-
lib/
-
Analysis/
-
LoopUnrollAnalyzer.cpp
-
Transforms/Scalar/
-
Scalar/
1
LoopUnrollPass.cpp
-
test/Transforms/LoopUnroll/
-
Transforms/
-
LoopUnroll/
-
full-unroll-heuristics-2.ll
-
full-unroll-heuristics-dce.ll
-
full-unroll-heuristics-geps.ll
-
unittests/Analysis/
-
Analysis/
-
UnrollAnalyzer.cpp

Differential D11758

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...
ClosedPublic

Authored by mzolotukhin on Aug 5 2015, 1:54 AM.

Download Raw Diff

Details

Reviewers

chandlerc

Commits

rGb7b8052982de: [Unroll] Implement a conservative and monotonically increasing cost tracking…
rL269388: [Unroll] Implement a conservative and monotonically increasing cost tracking…

Summary

...loop after the last iteration.

This is really hard to do correctly. The core problem is that we need to
model liveness through the induction PHIs from iteration to iteration in
order to get the correct results, and we need to correctly de-duplicate
the common subgraphs of instructions feeding some subset of the
induction PHIs. All of this can be driven either from a side effect at
some iteration or from the loop values used after the loop finishes.

This patch implements this by storing the forward-propagating analysis
of each instruction in a cache to recall whether it was free and whether
it has become live and thus counted toward the total unroll cost. Then,
at each sink for a value in the loop, we recursively walk back through
every value that feeds the sink, including looping back through the
iterations as needed, until we have marked the entire input graph as
live. Because we cache this, we never visit instructions more than twice

once when we analyze them and put them into the cache, and once when

we count their cost towards the unrolled loop. Also, because the cache
is only two bits and because we are dealing with relatively small
iteration counts, we can store all of this very densely in memory to
avoid this from becoming an excessively slow analysis.

The code here is still pretty gross. I would appreciate suggestions
about better ways to factor or split this up, I've stared too long at
the algorithmic side to really have a good sense of what the design
should probably look at.

Also, it might seem like we should do all of this bottom-up, but I think
that is a red herring. Specifically, the simplification power is *much*
greater working top-down. We can forward propagate very effectively,
even across strange and interesting recurrances around the backedge.
Because we use data to propagate, this doesn't cause a state space
explosion. Doing this level of constant folding, etc, would be very
expensive to do bottom-up because it wouldn't be until the last moment
that you could collapse everything. The current solution is essentially
a top-down simplification with a bottom-up cost accounting which seems
to get the best of both worlds. It makes the simplification incremental
and powerful while leaving everything dead until we *know* it is needed.

Finally, a core property of this approach is its *monotonicity*. At all
times, the current UnrolledCost is a conservatively low estimate. This
ensures that we will never early-exit from the analysis due to exceeding
a threshold when if we had continued, the cost would have gone back
below the threshold. These kinds of bugs can cause incredibly hard to
track down random changes to behavior.

We could use a techinque similar (but much simpler) within the inliner
as well to avoid considering speculated code in the inline cost.

Diff Detail

Repository: rL LLVM

Event Timeline

chandlerc updated this revision to Diff 31339.Aug 5 2015, 1:54 AM

chandlerc retitled this revision from to [Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the....

chandlerc updated this object.

chandlerc added a reviewer: mzolotukhin.

chandlerc added a subscriber: llvm-commits.

Rebase after landing prerequisite patches and merge a missing commit into this
patch. Sorry for the broken original diff.

Hi Chandler,

Mostly the patch looks good, a couple of nit-picks inline.

But I think we need more tests for this. I think we can add debug prints and pin-point the cases we want to check with it in tests.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
596 ↗	(On Diff #31384)	s/instrution/instruction/
616 ↗	(On Diff #31384)	Maybe worth commenting here that only `I` and `Iteration` are actually used as a key. It might be not-obvious for someone looking at this code without context of this patch.
620 ↗	(On Diff #31384)	s/free.a/free./
739 ↗	(On Diff #31384)	I find having debug prints very convenient in this kind of analyses. Would it be possible to return them back? We could report instructions that was simplified, and instructions that were proven to be live, for instance. Also, I think we can use this diagnostic in tests, as it gets harder and harder to write them.
749 ↗	(On Diff #31384)	Is word 'either' redundant here, or is it just my English?:)

Hi Chandler,

I found that with this patch compile time is significantly worse on some benchmarks, a reduced testcase from one of them is attached. It seems like the problem isn’t in the patch itself, but in the way we perform actual unrolling - you can see that if you add ‘-debug’ flag. I can look into what’s bad there later, but if you’d like to poke it too, you are welcome:)

reduced.ll2 KBDownload
msg-15878-295.txt2 KBDownload

Hi Chandler,

This patch got stuck, but finally I got some time and investigated it more carefully. I found a bug in current implementation that explains everything, and as a consequence I constructed a simple test-case to demonstrate the problem:

int foo(char *a) {
  int i = 0;
  for (i = 0; i < 500; i++)
    if (a[i] == 0)
      return i;
  return 0;
}

The problem is that the only instructions we consider live are those having side effects. However, they might have no side effects (by LLVM's definition of side effects: mayWriteToMemory() || mayThrow() || !mayReturn()) and still affect program behavior, as in the case above.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
752–753 ↗	(On Diff #31384)	Should we also take into account instructions with out-of-the-loop uses and early-exit instructions? Are there any other cases we want not to miss?

Hi Chander,

We briefly discussed it on IRC, but I'll duplicate my latest findings here too.

I tested this patch and found several issues, which lead to undesired unrolling in some cases, and thus, have significant compile-time impact for no performance benefit. With them fixed/worked-around, the compile time regressions seemed not that bad, but I'll need to remeasure it when we fix the issues in a proper way. With these issues worked-around I still saw some nice performance gains.

Here is the list of problems that I found:

With this improved algorithm for finding dead instructions, we're now able to figure out that loop control flow becomes dead after unrolling. If a loop has a small body, then the control flow might be up to 50% of the loop of the body, but it doesn't seem reasonable to unroll such loops. For instance:

for (i = 0; i < 500; i++)
   a[i] += 1;

In this case unrolling removes nothing except the control flow, but it fools the current heuristic so the loop is unrolled. Such loops are pretty popular, so the compile time hit is severe if we unroll them. Performance gain is questionable, and probably we actually regress the performance in such cases.

Currently simplifyInstWithSCEV returns true (meaning the corresponding instruction is simplified) for expressions in a form Address + ConstantOffset. However, unrolling doesn't necessarily leads to simplification of such instruction, so our estimate might be wrong here. For example:

for (int i=0; i < 16; i++)  {
   a[i][0]=b[S.y][S.x+i];
}

In this case index expressions take the most part of the loop body, but unrolling doesn't help to simplify them in contrast to our estimate.

After we unrolled a loop we should make sure that we cleaned-up everything we expected to be simplified/dead, otherwise we will count it again when we analyze the parent loop.

Michael

I plan to rebase this patch and slightly update it.

Rebase onto TOT.
Only work on inner loops: we can't accurately estimate cost of instructions from inner loops when analyzing an outer loop, and we won't be visiting inner loops later anyway, so even if unrolling of the outer loop could expose new opportunities in inner loops, we won't be catching them now. We can revisit this later.
Don't unroll loop with calls inside. Experiments showed some problems with such loops (our estimate is far from accurate for them), and disabling it doesn't hurt any test. Again, we can revisit it later if we have better way to estimate a cost of calls.

Ping!

OK, I've paged all this back into my brain.

The inner loop aspect makes more sense to me now. I'm still a bit dubious on the skipping calls, but I understand that at least we're not ready for them yet and its important to get this stuff moving, so it seems like a good incremental step.

Some comments below. Only the test for the inner loop case is really interesting. Feel free to submit with these comments addressed or post any updates with more questions if useful.

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
1 ↗	(On Diff #49862)	This file is under lib, not test?
lib/Transforms/Scalar/LoopUnrollPass.cpp
241–244 ↗	(On Diff #49862)	It would make slightly more sense to hoist this out to the caller... There is nothing about this routine that is specific to inner loops, its just that it isn't useful right now? Not a big deal either way.
270 ↗	(On Diff #49862)	Heh, now that you've taken this over, need to address your own comment here. ;]
290 ↗	(On Diff #49862)	I'm fine with a comment, or extending the DenseMapInfo to support using .find_as(std::make_pair(I, Iteration)).
417 ↗	(On Diff #49862)	I have no problem adding them back. Want to do it in a follow-up patch? In this patch?
430–431 ↗	(On Diff #49862)	Not sure what this comment was referencing... The idea is that all out-of-loop uses should be via a LCSSA phi node in the exit block(s) and we backwards saturate their cost there?
432–433 ↗	(On Diff #49862)	Maybe leave a FIXME? I feel like we should make some effort to address this in follow-up commits. At the very least I suspect we want to handle intrinsics here.

This revision is now accepted and ready to land.May 12 2016, 3:52 PM

Random drive-by comment inline

lib/Transforms/Scalar/LoopUnrollPass.cpp
432 ↗	(On Diff #49862)	Nit: spelling

Closed by commit rL269388: [Unroll] Implement a conservative and monotonically increasing cost tracking… (authored by mzolotukhin). · Explain WhyMay 12 2016, 6:48 PM

This revision was automatically updated to reflect the committed changes.

Hi,

I committed the patch in r269388. Along with the changes you requested I adjusted some thresholds in tests invocation and made one semantical change in visitPHINode. In the previous version we checked if (PN.getParent() == L->getHeader()) and early exited if it's true. That prevented us from running simplifyWithSCEV on the phi-value, so the analysis didn't get the information about this phi. Now we run the base visitor first to mine the data if we can, and only then we check if the phi is in the header.

Michael

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
1 ↗	(On Diff #49862)	Ouch, thanks for the catch!
lib/Transforms/Scalar/LoopUnrollPass.cpp
241–244 ↗	(On Diff #49862)	We need to treat inner loops in a special way (e.g. using weights for basic blocks based on the profile info), but we're not doing it now, so I tend to think this function in current implementation isn't designed to work on outer loops.
270 ↗	(On Diff #49862)	Fixed :)
417 ↗	(On Diff #49862)	I'll add them back in another patch if I see a need in them.
430–431 ↗	(On Diff #49862)	Yeah, I think I might misunderstand this part at that time.

chapuni added a subscriber: chapuni.May 12 2016, 11:43 PM

chapuni added inline comments.

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
434	FYI, it seems msc doesn't like it. bool Inserted = InstCostMap.insert({&I, (int)Iteration, (unsigned)IsFree, /IsCounted/ false}) would work for me.

Thanks, I reapplied the patch with your fix (r269486). Hopefully, the bots will be happy!

Michael

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

LoopUnrollAnalyzer.h

1 line

lib/

Analysis/

LoopUnrollAnalyzer.cpp

10 lines

Transforms/

Scalar/

LoopUnrollPass.cpp

191 lines

test/

Transforms/

LoopUnroll/

full-unroll-heuristics-2.ll

2 lines

full-unroll-heuristics-dce.ll

38 lines

full-unroll-heuristics-geps.ll

2 lines

unittests/

Analysis/

UnrollAnalyzer.cpp

10 lines

Diff 57128

llvm/trunk/include/llvm/Analysis/LoopUnrollAnalyzer.h

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	private:

bool simplifyInstWithSCEV(Instruction *I);		bool simplifyInstWithSCEV(Instruction *I);

bool visitInstruction(Instruction &I) { return simplifyInstWithSCEV(&I); }		bool visitInstruction(Instruction &I) { return simplifyInstWithSCEV(&I); }
bool visitBinaryOperator(BinaryOperator &I);		bool visitBinaryOperator(BinaryOperator &I);
bool visitLoad(LoadInst &I);		bool visitLoad(LoadInst &I);
bool visitCastInst(CastInst &I);		bool visitCastInst(CastInst &I);
bool visitCmpInst(CmpInst &I);		bool visitCmpInst(CmpInst &I);
		bool visitPHINode(PHINode &PN);
};		};
}		}
#endif		#endif

llvm/trunk/lib/Analysis/LoopUnrollAnalyzer.cpp

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	if (Constant *CRHS = dyn_cast<Constant>(RHS)) {
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
return true;		return true;
}		}
}		}
}		}

return Base::visitCmpInst(I);		return Base::visitCmpInst(I);
}		}

		bool UnrolledInstAnalyzer::visitPHINode(PHINode &PN) {
		// Run base visitor first. This way we can gather some useful for later
		// analysis information.
		if (Base::visitPHINode(PN))
		return true;

		// The loop induction PHI nodes are definitionally free.
		return PN.getParent() == L->getHeader();
		}

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 179 Lines • ▼ Show 20 Lines	if (UP.PartialThreshold != NoThreshold)
UP.PartialThreshold =		UP.PartialThreshold =
std::max<unsigned>(UP.PartialThreshold, PragmaUnrollThreshold);		std::max<unsigned>(UP.PartialThreshold, PragmaUnrollThreshold);
}		}

return UP;		return UP;
}		}

namespace {		namespace {
		/// A struct to densely store the state of an instruction after unrolling at
		/// each iteration.
		///
		/// This is designed to work like a tuple of <Instruction *, int> for the
		/// purposes of hashing and lookup, but to be able to associate two boolean
		/// states with each key.
		struct UnrolledInstState {
		Instruction *I;
		int Iteration : 30;
		unsigned IsFree : 1;
		unsigned IsCounted : 1;
		};

		/// Hashing and equality testing for a set of the instruction states.
		struct UnrolledInstStateKeyInfo {
		typedef DenseMapInfo<Instruction *> PtrInfo;
		typedef DenseMapInfo<std::pair<Instruction *, int>> PairInfo;
		static inline UnrolledInstState getEmptyKey() {
		return {PtrInfo::getEmptyKey(), 0, 0, 0};
		}
		static inline UnrolledInstState getTombstoneKey() {
		return {PtrInfo::getTombstoneKey(), 0, 0, 0};
		}
		static inline unsigned getHashValue(const UnrolledInstState &S) {
		return PairInfo::getHashValue({S.I, S.Iteration});
		}
		static inline bool isEqual(const UnrolledInstState &LHS,
		const UnrolledInstState &RHS) {
		return PairInfo::isEqual({LHS.I, LHS.Iteration}, {RHS.I, RHS.Iteration});
		}
		};
		}

		namespace {
struct EstimatedUnrollCost {		struct EstimatedUnrollCost {
/// \brief The estimated cost after unrolling.		/// \brief The estimated cost after unrolling.
int UnrolledCost;		int UnrolledCost;

/// \brief The estimated dynamic cost of executing the instructions in the		/// \brief The estimated dynamic cost of executing the instructions in the
/// rolled form.		/// rolled form.
int RolledDynamicCost;		int RolledDynamicCost;
};		};
Show All 17 Lines	analyzeLoopUnrollCost(const Loop *L, unsigned TripCount, DominatorTree &DT,
ScalarEvolution &SE, const TargetTransformInfo &TTI,		ScalarEvolution &SE, const TargetTransformInfo &TTI,
int MaxUnrolledLoopSize) {		int MaxUnrolledLoopSize) {
// We want to be able to scale offsets by the trip count and add more offsets		// We want to be able to scale offsets by the trip count and add more offsets
// to them without checking for overflows, and we already don't want to		// to them without checking for overflows, and we already don't want to
// analyze massive trip counts, so we force the max to be reasonably small.		// analyze massive trip counts, so we force the max to be reasonably small.
assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&		assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&
"The unroll iterations max is too large!");		"The unroll iterations max is too large!");

		// Only analyze inner loops. We can't properly estimate cost of nested loops
		// and we won't visit inner loops again anyway.
		if (!L->empty())
		return None;

// Don't simulate loops with a big or unknown tripcount		// Don't simulate loops with a big or unknown tripcount
if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
TripCount > UnrollMaxIterationsCountToAnalyze)		TripCount > UnrollMaxIterationsCountToAnalyze)
return None;		return None;

SmallSetVector<BasicBlock *, 16> BBWorklist;		SmallSetVector<BasicBlock *, 16> BBWorklist;
		SmallSetVector<std::pair<BasicBlock , BasicBlock >, 4> ExitWorklist;
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;		SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;

// The estimated cost of the unrolled form of the loop. We try to estimate		// The estimated cost of the unrolled form of the loop. We try to estimate
// this by simplifying as much as we can while computing the estimate.		// this by simplifying as much as we can while computing the estimate.
int UnrolledCost = 0;		int UnrolledCost = 0;

// We also track the estimated dynamic (that is, actually executed) cost in		// We also track the estimated dynamic (that is, actually executed) cost in
// the rolled form. This helps identify cases when the savings from unrolling		// the rolled form. This helps identify cases when the savings from unrolling
// aren't just exposing dead control flows, but actual reduced dynamic		// aren't just exposing dead control flows, but actual reduced dynamic
// instructions due to the simplifications which we expect to occur after		// instructions due to the simplifications which we expect to occur after
// unrolling.		// unrolling.
int RolledDynamicCost = 0;		int RolledDynamicCost = 0;

		// We track the simplification of each instruction in each iteration. We use
		// this to recursively merge costs into the unrolled cost on-demand so that
		// we don't count the cost of any dead code. This is essentially a map from
		// <instruction, int> to <bool, bool>, but stored as a densely packed struct.
		DenseSet<UnrolledInstState, UnrolledInstStateKeyInfo> InstCostMap;

		// A small worklist used to accumulate cost of instructions from each
		// observable and reached root in the loop.
		SmallVector<Instruction *, 16> CostWorklist;

		// PHI-used worklist used between iterations while accumulating cost.
		SmallVector<Instruction *, 4> PHIUsedList;

		// Helper function to accumulate cost for instructions in the loop.
		auto AddCostRecursively = [&](Instruction &RootI, int Iteration) {
		assert(Iteration >= 0 && "Cannot have a negative iteration!");
		assert(CostWorklist.empty() && "Must start with an empty cost list");
		assert(PHIUsedList.empty() && "Must start with an empty phi used list");
		CostWorklist.push_back(&RootI);
		for (;; --Iteration) {
		do {
		Instruction *I = CostWorklist.pop_back_val();

		// InstCostMap only uses I and Iteration as a key, the other two values
		// don't matter here.
		auto CostIter = InstCostMap.find({I, Iteration, 0, 0});
		if (CostIter == InstCostMap.end())
		// If an input to a PHI node comes from a dead path through the loop
		// we may have no cost data for it here. What that actually means is
		// that it is free.
		continue;
		auto &Cost = *CostIter;
		if (Cost.IsCounted)
		// Already counted this instruction.
		continue;

		// Mark that we are counting the cost of this instruction now.
		Cost.IsCounted = true;

		// If this is a PHI node in the loop header, just add it to the PHI set.
		if (auto *PhiI = dyn_cast<PHINode>(I))
		if (PhiI->getParent() == L->getHeader()) {
		assert(Cost.IsFree && "Loop PHIs shouldn't be evaluated as they "
		"inherently simplify during unrolling.");
		if (Iteration == 0)
		continue;

		// Push the incoming value from the backedge into the PHI used list
		// if it is an in-loop instruction. We'll use this to populate the
		// cost worklist for the next iteration (as we count backwards).
		if (auto *OpI = dyn_cast<Instruction>(
		PhiI->getIncomingValueForBlock(L->getLoopLatch())))
		if (L->contains(OpI))
		PHIUsedList.push_back(OpI);
		continue;
		}

		// First accumulate the cost of this instruction.
		if (!Cost.IsFree) {
		UnrolledCost += TTI.getUserCost(I);
		DEBUG(dbgs() << "Adding cost of instruction (iteration " << Iteration
		<< "): ");
		DEBUG(I->dump());
		}

		// We must count the cost of every operand which is not free,
		// recursively. If we reach a loop PHI node, simply add it to the set
		// to be considered on the next iteration (backwards!).
		for (Value *Op : I->operands()) {
		// Check whether this operand is free due to being a constant or
		// outside the loop.
		auto *OpI = dyn_cast<Instruction>(Op);
		if (!OpI \|\| !L->contains(OpI))
		continue;

		// Otherwise accumulate its cost.
		CostWorklist.push_back(OpI);
		}
		} while (!CostWorklist.empty());

		if (PHIUsedList.empty())
		// We've exhausted the search.
		break;

		assert(Iteration > 0 &&
		"Cannot track PHI-used values past the first iteration!");
		CostWorklist.append(PHIUsedList.begin(), PHIUsedList.end());
		PHIUsedList.clear();
		}
		};

// Ensure that we don't violate the loop structure invariants relied on by		// Ensure that we don't violate the loop structure invariants relied on by
// this analysis.		// this analysis.
assert(L->isLoopSimplifyForm() && "Must put loop into normal form first.");		assert(L->isLoopSimplifyForm() && "Must put loop into normal form first.");
assert(L->isLCSSAForm(DT) &&		assert(L->isLCSSAForm(DT) &&
"Must have loops in LCSSA form to track live-out values.");		"Must have loops in LCSSA form to track live-out values.");

DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");		DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");

Show All 38 Lines	for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {
// Note that we must not cache the size, this loop grows the worklist.		// Note that we must not cache the size, this loop grows the worklist.
for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BasicBlock *BB = BBWorklist[Idx];		BasicBlock *BB = BBWorklist[Idx];

// Visit all instructions in the given basic block and try to simplify		// Visit all instructions in the given basic block and try to simplify
// it. We don't change the actual IR, just count optimization		// it. We don't change the actual IR, just count optimization
// opportunities.		// opportunities.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
int InstCost = TTI.getUserCost(&I);		// Track this instruction's expected baseline cost when executing the
		// rolled loop form.
		RolledDynamicCost += TTI.getUserCost(&I);

// Visit the instruction to analyze its loop cost after unrolling,		// Visit the instruction to analyze its loop cost after unrolling,
// and if the visitor returns false, include this instruction in the		// and if the visitor returns true, mark the instruction as free after
// unrolled cost.		// unrolling and continue.
if (!Analyzer.visit(I))		bool IsFree = Analyzer.visit(I);
UnrolledCost += InstCost;		bool Inserted = InstCostMap.insert({&I, (int)Iteration, IsFree,
		chapuniUnsubmitted Not Done Reply Inline Actions FYI, it seems msc doesn't like it. bool Inserted = InstCostMap.insert({&I, (int)Iteration, (unsigned)IsFree, /IsCounted/ false}) would work for me. chapuni: FYI, it seems msc doesn't like it. ``` bool Inserted = InstCostMap.insert({&I, (int)Iteration…
else {		/IsCounted/ false})
DEBUG(dbgs() << " " << I		.second;
<< " would be simplified if loop is unrolled.\n");		(void)Inserted;
(void)0;		assert(Inserted && "Cannot have a state for an unvisited instruction!");
}

// Also track this instructions expected cost when executing the rolled		if (IsFree)
// loop form.		continue;
RolledDynamicCost += InstCost;
		// If the instruction might have a side-effect recursively account for
		// the cost of it and all the instructions leading up to it.
		if (I.mayHaveSideEffects())
		AddCostRecursively(I, Iteration);

		// Can't properly model a cost of a call.
		// FIXME: With a proper cost model we should be able to do it.
		if(isa<CallInst>(&I))
		return None;

// If unrolled body turns out to be too big, bail out.		// If unrolled body turns out to be too big, bail out.
if (UnrolledCost > MaxUnrolledLoopSize) {		if (UnrolledCost > MaxUnrolledLoopSize) {
DEBUG(dbgs() << " Exceeded threshold.. exiting.\n"		DEBUG(dbgs() << " Exceeded threshold.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost		<< " UnrolledCost: " << UnrolledCost
<< ", MaxUnrolledLoopSize: " << MaxUnrolledLoopSize		<< ", MaxUnrolledLoopSize: " << MaxUnrolledLoopSize
<< "\n");		<< "\n");
return None;		return None;
Show All 12 Lines	for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
// Just take the first successor if condition is undef		// Just take the first successor if condition is undef
if (isa<UndefValue>(SimpleCond))		if (isa<UndefValue>(SimpleCond))
Succ = BI->getSuccessor(0);		Succ = BI->getSuccessor(0);
else		else
Succ = BI->getSuccessor(		Succ = BI->getSuccessor(
cast<ConstantInt>(SimpleCond)->isZero() ? 1 : 0);		cast<ConstantInt>(SimpleCond)->isZero() ? 1 : 0);
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
		else
		ExitWorklist.insert({BB, Succ});
continue;		continue;
}		}
}		}
} else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {		} else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {
if (Constant *SimpleCond =		if (Constant *SimpleCond =
SimplifiedValues.lookup(SI->getCondition())) {		SimplifiedValues.lookup(SI->getCondition())) {
BasicBlock *Succ = nullptr;		BasicBlock *Succ = nullptr;
// Just take the first successor if condition is undef		// Just take the first successor if condition is undef
if (isa<UndefValue>(SimpleCond))		if (isa<UndefValue>(SimpleCond))
Succ = SI->getSuccessor(0);		Succ = SI->getSuccessor(0);
else		else
Succ = SI->findCaseValue(cast<ConstantInt>(SimpleCond))		Succ = SI->findCaseValue(cast<ConstantInt>(SimpleCond))
.getCaseSuccessor();		.getCaseSuccessor();
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
		else
		ExitWorklist.insert({BB, Succ});
continue;		continue;
}		}
}		}

// Add BB's successors to the worklist.		// Add BB's successors to the worklist.
for (BasicBlock *Succ : successors(BB))		for (BasicBlock *Succ : successors(BB))
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
		else
		ExitWorklist.insert({BB, Succ});
}		}

// If we found no optimization opportunities on the first iteration, we		// If we found no optimization opportunities on the first iteration, we
// won't find them on later ones too.		// won't find them on later ones too.
if (UnrolledCost == RolledDynamicCost) {		if (UnrolledCost == RolledDynamicCost) {
DEBUG(dbgs() << " No opportunities found.. exiting.\n"		DEBUG(dbgs() << " No opportunities found.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost << "\n");		<< " UnrolledCost: " << UnrolledCost << "\n");
return None;		return None;
}		}
}		}

		while (!ExitWorklist.empty()) {
		BasicBlock ExitingBB, ExitBB;
		std::tie(ExitingBB, ExitBB) = ExitWorklist.pop_back_val();

		for (Instruction &I : *ExitBB) {
		auto *PN = dyn_cast<PHINode>(&I);
		if (!PN)
		break;

		Value *Op = PN->getIncomingValueForBlock(ExitingBB);
		if (auto *OpI = dyn_cast<Instruction>(Op))
		if (L->contains(OpI))
		AddCostRecursively(*OpI, TripCount - 1);
		}
		}

DEBUG(dbgs() << "Analysis finished:\n"		DEBUG(dbgs() << "Analysis finished:\n"
<< "UnrolledCost: " << UnrolledCost << ", "		<< "UnrolledCost: " << UnrolledCost << ", "
<< "RolledDynamicCost: " << RolledDynamicCost << "\n");		<< "RolledDynamicCost: " << RolledDynamicCost << "\n");
return {{UnrolledCost, RolledDynamicCost}};		return {{UnrolledCost, RolledDynamicCost}};
}		}

/// ApproximateLoopSize - Approximate the size of the loop.		/// ApproximateLoopSize - Approximate the size of the loop.
static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,		static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,
▲ Show 20 Lines • Show All 465 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-2.ll

	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=50 -unroll-dynamic-cost-savings-discount=90 \| FileCheck %s			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=70 -unroll-dynamic-cost-savings-discount=90 \| FileCheck %s
	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

	@unknown_global = internal unnamed_addr global [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16			@unknown_global = internal unnamed_addr global [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16
	@weak_constant = weak unnamed_addr constant [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16			@weak_constant = weak unnamed_addr constant [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16

	; Though @unknown_global is initialized with constant values, we can't consider			; Though @unknown_global is initialized with constant values, we can't consider
	; it as a constant, so we shouldn't unroll the loop.			; it as a constant, so we shouldn't unroll the loop.
	; CHECK-LABEL: @foo			; CHECK-LABEL: @foo
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

				; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-dynamic-cost-savings-discount=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=60 \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				@known_constant = internal unnamed_addr constant [10 x i32] [i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0], align 16

				; If a load becomes a constant after loop unrolling, we sometimes can simplify
				; CFG. This test verifies that we handle such cases.
				; After one operand in an instruction is constant-folded and the
				; instruction is simplified, the other operand might become dead.
				; In this test we have::
				; for i in 1..10:
				; r += A[i] * B[i]
				; A[i] is 0 almost at every iteration, so there is no need in loading B[i] at
				; all.


				; CHECK-LABEL: @unroll_dce
				; CHECK-NOT: br i1 %exitcond, label %for.end, label %for.body
				define i32 @unroll_dce(i32* noalias nocapture readonly %b) {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%iv.0 = phi i64 [ 0, %entry ], [ %iv.1, %for.body ]
				%r.0 = phi i32 [ 0, %entry ], [ %r.1, %for.body ]
				%arrayidx1 = getelementptr inbounds [10 x i32], [10 x i32]* @known_constant, i64 0, i64 %iv.0
				%x1 = load i32, i32* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 %iv.0
				%x2 = load i32, i32* %arrayidx2, align 4
				%mul = mul i32 %x1, %x2
				%r.1 = add i32 %mul, %r.0
				%iv.1 = add nuw nsw i64 %iv.0, 1
				%exitcond = icmp eq i64 %iv.1, 10
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 %r.1
				}

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-geps.ll

	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-dynamic-cost-savings-discount=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=40 \| FileCheck %s			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-dynamic-cost-savings-discount=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=60 \| FileCheck %s
	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

	; When examining gep-instructions we shouldn't consider them simplified if the			; When examining gep-instructions we shouldn't consider them simplified if the
	; corresponding memory access isn't simplified. Doing the opposite might bias			; corresponding memory access isn't simplified. Doing the opposite might bias
	; our estimate, so that we might decide to unroll even a simple memcpy loop.			; our estimate, so that we might decide to unroll even a simple memcpy loop.
	;			;
	; Thus, the following loop shouldn't be unrolled:			; Thus, the following loop shouldn't be unrolled:
	; CHECK-LABEL: @not_simplified_geps			; CHECK-LABEL: @not_simplified_geps
	Show All 19 Lines

llvm/trunk/unittests/Analysis/UnrollAnalyzer.cpp

	Show First 20 Lines • Show All 128 Lines • ▼ Show 20 Lines
	TEST(UnrollAnalyzerTest, OuterLoopSimplification) {			TEST(UnrollAnalyzerTest, OuterLoopSimplification) {
	const char *ModuleStr =			const char *ModuleStr =
	"target datalayout = \"e-m:o-i64:64-f80:128-n8:16:32:64-S128\"\n"			"target datalayout = \"e-m:o-i64:64-f80:128-n8:16:32:64-S128\"\n"
	"define void @foo() {\n"			"define void @foo() {\n"
	"entry:\n"			"entry:\n"
	" br label %outer.loop\n"			" br label %outer.loop\n"
	"outer.loop:\n"			"outer.loop:\n"
	" %iv.outer = phi i64 [ 0, %entry ], [ %iv.outer.next, %outer.loop.latch ]\n"			" %iv.outer = phi i64 [ 0, %entry ], [ %iv.outer.next, %outer.loop.latch ]\n"
				" %iv.outer.next = add nuw nsw i64 %iv.outer, 1\n"
	" br label %inner.loop\n"			" br label %inner.loop\n"
	"inner.loop:\n"			"inner.loop:\n"
	" %iv.inner = phi i64 [ 0, %outer.loop ], [ %iv.inner.next, %inner.loop ]\n"			" %iv.inner = phi i64 [ 0, %outer.loop ], [ %iv.inner.next, %inner.loop ]\n"
	" %iv.inner.next = add nuw nsw i64 %iv.inner, 1\n"			" %iv.inner.next = add nuw nsw i64 %iv.inner, 1\n"
	" %exitcond.inner = icmp eq i64 %iv.inner.next, 1000\n"			" %exitcond.inner = icmp eq i64 %iv.inner.next, 1000\n"
	" br i1 %exitcond.inner, label %outer.loop.latch, label %inner.loop\n"			" br i1 %exitcond.inner, label %outer.loop.latch, label %inner.loop\n"
	"outer.loop.latch:\n"			"outer.loop.latch:\n"
	" %iv.outer.next = add nuw nsw i64 %iv.outer, 1\n"
	" %exitcond.outer = icmp eq i64 %iv.outer.next, 40\n"			" %exitcond.outer = icmp eq i64 %iv.outer.next, 40\n"
	" br i1 %exitcond.outer, label %exit, label %outer.loop\n"			" br i1 %exitcond.outer, label %exit, label %outer.loop\n"
	"exit:\n"			"exit:\n"
	" ret void\n"			" ret void\n"
	"}\n";			"}\n";

	UnrollAnalyzerTest *P = new UnrollAnalyzerTest();			UnrollAnalyzerTest *P = new UnrollAnalyzerTest();
	LLVMContext Context;			LLVMContext Context;
	std::unique_ptr<Module> M = makeLLVMModule(Context, P, ModuleStr);			std::unique_ptr<Module> M = makeLLVMModule(Context, P, ModuleStr);
	legacy::PassManager Passes;			legacy::PassManager Passes;
	Passes.add(P);			Passes.add(P);
	Passes.run(*M);			Passes.run(*M);

	Module::iterator MI = M->begin();			Module::iterator MI = M->begin();
	Function F = &MI++;			Function F = &MI++;
	Function::iterator FI = F->begin();			Function::iterator FI = F->begin();
	FI++;			FI++;
	BasicBlock Header = &FI++;			BasicBlock Header = &FI++;
	BasicBlock InnerBody = &FI++;			BasicBlock InnerBody = &FI++;

	BasicBlock::iterator BBI = Header->begin();			BasicBlock::iterator BBI = Header->begin();
	Instruction Y1 = &BBI++;			BBI++;
				Instruction Y1 = &BBI;
	BBI = InnerBody->begin();			BBI = InnerBody->begin();
	Instruction Y2 = &BBI++;			BBI++;
				Instruction Y2 = &BBI;
	// Check that we can simplify IV of the outer loop, but can't simplify the IV			// Check that we can simplify IV of the outer loop, but can't simplify the IV
	// of the inner loop if we only know the iteration number of the outer loop.			// of the inner loop if we only know the iteration number of the outer loop.
				//
				// Y1 is %iv.outer.next, Y2 is %iv.inner.next
	auto I1 = SimplifiedValuesVector[0].find(Y1);			auto I1 = SimplifiedValuesVector[0].find(Y1);
	EXPECT_TRUE(I1 != SimplifiedValuesVector[0].end());			EXPECT_TRUE(I1 != SimplifiedValuesVector[0].end());
	auto I2 = SimplifiedValuesVector[0].find(Y2);			auto I2 = SimplifiedValuesVector[0].find(Y2);
	EXPECT_TRUE(I2 == SimplifiedValuesVector[0].end());			EXPECT_TRUE(I2 == SimplifiedValuesVector[0].end());
	}			}
	TEST(UnrollAnalyzerTest, CmpSimplifications) {			TEST(UnrollAnalyzerTest, CmpSimplifications) {
	const char *ModuleStr =			const char *ModuleStr =
	"target datalayout = \"e-m:o-i64:64-f80:128-n8:16:32:64-S128\"\n"			"target datalayout = \"e-m:o-i64:64-f80:128-n8:16:32:64-S128\"\n"
	▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 57128

llvm/trunk/include/llvm/Analysis/LoopUnrollAnalyzer.h

llvm/trunk/lib/Analysis/LoopUnrollAnalyzer.cpp

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-2.ll

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-geps.ll

llvm/trunk/unittests/Analysis/UnrollAnalyzer.cpp

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...
ClosedPublic