This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopUnrollAnalyzer.h
-
lib/
-
Analysis/
-
LoopUnrollAnalyzer.cpp
-
Transforms/
-
LoopUnroll/
1/2
full-unroll-heuristics-dce.ll
-
Scalar/
13/17
LoopUnrollPass.cpp
-
test/Transforms/LoopUnroll/
-
Transforms/
-
LoopUnroll/
-
full-unroll-heuristics-2.ll
-
unittests/Analysis/
-
Analysis/
-
UnrollAnalyzer.cpp

Differential D11758

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...
ClosedPublic

Authored by mzolotukhin on Aug 5 2015, 1:54 AM.

Download Raw Diff

Details

Reviewers

chandlerc

Commits

rGb7b8052982de: [Unroll] Implement a conservative and monotonically increasing cost tracking…
rL269388: [Unroll] Implement a conservative and monotonically increasing cost tracking…

Summary

...loop after the last iteration.

This is really hard to do correctly. The core problem is that we need to
model liveness through the induction PHIs from iteration to iteration in
order to get the correct results, and we need to correctly de-duplicate
the common subgraphs of instructions feeding some subset of the
induction PHIs. All of this can be driven either from a side effect at
some iteration or from the loop values used after the loop finishes.

This patch implements this by storing the forward-propagating analysis
of each instruction in a cache to recall whether it was free and whether
it has become live and thus counted toward the total unroll cost. Then,
at each sink for a value in the loop, we recursively walk back through
every value that feeds the sink, including looping back through the
iterations as needed, until we have marked the entire input graph as
live. Because we cache this, we never visit instructions more than twice

once when we analyze them and put them into the cache, and once when

we count their cost towards the unrolled loop. Also, because the cache
is only two bits and because we are dealing with relatively small
iteration counts, we can store all of this very densely in memory to
avoid this from becoming an excessively slow analysis.

The code here is still pretty gross. I would appreciate suggestions
about better ways to factor or split this up, I've stared too long at
the algorithmic side to really have a good sense of what the design
should probably look at.

Also, it might seem like we should do all of this bottom-up, but I think
that is a red herring. Specifically, the simplification power is *much*
greater working top-down. We can forward propagate very effectively,
even across strange and interesting recurrances around the backedge.
Because we use data to propagate, this doesn't cause a state space
explosion. Doing this level of constant folding, etc, would be very
expensive to do bottom-up because it wouldn't be until the last moment
that you could collapse everything. The current solution is essentially
a top-down simplification with a bottom-up cost accounting which seems
to get the best of both worlds. It makes the simplification incremental
and powerful while leaving everything dead until we *know* it is needed.

Finally, a core property of this approach is its *monotonicity*. At all
times, the current UnrolledCost is a conservatively low estimate. This
ensures that we will never early-exit from the analysis due to exceeding
a threshold when if we had continued, the cost would have gone back
below the threshold. These kinds of bugs can cause incredibly hard to
track down random changes to behavior.

We could use a techinque similar (but much simpler) within the inliner
as well to avoid considering speculated code in the inline cost.

Diff Detail

Event Timeline

chandlerc updated this revision to Diff 31339.Aug 5 2015, 1:54 AM

chandlerc retitled this revision from to [Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the....

chandlerc updated this object.

chandlerc added a reviewer: mzolotukhin.

chandlerc added a subscriber: llvm-commits.

Rebase after landing prerequisite patches and merge a missing commit into this
patch. Sorry for the broken original diff.

Hi Chandler,

Mostly the patch looks good, a couple of nit-picks inline.

But I think we need more tests for this. I think we can add debug prints and pin-point the cases we want to check with it in tests.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
270	s/instrution/instruction/
290	Maybe worth commenting here that only `I` and `Iteration` are actually used as a key. It might be not-obvious for someone looking at this code without context of this patch.
294	s/free.a/free./
417	I find having debug prints very convenient in this kind of analyses. Would it be possible to return them back? We could report instructions that was simplified, and instructions that were proven to be live, for instance. Also, I think we can use this diagnostic in tests, as it gets harder and harder to write them.
427–434	Is word 'either' redundant here, or is it just my English?:)

Hi Chandler,

I found that with this patch compile time is significantly worse on some benchmarks, a reduced testcase from one of them is attached. It seems like the problem isn’t in the patch itself, but in the way we perform actual unrolling - you can see that if you add ‘-debug’ flag. I can look into what’s bad there later, but if you’d like to poke it too, you are welcome:)

reduced.ll2 KBDownload
msg-15878-295.txt2 KBDownload

Hi Chandler,

This patch got stuck, but finally I got some time and investigated it more carefully. I found a bug in current implementation that explains everything, and as a consequence I constructed a simple test-case to demonstrate the problem:

int foo(char *a) {
  int i = 0;
  for (i = 0; i < 500; i++)
    if (a[i] == 0)
      return i;
  return 0;
}

The problem is that the only instructions we consider live are those having side effects. However, they might have no side effects (by LLVM's definition of side effects: mayWriteToMemory() || mayThrow() || !mayReturn()) and still affect program behavior, as in the case above.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
430–431	Should we also take into account instructions with out-of-the-loop uses and early-exit instructions? Are there any other cases we want not to miss?

Hi Chander,

We briefly discussed it on IRC, but I'll duplicate my latest findings here too.

I tested this patch and found several issues, which lead to undesired unrolling in some cases, and thus, have significant compile-time impact for no performance benefit. With them fixed/worked-around, the compile time regressions seemed not that bad, but I'll need to remeasure it when we fix the issues in a proper way. With these issues worked-around I still saw some nice performance gains.

Here is the list of problems that I found:

With this improved algorithm for finding dead instructions, we're now able to figure out that loop control flow becomes dead after unrolling. If a loop has a small body, then the control flow might be up to 50% of the loop of the body, but it doesn't seem reasonable to unroll such loops. For instance:

for (i = 0; i < 500; i++)
   a[i] += 1;

In this case unrolling removes nothing except the control flow, but it fools the current heuristic so the loop is unrolled. Such loops are pretty popular, so the compile time hit is severe if we unroll them. Performance gain is questionable, and probably we actually regress the performance in such cases.

Currently simplifyInstWithSCEV returns true (meaning the corresponding instruction is simplified) for expressions in a form Address + ConstantOffset. However, unrolling doesn't necessarily leads to simplification of such instruction, so our estimate might be wrong here. For example:

for (int i=0; i < 16; i++)  {
   a[i][0]=b[S.y][S.x+i];
}

In this case index expressions take the most part of the loop body, but unrolling doesn't help to simplify them in contrast to our estimate.

After we unrolled a loop we should make sure that we cleaned-up everything we expected to be simplified/dead, otherwise we will count it again when we analyze the parent loop.

Michael

I plan to rebase this patch and slightly update it.

Rebase onto TOT.
Only work on inner loops: we can't accurately estimate cost of instructions from inner loops when analyzing an outer loop, and we won't be visiting inner loops later anyway, so even if unrolling of the outer loop could expose new opportunities in inner loops, we won't be catching them now. We can revisit this later.
Don't unroll loop with calls inside. Experiments showed some problems with such loops (our estimate is far from accurate for them), and disabling it doesn't hurt any test. Again, we can revisit it later if we have better way to estimate a cost of calls.

Ping!

OK, I've paged all this back into my brain.

The inner loop aspect makes more sense to me now. I'm still a bit dubious on the skipping calls, but I understand that at least we're not ready for them yet and its important to get this stuff moving, so it seems like a good incremental step.

Some comments below. Only the test for the inner loop case is really interesting. Feel free to submit with these comments addressed or post any updates with more questions if useful.

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
1	This file is under lib, not test?
lib/Transforms/Scalar/LoopUnrollPass.cpp
241–244	It would make slightly more sense to hoist this out to the caller... There is nothing about this routine that is specific to inner loops, its just that it isn't useful right now? Not a big deal either way.
270	Heh, now that you've taken this over, need to address your own comment here. ;]
290	I'm fine with a comment, or extending the DenseMapInfo to support using .find_as(std::make_pair(I, Iteration)).
417	I have no problem adding them back. Want to do it in a follow-up patch? In this patch?
430–431	Not sure what this comment was referencing... The idea is that all out-of-loop uses should be via a LCSSA phi node in the exit block(s) and we backwards saturate their cost there?
432–433	Maybe leave a FIXME? I feel like we should make some effort to address this in follow-up commits. At the very least I suspect we want to handle intrinsics here.

This revision is now accepted and ready to land.May 12 2016, 3:52 PM

Random drive-by comment inline

lib/Transforms/Scalar/LoopUnrollPass.cpp
432	Nit: spelling

Closed by commit rL269388: [Unroll] Implement a conservative and monotonically increasing cost tracking… (authored by mzolotukhin). · Explain WhyMay 12 2016, 6:48 PM

This revision was automatically updated to reflect the committed changes.

Hi,

I committed the patch in r269388. Along with the changes you requested I adjusted some thresholds in tests invocation and made one semantical change in visitPHINode. In the previous version we checked if (PN.getParent() == L->getHeader()) and early exited if it's true. That prevented us from running simplifyWithSCEV on the phi-value, so the analysis didn't get the information about this phi. Now we run the base visitor first to mine the data if we can, and only then we check if the phi is in the header.

Michael

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
1	Ouch, thanks for the catch!
lib/Transforms/Scalar/LoopUnrollPass.cpp
241–244	We need to treat inner loops in a special way (e.g. using weights for basic blocks based on the profile info), but we're not doing it now, so I tend to think this function in current implementation isn't designed to work on outer loops.
270	Fixed :)
417	I'll add them back in another patch if I see a need in them.
430–431	Yeah, I think I might misunderstand this part at that time.

chapuni added a subscriber: chapuni.May 12 2016, 11:43 PM

chapuni added inline comments.

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
434 ↗	(On Diff #57128)	FYI, it seems msc doesn't like it. bool Inserted = InstCostMap.insert({&I, (int)Iteration, (unsigned)IsFree, /IsCounted/ false}) would work for me.

Thanks, I reapplied the patch with your fix (r269486). Hopefully, the bots will be happy!

Michael

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopUnrollAnalyzer.h

1 line

lib/

Analysis/

LoopUnrollAnalyzer.cpp

8 lines

Transforms/

LoopUnroll/

full-unroll-heuristics-dce.ll

38 lines

Scalar/

LoopUnrollPass.cpp

188 lines

test/

Transforms/

LoopUnroll/

full-unroll-heuristics-2.ll

2 lines

unittests/

Analysis/

UnrollAnalyzer.cpp

10 lines

Diff 49862

include/llvm/Analysis/LoopUnrollAnalyzer.h

Show First 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	private:

bool simplifyInstWithSCEV(Instruction *I);		bool simplifyInstWithSCEV(Instruction *I);

bool visitInstruction(Instruction &I) { return simplifyInstWithSCEV(&I); }		bool visitInstruction(Instruction &I) { return simplifyInstWithSCEV(&I); }
bool visitBinaryOperator(BinaryOperator &I);		bool visitBinaryOperator(BinaryOperator &I);
bool visitLoad(LoadInst &I);		bool visitLoad(LoadInst &I);
bool visitCastInst(CastInst &I);		bool visitCastInst(CastInst &I);
bool visitCmpInst(CmpInst &I);		bool visitCmpInst(CmpInst &I);
		bool visitPHINode(PHINode &PN);
};		};
}		}
#endif		#endif

lib/Analysis/LoopUnrollAnalyzer.cpp

Show First 20 Lines • Show All 183 Lines • ▼ Show 20 Lines	if (Constant *CRHS = dyn_cast<Constant>(RHS)) {
SimplifiedValues[&I] = C;		SimplifiedValues[&I] = C;
return true;		return true;
}		}
}		}
}		}

return Base::visitCmpInst(I);		return Base::visitCmpInst(I);
}		}

		bool UnrolledInstAnalyzer::visitPHINode(PHINode &PN) {
		// The loop induction PHI nodes are definitionally free.
		if (PN.getParent() == L->getHeader())
		return true;

		return Base::visitPHINode(PN);
		}

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

This file was added.

				; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-dynamic-cost-savings-discount=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=80 \| FileCheck %s
				chandlercUnsubmitted Done Reply Inline Actions This file is under lib, not test? chandlerc: This file is under lib, not test?
				mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Ouch, thanks for the catch! mzolotukhin: Ouch, thanks for the catch!
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				@known_constant = internal unnamed_addr constant [10 x i32] [i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0], align 16

				; If a load becomes a constant after loop unrolling, we sometimes can simplify
				; CFG. This test verifies that we handle such cases.
				; After one operand in an instruction is constant-folded and the
				; instruction is simplified, the other operand might become dead.
				; In this test we have::
				; for i in 1..10:
				; r += A[i] * B[i]
				; A[i] is 0 almost at every iteration, so there is no need in loading B[i] at
				; all.


				; CHECK-LABEL: @unroll_dce
				; CHECK-NOT: br i1 %exitcond, label %for.end, label %for.body
				define i32 @unroll_dce(i32* noalias nocapture readonly %b) {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%iv.0 = phi i64 [ 0, %entry ], [ %iv.1, %for.body ]
				%r.0 = phi i32 [ 0, %entry ], [ %r.1, %for.body ]
				%arrayidx1 = getelementptr inbounds [10 x i32], [10 x i32]* @known_constant, i64 0, i64 %iv.0
				%x1 = load i32, i32* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 %iv.0
				%x2 = load i32, i32* %arrayidx2, align 4
				%mul = mul i32 %x1, %x2
				%r.1 = add i32 %mul, %r.0
				%iv.1 = add nuw nsw i64 %iv.0, 1
				%exitcond = icmp eq i64 %iv.1, 10
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 %r.1
				}

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 165 Lines • ▼ Show 20 Lines	if (UP.PartialThreshold != NoThreshold)
UP.PartialThreshold =		UP.PartialThreshold =
std::max<unsigned>(UP.PartialThreshold, PragmaUnrollThreshold);		std::max<unsigned>(UP.PartialThreshold, PragmaUnrollThreshold);
}		}

return UP;		return UP;
}		}

namespace {		namespace {
		/// A struct to densely store the state of an instruction after unrolling at
		/// each iteration.
		///
		/// This is designed to work like a tuple of <Instruction *, int> for the
		/// purposes of hashing and lookup, but to be able to associate two boolean
		/// states with each key.
		struct UnrolledInstState {
		Instruction *I;
		int Iteration : 30;
		unsigned IsFree : 1;
		unsigned IsCounted : 1;
		};

		/// Hashing and equality testing for a set of the instruction states.
		struct UnrolledInstStateKeyInfo {
		typedef DenseMapInfo<Instruction *> PtrInfo;
		typedef DenseMapInfo<std::pair<Instruction *, int>> PairInfo;
		static inline UnrolledInstState getEmptyKey() {
		return {PtrInfo::getEmptyKey(), 0, 0, 0};
		}
		static inline UnrolledInstState getTombstoneKey() {
		return {PtrInfo::getTombstoneKey(), 0, 0, 0};
		}
		static inline unsigned getHashValue(const UnrolledInstState &S) {
		return PairInfo::getHashValue({S.I, S.Iteration});
		}
		static inline bool isEqual(const UnrolledInstState &LHS,
		const UnrolledInstState &RHS) {
		return PairInfo::isEqual({LHS.I, LHS.Iteration}, {RHS.I, RHS.Iteration});
		}
		};
		}

		namespace {
struct EstimatedUnrollCost {		struct EstimatedUnrollCost {
/// \brief The estimated cost after unrolling.		/// \brief The estimated cost after unrolling.
int UnrolledCost;		int UnrolledCost;

/// \brief The estimated dynamic cost of executing the instructions in the		/// \brief The estimated dynamic cost of executing the instructions in the
/// rolled form.		/// rolled form.
int RolledDynamicCost;		int RolledDynamicCost;
};		};
Show All 17 Lines	analyzeLoopUnrollCost(const Loop *L, unsigned TripCount, DominatorTree &DT,
ScalarEvolution &SE, const TargetTransformInfo &TTI,		ScalarEvolution &SE, const TargetTransformInfo &TTI,
int MaxUnrolledLoopSize) {		int MaxUnrolledLoopSize) {
// We want to be able to scale offsets by the trip count and add more offsets		// We want to be able to scale offsets by the trip count and add more offsets
// to them without checking for overflows, and we already don't want to		// to them without checking for overflows, and we already don't want to
// analyze massive trip counts, so we force the max to be reasonably small.		// analyze massive trip counts, so we force the max to be reasonably small.
assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&		assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&
"The unroll iterations max is too large!");		"The unroll iterations max is too large!");

		// Only analyze inner loops. We can't properly estimate cost of nested loops
		// and we won't visit inner loops again anyway.
		if (!L->empty())
		return None;
		chandlercUnsubmitted Done Reply Inline Actions It would make slightly more sense to hoist this out to the caller... There is nothing about this routine that is specific to inner loops, its just that it isn't useful right now? Not a big deal either way. chandlerc: It would make slightly more sense to hoist this out to the caller... There is nothing about…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions We need to treat inner loops in a special way (e.g. using weights for basic blocks based on the profile info), but we're not doing it now, so I tend to think this function in current implementation isn't designed to work on outer loops. mzolotukhin: We need to treat inner loops in a special way (e.g. using weights for basic blocks based on the…

// Don't simulate loops with a big or unknown tripcount		// Don't simulate loops with a big or unknown tripcount
if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
TripCount > UnrollMaxIterationsCountToAnalyze)		TripCount > UnrollMaxIterationsCountToAnalyze)
return None;		return None;

SmallSetVector<BasicBlock *, 16> BBWorklist;		SmallSetVector<BasicBlock *, 16> BBWorklist;
		SmallSetVector<std::pair<BasicBlock , BasicBlock >, 4> ExitWorklist;
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;		SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;

// The estimated cost of the unrolled form of the loop. We try to estimate		// The estimated cost of the unrolled form of the loop. We try to estimate
// this by simplifying as much as we can while computing the estimate.		// this by simplifying as much as we can while computing the estimate.
int UnrolledCost = 0;		int UnrolledCost = 0;

// We also track the estimated dynamic (that is, actually executed) cost in		// We also track the estimated dynamic (that is, actually executed) cost in
// the rolled form. This helps identify cases when the savings from unrolling		// the rolled form. This helps identify cases when the savings from unrolling
// aren't just exposing dead control flows, but actual reduced dynamic		// aren't just exposing dead control flows, but actual reduced dynamic
// instructions due to the simplifications which we expect to occur after		// instructions due to the simplifications which we expect to occur after
// unrolling.		// unrolling.
int RolledDynamicCost = 0;		int RolledDynamicCost = 0;

		// We track the simplification of each instruction in each iteration. We use
		// this to recursively merge costs into the unrolled cost on-demand so that
		// we don't count the cost of any dead code. This is essentially a map from
		// <instrution, int> to <bool, bool>, but stored as a densely packed struct.
		mzolotukhinAuthorUnsubmitted Done Reply Inline Actions s/instrution/instruction/ mzolotukhin: s/instrution/instruction/
		chandlercUnsubmitted Done Reply Inline Actions Heh, now that you've taken this over, need to address your own comment here. ;] chandlerc: Heh, now that you've taken this over, need to address your own comment here. ;]
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Fixed :) mzolotukhin: Fixed :)
		DenseSet<UnrolledInstState, UnrolledInstStateKeyInfo> InstCostMap;

		// A small worklist used to accumulate cost of instructions from each
		// observable and reached root in the loop.
		SmallVector<Instruction *, 16> CostWorklist;

		// PHI-used worklist used between iterations while accumulating cost.
		SmallVector<Instruction *, 4> PHIUsedList;

		// Helper function to accumulate cost for instructions in the loop.
		auto AddCostRecursively = [&](Instruction &RootI, int Iteration) {
		assert(Iteration >= 0 && "Cannot have a negative iteration!");
		assert(CostWorklist.empty() && "Must start with an empty cost list");
		assert(PHIUsedList.empty() && "Must start with an empty phi used list");
		CostWorklist.push_back(&RootI);
		for (;; --Iteration) {
		do {
		Instruction *I = CostWorklist.pop_back_val();

		auto CostIter = InstCostMap.find({I, Iteration, 0, 0});
		mzolotukhinAuthorUnsubmitted Done Reply Inline Actions Maybe worth commenting here that only `I` and `Iteration` are actually used as a key. It might be not-obvious for someone looking at this code without context of this patch. mzolotukhin: Maybe worth commenting here that only `I` and `Iteration` are actually used as a key. It might…
		chandlercUnsubmitted Done Reply Inline Actions I'm fine with a comment, or extending the DenseMapInfo to support using .find_as(std::make_pair(I, Iteration)). chandlerc: I'm fine with a comment, or extending the DenseMapInfo to support using .find_as(std::make_pair…
		if (CostIter == InstCostMap.end())
		// If an input to a PHI node comes from a dead path through the loop
		// we may have no cost data for it here. What that actually means is
		// that it is free.
		mzolotukhinAuthorUnsubmitted Done Reply Inline Actions s/free.a/free./ mzolotukhin: s/free.a/free./
		continue;
		auto &Cost = *CostIter;
		if (Cost.IsCounted)
		// Already counted this instruction.
		continue;

		// Mark that we are counting the cost of this instruction now.
		Cost.IsCounted = true;

		// If this is a PHI node in the loop header, just add it to the PHI set.
		if (auto *PhiI = dyn_cast<PHINode>(I))
		if (PhiI->getParent() == L->getHeader()) {
		assert(Cost.IsFree && "Loop PHIs shouldn't be evaluated as they "
		"inherently simplify during unrolling.");
		if (Iteration == 0)
		continue;

		// Push the incoming value from the backedge into the PHI used list
		// if it is an in-loop instruction. We'll use this to populate the
		// cost worklist for the next iteration (as we count backwards).
		if (auto *OpI = dyn_cast<Instruction>(
		PhiI->getIncomingValueForBlock(L->getLoopLatch())))
		if (L->contains(OpI))
		PHIUsedList.push_back(OpI);
		continue;
		}

		// First accumulate the cost of this instruction.
		if (!Cost.IsFree) {
		UnrolledCost += TTI.getUserCost(I);
		DEBUG(dbgs() << "Adding cost of instruction (iteration " << Iteration
		<< "): ");
		DEBUG(I->dump());
		}

		// We must count the cost of every operand which is not free,
		// recursively. If we reach a loop PHI node, simply add it to the set
		// to be considered on the next iteration (backwards!).
		for (Value *Op : I->operands()) {
		// Check whether this operand is free due to being a constant or
		// outside the loop.
		auto *OpI = dyn_cast<Instruction>(Op);
		if (!OpI \|\| !L->contains(OpI))
		continue;

		// Otherwise accumulate its cost.
		CostWorklist.push_back(OpI);
		}
		} while (!CostWorklist.empty());

		if (PHIUsedList.empty())
		// We've exhausted the search.
		break;

		assert(Iteration > 0 &&
		"Cannot track PHI-used values past the first iteration!");
		CostWorklist.append(PHIUsedList.begin(), PHIUsedList.end());
		PHIUsedList.clear();
		}
		};

// Ensure that we don't violate the loop structure invariants relied on by		// Ensure that we don't violate the loop structure invariants relied on by
// this analysis.		// this analysis.
assert(L->isLoopSimplifyForm() && "Must put loop into normal form first.");		assert(L->isLoopSimplifyForm() && "Must put loop into normal form first.");
assert(L->isLCSSAForm(DT) &&		assert(L->isLCSSAForm(DT) &&
"Must have loops in LCSSA form to track live-out values.");		"Must have loops in LCSSA form to track live-out values.");

DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");		DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");

Show All 38 Lines	for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {
// Note that we must not cache the size, this loop grows the worklist.		// Note that we must not cache the size, this loop grows the worklist.
for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BasicBlock *BB = BBWorklist[Idx];		BasicBlock *BB = BBWorklist[Idx];

// Visit all instructions in the given basic block and try to simplify		// Visit all instructions in the given basic block and try to simplify
// it. We don't change the actual IR, just count optimization		// it. We don't change the actual IR, just count optimization
// opportunities.		// opportunities.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
int InstCost = TTI.getUserCost(&I);		// Track this instruction's expected baseline cost when executing the
		// rolled loop form.
		RolledDynamicCost += TTI.getUserCost(&I);

// Visit the instruction to analyze its loop cost after unrolling,		// Visit the instruction to analyze its loop cost after unrolling,
// and if the visitor returns false, include this instruction in the		// and if the visitor returns true, mark the instruction as free after
// unrolled cost.		// unrolling and continue.
if (!Analyzer.visit(I))		bool IsFree = Analyzer.visit(I);
		mzolotukhinAuthorUnsubmitted Done Reply Inline Actions I find having debug prints very convenient in this kind of analyses. Would it be possible to return them back? We could report instructions that was simplified, and instructions that were proven to be live, for instance. Also, I think we can use this diagnostic in tests, as it gets harder and harder to write them. mzolotukhin: I find having debug prints very convenient in this kind of analyses. Would it be possible to…
		chandlercUnsubmitted Done Reply Inline Actions I have no problem adding them back. Want to do it in a follow-up patch? In this patch? chandlerc: I have no problem adding them back. Want to do it in a follow-up patch? In this patch?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions I'll add them back in another patch if I see a need in them. mzolotukhin: I'll add them back in another patch if I see a need in them.
UnrolledCost += InstCost;		bool Inserted = InstCostMap.insert({&I, (int)Iteration, IsFree,
else {		/IsCounted/ false})
DEBUG(dbgs() << " " << I		.second;
<< " would be simplified if loop is unrolled.\n");		(void)Inserted;
(void)0;		assert(Inserted && "Cannot have a state for an unvisited instruction!");
}

// Also track this instructions expected cost when executing the rolled		if (IsFree)
// loop form.		continue;
RolledDynamicCost += InstCost;
		// If the instruction might have a side-effect recursively account for
		// the cost of it and all the instructions leading up to it.
		if (I.mayHaveSideEffects())
		AddCostRecursively(I, Iteration);

		mzolotukhinAuthorUnsubmitted Done Reply Inline Actions Should we also take into account instructions with out-of-the-loop uses and early-exit instructions? Are there any other cases we want not to miss? mzolotukhin: Should we also take into account instructions with out-of-the-loop uses and early-exit…
		chandlercUnsubmitted Done Reply Inline Actions Not sure what this comment was referencing... The idea is that all out-of-loop uses should be via a LCSSA phi node in the exit block(s) and we backwards saturate their cost there? chandlerc: Not sure what this comment was referencing... The idea is that all out-of-loop uses should be…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions Yeah, I think I might misunderstand this part at that time. mzolotukhin: Yeah, I think I might misunderstand this part at that time.
		// Can't properly modeal a cost of a call.
		sanjoyUnsubmitted Done Reply Inline Actions Nit: spelling sanjoy: Nit: spelling
		if(isa<CallInst>(&I))
		chandlercUnsubmitted Done Reply Inline Actions Maybe leave a FIXME? I feel like we should make some effort to address this in follow-up commits. At the very least I suspect we want to handle intrinsics here. chandlerc: Maybe leave a FIXME? I feel like we should make some effort to address this in follow-up…
		return None;
		mzolotukhinAuthorUnsubmitted Done Reply Inline Actions Is word 'either' redundant here, or is it just my English?:) mzolotukhin: Is word 'either' redundant here, or is it just my English?:)

// If unrolled body turns out to be too big, bail out.		// If unrolled body turns out to be too big, bail out.
if (UnrolledCost > MaxUnrolledLoopSize) {		if (UnrolledCost > MaxUnrolledLoopSize) {
DEBUG(dbgs() << " Exceeded threshold.. exiting.\n"		DEBUG(dbgs() << " Exceeded threshold.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost		<< " UnrolledCost: " << UnrolledCost
<< ", MaxUnrolledLoopSize: " << MaxUnrolledLoopSize		<< ", MaxUnrolledLoopSize: " << MaxUnrolledLoopSize
<< "\n");		<< "\n");
return None;		return None;
Show All 12 Lines	for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
// Just take the first successor if condition is undef		// Just take the first successor if condition is undef
if (isa<UndefValue>(SimpleCond))		if (isa<UndefValue>(SimpleCond))
Succ = BI->getSuccessor(0);		Succ = BI->getSuccessor(0);
else		else
Succ = BI->getSuccessor(		Succ = BI->getSuccessor(
cast<ConstantInt>(SimpleCond)->isZero() ? 1 : 0);		cast<ConstantInt>(SimpleCond)->isZero() ? 1 : 0);
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
		else
		ExitWorklist.insert({BB, Succ});
continue;		continue;
}		}
}		}
} else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {		} else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {
if (Constant *SimpleCond =		if (Constant *SimpleCond =
SimplifiedValues.lookup(SI->getCondition())) {		SimplifiedValues.lookup(SI->getCondition())) {
BasicBlock *Succ = nullptr;		BasicBlock *Succ = nullptr;
// Just take the first successor if condition is undef		// Just take the first successor if condition is undef
if (isa<UndefValue>(SimpleCond))		if (isa<UndefValue>(SimpleCond))
Succ = SI->getSuccessor(0);		Succ = SI->getSuccessor(0);
else		else
Succ = SI->findCaseValue(cast<ConstantInt>(SimpleCond))		Succ = SI->findCaseValue(cast<ConstantInt>(SimpleCond))
.getCaseSuccessor();		.getCaseSuccessor();
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
		else
		ExitWorklist.insert({BB, Succ});
continue;		continue;
}		}
}		}

// Add BB's successors to the worklist.		// Add BB's successors to the worklist.
for (BasicBlock *Succ : successors(BB))		for (BasicBlock *Succ : successors(BB))
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
		else
		ExitWorklist.insert({BB, Succ});
}		}

// If we found no optimization opportunities on the first iteration, we		// If we found no optimization opportunities on the first iteration, we
// won't find them on later ones too.		// won't find them on later ones too.
if (UnrolledCost == RolledDynamicCost) {		if (UnrolledCost == RolledDynamicCost) {
DEBUG(dbgs() << " No opportunities found.. exiting.\n"		DEBUG(dbgs() << " No opportunities found.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost << "\n");		<< " UnrolledCost: " << UnrolledCost << "\n");
return None;		return None;
}		}
}		}

		while (!ExitWorklist.empty()) {
		BasicBlock ExitingBB, ExitBB;
		std::tie(ExitingBB, ExitBB) = ExitWorklist.pop_back_val();

		for (Instruction &I : *ExitBB) {
		auto *PN = dyn_cast<PHINode>(&I);
		if (!PN)
		break;

		Value *Op = PN->getIncomingValueForBlock(ExitingBB);
		if (auto *OpI = dyn_cast<Instruction>(Op))
		if (L->contains(OpI))
		AddCostRecursively(*OpI, TripCount - 1);
		}
		}

DEBUG(dbgs() << "Analysis finished:\n"		DEBUG(dbgs() << "Analysis finished:\n"
<< "UnrolledCost: " << UnrolledCost << ", "		<< "UnrolledCost: " << UnrolledCost << ", "
<< "RolledDynamicCost: " << RolledDynamicCost << "\n");		<< "RolledDynamicCost: " << RolledDynamicCost << "\n");
return {{UnrolledCost, RolledDynamicCost}};		return {{UnrolledCost, RolledDynamicCost}};
}		}

/// ApproximateLoopSize - Approximate the size of the loop.		/// ApproximateLoopSize - Approximate the size of the loop.
static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,		static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,
▲ Show 20 Lines • Show All 413 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/full-unroll-heuristics-2.ll

	; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=50 -unroll-dynamic-cost-savings-discount=90 \| FileCheck %s			; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=70 -unroll-dynamic-cost-savings-discount=90 \| FileCheck %s
	target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

	@unknown_global = internal unnamed_addr global [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16			@unknown_global = internal unnamed_addr global [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16
	@weak_constant = weak unnamed_addr constant [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16			@weak_constant = weak unnamed_addr constant [9 x i32] [i32 0, i32 -1, i32 0, i32 -1, i32 5, i32 -1, i32 0, i32 -1, i32 0], align 16

	; Though @unknown_global is initialized with constant values, we can't consider			; Though @unknown_global is initialized with constant values, we can't consider
	; it as a constant, so we shouldn't unroll the loop.			; it as a constant, so we shouldn't unroll the loop.
	; CHECK-LABEL: @foo			; CHECK-LABEL: @foo
	▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

unittests/Analysis/UnrollAnalyzer.cpp

	Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	TEST(UnrollAnalyzerTest, OuterLoopSimplification) {			TEST(UnrollAnalyzerTest, OuterLoopSimplification) {
	const char *ModuleStr =			const char *ModuleStr =
	"target datalayout = \"e-m:o-i64:64-f80:128-n8:16:32:64-S128\"\n"			"target datalayout = \"e-m:o-i64:64-f80:128-n8:16:32:64-S128\"\n"
	"define void @foo() {\n"			"define void @foo() {\n"
	"entry:\n"			"entry:\n"
	" br label %outer.loop\n"			" br label %outer.loop\n"
	"outer.loop:\n"			"outer.loop:\n"
	" %iv.outer = phi i64 [ 0, %entry ], [ %iv.outer.next, %outer.loop.latch ]\n"			" %iv.outer = phi i64 [ 0, %entry ], [ %iv.outer.next, %outer.loop.latch ]\n"
				" %iv.outer.next = add nuw nsw i64 %iv.outer, 1\n"
	" br label %inner.loop\n"			" br label %inner.loop\n"
	"inner.loop:\n"			"inner.loop:\n"
	" %iv.inner = phi i64 [ 0, %outer.loop ], [ %iv.inner.next, %inner.loop ]\n"			" %iv.inner = phi i64 [ 0, %outer.loop ], [ %iv.inner.next, %inner.loop ]\n"
	" %iv.inner.next = add nuw nsw i64 %iv.inner, 1\n"			" %iv.inner.next = add nuw nsw i64 %iv.inner, 1\n"
	" %exitcond.inner = icmp eq i64 %iv.inner.next, 1000\n"			" %exitcond.inner = icmp eq i64 %iv.inner.next, 1000\n"
	" br i1 %exitcond.inner, label %outer.loop.latch, label %inner.loop\n"			" br i1 %exitcond.inner, label %outer.loop.latch, label %inner.loop\n"
	"outer.loop.latch:\n"			"outer.loop.latch:\n"
	" %iv.outer.next = add nuw nsw i64 %iv.outer, 1\n"
	" %exitcond.outer = icmp eq i64 %iv.outer.next, 40\n"			" %exitcond.outer = icmp eq i64 %iv.outer.next, 40\n"
	" br i1 %exitcond.outer, label %exit, label %outer.loop\n"			" br i1 %exitcond.outer, label %exit, label %outer.loop\n"
	"exit:\n"			"exit:\n"
	" ret void\n"			" ret void\n"
	"}\n";			"}\n";

	UnrollAnalyzerTest *P = new UnrollAnalyzerTest();			UnrollAnalyzerTest *P = new UnrollAnalyzerTest();
	std::unique_ptr<Module> M = makeLLVMModule(P, ModuleStr);			std::unique_ptr<Module> M = makeLLVMModule(P, ModuleStr);
	legacy::PassManager Passes;			legacy::PassManager Passes;
	Passes.add(P);			Passes.add(P);
	Passes.run(*M);			Passes.run(*M);

	Module::iterator MI = M->begin();			Module::iterator MI = M->begin();
	Function F = &MI++;			Function F = &MI++;
	Function::iterator FI = F->begin();			Function::iterator FI = F->begin();
	FI++;			FI++;
	BasicBlock Header = &FI++;			BasicBlock Header = &FI++;
	BasicBlock InnerBody = &FI++;			BasicBlock InnerBody = &FI++;

	BasicBlock::iterator BBI = Header->begin();			BasicBlock::iterator BBI = Header->begin();
	Instruction Y1 = &BBI++;			BBI++;
				Instruction Y1 = &BBI;
	BBI = InnerBody->begin();			BBI = InnerBody->begin();
	Instruction Y2 = &BBI++;			BBI++;
				Instruction Y2 = &BBI;
	// Check that we can simplify IV of the outer loop, but can't simplify the IV			// Check that we can simplify IV of the outer loop, but can't simplify the IV
	// of the inner loop if we only know the iteration number of the outer loop.			// of the inner loop if we only know the iteration number of the outer loop.
				//
				// Y1 is %iv.outer.next, Y2 is %iv.inner.next
	auto I1 = SimplifiedValuesVector[0].find(Y1);			auto I1 = SimplifiedValuesVector[0].find(Y1);
	EXPECT_TRUE(I1 != SimplifiedValuesVector[0].end());			EXPECT_TRUE(I1 != SimplifiedValuesVector[0].end());
	auto I2 = SimplifiedValuesVector[0].find(Y2);			auto I2 = SimplifiedValuesVector[0].find(Y2);
	EXPECT_TRUE(I2 == SimplifiedValuesVector[0].end());			EXPECT_TRUE(I2 == SimplifiedValuesVector[0].end());
	}			}
	} // end namespace llvm			} // end namespace llvm

	INITIALIZE_PASS_BEGIN(UnrollAnalyzerTest, "unrollanalyzertestpass",			INITIALIZE_PASS_BEGIN(UnrollAnalyzerTest, "unrollanalyzertestpass",
	"unrollanalyzertestpass", false, false)			"unrollanalyzertestpass", false, false)
	INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)			INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
	INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)			INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
	INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)			INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
	INITIALIZE_PASS_END(UnrollAnalyzerTest, "unrollanalyzertestpass",			INITIALIZE_PASS_END(UnrollAnalyzerTest, "unrollanalyzertestpass",
	"unrollanalyzertestpass", false, false)			"unrollanalyzertestpass", false, false)

This is an archive of the discontinued LLVM Phabricator instance.

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 49862

include/llvm/Analysis/LoopUnrollAnalyzer.h

lib/Analysis/LoopUnrollAnalyzer.cpp

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

lib/Transforms/Scalar/LoopUnrollPass.cpp

test/Transforms/LoopUnroll/full-unroll-heuristics-2.ll

unittests/Analysis/UnrollAnalyzer.cpp

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...
ClosedPublic