Download Raw Diff

Details

Reviewers

Summary

This patch adds capability for estimation effect of DCE on completely unrolled loop.
That helps to improve accuracy of the prediction, and thus make the decision
whether to completely unroll more informative.

Diff Detail

Event Timeline

mzolotukhin updated this revision to Diff 23202.Apr 2 2015, 7:30 PM

mzolotukhin retitled this revision from to Estimate DCE effect in heuristic for estimating complete-unroll optimization effects..

mzolotukhin updated this object.

mzolotukhin edited the test plan for this revision. (Show Details)

mzolotukhin added reviewers: hfinkel, chandlerc.

mzolotukhin added a subscriber: Unknown Object (MLST).

mzolotukhin added a parent revision: D8816: Reimplement heuristic for estimating complete-unroll optimization effects..Apr 2 2015, 7:31 PM

Before we go here, I think you should handle dead CFG paths which is a *much* simpler problem.

Specifically, you should simplify the condition for branches or the input for switch, and only add a single successor when it folds to a constant.

Once that's handled, I think this should work the other way. The problem with doing it as you have is that walking all the users of an instruction is very expensive (linked list traversal, to cache hostile). Instead, when we first see an instruction in the body of the loop without side-effects we should add it to a dead set rather than accumulating its cost, and we should remove all operands of all instructions we see in the body of the loop from the dead set (and if it was removed, account for its cost there). On the last iteration, we can also remove all live-out instructions from the dead set using the LCSSA PHI node (or just not use the dead set for the last iteration).

Does that make sense?

Handling dead CFG paths only looks like a simple problem, but in fact, it's much trickier.

Let me start with an example:

for(i = 0; i < 1000; i++) {
   a[i] = b[i] + c[i];
   if(d[i]) {
      // very expensive code - let's say 998 instructions.
   }
}

Cost of the loop body here would be 1+1+998=1000, and the estimated cost of the original loop TripCount*BodyCost = 10^6.
Suppose that d[i] is filled with 0, so if(d[i]) is always false and we never take the expensive path.
That means, that after complete unrolling we'll end up with 1000 instructions: a[0] = b[0] + c[0], a[1] = b[1] + c[1], ...
That looks like a huge win - the cost of unrolled loop is 10^3, while the cost of original loop is 10^6. But what we actually did? We significantly increased the code size, and gained nothing in terms of performance - that expensive code would never be executed in the original loop either! The things would get even worse if, e.g. d[] contains non-zeros too.

So, we can't simply fold the branch and take only one successor - it would be incorrect to compare cost of the loop computed this way with the original cost. To be precise, that works well for code size estimate, but not for execution time (~performance). And the goal of the optimization is to improve performance - i.e. if in completely unrolled loop we'd need to execute 20% less instructions in real time, than it's worth unrolling.

Having said that, it might be interesting to take branch-folding into account, but that will need much more complicated cost model (and thus will increase code complexity). Currently I incline toward putting it off until we get a real use-case where it can help. What do you think?

Now to DCE part:)
So, you basically suggest adding all instructions to the 'dead' set and then remove from it instructions that have uses? Is this the idea? If so, I guess we'll end up with what we have now, but on the removal phase - when you remove instruction operands from the set, you also need to remove their operand, and their operands etc. - that's effectively the same linked list traversal as we have now.

There is a high chance that I misunderstood your suggestion here, so please correct me!

Ping!

Rebase to trunk.

Ping ^3

Ping ^4.

Rebase ontop of the recent trunk.
Add a test.

mzolotukhin edited parent revisions, added: D10205: Remove SCEVCache and FindConstantPointers from complete loop unrolling heuristic.; removed: D8816: Reimplement heuristic for estimating complete-unroll optimization effects..Jun 5 2015, 9:12 PM

Rebase.
Adjust according to the new naming scheme.
Adjust DCE test according to new naming scheme.

Ping!

Gerolf added a subscriber: Gerolf.Jul 7 2015, 10:20 PM

Gerolf added inline comments.

lib/Transforms/Scalar/LoopUnrollPass.cpp
580	How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ...
582	I'm missing context. Where do the costs get computed?
test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
7	This should be true for expressions also, not just loads.

Thanks, Gerolf,
I'll update the comments.

lib/Transforms/Scalar/LoopUnrollPass.cpp
582	On lines 594, 598.

Was there an update?

Thanks
Gerolf

chandlerc added inline comments.Jul 15 2015, 5:20 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
585–601	This seems like it may be somewhat slow. And I expect this to relatively rarely impact the computation. SimplifiedValues should have forward-pruned most of the dead instructions here? What about a slightly different approach: Each time we simplify something whose operands are not simplified, add that instruction to a SimplifiedRootsSet and SimplifiedRootsWorklist. Each time we actually count an instruction's cost, add it to a set of cost counted instructions, and increment a count of uses for each of its operands. Here, for each instruction in the worklist, for each operand to that instruction, if the operand is in the set of cost counted instructions and not in the SimplifiedRootSet and has a zero count of uses, subtract its cost, decrement the use counts of all its operands, add it to the SimplifiedRootSet, and add it to the worklist. This should only start from the instructions we can't forward-simplify (things that SCEV simplifies for example), and walk recursively up its operands GC-ing everything whose use count in the loop reaches zero as a consequence. This seems like it should be faster in the common cases than walking the user lists of every instruction? As far as I can tell, we at most walk every operand twice (once to increment, once to decrement)....

Rebase on master.
Avoid visiting all basic blocks and all their instructions - instead, use a worklist.

Hi Chandler,

I chose somewhat a hybrid approach between how it was implemented and what you suggested.

I did use a worklist to avoid one more visit of all instructions, but I decided not to count uses. I had in mind two reasons for that:

it would work incorrectly (or would require some special handling) with uses outside the loop - we won't count them and thus might think an instruction is dead, while it's not.
It would additionally complicate the code.

What do you think? Does it sound reasonable?

Michael

chandlerc removed a reviewer: chandlerc.Apr 6 2016, 10:45 PM

Diff 27058

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 263 Lines • ▼ Show 20 Lines
// v = b[1]		// v = b[1]
class UnrolledInstAnalyzer : private InstVisitor<UnrolledInstAnalyzer, bool> {		class UnrolledInstAnalyzer : private InstVisitor<UnrolledInstAnalyzer, bool> {
typedef InstVisitor<UnrolledInstAnalyzer, bool> Base;		typedef InstVisitor<UnrolledInstAnalyzer, bool> Base;
friend class InstVisitor<UnrolledInstAnalyzer, bool>;		friend class InstVisitor<UnrolledInstAnalyzer, bool>;

public:		public:
UnrolledInstAnalyzer(unsigned Iteration,		UnrolledInstAnalyzer(unsigned Iteration,
DenseMap<Value , Constant > &SimplifiedValues,		DenseMap<Value , Constant > &SimplifiedValues,
		SmallSet<Instruction *, 16> &DeadInstructions,
const Loop *L, ScalarEvolution &SE)		const Loop *L, ScalarEvolution &SE)
: Iteration(Iteration), SimplifiedValues(SimplifiedValues), L(L), SE(SE) {		: Iteration(Iteration), SimplifiedValues(SimplifiedValues), DeadInstructions(DeadInstructions), L(L), SE(SE) {
}		}

// Allow access to the initial visit method.		// Allow access to the initial visit method.
using Base::visit;		using Base::visit;

private:		private:
/// \brief A cache of pointer bases corresponding to GEP instructions.		/// \brief A cache of pointer bases corresponding to GEP instructions.
///		///
Show All 11 Lines	private:

// While we walk the loop instructions, we we build up and maintain a mapping		// While we walk the loop instructions, we we build up and maintain a mapping
// of simplified values specific to this iteration. The idea is to propagate		// of simplified values specific to this iteration. The idea is to propagate
// any special information we have about loads that can be replaced with		// any special information we have about loads that can be replaced with
// constants after complete unrolling, and account for likely simplifications		// constants after complete unrolling, and account for likely simplifications
// post-unrolling.		// post-unrolling.
DenseMap<Value , Constant > &SimplifiedValues;		DenseMap<Value , Constant > &SimplifiedValues;

		/// \brief Keeps track of all instructions known to become dead.
		///
		/// After one operand in an instruction is constant-folded and the
		/// instruction is simplified, the other operand might become dead.
		/// For example:
		/// X = <...>
		/// A[i] = X * B[i]
		/// If at one iteration B[i] becomes 0, A[i] also becomes 0, and we no longer
		/// need to compute X (assuming it has no other uses).
		SmallSet<Instruction *, 16> &DeadInstructions;

const Loop *L;		const Loop *L;
ScalarEvolution &SE;		ScalarEvolution &SE;

/// \brief Try to simplify instruction \param I using its SCEV expression.		/// \brief Try to simplify instruction \param I using its SCEV expression.
///		///
/// The idea is that some AddRec expressions become constants, which then		/// The idea is that some AddRec expressions become constants, which then
/// could trigger folding of other instructions. However, that only happens		/// could trigger folding of other instructions. However, that only happens
/// for expressions whose start value is also constant, which isn't always the		/// for expressions whose start value is also constant, which isn't always the
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	analyzeLoopUnrollCost(const Loop *L, unsigned TripCount, ScalarEvolution &SE,

// Don't simulate loops with a big or unknown tripcount		// Don't simulate loops with a big or unknown tripcount
if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
TripCount > UnrollMaxIterationsCountToAnalyze)		TripCount > UnrollMaxIterationsCountToAnalyze)
return None;		return None;

SmallSetVector<BasicBlock *, 16> BBWorklist;		SmallSetVector<BasicBlock *, 16> BBWorklist;
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
		SmallSet<Instruction *, 16> DeadInstructions;

unsigned NumberOfOptimizedInstructions = 0;		unsigned NumberOfOptimizedInstructions = 0;
unsigned UnrolledLoopSize = 0;		unsigned UnrolledLoopSize = 0;

// Simulate execution of each iteration of the loop counting instructions,		// Simulate execution of each iteration of the loop counting instructions,
// which would be simplified.		// which would be simplified.
// Since the same load will take different values on different iterations,		// Since the same load will take different values on different iterations,
// we literally have to go through all loop's iterations.		// we literally have to go through all loop's iterations.
for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {		for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {
SimplifiedValues.clear();		SimplifiedValues.clear();
UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, L, SE);		DeadInstructions.clear();
		UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, DeadInstructions, L, SE);

BBWorklist.clear();		BBWorklist.clear();
BBWorklist.insert(L->getHeader());		BBWorklist.insert(L->getHeader());
// Note that we must not cache the size, this loop grows the worklist.		// Note that we must not cache the size, this loop grows the worklist.
for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BasicBlock *BB = BBWorklist[Idx];		BasicBlock *BB = BBWorklist[Idx];

// Visit all instructions in the given basic block and try to simplify		// Visit all instructions in the given basic block and try to simplify
Show All 37 Lines	for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
}		}

// Add BB's successors to the worklist.		// Add BB's successors to the worklist.
for (BasicBlock *Succ : successors(BB))		for (BasicBlock *Succ : successors(BB))
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
}		}

// If we found no optimization opportunities on the first iteration, we		// If we found no optimization opportunities on the first iteration, we
		GerolfUnsubmitted Not Done Reply Inline Actions How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ... Gerolf: How about: // When there is no optimization opportunity in the first iteration, we won't find…
// won't find them on later ones too.		// won't find them on later ones too.
if (!NumberOfOptimizedInstructions)		if (!NumberOfOptimizedInstructions)
		GerolfUnsubmitted Done Reply Inline Actions I'm missing context. Where do the costs get computed? Gerolf: I'm missing context. Where do the costs get computed?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions On lines 594, 598. mzolotukhin: On lines 594, 598.
return None;		return None;

		for (unsigned Idx = BBWorklist.size() - 1; Idx != 0; --Idx) {
		BasicBlock *BB = BBWorklist[Idx];
		if (BB->empty())
		continue;
		for (BasicBlock::reverse_iterator I = BB->rbegin(), E = BB->rend(); I != E; ++I) {
		if (SimplifiedValues.count(&*I))
		continue;
		if (DeadInstructions.count(&*I))
		continue;
		if (std::all_of(I->user_begin(), I->user_end(), [&](User *U) {
		return SimplifiedValues.count(cast<Instruction>(U)) +
		DeadInstructions.count(cast<Instruction>(U));
		})) {
		NumberOfOptimizedInstructions += TTI.getUserCost(&*I);
		DeadInstructions.insert(&*I);
		}
		}
		chandlercUnsubmitted Not Done Reply Inline Actions This seems like it may be somewhat slow. And I expect this to relatively rarely impact the computation. SimplifiedValues should have forward-pruned most of the dead instructions here? What about a slightly different approach: Each time we simplify something whose operands are not simplified, add that instruction to a SimplifiedRootsSet and SimplifiedRootsWorklist. Each time we actually count an instruction's cost, add it to a set of cost counted instructions, and increment a count of uses for each of its operands. Here, for each instruction in the worklist, for each operand to that instruction, if the operand is in the set of cost counted instructions and not in the SimplifiedRootSet and has a zero count of uses, subtract its cost, decrement the use counts of all its operands, add it to the SimplifiedRootSet, and add it to the worklist. This should only start from the instructions we can't forward-simplify (things that SCEV simplifies for example), and walk recursively up its operands GC-ing everything whose use count in the loop reaches zero as a consequence. This seems like it should be faster in the common cases than walking the user lists of every instruction? As far as I can tell, we at most walk every operand twice (once to increment, once to decrement).... chandlerc: This seems like it may be somewhat slow. And I expect this to relatively rarely impact the…
		}
}		}
return {{NumberOfOptimizedInstructions, UnrolledLoopSize}};		return {{NumberOfOptimizedInstructions, UnrolledLoopSize}};
}		}

/// ApproximateLoopSize - Approximate the size of the loop.		/// ApproximateLoopSize - Approximate the size of the loop.
static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,		static unsigned ApproximateLoopSize(const Loop *L, unsigned &NumCalls,
bool &NotDuplicatable,		bool &NotDuplicatable,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
▲ Show 20 Lines • Show All 363 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

This file was added.

				; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-absolute-threshold=1000 -unroll-threshold=20 -unroll-percent-of-optimized-for-complete-unroll=50 \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				@known_constant = internal unnamed_addr constant [10 x i32] [i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0], align 16

				; If a load becomes a constant after loop unrolling, we sometimes can simplify
				; CFG. This test verifies that we handle such cases.
				GerolfUnsubmitted Not Done Reply Inline Actions This should be true for expressions also, not just loads. Gerolf: This should be true for expressions also, not just loads.
				; After one operand in an instruction is constant-folded and the
				; instruction is simplified, the other operand might become dead.
				; In this test we have::
				; for i in 1..10:
				; r += A[i] * B[i]
				; A[i] is 0 almost at every iteration, so there is no need in loading B[i] at
				; all.


				; CHECK-LABEL: @unroll_dce
				; CHECK-NOT: br i1 %exitcond, label %for.end, label %for.body
				define i32 @unroll_dce(i32* noalias nocapture readonly %b) {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%iv.0 = phi i64 [ 0, %entry ], [ %iv.1, %for.body ]
				%r.0 = phi i32 [ 0, %entry ], [ %r.1, %for.body ]
				%arrayidx1 = getelementptr inbounds [10 x i32], [10 x i32]* @known_constant, i64 0, i64 %iv.0
				%x1 = load i32, i32* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 %iv.0
				%x2 = load i32, i32* %arrayidx2, align 4
				%mul = mul i32 %x1, %x2
				%r.1 = add i32 %mul, %r.0
				%iv.1 = add nuw nsw i64 %iv.0, 1
				%exitcond = icmp eq i64 %iv.1, 10
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 %r.1
				}

This is an archive of the discontinued LLVM Phabricator instance.

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.
Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 27058

lib/Transforms/Scalar/LoopUnrollPass.cpp

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

This is an archive of the discontinued LLVM Phabricator instance.

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 27058

lib/Transforms/Scalar/LoopUnrollPass.cpp

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.
Needs ReviewPublic