Download Raw Diff

Details

Reviewers

Summary

This patch adds capability for estimation effect of DCE on completely unrolled loop.
That helps to improve accuracy of the prediction, and thus make the decision
whether to completely unroll more informative.

Diff Detail

Event Timeline

mzolotukhin updated this revision to Diff 23202.Apr 2 2015, 7:30 PM

mzolotukhin retitled this revision from to Estimate DCE effect in heuristic for estimating complete-unroll optimization effects..

mzolotukhin updated this object.

mzolotukhin edited the test plan for this revision. (Show Details)

mzolotukhin added reviewers: hfinkel, chandlerc.

mzolotukhin added a subscriber: Unknown Object (MLST).

mzolotukhin added a parent revision: D8816: Reimplement heuristic for estimating complete-unroll optimization effects..Apr 2 2015, 7:31 PM

Before we go here, I think you should handle dead CFG paths which is a *much* simpler problem.

Specifically, you should simplify the condition for branches or the input for switch, and only add a single successor when it folds to a constant.

Once that's handled, I think this should work the other way. The problem with doing it as you have is that walking all the users of an instruction is very expensive (linked list traversal, to cache hostile). Instead, when we first see an instruction in the body of the loop without side-effects we should add it to a dead set rather than accumulating its cost, and we should remove all operands of all instructions we see in the body of the loop from the dead set (and if it was removed, account for its cost there). On the last iteration, we can also remove all live-out instructions from the dead set using the LCSSA PHI node (or just not use the dead set for the last iteration).

Does that make sense?

Handling dead CFG paths only looks like a simple problem, but in fact, it's much trickier.

Let me start with an example:

for(i = 0; i < 1000; i++) {
   a[i] = b[i] + c[i];
   if(d[i]) {
      // very expensive code - let's say 998 instructions.
   }
}

Cost of the loop body here would be 1+1+998=1000, and the estimated cost of the original loop TripCount*BodyCost = 10^6.
Suppose that d[i] is filled with 0, so if(d[i]) is always false and we never take the expensive path.
That means, that after complete unrolling we'll end up with 1000 instructions: a[0] = b[0] + c[0], a[1] = b[1] + c[1], ...
That looks like a huge win - the cost of unrolled loop is 10^3, while the cost of original loop is 10^6. But what we actually did? We significantly increased the code size, and gained nothing in terms of performance - that expensive code would never be executed in the original loop either! The things would get even worse if, e.g. d[] contains non-zeros too.

So, we can't simply fold the branch and take only one successor - it would be incorrect to compare cost of the loop computed this way with the original cost. To be precise, that works well for code size estimate, but not for execution time (~performance). And the goal of the optimization is to improve performance - i.e. if in completely unrolled loop we'd need to execute 20% less instructions in real time, than it's worth unrolling.

Having said that, it might be interesting to take branch-folding into account, but that will need much more complicated cost model (and thus will increase code complexity). Currently I incline toward putting it off until we get a real use-case where it can help. What do you think?

Now to DCE part:)
So, you basically suggest adding all instructions to the 'dead' set and then remove from it instructions that have uses? Is this the idea? If so, I guess we'll end up with what we have now, but on the removal phase - when you remove instruction operands from the set, you also need to remove their operand, and their operands etc. - that's effectively the same linked list traversal as we have now.

There is a high chance that I misunderstood your suggestion here, so please correct me!

Ping!

Rebase to trunk.

Ping ^3

Ping ^4.

Rebase ontop of the recent trunk.
Add a test.

mzolotukhin edited parent revisions, added: D10205: Remove SCEVCache and FindConstantPointers from complete loop unrolling heuristic.; removed: D8816: Reimplement heuristic for estimating complete-unroll optimization effects..Jun 5 2015, 9:12 PM

Rebase.
Adjust according to the new naming scheme.
Adjust DCE test according to new naming scheme.

Ping!

Gerolf added a subscriber: Gerolf.Jul 7 2015, 10:20 PM

Gerolf added inline comments.

lib/Transforms/Scalar/LoopUnrollPass.cpp
633	How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ...
635–637	I'm missing context. Where do the costs get computed?
test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
7	This should be true for expressions also, not just loads.

Thanks, Gerolf,
I'll update the comments.

lib/Transforms/Scalar/LoopUnrollPass.cpp
635–637	On lines 594, 598.

Was there an update?

Thanks
Gerolf

chandlerc added inline comments.Jul 15 2015, 5:20 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
640–656	This seems like it may be somewhat slow. And I expect this to relatively rarely impact the computation. SimplifiedValues should have forward-pruned most of the dead instructions here? What about a slightly different approach: Each time we simplify something whose operands are not simplified, add that instruction to a SimplifiedRootsSet and SimplifiedRootsWorklist. Each time we actually count an instruction's cost, add it to a set of cost counted instructions, and increment a count of uses for each of its operands. Here, for each instruction in the worklist, for each operand to that instruction, if the operand is in the set of cost counted instructions and not in the SimplifiedRootSet and has a zero count of uses, subtract its cost, decrement the use counts of all its operands, add it to the SimplifiedRootSet, and add it to the worklist. This should only start from the instructions we can't forward-simplify (things that SCEV simplifies for example), and walk recursively up its operands GC-ing everything whose use count in the loop reaches zero as a consequence. This seems like it should be faster in the common cases than walking the user lists of every instruction? As far as I can tell, we at most walk every operand twice (once to increment, once to decrement)....

Rebase on master.
Avoid visiting all basic blocks and all their instructions - instead, use a worklist.

Hi Chandler,

I chose somewhat a hybrid approach between how it was implemented and what you suggested.

I did use a worklist to avoid one more visit of all instructions, but I decided not to count uses. I had in mind two reasons for that:

it would work incorrectly (or would require some special handling) with uses outside the loop - we won't count them and thus might think an instruction is dead, while it's not.
It would additionally complicate the code.

What do you think? Does it sound reasonable?

Michael

chandlerc removed a reviewer: chandlerc.Apr 6 2016, 10:45 PM

Diff 30867

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 525 Lines • ▼ Show 20 Lines	assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&
"The unroll iterations max is too large!");		"The unroll iterations max is too large!");

// Don't simulate loops with a big or unknown tripcount		// Don't simulate loops with a big or unknown tripcount
if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
TripCount > UnrollMaxIterationsCountToAnalyze)		TripCount > UnrollMaxIterationsCountToAnalyze)
return None;		return None;

SmallSetVector<BasicBlock *, 16> BBWorklist;		SmallSetVector<BasicBlock *, 16> BBWorklist;
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedToConstantValues;
		SmallSetVector<Instruction *, 16> SimplifiedToConstantOrDeadInsts;
		SmallSetVector<Instruction *, 16> SimplifiedInsts;

// The estimated cost of the unrolled form of the loop. We try to estimate		// The estimated cost of the unrolled form of the loop. We try to estimate
// this by simplifying as much as we can while computing the estimate.		// this by simplifying as much as we can while computing the estimate.
unsigned UnrolledCost = 0;		unsigned UnrolledCost = 0;
// We also track the estimated dynamic (that is, actually executed) cost in		// We also track the estimated dynamic (that is, actually executed) cost in
// the rolled form. This helps identify cases when the savings from unrolling		// the rolled form. This helps identify cases when the savings from unrolling
// aren't just exposing dead control flows, but actual reduced dynamic		// aren't just exposing dead control flows, but actual reduced dynamic
// instructions due to the simplifications which we expect to occur after		// instructions due to the simplifications which we expect to occur after
// unrolling.		// unrolling.
unsigned RolledDynamicCost = 0;		unsigned RolledDynamicCost = 0;

DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");		DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");

// Simulate execution of each iteration of the loop counting instructions,		// Simulate execution of each iteration of the loop counting instructions,
// which would be simplified.		// which would be simplified.
// Since the same load will take different values on different iterations,		// Since the same load will take different values on different iterations,
// we literally have to go through all loop's iterations.		// we literally have to go through all loop's iterations.
for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {		for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {
DEBUG(dbgs() << " Analyzing iteration " << Iteration << "\n");		DEBUG(dbgs() << " Analyzing iteration " << Iteration << "\n");
SimplifiedValues.clear();		SimplifiedToConstantValues.clear();
UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, L, SE);		SimplifiedToConstantOrDeadInsts.clear();
		SimplifiedInsts.clear();
		UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedToConstantValues, L, SE);

BBWorklist.clear();		BBWorklist.clear();
BBWorklist.insert(L->getHeader());		BBWorklist.insert(L->getHeader());
// Note that we must not cache the size, this loop grows the worklist.		// Note that we must not cache the size, this loop grows the worklist.
for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BasicBlock *BB = BBWorklist[Idx];		BasicBlock *BB = BBWorklist[Idx];

// Visit all instructions in the given basic block and try to simplify		// Visit all instructions in the given basic block and try to simplify
// it. We don't change the actual IR, just count optimization		// it. We don't change the actual IR, just count optimization
// opportunities.		// opportunities.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
unsigned InstCost = TTI.getUserCost(&I);		unsigned InstCost = TTI.getUserCost(&I);

// Visit the instruction to analyze its loop cost after unrolling,		// Visit the instruction to analyze its loop cost after unrolling,
// and if the visitor returns false, include this instruction in the		// and if the visitor returns false, include this instruction in the
// unrolled cost.		// unrolled cost.
if (!Analyzer.visit(I))		if (!Analyzer.visit(I))
UnrolledCost += InstCost;		UnrolledCost += InstCost;
else {		else {
		SimplifiedInsts.insert(&I);
DEBUG(dbgs() << " " << I		DEBUG(dbgs() << " " << I
<< " would be simplified if loop is unrolled.\n");		<< " would be simplified if loop is unrolled.\n");
(void)0;		(void)0;
}		}
		if (SimplifiedToConstantValues.count(&I))
		SimplifiedToConstantOrDeadInsts.insert(&I);

// Also track this instructions expected cost when executing the rolled		// Also track this instructions expected cost when executing the rolled
// loop form.		// loop form.
RolledDynamicCost += InstCost;		RolledDynamicCost += InstCost;

// If unrolled body turns out to be too big, bail out.		// If unrolled body turns out to be too big, bail out.
if (UnrolledCost > MaxUnrolledLoopSize) {		if (UnrolledCost > MaxUnrolledLoopSize) {
DEBUG(dbgs() << " Exceeded threshold.. exiting.\n"		DEBUG(dbgs() << " Exceeded threshold.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost		<< " UnrolledCost: " << UnrolledCost
<< ", MaxUnrolledLoopSize: " << MaxUnrolledLoopSize		<< ", MaxUnrolledLoopSize: " << MaxUnrolledLoopSize
<< "\n");		<< "\n");
return None;		return None;
}		}
}		}

TerminatorInst *TI = BB->getTerminator();		TerminatorInst *TI = BB->getTerminator();

// Add in the live successors by first checking whether we have terminator		// Add in the live successors by first checking whether we have terminator
// that may be simplified based on the values simplified by this call.		// that may be simplified based on the values simplified by this call.
if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {		if (BranchInst *BI = dyn_cast<BranchInst>(TI)) {
if (BI->isConditional()) {		if (BI->isConditional()) {
if (Constant *SimpleCond =		if (Constant *SimpleCond =
SimplifiedValues.lookup(BI->getCondition())) {		SimplifiedToConstantValues.lookup(BI->getCondition())) {
BasicBlock *Succ = BI->getSuccessor(		BasicBlock *Succ = BI->getSuccessor(
cast<ConstantInt>(SimpleCond)->isZero() ? 1 : 0);		cast<ConstantInt>(SimpleCond)->isZero() ? 1 : 0);
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
continue;		continue;
}		}
}		}
} else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {		} else if (SwitchInst *SI = dyn_cast<SwitchInst>(TI)) {
if (Constant *SimpleCond =		if (Constant *SimpleCond =
SimplifiedValues.lookup(SI->getCondition())) {		SimplifiedToConstantValues.lookup(SI->getCondition())) {
BasicBlock *Succ =		BasicBlock *Succ =
SI->getSuccessor(cast<ConstantInt>(SimpleCond)->getSExtValue());		SI->getSuccessor(cast<ConstantInt>(SimpleCond)->getSExtValue());
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
continue;		continue;
}		}
}		}

// Add BB's successors to the worklist.		// Add BB's successors to the worklist.
for (BasicBlock *Succ : successors(BB))		for (BasicBlock *Succ : successors(BB))
if (L->contains(Succ))		if (L->contains(Succ))
BBWorklist.insert(Succ);		BBWorklist.insert(Succ);
}		}

// If we found no optimization opportunities on the first iteration, we		// If we found no optimization opportunities on the first iteration, we
		GerolfUnsubmitted Not Done Reply Inline Actions How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ... Gerolf: How about: // When there is no optimization opportunity in the first iteration, we won't find…
// won't find them on later ones too.		// won't find them on later ones too.
if (UnrolledCost == RolledDynamicCost) {		if (UnrolledCost == RolledDynamicCost) {
DEBUG(dbgs() << " No opportunities found.. exiting.\n"		DEBUG(dbgs() << " No opportunities found.. exiting.\n"
<< " UnrolledCost: " << UnrolledCost << "\n");		<< " UnrolledCost: " << UnrolledCost << "\n");
		GerolfUnsubmitted Done Reply Inline Actions I'm missing context. Where do the costs get computed? Gerolf: I'm missing context. Where do the costs get computed?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions On lines 594, 598. mzolotukhin: On lines 594, 598.
return None;		return None;
}		}

		for (unsigned Idx = 0; Idx != SimplifiedToConstantOrDeadInsts.size();
		++Idx) {
		Instruction *Root = SimplifiedToConstantOrDeadInsts[Idx];
		for (auto &Op : Root->operands()) {
		Instruction *I = dyn_cast<Instruction>(Op);
		if (!I)
		continue;
		if (isa<TerminatorInst>(I) \|\| isa<DbgInfoIntrinsic>(I) \|\|
		isa<LandingPadInst>(I) \|\| I->mayHaveSideEffects()) {
		continue;
		}
		if (SimplifiedInsts.count(I))
		continue;
		if (std::all_of(I->user_begin(), I->user_end(), [&](User *U) {
		return SimplifiedToConstantOrDeadInsts.count(
		cast<Instruction>(U));
		chandlercUnsubmitted Not Done Reply Inline Actions This seems like it may be somewhat slow. And I expect this to relatively rarely impact the computation. SimplifiedValues should have forward-pruned most of the dead instructions here? What about a slightly different approach: Each time we simplify something whose operands are not simplified, add that instruction to a SimplifiedRootsSet and SimplifiedRootsWorklist. Each time we actually count an instruction's cost, add it to a set of cost counted instructions, and increment a count of uses for each of its operands. Here, for each instruction in the worklist, for each operand to that instruction, if the operand is in the set of cost counted instructions and not in the SimplifiedRootSet and has a zero count of uses, subtract its cost, decrement the use counts of all its operands, add it to the SimplifiedRootSet, and add it to the worklist. This should only start from the instructions we can't forward-simplify (things that SCEV simplifies for example), and walk recursively up its operands GC-ing everything whose use count in the loop reaches zero as a consequence. This seems like it should be faster in the common cases than walking the user lists of every instruction? As far as I can tell, we at most walk every operand twice (once to increment, once to decrement).... chandlerc: This seems like it may be somewhat slow. And I expect this to relatively rarely impact the…
		})) {
		DEBUG(dbgs() << " " << *I
		<< " would be dead if loop is unrolled.\n");
		UnrolledCost -= TTI.getUserCost(I);
		SimplifiedToConstantOrDeadInsts.insert(I);
		}
		}
		}
}		}
DEBUG(dbgs() << "Analysis finished:\n"		DEBUG(dbgs() << "Analysis finished:\n"
<< "UnrolledCost: " << UnrolledCost << ", "		<< "UnrolledCost: " << UnrolledCost << ", "
<< "RolledDynamicCost: " << RolledDynamicCost << "\n");		<< "RolledDynamicCost: " << RolledDynamicCost << "\n");
return {{UnrolledCost, RolledDynamicCost}};		return {{UnrolledCost, RolledDynamicCost}};
}		}

/// ApproximateLoopSize - Approximate the size of the loop.		/// ApproximateLoopSize - Approximate the size of the loop.
▲ Show 20 Lines • Show All 378 Lines • Show Last 20 Lines

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

This file was added.

				; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-dynamic-cost-savings-discount=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=80 \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				@known_constant = internal unnamed_addr constant [10 x i32] [i32 0, i32 0, i32 0, i32 0, i32 1, i32 0, i32 0, i32 0, i32 0, i32 0], align 16

				; If a load becomes a constant after loop unrolling, we sometimes can simplify
				; CFG. This test verifies that we handle such cases.
				GerolfUnsubmitted Not Done Reply Inline Actions This should be true for expressions also, not just loads. Gerolf: This should be true for expressions also, not just loads.
				; After one operand in an instruction is constant-folded and the
				; instruction is simplified, the other operand might become dead.
				; In this test we have::
				; for i in 1..10:
				; r += A[i] * B[i]
				; A[i] is 0 almost at every iteration, so there is no need in loading B[i] at
				; all.


				; CHECK-LABEL: @unroll_dce
				; CHECK-NOT: br i1 %exitcond, label %for.end, label %for.body
				define i32 @unroll_dce(i32* noalias nocapture readonly %b) {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%iv.0 = phi i64 [ 0, %entry ], [ %iv.1, %for.body ]
				%r.0 = phi i32 [ 0, %entry ], [ %r.1, %for.body ]
				%arrayidx1 = getelementptr inbounds [10 x i32], [10 x i32]* @known_constant, i64 0, i64 %iv.0
				%x1 = load i32, i32* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 %iv.0
				%x2 = load i32, i32* %arrayidx2, align 4
				%mul = mul i32 %x1, %x2
				%r.1 = add i32 %mul, %r.0
				%iv.1 = add nuw nsw i64 %iv.0, 1
				%exitcond = icmp eq i64 %iv.1, 10
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 %r.1
				}

This is an archive of the discontinued LLVM Phabricator instance.

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.
Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 30867

lib/Transforms/Scalar/LoopUnrollPass.cpp

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

This is an archive of the discontinued LLVM Phabricator instance.

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 30867

lib/Transforms/Scalar/LoopUnrollPass.cpp

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.
Needs ReviewPublic