This is an archive of the discontinued LLVM Phabricator instance.

[Unroll] Improve the brute force loop unroll estimate by propagating through PHI nodes across iterations.
ClosedPublic

Authored by chandlerc on Aug 2 2015, 12:06 AM.

Download Raw Diff

Details

Reviewers

Commits

rG87adb7a2e2fb: [Unroll] Improve the brute force loop unroll estimate by propagating through…
rL243900: [Unroll] Improve the brute force loop unroll estimate by propagating

Summary

This patch teaches the new advanced loop unrolling heuristics to propagate
constants into the loop from the preheader and around the backedge after
simulating each iteration. This lets us brute force solve simple recurrances
that aren't modeled effectively by SCEV. It also makes it more clear why we
need to process the loop in-order rather than bottom-up which might otherwise
make much more sense (for example, for DCE).

This came out of an attempt I'm making to develop a principled way to account
for dead code in the unroll estimation. When I implemented
a forward-propagating version of that it produced incorrect results due to
failing to propagate *cost* between loop iterations through the PHI nodes, and
it occured to me we really should at least propagate simplifications across
those edges, and it is quite easy thanks to the loop being in canonical and
LCSSA form.

Diff Detail

Repository: rL LLVM

Event Timeline

chandlerc updated this revision to Diff 31198.Aug 2 2015, 12:06 AM

chandlerc retitled this revision from to [Unroll] Improve the brute force loop unroll estimate by propagating through PHI nodes across iterations..

chandlerc updated this object.

chandlerc added a reviewer: mzolotukhin.

chandlerc added a subscriber: llvm-commits.

Hi Chandler,

That's really cool! Looks good to me.

Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
551–553 ↗	(On Diff #31198)	Shouldn't we just bail out in this case instead of crashing with assert? IOW, is it possible to make a loop, that won't be simplified, but will reach this point?
test/Transforms/LoopUnroll/full-unroll-heuristics-phi-prop.ll
14–22 ↗	(On Diff #31198)	Maybe just leave first 2-3 of them to make the test shorter? %x1 = or i64 %x0, 1 %x2 = or i64 %x1, 2

mzolotukhin accepted this revision.Aug 2 2015, 11:32 PM

mzolotukhin edited edge metadata.

This revision is now accepted and ready to land.Aug 2 2015, 11:32 PM

Thanks, will submit with a minimal test case. See below for detailed replies.

lib/Transforms/Scalar/LoopUnrollPass.cpp
551–553 ↗	(On Diff #31198)	My intent is for that to be the responsibility of the caller. In this case, a LoopPass operates under these specific invariants. This is why we require and preserve LoopSimplify and LCSSA above. So currently, no, it isn't possible without a bug somewhere.
test/Transforms/LoopUnroll/full-unroll-heuristics-phi-prop.ll
14–22 ↗	(On Diff #31198)	I mostly wanted it to be more clear that these couldn't just be simplified via SCEV. The add and icmp are simplified already for example. I'll see if there is a useful smaller number that still clearly fails without my change.

Closed by commit rL243900: [Unroll] Improve the brute force loop unroll estimate by propagating (authored by chandlerc). · Explain WhyAug 3 2015, 1:33 PM

This revision was automatically updated to reflect the committed changes.

Thanks!

lib/Transforms/Scalar/LoopUnrollPass.cpp
551–553 ↗	(On Diff #31198)	Sounds good, just wanted to verify this.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

46 lines

test/

Transforms/

LoopUnroll/

full-unroll-heuristics-phi-prop.ll

23 lines

Diff 31259

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines	public:

bool runOnLoop(Loop *L, LPPassManager &LPM) override;		bool runOnLoop(Loop *L, LPPassManager &LPM) override;

/// This transformation requires natural loop information & requires that		/// This transformation requires natural loop information & requires that
/// loop preheaders be inserted into the CFG...		/// loop preheaders be inserted into the CFG...
///		///
void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();
AU.addPreserved<LoopInfoWrapperPass>();		AU.addPreserved<LoopInfoWrapperPass>();
AU.addRequiredID(LoopSimplifyID);		AU.addRequiredID(LoopSimplifyID);
AU.addPreservedID(LoopSimplifyID);		AU.addPreservedID(LoopSimplifyID);
AU.addRequiredID(LCSSAID);		AU.addRequiredID(LCSSAID);
AU.addPreservedID(LCSSAID);		AU.addPreservedID(LCSSAID);
AU.addRequired<ScalarEvolution>();		AU.addRequired<ScalarEvolution>();
AU.addPreserved<ScalarEvolution>();		AU.addPreserved<ScalarEvolution>();
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	bool canUnrollCompletely(Loop *L, unsigned Threshold,
uint64_t UnrolledCost, uint64_t RolledDynamicCost);		uint64_t UnrolledCost, uint64_t RolledDynamicCost);
};		};
}		}

char LoopUnroll::ID = 0;		char LoopUnroll::ID = 0;
INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopSimplify)		INITIALIZE_PASS_DEPENDENCY(LoopSimplify)
INITIALIZE_PASS_DEPENDENCY(LCSSA)		INITIALIZE_PASS_DEPENDENCY(LCSSA)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)
INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)

Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,		Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,
int Runtime) {		int Runtime) {
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines
/// dynamic cost we mean that we won't count costs of blocks that are known not		/// dynamic cost we mean that we won't count costs of blocks that are known not
/// to be executed (i.e. if we have a branch in the loop and we know that at the		/// to be executed (i.e. if we have a branch in the loop and we know that at the
/// given iteration its condition would be resolved to true, we won't add up the		/// given iteration its condition would be resolved to true, we won't add up the
/// cost of the 'false'-block).		/// cost of the 'false'-block).
/// \returns Optional value, holding the RolledDynamicCost and UnrolledCost. If		/// \returns Optional value, holding the RolledDynamicCost and UnrolledCost. If
/// the analysis failed (no benefits expected from the unrolling, or the loop is		/// the analysis failed (no benefits expected from the unrolling, or the loop is
/// too big to analyze), the returned value is None.		/// too big to analyze), the returned value is None.
Optional<EstimatedUnrollCost>		Optional<EstimatedUnrollCost>
analyzeLoopUnrollCost(const Loop *L, unsigned TripCount, ScalarEvolution &SE,		analyzeLoopUnrollCost(const Loop *L, unsigned TripCount, DominatorTree &DT,
const TargetTransformInfo &TTI,		ScalarEvolution &SE, const TargetTransformInfo &TTI,
unsigned MaxUnrolledLoopSize) {		unsigned MaxUnrolledLoopSize) {
// We want to be able to scale offsets by the trip count and add more offsets		// We want to be able to scale offsets by the trip count and add more offsets
// to them without checking for overflows, and we already don't want to		// to them without checking for overflows, and we already don't want to
// analyze massive trip counts, so we force the max to be reasonably small.		// analyze massive trip counts, so we force the max to be reasonably small.
assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&		assert(UnrollMaxIterationsCountToAnalyze < (INT_MAX / 2) &&
"The unroll iterations max is too large!");		"The unroll iterations max is too large!");

// Don't simulate loops with a big or unknown tripcount		// Don't simulate loops with a big or unknown tripcount
if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|		if (!UnrollMaxIterationsCountToAnalyze \|\| !TripCount \|\|
TripCount > UnrollMaxIterationsCountToAnalyze)		TripCount > UnrollMaxIterationsCountToAnalyze)
return None;		return None;

SmallSetVector<BasicBlock *, 16> BBWorklist;		SmallSetVector<BasicBlock *, 16> BBWorklist;
DenseMap<Value , Constant > SimplifiedValues;		DenseMap<Value , Constant > SimplifiedValues;
		SmallVector<std::pair<Value , Constant >, 4> SimplifiedInputValues;

// The estimated cost of the unrolled form of the loop. We try to estimate		// The estimated cost of the unrolled form of the loop. We try to estimate
// this by simplifying as much as we can while computing the estimate.		// this by simplifying as much as we can while computing the estimate.
unsigned UnrolledCost = 0;		unsigned UnrolledCost = 0;
// We also track the estimated dynamic (that is, actually executed) cost in		// We also track the estimated dynamic (that is, actually executed) cost in
// the rolled form. This helps identify cases when the savings from unrolling		// the rolled form. This helps identify cases when the savings from unrolling
// aren't just exposing dead control flows, but actual reduced dynamic		// aren't just exposing dead control flows, but actual reduced dynamic
// instructions due to the simplifications which we expect to occur after		// instructions due to the simplifications which we expect to occur after
// unrolling.		// unrolling.
unsigned RolledDynamicCost = 0;		unsigned RolledDynamicCost = 0;

		// Ensure that we don't violate the loop structure invariants relied on by
		// this analysis.
		assert(L->isLoopSimplifyForm() && "Must put loop into normal form first.");
		assert(L->isLCSSAForm(DT) &&
		"Must have loops in LCSSA form to track live-out values.");

DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");		DEBUG(dbgs() << "Starting LoopUnroll profitability analysis...\n");

// Simulate execution of each iteration of the loop counting instructions,		// Simulate execution of each iteration of the loop counting instructions,
// which would be simplified.		// which would be simplified.
// Since the same load will take different values on different iterations,		// Since the same load will take different values on different iterations,
// we literally have to go through all loop's iterations.		// we literally have to go through all loop's iterations.
for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {		for (unsigned Iteration = 0; Iteration < TripCount; ++Iteration) {
DEBUG(dbgs() << " Analyzing iteration " << Iteration << "\n");		DEBUG(dbgs() << " Analyzing iteration " << Iteration << "\n");

		// Prepare for the iteration by collecting any simplified entry or backedge
		// inputs.
		for (Instruction &I : *L->getHeader()) {
		auto *PHI = dyn_cast<PHINode>(&I);
		if (!PHI)
		break;

		// The loop header PHI nodes must have exactly two input: one from the
		// loop preheader and one from the loop latch.
		assert(
		PHI->getNumIncomingValues() == 2 &&
		"Must have an incoming value only for the preheader and the latch.");

		Value *V = PHI->getIncomingValueForBlock(
		Iteration == 0 ? L->getLoopPreheader() : L->getLoopLatch());
		Constant *C = dyn_cast<Constant>(V);
		if (Iteration != 0 && !C)
		C = SimplifiedValues.lookup(V);
		if (C)
		SimplifiedInputValues.push_back({PHI, C});
		}

		// Now clear and re-populate the map for the next iteration.
SimplifiedValues.clear();		SimplifiedValues.clear();
		while (!SimplifiedInputValues.empty())
		SimplifiedValues.insert(SimplifiedInputValues.pop_back_val());

UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, L, SE);		UnrolledInstAnalyzer Analyzer(Iteration, SimplifiedValues, L, SE);

BBWorklist.clear();		BBWorklist.clear();
BBWorklist.insert(L->getHeader());		BBWorklist.insert(L->getHeader());
// Note that we must not cache the size, this loop grows the worklist.		// Note that we must not cache the size, this loop grows the worklist.
for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {		for (unsigned Idx = 0; Idx != BBWorklist.size(); ++Idx) {
BasicBlock *BB = BBWorklist[Idx];		BasicBlock *BB = BBWorklist[Idx];

▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines
}		}

bool LoopUnroll::runOnLoop(Loop *L, LPPassManager &LPM) {		bool LoopUnroll::runOnLoop(Loop *L, LPPassManager &LPM) {
if (skipOptnoneFunction(L))		if (skipOptnoneFunction(L))
return false;		return false;

Function &F = *L->getHeader()->getParent();		Function &F = *L->getHeader()->getParent();

		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
ScalarEvolution *SE = &getAnalysis<ScalarEvolution>();		ScalarEvolution *SE = &getAnalysis<ScalarEvolution>();
const TargetTransformInfo &TTI =		const TargetTransformInfo &TTI =
getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);

BasicBlock *Header = L->getHeader();		BasicBlock *Header = L->getHeader();
DEBUG(dbgs() << "Loop Unroll: F[" << Header->getParent()->getName()		DEBUG(dbgs() << "Loop Unroll: F[" << Header->getParent()->getName()
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (TripCount && Count == TripCount) {
// If the loop is really small, we don't need to run an expensive analysis.		// If the loop is really small, we don't need to run an expensive analysis.
if (canUnrollCompletely(L, Threshold, 100, DynamicCostSavingsDiscount,		if (canUnrollCompletely(L, Threshold, 100, DynamicCostSavingsDiscount,
UnrolledSize, UnrolledSize)) {		UnrolledSize, UnrolledSize)) {
Unrolling = Full;		Unrolling = Full;
} else {		} else {
// The loop isn't that small, but we still can fully unroll it if that		// The loop isn't that small, but we still can fully unroll it if that
// helps to remove a significant number of instructions.		// helps to remove a significant number of instructions.
// To check that, run additional analysis on the loop.		// To check that, run additional analysis on the loop.
if (Optional<EstimatedUnrollCost> Cost = analyzeLoopUnrollCost(		if (Optional<EstimatedUnrollCost> Cost =
L, TripCount, *SE, TTI, Threshold + DynamicCostSavingsDiscount))		analyzeLoopUnrollCost(L, TripCount, DT, *SE, TTI,
		Threshold + DynamicCostSavingsDiscount))
if (canUnrollCompletely(L, Threshold, PercentDynamicCostSavedThreshold,		if (canUnrollCompletely(L, Threshold, PercentDynamicCostSavedThreshold,
DynamicCostSavingsDiscount, Cost->UnrolledCost,		DynamicCostSavingsDiscount, Cost->UnrolledCost,
Cost->RolledDynamicCost)) {		Cost->RolledDynamicCost)) {
Unrolling = Full;		Unrolling = Full;
}		}
}		}
} else if (TripCount && Count < TripCount) {		} else if (TripCount && Count < TripCount) {
Unrolling = Partial;		Unrolling = Partial;
▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnroll/full-unroll-heuristics-phi-prop.ll

				; RUN: opt < %s -S -loop-unroll -unroll-max-iteration-count-to-analyze=100 -unroll-dynamic-cost-savings-discount=1000 -unroll-threshold=10 -unroll-percent-dynamic-cost-saved-threshold=50 \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				define i64 @propagate_loop_phis() {
				; CHECK-LABEL: @propagate_loop_phis(
				; CHECK-NOT: br i1
				; CHECK: ret i64 3
				entry:
				br label %loop

				loop:
				%iv = phi i64 [ 0, %entry ], [ %inc, %loop ]
				%x0 = phi i64 [ 0, %entry ], [ %x2, %loop ]
				%x1 = or i64 %x0, 1
				%x2 = or i64 %x1, 2
				%inc = add nuw nsw i64 %iv, 1
				%cond = icmp sge i64 %inc, 10
				br i1 %cond, label %loop.end, label %loop

				loop.end:
				%x.lcssa = phi i64 [ %x2, %loop ]
				ret i64 %x.lcssa
				}