This is an archive of the discontinued LLVM Phabricator instance.

[CaptureTracking] Avoid long compilation time on large basic blocks
ClosedPublic

Authored by bruno on Jan 15 2015, 3:06 PM.

Download Raw Diff

Details

Reviewers

nicholas
hfinkel

Commits

rG7900ef891fc8: [CaptureTracking] Avoid long compilation time on large basic blocks
rL240560: [CaptureTracking] Avoid long compilation time on large basic blocks

Summary

Problem

For large basic blocks CaptureTracking may become very expensive: for each memory related instruction considered, walks top down the BB to check ordering and dominance among two instructions. In my testcase there are 81782 instructions in a BB, which leads to compilation time of ~12min w/ 'opt -O1'. This triggers in the flow: DeadStoreElimination -> MepDepAnalysis -> CaptureTracking.

Solution

To fix the compile time bloat, the patch changes 'shouldExplore' to do two things:

Add a special case to compute the ordering between instructions when both are in the same basic block. It avoids the use of two expensive functions in this scenario: 'dominates' and 'isPotentiallyReachable'.
Limit the search by using a threshold=1000 on the number of instructions to search. This limit presented no measurable regressions on the test-suite.

This lead to a compile time reduction of 'opt -O1' from ~12min to 1s.

Diff Detail

Repository: rL LLVM

Event Timeline

bruno updated this revision to Diff 18265.Jan 15 2015, 3:06 PM

bruno retitled this revision from to [CaptureTracking] Avoid long compilation time on large basic blocks.

bruno updated this object.

bruno edited the test plan for this revision. (Show Details)

bruno added reviewers: nicholas, hfinkel.

bruno set the repository for this revision to rL LLVM.

bruno added a subscriber: Unknown Object (MLST).

Drive by comment - this might be completely wrong.

I wouldn't expect that dominates or isPotentiallyReachable would be expensive when querying two instructions in the same BB. At worst, it's a linear scan of the BB. Even at 81782 instructions, that's pretty fast. How many time is this getting called?

It would seem reasonable to push the search limit down into isPotentiallyReachable. Not dominates unfortunately, that has to be an exact answer.

However, do you actually need to an instruction level dominance answer when the instructions are in the same BB? Wouldn't isPotentiallyReachable take care of that?

You might be able to push your limit down into isPotentiallyReachable and just not call dominates when the def and use are in the same BB.

You might also consider looking at why those loops are slow. Are there micro-optimizations (i.e. loop structure, iterator use, etc..) which would help a lot? For example, you could consider advancing both iterators on every iteration to help with cases where one is near the end of the BB. (Not sure this is actually a good idea!)

Taking a step back, you could consider grouping uses by basic block, and then trying to share work across uses in the same block.

Having said all of that, I don't actually have any *objection* to the current approach. :)

Hi Philip,

Drive by comment - this might be completely wrong.

Thanks for the comments :-)

I wouldn't expect that dominates or isPotentiallyReachable would be expensive when querying two instructions in the same BB. At worst, it's a linear scan of the BB. Even at 81782 instructions, that's pretty fast. How many time is this getting called?

I don't have exact numbers, but many calls pile up and if you tweak the 'Limit' value in the patch you see how fast it grows. I may clarify that in the comments, but the problem is not that dominates or isPotentiallyReachable are expensive per-se, but the piled up number of calls to those greatly increase the compile time if we have to walk a large number of instructions.

It would seem reasonable to push the search limit down into isPotentiallyReachable. Not dominates unfortunately, that has to be an exact answer.

Because 'dominates' needs an exact answer, it returns TooExpensive instead of 'true' or 'false'.

However, do you actually need to an instruction level dominance answer when the instructions are in the same BB? Wouldn't isPotentiallyReachable take care of that?

You might be able to push your limit down into isPotentiallyReachable and just not call dominates when the def and use are in the same BB.

Note that what I'm actually trying to achieve here is a fast path for:

if (BeforeHere != I && DT->dominates(BeforeHere, I) &&
    !isPotentiallyReachable(I, BeforeHere, DT))

Hence, if 'BeforeHere' dominates 'I', we don't actually need to do any BB walk like what it's done in 'isPotentiallyReachable', and early prune the search in case we're on a entry block or don't have successors. Otherwise we must continue the search, because the BB may be in a Loop, etc.

You might also consider looking at why those loops are slow. Are there micro-optimizations (i.e. loop structure, iterator use, etc..) which would help a lot? For example, you could consider advancing both iterators on every iteration to help with cases where one is near the end of the BB. (Not sure this is actually a good idea!)

Taking a step back, you could consider grouping uses by basic block, and then trying to share work across uses in the same block.

This looks like a really good idea but that would also require more intrusive changes to the current implementation without the actual guarantee that it yield benefits.

Having said all of that, I don't actually have any *objection* to the current approach. :)

Thanks again! :D

Ping!

We hit a similar issue internally.

I reported it in pr22348, and your patch does speed it up.

BTW, have you looked at any options for speeding up dominance check for two
instructions in the same BB? CCing Daniel in case he knows a nice algorithm
:-)

I wonder if there might be a better way to solve this problem. For example, what if, for large blocks, we scan them only once, number the instructions, and then use that ordering to answer the intra-block dominance queries instead of scanning over and over again?

lib/Analysis/CaptureTracking.cpp
69 ↗	(On Diff #18265)	I'd much rather this be a command-line opt (1000 as a default is likely fine), than a static constant.
83 ↗	(On Diff #18265)	"Early compute" sounds odd. Just say "Compute"

The bottleneck here is 'shouldExplore', which is called by llvm::PointerMayBeCaptured. This function
is used by FunctionAttrs, DeadStoreElimination, BasicAliasAnalysis and llvm::PointerMayBeCapturedBefore;
this last one is also called by AliasAnalysis and InlineFunction.

The simple and direct solution where we number the instructions ins't easy to cope with the current
way capture tracking is used in the codebase (and please correct me if I'm wrong):

(1) we call functions to use capture tracking and don't have a way to easily store/update a cache containing
the numbered insns unless we do that in each CaptureTracker user mentioned above.
(2) an alternative to (1) is to transform it in a pass which internally contains a cache, where we would need
an interface for invalidation whenever a BB changes.
(3) it isn't clear to me whether numbering a BB each time it gets invalidated by one of CaptureTracker users
will totally remove the compile time issue for large basic blocks. Although I'm inclined to believe it would,
I would like to get numbers first :-)

I would be happy to try out this approach to see where it gets, but given the steps it takes I unfortunately
don't have time to try this out right now. The best I can do now is to use this patch as a short term solution.
I believe we can still keep track for an enhancement under PR22348.

Updated a new version of the patch after Hal's comments.

The bottleneck here is 'shouldExplore', which is called by llvm::PointerMayBeCaptured. This function is used by FunctionAttrs, DeadStoreElimination, BasicAliasAnalysis and llvm::PointerMayBeCapturedBefore; this last one is also called by AliasAnalysis and InlineFunction.

I don't disagree that it might be difficult to re-use the cache inbetween calls to PointerMayBeCaptured (especially as the IR is being mutated -- we'd need to think about how this was laid out and what needed to actively invalidate it), but I thought the expensive part was really the repeated BB scans *within* PointerMayBeCaptured. That could be solved by having a local cache, right?

Hi,

Sorry for the delay folks. Resuming this thread with more information
and patches...

I don't disagree that it might be difficult to re-use the cache inbetween calls to PointerMayBeCaptured (especially as the IR is being mutated -- we'd need to think about how this was laid out and what needed to actively invalidate it), but I thought the expensive part was really the repeated BB scans *within* PointerMayBeCaptured. That could be solved by having a local cache, right?

Currently, PointerMayBeCaptured scans the BB the number of times equal
to the number of uses of 'BeforeHere', which is currently capped at 20
and bails out with Tracker->tooManyUses(). The main issue here is the
number of calls *to* PointerMayBeCaptured times the basic block scan.
In the testcase with 82k instructions, PointerMayBeCaptured is called
130k times, leading to 'shouldExplore' taking 527k runs, this
currently takes ~12min.

I tried the approach where I locally (within PointerMayBeCaptured )
number the instructions in the basic block using a DenseMap to cache
instruction positions/numbers. I experimented with two approaches:

(1) Build the local cache once in the beginning and consult the cache
to gather position of instructions => Takes ~4min.
(2) Build the cache incrementally every time we need to scan an
unexplored part of the BB => Takes ~2min.

I've attached the implementation for (2) in case there's any interest.

Considering the reduction from 12min to ~2min, this is a good gain,
but still looks to me like a long time to have a user waiting for the
compiler to finish, specially under -O1/-O2. I still believe the
limited scan approach is a better fix.

Let me know what you think.

Cheers,

capture-densemap.patch8 KBDownload

Hi Hal,

Updated the patch with your suggestions + the local densemap cache. It reduces the compile time from 12min to 2min. FTR, this approach will still check the entire BB and won't change capture tracker precision. Although I still believe we should be able to reduce the compile further under O1/O2 while paying the cost of losing precision, this discussion can be postponed to a forthcoming patch.

Thanks,

LGTM, thanks for all the work you've put into this!

This revision is now accepted and ready to land.Jun 22 2015, 3:01 PM

Closed by commit rL240560: [CaptureTracking] Avoid long compilation time on large basic blocks (authored by bruno). · Explain WhyJun 24 2015, 10:57 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

CFG.h

11 lines

lib/

Analysis/

CFG.cpp

16 lines

CaptureTracking.cpp

127 lines

Diff 28368

llvm/trunk/include/llvm/Analysis/CFG.h

	Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
	///			///
	/// Determine whether there is a path from From to To within a single function.			/// Determine whether there is a path from From to To within a single function.
	/// Returns false only if we can prove that once 'From' has been reached then			/// Returns false only if we can prove that once 'From' has been reached then
	/// 'To' can not be executed. Conservatively returns true.			/// 'To' can not be executed. Conservatively returns true.
	bool isPotentiallyReachable(const BasicBlock From, const BasicBlock To,			bool isPotentiallyReachable(const BasicBlock From, const BasicBlock To,
	const DominatorTree *DT = nullptr,			const DominatorTree *DT = nullptr,
	const LoopInfo *LI = nullptr);			const LoopInfo *LI = nullptr);

				/// \brief Determine whether there is at least one path from a block in
				/// 'Worklist' to 'StopBB', returning true if uncertain.
				///
				/// Determine whether there is a path from at least one block in Worklist to
				/// StopBB within a single function. Returns false only if we can prove that
				/// once any block in 'Worklist' has been reached then 'StopBB' can not be
				/// executed. Conservatively returns true.
				bool isPotentiallyReachableFromMany(SmallVectorImpl<BasicBlock *> &Worklist,
				BasicBlock *StopBB,
				const DominatorTree *DT = nullptr,
				const LoopInfo *LI = nullptr);
	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/trunk/lib/Analysis/CFG.cpp

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines
// True if there is a loop which contains both BB1 and BB2.		// True if there is a loop which contains both BB1 and BB2.
static bool loopContainsBoth(const LoopInfo *LI,		static bool loopContainsBoth(const LoopInfo *LI,
const BasicBlock BB1, const BasicBlock BB2) {		const BasicBlock BB1, const BasicBlock BB2) {
const Loop *L1 = getOutermostLoop(LI, BB1);		const Loop *L1 = getOutermostLoop(LI, BB1);
const Loop *L2 = getOutermostLoop(LI, BB2);		const Loop *L2 = getOutermostLoop(LI, BB2);
return L1 != nullptr && L1 == L2;		return L1 != nullptr && L1 == L2;
}		}

static bool isPotentiallyReachableInner(SmallVectorImpl<BasicBlock *> &Worklist,		bool llvm::isPotentiallyReachableFromMany(
BasicBlock *StopBB,		SmallVectorImpl<BasicBlock > &Worklist, BasicBlock StopBB,
const DominatorTree *DT,		const DominatorTree DT, const LoopInfo LI) {
const LoopInfo *LI) {
// When the stop block is unreachable, it's dominated from everywhere,		// When the stop block is unreachable, it's dominated from everywhere,
// regardless of whether there's a path between the two blocks.		// regardless of whether there's a path between the two blocks.
if (DT && !DT->isReachableFromEntry(StopBB))		if (DT && !DT->isReachableFromEntry(StopBB))
DT = nullptr;		DT = nullptr;

// Limit the number of blocks we visit. The goal is to avoid run-away compile		// Limit the number of blocks we visit. The goal is to avoid run-away compile
// times on large CFGs without hampering sensible code. Arbitrarily chosen.		// times on large CFGs without hampering sensible code. Arbitrarily chosen.
unsigned Limit = 32;		unsigned Limit = 32;
Show All 33 Lines
bool llvm::isPotentiallyReachable(const BasicBlock A, const BasicBlock B,		bool llvm::isPotentiallyReachable(const BasicBlock A, const BasicBlock B,
const DominatorTree DT, const LoopInfo LI) {		const DominatorTree DT, const LoopInfo LI) {
assert(A->getParent() == B->getParent() &&		assert(A->getParent() == B->getParent() &&
"This analysis is function-local!");		"This analysis is function-local!");

SmallVector<BasicBlock*, 32> Worklist;		SmallVector<BasicBlock*, 32> Worklist;
Worklist.push_back(const_cast<BasicBlock*>(A));		Worklist.push_back(const_cast<BasicBlock*>(A));

return isPotentiallyReachableInner(Worklist, const_cast<BasicBlock*>(B),		return isPotentiallyReachableFromMany(Worklist, const_cast<BasicBlock *>(B),
DT, LI);		DT, LI);
}		}

bool llvm::isPotentiallyReachable(const Instruction A, const Instruction B,		bool llvm::isPotentiallyReachable(const Instruction A, const Instruction B,
const DominatorTree DT, const LoopInfo LI) {		const DominatorTree DT, const LoopInfo LI) {
assert(A->getParent()->getParent() == B->getParent()->getParent() &&		assert(A->getParent()->getParent() == B->getParent()->getParent() &&
"This analysis is function-local!");		"This analysis is function-local!");

SmallVector<BasicBlock*, 32> Worklist;		SmallVector<BasicBlock*, 32> Worklist;
Show All 33 Lines	if (A->getParent() == B->getParent()) {
Worklist.push_back(const_cast<BasicBlock*>(A->getParent()));		Worklist.push_back(const_cast<BasicBlock*>(A->getParent()));
}		}

if (A->getParent() == &A->getParent()->getParent()->getEntryBlock())		if (A->getParent() == &A->getParent()->getParent()->getEntryBlock())
return true;		return true;
if (B->getParent() == &A->getParent()->getParent()->getEntryBlock())		if (B->getParent() == &A->getParent()->getParent()->getEntryBlock())
return false;		return false;

return isPotentiallyReachableInner(Worklist,		return isPotentiallyReachableFromMany(
const_cast<BasicBlock*>(B->getParent()),		Worklist, const_cast<BasicBlock *>(B->getParent()), DT, LI);
DT, LI);
}		}

llvm/trunk/lib/Analysis/CaptureTracking.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	bool captured(const Use *U) override {
return true;		return true;
}		}

bool ReturnCaptures;		bool ReturnCaptures;

bool Captured;		bool Captured;
};		};

		struct NumberedInstCache {
		SmallDenseMap<const Instruction *, unsigned, 32> NumberedInsts;
		BasicBlock::const_iterator LastInstFound;
		unsigned LastInstPos;
		const BasicBlock *BB;

		NumberedInstCache(const BasicBlock *BasicB) : LastInstPos(0), BB(BasicB) {
		LastInstFound = BB->end();
		}

		/// \brief Find the first instruction 'A' or 'B' in 'BB'. Number out
		/// instruction while walking 'BB'.
		const Instruction find(const Instruction A, const Instruction *B) {
		const Instruction *Inst = nullptr;
		assert(!(LastInstFound == BB->end() && LastInstPos != 0) &&
		"Instruction supposed to be in NumberedInsts");

		// Start the search with the instruction found in the last lookup round.
		auto II = BB->begin();
		auto IE = BB->end();
		if (LastInstFound != IE)
		II = std::next(LastInstFound);

		// Number all instructions up to the point where we find 'A' or 'B'.
		for (++LastInstPos; II != IE; ++II, ++LastInstPos) {
		Inst = cast<Instruction>(II);
		NumberedInsts[Inst] = LastInstPos;
		if (Inst == A \|\| Inst == B)
		break;
		}

		assert(II != IE && "Instruction not found?");
		LastInstFound = II;
		return Inst;
		}

		/// \brief Find out whether 'A' dominates 'B', meaning whether 'A'
		/// comes before 'B' in 'BB'. This is a simplification that considers
		/// cached instruction positions and ignores other basic blocks, being
		/// only relevant to compare relative instructions positions inside 'BB'.
		bool dominates(const Instruction A, const Instruction B) {
		assert(A->getParent() == B->getParent() &&
		"Instructions must be in the same basic block!");

		unsigned NA = NumberedInsts.lookup(A);
		unsigned NB = NumberedInsts.lookup(B);
		if (NA && NB)
		return NA < NB;
		if (NA)
		return true;
		if (NB)
		return false;

		return A == find(A, B);
		}
		};

/// Only find pointer captures which happen before the given instruction. Uses		/// Only find pointer captures which happen before the given instruction. Uses
/// the dominator tree to determine whether one instruction is before another.		/// the dominator tree to determine whether one instruction is before another.
/// Only support the case where the Value is defined in the same basic block		/// Only support the case where the Value is defined in the same basic block
/// as the given instruction and the use.		/// as the given instruction and the use.
struct CapturesBefore : public CaptureTracker {		struct CapturesBefore : public CaptureTracker {

CapturesBefore(bool ReturnCaptures, const Instruction I, DominatorTree DT,		CapturesBefore(bool ReturnCaptures, const Instruction I, DominatorTree DT,
bool IncludeI)		bool IncludeI)
: BeforeHere(I), DT(DT), ReturnCaptures(ReturnCaptures),		: LocalInstCache(I->getParent()), BeforeHere(I), DT(DT),
IncludeI(IncludeI), Captured(false) {}		ReturnCaptures(ReturnCaptures), IncludeI(IncludeI), Captured(false) {}

void tooManyUses() override { Captured = true; }		void tooManyUses() override { Captured = true; }

bool shouldExplore(const Use *U) override {		bool isSafeToPrune(Instruction *I) {
Instruction *I = cast<Instruction>(U->getUser());
if (BeforeHere == I && !IncludeI)
return false;

BasicBlock *BB = I->getParent();		BasicBlock *BB = I->getParent();
// We explore this usage only if the usage can reach "BeforeHere".		// We explore this usage only if the usage can reach "BeforeHere".
// If use is not reachable from entry, there is no need to explore.		// If use is not reachable from entry, there is no need to explore.
if (BeforeHere != I && !DT->isReachableFromEntry(BB))		if (BeforeHere != I && !DT->isReachableFromEntry(BB))
		return true;

		// Compute the case where both instructions are inside the same basic
		// block. Since instructions in the same BB as BeforeHere are numbered in
		// 'LocalInstCache', avoid using 'dominates' and 'isPotentiallyReachable'
		// which are very expensive for large basic blocks.
		if (BB == BeforeHere->getParent()) {
		// 'I' dominates 'BeforeHere' => not safe to prune.
		//
		// The value defined by an invoke dominates an instruction only if it
		// dominates every instruction in UseBB. A PHI is dominated only if
		// the instruction dominates every possible use in the UseBB. Since
		// UseBB == BB, avoid pruning.
		if (isa<InvokeInst>(BeforeHere) \|\| isa<PHINode>(I) \|\| I == BeforeHere)
		return false;
		if (!LocalInstCache.dominates(BeforeHere, I))
		return false;

		// 'BeforeHere' comes before 'I', it's safe to prune if we also
		// guarantee that 'I' never reaches 'BeforeHere' through a back-edge or
		// by its successors, i.e, prune if:
		//
		// (1) BB is an entry block or have no sucessors.
		// (2) There's no path coming back through BB sucessors.
		if (BB == &BB->getParent()->getEntryBlock() \|\|
		!BB->getTerminator()->getNumSuccessors())
		return true;

		SmallVector<BasicBlock*, 32> Worklist;
		Worklist.append(succ_begin(BB), succ_end(BB));
		if (!isPotentiallyReachableFromMany(Worklist, BB, DT))
		return true;

return false;		return false;
		}

// If the value is defined in the same basic block as use and BeforeHere,		// If the value is defined in the same basic block as use and BeforeHere,
// there is no need to explore the use if BeforeHere dominates use.		// there is no need to explore the use if BeforeHere dominates use.
// Check whether there is a path from I to BeforeHere.		// Check whether there is a path from I to BeforeHere.
if (BeforeHere != I && DT->dominates(BeforeHere, I) &&		if (BeforeHere != I && DT->dominates(BeforeHere, I) &&
!isPotentiallyReachable(I, BeforeHere, DT))		!isPotentiallyReachable(I, BeforeHere, DT))
return false;
return true;		return true;
}

bool captured(const Use *U) override {
if (isa<ReturnInst>(U->getUser()) && !ReturnCaptures)
return false;		return false;
		}

		bool shouldExplore(const Use *U) override {
Instruction *I = cast<Instruction>(U->getUser());		Instruction *I = cast<Instruction>(U->getUser());

if (BeforeHere == I && !IncludeI)		if (BeforeHere == I && !IncludeI)
return false;		return false;

BasicBlock *BB = I->getParent();		if (isSafeToPrune(I))
// Same logic as in shouldExplore.
if (BeforeHere != I && !DT->isReachableFromEntry(BB))
return false;		return false;
if (BeforeHere != I && DT->dominates(BeforeHere, I) &&
!isPotentiallyReachable(I, BeforeHere, DT))		return true;
		}

		bool captured(const Use *U) override {
		if (isa<ReturnInst>(U->getUser()) && !ReturnCaptures)
return false;		return false;

		if (!shouldExplore(U))
		return false;

Captured = true;		Captured = true;
return true;		return true;
}		}

		NumberedInstCache LocalInstCache;
const Instruction *BeforeHere;		const Instruction *BeforeHere;
DominatorTree *DT;		DominatorTree *DT;

bool ReturnCaptures;		bool ReturnCaptures;
bool IncludeI;		bool IncludeI;

bool Captured;		bool Captured;
};		};
▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines