This is an archive of the discontinued LLVM Phabricator instance.

[CaptureTracking] Avoid long compilation time on large basic blocks
ClosedPublic

Authored by bruno on Jan 15 2015, 3:06 PM.

Download Raw Diff

Details

Reviewers

nicholas
hfinkel

Commits

rG7900ef891fc8: [CaptureTracking] Avoid long compilation time on large basic blocks
rL240560: [CaptureTracking] Avoid long compilation time on large basic blocks

Summary

Problem

For large basic blocks CaptureTracking may become very expensive: for each memory related instruction considered, walks top down the BB to check ordering and dominance among two instructions. In my testcase there are 81782 instructions in a BB, which leads to compilation time of ~12min w/ 'opt -O1'. This triggers in the flow: DeadStoreElimination -> MepDepAnalysis -> CaptureTracking.

Solution

To fix the compile time bloat, the patch changes 'shouldExplore' to do two things:

Add a special case to compute the ordering between instructions when both are in the same basic block. It avoids the use of two expensive functions in this scenario: 'dominates' and 'isPotentiallyReachable'.
Limit the search by using a threshold=1000 on the number of instructions to search. This limit presented no measurable regressions on the test-suite.

This lead to a compile time reduction of 'opt -O1' from ~12min to 1s.

Diff Detail

Event Timeline

bruno updated this revision to Diff 18265.Jan 15 2015, 3:06 PM

bruno retitled this revision from to [CaptureTracking] Avoid long compilation time on large basic blocks.

bruno updated this object.

bruno edited the test plan for this revision. (Show Details)

bruno added reviewers: nicholas, hfinkel.

bruno set the repository for this revision to rL LLVM.

bruno added a subscriber: Unknown Object (MLST).

Drive by comment - this might be completely wrong.

I wouldn't expect that dominates or isPotentiallyReachable would be expensive when querying two instructions in the same BB. At worst, it's a linear scan of the BB. Even at 81782 instructions, that's pretty fast. How many time is this getting called?

It would seem reasonable to push the search limit down into isPotentiallyReachable. Not dominates unfortunately, that has to be an exact answer.

However, do you actually need to an instruction level dominance answer when the instructions are in the same BB? Wouldn't isPotentiallyReachable take care of that?

You might be able to push your limit down into isPotentiallyReachable and just not call dominates when the def and use are in the same BB.

You might also consider looking at why those loops are slow. Are there micro-optimizations (i.e. loop structure, iterator use, etc..) which would help a lot? For example, you could consider advancing both iterators on every iteration to help with cases where one is near the end of the BB. (Not sure this is actually a good idea!)

Taking a step back, you could consider grouping uses by basic block, and then trying to share work across uses in the same block.

Having said all of that, I don't actually have any *objection* to the current approach. :)

Hi Philip,

Drive by comment - this might be completely wrong.

Thanks for the comments :-)

I wouldn't expect that dominates or isPotentiallyReachable would be expensive when querying two instructions in the same BB. At worst, it's a linear scan of the BB. Even at 81782 instructions, that's pretty fast. How many time is this getting called?

I don't have exact numbers, but many calls pile up and if you tweak the 'Limit' value in the patch you see how fast it grows. I may clarify that in the comments, but the problem is not that dominates or isPotentiallyReachable are expensive per-se, but the piled up number of calls to those greatly increase the compile time if we have to walk a large number of instructions.

It would seem reasonable to push the search limit down into isPotentiallyReachable. Not dominates unfortunately, that has to be an exact answer.

Because 'dominates' needs an exact answer, it returns TooExpensive instead of 'true' or 'false'.

However, do you actually need to an instruction level dominance answer when the instructions are in the same BB? Wouldn't isPotentiallyReachable take care of that?

You might be able to push your limit down into isPotentiallyReachable and just not call dominates when the def and use are in the same BB.

Note that what I'm actually trying to achieve here is a fast path for:

if (BeforeHere != I && DT->dominates(BeforeHere, I) &&
    !isPotentiallyReachable(I, BeforeHere, DT))

Hence, if 'BeforeHere' dominates 'I', we don't actually need to do any BB walk like what it's done in 'isPotentiallyReachable', and early prune the search in case we're on a entry block or don't have successors. Otherwise we must continue the search, because the BB may be in a Loop, etc.

You might also consider looking at why those loops are slow. Are there micro-optimizations (i.e. loop structure, iterator use, etc..) which would help a lot? For example, you could consider advancing both iterators on every iteration to help with cases where one is near the end of the BB. (Not sure this is actually a good idea!)

Taking a step back, you could consider grouping uses by basic block, and then trying to share work across uses in the same block.

This looks like a really good idea but that would also require more intrusive changes to the current implementation without the actual guarantee that it yield benefits.

Having said all of that, I don't actually have any *objection* to the current approach. :)

Thanks again! :D

Ping!

We hit a similar issue internally.

I reported it in pr22348, and your patch does speed it up.

BTW, have you looked at any options for speeding up dominance check for two
instructions in the same BB? CCing Daniel in case he knows a nice algorithm
:-)

I wonder if there might be a better way to solve this problem. For example, what if, for large blocks, we scan them only once, number the instructions, and then use that ordering to answer the intra-block dominance queries instead of scanning over and over again?

lib/Analysis/CaptureTracking.cpp
74	I'd much rather this be a command-line opt (1000 as a default is likely fine), than a static constant.
85	"Early compute" sounds odd. Just say "Compute"

The bottleneck here is 'shouldExplore', which is called by llvm::PointerMayBeCaptured. This function
is used by FunctionAttrs, DeadStoreElimination, BasicAliasAnalysis and llvm::PointerMayBeCapturedBefore;
this last one is also called by AliasAnalysis and InlineFunction.

The simple and direct solution where we number the instructions ins't easy to cope with the current
way capture tracking is used in the codebase (and please correct me if I'm wrong):

(1) we call functions to use capture tracking and don't have a way to easily store/update a cache containing
the numbered insns unless we do that in each CaptureTracker user mentioned above.
(2) an alternative to (1) is to transform it in a pass which internally contains a cache, where we would need
an interface for invalidation whenever a BB changes.
(3) it isn't clear to me whether numbering a BB each time it gets invalidated by one of CaptureTracker users
will totally remove the compile time issue for large basic blocks. Although I'm inclined to believe it would,
I would like to get numbers first :-)

I would be happy to try out this approach to see where it gets, but given the steps it takes I unfortunately
don't have time to try this out right now. The best I can do now is to use this patch as a short term solution.
I believe we can still keep track for an enhancement under PR22348.

Updated a new version of the patch after Hal's comments.

The bottleneck here is 'shouldExplore', which is called by llvm::PointerMayBeCaptured. This function is used by FunctionAttrs, DeadStoreElimination, BasicAliasAnalysis and llvm::PointerMayBeCapturedBefore; this last one is also called by AliasAnalysis and InlineFunction.

I don't disagree that it might be difficult to re-use the cache inbetween calls to PointerMayBeCaptured (especially as the IR is being mutated -- we'd need to think about how this was laid out and what needed to actively invalidate it), but I thought the expensive part was really the repeated BB scans *within* PointerMayBeCaptured. That could be solved by having a local cache, right?

Hi,

Sorry for the delay folks. Resuming this thread with more information
and patches...

I don't disagree that it might be difficult to re-use the cache inbetween calls to PointerMayBeCaptured (especially as the IR is being mutated -- we'd need to think about how this was laid out and what needed to actively invalidate it), but I thought the expensive part was really the repeated BB scans *within* PointerMayBeCaptured. That could be solved by having a local cache, right?

Currently, PointerMayBeCaptured scans the BB the number of times equal
to the number of uses of 'BeforeHere', which is currently capped at 20
and bails out with Tracker->tooManyUses(). The main issue here is the
number of calls *to* PointerMayBeCaptured times the basic block scan.
In the testcase with 82k instructions, PointerMayBeCaptured is called
130k times, leading to 'shouldExplore' taking 527k runs, this
currently takes ~12min.

I tried the approach where I locally (within PointerMayBeCaptured )
number the instructions in the basic block using a DenseMap to cache
instruction positions/numbers. I experimented with two approaches:

(1) Build the local cache once in the beginning and consult the cache
to gather position of instructions => Takes ~4min.
(2) Build the cache incrementally every time we need to scan an
unexplored part of the BB => Takes ~2min.

I've attached the implementation for (2) in case there's any interest.

Considering the reduction from 12min to ~2min, this is a good gain,
but still looks to me like a long time to have a user waiting for the
compiler to finish, specially under -O1/-O2. I still believe the
limited scan approach is a better fix.

Let me know what you think.

Cheers,

capture-densemap.patch8 KBDownload

Hi Hal,

Updated the patch with your suggestions + the local densemap cache. It reduces the compile time from 12min to 2min. FTR, this approach will still check the entire BB and won't change capture tracker precision. Although I still believe we should be able to reduce the compile further under O1/O2 while paying the cost of losing precision, this discussion can be postponed to a forthcoming patch.

Thanks,

LGTM, thanks for all the work you've put into this!

This revision is now accepted and ready to land.Jun 22 2015, 3:01 PM

Closed by commit rL240560: [CaptureTracking] Avoid long compilation time on large basic blocks (authored by bruno). · Explain WhyJun 24 2015, 10:57 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

CaptureTracking.h

22 lines

lib/

Analysis/

CaptureTracking.cpp

73 lines

Transforms/

IPO/

FunctionAttrs.cpp

2 lines

Diff 20394

include/llvm/Analysis/CaptureTracking.h

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	namespace llvm {
bool PointerMayBeCapturedBefore(const Value *V, bool ReturnCaptures,		bool PointerMayBeCapturedBefore(const Value *V, bool ReturnCaptures,
bool StoreCaptures, const Instruction *I,		bool StoreCaptures, const Instruction *I,
DominatorTree *DT, bool IncludeI = false);		DominatorTree *DT, bool IncludeI = false);

/// This callback is used in conjunction with PointerMayBeCaptured. In		/// This callback is used in conjunction with PointerMayBeCaptured. In
/// addition to the interface here, you'll need to provide your own getters		/// addition to the interface here, you'll need to provide your own getters
/// to see whether anything was captured.		/// to see whether anything was captured.
struct CaptureTracker {		struct CaptureTracker {

		/// Action - The possible actions shouldExplore must return.
		enum Action {
		Search,
		Prune,
		TooExpensive
		};

virtual ~CaptureTracker();		virtual ~CaptureTracker();

/// tooManyUses - The depth of traversal has breached a limit. There may be		/// tooExpensive - The depth of traversal or the number of instructions
		/// to search inside a basic block has breached a limit. There may be
/// capturing instructions that will not be passed into captured().		/// capturing instructions that will not be passed into captured().
virtual void tooManyUses() = 0;		virtual void tooExpensive() = 0;

/// shouldExplore - This is the use of a value derived from the pointer.		/// shouldExplore - This is the use of a value derived from the pointer. To
/// To prune the search (ie., assume that none of its users could possibly		/// prune the search (ie., assume that none of its users could possibly
/// capture) return false. To search it, return true.		/// capture) return Action::Prune. To search it, return Action::Search.
		/// To avoid really expensive compile time, return Action::TooExpensive.
///		///
/// U->getUser() is always an Instruction.		/// U->getUser() is always an Instruction.
virtual bool shouldExplore(const Use *U);		virtual Action shouldExplore(const Use *U);

/// captured - Information about the pointer was captured by the user of		/// captured - Information about the pointer was captured by the user of
/// use U. Return true to stop the traversal or false to continue looking		/// use U. Return true to stop the traversal or false to continue looking
/// for more capturing instructions.		/// for more capturing instructions.
virtual bool captured(const Use *U) = 0;		virtual bool captured(const Use *U) = 0;
};		};

/// PointerMayBeCaptured - Visit the value and the values derived from it and		/// PointerMayBeCaptured - Visit the value and the values derived from it and
/// find values which appear to be capturing the pointer value. This feeds		/// find values which appear to be capturing the pointer value. This feeds
/// results into and is controlled by the CaptureTracker object.		/// results into and is controlled by the CaptureTracker object.
void PointerMayBeCaptured(const Value V, CaptureTracker Tracker);		void PointerMayBeCaptured(const Value V, CaptureTracker Tracker);
} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Analysis/CaptureTracking.cpp

Show All 19 Lines
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/CFG.h"		#include "llvm/Analysis/CFG.h"
#include "llvm/Analysis/CaptureTracking.h"		#include "llvm/Analysis/CaptureTracking.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/Constants.h"		#include "llvm/IR/Constants.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
		#include "llvm/Support/CommandLine.h"

using namespace llvm;		using namespace llvm;

		static cl::opt<unsigned>
		BlockScanLimit("capture-tracker-blockscan-limit", cl::init(1000), cl::Hidden,
		cl::desc("Limit the number of instructions to scan in a block"));

CaptureTracker::~CaptureTracker() {}		CaptureTracker::~CaptureTracker() {}

bool CaptureTracker::shouldExplore(const Use *U) { return true; }		CaptureTracker::Action CaptureTracker::shouldExplore(const Use *U)
		{ return CaptureTracker::Search; }

namespace {		namespace {
struct SimpleCaptureTracker : public CaptureTracker {		struct SimpleCaptureTracker : public CaptureTracker {
explicit SimpleCaptureTracker(bool ReturnCaptures)		explicit SimpleCaptureTracker(bool ReturnCaptures)
: ReturnCaptures(ReturnCaptures), Captured(false) {}		: ReturnCaptures(ReturnCaptures), Captured(false) {}

void tooManyUses() override { Captured = true; }		void tooExpensive() override { Captured = true; }

bool captured(const Use *U) override {		bool captured(const Use *U) override {
if (isa<ReturnInst>(U->getUser()) && !ReturnCaptures)		if (isa<ReturnInst>(U->getUser()) && !ReturnCaptures)
return false;		return false;

Captured = true;		Captured = true;
return true;		return true;
}		}

bool ReturnCaptures;		bool ReturnCaptures;

bool Captured;		bool Captured;
};		};

/// Only find pointer captures which happen before the given instruction. Uses		/// Only find pointer captures which happen before the given instruction. Uses
/// the dominator tree to determine whether one instruction is before another.		/// the dominator tree to determine whether one instruction is before another.
/// Only support the case where the Value is defined in the same basic block		/// Only support the case where the Value is defined in the same basic block
/// as the given instruction and the use.		/// as the given instruction and the use.
struct CapturesBefore : public CaptureTracker {		struct CapturesBefore : public CaptureTracker {
CapturesBefore(bool ReturnCaptures, const Instruction I, DominatorTree DT,		CapturesBefore(bool ReturnCaptures, const Instruction I, DominatorTree DT,
bool IncludeI)		bool IncludeI)
: BeforeHere(I), DT(DT), ReturnCaptures(ReturnCaptures),		: BeforeHere(I), DT(DT), ReturnCaptures(ReturnCaptures),
IncludeI(IncludeI), Captured(false) {}		IncludeI(IncludeI), Captured(false) {}

void tooManyUses() override { Captured = true; }		void tooExpensive() override { Captured = true; }

bool shouldExplore(const Use *U) override {		CaptureTracker::Action shouldExplore(const Use *U) override {
Instruction *I = cast<Instruction>(U->getUser());		Instruction *I = cast<Instruction>(U->getUser());
		hfinkelUnsubmitted Not Done Reply Inline Actions I'd much rather this be a command-line opt (1000 as a default is likely fine), than a static constant. hfinkel: I'd much rather this be a command-line opt (1000 as a default is likely fine), than a static…

if (BeforeHere == I && !IncludeI)		if (BeforeHere == I && !IncludeI)
return false;		return CaptureTracker::Prune;

BasicBlock *BB = I->getParent();		BasicBlock *BB = I->getParent();
// We explore this usage only if the usage can reach "BeforeHere".		// We explore this usage only if the usage can reach "BeforeHere".
// If use is not reachable from entry, there is no need to explore.		// If use is not reachable from entry, there is no need to explore.
if (BeforeHere != I && !DT->isReachableFromEntry(BB))		if (BeforeHere != I && !DT->isReachableFromEntry(BB))
return false;		return CaptureTracker::Prune;

		// Compute the case where both instructions are inside the same basic
		hfinkelUnsubmitted Not Done Reply Inline Actions "Early compute" sounds odd. Just say "Compute" hfinkel: "Early compute" sounds odd. Just say "Compute"
		// block and limit the number of instructions to search. We avoid using
		// 'dominates' and 'isPotentiallyReachable' since both get very expensive
		// in large basic block scenarios.
		if (BB == BeforeHere->getParent()) {
		if (BeforeHere == I)
		return CaptureTracker::Search;

		// Loop through the basic block until we find BeforeHere or I.
		unsigned int Limit = BlockScanLimit;
		BasicBlock::const_iterator II = BB->begin();
		for (; Limit && &II != BeforeHere && &II != I; ++II, --Limit)
		/empty/;

		if (!Limit)
		return CaptureTracker::TooExpensive;

		// BeforeHere dominates I, we can also conservatively assume there's no
		// path from I to BeforeHere if we're on the entry block or if there are
		// no BB successors.
		if (BeforeHere != I && &*II == BeforeHere &&
		(BB == &BB->getParent()->getEntryBlock() \|\|
		!BB->getTerminator()->getNumSuccessors()))
		return CaptureTracker::Prune;

		return CaptureTracker::Search;
		}

// If the value is defined in the same basic block as use and BeforeHere,		// If the value is defined in the same basic block as use and BeforeHere,
// there is no need to explore the use if BeforeHere dominates use.		// there is no need to explore the use if BeforeHere dominates use.
// Check whether there is a path from I to BeforeHere.		// Check whether there is a path from I to BeforeHere.
if (BeforeHere != I && DT->dominates(BeforeHere, I) &&		if (BeforeHere != I && DT->dominates(BeforeHere, I) &&
!isPotentiallyReachable(I, BeforeHere, DT))		!isPotentiallyReachable(I, BeforeHere, DT))
return false;		return CaptureTracker::Prune;
return true;
		return CaptureTracker::Search;
}		}

bool captured(const Use *U) override {		bool captured(const Use *U) override {
if (isa<ReturnInst>(U->getUser()) && !ReturnCaptures)		if (isa<ReturnInst>(U->getUser()) && !ReturnCaptures)
return false;		return false;

Instruction *I = cast<Instruction>(U->getUser());		Instruction *I = cast<Instruction>(U->getUser());
if (BeforeHere == I && !IncludeI)		if (BeforeHere == I && !IncludeI)
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	void llvm::PointerMayBeCaptured(const Value V, CaptureTracker Tracker) {
SmallVector<const Use *, Threshold> Worklist;		SmallVector<const Use *, Threshold> Worklist;
SmallSet<const Use *, Threshold> Visited;		SmallSet<const Use *, Threshold> Visited;
int Count = 0;		int Count = 0;

for (const Use &U : V->uses()) {		for (const Use &U : V->uses()) {
// If there are lots of uses, conservatively say that the value		// If there are lots of uses, conservatively say that the value
// is captured to avoid taking too much compile time.		// is captured to avoid taking too much compile time.
if (Count++ >= Threshold)		if (Count++ >= Threshold)
return Tracker->tooManyUses();		return Tracker->tooExpensive();

		CaptureTracker::Action Res = Tracker->shouldExplore(&U);
		if (Res == CaptureTracker::TooExpensive)
		return Tracker->tooExpensive();
		if (Res == CaptureTracker::Prune)
		continue;

if (!Tracker->shouldExplore(&U)) continue;
Visited.insert(&U);		Visited.insert(&U);
Worklist.push_back(&U);		Worklist.push_back(&U);
}		}

while (!Worklist.empty()) {		while (!Worklist.empty()) {
const Use *U = Worklist.pop_back_val();		const Use *U = Worklist.pop_back_val();
Instruction *I = cast<Instruction>(U->getUser());		Instruction *I = cast<Instruction>(U->getUser());
V = U->get();		V = U->get();
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {
case Instruction::Select:		case Instruction::Select:
case Instruction::AddrSpaceCast:		case Instruction::AddrSpaceCast:
// The original value is not captured via this if the new value isn't.		// The original value is not captured via this if the new value isn't.
Count = 0;		Count = 0;
for (Use &UU : I->uses()) {		for (Use &UU : I->uses()) {
// If there are lots of uses, conservatively say that the value		// If there are lots of uses, conservatively say that the value
// is captured to avoid taking too much compile time.		// is captured to avoid taking too much compile time.
if (Count++ >= Threshold)		if (Count++ >= Threshold)
return Tracker->tooManyUses();		return Tracker->tooExpensive();

		if (Visited.insert(&UU).second) {
		CaptureTracker::Action Res = Tracker->shouldExplore(&UU);

if (Visited.insert(&UU).second)		if (Res == CaptureTracker::Search)
if (Tracker->shouldExplore(&UU))
Worklist.push_back(&UU);		Worklist.push_back(&UU);
		if (Res == CaptureTracker::TooExpensive)
		return Tracker->tooExpensive();
		}
}		}
break;		break;
case Instruction::ICmp:		case Instruction::ICmp:
// Don't count comparisons of a no-alias return value against null as		// Don't count comparisons of a no-alias return value against null as
// captures. This allows us to ignore comparisons of malloc results		// captures. This allows us to ignore comparisons of malloc results
// with null, for example.		// with null, for example.
if (ConstantPointerNull *CPN =		if (ConstantPointerNull *CPN =
dyn_cast<ConstantPointerNull>(I->getOperand(1)))		dyn_cast<ConstantPointerNull>(I->getOperand(1)))
Show All 18 Lines

lib/Transforms/IPO/FunctionAttrs.cpp

Show First 20 Lines • Show All 340 Lines • ▼ Show 20 Lines	namespace {

// This tracker checks whether callees are in the SCC, and if so it does not		// This tracker checks whether callees are in the SCC, and if so it does not
// consider that a capture, instead adding it to the "Uses" list and		// consider that a capture, instead adding it to the "Uses" list and
// continuing with the analysis.		// continuing with the analysis.
struct ArgumentUsesTracker : public CaptureTracker {		struct ArgumentUsesTracker : public CaptureTracker {
ArgumentUsesTracker(const SmallPtrSet<Function*, 8> &SCCNodes)		ArgumentUsesTracker(const SmallPtrSet<Function*, 8> &SCCNodes)
: Captured(false), SCCNodes(SCCNodes) {}		: Captured(false), SCCNodes(SCCNodes) {}

void tooManyUses() override { Captured = true; }		void tooExpensive() override { Captured = true; }

bool captured(const Use *U) override {		bool captured(const Use *U) override {
CallSite CS(U->getUser());		CallSite CS(U->getUser());
if (!CS.getInstruction()) { Captured = true; return true; }		if (!CS.getInstruction()) { Captured = true; return true; }

Function *F = CS.getCalledFunction();		Function *F = CS.getCalledFunction();
if (!F \|\| !SCCNodes.count(F)) { Captured = true; return true; }		if (!F \|\| !SCCNodes.count(F)) { Captured = true; return true; }

▲ Show 20 Lines • Show All 1,355 Lines • Show Last 20 Lines