This is an archive of the discontinued LLVM Phabricator instance.

Randomly outline code for cold regions
Needs ReviewPublic

Authored by hiraditya on Jul 28 2019, 6:33 AM.

Download Raw Diff

Details

Reviewers

vsk
tejohnson
sebpop
bcahoon
ronl

Summary

In absence of profile information, it is possible to take advantage of hot-cold splitting
if the code base is mostly code e.g. hot/cold<0.2/0.8 (80:20 rule). It is a non-deterministic
outlining but can be beneficial in several instances where profile data is not available and determinism is not important.

Putting the diff here in case there is interest in having aggressive hot-cold-splitting.

Diff Detail

Event Timeline

hiraditya created this revision.Jul 28 2019, 6:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 28 2019, 6:33 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

hiraditya added reviewers: sebpop, bcahoon, ronl.Jul 28 2019, 6:34 AM

I'm not sure it is a good idea to have non-deterministic optimizations.

It's worth noting that you could make this deterministic by seeding an RNG with something like the function name.

In D65376#1603895, @lebedev.ri wrote:

I'm not sure it is a good idea to have non-deterministic optimizations.

I agree with @lebedev.ri, I don't think non-deterministic optimizations are good as it would make triaging and debugging really difficult. I see @pcc has a follow on suggestion to make this deterministic. At the least that would be needed. But is it useful to do this optimization randomly in practice?

We can certainly make this deterministic with a sequence of lists. Thanks for the suggestion, I'll update this soon.

In D65376#1604460, @tejohnson wrote:

In D65376#1603895, @lebedev.ri wrote:

I'm not sure it is a good idea to have non-deterministic optimizations.

I agree with @lebedev.ri, I don't think non-deterministic optimizations are good as it would make triaging and debugging really difficult. I see @pcc has a follow on suggestion to make this deterministic. At the least that would be needed. But is it useful to do this optimization randomly in practice?

I guess It depends on the workload, for user-facing apps for example, where most of the code is cold. Aggressively outlining reduces program load time.

Updated to have a fixed sequence of 'random' numbers.

I think that this is probably too risky. I've seen that splitting out the wrong block can introduce major performance regressions (up to and including stack overflow).

As a future direction for code reordering in llvm, a pass that operates on the machine instruction level may have more promise. At that level, there's a reduced/eliminated code size penalty and landingpads can be moved easily. There's less pass ordering flexibility, but experiments seem to show that late splitting works better anyway.

In D65376#1607799, @vsk wrote:

I think that this is probably too risky. I've seen that splitting out the wrong block can introduce major performance regressions (up to and including stack overflow).

As a future direction for code reordering in llvm, a pass that operates on the machine instruction level may have more promise. At that level, there's a reduced/eliminated code size penalty and landingpads can be moved easily. There's less pass ordering flexibility, but experiments seem to show that late splitting works better anyway.

Agreed, this is mostly intended for specific workloads where wrong out-ling doesn't incur major overhead. e.g., user facing applications, applications where initial load time is a major issue etc. We can keep this disabled by default so that only interested parties can use it. For some cases it can provide major startup improvements.

As a future direction for code reordering in llvm, a pass that operates on the machine instruction level may have more promise. At that level, there's a reduced/eliminated code size penalty and landingpads can be moved easily. There's less pass ordering flexibility, but experiments seem to show that late splitting works better anyway.

Agree with the reasoning here, probably that's why all previous hot-cold splitting were done at the RTL level. The problem with having machine-level passes like these are related to porting the optimization to every backend and maintenance. In my experience, even when they don't give the optimal results IR level passes scale well.

Do you actually need truly random, non-deterministic transform?
Can it instead be solved by non-non-deterministic outlining, based on some heuristic?

In D65376#1612295, @lebedev.ri wrote:

Do you actually need truly random, non-deterministic transform?

The idea is to aggressively outline more, because for some applications most of the code is cold, getting profile information is difficult and goal is to improve startup time. non-determinism is not required, it will be great to have a deterministic heuristic. I'm open to ideas.

Can it instead be solved by non-non-deterministic outlining, based on some heuristic?

There is some heuristic in the static-profile analysis but it does not outline as much.

WIP

Herald added a subscriber: mehdi_amini. · View Herald TranscriptOct 21 2019, 7:04 AM

Rebase

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

HotColdSplitting.cpp

48 lines

Diff 212502

llvm/lib/Transforms/IPO/HotColdSplitting.cpp

	Show First 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
	static cl::opt<bool> EnableStaticAnalyis("hot-cold-static-analysis",			static cl::opt<bool> EnableStaticAnalyis("hot-cold-static-analysis",
	cl::init(true), cl::Hidden);			cl::init(true), cl::Hidden);

	static cl::opt<int>			static cl::opt<int>
	SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,			SplittingThreshold("hotcoldsplit-threshold", cl::init(2), cl::Hidden,
	cl::desc("Base penalty for splitting cold code (as a "			cl::desc("Base penalty for splitting cold code (as a "
	"multiple of TCC_Basic)"));			"multiple of TCC_Basic)"));

				static cl::opt<bool> EnableRandomOutlining("hot-cold-randomly-outline-cold-code",
				cl::init(false), cl::Hidden);

				static cl::opt<bool> EnableDeterministicRandomOutlining("hot-cold-deterministic-random-outline-code",
				cl::init(true), cl::Hidden);

	namespace {			namespace {

	/// A sequence of basic blocks.			/// A sequence of basic blocks.
	///			///
	/// A 0-sized SmallVector is slightly cheaper to move than a std::vector.			/// A 0-sized SmallVector is slightly cheaper to move than a std::vector.
	using BlockSequence = SmallVector<BasicBlock *, 0>;			using BlockSequence = SmallVector<BasicBlock *, 0>;

	// Same as blockEndsInUnreachable in CodeGen/BranchFolding.cpp. Do not modify			// Same as blockEndsInUnreachable in CodeGen/BranchFolding.cpp. Do not modify
	// this function unless you modify the MBB version as well.			// this function unless you modify the MBB version as well.
	//			//
	/// A no successor, non-return block probably ends in unreachable and is cold.			/// A no successor, non-return block probably ends in unreachable and is cold.
	/// Also consider a block that ends in an indirect branch to be a return block,			/// Also consider a block that ends in an indirect branch to be a return block,
	/// since many targets use plain indirect branches to return.			/// since many targets use plain indirect branches to return.
	bool blockEndsInUnreachable(const BasicBlock &BB) {			bool blockEndsInUnreachable(const BasicBlock &BB) {
	if (!succ_empty(&BB))			if (!succ_empty(&BB))
	return false;			return false;
	if (BB.empty())			if (BB.empty())
	return true;			return true;
	const Instruction *I = BB.getTerminator();			const Instruction *I = BB.getTerminator();
	return !(isa<ReturnInst>(I) \|\| isa<IndirectBrInst>(I));			return !(isa<ReturnInst>(I) \|\| isa<IndirectBrInst>(I));
	}			}

				static Optional<std::pair<const BasicBlock , const BasicBlock >>
				hasSingleSuccAndPred(const BasicBlock &BB) {
				auto FirstSucc = succ_begin(&BB);
				auto SuccEnd = succ_end(&BB);
				if (FirstSucc == SuccEnd \|\| ++FirstSucc != SuccEnd)
				return None;
				auto FirstPred = pred_begin(&BB);
				auto PredEnd = pred_end(&BB);
				if (FirstPred == PredEnd \|\| ++FirstPred != PredEnd)
				return None;
				return Optional<std::pair<const BasicBlock , const BasicBlock >>(
				{succ_begin(&BB), pred_begin(&BB)});
				}

	bool unlikelyExecuted(BasicBlock &BB) {			bool unlikelyExecuted(BasicBlock &BB) {
	// Exception handling blocks are unlikely executed.			// Exception handling blocks are unlikely executed.
	if (BB.isEHPad() \|\| isa<ResumeInst>(BB.getTerminator()))			if (BB.isEHPad() \|\| isa<ResumeInst>(BB.getTerminator()))
	return true;			return true;

	// The block is cold if it calls/invokes a cold function. However, do not			// The block is cold if it calls/invokes a cold function. However, do not
	// mark sanitizer traps as cold.			// mark sanitizer traps as cold.
	for (Instruction &I : BB)			for (Instruction &I : BB)
	if (auto CS = CallSite(&I))			if (auto CS = CallSite(&I))
	if (CS.hasFnAttr(Attribute::Cold) && !CS->getMetadata("nosanitize"))			if (CS.hasFnAttr(Attribute::Cold) && !CS->getMetadata("nosanitize"))
	return true;			return true;

	// The block is cold if it has an unreachable terminator, unless it's			// The block is cold if it has an unreachable terminator, unless it's
	// preceded by a call to a (possibly warm) noreturn call (e.g. longjmp).			// preceded by a call to a (possibly warm) noreturn call (e.g. longjmp).
	if (blockEndsInUnreachable(BB)) {			if (blockEndsInUnreachable(BB)) {
	if (auto *CI =			if (auto *CI =
	dyn_cast_or_null<CallInst>(BB.getTerminator()->getPrevNode()))			dyn_cast_or_null<CallInst>(BB.getTerminator()->getPrevNode()))
	if (CI->hasFnAttr(Attribute::NoReturn))			if (CI->hasFnAttr(Attribute::NoReturn))
	return false;			return false;
	return true;			return true;
	}			}

	return false;			if (!EnableRandomOutlining)
				return false;

				// Game of chance hypothesis: when most (80:20 rule) of the code is cold,
				// a randomly selected basic block has a higher chance of being cold. Do this
				// if the basic block is part of a diamond structure (if-else).
				// NB: This causes non-deterministic outlining.
				if (auto SuccPred = hasSingleSuccAndPred(BB)) {
				auto Succ = SuccPred->first;
				int SuccPredCount = 0;
				for (auto X : predecessors(Succ))
				++SuccPredCount;
				if (SuccPredCount < 2)
				return false;
				auto Pred = SuccPred->second;
				int PredSuccCount = 0;
				for (auto X : successors(Pred))
				++PredSuccCount;
				if (PredSuccCount < 2)
				return false;
				double Chance = (std::rand() % PredSuccCount) / double(PredSuccCount);
				// Threshold can be decreased to enable aggressive outlining.
				return (Chance < 0.5);
				}
	}			}

	/// Check whether it's safe to outline \p BB.			/// Check whether it's safe to outline \p BB.
	static bool mayExtractBlock(const BasicBlock &BB) {			static bool mayExtractBlock(const BasicBlock &BB) {
	// EH pads are unsafe to outline because doing so breaks EH type tables. It			// EH pads are unsafe to outline because doing so breaks EH type tables. It
	// follows that invoke instructions cannot be extracted, because CodeExtractor			// follows that invoke instructions cannot be extracted, because CodeExtractor
	// requires unwind destinations to be within the extraction region.			// requires unwind destinations to be within the extraction region.
	//			//
	▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines

	LLVM_DEBUG(llvm::dbgs() << "Outlining in " << F.getName() << "\n");			LLVM_DEBUG(llvm::dbgs() << "Outlining in " << F.getName() << "\n");
	Changed \|= outlineColdRegions(F, HasProfileSummary);			Changed \|= outlineColdRegions(F, HasProfileSummary);
	}			}
	return Changed;			return Changed;
	}			}

	bool HotColdSplittingLegacyPass::runOnModule(Module &M) {			bool HotColdSplittingLegacyPass::runOnModule(Module &M) {
				// Initialize rng for deterministic aggressive outlining.
				if (EnableDeterministicRandomOutlining)
				std::srand(1);
	if (skipModule(M))			if (skipModule(M))
	return false;			return false;
	ProfileSummaryInfo *PSI =			ProfileSummaryInfo *PSI =
	&getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();			&getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();
	auto GTTI = [this](Function &F) -> TargetTransformInfo & {			auto GTTI = [this](Function &F) -> TargetTransformInfo & {
	return this->getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);			return this->getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
	};			};
	auto GBFI = [this](Function &F) {			auto GBFI = [this](Function &F) {
	▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines