This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Add switches to control prelink/postlink inline separately
AbandonedPublic

Authored by wenlei on Feb 5 2021, 9:08 PM.

Download Raw Diff

Details

Reviewers

wmi
hoy
davidxl

Summary

With CSSPGO, to maximize the benefit of global top-down context-sensitive inlining, we are experimenting with shifting more inlining from LTO prelink to postlink, from cgscc inlining to sample loader inlining. The change adds switches to turn off prelink/postlink cgscc/sample-loader inlining separately. Note this is different from -fno-inline as the new switches only augment pass pipeline without changing inline attribute on functions.

We saw better performance on some benchmark with prelto inlining turned off, but there're other regressions too. Currently both switches default to on so nothing changes by default, even with CSSPGO. I'm exposing the switches to help further tuning.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	290 ms	x64 debian > libarcher.races::task-dependency.c
	380 ms	x64 debian > libarcher.races::task-taskgroup-unrelated.c
	370 ms	x64 debian > libarcher.races::task-taskwait-nested.c
	360 ms	x64 debian > libarcher.races::task-two.c
	340 ms	x64 debian > libarcher.task::task-barrier.c
		View Full Test Results (13 Failed)

Event Timeline

wenlei created this revision.Feb 5 2021, 9:08 PM

Herald added subscribers: modimo, lxfind, hiraditya, eraman. · View Herald TranscriptFeb 5 2021, 9:08 PM

wenlei requested review of this revision.Feb 5 2021, 9:08 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2021, 9:08 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

wenlei edited the summary of this revision. (Show Details)Feb 5 2021, 9:11 PM

wenlei added a parent revision: D95980: [CSSPGO] Use merged base profile for hot threshold calculation.

Harbormaster completed remote builds in B88166: Diff 321917.Feb 5 2021, 9:48 PM

rebase

Harbormaster completed remote builds in B88582: Diff 322596.Feb 9 2021, 10:58 PM

Have you tried tuning the inline param instead of disabling CGSCC inlining in prelto? From our experience, thinlto wants some inlining happening in prelink phase especially for smaller callees so that the heuristic of choosing functions to import can be more effective. Current sampleFDO also turned off hot callsite inline heuristic in CGSCC inlining by setting the inline param, in order to prevent optimizations in prelink from obstructing profile annotation in postlink, but keep the other params the same. Feel using inline param for the tuning will be more flexible.

In D96197#2553329, @wmi wrote:

Have you tried tuning the inline param instead of disabling CGSCC inlining in prelto? From our experience, thinlto wants some inlining happening in prelink phase especially for smaller callees so that the heuristic of choosing functions to import can be more effective. Current sampleFDO also turned off hot callsite inline heuristic in CGSCC inlining by setting the inline param, in order to prevent optimizations in prelink from obstructing profile annotation in postlink, but keep the other params the same. Feel using inline param for the tuning will be more flexible.

Yeah, the amount of inlining affects the importing due to call graph depth difference. The part you mentioned about turning off hot call site CGSCC inlining in prelto is the place where HotCallSiteThreshold is set to zero in buildInlinerPipeline, right? We would still have some inlining for small functions as they may end up with negative cost. Using threshold is more flexible and can achieve the same thing, though we'd need to pass four zero or negative thresholds (hot|regular x fdo|cgscc), so I thought switch would be a bit easier. It's somewhat similar to -fno-inline - we could theoretically achieve the same thing by tweaking thresholds too.

Yeah, the amount of inlining affects the importing due to call graph depth difference. The part you mentioned about turning off hot call site CGSCC inlining in prelto is the place where HotCallSiteThreshold is set to zero in buildInlinerPipeline, right?

Right.

We would still have some inlining for small functions as they may end up with negative cost.

If EnableRegularCGSCCInline and EnableSampleProfileInline are false in prelto, how would you have inlining for small functions?

Using threshold is more flexible and can achieve the same thing, though we'd need to pass four zero or negative thresholds (hot|regular x fdo|cgscc), so I thought switch would be a bit easier. It's somewhat similar to -fno-inline - we could theoretically achieve the same thing by tweaking thresholds too.

Passing multiple flags to set params may be ok for tuning but not for the default usage. I think we can hardcode the threshold value for CSSPGO after tuning is done.

we are experimenting with shifting more inlining from LTO prelink to postlink, from cgscc inlining to sample loader inlining.

Talking about the shifting from cgscc inlining to sample loader inlining. One thing missing in sample loader inlining is it will be lack of iterative cleaning during inlining which cgscc inlining provides. Do you think whether it matters?

In D96197#2554808, @wmi wrote:

Yeah, the amount of inlining affects the importing due to call graph depth difference. The part you mentioned about turning off hot call site CGSCC inlining in prelto is the place where HotCallSiteThreshold is set to zero in buildInlinerPipeline, right?

Right.

We would still have some inlining for small functions as they may end up with negative cost.

If EnableRegularCGSCCInline and EnableSampleProfileInline are false in prelto, how would you have inlining for small functions?

In that case, small function inlining will also move to LTO, though it would require tweaking importing instr limit/threshold.

Using threshold is more flexible and can achieve the same thing, though we'd need to pass four zero or negative thresholds (hot|regular x fdo|cgscc), so I thought switch would be a bit easier. It's somewhat similar to -fno-inline - we could theoretically achieve the same thing by tweaking thresholds too.

Passing multiple flags to set params may be ok for tuning but not for the default usage. I think we can hardcode the threshold value for CSSPGO after tuning is done.

Makes sense, we can use threshold for now and change the defaults for CSSPGO after it settles. I will skip this patch for now then.

we are experimenting with shifting more inlining from LTO prelink to postlink, from cgscc inlining to sample loader inlining.

Talking about the shifting from cgscc inlining to sample loader inlining. One thing missing in sample loader inlining is it will be lack of iterative cleaning during inlining which cgscc inlining provides. Do you think whether it matters?

Yeah, this could potentially be a challenge. Without the iterative cleanup, the cost inliner sees may not be accurate. We hope that this could be mitigated by 1) tweaking the threshold for sample loader inliner, 2) potentially use post-codegen size from previous build to help estimating the cost, .e.g we could put function size alongside with cfg checksum in profile metadata, and reference that size to see through potential cleanup during sample loader. We always have cgscc passes in LTO, so the actually clean up should still do a good job there.

Will revisit after CSSPGO tuning settles.

wenlei mentioned this in D104926: [CSSPGO] Switches to disable pre-link inlining.Jun 25 2021, 9:22 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

Inliner.cpp

7 lines

SampleProfile.cpp

9 lines

Diff 321917

llvm/lib/Transforms/IPO/Inliner.cpp

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines

static cl::opt<std::string> CGSCCInlineReplayFile(		static cl::opt<std::string> CGSCCInlineReplayFile(
"cgscc-inline-replay", cl::init(""), cl::value_desc("filename"),		"cgscc-inline-replay", cl::init(""), cl::value_desc("filename"),
cl::desc(		cl::desc(
"Optimization remarks file containing inline remarks to be replayed "		"Optimization remarks file containing inline remarks to be replayed "
"by inlining from cgscc inline remarks."),		"by inlining from cgscc inline remarks."),
cl::Hidden);		cl::Hidden);

		static cl::opt<bool>
		EnableRegularCGSCCInline("enable-cgscc-inline", cl::init(true), cl::Hidden,
		cl::desc("Enable CGSCC Inlining"));

LegacyInlinerBase::LegacyInlinerBase(char &ID) : CallGraphSCCPass(ID) {}		LegacyInlinerBase::LegacyInlinerBase(char &ID) : CallGraphSCCPass(ID) {}

LegacyInlinerBase::LegacyInlinerBase(char &ID, bool InsertLifetime)		LegacyInlinerBase::LegacyInlinerBase(char &ID, bool InsertLifetime)
: CallGraphSCCPass(ID), InsertLifetime(InsertLifetime) {}		: CallGraphSCCPass(ID), InsertLifetime(InsertLifetime) {}

/// For this class, we declare that we require and preserve the call graph.		/// For this class, we declare that we require and preserve the call graph.
/// If the derived class implements this method, it should		/// If the derived class implements this method, it should
/// always explicitly call the implementation here.		/// always explicitly call the implementation here.
▲ Show 20 Lines • Show All 891 Lines • ▼ Show 20 Lines	: Params(Params), Mode(Mode), MaxDevirtIterations(MaxDevirtIterations),
PM(Debugging), MPM(Debugging) {		PM(Debugging), MPM(Debugging) {
// Run the inliner first. The theory is that we are walking bottom-up and so		// Run the inliner first. The theory is that we are walking bottom-up and so
// the callees have already been fully optimized, and we want to inline them		// the callees have already been fully optimized, and we want to inline them
// into the callers so that our optimizations can reflect that.		// into the callers so that our optimizations can reflect that.
// For PreLinkThinLTO pass, we disable hot-caller heuristic for sample PGO		// For PreLinkThinLTO pass, we disable hot-caller heuristic for sample PGO
// because it makes profile annotation in the backend inaccurate.		// because it makes profile annotation in the backend inaccurate.
if (MandatoryFirst)		if (MandatoryFirst)
PM.addPass(InlinerPass(/OnlyMandatory/ true));		PM.addPass(InlinerPass(/OnlyMandatory/ true));
		if (EnableRegularCGSCCInline)
PM.addPass(InlinerPass());		PM.addPass(InlinerPass());
}		}

PreservedAnalyses ModuleInlinerWrapperPass::run(Module &M,		PreservedAnalyses ModuleInlinerWrapperPass::run(Module &M,
ModuleAnalysisManager &MAM) {		ModuleAnalysisManager &MAM) {
auto &IAA = MAM.getResult<InlineAdvisorAnalysis>(M);		auto &IAA = MAM.getResult<InlineAdvisorAnalysis>(M);
if (!IAA.tryCreate(Params, Mode, CGSCCInlineReplayFile)) {		if (!IAA.tryCreate(Params, Mode, CGSCCInlineReplayFile)) {
M.getContext().emitError(		M.getContext().emitError(
"Could not setup Inlining Advisor for the requested "		"Could not setup Inlining Advisor for the requested "
Show All 21 Lines

llvm/lib/Transforms/IPO/SampleProfile.cpp

Show First 20 Lines • Show All 171 Lines • ▼ Show 20 Lines	cl::desc("Merge past inlinee's profile to outline version if sample "
"enabled. "));		"enabled. "));

static cl::opt<bool> ProfileTopDownLoad(		static cl::opt<bool> ProfileTopDownLoad(
"sample-profile-top-down-load", cl::Hidden, cl::init(true),		"sample-profile-top-down-load", cl::Hidden, cl::init(true),
cl::desc("Do profile annotation and inlining for functions in top-down "		cl::desc("Do profile annotation and inlining for functions in top-down "
"order of call graph during sample profile loading. It only "		"order of call graph during sample profile loading. It only "
"works for new pass manager. "));		"works for new pass manager. "));

		static cl::opt<bool> EnableSampleProfileInline(
		"enable-sample-profile-inline", cl::Hidden, cl::init(true),
		cl::desc("Enable Inlining from sample profile loader."));

static cl::opt<bool> ProfileSizeInline(		static cl::opt<bool> ProfileSizeInline(
"sample-profile-inline-size", cl::Hidden, cl::init(false),		"sample-profile-inline-size", cl::Hidden, cl::init(false),
cl::desc("Inline cold call sites in profile loader if it's beneficial "		cl::desc("Inline cold call sites in profile loader if it's beneficial "
"for code size."));		"for code size."));

static cl::opt<int> ProfileInlineGrowthLimit(		static cl::opt<int> ProfileInlineGrowthLimit(
"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),		"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),
cl::desc("The size growth ratio limit for proirity-based sample profile "		cl::desc("The size growth ratio limit for proirity-based sample profile "
▲ Show 20 Lines • Show All 1,148 Lines • ▼ Show 20 Lines	if (ProfileMergeInlinee) {
pair.first->second.entryCount += FS->getEntrySamples();		pair.first->second.entryCount += FS->getEntrySamples();
}		}
}		}
return Changed;		return Changed;
}		}

bool SampleProfileLoader::tryInlineCandidate(		bool SampleProfileLoader::tryInlineCandidate(
InlineCandidate &Candidate, SmallVector<CallBase , 8> InlinedCallSites) {		InlineCandidate &Candidate, SmallVector<CallBase , 8> InlinedCallSites) {
		// Disable only the inlining part of sample profile loader, and still
		// perform profile annotation, ICP as well as computing import GUIDs for
		// ThinLTO's pre-link.
		if (!EnableSampleProfileInline)
		return false;

CallBase &CB = *Candidate.CallInstr;		CallBase &CB = *Candidate.CallInstr;
Function *CalledFunction = CB.getCalledFunction();		Function *CalledFunction = CB.getCalledFunction();
assert(CalledFunction && "Expect a callee with definition");		assert(CalledFunction && "Expect a callee with definition");
DebugLoc DLoc = CB.getDebugLoc();		DebugLoc DLoc = CB.getDebugLoc();
BasicBlock *BB = CB.getParent();		BasicBlock *BB = CB.getParent();

InlineCost Cost = shouldInlineCandidate(Candidate);		InlineCost Cost = shouldInlineCandidate(Candidate);
▲ Show 20 Lines • Show All 1,205 Lines • Show Last 20 Lines