This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/IPO/
-
llvm/
-
Transforms/
-
IPO/
-
ProfiledCallGraph.h
-
SampleContextTracker.h
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
-
SampleContextTracker.cpp
-
SampleProfile.cpp
-
test/tools/llvm-profgen/
-
tools/
-
llvm-profgen/
-
cs-preinline.test
-
tools/llvm-profgen/
-
llvm-profgen/
1/2
CMakeLists.txt
-
CSPreInliner.h
12/26
CSPreInliner.cpp
-
ProfileGenerator.h
2/4
ProfileGenerator.cpp

Differential D99146

[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
ClosedPublic

Authored by wenlei on Mar 22 2021, 11:06 PM.

Download Raw Diff

Details

Reviewers

hoy
wmi
davidxl
wlei

Commits

rG30b023233696: [CSSPGO][llvm-profgen] Context-sensitive global pre-inliner

Summary

This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.

It will serve two purposes:

Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.

Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.

This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:

We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wenlei created this revision.Mar 22 2021, 11:06 PM

Herald added subscribers: modimo, lxfind, hiraditya and 2 others. · View Herald TranscriptMar 22 2021, 11:06 PM

wenlei requested review of this revision.Mar 22 2021, 11:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2021, 11:06 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B95169: Diff 332537.Mar 22 2021, 11:56 PM

hoy added inline comments.Mar 24 2021, 11:08 AM

llvm/tools/llvm-profgen/CSPreInliner.cpp
106	nit: a function call
151	Should this be `>=`?
169	This currently only reflects the number of live/hot lines. Might be extended to using static size from dwarf/probe decoding or disassembling. Can you leave a TODO for this?
llvm/tools/llvm-profgen/ProfileGenerator.cpp
470	Nit: context
llvm/tools/llvm-profgen/ProfiledCallGraph.h
29 ↗	(On Diff #332537)	Nit: ProfiledCallGraphNodeComparer

wenlei added inline comments.Mar 24 2021, 7:13 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
106	updated.
151	`>` is consistent with `SampleProfileLoader::shouldInlineCandidate`, though practically I don't think it matters.
169	Todo added.
llvm/tools/llvm-profgen/ProfileGenerator.cpp
470	updated.
llvm/tools/llvm-profgen/ProfiledCallGraph.h
29 ↗	(On Diff #332537)	Good catch, changed.

Address Hongtao's comments. Also moving ProfiledCallGraph.h into IPO so we can use it in compiler too (there's a need to make it available in compiler and we will follow up with separate patch).

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase. One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

In D99146#2649557, @davidxl wrote:

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

MaskRay added a subscriber: MaskRay.Mar 24 2021, 9:53 PM

In D99146#2649599, @wenlei wrote:

In D99146#2649557, @davidxl wrote:

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

It is a good idea to have an non-intrusive way to predict cross-module inlining decision and update the profile beforehand.

To mitigate ThinLTO profile update issue, either apparent inline or no-inline decisions can be made. From the patch description, seems currently it only considers the case that no-inline decision is made and profile can be merged back. Have you considered the case that inline is apparently beneficial and profile without context can be split?

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

In D99146#2649707, @wmi wrote:

In D99146#2649599, @wenlei wrote:

In D99146#2649557, @davidxl wrote:

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

It is a good idea to have an non-intrusive way to predict cross-module inlining decision and update the profile beforehand.

To mitigate ThinLTO profile update issue, either apparent inline or no-inline decisions can be made. From the patch description, seems currently it only considers the case that no-inline decision is made and profile can be merged back. Have you considered the case that inline is apparently beneficial and profile without context can be split?

If we don't have context profile from raw input profile, the split is going to be a simple scaling based on call site counts, right? In that case, doing it in profile generation won't improve profile quality because the scaling won't be very different from the scaling done by cgscc inliner. Though if we split the profiles to synthesize context profile, sample loader would be able to inline more, but if we want we could allow sample loader inlining to do scaling.

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

Harbormaster completed remote builds in B95619: Diff 333190.Mar 25 2021, 12:42 AM

hoy added a child revision: D99351: [CSSPGO] Top-down processing order based on full profile..Mar 25 2021, 9:11 AM

Update header comments and macro for ProfiledCallGraph.

Harbormaster completed remote builds in B95735: Diff 333362.Mar 25 2021, 12:24 PM

hoy added inline comments.Mar 25 2021, 12:31 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	I'm wondering in the future if functions without profile should be considered so that a broader inline decisions can be made regardless of callsite hotness.
llvm/tools/llvm-profgen/ProfileGenerator.cpp
470	Trying to understand how keys can change. Do we only remove the profiles from `ProfileMap` once they are merged into base profiles? Are base profiles from the reuse of first non-inlined profiles?

wenlei added inline comments.Mar 25 2021, 4:33 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	For pre-inlining, we need to have callee profile otherwise there's no profile to be adjusted regardless of whether we inline or not. For caller profile, we currently requires it to trigger pre-inline, but we could do pre-inline without caller profile. Is that what you meant? Currently compiler's inlining also requires caller profile, so what we have here aligns with compiler. (Not sure if I get the question, it's building top-down order here, not actual pre-inlining.. )
llvm/tools/llvm-profgen/ProfileGenerator.cpp
470	Right, base profile can be from the first promoted context profile. We don't remove profiles from ProfileMap during context promotion and merging. ProfileMap owns the function profiles, and context promotion is done on context trie only, which also updates context for the profiles. The key of that map does not change, but the context for profile in the map changes, so key is no longer accurate. Fortunately for profile writing, we don't look at the keys, so as long we fix this spot to avoid using the key, we are good. We could also update the map to keep keys accurate, but it involves moving profiles around which has some cost.

hoy added inline comments.Mar 25 2021, 5:51 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	Sorry for the confusion. I actually meant to explore non-profiled callees (such as getters/setters) here and in `getInlineCandidates` to mimid the situation that both hot callees of those getters/setters and themselves are inlined into the current caller. The getter/setter inlining, though their callsite are not hot, are likely done in prelink CGSCC inlining. I was thinking about simulating the prelink inlining if possible. There's no such need if prelink inlining is disabled.

wenlei added inline comments.Mar 25 2021, 9:28 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	What do you mean by non-profiled callees? If function does not have profile, there's no profile to be adjusted, then doing pre-inline or not doesn't matter. CGSCC inline in prelink can happen but since there's no profile for the callee, no profile need to be adjusted and there's no count quality issue. Or are you suggesting considering looking at multiple levels of callees when evaluating a call site? That is orthogonal to whether a function has profile.

hoy added inline comments.Mar 25 2021, 10:02 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	Yeah, even if the first level callee doesn't have profile, the callee's callee could have a hot profile. Skipping such first-level callees also skips next-level callees. It's fine if first-level callees are not inlined in prelink, which will aslo not be inlined in postlink sample loader. If such callees are inlined by prelink cgscc, then next level callees will likely be inlined by postlink FDO, which is a discrepancy from llvm-profgen preliner. Currently if all functions have profiles, multi-level is naturally supported with the priority-based BFS processing by tweaking calliste costs. If any function in one call chain doesn't have a profile, BFS will stop at that level.

wenlei added inline comments.Mar 25 2021, 10:09 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	We don't support multiple levels even with priority-based BFS inliner in the sense that we never look ahead to see if there's anything hot underneath a cold callee. Callee without profile is just one example of cold callee.

hoy added inline comments.Mar 25 2021, 10:22 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	I see. Yes, callees without profile is a case of cold callees. They shouldn't be an issue without cgscc inlining sitting between the preliner and the targeted postlink FDO inliner. With the cgscc inlining, we might need sort of simulation for that, which might be quite different with the current top-down simulator.

wenlei added inline comments.Mar 25 2021, 10:36 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	Yeah, I hope that cgscc inline will mostly only deal with cold/small function inlining for csspgo, in which case profile adjustment is less important, hence preinline is less important (extreme case of cold inlining is cases where we don't have profile at all, then pre-inline doesn't matter). If we actually find cgscc inlining covering some hot inlining, I think it'd be worth looking to see why sample loader doesn't handle such cases. So I hope we don't have to do much for preinline estimation for cgscc inliner.

hoy accepted this revision.Mar 25 2021, 10:55 PM

hoy added inline comments.

llvm/tools/llvm-profgen/CSPreInliner.cpp
64	Sounds good. Current implementation looks good to me. Would like to see this moving forward and evolving close to the real inliner.

This revision is now accepted and ready to land.Mar 25 2021, 10:55 PM

wmi added inline comments.Mar 26 2021, 10:04 AM

llvm/tools/llvm-profgen/CMakeLists.txt
8	Besides the flags, is there anything else needed for the patch to add IPO as a dependency for llvm-profgen? IPO include many unrelated things.

wmi added inline comments.Mar 26 2021, 10:56 AM

llvm/tools/llvm-profgen/CSPreInliner.cpp
182	The definition of ShouldInline is only used in LLVM_DEBUG. It may trigger warning in release mode.
182–190	I didn't find if the candidate should not be inlined, where the context profile is merged into the base profile. Could you show me?
200	Included in LLVM_DEBUG.
222	printProfileNames is only used in LLVM_DEBUG. Need to include it in #ifndef NDEBUG.

wenlei added inline comments.Mar 26 2021, 10:57 AM

llvm/tools/llvm-profgen/CMakeLists.txt
8	That is for reusing context tracker and the profiled call graph added in IPO. These two are shared between sample loader and llvm-profgen, which hopefully helps to make preinliner close to compiler inline. I moved profiled call graph into IPO so compiler can use it too (we now use it in https://reviews.llvm.org/D99351).

wenlei added inline comments.Mar 26 2021, 11:01 AM

llvm/tools/llvm-profgen/CSPreInliner.cpp
182	Good point, let me check and adjust.
182–190	The merge is done within getBaseSamplesFor on-demand, same as how it's done in compiler.
200	This is all in LLVM_DEBUG on line 197. Do you mean we need a separate LLVM_DEBUG?
222	Good point, will do.

wmi added inline comments.Mar 26 2021, 12:28 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
182–190	Ah, I see. markContextSamplesInlined mark those contexts which will be inlined. For the rest, they are non-inlined and will be merged when getBaseSamplesFor is called.
200	Ah, I missed that. Never mind.

wenlei added inline comments.Mar 26 2021, 4:19 PM

llvm/tools/llvm-profgen/CSPreInliner.cpp
182	Seem the assignment is treated as a use, and we don't have warning in release. But we do have one for printProfileNames, now wrapped in macro.

Addressed Wei's comment, also moved profile call graph building into ProfiledCallGraph.

linter

Harbormaster completed remote builds in B95948: Diff 333652.Mar 26 2021, 4:51 PM

LGTM. Thanks.

Harbormaster completed remote builds in B95949: Diff 333653.Mar 26 2021, 5:29 PM

hoy mentioned this in D99351: [CSSPGO] Top-down processing order based on full profile..Mar 27 2021, 11:00 AM

This revision was landed with ongoing or failed builds.Mar 29 2021, 9:54 AM

Closed by commit rG30b023233696: [CSSPGO][llvm-profgen] Context-sensitive global pre-inliner (authored by wenlei). · Explain Why

This revision was automatically updated to reflect the committed changes.

wenlei added a commit: rG30b023233696: [CSSPGO][llvm-profgen] Context-sensitive global pre-inliner.

In D99146#2649599, @wenlei wrote:

In D99146#2649557, @davidxl wrote:

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

@davidxl @wmi just to report back now that we've done more work along this path, and got some results. We observed global preinliner boosting performance for a large internal workload by ~1% (baseline is CSSPGO without global preinliner). Hopefully with some work we can rely more on preinliner for AutoFDO too.

Herald added a subscriber: ormris. · View Herald TranscriptSep 21 2021, 10:07 AM

hvdijk mentioned this in D119798: [ADT, CSSPGO] Specify set comparer.Feb 14 2022, 5:20 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

IPO/

ProfiledCallGraph.h

135 lines

SampleContextTracker.h

3 lines

lib/

Transforms/

IPO/

SampleContextTracker.cpp

3 lines

SampleProfile.cpp

23 lines

test/

tools/

llvm-profgen/

cs-preinline.test

41 lines

tools/

llvm-profgen/

2 lines

92 lines

229 lines

10 lines

30 lines

Diff 333915

llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h

This file was added.

				//===-- ProfiledCallGraph.h - Profiled Call Graph ----------------- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TOOLS_LLVM_PROFGEN_PROFILEDCALLGRAPH_H
				#define LLVM_TOOLS_LLVM_PROFGEN_PROFILEDCALLGRAPH_H

				#include "llvm/ADT/GraphTraits.h"
				#include "llvm/ADT/StringMap.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/ProfileData/SampleProf.h"
				#include "llvm/Transforms/IPO/SampleContextTracker.h"
				#include <queue>
				#include <set>
				#include <string>

				using namespace llvm;
				using namespace sampleprof;

				namespace llvm {
				namespace sampleprof {

				struct ProfiledCallGraphNode {
				ProfiledCallGraphNode(StringRef FName = StringRef()) : Name(FName) {}
				StringRef Name;

				struct ProfiledCallGraphNodeComparer {
				bool operator()(const ProfiledCallGraphNode *L,
				const ProfiledCallGraphNode *R) const {
				return L->Name < R->Name;
				}
				};
				std::set<ProfiledCallGraphNode *, ProfiledCallGraphNodeComparer> Callees;
				};

				class ProfiledCallGraph {
				public:
				using iterator = std::set<ProfiledCallGraphNode *>::iterator;
				ProfiledCallGraph(StringMap<FunctionSamples> &ProfileMap,
				SampleContextTracker &ContextTracker) {
				// Add all profiled functions into profiled call graph.
				// We only add function with actual context profile
				for (auto &FuncSample : ProfileMap) {
				FunctionSamples *FSamples = &FuncSample.second;
				addProfiledFunction(FSamples->getName());
				}

				// BFS traverse the context profile trie to add call edges for
				// both samples calls as well as calls shown in context.
				std::queue<ContextTrieNode *> Queue;
				Queue.push(&ContextTracker.getRootContext());
				while (!Queue.empty()) {
				ContextTrieNode *Caller = Queue.front();
				Queue.pop();
				FunctionSamples *CallerSamples = Caller->getFunctionSamples();

				// Add calls for context, if both caller and callee has context profile.
				for (auto &Child : Caller->getAllChildContext()) {
				ContextTrieNode *Callee = &Child.second;
				Queue.push(Callee);
				if (CallerSamples && Callee->getFunctionSamples()) {
				addProfiledCall(Caller->getFuncName(), Callee->getFuncName());
				}
				}

				// Add calls from call site samples
				if (CallerSamples) {
				for (auto &LocCallSite : CallerSamples->getBodySamples()) {
				for (auto &NameCallSite : LocCallSite.second.getCallTargets()) {
				addProfiledCall(Caller->getFuncName(), NameCallSite.first());
				}
				}
				}
				}
				}

				iterator begin() { return Root.Callees.begin(); }
				iterator end() { return Root.Callees.end(); }
				ProfiledCallGraphNode *getEntryNode() { return &Root; }
				void addProfiledFunction(StringRef Name) {
				if (!ProfiledFunctions.count(Name)) {
				// Link to synthetic root to make sure every node is reachable
				// from root. This does not affect SCC order.
				Root.Callees.insert(&ProfiledFunctions[Name]);
				ProfiledFunctions[Name] = ProfiledCallGraphNode(Name);
				}
				}
				void addProfiledCall(StringRef CallerName, StringRef CalleeName) {
				assert(ProfiledFunctions.count(CallerName));
				auto CalleeIt = ProfiledFunctions.find(CalleeName);
				if (CalleeIt == ProfiledFunctions.end()) {
				return;
				}
				ProfiledFunctions[CallerName].Callees.insert(&CalleeIt->second);
				}

				private:
				ProfiledCallGraphNode Root;
				StringMap<ProfiledCallGraphNode> ProfiledFunctions;
				};

				} // end namespace sampleprof

				template <> struct GraphTraits<ProfiledCallGraphNode *> {
				using NodeRef = ProfiledCallGraphNode *;
				using ChildIteratorType = std::set<ProfiledCallGraphNode *>::iterator;

				static NodeRef getEntryNode(NodeRef PCGN) { return PCGN; }
				static ChildIteratorType child_begin(NodeRef N) { return N->Callees.begin(); }
				static ChildIteratorType child_end(NodeRef N) { return N->Callees.end(); }
				};

				template <>
				struct GraphTraits<ProfiledCallGraph *>
				: public GraphTraits<ProfiledCallGraphNode *> {
				static NodeRef getEntryNode(ProfiledCallGraph *PCG) {
				return PCG->getEntryNode();
				}

				static ChildIteratorType nodes_begin(ProfiledCallGraph *PCG) {
				return PCG->begin();
				}

				static ChildIteratorType nodes_end(ProfiledCallGraph *PCG) {
				return PCG->end();
				}
				};

				} // end namespace llvm

				#endif

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h

Show First 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	public:
// Get all context profile for given function.		// Get all context profile for given function.
ContextSamplesTy &getAllContextSamplesFor(const Function &Func);		ContextSamplesTy &getAllContextSamplesFor(const Function &Func);
ContextSamplesTy &getAllContextSamplesFor(StringRef Name);		ContextSamplesTy &getAllContextSamplesFor(StringRef Name);
// Query base profile for a given function. A base profile is a merged view		// Query base profile for a given function. A base profile is a merged view
// of all context profiles for contexts that are not inlined.		// of all context profiles for contexts that are not inlined.
FunctionSamples *getBaseSamplesFor(const Function &Func,		FunctionSamples *getBaseSamplesFor(const Function &Func,
bool MergeContext = true);		bool MergeContext = true);
// Query base profile for a given function by name.		// Query base profile for a given function by name.
FunctionSamples *getBaseSamplesFor(StringRef Name, bool MergeContext);		FunctionSamples *getBaseSamplesFor(StringRef Name, bool MergeContext = true);
// Retrieve the context trie node for given profile context		// Retrieve the context trie node for given profile context
ContextTrieNode *getContextFor(const SampleContext &Context);		ContextTrieNode *getContextFor(const SampleContext &Context);
// Mark a context profile as inlined when function is inlined.		// Mark a context profile as inlined when function is inlined.
// This makes sure that inlined context profile will be excluded in		// This makes sure that inlined context profile will be excluded in
// function's base profile.		// function's base profile.
void markContextSamplesInlined(const FunctionSamples *InlinedSamples);		void markContextSamplesInlined(const FunctionSamples *InlinedSamples);
		ContextTrieNode &getRootContext();
void promoteMergeContextSamplesTree(const Instruction &Inst,		void promoteMergeContextSamplesTree(const Instruction &Inst,
StringRef CalleeName);		StringRef CalleeName);
void addCallGraphEdges(CallGraph &CG, StringMap<Function *> &SymbolMap);		void addCallGraphEdges(CallGraph &CG, StringMap<Function *> &SymbolMap);
// Dump the internal context profile trie.		// Dump the internal context profile trie.
void dump();		void dump();

private:		private:
ContextTrieNode getContextFor(const DILocation DIL);		ContextTrieNode getContextFor(const DILocation DIL);
Show All 22 Lines

llvm/lib/Transforms/IPO/SampleContextTracker.cpp

Show First 20 Lines • Show All 322 Lines • ▼ Show 20 Lines
void SampleContextTracker::markContextSamplesInlined(		void SampleContextTracker::markContextSamplesInlined(
const FunctionSamples *InlinedSamples) {		const FunctionSamples *InlinedSamples) {
assert(InlinedSamples && "Expect non-null inlined samples");		assert(InlinedSamples && "Expect non-null inlined samples");
LLVM_DEBUG(dbgs() << "Marking context profile as inlined: "		LLVM_DEBUG(dbgs() << "Marking context profile as inlined: "
<< InlinedSamples->getContext() << "\n");		<< InlinedSamples->getContext() << "\n");
InlinedSamples->getContext().setState(InlinedContext);		InlinedSamples->getContext().setState(InlinedContext);
}		}

		ContextTrieNode &SampleContextTracker::getRootContext() { return RootContext; }

void SampleContextTracker::promoteMergeContextSamplesTree(		void SampleContextTracker::promoteMergeContextSamplesTree(
const Instruction &Inst, StringRef CalleeName) {		const Instruction &Inst, StringRef CalleeName) {
LLVM_DEBUG(dbgs() << "Promoting and merging context tree for instr: \n"		LLVM_DEBUG(dbgs() << "Promoting and merging context tree for instr: \n"
<< Inst << "\n");		<< Inst << "\n");
// Get the caller context for the call instruction, we don't use callee		// Get the caller context for the call instruction, we don't use callee
// name from call because there can be context from indirect calls too.		// name from call because there can be context from indirect calls too.
DILocation *DIL = Inst.getDebugLoc();		DILocation *DIL = Inst.getDebugLoc();
ContextTrieNode *CallerNode = getContextFor(DIL);		ContextTrieNode *CallerNode = getContextFor(DIL);
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	SampleContextTracker::getOrCreateContextPath(const SampleContext &Context,
}		}

assert((!AllowCreate \|\| ContextNode) &&		assert((!AllowCreate \|\| ContextNode) &&
"Node must exist if creation is allowed");		"Node must exist if creation is allowed");
return ContextNode;		return ContextNode;
}		}

ContextTrieNode *SampleContextTracker::getTopLevelContextNode(StringRef FName) {		ContextTrieNode *SampleContextTracker::getTopLevelContextNode(StringRef FName) {
		assert(!FName.empty() && "Top level node query must provide valid name");
return RootContext.getChildContext(LineLocation(0, 0), FName);		return RootContext.getChildContext(LineLocation(0, 0), FName);
}		}

ContextTrieNode &SampleContextTracker::addTopLevelContextNode(StringRef FName) {		ContextTrieNode &SampleContextTracker::addTopLevelContextNode(StringRef FName) {
assert(!getTopLevelContextNode(FName) && "Node to add must not exist");		assert(!getTopLevelContextNode(FName) && "Node to add must not exist");
return *RootContext.getOrCreateChildContext(LineLocation(0, 0), FName);		return *RootContext.getOrCreateChildContext(LineLocation(0, 0), FName);
}		}

▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

llvm/lib/Transforms/IPO/SampleProfile.cpp

Show First 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	static cl::opt<bool> UseProfileTopDownOrder(
cl::desc("Process functions in one SCC in a top-down order "		cl::desc("Process functions in one SCC in a top-down order "
"based on the input profile."));		"based on the input profile."));

static cl::opt<bool> ProfileSizeInline(		static cl::opt<bool> ProfileSizeInline(
"sample-profile-inline-size", cl::Hidden, cl::init(false),		"sample-profile-inline-size", cl::Hidden, cl::init(false),
cl::desc("Inline cold call sites in profile loader if it's beneficial "		cl::desc("Inline cold call sites in profile loader if it's beneficial "
"for code size."));		"for code size."));

static cl::opt<int> ProfileInlineGrowthLimit(		cl::opt<int> ProfileInlineGrowthLimit(
"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),		"sample-profile-inline-growth-limit", cl::Hidden, cl::init(12),
cl::desc("The size growth ratio limit for proirity-based sample profile "		cl::desc("The size growth ratio limit for proirity-based sample profile "
"loader inlining."));		"loader inlining."));

static cl::opt<int> ProfileInlineLimitMin(		cl::opt<int> ProfileInlineLimitMin(
"sample-profile-inline-limit-min", cl::Hidden, cl::init(100),		"sample-profile-inline-limit-min", cl::Hidden, cl::init(100),
cl::desc("The lower bound of size growth limit for "		cl::desc("The lower bound of size growth limit for "
"proirity-based sample profile loader inlining."));		"proirity-based sample profile loader inlining."));

static cl::opt<int> ProfileInlineLimitMax(		cl::opt<int> ProfileInlineLimitMax(
"sample-profile-inline-limit-max", cl::Hidden, cl::init(10000),		"sample-profile-inline-limit-max", cl::Hidden, cl::init(10000),
cl::desc("The upper bound of size growth limit for "		cl::desc("The upper bound of size growth limit for "
"proirity-based sample profile loader inlining."));		"proirity-based sample profile loader inlining."));

		cl::opt<int> SampleHotCallSiteThreshold(
		"sample-profile-hot-inline-threshold", cl::Hidden, cl::init(3000),
		cl::desc("Hot callsite threshold for proirity-based sample profile loader "
		"inlining."));

		cl::opt<int> SampleColdCallSiteThreshold(
		"sample-profile-cold-inline-threshold", cl::Hidden, cl::init(45),
		cl::desc("Threshold for inlining cold callsites"));

static cl::opt<int> ProfileICPThreshold(		static cl::opt<int> ProfileICPThreshold(
"sample-profile-icp-threshold", cl::Hidden, cl::init(5),		"sample-profile-icp-threshold", cl::Hidden, cl::init(5),
cl::desc(		cl::desc(
"Relative hotness threshold for indirect "		"Relative hotness threshold for indirect "
"call promotion in proirity-based sample profile loader inlining."));		"call promotion in proirity-based sample profile loader inlining."));

static cl::opt<int> SampleHotCallSiteThreshold(
"sample-profile-hot-inline-threshold", cl::Hidden, cl::init(3000),
cl::desc("Hot callsite threshold for proirity-based sample profile loader "
"inlining."));

static cl::opt<bool> CallsitePrioritizedInline(		static cl::opt<bool> CallsitePrioritizedInline(
"sample-profile-prioritized-inline", cl::Hidden, cl::ZeroOrMore,		"sample-profile-prioritized-inline", cl::Hidden, cl::ZeroOrMore,
cl::init(false),		cl::init(false),
cl::desc("Use call site prioritized inlining for sample profile loader."		cl::desc("Use call site prioritized inlining for sample profile loader."
"Currently only CSSPGO is supported."));		"Currently only CSSPGO is supported."));

static cl::opt<int> SampleColdCallSiteThreshold(
"sample-profile-cold-inline-threshold", cl::Hidden, cl::init(45),
cl::desc("Threshold for inlining cold callsites"));

static cl::opt<std::string> ProfileInlineReplayFile(		static cl::opt<std::string> ProfileInlineReplayFile(
"sample-profile-inline-replay", cl::init(""), cl::value_desc("filename"),		"sample-profile-inline-replay", cl::init(""), cl::value_desc("filename"),
cl::desc(		cl::desc(
"Optimization remarks file containing inline remarks to be replayed "		"Optimization remarks file containing inline remarks to be replayed "
"by inlining from sample profile loader."),		"by inlining from sample profile loader."),
cl::Hidden);		cl::Hidden);

▲ Show 20 Lines • Show All 1,765 Lines • Show Last 20 Lines

llvm/test/tools/llvm-profgen/cs-preinline.test

This file was added.

				; Test default llvm-profgen with preinline off
				; RUN: llvm-profgen --perfscript=%S/Inputs/inline-cs-noprobe.perfscript --binary=%S/Inputs/inline-cs-noprobe.perfbin --output=%t
				; RUN: FileCheck %s --input-file %t --check-prefix=CHECK-DEFAULT

				; Test llvm-profgen with preinliner on will merge not inlinable profile into base profile.
				; RUN: llvm-profgen --perfscript=%S/Inputs/inline-cs-noprobe.perfscript --binary=%S/Inputs/inline-cs-noprobe.perfbin --output=%t --csspgo-preinliner=1
				; RUN: FileCheck %s --input-file %t --check-prefix=CHECK-PREINL

				; Test preinliner threshold that prevents all possible inlining and merges everything into base profile.
				; RUN: llvm-profgen --perfscript=%S/Inputs/inline-cs-noprobe.perfscript --binary=%S/Inputs/inline-cs-noprobe.perfbin --output=%t --csspgo-preinliner=1 -sample-profile-hot-inline-threshold=0
				; RUN: FileCheck %s --input-file %t --check-prefix=CHECK-NO-PREINL

				; CHECK-DEFAULT: [main:1 @ foo]:309:0
				; CHECK-DEFAULT-NEXT: 2.1: 14
				; CHECK-DEFAULT-NEXT: 3: 15
				; CHECK-DEFAULT-NEXT: 3.1: 14 bar:14
				; CHECK-DEFAULT-NEXT: 3.2: 1
				; CHECK-DEFAULT-NEXT: !Attributes: 1
				; CHECK-DEFAULT-NEXT:[main:1 @ foo:3.1 @ bar]:84:0
				; CHECK-DEFAULT-NEXT: 1: 14
				; CHECK-DEFAULT-NEXT: !Attributes: 1

				; CHECK-PREINL: [foo]:309:0
				; CHECK-PREINL-NEXT: 2.1: 14
				; CHECK-PREINL-NEXT: 3: 15
				; CHECK-PREINL-NEXT: 3.1: 14 bar:14
				; CHECK-PREINL-NEXT: 3.2: 1
				; CHECK-PREINL-NEXT: !Attributes: 1
				; CHECK-PREINL-NEXT:[foo:3.1 @ bar]:84:0
				; CHECK-PREINL-NEXT: 1: 14
				; CHECK-PREINL-NEXT: !Attributes: 3

				; CHECK-NO-PREINL: [foo]:309:0
				; CHECK-NO-PREINL-NEXT: 2.1: 14
				; CHECK-NO-PREINL-NEXT: 3: 15
				; CHECK-NO-PREINL-NEXT: 3.1: 14 bar:14
				; CHECK-NO-PREINL-NEXT: 3.2: 1
				; CHECK-NO-PREINL-NEXT: !Attributes: 1
				; CHECK-NO-PREINL-NEXT:[bar]:84:0
				; CHECK-NO-PREINL-NEXT: 1: 14
				; CHECK-NO-PREINL-NEXT: !Attributes: 1

llvm/tools/llvm-profgen/CMakeLists.txt


	set(LLVM_LINK_COMPONENTS			set(LLVM_LINK_COMPONENTS
	AllTargetsDescs			AllTargetsDescs
	AllTargetsDisassemblers			AllTargetsDisassemblers
	AllTargetsInfos			AllTargetsInfos
	Core			Core
	MC			MC
				IPO
				wmiUnsubmitted Not Done Reply Inline Actions Besides the flags, is there anything else needed for the patch to add IPO as a dependency for llvm-profgen? IPO include many unrelated things. wmi: Besides the flags, is there anything else needed for the patch to add IPO as a dependency for…
				wenleiAuthorUnsubmitted Done Reply Inline Actions That is for reusing context tracker and the profiled call graph added in IPO. These two are shared between sample loader and llvm-profgen, which hopefully helps to make preinliner close to compiler inline. I moved profiled call graph into IPO so compiler can use it too (we now use it in https://reviews.llvm.org/D99351). wenlei: That is for reusing context tracker and the profiled call graph added in IPO. These two are…
	MCDisassembler			MCDisassembler
	Object			Object
	ProfileData			ProfileData
	Support			Support
	Symbolize			Symbolize
	)			)

	add_llvm_tool(llvm-profgen			add_llvm_tool(llvm-profgen
	llvm-profgen.cpp			llvm-profgen.cpp
	PerfReader.cpp			PerfReader.cpp
				CSPreInliner.cpp
	ProfiledBinary.cpp			ProfiledBinary.cpp
	ProfileGenerator.cpp			ProfileGenerator.cpp
	PseudoProbe.cpp			PseudoProbe.cpp
	)			)

llvm/tools/llvm-profgen/CSPreInliner.h

This file was added.

				//===-- CSPreInliner.h - Profile guided preinliner ---------------- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TOOLS_LLVM_PROFGEN_PGOINLINEADVISOR_H
				#define LLVM_TOOLS_LLVM_PROFGEN_PGOINLINEADVISOR_H

				#include "llvm/ADT/PriorityQueue.h"
				#include "llvm/ProfileData/ProfileCommon.h"
				#include "llvm/ProfileData/SampleProf.h"
				#include "llvm/Transforms/IPO/ProfiledCallGraph.h"
				#include "llvm/Transforms/IPO/SampleContextTracker.h"

				using namespace llvm;
				using namespace sampleprof;

				namespace llvm {
				namespace sampleprof {

				// Inline candidate seen from profile
				struct ProfiledInlineCandidate {
				ProfiledInlineCandidate(const FunctionSamples *Samples, uint64_t Count)
				: CalleeSamples(Samples), CallsiteCount(Count),
				SizeCost(Samples->getBodySamples().size()) {}
				// Context-sensitive function profile for inline candidate
				const FunctionSamples *CalleeSamples;
				// Call site count for an inline candidate
				// TODO: make sure entry count for context profile and call site
				// target count for corresponding call are consistent.
				uint64_t CallsiteCount;
				// Size proxy for function under particular call context.
				// TODO: use post-inline callee size from debug info.
				uint64_t SizeCost;
				};

				// Inline candidate comparer using call site weight
				struct ProfiledCandidateComparer {
				bool operator()(const ProfiledInlineCandidate &LHS,
				const ProfiledInlineCandidate &RHS) {
				if (LHS.CallsiteCount != RHS.CallsiteCount)
				return LHS.CallsiteCount < RHS.CallsiteCount;

				if (LHS.SizeCost != RHS.SizeCost)
				return LHS.SizeCost > RHS.SizeCost;

				// Tie breaker using GUID so we have stable/deterministic inlining order
				assert(LHS.CalleeSamples && RHS.CalleeSamples &&
				"Expect non-null FunctionSamples");
				return LHS.CalleeSamples->getGUID(LHS.CalleeSamples->getName()) <
				RHS.CalleeSamples->getGUID(RHS.CalleeSamples->getName());
				}
				};

				using ProfiledCandidateQueue =
				PriorityQueue<ProfiledInlineCandidate, std::vector<ProfiledInlineCandidate>,
				ProfiledCandidateComparer>;

				// Pre-compilation inliner based on context-sensitive profile.
				// The PreInliner estimates inline decision using hotness from profile
				// and cost estimation from machine code size. It helps merges context
				// profile globally and achieves better post-inine profile quality, which
				// otherwise won't be possible for ThinLTO. It also reduce context profile
				// size by only keep context that is estimated to be inlined.
				class CSPreInliner {
				public:
				CSPreInliner(StringMap<FunctionSamples> &Profiles, uint64_t HotThreshold,
				uint64_t ColdThreshold);
				void run();

				private:
				bool getInlineCandidates(ProfiledCandidateQueue &CQueue,
				const FunctionSamples *FCallerContextSamples);
				std::vector<StringRef> buildTopDownOrder();
				void processFunction(StringRef Name);
				bool shouldInline(ProfiledInlineCandidate &Candidate);
				SampleContextTracker ContextTracker;
				StringMap<FunctionSamples> &ProfileMap;

				// Count thresholds to answer isHotCount and isColdCount queries.
				// Mirrors the threshold in ProfileSummaryInfo.
				uint64_t HotCountThreshold;
				uint64_t ColdCountThreshold;
				};

				} // end namespace sampleprof
				} // end namespace llvm

				#endif

llvm/tools/llvm-profgen/CSPreInliner.cpp

This file was added.

				//===-- CSPreInliner.cpp - Profile guided preinliner -------------- C++ -*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "CSPreInliner.h"
				#include "llvm/ADT/SCCIterator.h"
				#include <cstdint>
				#include <queue>

				#define DEBUG_TYPE "cs-preinliner"

				using namespace llvm;
				using namespace sampleprof;

				static cl::opt<bool> EnableCSPreInliner(
				"csspgo-preinliner", cl::Hidden, cl::init(false),
				cl::desc("Run a global pre-inliner to merge context profile based on "
				"estimated global top-down inline decisions"));

				// The switches specify inline thresholds used in SampleProfileLoader inlining.
				// TODO: the actual threshold to be tuned here because the size here is based
				// on machine code not LLVM IR.
				extern cl::opt<int> SampleHotCallSiteThreshold;
				extern cl::opt<int> SampleColdCallSiteThreshold;
				extern cl::opt<int> ProfileInlineGrowthLimit;
				extern cl::opt<int> ProfileInlineLimitMin;
				extern cl::opt<int> ProfileInlineLimitMax;

				static cl::opt<bool> SamplePreInlineReplay(
				"csspgo-replay-preinline", cl::Hidden, cl::init(false),
				cl::desc(
				"Replay previous inlining and adjust context profile accordingly"));

				CSPreInliner::CSPreInliner(StringMap<FunctionSamples> &Profiles,
				uint64_t HotThreshold, uint64_t ColdThreshold)
				: ContextTracker(Profiles), ProfileMap(Profiles),
				HotCountThreshold(HotThreshold), ColdCountThreshold(ColdThreshold) {}

				std::vector<StringRef> CSPreInliner::buildTopDownOrder() {
				std::vector<StringRef> Order;
				ProfiledCallGraph ProfiledCG(ProfileMap, ContextTracker);

				// Now that we have a profiled call graph, construct top-down order
				// by building up SCC and reversing SCC order.
				scc_iterator<ProfiledCallGraph *> I = scc_begin(&ProfiledCG);
				while (!I.isAtEnd()) {
				for (ProfiledCallGraphNode Node : I) {
				if (Node != ProfiledCG.getEntryNode())
				Order.push_back(Node->Name);
				}
				++I;
				}
				std::reverse(Order.begin(), Order.end());

				return Order;
				}

				bool CSPreInliner::getInlineCandidates(ProfiledCandidateQueue &CQueue,
				const FunctionSamples *CallerSamples) {
				assert(CallerSamples && "Expect non-null caller samples");
				hoyUnsubmitted Not Done Reply Inline Actions I'm wondering in the future if functions without profile should be considered so that a broader inline decisions can be made regardless of callsite hotness. hoy: I'm wondering in the future if functions without profile should be considered so that a broader…
				wenleiAuthorUnsubmitted Done Reply Inline Actions For pre-inlining, we need to have callee profile otherwise there's no profile to be adjusted regardless of whether we inline or not. For caller profile, we currently requires it to trigger pre-inline, but we could do pre-inline without caller profile. Is that what you meant? Currently compiler's inlining also requires caller profile, so what we have here aligns with compiler. (Not sure if I get the question, it's building top-down order here, not actual pre-inlining.. ) wenlei: For pre-inlining, we need to have callee profile otherwise there's no profile to be adjusted…
				hoyUnsubmitted Not Done Reply Inline Actions Sorry for the confusion. I actually meant to explore non-profiled callees (such as getters/setters) here and in `getInlineCandidates` to mimid the situation that both hot callees of those getters/setters and themselves are inlined into the current caller. The getter/setter inlining, though their callsite are not hot, are likely done in prelink CGSCC inlining. I was thinking about simulating the prelink inlining if possible. There's no such need if prelink inlining is disabled. hoy: Sorry for the confusion. I actually meant to explore non-profiled callees (such as…
				wenleiAuthorUnsubmitted Done Reply Inline Actions What do you mean by non-profiled callees? If function does not have profile, there's no profile to be adjusted, then doing pre-inline or not doesn't matter. CGSCC inline in prelink can happen but since there's no profile for the callee, no profile need to be adjusted and there's no count quality issue. Or are you suggesting considering looking at multiple levels of callees when evaluating a call site? That is orthogonal to whether a function has profile. wenlei: What do you mean by non-profiled callees? If function does not have profile, there's no profile…
				hoyUnsubmitted Not Done Reply Inline Actions Yeah, even if the first level callee doesn't have profile, the callee's callee could have a hot profile. Skipping such first-level callees also skips next-level callees. It's fine if first-level callees are not inlined in prelink, which will aslo not be inlined in postlink sample loader. If such callees are inlined by prelink cgscc, then next level callees will likely be inlined by postlink FDO, which is a discrepancy from llvm-profgen preliner. Currently if all functions have profiles, multi-level is naturally supported with the priority-based BFS processing by tweaking calliste costs. If any function in one call chain doesn't have a profile, BFS will stop at that level. hoy: Yeah, even if the first level callee doesn't have profile, the callee's callee could have a hot…
				wenleiAuthorUnsubmitted Done Reply Inline Actions We don't support multiple levels even with priority-based BFS inliner in the sense that we never look ahead to see if there's anything hot underneath a cold callee. Callee without profile is just one example of cold callee. wenlei: We don't support multiple levels even with priority-based BFS inliner in the sense that we…
				hoyUnsubmitted Not Done Reply Inline Actions I see. Yes, callees without profile is a case of cold callees. They shouldn't be an issue without cgscc inlining sitting between the preliner and the targeted postlink FDO inliner. With the cgscc inlining, we might need sort of simulation for that, which might be quite different with the current top-down simulator. hoy: I see. Yes, callees without profile is a case of cold callees. They shouldn't be an issue…
				wenleiAuthorUnsubmitted Done Reply Inline Actions Yeah, I hope that cgscc inline will mostly only deal with cold/small function inlining for csspgo, in which case profile adjustment is less important, hence preinline is less important (extreme case of cold inlining is cases where we don't have profile at all, then pre-inline doesn't matter). If we actually find cgscc inlining covering some hot inlining, I think it'd be worth looking to see why sample loader doesn't handle such cases. So I hope we don't have to do much for preinline estimation for cgscc inliner. wenlei: Yeah, I hope that cgscc inline will mostly only deal with cold/small function inlining for…
				hoyUnsubmitted Not Done Reply Inline Actions Sounds good. Current implementation looks good to me. Would like to see this moving forward and evolving close to the real inliner. hoy: Sounds good. Current implementation looks good to me. Would like to see this moving forward and…

				// Ideally we want to consider everything a function calls, but as far as
				// context profile is concerned, only those frames that are children of
				// current one in the trie is relavent. So we walk the trie instead of call
				// targets from function profile.
				ContextTrieNode *CallerNode =
				ContextTracker.getContextFor(CallerSamples->getContext());

				bool HasNewCandidate = false;
				for (auto &Child : CallerNode->getAllChildContext()) {
				ContextTrieNode *CalleeNode = &Child.second;
				FunctionSamples *CalleeSamples = CalleeNode->getFunctionSamples();
				if (!CalleeSamples)
				continue;

				// Call site count is more reliable, so we look up the corresponding call
				// target profile in caller's context profile to retrieve call site count.
				uint64_t CalleeEntryCount = CalleeSamples->getEntrySamples();
				uint64_t CallsiteCount = 0;
				LineLocation Callsite = CalleeNode->getCallSiteLoc();
				if (auto CallTargets = CallerSamples->findCallTargetMapAt(Callsite)) {
				SampleRecord::CallTargetMap &TargetCounts = CallTargets.get();
				auto It = TargetCounts.find(CalleeSamples->getName());
				if (It != TargetCounts.end())
				CallsiteCount = It->second;
				}

				// TODO: call site and callee entry count should be mostly consistent, add
				// check for that.
				HasNewCandidate = true;
				CQueue.emplace(CalleeSamples, std::max(CallsiteCount, CalleeEntryCount));
				}

				return HasNewCandidate;
				}

				bool CSPreInliner::shouldInline(ProfiledInlineCandidate &Candidate) {
				// If replay inline is requested, simply follow the inline decision of the
				// profiled binary.
				if (SamplePreInlineReplay)
				return Candidate.CalleeSamples->getContext().hasAttribute(
				ContextWasInlined);
				hoyUnsubmitted Not Done Reply Inline Actions nit: a function call hoy: nit: a function call
				wenleiAuthorUnsubmitted Done Reply Inline Actions updated. wenlei: updated.

				// Adjust threshold based on call site hotness, only do this for callsite
				// prioritized inliner because otherwise cost-benefit check is done earlier.
				unsigned int SampleThreshold = SampleColdCallSiteThreshold;
				if (Candidate.CallsiteCount > HotCountThreshold)
				SampleThreshold = SampleHotCallSiteThreshold;

				// TODO: for small cold functions, we may inlined them and we need to keep
				// context profile accordingly.
				if (Candidate.CallsiteCount < ColdCountThreshold)
				SampleThreshold = SampleColdCallSiteThreshold;

				return (Candidate.SizeCost < SampleThreshold);
				}

				void CSPreInliner::processFunction(const StringRef Name) {
				LLVM_DEBUG(dbgs() << "Process " << Name
				<< " for context-sensitive pre-inlining\n");

				FunctionSamples *FSamples = ContextTracker.getBaseSamplesFor(Name);
				if (!FSamples)
				return;

				// Use the number of lines/probes as proxy for function size for now.
				// TODO: retrieve accurate size from dwarf or binary instead.
				unsigned FuncSize = FSamples->getBodySamples().size();
				unsigned FuncFinalSize = FuncSize;
				unsigned SizeLimit = FuncSize * ProfileInlineGrowthLimit;
				SizeLimit = std::min(SizeLimit, (unsigned)ProfileInlineLimitMax);
				SizeLimit = std::max(SizeLimit, (unsigned)ProfileInlineLimitMin);

				ProfiledCandidateQueue CQueue;
				getInlineCandidates(CQueue, FSamples);

				while (!CQueue.empty() && FuncFinalSize < SizeLimit) {
				ProfiledInlineCandidate Candidate = CQueue.top();
				CQueue.pop();
				bool ShouldInline = false;
				if ((ShouldInline = shouldInline(Candidate))) {
				// We mark context as inlined as the corresponding context profile
				// won't be merged into that function's base profile.
				ContextTracker.markContextSamplesInlined(Candidate.CalleeSamples);
				Candidate.CalleeSamples->getContext().setAttribute(
				ContextShouldBeInlined);
				FuncFinalSize += Candidate.SizeCost;
				hoyUnsubmitted Not Done Reply Inline Actions Should this be `>=`? hoy: Should this be `>=`?
				wenleiAuthorUnsubmitted Done Reply Inline Actions `>` is consistent with `SampleProfileLoader::shouldInlineCandidate`, though practically I don't think it matters. wenlei: `>` is consistent with `SampleProfileLoader::shouldInlineCandidate`, though practically I don't…
				getInlineCandidates(CQueue, Candidate.CalleeSamples);
				}
				LLVM_DEBUG(dbgs() << (ShouldInline ? " Inlined" : " Outlined")
				<< " context profile for: "
				<< Candidate.CalleeSamples->getNameWithContext()
				<< " (callee size: " << Candidate.SizeCost
				<< ", call count:" << Candidate.CallsiteCount << ")\n");
				}

				LLVM_DEBUG({
				if (!CQueue.empty())
				dbgs() << " Inline candidates ignored due to size limit (inliner "
				"original size: "
				<< FuncSize << ", inliner final size: " << FuncFinalSize
				<< ", size limit: " << SizeLimit << ")\n";

				while (!CQueue.empty()) {
				ProfiledInlineCandidate Candidate = CQueue.top();
				hoyUnsubmitted Not Done Reply Inline Actions This currently only reflects the number of live/hot lines. Might be extended to using static size from dwarf/probe decoding or disassembling. Can you leave a TODO for this? hoy: This currently only reflects the number of live/hot lines. Might be extended to using static…
				wenleiAuthorUnsubmitted Done Reply Inline Actions Todo added. wenlei: Todo added.
				CQueue.pop();
				bool WasInlined =
				Candidate.CalleeSamples->getContext().hasAttribute(ContextWasInlined);
				dbgs() << " " << Candidate.CalleeSamples->getNameWithContext()
				<< " (candidate size:" << Candidate.SizeCost
				<< ", call count: " << Candidate.CallsiteCount << ", previously "
				<< (WasInlined ? "inlined)\n" : "not inlined)\n");
				}
				});
				}

				void CSPreInliner::run() {
				if (!EnableCSPreInliner)
				wmiUnsubmitted Not Done Reply Inline Actions The definition of ShouldInline is only used in LLVM_DEBUG. It may trigger warning in release mode. wmi: The definition of ShouldInline is only used in LLVM_DEBUG. It may trigger warning in release…
				wenleiAuthorUnsubmitted Done Reply Inline Actions Good point, let me check and adjust. wenlei: Good point, let me check and adjust.
				wenleiAuthorUnsubmitted Done Reply Inline Actions Seem the assignment is treated as a use, and we don't have warning in release. But we do have one for printProfileNames, now wrapped in macro. wenlei: Seem the assignment is treated as a use, and we don't have warning in release. But we do have…
				return;

				#ifndef NDEBUG
				auto printProfileNames = [](StringMap<FunctionSamples> &Profiles,
				bool IsInput) {
				dbgs() << (IsInput ? "Input" : "Output") << " context-sensitive profiles ("
				<< Profiles.size() << " total):\n";
				for (auto &It : Profiles) {
				wmiUnsubmitted Not Done Reply Inline Actions I didn't find if the candidate should not be inlined, where the context profile is merged into the base profile. Could you show me? wmi: I didn't find if the candidate should not be inlined, where the context profile is merged into…
				wenleiAuthorUnsubmitted Done Reply Inline Actions The merge is done within getBaseSamplesFor on-demand, same as how it's done in compiler. wenlei: The merge is done within getBaseSamplesFor on-demand, same as how it's done in compiler.
				wmiUnsubmitted Not Done Reply Inline Actions Ah, I see. markContextSamplesInlined mark those contexts which will be inlined. For the rest, they are non-inlined and will be merged when getBaseSamplesFor is called. wmi: Ah, I see. markContextSamplesInlined mark those contexts which will be inlined. For the rest…
				const FunctionSamples &Samples = It.second;
				dbgs() << " [" << Samples.getNameWithContext() << "] "
				<< Samples.getTotalSamples() << ":" << Samples.getHeadSamples()
				<< "\n";
				}
				};
				#endif

				LLVM_DEBUG(printProfileNames(ProfileMap, true));

				wmiUnsubmitted Not Done Reply Inline Actions Included in LLVM_DEBUG. wmi: Included in LLVM_DEBUG.
				wenleiAuthorUnsubmitted Done Reply Inline Actions This is all in LLVM_DEBUG on line 197. Do you mean we need a separate LLVM_DEBUG? wenlei: This is all in LLVM_DEBUG on line 197. Do you mean we need a separate LLVM_DEBUG?
				wmiUnsubmitted Not Done Reply Inline Actions Ah, I missed that. Never mind. wmi: Ah, I missed that. Never mind.
				// Execute global pre-inliner to estimate a global top-down inline
				// decision and merge profiles accordingly. This helps with profile
				// merge for ThinLTO otherwise we won't be able to merge profiles back
				// to base profile across module/thin-backend boundaries.
				// It also helps better compress context profile to control profile
				// size, as we now only need context profile for functions going to
				// be inlined.
				for (StringRef FuncName : buildTopDownOrder()) {
				processFunction(FuncName);
				}

				// Not inlined context profiles are merged into its base, so we can
				// trim out such profiles from the output.
				std::vector<StringRef> ProfilesToBeRemoved;
				for (auto &It : ProfileMap) {
				SampleContext Context = It.second.getContext();
				if (!Context.isBaseContext() && !Context.hasState(InlinedContext)) {
				assert(Context.hasState(MergedContext) &&
				"Not inlined context profile should be merged already");
				ProfilesToBeRemoved.push_back(It.first());
				}
				}
				wmiUnsubmitted Not Done Reply Inline Actions printProfileNames is only used in LLVM_DEBUG. Need to include it in #ifndef NDEBUG. wmi: printProfileNames is only used in LLVM_DEBUG. Need to include it in #ifndef NDEBUG.
				wenleiAuthorUnsubmitted Done Reply Inline Actions Good point, will do. wenlei: Good point, will do.

				for (StringRef ContextName : ProfilesToBeRemoved) {
				ProfileMap.erase(ContextName);
				}

				LLVM_DEBUG(printProfileNames(ProfileMap, false));
				}

llvm/tools/llvm-profgen/ProfileGenerator.h

//===-- ProfileGenerator.h - Profile Generator ------------------ C++ --===//		//===-- ProfileGenerator.h - Profile Generator ------------------ C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef LLVM_TOOLS_LLVM_PROGEN_PROFILEGENERATOR_H		#ifndef LLVM_TOOLS_LLVM_PROGEN_PROFILEGENERATOR_H
#define LLVM_TOOLS_LLVM_PROGEN_PROFILEGENERATOR_H		#define LLVM_TOOLS_LLVM_PROGEN_PROFILEGENERATOR_H
		#include "CSPreInliner.h"
#include "ErrorHandling.h"		#include "ErrorHandling.h"
#include "PerfReader.h"		#include "PerfReader.h"
#include "ProfiledBinary.h"		#include "ProfiledBinary.h"
#include "llvm/Analysis/ProfileSummaryInfo.h"		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/ProfileData/SampleProfWriter.h"		#include "llvm/ProfileData/SampleProfWriter.h"
#include <memory>		#include <memory>

using namespace llvm;		using namespace llvm;
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	while (I <= MaxDedupSize) {
MaxDedupSize = std::min(static_cast<uint32_t>(End / 2), MaxDedupSize);		MaxDedupSize = std::min(static_cast<uint32_t>(End / 2), MaxDedupSize);
}		}
}		}

protected:		protected:
// Lookup or create FunctionSamples for the context		// Lookup or create FunctionSamples for the context
FunctionSamples &getFunctionProfileForContext(StringRef ContextId,		FunctionSamples &getFunctionProfileForContext(StringRef ContextId,
bool WasLeafInlined = false);		bool WasLeafInlined = false);
		// Post processing for profiles before writing out, such as mermining
		// and trimming cold profiles, running preinliner on profiles.
		void postProcessProfiles();
// Merge cold context profile whose total sample is below threshold		// Merge cold context profile whose total sample is below threshold
// into base profile.		// into base profile.
void mergeAndTrimColdProfile(StringMap<FunctionSamples> &ProfileMap);		void mergeAndTrimColdProfile(StringMap<FunctionSamples> &ProfileMap);
void computeSummaryAndThreshold();		void computeSummaryAndThreshold();
void write(std::unique_ptr<SampleProfileWriter> Writer,		void write(std::unique_ptr<SampleProfileWriter> Writer,
StringMap<FunctionSamples> &ProfileMap) override;		StringMap<FunctionSamples> &ProfileMap) override;

		// Profile summary to answer isHotCount and isColdCount queries.
		std::unique_ptr<ProfileSummaryInfo> PSI;

private:		private:
// Helper function for updating body sample for a leaf location in		// Helper function for updating body sample for a leaf location in
// FunctionProfile		// FunctionProfile
void updateBodySamplesforFunctionProfile(FunctionSamples &FunctionProfile,		void updateBodySamplesforFunctionProfile(FunctionSamples &FunctionProfile,
const FrameLocation &LeafLoc,		const FrameLocation &LeafLoc,
uint64_t Count);		uint64_t Count);
void populateFunctionBodySamples(FunctionSamples &FunctionProfile,		void populateFunctionBodySamples(FunctionSamples &FunctionProfile,
const RangeSample &RangeCounters,		const RangeSample &RangeCounters,
ProfiledBinary *Binary);		ProfiledBinary *Binary);
void populateFunctionBoundarySamples(StringRef ContextId,		void populateFunctionBoundarySamples(StringRef ContextId,
FunctionSamples &FunctionProfile,		FunctionSamples &FunctionProfile,
const BranchSample &BranchCounters,		const BranchSample &BranchCounters,
ProfiledBinary *Binary);		ProfiledBinary *Binary);
void populateInferredFunctionSamples();		void populateInferredFunctionSamples();

// Profile summary to answer isHotCount and isColdCount queries.
std::unique_ptr<ProfileSummaryInfo> PSI;

public:		public:
// Deduplicate adjacent repeated context sequences up to a given sequence		// Deduplicate adjacent repeated context sequences up to a given sequence
// length. -1 means no size limit.		// length. -1 means no size limit.
static int32_t MaxCompressionSize;		static int32_t MaxCompressionSize;
};		};

using ProbeCounterMap = std::unordered_map<const PseudoProbe *, uint64_t>;		using ProbeCounterMap = std::unordered_map<const PseudoProbe *, uint64_t>;

Show All 38 Lines

llvm/tools/llvm-profgen/ProfileGenerator.cpp

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	for (const auto &BI : BinarySampleCounters) {
}		}
}		}
// Fill in call site value sample for inlined calls and also use context to		// Fill in call site value sample for inlined calls and also use context to
// infer missing samples. Since we don't have call count for inlined		// infer missing samples. Since we don't have call count for inlined
// functions, we estimate it from inlinee's profile using the entry of the		// functions, we estimate it from inlinee's profile using the entry of the
// body sample.		// body sample.
populateInferredFunctionSamples();		populateInferredFunctionSamples();

// Compute hot/cold threshold based on profile. This will be used for cold		postProcessProfiles();
// context profile merging/trimming.
computeSummaryAndThreshold();
}		}

void CSProfileGenerator::updateBodySamplesforFunctionProfile(		void CSProfileGenerator::updateBodySamplesforFunctionProfile(
FunctionSamples &FunctionProfile, const FrameLocation &LeafLoc,		FunctionSamples &FunctionProfile, const FrameLocation &LeafLoc,
uint64_t Count) {		uint64_t Count) {
// Filter out invalid negative(int type) lineOffset		// Filter out invalid negative(int type) lineOffset
if (LeafLoc.second.LineOffset & 0x80000000)		if (LeafLoc.second.LineOffset & 0x80000000)
return;		return;
▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	CallerProfile.addCalledTargetSamples(
CalleeProfile.getContext().getNameWithoutContext(), EstimatedCallCount);		CalleeProfile.getContext().getNameWithoutContext(), EstimatedCallCount);
CallerProfile.addBodySamples(CallerLeafFrameLoc.second.LineOffset,		CallerProfile.addBodySamples(CallerLeafFrameLoc.second.LineOffset,
CallerLeafFrameLoc.second.Discriminator,		CallerLeafFrameLoc.second.Discriminator,
EstimatedCallCount);		EstimatedCallCount);
CallerProfile.addTotalSamples(EstimatedCallCount);		CallerProfile.addTotalSamples(EstimatedCallCount);
}		}
}		}

		void CSProfileGenerator::postProcessProfiles() {
		// Compute hot/cold threshold based on profile. This will be used for cold
		// context profile merging/trimming.
		computeSummaryAndThreshold();

		// Run global pre-inliner to adjust/merge context profile based on estimated
		// inline decisions.
		CSPreInliner(ProfileMap, PSI->getHotCountThreshold(),
		PSI->getColdCountThreshold())
		.run();

		mergeAndTrimColdProfile(ProfileMap);
		}

void CSProfileGenerator::computeSummaryAndThreshold() {		void CSProfileGenerator::computeSummaryAndThreshold() {
SampleProfileSummaryBuilder Builder(ProfileSummaryBuilder::DefaultCutoffs);		SampleProfileSummaryBuilder Builder(ProfileSummaryBuilder::DefaultCutoffs);
auto Summary = Builder.computeSummaryForProfiles(ProfileMap);		auto Summary = Builder.computeSummaryForProfiles(ProfileMap);
PSI.reset(new ProfileSummaryInfo(std::move(Summary)));		PSI.reset(new ProfileSummaryInfo(std::move(Summary)));
}		}

void CSProfileGenerator::mergeAndTrimColdProfile(		void CSProfileGenerator::mergeAndTrimColdProfile(
StringMap<FunctionSamples> &ProfileMap) {		StringMap<FunctionSamples> &ProfileMap) {
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	for (const auto &I : BaseProfileMap) {
// as a new profile		// as a new profile
FunctionSamples &OrigProfile = getFunctionProfileForContext(I.getKey());		FunctionSamples &OrigProfile = getFunctionProfileForContext(I.getKey());
OrigProfile.merge(I.second);		OrigProfile.merge(I.second);
}		}
}		}

void CSProfileGenerator::write(std::unique_ptr<SampleProfileWriter> Writer,		void CSProfileGenerator::write(std::unique_ptr<SampleProfileWriter> Writer,
StringMap<FunctionSamples> &ProfileMap) {		StringMap<FunctionSamples> &ProfileMap) {
mergeAndTrimColdProfile(ProfileMap);
// Add bracket for context key to support different profile binary format		// Add bracket for context key to support different profile binary format
StringMap<FunctionSamples> CxtWithBracketPMap;		StringMap<FunctionSamples> CxtWithBracketPMap;
for (const auto &Item : ProfileMap) {		for (const auto &Item : ProfileMap) {
std::string ContextWithBracket = "[" + Item.first().str() + "]";		// After CSPreInliner the key of ProfileMap is no longer accurate for
		// context, use the context attached to function samples instead.
		hoyUnsubmitted Not Done Reply Inline Actions Nit: context hoy: Nit: context
		wenleiAuthorUnsubmitted Done Reply Inline Actions updated. wenlei: updated.
		hoyUnsubmitted Not Done Reply Inline Actions Trying to understand how keys can change. Do we only remove the profiles from `ProfileMap` once they are merged into base profiles? Are base profiles from the reuse of first non-inlined profiles? hoy: Trying to understand how keys can change. Do we only remove the profiles from `ProfileMap` once…
		wenleiAuthorUnsubmitted Done Reply Inline Actions Right, base profile can be from the first promoted context profile. We don't remove profiles from ProfileMap during context promotion and merging. ProfileMap owns the function profiles, and context promotion is done on context trie only, which also updates context for the profiles. The key of that map does not change, but the context for profile in the map changes, so key is no longer accurate. Fortunately for profile writing, we don't look at the keys, so as long we fix this spot to avoid using the key, we are good. We could also update the map to keep keys accurate, but it involves moving profiles around which has some cost. wenlei: Right, base profile can be from the first promoted context profile. We don't remove profiles…
		std::string ContextWithBracket =
		"[" + Item.second.getNameWithContext().str() + "]";
auto Ret = CxtWithBracketPMap.try_emplace(ContextWithBracket, Item.second);		auto Ret = CxtWithBracketPMap.try_emplace(ContextWithBracket, Item.second);
assert(Ret.second && "Must be a unique context");		assert(Ret.second && "Must be a unique context");
SampleContext FContext(Ret.first->first(), RawContext);		SampleContext FContext(Ret.first->first(), RawContext);
FunctionSamples &FProfile = Ret.first->second;		FunctionSamples &FProfile = Ret.first->second;
FContext.setAllAttributes(FProfile.getContext().getAllAttributes());		FContext.setAllAttributes(FProfile.getContext().getAllAttributes());
FProfile.setName(FContext.getNameWithContext(true));		FProfile.setName(FContext.getNameWithoutContext());
FProfile.setContext(FContext);		FProfile.setContext(FContext);
}		}
Writer->write(CxtWithBracketPMap);		Writer->write(CxtWithBracketPMap);
}		}

// Helper function to extract context prefix string stack		// Helper function to extract context prefix string stack
// Extract context stack for reusing, leaf context stack will		// Extract context stack for reusing, leaf context stack will
// be added compressed while looking up function profile		// be added compressed while looking up function profile
Show All 22 Lines	for (const auto &CI : BI.second) {
populateBodySamplesWithProbes(CI.second.RangeCounter, ContextStrStack,		populateBodySamplesWithProbes(CI.second.RangeCounter, ContextStrStack,
Binary);		Binary);
// Fill in boundary samples for a call probe		// Fill in boundary samples for a call probe
populateBoundarySamplesWithProbes(CI.second.BranchCounter,		populateBoundarySamplesWithProbes(CI.second.BranchCounter,
ContextStrStack, Binary);		ContextStrStack, Binary);
}		}
}		}

// Compute hot/cold threshold based on profile. This will be used for cold		postProcessProfiles();
// context profile merging/trimming.
computeSummaryAndThreshold();
}		}

void PseudoProbeCSProfileGenerator::extractProbesFromRange(		void PseudoProbeCSProfileGenerator::extractProbesFromRange(
const RangeSample &RangeCounter, ProbeCounterMap &ProbeCounter,		const RangeSample &RangeCounter, ProbeCounterMap &ProbeCounter,
ProfiledBinary *Binary) {		ProfiledBinary *Binary) {
RangeSample Ranges;		RangeSample Ranges;
findDisjointRanges(Ranges, RangeCounter);		findDisjointRanges(Ranges, RangeCounter);
for (const auto &Range : Ranges) {		for (const auto &Range : Ranges) {
▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Context-sensitive global pre-inlinerClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 333915

llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h

llvm/lib/Transforms/IPO/SampleContextTracker.cpp

llvm/lib/Transforms/IPO/SampleProfile.cpp

llvm/test/tools/llvm-profgen/cs-preinline.test

llvm/tools/llvm-profgen/CMakeLists.txt

llvm/tools/llvm-profgen/CSPreInliner.h

llvm/tools/llvm-profgen/CSPreInliner.cpp

llvm/tools/llvm-profgen/ProfileGenerator.h

llvm/tools/llvm-profgen/ProfileGenerator.cpp

[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
ClosedPublic