This is an archive of the discontinued LLVM Phabricator instance.

[SampleFDO] Don't let inliner treat warm callsite with inline instance in the profile as cold
ClosedPublic

Authored by wmi on Apr 6 2018, 9:16 AM.

Download Raw Diff

Details

Reviewers

danielcdh
davidxl
javed.absar

Commits

rG0c2f6be662d8: [SampleFDO] Don't treat warm callsite with inline instance in the profile as…
rL332058: [SampleFDO] Don't treat warm callsite with inline instance in the profile as…

Summary

We found current sampleFDO had a performance issue when triaging a regression. For a callsite with inline instance in the profile, even if hot callsite inliner cannot inline it, it may still execute enough times and should not be treated as cold in regular inliner later. However, currently if such callsite is not inlined by hot callsite inliner, and the BB where the callsite locates doesn't get samples from other instructions inside of it, the callsite will have no profile metadata annotated. In regular inliner cost analysis, if the callsite has no profile annotated and its caller has profile information, it will be treated as cold.

The fix is for such warm callsites without profile because they are inlined in the profile, still keep them without profile metadata annotated. For other callsites whose parent BBs don't get any sample, explicitly annotate them with 0 profile count (Don't omit profile metadata). In regular inliner cost analysis, if a callsite has no profile annotated, we won't treat it as cold anymore -- we treat callsites as cold only when they profile count exists and is less than cold cutoff value.

It fixes a 5% regression in the target application. I also evaluate it on two server benchmarks and find no performance difference there, but one server benchmark gets 2% reduction in size.

I also evaluate other alternative to fix the issue, like relax the criterial of hotness checking in hot callsite inliner, but the result is not as good as this strategy probably because regular inliner has more information about whether we should inline a warm callsite with medium/small size callee.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi created this revision.Apr 6 2018, 9:16 AM

Herald added subscribers: kristof.beyls, eraman, javed.absar, sanjoy. · View Herald TranscriptApr 6 2018, 9:16 AM

Update a comment in the code.

davidxl added inline comments.Apr 11 2018, 9:32 AM

lib/Analysis/ProfileSummaryInfo.cpp
252 ↗	(On Diff #141366)	This can cause problem if the caller function is newly added and there is no profile associated with it -- all callsites there will be marked as cold.
lib/Transforms/IPO/SampleProfile.cpp
1305	Instead of skipping it, is it better to annotate it with a 'warm' profile count?

wmi added inline comments.Apr 11 2018, 10:00 AM

lib/Analysis/ProfileSummaryInfo.cpp
252 ↗	(On Diff #141366)	You are right. That is a problem. Need to figure out how to avoid it.
lib/Transforms/IPO/SampleProfile.cpp
1305	I considered that solution but I was worried that by annotating the callsite with warm profile count, it will be treated as warm but the callsites inside of the callee will still be treated as cold after the current callsite is inlined. Definitely the issue here I am worried about is minor than the missing profile issue of new functions introduced by source change. I think this is still a good solution if only the testing is fine, or if we can come up with better solution.

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

In D45377#1068438, @danielcdh wrote:

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

I tried the idea to compute the inline instance's total count divided by its bb count, and compare the division result to PSI hot threshold. That improved the regression benchmark but did not recover the whole regression. That is why I choosed to keep the current callsiteIsHot check in early inliner unchanged because I guessed regular inliner may have a better position to decide whether to inline such warm/medium size callsite.

Tried David's suggestion and found the tests were good. The original regression for the target benchmark was recovered and we even got a little improvement. Another two server benchmarks had no performance change.

Patch was updated accordingly.

In D45377#1068853, @wmi wrote:

In D45377#1068438, @danielcdh wrote:

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

I tried the idea to compute the inline instance's total count divided by its bb count, and compare the division result to PSI hot threshold. That improved the regression benchmark but did not recover the whole regression. That is why I choosed to keep the current callsiteIsHot check in early inliner unchanged because I guessed regular inliner may have a better position to decide whether to inline such warm/medium size callsite.

I suppose the regression comes from iterative-AutoFDO?

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

In D45377#1068900, @danielcdh wrote:

In D45377#1068853, @wmi wrote:

In D45377#1068438, @danielcdh wrote:

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

I tried the idea to compute the inline instance's total count divided by its bb count, and compare the division result to PSI hot threshold. That improved the regression benchmark but did not recover the whole regression. That is why I choosed to keep the current callsiteIsHot check in early inliner unchanged because I guessed regular inliner may have a better position to decide whether to inline such warm/medium size callsite.

I suppose the regression comes from iterative-AutoFDO?

It is possible. Because it is only about 1% regression uncovered by the change, I don't have a good way to measure exactly where it comes from. And 1% is within the fluctuation range the target benchmarks allows.

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

Yes, that is the same concern I have in my reply to David's suggestion, but the result seems fine. I can measure your suggested way and see how it looks like.

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot
threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

Yes, that is the same concern I have in my reply to David's suggestion, but the result seems fine. I can measure your suggested way and see how it looks like.

I tested the solution of comparing total_count to hot threshold, for the two server benchmarks the performance had no change. But for the regressed benchmark, it is a little worse than the solution of comparing total_count/num_callee_bb to hot threshold -- in my three runs there were two runs for which the regression was larger than the fluctuation range the target benchmarks allows. I know it is possible there is other side-effect taking place here, but for now I don't have detail perf profile for me to find out.

Herald added a subscriber: chrib. · View Herald TranscriptApr 18 2018, 8:42 AM

In D45377#1071031, @wmi wrote:

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot
threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

Yes, that is the same concern I have in my reply to David's suggestion, but the result seems fine. I can measure your suggested way and see how it looks like.

I tested the solution of comparing total_count to hot threshold, for the two server benchmarks the performance had no change. But for the regressed benchmark, it is a little worse than the solution of comparing total_count/num_callee_bb to hot threshold -- in my three runs there were two runs for which the regression was larger than the fluctuation range the target benchmarks allows. I know it is possible there is other side-effect taking place here, but for now I don't have detail perf profile for me to find out.

Ok, I find out the reason why comparing total_count to hot threshold didn't recover the regression. It is indeed caused by side-effect. The different inline disabled a jumpthreading and in turn disabled a block of code from being sunk into cold region in machine sinking. This lead to the regression. The patch in https://reviews.llvm.org/D46275 can fix the issue in jumpthreading. With D46275 installed, the solution of comparing total_count to hot threshold recover all the regression and even bring small improvement for the benchmark.

I will update the patch using the solution of comparing total_count to hot threshold.

Herald added a reviewer: javed.absar. · View Herald TranscriptApr 30 2018, 10:51 AM

Update the patch to use the solution of comparing total count to hot cutoff threshold.

danielcdh added inline comments.Apr 30 2018, 6:41 PM

lib/Transforms/IPO/SampleProfile.cpp
372	In what situations will PSI be nullptr? If not, then please assert it instead. Also, I think this will overwrite the later (PercentSamples >= SampleProfileHotThreshold) heuristic, and we should remove that flag.

wmi added inline comments.Apr 30 2018, 8:48 PM

lib/Transforms/IPO/SampleProfile.cpp
372	PSI will not be nullptr. Will add an assertion. It is possible that for a callsite its CallsiteTotalSamples is less than hot cutoff threshold but still have a PententSamples larger than SampleProfileHotThreshold. My original plan is if a callsite is inlined currently, the new heuristic will still keep it. But I check where SampleProfileHotThreshold is used and find it is also used to populate the InlinedGUIDs set. To make that simple and consistent, like you suggest, I may remove SampleProfileHotThreshold and related heuristic.

remove SampleProfileHotThreshold. The benchmarks showed no regressions. I am now testing the iterative AFDO result.

Iterative AFDO result is comparable with AFDO result.

Ping.

danielcdh accepted this revision.May 10 2018, 1:09 PM

This revision is now accepted and ready to land.May 10 2018, 1:09 PM

Closed by commit rL332058: [SampleFDO] Don't treat warm callsite with inline instance in the profile as… (authored by wmi). · Explain WhyMay 10 2018, 4:06 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

IPO/

SampleProfile.cpp

104 lines

test/

Transforms/

SampleProfile/

Inputs/

warm-inline-instance.prof

11 lines

warm-inline-instance.ll

116 lines

Diff 142643

lib/Transforms/IPO/SampleProfile.cpp

Show All 31 Lines
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringMap.h"		#include "llvm/ADT/StringMap.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/Twine.h"		#include "llvm/ADT/Twine.h"
#include "llvm/Analysis/AssumptionCache.h"		#include "llvm/Analysis/AssumptionCache.h"
#include "llvm/Analysis/InlineCost.h"		#include "llvm/Analysis/InlineCost.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
		#include "llvm/Analysis/ProfileSummaryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/BasicBlock.h"		#include "llvm/IR/BasicBlock.h"
#include "llvm/IR/CFG.h"		#include "llvm/IR/CFG.h"
#include "llvm/IR/CallSite.h"		#include "llvm/IR/CallSite.h"
#include "llvm/IR/DebugInfoMetadata.h"		#include "llvm/IR/DebugInfoMetadata.h"
#include "llvm/IR/DebugLoc.h"		#include "llvm/IR/DebugLoc.h"
#include "llvm/IR/DiagnosticInfo.h"		#include "llvm/IR/DiagnosticInfo.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
▲ Show 20 Lines • Show All 133 Lines • ▼ Show 20 Lines	SampleProfileLoader(
StringRef Name, bool IsThinLTOPreLink,		StringRef Name, bool IsThinLTOPreLink,
std::function<AssumptionCache &(Function &)> GetAssumptionCache,		std::function<AssumptionCache &(Function &)> GetAssumptionCache,
std::function<TargetTransformInfo &(Function &)> GetTargetTransformInfo)		std::function<TargetTransformInfo &(Function &)> GetTargetTransformInfo)
: GetAC(std::move(GetAssumptionCache)),		: GetAC(std::move(GetAssumptionCache)),
GetTTI(std::move(GetTargetTransformInfo)), Filename(Name),		GetTTI(std::move(GetTargetTransformInfo)), Filename(Name),
IsThinLTOPreLink(IsThinLTOPreLink) {}		IsThinLTOPreLink(IsThinLTOPreLink) {}

bool doInitialization(Module &M);		bool doInitialization(Module &M);
bool runOnModule(Module &M, ModuleAnalysisManager *AM);		bool runOnModule(Module &M, ModuleAnalysisManager *AM,
		ProfileSummaryInfo *_PSI);

void dump() { Reader->dump(); }		void dump() { Reader->dump(); }

protected:		protected:
bool runOnFunction(Function &F, ModuleAnalysisManager *AM);		bool runOnFunction(Function &F, ModuleAnalysisManager *AM);
unsigned getFunctionLoc(Function &F);		unsigned getFunctionLoc(Function &F);
bool emitAnnotations(Function &F);		bool emitAnnotations(Function &F);
ErrorOr<uint64_t> getInstWeight(const Instruction &I);		ErrorOr<uint64_t> getInstWeight(const Instruction &I);
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	protected:
bool ProfileIsValid = false;		bool ProfileIsValid = false;

/// \brief Flag indicating if the pass is invoked in ThinLTO compile phase.		/// \brief Flag indicating if the pass is invoked in ThinLTO compile phase.
///		///
/// In this phase, in annotation, we should not promote indirect calls.		/// In this phase, in annotation, we should not promote indirect calls.
/// Instead, we will mark GUIDs that needs to be annotated to the function.		/// Instead, we will mark GUIDs that needs to be annotated to the function.
bool IsThinLTOPreLink;		bool IsThinLTOPreLink;

		/// \brief Profile Summary Info computed from sample profile.
		ProfileSummaryInfo *PSI = nullptr;

/// \brief Total number of samples collected in this profile.		/// \brief Total number of samples collected in this profile.
///		///
/// This is the sum of all the samples collected in all the functions executed		/// This is the sum of all the samples collected in all the functions executed
/// at runtime.		/// at runtime.
uint64_t TotalCollectedSamples = 0;		uint64_t TotalCollectedSamples = 0;

		/// \brief For callsite which is inlined in the profile, we don't have
		/// sample for the call instruction. Even if hotCallsite inliner cannot
		/// inline such callsite because it is not hot enough, if only we think
		/// it is not cold, we will save such callsite in the WarmCallsWithoutProf
		/// set. We will pass the information to regular inliner to let it know
		/// such callsite is not cold too.
		DenseSet<Instruction *> WarmCallsWithoutProf;

/// \brief Optimization Remark Emitter used to emit diagnostic remarks.		/// \brief Optimization Remark Emitter used to emit diagnostic remarks.
OptimizationRemarkEmitter *ORE = nullptr;		OptimizationRemarkEmitter *ORE = nullptr;
};		};

class SampleProfileLoaderLegacyPass : public ModulePass {		class SampleProfileLoaderLegacyPass : public ModulePass {
public:		public:
// Class identification, replacement for typeinfo		// Class identification, replacement for typeinfo
static char ID;		static char ID;
Show All 18 Lines	public:
}		}

StringRef getPassName() const override { return "Sample profile pass"; }		StringRef getPassName() const override { return "Sample profile pass"; }
bool runOnModule(Module &M) override;		bool runOnModule(Module &M) override;

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AssumptionCacheTracker>();		AU.addRequired<AssumptionCacheTracker>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
		AU.addRequired<ProfileSummaryInfoWrapperPass>();
}		}

private:		private:
SampleProfileLoader SampleLoader;		SampleProfileLoader SampleLoader;
AssumptionCacheTracker *ACT = nullptr;		AssumptionCacheTracker *ACT = nullptr;
TargetTransformInfoWrapperPass *TTIWP = nullptr;		TargetTransformInfoWrapperPass *TTIWP = nullptr;
};		};

Show All 14 Lines
/// If that fraction is larger than the default given by		/// If that fraction is larger than the default given by
/// SampleProfileHotThreshold, the callsite will be inlined again.		/// SampleProfileHotThreshold, the callsite will be inlined again.
static bool callsiteIsHot(const FunctionSamples *CallerFS,		static bool callsiteIsHot(const FunctionSamples *CallerFS,
const FunctionSamples *CallsiteFS) {		const FunctionSamples *CallsiteFS) {
if (!CallsiteFS)		if (!CallsiteFS)
return false; // The callsite was not inlined in the original binary.		return false; // The callsite was not inlined in the original binary.

uint64_t ParentTotalSamples = CallerFS->getTotalSamples();		uint64_t ParentTotalSamples = CallerFS->getTotalSamples();
if (ParentTotalSamples == 0)		if (ParentTotalSamples == 0)
		danielcdhUnsubmitted Not Done Reply Inline Actions In what situations will PSI be nullptr? If not, then please assert it instead. Also, I think this will overwrite the later (PercentSamples >= SampleProfileHotThreshold) heuristic, and we should remove that flag. danielcdh: In what situations will PSI be nullptr? If not, then please assert it instead. Also, I think…
		wmiAuthorUnsubmitted Not Done Reply Inline Actions PSI will not be nullptr. Will add an assertion. It is possible that for a callsite its CallsiteTotalSamples is less than hot cutoff threshold but still have a PententSamples larger than SampleProfileHotThreshold. My original plan is if a callsite is inlined currently, the new heuristic will still keep it. But I check where SampleProfileHotThreshold is used and find it is also used to populate the InlinedGUIDs set. To make that simple and consistent, like you suggest, I may remove SampleProfileHotThreshold and related heuristic. wmi: PSI will not be nullptr. Will add an assertion. It is possible that for a callsite its…
return false; // Avoid division by zero.		return false; // Avoid division by zero.

uint64_t CallsiteTotalSamples = CallsiteFS->getTotalSamples();		uint64_t CallsiteTotalSamples = CallsiteFS->getTotalSamples();
if (CallsiteTotalSamples == 0)		if (CallsiteTotalSamples == 0)
return false; // Callsite is trivially cold.		return false; // Callsite is trivially cold.

double PercentSamples =		double PercentSamples =
(double)CallsiteTotalSamples / (double)ParentTotalSamples * 100.0;		(double)CallsiteTotalSamples / (double)ParentTotalSamples * 100.0;
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	void SampleProfileLoader::clearFunctionData() {
VisitedEdges.clear();		VisitedEdges.clear();
EquivalenceClass.clear();		EquivalenceClass.clear();
DT = nullptr;		DT = nullptr;
PDT = nullptr;		PDT = nullptr;
LI = nullptr;		LI = nullptr;
Predecessors.clear();		Predecessors.clear();
Successors.clear();		Successors.clear();
CoverageTracker.clear();		CoverageTracker.clear();
		WarmCallsWithoutProf.clear();
}		}

#ifndef NDEBUG		#ifndef NDEBUG
/// \brief Print the weight of edge \p E on stream \p OS.		/// \brief Print the weight of edge \p E on stream \p OS.
///		///
/// \param OS Stream to emit the output to.		/// \param OS Stream to emit the output to.
/// \param E Edge to print.		/// \param E Edge to print.
void SampleProfileLoader::printEdgeWeight(raw_ostream &OS, Edge E) {		void SampleProfileLoader::printEdgeWeight(raw_ostream &OS, Edge E) {
▲ Show 20 Lines • Show All 283 Lines • ▼ Show 20 Lines	for (auto &BB : F) {
SmallVector<Instruction *, 10> Candidates;		SmallVector<Instruction *, 10> Candidates;
for (auto &I : BB.getInstList()) {		for (auto &I : BB.getInstList()) {
const FunctionSamples *FS = nullptr;		const FunctionSamples *FS = nullptr;
if ((isa<CallInst>(I) \|\| isa<InvokeInst>(I)) &&		if ((isa<CallInst>(I) \|\| isa<InvokeInst>(I)) &&
!isa<IntrinsicInst>(I) && (FS = findCalleeFunctionSamples(I))) {		!isa<IntrinsicInst>(I) && (FS = findCalleeFunctionSamples(I))) {
Candidates.push_back(&I);		Candidates.push_back(&I);
if (callsiteIsHot(Samples, FS))		if (callsiteIsHot(Samples, FS))
Hot = true;		Hot = true;
		else if (PSI && PSI->isHotCount(FS->getTotalSamples()))
		WarmCallsWithoutProf.insert(&I);
}		}
}		}
if (Hot) {		if (Hot) {
CIS.insert(CIS.begin(), Candidates.begin(), Candidates.end());		CIS.insert(CIS.begin(), Candidates.begin(), Candidates.end());
}		}
}		}
for (auto I : CIS) {		for (auto I : CIS) {
Function *CalledFunction = CallSite(I).getCalledFunction();		Function *CalledFunction = CallSite(I).getCalledFunction();
▲ Show 20 Lines • Show All 466 Lines • ▼ Show 20 Lines	void SampleProfileLoader::propagateWeights(Function &F) {
// Generate MD_prof metadata for every branch instruction using the		// Generate MD_prof metadata for every branch instruction using the
// edge weights computed during propagation.		// edge weights computed during propagation.
DEBUG(dbgs() << "\nPropagation complete. Setting branch weights\n");		DEBUG(dbgs() << "\nPropagation complete. Setting branch weights\n");
LLVMContext &Ctx = F.getContext();		LLVMContext &Ctx = F.getContext();
MDBuilder MDB(Ctx);		MDBuilder MDB(Ctx);
for (auto &BI : F) {		for (auto &BI : F) {
BasicBlock *BB = &BI;		BasicBlock *BB = &BI;

if (BlockWeights[BB]) {
for (auto &I : BB->getInstList()) {		for (auto &I : BB->getInstList()) {
if (!isa<CallInst>(I) && !isa<InvokeInst>(I))		if (!isa<CallInst>(I) && !isa<InvokeInst>(I))
continue;		continue;
		// If there is no sample in the BB, and the callsite is not in
		// WarmCallsWithoutProf set, no annotation will be done.
		if (BlockWeights[BB] == 0 && !WarmCallsWithoutProf.count(&I))
		continue;

CallSite CS(&I);		CallSite CS(&I);
if (!CS.getCalledFunction()) {		if (!CS.getCalledFunction()) {
const DebugLoc &DLoc = I.getDebugLoc();		const DebugLoc &DLoc = I.getDebugLoc();
if (!DLoc)		if (!DLoc)
continue;		continue;
const DILocation *DIL = DLoc;		const DILocation *DIL = DLoc;
uint32_t LineOffset = FunctionSamples::getOffset(DIL);		uint32_t LineOffset = FunctionSamples::getOffset(DIL);
uint32_t Discriminator = DIL->getBaseDiscriminator();		uint32_t Discriminator = DIL->getBaseDiscriminator();

const FunctionSamples *FS = findFunctionSamples(I);		const FunctionSamples *FS = findFunctionSamples(I);
if (!FS)		if (!FS)
continue;		continue;
auto T = FS->findCallTargetMapAt(LineOffset, Discriminator);		auto T = FS->findCallTargetMapAt(LineOffset, Discriminator);
if (!T \|\| T.get().empty())		if (!T \|\| T.get().empty())
continue;		continue;
SmallVector<InstrProfValueData, 2> SortedCallTargets =		SmallVector<InstrProfValueData, 2> SortedCallTargets =
SortCallTargets(T.get());		SortCallTargets(T.get());
uint64_t Sum;		uint64_t Sum;
findIndirectCallFunctionSamples(I, Sum);		findIndirectCallFunctionSamples(I, Sum);
annotateValueSite(*I.getParent()->getParent()->getParent(), I,		annotateValueSite(*I.getParent()->getParent()->getParent(), I,
SortedCallTargets, Sum, IPVK_IndirectCallTarget,		SortedCallTargets, Sum, IPVK_IndirectCallTarget,
SortedCallTargets.size());		SortedCallTargets.size());
} else if (!dyn_cast<IntrinsicInst>(&I)) {		} else if (!dyn_cast<IntrinsicInst>(&I)) {
		uint64_t Weight = BlockWeights[BB];
		// If weight of the BB is 0 and the callsite is in WarmCallsWithoutProf
		// set, annotate the callsite with a warm number.
		if (BlockWeights[BB] == 0 && WarmCallsWithoutProf.count(&I))
		davidxlUnsubmitted Not Done Reply Inline Actions Instead of skipping it, is it better to annotate it with a 'warm' profile count? davidxl: Instead of skipping it, is it better to annotate it with a 'warm' profile count?
		wmiAuthorUnsubmitted Not Done Reply Inline Actions I considered that solution but I was worried that by annotating the callsite with warm profile count, it will be treated as warm but the callsites inside of the callee will still be treated as cold after the current callsite is inlined. Definitely the issue here I am worried about is minor than the missing profile issue of new functions introduced by source change. I think this is still a good solution if only the testing is fine, or if we can come up with better solution. wmi: I considered that solution but I was worried that by annotating the callsite with warm profile…
		Weight =
		(PSI->getHotCountThreshold() + PSI->getColdCountThreshold()) >> 2;
SmallVector<uint32_t, 1> Weights;		SmallVector<uint32_t, 1> Weights;
Weights.push_back(BlockWeights[BB]);		Weights.push_back(Weight);
I.setMetadata(LLVMContext::MD_prof, MDB.createBranchWeights(Weights));		I.setMetadata(LLVMContext::MD_prof, MDB.createBranchWeights(Weights));
}		}
}		}
}
TerminatorInst *TI = BB->getTerminator();		TerminatorInst *TI = BB->getTerminator();
if (TI->getNumSuccessors() == 1)		if (TI->getNumSuccessors() == 1)
continue;		continue;
if (!isa<BranchInst>(TI) && !isa<SwitchInst>(TI))		if (!isa<BranchInst>(TI) && !isa<SwitchInst>(TI))
continue;		continue;

DebugLoc BranchLoc = TI->getDebugLoc();		DebugLoc BranchLoc = TI->getDebugLoc();
DEBUG(dbgs() << "\nGetting weights for branch at line "		DEBUG(dbgs() << "\nGetting weights for branch at line "
▲ Show 20 Lines • Show All 196 Lines • ▼ Show 20 Lines
}		}

char SampleProfileLoaderLegacyPass::ID = 0;		char SampleProfileLoaderLegacyPass::ID = 0;

INITIALIZE_PASS_BEGIN(SampleProfileLoaderLegacyPass, "sample-profile",		INITIALIZE_PASS_BEGIN(SampleProfileLoaderLegacyPass, "sample-profile",
"Sample Profile loader", false, false)		"Sample Profile loader", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
		INITIALIZE_PASS_DEPENDENCY(ProfileSummaryInfoWrapperPass)
INITIALIZE_PASS_END(SampleProfileLoaderLegacyPass, "sample-profile",		INITIALIZE_PASS_END(SampleProfileLoaderLegacyPass, "sample-profile",
"Sample Profile loader", false, false)		"Sample Profile loader", false, false)

bool SampleProfileLoader::doInitialization(Module &M) {		bool SampleProfileLoader::doInitialization(Module &M) {
auto &Ctx = M.getContext();		auto &Ctx = M.getContext();
auto ReaderOrErr = SampleProfileReader::create(Filename, Ctx);		auto ReaderOrErr = SampleProfileReader::create(Filename, Ctx);
if (std::error_code EC = ReaderOrErr.getError()) {		if (std::error_code EC = ReaderOrErr.getError()) {
std::string Msg = "Could not open profile: " + EC.message();		std::string Msg = "Could not open profile: " + EC.message();
Ctx.diagnose(DiagnosticInfoSampleProfile(Filename, Msg));		Ctx.diagnose(DiagnosticInfoSampleProfile(Filename, Msg));
return false;		return false;
}		}
Reader = std::move(ReaderOrErr.get());		Reader = std::move(ReaderOrErr.get());
ProfileIsValid = (Reader->read() == sampleprof_error::success);		ProfileIsValid = (Reader->read() == sampleprof_error::success);
return true;		return true;
}		}

ModulePass *llvm::createSampleProfileLoaderPass() {		ModulePass *llvm::createSampleProfileLoaderPass() {
return new SampleProfileLoaderLegacyPass(SampleProfileFile);		return new SampleProfileLoaderLegacyPass(SampleProfileFile);
}		}

ModulePass *llvm::createSampleProfileLoaderPass(StringRef Name) {		ModulePass *llvm::createSampleProfileLoaderPass(StringRef Name) {
return new SampleProfileLoaderLegacyPass(Name);		return new SampleProfileLoaderLegacyPass(Name);
}		}

bool SampleProfileLoader::runOnModule(Module &M, ModuleAnalysisManager *AM) {		bool SampleProfileLoader::runOnModule(Module &M, ModuleAnalysisManager *AM,
		ProfileSummaryInfo *_PSI) {
if (!ProfileIsValid)		if (!ProfileIsValid)
return false;		return false;

		PSI = _PSI;
		if (M.getProfileSummary() == nullptr)
		M.setProfileSummary(Reader->getSummary().getMD(M.getContext()));

// Compute the total number of samples collected in this profile.		// Compute the total number of samples collected in this profile.
for (const auto &I : Reader->getProfiles())		for (const auto &I : Reader->getProfiles())
TotalCollectedSamples += I.second.getTotalSamples();		TotalCollectedSamples += I.second.getTotalSamples();

// Populate the symbol map.		// Populate the symbol map.
for (const auto &N_F : M.getValueSymbolTable()) {		for (const auto &N_F : M.getValueSymbolTable()) {
StringRef OrigName = N_F.getKey();		StringRef OrigName = N_F.getKey();
Function *F = dyn_cast<Function>(N_F.getValue());		Function *F = dyn_cast<Function>(N_F.getValue());
Show All 14 Lines	bool SampleProfileLoader::runOnModule(Module &M, ModuleAnalysisManager *AM,
}		}

bool retval = false;		bool retval = false;
for (auto &F : M)		for (auto &F : M)
if (!F.isDeclaration()) {		if (!F.isDeclaration()) {
clearFunctionData();		clearFunctionData();
retval \|= runOnFunction(F, AM);		retval \|= runOnFunction(F, AM);
}		}
if (M.getProfileSummary() == nullptr)
M.setProfileSummary(Reader->getSummary().getMD(M.getContext()));
return retval;		return retval;
}		}

bool SampleProfileLoaderLegacyPass::runOnModule(Module &M) {		bool SampleProfileLoaderLegacyPass::runOnModule(Module &M) {
ACT = &getAnalysis<AssumptionCacheTracker>();		ACT = &getAnalysis<AssumptionCacheTracker>();
TTIWP = &getAnalysis<TargetTransformInfoWrapperPass>();		TTIWP = &getAnalysis<TargetTransformInfoWrapperPass>();
return SampleLoader.runOnModule(M, nullptr);		ProfileSummaryInfo *PSI =
		getAnalysis<ProfileSummaryInfoWrapperPass>().getPSI();
		return SampleLoader.runOnModule(M, nullptr, PSI);
}		}

bool SampleProfileLoader::runOnFunction(Function &F, ModuleAnalysisManager *AM) {		bool SampleProfileLoader::runOnFunction(Function &F, ModuleAnalysisManager *AM) {
// Initialize the entry count to -1, which will be treated conservatively		// Initialize the entry count to -1, which will be treated conservatively
// by getEntryCount as the same as unknown (None). If we have samples this		// by getEntryCount as the same as unknown (None). If we have samples this
// will be overwritten in emitAnnotations.		// will be overwritten in emitAnnotations.
F.setEntryCount(ProfileCount(-1, Function::PCT_Real));		F.setEntryCount(ProfileCount(-1, Function::PCT_Real));
std::unique_ptr<OptimizationRemarkEmitter> OwnedORE;		std::unique_ptr<OptimizationRemarkEmitter> OwnedORE;
Show All 25 Lines	PreservedAnalyses SampleProfileLoaderPass::run(Module &M,
};		};

SampleProfileLoader SampleLoader(		SampleProfileLoader SampleLoader(
ProfileFileName.empty() ? SampleProfileFile : ProfileFileName,		ProfileFileName.empty() ? SampleProfileFile : ProfileFileName,
IsThinLTOPreLink, GetAssumptionCache, GetTTI);		IsThinLTOPreLink, GetAssumptionCache, GetTTI);

SampleLoader.doInitialization(M);		SampleLoader.doInitialization(M);

if (!SampleLoader.runOnModule(M, &AM))		ProfileSummaryInfo *PSI = &AM.getResult<ProfileSummaryAnalysis>(M);
		if (!SampleLoader.runOnModule(M, &AM, PSI))
return PreservedAnalyses::all();		return PreservedAnalyses::all();

return PreservedAnalyses::none();		return PreservedAnalyses::none();
}		}

test/Transforms/SampleProfile/Inputs/warm-inline-instance.prof

				main:2257150:0
				2.1: 5553
				3: 5391
				3.1: foo:5860
				0: 5279
				1: 5279
				2: 5279
				4.1: goo:60
				0: 20
				1: 20
				2: 20

test/Transforms/SampleProfile/warm-inline-instance.ll

				; RUN: opt < %s -sample-profile -sample-profile-file=%S/Inputs/warm-inline-instance.prof -sample-profile-inline-hot-threshold=1 -S \| FileCheck %s
				; RUN: opt < %s -passes=sample-profile -sample-profile-file=%S/Inputs/warm-inline-instance.prof -sample-profile-inline-hot-threshold=1 -S \| FileCheck %s

				@.str = private unnamed_addr constant [11 x i8] c"sum is %d\0A\00", align 1

				; Function Attrs: nounwind uwtable
				define i32 @foo(i32 %x, i32 %y) !dbg !4 {
				entry:
				%x.addr = alloca i32, align 4
				%y.addr = alloca i32, align 4
				store i32 %x, i32* %x.addr, align 4
				store i32 %y, i32* %y.addr, align 4
				%t0 = load i32, i32* %x.addr, align 4, !dbg !11
				%t1 = load i32, i32* %y.addr, align 4, !dbg !11
				%add = add nsw i32 %t0, %t1, !dbg !11
				ret i32 %add, !dbg !11
				}

				define i32 @goo(i32 %x, i32 %y) {
				entry:
				%x.addr = alloca i32, align 4
				%y.addr = alloca i32, align 4
				store i32 %x, i32* %x.addr, align 4
				store i32 %y, i32* %y.addr, align 4
				%t0 = load i32, i32* %x.addr, align 4, !dbg !11
				%t1 = load i32, i32* %y.addr, align 4, !dbg !11
				%add = add nsw i32 %t0, %t1, !dbg !11
				ret i32 %add, !dbg !11
				}

				; Function Attrs: uwtable
				define i32 @main() !dbg !7 {
				entry:
				%retval = alloca i32, align 4
				%s = alloca i32, align 4
				%i = alloca i32, align 4
				store i32 0, i32* %retval
				store i32 0, i32* %i, align 4, !dbg !12
				br label %while.cond, !dbg !13

				while.cond: ; preds = %if.end, %entry
				%t0 = load i32, i32* %i, align 4, !dbg !14
				%inc = add nsw i32 %t0, 1, !dbg !14
				store i32 %inc, i32* %i, align 4, !dbg !14
				%cmp = icmp slt i32 %t0, 400000000, !dbg !14
				br i1 %cmp, label %while.body, label %while.end, !dbg !14

				while.body: ; preds = %while.cond
				%t1 = load i32, i32* %i, align 4, !dbg !16
				%cmp1 = icmp ne i32 %t1, 100, !dbg !16
				br i1 %cmp1, label %if.then, label %if.else, !dbg !16

				if.then: ; preds = %while.body
				%t2 = load i32, i32* %i, align 4, !dbg !18
				%t3 = load i32, i32* %s, align 4, !dbg !18
				; call @foo and call @goo below are not hot callsites so early inliner of afdo
				; will not inline them.
				; call @foo below is a warm callsite, so annotate it with a warm weight then
				; regular inliner will not treat it as cold.
				; CHECK-LABEL: @main(
				; CHECK: call i32 @foo(i32 %t2, i32 %t3), !dbg !{{[0-9]+}}, !prof [[PROF_LOC:![0-9]+]]
				%call1 = call i32 @foo(i32 %t2, i32 %t3), !dbg !18
				store i32 %call1, i32* %s, align 4, !dbg !18
				br label %if.end, !dbg !18

				if.else: ; preds = %while.body
				; call @goo below's basicblock doesn't get any sample, so no profile will be annotated.
				; CHECK: call i32 @goo(i32 2, i32 3), !dbg !{{[0-9]+}}
				; CHECK-NOT: !prof
				; CHECK-SAME: {{$}}
				%call2 = call i32 @goo(i32 2, i32 3), !dbg !26
				store i32 %call2, i32* %s, align 4, !dbg !20
				br label %if.end

				if.end: ; preds = %if.else, %if.then
				br label %while.cond, !dbg !22

				while.end: ; preds = %while.cond
				%t4 = load i32, i32* %s, align 4, !dbg !24
				%call3 = call i32 (i8, ...) @printf(i8 getelementptr inbounds ([11 x i8], [11 x i8]* @.str, i32 0, i32 0), i32 %t4), !dbg !24
				ret i32 0, !dbg !25
				}

				; CHECK: [[PROF_LOC]] = !{!"branch_weights", i32 2695}
				declare i32 @printf(i8*, ...) #2

				!llvm.dbg.cu = !{!0}
				!llvm.module.flags = !{!8, !9}
				!llvm.ident = !{!10}

				!0 = distinct !DICompileUnit(language: DW_LANG_C_plus_plus, producer: "clang version 3.5 ", isOptimized: false, emissionKind: NoDebug, file: !1, enums: !2, retainedTypes: !2, globals: !2, imports: !2)
				!1 = !DIFile(filename: "calls.cc", directory: ".")
				!2 = !{}
				!4 = distinct !DISubprogram(name: "foo", line: 3, isLocal: false, isDefinition: true, virtualIndex: 6, flags: DIFlagPrototyped, isOptimized: false, unit: !0, scopeLine: 3, file: !1, scope: !5, type: !6, variables: !2)
				!5 = !DIFile(filename: "calls.cc", directory: ".")
				!6 = !DISubroutineType(types: !2)
				!7 = distinct !DISubprogram(name: "main", line: 7, isLocal: false, isDefinition: true, virtualIndex: 6, flags: DIFlagPrototyped, isOptimized: false, unit: !0, scopeLine: 7, file: !1, scope: !5, type: !6, variables: !2)
				!8 = !{i32 2, !"Dwarf Version", i32 4}
				!9 = !{i32 1, !"Debug Info Version", i32 3}
				!10 = !{!"clang version 3.5 "}
				!11 = !DILocation(line: 4, scope: !4)
				!12 = !DILocation(line: 8, scope: !7)
				!13 = !DILocation(line: 9, scope: !7)
				!14 = !DILocation(line: 9, scope: !15)
				!15 = !DILexicalBlockFile(discriminator: 2, file: !1, scope: !7)
				!16 = !DILocation(line: 10, scope: !17)
				!17 = distinct !DILexicalBlock(line: 10, column: 0, file: !1, scope: !7)
				!18 = !DILocation(line: 10, scope: !19)
				!19 = !DILexicalBlockFile(discriminator: 2, file: !1, scope: !17)
				!20 = !DILocation(line: 10, scope: !21)
				!21 = !DILexicalBlockFile(discriminator: 4, file: !1, scope: !17)
				!22 = !DILocation(line: 10, scope: !23)
				!23 = !DILexicalBlockFile(discriminator: 6, file: !1, scope: !17)
				!24 = !DILocation(line: 11, scope: !7)
				!25 = !DILocation(line: 12, scope: !7)
				!26 = !DILocation(line: 11, scope: !19)