This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/tools/llvm-profgen/
-
tools/
-
llvm-profgen/
12/25
PerfReader.h
5/10
PerfReader.cpp
-
ProfiledBinary.h
-
ProfiledBinary.cpp

Differential D94110

[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
ClosedPublic

Authored by wlei on Jan 5 2021, 12:16 PM.

Download Raw Diff

Details

Reviewers

wmi
davidxl
hoy
wenlei

Commits

rG3869309a0c92: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile…
rG1714ad233629: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile…

Summary

For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses StackLeaf as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wlei created this revision.Jan 5 2021, 12:16 PM

Herald added subscribers: hoy, wenlei, lxfind. · View Herald TranscriptJan 5 2021, 12:16 PM

wlei requested review of this revision.Jan 5 2021, 12:16 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2021, 12:16 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B84097: Diff 314687.Jan 5 2021, 12:17 PM

wlei retitled this revision from [CSSPGO][llvm-profgen] Aggregate sample on call frame trie to speed up profile generation to [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation.Jan 5 2021, 3:08 PM

wlei edited the summary of this revision. (Show Details)

wlei added reviewers: wmi, davidxl, hoy, wenlei.

wlei added a parent revision: D93556: [CSSPGO][llvm-profgen] Compress recursive cycles in calling context.Jan 5 2021, 3:27 PM

wlei added a child revision: D94111: [CSSPGO][llvm-profgen] Merge and trim profile for cold context to reduce profile size.

This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.

5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

llvm/tools/llvm-profgen/PerfReader.cpp
89	Use SmallVectorImpl<uint64_t>& as a parameter type instead of SmallVector<uint64_t, 16>&. There are some other places with the same issue.

Addessing Wei's feedback

In D94110#2489390, @wmi wrote:

This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.

5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

Yes, this is a very good idea, thanks. Further grouping of call frames might allow us not needing the aggregated samples map, i.e. we can do profile generation on the tree during DFS.
One of my concern on this is that if we maintain a whole tree of call frame, it might affect the performance of getOrCreateNode() during the unwinding, because it will have more children of tree node increasing the hash collision. but if the overlapping degree is high, it will surely improve the performance. I will try this and do more experiments on the heavy workloads in SPEC.

llvm/tools/llvm-profgen/PerfReader.cpp
89	fixed all those issue.

Harbormaster completed remote builds in B84911: Diff 316222.Jan 12 2021, 1:42 PM

hoy added inline comments.Jan 19 2021, 9:16 AM

llvm/tools/llvm-profgen/PerfReader.cpp
108	Is the trie root the only case this can be zero? Can we make an assert for that?
114	Typo: inliner
170	Would be nice to a comment about why counts are reported on the parent frame for pseudo prob .
llvm/tools/llvm-profgen/PerfReader.h
213	Nit: how about name it `getOrCreateChildFrame`?
226	`FrameTrieRoot` doesn't seem to be initialized anywhere in the constructor of `UnwindState`.
227	Nit: name it `FrameCurrentLeaf`?

addressing Hongtao's feedback

llvm/tools/llvm-profgen/PerfReader.cpp
108	Good point! adding assertion here may need a dummy root parameter, how about changing this condition to `isDummyRoot` and add assertion inside `getOrCreateChildFrame` so that we can also guarantee only the root can be zero address.
114	fixed
170	Good catch, comments added
llvm/tools/llvm-profgen/PerfReader.h
213	Sounds good!
226	I guess it will be initialized implicitly since it's not a pointer but a value. It will have a zero value address and be taken as a dummy node. I changed the name to `DummyTrieRoot` to make it clearer.
227	Good point

Harbormaster completed remote builds in B85833: Diff 317772.Jan 19 2021, 11:23 PM

In D94110#2494254, @wlei wrote:

In D94110#2489390, @wmi wrote:

This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.

5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

Yeah, the aggregation was based on stack + LBR sample, so events with same stack but different LBR won't be aggregated today. Using a global trie would help defer context generation for each of the aggregations if they lead to shared context..

I think we could experiment how helpful that is by checking how often we generate a context that's already in the context profile map when traversing the trie. If that happens very often, then it suggests global trie could save more..

Yes, this is a very good idea, thanks. Further grouping of call frames might allow us not needing the aggregated samples map, i.e. we can do profile generation on the tree during DFS.

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

One of my concern on this is that if we maintain a whole tree of call frame, it might affect the performance of getOrCreateNode() during the unwinding, because it will have more children of tree node increasing the hash collision. but if the overlapping degree is high, it will surely improve the performance. I will try this and do more experiments on the heavy workloads in SPEC.

llvm/tools/llvm-profgen/PerfReader.cpp
123	Would be nice to have a unified representation for frame stack and probe stack, but without replying on reinterpret_cast of probe pointer.. Using uint64_t then rely on Binary->usePseudoProbes() to decide how to interpret the value seem less than ideal.. Do this through template functions with the different part done through specialization helpers?
llvm/tools/llvm-profgen/PerfReader.h
204	What about name it `ProfiledFrame`? (we have ProfileBinary)
206	Raw pointer is used for parent, but unique_ptr is used for child. Would be good to keep consistent.
211	This ctor can be merged into the one below with default parameter.
227	nit: name them `CurrentLeafFrame` and `DummyRootFrame`?
281	A few more naming nits: initFrameTrie? CallStackTrie prompts a trie whose nodes are call stack.. appendFrame and popStackLeaf are functionally symmetric, would be good for them to follow similar names. e.g. pushFrame, popFrame. and then swithToFrame..
473	We "record" samples on trie (have recordRangeCount and recordBranchCount), then "collect" samples from trie into context sample maps via a DFS traversal. So we could have a more intuitive convention for readability: ProfiledFrame::recordRangeCount ProfiledFrame::recordBranchCount VirtualUnwinder::collectSamplesFromFrame VirtualUnwinder::collectSamplesFromFrameTrie
484–485	We can make this a private member since it's only a helper for getOrCreateCounter. (a few other functions can be private too).

With the latest, do you see similar speed up for probe profile and dwarf profile?

Addressing Wenlei's feedback: much refactoring work.

Harbormaster completed remote builds in B86621: Diff 319102.Jan 25 2021, 1:18 PM

wlei added inline comments.Jan 25 2021, 1:30 PM

llvm/tools/llvm-profgen/PerfReader.cpp
123	Yes, changed to template functions and create two struct FrameStack and ProbeStack for this, thanks!
llvm/tools/llvm-profgen/PerfReader.h
204	Good point
206	Oh, the child is a new allocated instance so we use smart pointer for it, but the parent pointer can always point to a pre-allocated instance or null(for root). Also seems it would cause a recursive deconstruction of unique_ptr error if making parent unique_ptr.
211	Good catch!
281	Yeah, that's very good naming suggestions, thanks!
473	Cool, renamed
484–485	Yes, make all of them except `unwind()` to private

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

In D94110#2514693, @wenlei wrote:

With the latest, do you see similar speed up for probe profile and dwarf profile?

have implemented the global trie for the virtual unwinder part and done the evaluation over the LBR trie against baseline and global trie on probe profile. See the chart below, here only use gobmk and sjeng because other benchmarks's run time of virtual unwinder are very small(less than 1s).

The virtual unwinding time(s):

	No-Trie	LBR-Trie	Global-Trie
gobmk	8.31	3.6	4.11
sjeng	46.82	19.15	24.91

Speed up:

	LBR-Trie vs No-Trie	Global-Trie vs No-Trie	Global-Trie vs LBR-Trie
gobmk	2.3	2.02	0.88
sjeng	2.44	1.87	0.77

Sum-up:

LBR-Trie and Global-Trie can have about 2x speed-up over the baseline
Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.
Didn't see the similar speed-up as our internal prototype, I guess it's because of the refinement we did in the previous patches like removing the redundant string concatenation and reversal, switching to use probe based key.

Beside the time, LBR-trie seems good for the readability but Global-Trie can be easy to detect shared context. Haven't tried to defer the context generation(should have more speed-up for Global-Trie) and detect shared context, will try it.

CC: @wmi @hoy

In D94110#2523532, @wlei wrote:

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

In D94110#2514693, @wenlei wrote:

With the latest, do you see similar speed up for probe profile and dwarf profile?

have implemented the global trie for the virtual unwinder part and done the evaluation over the LBR trie against baseline and global trie on probe profile. See the chart below, here only use gobmk and sjeng because other benchmarks's run time of virtual unwinder are very small(less than 1s).

The virtual unwinding time(s):

No-Trie LBR-Trie Global-Trie

gobmk 8.31 3.6 4.11

sjeng 46.82 19.15 24.91

Speed up:

LBR-Trie vs No-Trie Global-Trie vs No-Trie Global-Trie vs LBR-Trie

gobmk 2.3 2.02 0.88

sjeng 2.44 1.87 0.77

Sum-up:

LBR-Trie and Global-Trie can have about 2x speed-up over the baseline

Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.

Didn't see the similar speed-up as our internal prototype, I guess it's because of the refinement we did in the previous patches like removing the redundant string concatenation and reversal, switching to use probe based key.

Beside the time, LBR-trie seems good for the readability but Global-Trie can be easy to detect shared context. Haven't tried to defer the context generation(should have more speed-up for Global-Trie) and detect shared context, will try it.

CC: @wmi @hoy

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

llvm/tools/llvm-profgen/PerfReader.h
206	Definitely not unique_ptr for parent and multiple children would keep pointer to same parent. I meant to say consistency between smart pointer vs raw pointer, and in the case of parent, a shared_ptr?
229	FrameCurrentLeaf -> CurrentLeafFrame?
485–486	this private can be removed since there's one on line 456.

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

Yeah, sounds good to me! It didn't affect the functionality, then in the future we can try to improve it when we get to solve the sharing context issue.

llvm/tools/llvm-profgen/PerfReader.h
206	Just tried the share_ptr, it failed, it seems there is a cycle object free issue, something like the parent object free its children's field then it will free the Parent shared_ptr, the Parent shared_ptr again tried to free its parent object cause a cycle free failure. I code like below(use `this` to initial Parent ptr) std::share_ptr<ProfiledFrame> Parent; ... ProfiledFrame(uint64_t Addr = 0, ProfiledFrame P = nullptr) : Address(Addr), Parent(P) {} ProfiledFrame getOrCreateChildFrame(uint64_t Address) { auto Ret = Children.emplace( Address, std::make_unique<ProfiledFrame>(Address, this)); return Ret.first->second.get(); }
229	renamed!
485–486	Good catch!

addressing Wenlei's feedback

Harbormaster completed remote builds in B86914: Diff 319683.Jan 27 2021, 2:42 PM

lgtm, thanks.

This revision is now accepted and ready to land.Jan 27 2021, 2:48 PM

Sum-up:

LBR-Trie and Global-Trie can have about 2x speed-up over the baseline

Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.

Thanks for the experiment.
LBR-Trie is faster than Global-Trie. Does it mean there is not enough callstack overlap between different LBR samples? And could you elaborate what is the hash overhead?

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

I assume the time in the table is in seconds -- 19.15 seconds for sjeng using LBR-Trie, and that is not very long. I agree with Wenlei you can leave the improvement for the future.

Thanks for the experiment.
LBR-Trie is faster than Global-Trie. Does it mean there is not enough callstack overlap between different LBR samples?

Thanks for your reminder. I just tried the experiment on aggregation on callstack only, see the chart below. The column means the size of the sample map aggregated by Callstack+LBR or Callstack only.

	Callstack+LBR	Callstack	Ratio
gobmk	1387155	993516	1.40
sjeng	158270	133629	1.18
gcc	9281	7610	1.22

you see there is only 20%~40% more save if we use only callstack for aggregation.

And could you elaborate what is the hash overhead?

I meant we use a global trie instead of the intra LBR trie, in that case all the frame in the call stack and branch in LBR is a node in the trie.
Then the trie is probably very huge and each node might have a lot of children. We use the unordered_map to create or get children, if the num of children is big, the unordered_map lookup might not be O(1). This is what I mean the hash overhead. For intra LBR trie, we only have 16 entries so the num of children is small, should be fine with unordered_map look-up.

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

I assume the time in the table is in seconds -- 19.15 seconds for sjeng using LBR-Trie, and that is not very long. I agree with Wenlei you can leave the improvement for the future.

Yeah, it's in second, for my experiments all the benchmarks in SPEC(train data) could be finished in one minute.

Thanks for the data and explanation! LGTM.

hoy accepted this revision.Feb 2 2021, 10:23 PM

This revision was landed with ongoing or failed builds.Feb 3 2021, 6:53 PM

Closed by commit rG1714ad233629: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile… (authored by wlei). · Explain Why

This revision was automatically updated to reflect the committed changes.

wlei added a commit: rG1714ad233629: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile….

wlei added a reverting change: rG08e8bb60cf99: Revert "[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up….Feb 3 2021, 10:18 PM

wlei added a commit: rG3869309a0c92: [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile….Feb 4 2021, 8:44 AM

Revision Contents

Path

Size

llvm/

tools/

llvm-profgen/

167 lines

161 lines

3 lines

8 lines

Diff 319683

llvm/tools/llvm-profgen/PerfReader.h

Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines	struct PerfSample {
PerfSample(PerfKind K) : Kind(K){};		PerfSample(PerfKind K) : Kind(K){};
};		};

// The parsed hybrid sample including call stack and LBR stack.		// The parsed hybrid sample including call stack and LBR stack.
struct HybridSample : public PerfSample {		struct HybridSample : public PerfSample {
// Profiled binary that current frame address belongs to		// Profiled binary that current frame address belongs to
ProfiledBinary *Binary;		ProfiledBinary *Binary;
// Call stack recorded in FILO(leaf to root) order		// Call stack recorded in FILO(leaf to root) order
std::list<uint64_t> CallStack;		SmallVector<uint64_t, 16> CallStack;
// LBR stack recorded in FIFO order		// LBR stack recorded in FIFO order
SmallVector<LBREntry, 16> LBRStack;		SmallVector<LBREntry, 16> LBRStack;

HybridSample() : PerfSample(PK_HybridSample){};		HybridSample() : PerfSample(PK_HybridSample){};
static bool classof(const PerfSample *K) {		static bool classof(const PerfSample *K) {
return K->getKind() == PK_HybridSample;		return K->getKind() == PK_HybridSample;
}		}

// Used for sample aggregation		// Used for sample aggregation
bool isEqual(const PerfSample *K) const override {		bool isEqual(const PerfSample *K) const override {
const HybridSample *Other = dyn_cast<HybridSample>(K);		const HybridSample *Other = dyn_cast<HybridSample>(K);
if (Other->Binary != Binary)		if (Other->Binary != Binary)
return false;		return false;
const std::list<uint64_t> &OtherCallStack = Other->CallStack;		const SmallVector<uint64_t, 16> &OtherCallStack = Other->CallStack;
const SmallVector<LBREntry, 16> &OtherLBRStack = Other->LBRStack;		const SmallVector<LBREntry, 16> &OtherLBRStack = Other->LBRStack;

if (CallStack.size() != OtherCallStack.size() \|\|		if (CallStack.size() != OtherCallStack.size() \|\|
LBRStack.size() != OtherLBRStack.size())		LBRStack.size() != OtherLBRStack.size())
return false;		return false;

auto Iter = CallStack.begin();		auto Iter = CallStack.begin();
for (auto Address : OtherCallStack) {		for (auto Address : OtherCallStack) {
Show All 29 Lines

// After parsing the sample, we record the samples by aggregating them		// After parsing the sample, we record the samples by aggregating them
// into this counter. The key stores the sample data and the value is		// into this counter. The key stores the sample data and the value is
// the sample repeat times.		// the sample repeat times.
using AggregatedCounter =		using AggregatedCounter =
std::unordered_map<Hashable<PerfSample>, uint64_t,		std::unordered_map<Hashable<PerfSample>, uint64_t,
Hashable<PerfSample>::Hash, Hashable<PerfSample>::Equal>;		Hashable<PerfSample>::Hash, Hashable<PerfSample>::Equal>;

		using SampleVector = SmallVector<std::tuple<uint64_t, uint64_t, uint64_t>, 16>;
// The state for the unwinder, it doesn't hold the data but only keep the		// The state for the unwinder, it doesn't hold the data but only keep the
// pointer/index of the data, While unwinding, the CallStack is changed		// pointer/index of the data, While unwinding, the CallStack is changed
// dynamicially and will be recorded as the context of the sample		// dynamicially and will be recorded as the context of the sample
struct UnwindState {		struct UnwindState {
// Profiled binary that current frame address belongs to		// Profiled binary that current frame address belongs to
const ProfiledBinary *Binary;		const ProfiledBinary *Binary;
// TODO: switch to use trie for call stack		// Call stack trie node
std::list<uint64_t> CallStack;		struct ProfiledFrame {
		wenleiUnsubmitted Not Done Reply Inline Actions What about name it `ProfiledFrame`? (we have ProfileBinary) wenlei: What about name it `ProfiledFrame`? (we have ProfileBinary)
		wleiAuthorUnsubmitted Done Reply Inline Actions Good point wlei: Good point
		const uint64_t Address = 0;
		ProfiledFrame *Parent;
		wenleiUnsubmitted Not Done Reply Inline Actions Raw pointer is used for parent, but unique_ptr is used for child. Would be good to keep consistent. wenlei: Raw pointer is used for parent, but unique_ptr is used for child. Would be good to keep…
		wleiAuthorUnsubmitted Done Reply Inline Actions Oh, the child is a new allocated instance so we use smart pointer for it, but the parent pointer can always point to a pre-allocated instance or null(for root). Also seems it would cause a recursive deconstruction of unique_ptr error if making parent unique_ptr. wlei: Oh, the child is a new allocated instance so we use smart pointer for it, but the parent…
		wenleiUnsubmitted Not Done Reply Inline Actions Definitely not unique_ptr for parent and multiple children would keep pointer to same parent. I meant to say consistency between smart pointer vs raw pointer, and in the case of parent, a shared_ptr? wenlei: Definitely not unique_ptr for parent and multiple children would keep pointer to same parent. I…
		wleiAuthorUnsubmitted Done Reply Inline Actions Just tried the share_ptr, it failed, it seems there is a cycle object free issue, something like the parent object free its children's field then it will free the Parent shared_ptr, the Parent shared_ptr again tried to free its parent object cause a cycle free failure. I code like below(use `this` to initial Parent ptr) std::share_ptr<ProfiledFrame> Parent; ... ProfiledFrame(uint64_t Addr = 0, ProfiledFrame P = nullptr) : Address(Addr), Parent(P) {} ProfiledFrame getOrCreateChildFrame(uint64_t Address) { auto Ret = Children.emplace( Address, std::make_unique<ProfiledFrame>(Address, this)); return Ret.first->second.get(); } wlei: Just tried the share_ptr, it failed, it seems there is a cycle object free issue, something…
		SampleVector RangeSamples;
		SampleVector BranchSamples;
		std::unordered_map<uint64_t, std::unique_ptr<ProfiledFrame>> Children;

		ProfiledFrame(uint64_t Addr = 0, ProfiledFrame *P = nullptr)
		wenleiUnsubmitted Not Done Reply Inline Actions This ctor can be merged into the one below with default parameter. wenlei: This ctor can be merged into the one below with default parameter.
		wleiAuthorUnsubmitted Done Reply Inline Actions Good catch! wlei: Good catch!
		: Address(Addr), Parent(P) {}
		ProfiledFrame *getOrCreateChildFrame(uint64_t Address) {
		hoyUnsubmitted Not Done Reply Inline Actions Nit: how about name it `getOrCreateChildFrame`? hoy: Nit: how about name it `getOrCreateChildFrame`?
		wleiAuthorUnsubmitted Done Reply Inline Actions Sounds good! wlei: Sounds good!
		assert(Address && "Address can't be zero!");
		auto Ret = Children.emplace(
		Address, std::make_unique<ProfiledFrame>(Address, this));
		return Ret.first->second.get();
		}
		void recordRangeCount(uint64_t Start, uint64_t End, uint64_t Count) {
		RangeSamples.emplace_back(std::make_tuple(Start, End, Count));
		}
		void recordBranchCount(uint64_t Source, uint64_t Target, uint64_t Count) {
		BranchSamples.emplace_back(std::make_tuple(Source, Target, Count));
		}
		bool isDummyRoot() { return Address == 0; }
		};
		hoyUnsubmitted Not Done Reply Inline Actions `FrameTrieRoot` doesn't seem to be initialized anywhere in the constructor of `UnwindState`. hoy: `FrameTrieRoot` doesn't seem to be initialized anywhere in the constructor of `UnwindState`.
		wleiAuthorUnsubmitted Done Reply Inline Actions I guess it will be initialized implicitly since it's not a pointer but a value. It will have a zero value address and be taken as a dummy node. I changed the name to `DummyTrieRoot` to make it clearer. wlei: I guess it will be initialized implicitly since it's not a pointer but a value. It will have a…

		hoyUnsubmitted Not Done Reply Inline Actions Nit: name it `FrameCurrentLeaf`? hoy: Nit: name it `FrameCurrentLeaf`?
		wleiAuthorUnsubmitted Done Reply Inline Actions Good point wlei: Good point
		wenleiUnsubmitted Not Done Reply Inline Actions nit: name them `CurrentLeafFrame` and `DummyRootFrame`? wenlei: nit: name them `CurrentLeafFrame` and `DummyRootFrame`?
		ProfiledFrame DummyTrieRoot;
		ProfiledFrame *CurrentLeafFrame;
		wenleiUnsubmitted Not Done Reply Inline Actions FrameCurrentLeaf -> CurrentLeafFrame? wenlei: FrameCurrentLeaf -> CurrentLeafFrame?
		wleiAuthorUnsubmitted Done Reply Inline Actions renamed! wlei: renamed!
// Used to fall through the LBR stack		// Used to fall through the LBR stack
uint32_t LBRIndex = 0;		uint32_t LBRIndex = 0;
// Reference to HybridSample.LBRStack		// Reference to HybridSample.LBRStack
const SmallVector<LBREntry, 16> &LBRStack;		const SmallVector<LBREntry, 16> &LBRStack;
// Used to iterate the address range		// Used to iterate the address range
InstructionPointer InstPtr;		InstructionPointer InstPtr;
UnwindState(const HybridSample *Sample)		UnwindState(const HybridSample *Sample)
: Binary(Sample->Binary), CallStack(Sample->CallStack),		: Binary(Sample->Binary), LBRStack(Sample->LBRStack),
LBRStack(Sample->LBRStack),		InstPtr(Sample->Binary, Sample->CallStack.front()) {
InstPtr(Sample->Binary, Sample->CallStack.front()) {}		initFrameTrie(Sample->CallStack);
		}

bool validateInitialState() {		bool validateInitialState() {
uint64_t LBRLeaf = LBRStack[LBRIndex].Target;		uint64_t LBRLeaf = LBRStack[LBRIndex].Target;
uint64_t StackLeaf = CallStack.front();		uint64_t LeafAddr = CurrentLeafFrame->Address;
// When we take a stack sample, ideally the sampling distance between the		// When we take a stack sample, ideally the sampling distance between the
// leaf IP of stack and the last LBR target shouldn't be very large.		// leaf IP of stack and the last LBR target shouldn't be very large.
// Use a heuristic size (0x100) to filter out broken records.		// Use a heuristic size (0x100) to filter out broken records.
if (StackLeaf < LBRLeaf \|\| StackLeaf >= LBRLeaf + 0x100) {		if (LeafAddr < LBRLeaf \|\| LeafAddr >= LBRLeaf + 0x100) {
WithColor::warning() << "Bogus trace: stack tip = "		WithColor::warning() << "Bogus trace: stack tip = "
<< format("%#010x", StackLeaf)		<< format("%#010x", LeafAddr)
<< ", LBR tip = " << format("%#010x\n", LBRLeaf);		<< ", LBR tip = " << format("%#010x\n", LBRLeaf);
return false;		return false;
}		}
return true;		return true;
}		}

void checkStateConsistency() {		void checkStateConsistency() {
assert(InstPtr.Address == CallStack.front() &&		assert(InstPtr.Address == CurrentLeafFrame->Address &&
"IP should align with context leaf");		"IP should align with context leaf");
}		}

std::string getExpandedContextStr() const {
return Binary->getExpandedContextStr(CallStack);
}
const ProfiledBinary *getBinary() const { return Binary; }		const ProfiledBinary *getBinary() const { return Binary; }
bool hasNextLBR() const { return LBRIndex < LBRStack.size(); }		bool hasNextLBR() const { return LBRIndex < LBRStack.size(); }
uint64_t getCurrentLBRSource() const { return LBRStack[LBRIndex].Source; }		uint64_t getCurrentLBRSource() const { return LBRStack[LBRIndex].Source; }
uint64_t getCurrentLBRTarget() const { return LBRStack[LBRIndex].Target; }		uint64_t getCurrentLBRTarget() const { return LBRStack[LBRIndex].Target; }
const LBREntry &getCurrentLBR() const { return LBRStack[LBRIndex]; }		const LBREntry &getCurrentLBR() const { return LBRStack[LBRIndex]; }
void advanceLBR() { LBRIndex++; }		void advanceLBR() { LBRIndex++; }

		ProfiledFrame *getParentFrame() { return CurrentLeafFrame->Parent; }

		void pushFrame(uint64_t Address) {
		CurrentLeafFrame = CurrentLeafFrame->getOrCreateChildFrame(Address);
		}

		void switchToFrame(uint64_t Address) {
		if (CurrentLeafFrame->Address == Address)
		return;
		CurrentLeafFrame = CurrentLeafFrame->Parent->getOrCreateChildFrame(Address);
		}

		void popFrame() { CurrentLeafFrame = CurrentLeafFrame->Parent; }
		wenleiUnsubmitted Not Done Reply Inline Actions A few more naming nits: initFrameTrie? CallStackTrie prompts a trie whose nodes are call stack.. appendFrame and popStackLeaf are functionally symmetric, would be good for them to follow similar names. e.g. pushFrame, popFrame. and then swithToFrame.. wenlei: A few more naming nits: initFrameTrie? CallStackTrie prompts a trie whose nodes are call…
		wleiAuthorUnsubmitted Done Reply Inline Actions Yeah, that's very good naming suggestions, thanks! wlei: Yeah, that's very good naming suggestions, thanks!

		void initFrameTrie(const SmallVectorImpl<uint64_t> &CallStack) {
		ProfiledFrame *Cur = &DummyTrieRoot;
		for (auto Address : reverse(CallStack)) {
		Cur = Cur->getOrCreateChildFrame(Address);
		}
		CurrentLeafFrame = Cur;
		}

		ProfiledFrame *getDummyRootPtr() { return &DummyTrieRoot; }
};		};

// Base class for sample counter key with context		// Base class for sample counter key with context
struct ContextKey {		struct ContextKey {
uint64_t HashCode = 0;		uint64_t HashCode = 0;
virtual ~ContextKey() = default;		virtual ~ContextKey() = default;
uint64_t getHashCode() const { return HashCode; }		uint64_t getHashCode() const { return HashCode; }
virtual bool isEqual(const ContextKey *K) const {		virtual bool isEqual(const ContextKey *K) const {
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	struct SampleCounter {
}		}
};		};

// Sample counter with context to support context-sensitive profile		// Sample counter with context to support context-sensitive profile
using ContextSampleCounterMap =		using ContextSampleCounterMap =
std::unordered_map<Hashable<ContextKey>, SampleCounter,		std::unordered_map<Hashable<ContextKey>, SampleCounter,
Hashable<ContextKey>::Hash, Hashable<ContextKey>::Equal>;		Hashable<ContextKey>::Hash, Hashable<ContextKey>::Equal>;

		struct FrameStack {
		SmallVector<uint64_t, 16> Stack;
		const ProfiledBinary *Binary;
		FrameStack(const ProfiledBinary *B) : Binary(B) {}
		bool pushFrame(UnwindState::ProfiledFrame *Cur) {
		Stack.push_back(Cur->Address);
		return true;
		}

		void popFrame() {
		if (!Stack.empty())
		Stack.pop_back();
		}
		std::shared_ptr<StringBasedCtxKey> getContextKey();
		};

		struct ProbeStack {
		SmallVector<const PseudoProbe *, 16> Stack;
		const ProfiledBinary *Binary;
		ProbeStack(const ProfiledBinary *B) : Binary(B) {}
		bool pushFrame(UnwindState::ProfiledFrame *Cur) {
		const PseudoProbe *CallProbe = Binary->getCallProbeForAddr(Cur->Address);
		// We may not find a probe for a merged or external callsite.
		// Callsite merging may cause the loss of original probe IDs.
		// Cutting off the context from here since the inliner will
		// not know how to consume a context with unknown callsites.
		if (!CallProbe)
		return false;
		Stack.push_back(CallProbe);
		return true;
		}

		void popFrame() {
		if (!Stack.empty())
		Stack.pop_back();
		}
		// Use pseudo probe based context key to get the sample counter
		// A context stands for a call path from 'main' to an uninlined
		// callee with all inline frames recovered on that path. The probes
		// belonging to that call path is the probes either originated from
		// the callee or from any functions inlined into the callee. Since
		// pseudo probes are organized in a tri-tree style after decoded,
		// the tree path from the tri-tree root (which is the uninlined
		// callee) to the probe node forms an inline context.
		// Here we use a list of probe(pointer) as the context key to speed up
		// aggregation and the final context string will be generate in
		// ProfileGenerator
		std::shared_ptr<ProbeBasedCtxKey> getContextKey();
		};

/*		/*
As in hybrid sample we have a group of LBRs and the most recent sampling call		As in hybrid sample we have a group of LBRs and the most recent sampling call
stack, we can walk through those LBRs to infer more call stacks which would be		stack, we can walk through those LBRs to infer more call stacks which would be
used as context for profile. VirtualUnwinder is the class to do the call stack		used as context for profile. VirtualUnwinder is the class to do the call stack
unwinding based on LBR state. Two types of unwinding are processd here:		unwinding based on LBR state. Two types of unwinding are processd here:
1) LBR unwinding and 2) linear range unwinding.		1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry(can be classified into call, return, regular		Specifically, for each LBR entry(can be classified into call, return, regular
branch), LBR unwinding will replay the operation by pushing, popping or		branch), LBR unwinding will replay the operation by pushing, popping or
switching leaf frame towards the call stack and since the initial call stack		switching leaf frame towards the call stack and since the initial call stack
is most recently sampled, the replay should be in anti-execution order, i.e. for		is most recently sampled, the replay should be in anti-execution order, i.e. for
the regular case, pop the call stack when LBR is call, push frame on call stack		the regular case, pop the call stack when LBR is call, push frame on call stack
when LBR is return. After each LBR processed, it also needs to align with the		when LBR is return. After each LBR processed, it also needs to align with the
next LBR by going through instructions from previous LBR's target to current		next LBR by going through instructions from previous LBR's target to current
LBR's source, which is the linear unwinding. As instruction from linear range		LBR's source, which is the linear unwinding. As instruction from linear range
can come from different function by inlining, linear unwinding will do the range		can come from different function by inlining, linear unwinding will do the range
splitting and record counters by the range with same inline context. Over those		splitting and record counters by the range with same inline context. Over those
unwinding process we will record each call stack as context id and LBR/linear		unwinding process we will record each call stack as context id and LBR/linear
range as sample counter for further CS profile generation.		range as sample counter for further CS profile generation.
*/		*/
class VirtualUnwinder {		class VirtualUnwinder {
public:		public:
VirtualUnwinder(ContextSampleCounterMap *Counter) : CtxCounterMap(Counter) {}		VirtualUnwinder(ContextSampleCounterMap Counter, const ProfiledBinary B)
		: CtxCounterMap(Counter), Binary(B) {}
		bool unwind(const HybridSample *Sample, uint64_t Repeat);

		private:
bool isCallState(UnwindState &State) const {		bool isCallState(UnwindState &State) const {
// The tail call frame is always missing here in stack sample, we will		// The tail call frame is always missing here in stack sample, we will
// use a specific tail call tracker to infer it.		// use a specific tail call tracker to infer it.
return State.getBinary()->addressIsCall(State.getCurrentLBRSource());		return Binary->addressIsCall(State.getCurrentLBRSource());
}		}

bool isReturnState(UnwindState &State) const {		bool isReturnState(UnwindState &State) const {
// Simply check addressIsReturn, as ret is always reliable, both for		// Simply check addressIsReturn, as ret is always reliable, both for
// regular call and tail call.		// regular call and tail call.
return State.getBinary()->addressIsReturn(State.getCurrentLBRSource());		return Binary->addressIsReturn(State.getCurrentLBRSource());
}		}

void unwindCall(UnwindState &State);		void unwindCall(UnwindState &State);
void unwindLinear(UnwindState &State, uint64_t Repeat);		void unwindLinear(UnwindState &State, uint64_t Repeat);
void unwindReturn(UnwindState &State);		void unwindReturn(UnwindState &State);
void unwindBranchWithinFrame(UnwindState &State);		void unwindBranchWithinFrame(UnwindState &State);
bool unwind(const HybridSample *Sample, uint64_t Repeat);
		wenleiUnsubmitted Not Done Reply Inline Actions We "record" samples on trie (have recordRangeCount and recordBranchCount), then "collect" samples from trie into context sample maps via a DFS traversal. So we could have a more intuitive convention for readability: ProfiledFrame::recordRangeCount ProfiledFrame::recordBranchCount VirtualUnwinder::collectSamplesFromFrame VirtualUnwinder::collectSamplesFromFrameTrie wenlei: We "record" samples on trie (have recordRangeCount and recordBranchCount), then "collect"…
		wleiAuthorUnsubmitted Done Reply Inline Actions Cool, renamed wlei: Cool, renamed
		template <typename T>
		void collectSamplesFromFrame(UnwindState::ProfiledFrame *Cur, T &Stack);
		// Collect each samples on trie node by DFS traversal
		template <typename T>
		void collectSamplesFromFrameTrie(UnwindState::ProfiledFrame *Cur, T &Stack);
		void collectSamplesFromFrameTrie(UnwindState::ProfiledFrame *Cur);

void recordRangeCount(uint64_t Start, uint64_t End, UnwindState &State,		void recordRangeCount(uint64_t Start, uint64_t End, UnwindState &State,
uint64_t Repeat);		uint64_t Repeat);
void recordBranchCount(const LBREntry &Branch, UnwindState &State,		void recordBranchCount(const LBREntry &Branch, UnwindState &State,
uint64_t Repeat);		uint64_t Repeat);
SampleCounter &getOrCreateCounter(const ProfiledBinary *Binary,
std::list<uint64_t> &CallStack);
// Use pseudo probe based context key to get the sample counter
// A context stands for a call path from 'main' to an uninlined
// callee with all inline frames recovered on that path. The probes
// belonging to that call path is the probes either originated from
// the callee or from any functions inlined into the callee. Since
// pseudo probes are organized in a tri-tree style after decoded,
// the tree path from the tri-tree root (which is the uninlined
// callee) to the probe node forms an inline context.
// Here we use a list of probe(pointer) as the context key to speed up
// aggregation and the final context string will be generate in
// ProfileGenerator
SampleCounter &getOrCreateCounterForProbe(const ProfiledBinary *Binary,
std::list<uint64_t> &CallStack);

		wenleiUnsubmitted Not Done Reply Inline Actions We can make this a private member since it's only a helper for getOrCreateCounter. (a few other functions can be private too). wenlei: We can make this a private member since it's only a helper for getOrCreateCounter. (a few other…
		wleiAuthorUnsubmitted Done Reply Inline Actions Yes, make all of them except `unwind()` to private wlei: Yes, make all of them except `unwind()` to private
private:
ContextSampleCounterMap *CtxCounterMap;		ContextSampleCounterMap *CtxCounterMap;
		wenleiUnsubmitted Not Done Reply Inline Actions this private can be removed since there's one on line 456. wenlei: this private can be removed since there's one on line 456.
		wleiAuthorUnsubmitted Done Reply Inline Actions Good catch! wlei: Good catch!
		// Profiled binary that current frame address belongs to
		const ProfiledBinary *Binary;
};		};

// Filename to binary map		// Filename to binary map
using BinaryMap = StringMap<ProfiledBinary>;		using BinaryMap = StringMap<ProfiledBinary>;
// Address to binary map for fast look-up		// Address to binary map for fast look-up
using AddressBinaryMap = std::map<uint64_t, ProfiledBinary *>;		using AddressBinaryMap = std::map<uint64_t, ProfiledBinary *>;
// Binary to ContextSampleCounters Map to support multiple binary, we may have		// Binary to ContextSampleCounters Map to support multiple binary, we may have
// same binary loaded at different addresses, they should share the same sample		// same binary loaded at different addresses, they should share the same sample
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	private:
void parseMMap2Event(TraceStream &TraceIt);		void parseMMap2Event(TraceStream &TraceIt);
// Parse perf events/samples and do aggregation		// Parse perf events/samples and do aggregation
void parseAndAggregateTrace(StringRef Filename);		void parseAndAggregateTrace(StringRef Filename);
// Parse either an MMAP event or a perf sample		// Parse either an MMAP event or a perf sample
void parseEventOrSample(TraceStream &TraceIt);		void parseEventOrSample(TraceStream &TraceIt);
// Parse the hybrid sample including the call and LBR line		// Parse the hybrid sample including the call and LBR line
void parseHybridSample(TraceStream &TraceIt);		void parseHybridSample(TraceStream &TraceIt);
// Extract call stack from the perf trace lines		// Extract call stack from the perf trace lines
bool extractCallstack(TraceStream &TraceIt, std::list<uint64_t> &CallStack);		bool extractCallstack(TraceStream &TraceIt,
		SmallVectorImpl<uint64_t> &CallStack);
// Extract LBR stack from one perf trace line		// Extract LBR stack from one perf trace line
bool extractLBRStack(TraceStream &TraceIt,		bool extractLBRStack(TraceStream &TraceIt,
SmallVector<LBREntry, 16> &LBRStack,		SmallVectorImpl<LBREntry> &LBRStack,
ProfiledBinary *Binary);		ProfiledBinary *Binary);
void checkAndSetPerfType(cl::list<std::string> &PerfTraceFilenames);		void checkAndSetPerfType(cl::list<std::string> &PerfTraceFilenames);
// Post process the profile after trace aggregation, we will do simple range		// Post process the profile after trace aggregation, we will do simple range
// overlap computation for AutoFDO, or unwind for CSSPGO(hybrid sample).		// overlap computation for AutoFDO, or unwind for CSSPGO(hybrid sample).
void generateRawProfile();		void generateRawProfile();
// Unwind the hybrid samples after aggregration		// Unwind the hybrid samples after aggregration
void unwindSamples();		void unwindSamples();
void printUnwinderOutput();		void printUnwinderOutput();
Show All 17 Lines

llvm/tools/llvm-profgen/PerfReader.cpp

Show All 22 Lines
void VirtualUnwinder::unwindCall(UnwindState &State) {		void VirtualUnwinder::unwindCall(UnwindState &State) {
// The 2nd frame after leaf could be missing if stack sample is		// The 2nd frame after leaf could be missing if stack sample is
// taken when IP is within prolog/epilog, as frame chain isn't		// taken when IP is within prolog/epilog, as frame chain isn't
// setup yet. Fill in the missing frame in that case.		// setup yet. Fill in the missing frame in that case.
// TODO: Currently we just assume all the addr that can't match the		// TODO: Currently we just assume all the addr that can't match the
// 2nd frame is in prolog/epilog. In the future, we will switch to		// 2nd frame is in prolog/epilog. In the future, we will switch to
// pro/epi tracker(Dwarf CFI) for the precise check.		// pro/epi tracker(Dwarf CFI) for the precise check.
uint64_t Source = State.getCurrentLBRSource();		uint64_t Source = State.getCurrentLBRSource();
auto Iter = State.CallStack.begin();		auto *ParentFrame = State.getParentFrame();
if (State.CallStack.size() == 1 \|\| *(++Iter) != Source) {		if (ParentFrame == State.getDummyRootPtr() \|\| ParentFrame->Address != Source) {
State.CallStack.front() = Source;		State.switchToFrame(Source);
} else {		} else {
State.CallStack.pop_front();		State.popFrame();
}		}
State.InstPtr.update(Source);		State.InstPtr.update(Source);
}		}

void VirtualUnwinder::unwindLinear(UnwindState &State, uint64_t Repeat) {		void VirtualUnwinder::unwindLinear(UnwindState &State, uint64_t Repeat) {
InstructionPointer &IP = State.InstPtr;		InstructionPointer &IP = State.InstPtr;
uint64_t Target = State.getCurrentLBRTarget();		uint64_t Target = State.getCurrentLBRTarget();
uint64_t End = IP.Address;		uint64_t End = IP.Address;
if (State.getBinary()->usePseudoProbes()) {		if (Binary->usePseudoProbes()) {
		// We don't need to top frame probe since it should be extracted
		// from the range.
// The outcome of the virtual unwinding with pseudo probes is a		// The outcome of the virtual unwinding with pseudo probes is a
// map from a context key to the address range being unwound.		// map from a context key to the address range being unwound.
// This means basically linear unwinding is not needed for pseudo		// This means basically linear unwinding is not needed for pseudo
// probes. The range will be simply recorded here and will be		// probes. The range will be simply recorded here and will be
// converted to a list of pseudo probes to report in ProfileGenerator.		// converted to a list of pseudo probes to report in ProfileGenerator.
recordRangeCount(Target, End, State, Repeat);		State.getParentFrame()->recordRangeCount(Target, End, Repeat);
} else {		} else {
// Unwind linear execution part		// Unwind linear execution part
		uint64_t LeafAddr = State.CurrentLeafFrame->Address;
while (IP.Address >= Target) {		while (IP.Address >= Target) {
uint64_t PrevIP = IP.Address;		uint64_t PrevIP = IP.Address;
IP.backward();		IP.backward();
// Break into segments for implicit call/return due to inlining		// Break into segments for implicit call/return due to inlining
bool SameInlinee =		bool SameInlinee = Binary->inlineContextEqual(PrevIP, IP.Address);
State.getBinary()->inlineContextEqual(PrevIP, IP.Address);
if (!SameInlinee \|\| PrevIP == Target) {		if (!SameInlinee \|\| PrevIP == Target) {
recordRangeCount(PrevIP, End, State, Repeat);		State.switchToFrame(LeafAddr);
		State.CurrentLeafFrame->recordRangeCount(PrevIP, End, Repeat);
End = IP.Address;		End = IP.Address;
}		}
State.CallStack.front() = IP.Address;		LeafAddr = IP.Address;
}		}
}		}
}		}

void VirtualUnwinder::unwindReturn(UnwindState &State) {		void VirtualUnwinder::unwindReturn(UnwindState &State) {
// Add extra frame as we unwind through the return		// Add extra frame as we unwind through the return
const LBREntry &LBR = State.getCurrentLBR();		const LBREntry &LBR = State.getCurrentLBR();
uint64_t CallAddr = State.getBinary()->getCallAddrFromFrameAddr(LBR.Target);		uint64_t CallAddr = Binary->getCallAddrFromFrameAddr(LBR.Target);
State.CallStack.front() = CallAddr;		State.switchToFrame(CallAddr);
State.CallStack.push_front(LBR.Source);		State.pushFrame(LBR.Source);
State.InstPtr.update(LBR.Source);		State.InstPtr.update(LBR.Source);
}		}

void VirtualUnwinder::unwindBranchWithinFrame(UnwindState &State) {		void VirtualUnwinder::unwindBranchWithinFrame(UnwindState &State) {
// TODO: Tolerate tail call for now, as we may see tail call from libraries.		// TODO: Tolerate tail call for now, as we may see tail call from libraries.
// This is only for intra function branches, excluding tail calls.		// This is only for intra function branches, excluding tail calls.
uint64_t Source = State.getCurrentLBRSource();		uint64_t Source = State.getCurrentLBRSource();
State.CallStack.front() = Source;		State.switchToFrame(Source);
State.InstPtr.update(Source);		State.InstPtr.update(Source);
}		}

SampleCounter &		std::shared_ptr<StringBasedCtxKey> FrameStack::getContextKey() {
VirtualUnwinder::getOrCreateCounter(const ProfiledBinary *Binary,
std::list<uint64_t> &CallStack) {
if (Binary->usePseudoProbes()) {
return getOrCreateCounterForProbe(Binary, CallStack);
}
std::shared_ptr<StringBasedCtxKey> KeyStr =		std::shared_ptr<StringBasedCtxKey> KeyStr =
		wmiUnsubmitted Not Done Reply Inline Actions Use SmallVectorImpl<uint64_t>& as a parameter type instead of SmallVector<uint64_t, 16>&. There are some other places with the same issue. wmi: Use SmallVectorImpl<uint64_t>& as a parameter type instead of SmallVector<uint64_t, 16>&. There…
		wleiAuthorUnsubmitted Done Reply Inline Actions fixed all those issue. wlei: fixed all those issue.
std::make_shared<StringBasedCtxKey>();		std::make_shared<StringBasedCtxKey>();
KeyStr->Context = Binary->getExpandedContextStr(CallStack);		KeyStr->Context = Binary->getExpandedContextStr(Stack);
KeyStr->genHashCode();		KeyStr->genHashCode();
auto Ret =		return KeyStr;
CtxCounterMap->emplace(Hashable<ContextKey>(KeyStr), SampleCounter());
return Ret.first->second;
}		}

SampleCounter &		std::shared_ptr<ProbeBasedCtxKey> ProbeStack::getContextKey() {
VirtualUnwinder::getOrCreateCounterForProbe(const ProfiledBinary *Binary,
std::list<uint64_t> &CallStack) {
std::shared_ptr<ProbeBasedCtxKey> ProbeBasedKey =		std::shared_ptr<ProbeBasedCtxKey> ProbeBasedKey =
std::make_shared<ProbeBasedCtxKey>();		std::make_shared<ProbeBasedCtxKey>();
if (CallStack.size() > 1) {		for (auto CallProbe : Stack) {
// We don't need to top frame probe since it should be extracted
// from the range.
// The top of stack is an instruction from the function where
// the LBR address range physcially resides. Strip it since
// the function is not a part of the call context. We also
// don't need its inline context since the probes being unwound
// come with an inline context all the way back to the uninlined
// function in their prefix tree.
auto Iter = CallStack.rbegin();
auto EndT = std::prev(CallStack.rend());
for (; Iter != EndT; Iter++) {
uint64_t Address = *Iter;
const PseudoProbe *CallProbe = Binary->getCallProbeForAddr(Address);
// We may not find a probe for a merged or external callsite.
// Callsite merging may cause the loss of original probe IDs.
// Cutting off the context from here since the inline will
// not know how to consume a context with unknown callsites.
if (!CallProbe)
break;
ProbeBasedKey->Probes.emplace_back(CallProbe);		ProbeBasedKey->Probes.emplace_back(CallProbe);
}		}
}
CSProfileGenerator::compressRecursionContext<const PseudoProbe *>(		CSProfileGenerator::compressRecursionContext<const PseudoProbe *>(
ProbeBasedKey->Probes);		ProbeBasedKey->Probes);
ProbeBasedKey->genHashCode();		ProbeBasedKey->genHashCode();
Hashable<ContextKey> ContextId(ProbeBasedKey);		return ProbeBasedKey;
auto Ret = CtxCounterMap->emplace(ContextId, SampleCounter());
return Ret.first->second;
}		}

void VirtualUnwinder::recordRangeCount(uint64_t Start, uint64_t End,		template <typename T>
		hoyUnsubmitted Not Done Reply Inline Actions Is the trie root the only case this can be zero? Can we make an assert for that? hoy: Is the trie root the only case this can be zero? Can we make an assert for that?
		wleiAuthorUnsubmitted Done Reply Inline Actions Good point! adding assertion here may need a dummy root parameter, how about changing this condition to `isDummyRoot` and add assertion inside `getOrCreateChildFrame` so that we can also guarantee only the root can be zero address. wlei: Good point! adding assertion here may need a dummy root parameter, how about changing this…
UnwindState &State, uint64_t Repeat) {		void VirtualUnwinder::collectSamplesFromFrame(UnwindState::ProfiledFrame *Cur,
uint64_t StartOffset = State.getBinary()->virtualAddrToOffset(Start);		T &Stack) {
uint64_t EndOffset = State.getBinary()->virtualAddrToOffset(End);		if (Cur->RangeSamples.empty() && Cur->BranchSamples.empty())
SampleCounter &SCounter =		return;
getOrCreateCounter(State.getBinary(), State.CallStack);
SCounter.recordRangeCount(StartOffset, EndOffset, Repeat);		auto Ret = CtxCounterMap->emplace(Hashable<ContextKey>(Stack.getContextKey()),
		hoyUnsubmitted Not Done Reply Inline Actions Typo: inliner hoy: Typo: inliner
		wleiAuthorUnsubmitted Done Reply Inline Actions fixed wlei: fixed
		SampleCounter());
		SampleCounter &SCounter = Ret.first->second;
		for (auto &Item : Cur->RangeSamples) {
		uint64_t StartOffset = Binary->virtualAddrToOffset(std::get<0>(Item));
		uint64_t EndOffset = Binary->virtualAddrToOffset(std::get<1>(Item));
		SCounter.recordRangeCount(StartOffset, EndOffset, std::get<2>(Item));
		}

		for (auto &Item : Cur->BranchSamples) {
		wenleiUnsubmitted Not Done Reply Inline Actions Would be nice to have a unified representation for frame stack and probe stack, but without replying on reinterpret_cast of probe pointer.. Using uint64_t then rely on Binary->usePseudoProbes() to decide how to interpret the value seem less than ideal.. Do this through template functions with the different part done through specialization helpers? wenlei: Would be nice to have a unified representation for frame stack and probe stack, but without…
		wleiAuthorUnsubmitted Done Reply Inline Actions Yes, changed to template functions and create two struct FrameStack and ProbeStack for this, thanks! wlei: Yes, changed to template functions and create two struct FrameStack and ProbeStack for this…
		uint64_t SourceOffset = Binary->virtualAddrToOffset(std::get<0>(Item));
		uint64_t TargetOffset = Binary->virtualAddrToOffset(std::get<1>(Item));
		SCounter.recordBranchCount(SourceOffset, TargetOffset, std::get<2>(Item));
		}
		}

		template <typename T>
		void VirtualUnwinder::collectSamplesFromFrameTrie(
		UnwindState::ProfiledFrame *Cur, T &Stack) {
		if (!Cur->isDummyRoot()) {
		if (!Stack.pushFrame(Cur)) {
		// Process truncated context
		for (const auto &Item : Cur->Children) {
		// Start a new traversal ignoring its bottom context
		collectSamplesFromFrameTrie(Item.second.get());
		}
		return;
		}
		}

		collectSamplesFromFrame(Cur, Stack);
		// Process children frame
		for (const auto &Item : Cur->Children) {
		collectSamplesFromFrameTrie(Item.second.get(), Stack);
		}
		// Recover the call stack
		Stack.popFrame();
		}

		void VirtualUnwinder::collectSamplesFromFrameTrie(
		UnwindState::ProfiledFrame *Cur) {
		if (Binary->usePseudoProbes()) {
		ProbeStack Stack(Binary);
		collectSamplesFromFrameTrie<ProbeStack>(Cur, Stack);
		} else {
		FrameStack Stack(Binary);
		collectSamplesFromFrameTrie<FrameStack>(Cur, Stack);
		}
}		}

void VirtualUnwinder::recordBranchCount(const LBREntry &Branch,		void VirtualUnwinder::recordBranchCount(const LBREntry &Branch,
UnwindState &State, uint64_t Repeat) {		UnwindState &State, uint64_t Repeat) {
if (Branch.IsArtificial)		if (Branch.IsArtificial)
return;		return;
uint64_t SourceOffset = State.getBinary()->virtualAddrToOffset(Branch.Source);
uint64_t TargetOffset = State.getBinary()->virtualAddrToOffset(Branch.Target);		if (Binary->usePseudoProbes()) {
SampleCounter &SCounter =		// Same as recordRangeCount, We don't need to top frame probe since we will
		hoyUnsubmitted Not Done Reply Inline Actions Would be nice to a comment about why counts are reported on the parent frame for pseudo prob . hoy: Would be nice to a comment about why counts are reported on the parent frame for pseudo prob .
		wleiAuthorUnsubmitted Done Reply Inline Actions Good catch, comments added wlei: Good catch, comments added
getOrCreateCounter(State.getBinary(), State.CallStack);		// extract it from branch's source address
SCounter.recordBranchCount(SourceOffset, TargetOffset, Repeat);		State.getParentFrame()->recordBranchCount(Branch.Source,
		Branch.Target, Repeat);
		} else {
		State.CurrentLeafFrame->recordBranchCount(Branch.Source, Branch.Target,
		Repeat);
		}
}		}

bool VirtualUnwinder::unwind(const HybridSample *Sample, uint64_t Repeat) {		bool VirtualUnwinder::unwind(const HybridSample *Sample, uint64_t Repeat) {
// Capture initial state as starting point for unwinding.		// Capture initial state as starting point for unwinding.
UnwindState State(Sample);		UnwindState State(Sample);

// Sanity check - making sure leaf of LBR aligns with leaf of stack sample		// Sanity check - making sure leaf of LBR aligns with leaf of stack sample
// Stack sample sometimes can be unreliable, so filter out bogus ones.		// Stack sample sometimes can be unreliable, so filter out bogus ones.
Show All 32 Lines	if (isCallState(State)) {
// Unwind branches - for regular intra function branches, we only		// Unwind branches - for regular intra function branches, we only
// need to record branch with context.		// need to record branch with context.
unwindBranchWithinFrame(State);		unwindBranchWithinFrame(State);
}		}
State.advanceLBR();		State.advanceLBR();
// Record `branch` with calling context after unwinding.		// Record `branch` with calling context after unwinding.
recordBranchCount(Branch, State, Repeat);		recordBranchCount(Branch, State, Repeat);
}		}
		// As samples are aggregated on trie, record them into counter map
		collectSamplesFromFrameTrie(State.getDummyRootPtr());

return true;		return true;
}		}

PerfReader::PerfReader(cl::list<std::string> &BinaryFilenames) {		PerfReader::PerfReader(cl::list<std::string> &BinaryFilenames) {
// Load the binaries.		// Load the binaries.
for (auto Filename : BinaryFilenames)		for (auto Filename : BinaryFilenames)
loadBinary(Filename, /AllowNameConflict/ false);		loadBinary(Filename, /AllowNameConflict/ false);
▲ Show 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	for (auto I : BinarySampleCounters) {
outs() << "\nBinary(" << Binary->getName().str() << ")'s Branch Counter:\n";		outs() << "\nBinary(" << Binary->getName().str() << ")'s Branch Counter:\n";
printBranchCounter(I.second, Binary);		printBranchCounter(I.second, Binary);
}		}
}		}

void PerfReader::unwindSamples() {		void PerfReader::unwindSamples() {
for (const auto &Item : AggregatedSamples) {		for (const auto &Item : AggregatedSamples) {
const HybridSample *Sample = dyn_cast<HybridSample>(Item.first.getPtr());		const HybridSample *Sample = dyn_cast<HybridSample>(Item.first.getPtr());
VirtualUnwinder Unwinder(&BinarySampleCounters[Sample->Binary]);		VirtualUnwinder Unwinder(&BinarySampleCounters[Sample->Binary],
		Sample->Binary);
Unwinder.unwind(Sample, Item.second);		Unwinder.unwind(Sample, Item.second);
}		}

if (ShowUnwinderOutput)		if (ShowUnwinderOutput)
printUnwinderOutput();		printUnwinderOutput();
}		}

bool PerfReader::extractLBRStack(TraceStream &TraceIt,		bool PerfReader::extractLBRStack(TraceStream &TraceIt,
SmallVector<LBREntry, 16> &LBRStack,		SmallVectorImpl<LBREntry> &LBRStack,
ProfiledBinary *Binary) {		ProfiledBinary *Binary) {
// The raw format of LBR stack is like:		// The raw format of LBR stack is like:
// 0x4005c8/0x4005dc/P/-/-/0 0x40062f/0x4005b0/P/-/-/0 ...		// 0x4005c8/0x4005dc/P/-/-/0 0x40062f/0x4005b0/P/-/-/0 ...
// ... 0x4005c8/0x4005dc/P/-/-/0		// ... 0x4005c8/0x4005dc/P/-/-/0
// It's in FIFO order and seperated by whitespace.		// It's in FIFO order and seperated by whitespace.
SmallVector<StringRef, 32> Records;		SmallVector<StringRef, 32> Records;
TraceIt.getCurrentLine().split(Records, " ");		TraceIt.getCurrentLine().split(Records, " ");

▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	while (Index < Records.size()) {

LBRStack.emplace_back(LBREntry(Src, Dst, IsArtificial));		LBRStack.emplace_back(LBREntry(Src, Dst, IsArtificial));
}		}
TraceIt.advance();		TraceIt.advance();
return !LBRStack.empty();		return !LBRStack.empty();
}		}

bool PerfReader::extractCallstack(TraceStream &TraceIt,		bool PerfReader::extractCallstack(TraceStream &TraceIt,
std::list<uint64_t> &CallStack) {		SmallVectorImpl<uint64_t> &CallStack) {
// The raw format of call stack is like:		// The raw format of call stack is like:
// 4005dc # leaf frame		// 4005dc # leaf frame
// 400634		// 400634
// 400684 # root frame		// 400684 # root frame
// It's in bottom-up order with each frame in one line.		// It's in bottom-up order with each frame in one line.

// Extract stack frames from sample		// Extract stack frames from sample
ProfiledBinary *Binary = nullptr;		ProfiledBinary *Binary = nullptr;
▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

llvm/tools/llvm-profgen/ProfiledBinary.h

Show First 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	public:
}		}

// Compare two addresses' inline context		// Compare two addresses' inline context
bool inlineContextEqual(uint64_t Add1, uint64_t Add2) const;		bool inlineContextEqual(uint64_t Add1, uint64_t Add2) const;

// Get the context string of the current stack with inline context filled in.		// Get the context string of the current stack with inline context filled in.
// It will search the disassembling info stored in Offset2LocStackMap. This is		// It will search the disassembling info stored in Offset2LocStackMap. This is
// used as the key of function sample map		// used as the key of function sample map
std::string getExpandedContextStr(const std::list<uint64_t> &stack) const;		std::string
		getExpandedContextStr(const SmallVectorImpl<uint64_t> &Stack) const;

const PseudoProbe *getCallProbeForAddr(uint64_t Address) const {		const PseudoProbe *getCallProbeForAddr(uint64_t Address) const {
return ProbeDecoder.getCallProbeForAddr(Address);		return ProbeDecoder.getCallProbeForAddr(Address);
}		}
void		void
getInlineContextForProbe(const PseudoProbe *Probe,		getInlineContextForProbe(const PseudoProbe *Probe,
SmallVectorImpl<std::string> &InlineContextStack,		SmallVectorImpl<std::string> &InlineContextStack,
bool IncludeLeaf = false) const {		bool IncludeLeaf = false) const {
Show All 18 Lines

llvm/tools/llvm-profgen/ProfiledBinary.cpp

Show First 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	if (Context1.size() != Context2.size())
return false;		return false;

// The leaf frame contains location within the leaf, and it		// The leaf frame contains location within the leaf, and it
// needs to be remove that as it's not part of the calling context		// needs to be remove that as it's not part of the calling context
return std::equal(Context1.begin(), Context1.begin() + Context1.size() - 1,		return std::equal(Context1.begin(), Context1.begin() + Context1.size() - 1,
Context2.begin(), Context2.begin() + Context2.size() - 1);		Context2.begin(), Context2.begin() + Context2.size() - 1);
}		}

std::string		std::string ProfiledBinary::getExpandedContextStr(
ProfiledBinary::getExpandedContextStr(const std::list<uint64_t> &Stack) const {		const SmallVectorImpl<uint64_t> &Stack) const {
std::string ContextStr;		std::string ContextStr;
SmallVector<std::string, 16> ContextVec;		SmallVector<std::string, 16> ContextVec;
// Process from frame root to leaf		// Process from frame root to leaf
for (auto Iter = Stack.rbegin(); Iter != Stack.rend(); Iter++) {		for (auto Address : Stack) {
uint64_t Offset = virtualAddrToOffset(*Iter);		uint64_t Offset = virtualAddrToOffset(Address);
const FrameLocationStack &ExpandedContext = getFrameLocationStack(Offset);		const FrameLocationStack &ExpandedContext = getFrameLocationStack(Offset);
for (const auto &Loc : ExpandedContext) {		for (const auto &Loc : ExpandedContext) {
ContextVec.push_back(getCallSite(Loc));		ContextVec.push_back(getCallSite(Loc));
}		}
}		}

assert(ContextVec.size() && "Context length should be at least 1");		assert(ContextVec.size() && "Context length should be at least 1");

▲ Show 20 Lines • Show All 289 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 319683

llvm/tools/llvm-profgen/PerfReader.h

llvm/tools/llvm-profgen/PerfReader.cpp

llvm/tools/llvm-profgen/ProfiledBinary.h

llvm/tools/llvm-profgen/ProfiledBinary.cpp

[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
ClosedPublic