This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation
ClosedPublic

Authored by wlei on Jan 5 2021, 12:16 PM.

Details

Summary

For CS profile generation, the process of call stack unwinding is time-consuming since for each LBR entry we need linear time to generate the context( hash, compression, string concatenation). This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it, deferring the context compression and string generation to the end of unwinding.

Specifically, it uses StackLeaf as the top frame on the stack and manipulates(pop or push a trie node) it dynamically during virtual unwinding so that the raw sample can just be recoded on the leaf node, the path(root to leaf) will represent its calling context. In the end, it traverses the trie and generates the context on the fly.

Results:
Our internal branch shows about 5X speed-up on some large workloads in SPEC06 benchmark.

Diff Detail

Event Timeline

wlei created this revision.Jan 5 2021, 12:16 PM
wlei requested review of this revision.Jan 5 2021, 12:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 5 2021, 12:16 PM
wlei retitled this revision from [CSSPGO][llvm-profgen] Aggregate sample on call frame trie to speed up profile generation to [CSSPGO][llvm-profgen] Aggregate samples on call frame trie to speed up profile generation.Jan 5 2021, 3:08 PM
wlei edited the summary of this revision. (Show Details)
wlei added reviewers: wmi, davidxl, hoy, wenlei.
wmi added a comment.Jan 10 2021, 9:28 PM

This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.

5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

llvm/tools/llvm-profgen/PerfReader.cpp
90

Use SmallVectorImpl<uint64_t>& as a parameter type instead of SmallVector<uint64_t, 16>&. There are some other places with the same issue.

wlei updated this revision to Diff 316222.Jan 12 2021, 1:41 PM

Addessing Wei's feedback

wlei added a comment.Jan 12 2021, 1:42 PM
In D94110#2489390, @wmi wrote:

This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.

5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

Yes, this is a very good idea, thanks. Further grouping of call frames might allow us not needing the aggregated samples map, i.e. we can do profile generation on the tree during DFS.
One of my concern on this is that if we maintain a whole tree of call frame, it might affect the performance of getOrCreateNode() during the unwinding, because it will have more children of tree node increasing the hash collision. but if the overlapping degree is high, it will surely improve the performance. I will try this and do more experiments on the heavy workloads in SPEC.

llvm/tools/llvm-profgen/PerfReader.cpp
90

fixed all those issue.

hoy added inline comments.Jan 19 2021, 9:16 AM
llvm/tools/llvm-profgen/PerfReader.cpp
109

Is the trie root the only case this can be zero? Can we make an assert for that?

115

Typo: inliner

171

Would be nice to a comment about why counts are reported on the parent frame for pseudo prob .

llvm/tools/llvm-profgen/PerfReader.h
213

Nit: how about name it getOrCreateChildFrame?

226

FrameTrieRoot doesn't seem to be initialized anywhere in the constructor of UnwindState.

227

Nit: name it FrameCurrentLeaf?

wlei updated this revision to Diff 317772.Jan 19 2021, 9:59 PM

addressing Hongtao's feedback

llvm/tools/llvm-profgen/PerfReader.cpp
109

Good point! adding assertion here may need a dummy root parameter, how about changing this condition to isDummyRoot and add assertion inside getOrCreateChildFrame so that we can also guarantee only the root can be zero address.

115

fixed

171

Good catch, comments added

llvm/tools/llvm-profgen/PerfReader.h
213

Sounds good!

226

I guess it will be initialized implicitly since it's not a pointer but a value. It will have a zero value address and be taken as a dummy node. I changed the name to DummyTrieRoot to make it clearer.

227

Good point

In D94110#2489390, @wmi wrote:

This change speeds up this by grouping all the call frame within one LBR sample into a trie and aggregating the result(sample counter) on it.

5x speedup shows it is a really impressive improvement. I am wondering whether there is callstack overlap between different LBR samples so you can have further grouping of call frames -- by reusing unwindState. You may also save some cost by reusing the frame trie. IIUC although samples have been aggregated based on callstack, each LBR sample may have multiple callstacks inferred from unwindCall/unwindReturn. If there are callstack overlap between different LBR samples, you may be able to further group them.

Yeah, the aggregation was based on stack + LBR sample, so events with same stack but different LBR won't be aggregated today. Using a global trie would help defer context generation for each of the aggregations if they lead to shared context..

I think we could experiment how helpful that is by checking how often we generate a context that's already in the context profile map when traversing the trie. If that happens very often, then it suggests global trie could save more..

Yes, this is a very good idea, thanks. Further grouping of call frames might allow us not needing the aggregated samples map, i.e. we can do profile generation on the tree during DFS.

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

One of my concern on this is that if we maintain a whole tree of call frame, it might affect the performance of getOrCreateNode() during the unwinding, because it will have more children of tree node increasing the hash collision. but if the overlapping degree is high, it will surely improve the performance. I will try this and do more experiments on the heavy workloads in SPEC.

llvm/tools/llvm-profgen/PerfReader.cpp
124

Would be nice to have a unified representation for frame stack and probe stack, but without replying on reinterpret_cast of probe pointer..

Using uint64_t then rely on Binary->usePseudoProbes() to decide how to interpret the value seem less than ideal.. Do this through template functions with the different part done through specialization helpers?

llvm/tools/llvm-profgen/PerfReader.h
204

What about name it ProfiledFrame? (we have ProfileBinary)

206

Raw pointer is used for parent, but unique_ptr is used for child. Would be good to keep consistent.

211

This ctor can be merged into the one below with default parameter.

227

nit: name them CurrentLeafFrame and DummyRootFrame?

281

A few more naming nits:

initFrameTrie? CallStackTrie prompts a trie whose nodes are call stack..

appendFrame and popStackLeaf are functionally symmetric, would be good for them to follow similar names. e.g. pushFrame, popFrame. and then swithToFrame..

473

We "record" samples on trie (have recordRangeCount and recordBranchCount), then "collect" samples from trie into context sample maps via a DFS traversal.

So we could have a more intuitive convention for readability:

  • ProfiledFrame::recordRangeCount
  • ProfiledFrame::recordBranchCount
  • VirtualUnwinder::collectSamplesFromFrame
  • VirtualUnwinder::collectSamplesFromFrameTrie
484–485

We can make this a private member since it's only a helper for getOrCreateCounter. (a few other functions can be private too).

With the latest, do you see similar speed up for probe profile and dwarf profile?

wlei updated this revision to Diff 319102.Jan 25 2021, 1:17 PM

Addressing Wenlei's feedback: much refactoring work.

wlei added inline comments.Jan 25 2021, 1:30 PM
llvm/tools/llvm-profgen/PerfReader.cpp
124

Yes, changed to template functions and create two struct FrameStack and ProbeStack for this, thanks!

llvm/tools/llvm-profgen/PerfReader.h
204

Good point

206

Oh, the child is a new allocated instance so we use smart pointer for it, but the parent pointer can always point to a pre-allocated instance or null(for root). Also seems it would cause a recursive deconstruction of unique_ptr error if making parent unique_ptr.

211

Good catch!

281

Yeah, that's very good naming suggestions, thanks!

473

Cool, renamed

484–485

Yes, make all of them except unwind() to private

wlei added a comment.Jan 26 2021, 12:22 PM

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

With the latest, do you see similar speed up for probe profile and dwarf profile?

have implemented the global trie for the virtual unwinder part and done the evaluation over the LBR trie against baseline and global trie on probe profile. See the chart below, here only use gobmk and sjeng because other benchmarks's run time of virtual unwinder are very small(less than 1s).

The virtual unwinding time(s):

No-TrieLBR-TrieGlobal-Trie
gobmk8.313.64.11
sjeng46.8219.1524.91

Speed up:

LBR-Trie vs No-TrieGlobal-Trie vs No-TrieGlobal-Trie vs LBR-Trie
gobmk2.32.020.88
sjeng2.441.870.77

Sum-up:

  • LBR-Trie and Global-Trie can have about 2x speed-up over the baseline
  • Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.
  • Didn't see the similar speed-up as our internal prototype, I guess it's because of the refinement we did in the previous patches like removing the redundant string concatenation and reversal, switching to use probe based key.

Beside the time, LBR-trie seems good for the readability but Global-Trie can be easy to detect shared context. Haven't tried to defer the context generation(should have more speed-up for Global-Trie) and detect shared context, will try it.

CC: @wmi @hoy

I think the aggregation is still worth keeping as it's probably still cheaper than trie with hashing. But we will know for sure through experiments.

With the latest, do you see similar speed up for probe profile and dwarf profile?

have implemented the global trie for the virtual unwinder part and done the evaluation over the LBR trie against baseline and global trie on probe profile. See the chart below, here only use gobmk and sjeng because other benchmarks's run time of virtual unwinder are very small(less than 1s).

The virtual unwinding time(s):

No-TrieLBR-TrieGlobal-Trie
gobmk8.313.64.11
sjeng46.8219.1524.91

Speed up:

LBR-Trie vs No-TrieGlobal-Trie vs No-TrieGlobal-Trie vs LBR-Trie
gobmk2.32.020.88
sjeng2.441.870.77

Sum-up:

  • LBR-Trie and Global-Trie can have about 2x speed-up over the baseline
  • Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.
  • Didn't see the similar speed-up as our internal prototype, I guess it's because of the refinement we did in the previous patches like removing the redundant string concatenation and reversal, switching to use probe based key.

Beside the time, LBR-trie seems good for the readability but Global-Trie can be easy to detect shared context. Haven't tried to defer the context generation(should have more speed-up for Global-Trie) and detect shared context, will try it.

CC: @wmi @hoy

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

llvm/tools/llvm-profgen/PerfReader.h
206

Definitely not unique_ptr for parent and multiple children would keep pointer to same parent. I meant to say consistency between smart pointer vs raw pointer, and in the case of parent, a shared_ptr?

229

FrameCurrentLeaf -> CurrentLeafFrame?

485–486

this private can be removed since there's one on line 456.

wlei added a comment.Jan 27 2021, 2:01 PM

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

Yeah, sounds good to me! It didn't affect the functionality, then in the future we can try to improve it when we get to solve the sharing context issue.

llvm/tools/llvm-profgen/PerfReader.h
206

Just tried the share_ptr, it failed, it seems there is a cycle object free issue, something like the parent object free its children's field then it will free the Parent shared_ptr, the Parent shared_ptr again tried to free its parent object cause a cycle free failure. I code like below(use this to initial Parent ptr)

std::share_ptr<ProfiledFrame> Parent;
...
ProfiledFrame(uint64_t Addr = 0, ProfiledFrame *P = nullptr)
    : Address(Addr), Parent(P) {}
ProfiledFrame *getOrCreateChildFrame(uint64_t Address) {
  auto Ret = Children.emplace(
      Address, std::make_unique<ProfiledFrame>(Address, this));
  return Ret.first->second.get();
}
229

renamed!

485–486

Good catch!

wlei updated this revision to Diff 319683.Jan 27 2021, 2:41 PM

addressing Wenlei's feedback

wenlei accepted this revision.Jan 27 2021, 2:48 PM

lgtm, thanks.

This revision is now accepted and ready to land.Jan 27 2021, 2:48 PM
wmi added a comment.Jan 27 2021, 10:35 PM

Sum-up:

  • LBR-Trie and Global-Trie can have about 2x speed-up over the baseline
  • Global-Trie have slight regression(about 10%) against LBR-Trie as we discussed this might be caused by hash overhead.

Thanks for the experiment.
LBR-Trie is faster than Global-Trie. Does it mean there is not enough callstack overlap between different LBR samples? And could you elaborate what is the hash overhead?

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

I assume the time in the table is in seconds -- 19.15 seconds for sjeng using LBR-Trie, and that is not very long. I agree with Wenlei you can leave the improvement for the future.

wlei added a comment.Jan 28 2021, 11:04 AM

Thanks for the experiment.
LBR-Trie is faster than Global-Trie. Does it mean there is not enough callstack overlap between different LBR samples?

Thanks for your reminder. I just tried the experiment on aggregation on callstack only, see the chart below. The column means the size of the sample map aggregated by Callstack+LBR or Callstack only.

Callstack+LBRCallstackRatio
gobmk13871559935161.40
sjeng1582701336291.18
gcc928176101.22

you see there is only 20%~40% more save if we use only callstack for aggregation.

And could you elaborate what is the hash overhead?

I meant we use a global trie instead of the intra LBR trie, in that case all the frame in the call stack and branch in LBR is a node in the trie.
Then the trie is probably very huge and each node might have a lot of children. We use the unordered_map to create or get children, if the num of children is big, the unordered_map lookup might not be O(1). This is what I mean the hash overhead. For intra LBR trie, we only have 16 entries so the num of children is small, should be fine with unordered_map look-up.

Thanks for the quick experiment! Given that we don't see immediate speed up from global trie, I'm inclined to just use what you have in this patch, and defer further improvement for the future. What do you think?

I assume the time in the table is in seconds -- 19.15 seconds for sjeng using LBR-Trie, and that is not very long. I agree with Wenlei you can leave the improvement for the future.

Yeah, it's in second, for my experiments all the benchmarks in SPEC(train data) could be finished in one minute.

wmi accepted this revision.Jan 28 2021, 11:14 AM

Thanks for the data and explanation! LGTM.

hoy accepted this revision.Feb 2 2021, 10:23 PM
This revision was landed with ongoing or failed builds.Feb 3 2021, 6:53 PM
This revision was automatically updated to reflect the committed changes.