Page MenuHomePhabricator

[CSSPGO][llvm-profgen] Context-sensitive profile data generation
ClosedPublic

Authored by wlei on Oct 19 2020, 12:59 PM.

Details

Summary

This stack of changes introduces llvm-profgen utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.

we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.

With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.

A breakdown of noteworthy changes:

  • Added HybridSample class as the abstraction perf sample including LBR stack and call stack
  • Extended PerfReader to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple HybridSample are extracted
  • Speed up by aggregating HybridSample into AggregatedSamples
  • Added VirtualUnwinder that consumes aggregated HybridSample and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
 Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like main:1 @ foo:2 @ bar.
  • Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.

  • Leveraged LLVM build-in(SampleProfWriter) writer to support different serialization format with no stop
  • getCanonicalFnName for callee name and name from ELF section
  • Added regression test for both unwinding and profile generation

Test Plan:
ninja & ninja check-llvm

Diff Detail

Unit TestsFailed

TimeTest
60 msx64 windows > LLVM.CodeGen/XCore::threads.ll
Script: -- : 'RUN: at line 1'; c:\ws\w64\llvm-project\premerge-checks\build\bin\llc.exe -march=xcore < C:\ws\w64\llvm-project\premerge-checks\llvm\test\CodeGen\XCore\threads.ll | c:\ws\w64\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w64\llvm-project\premerge-checks\llvm\test\CodeGen\XCore\threads.ll
60 msx64 windows > LLVM.tools/llvm-profgen::inline-cs-noprobe.test
Script: -- : 'RUN: at line 1'; llvm-profgen --perfscript=C:\ws\w64\llvm-project\premerge-checks\llvm\test\tools\llvm-profgen/Inputs/inline-cs-noprobe.perfscript --binary=C:\ws\w64\llvm-project\premerge-checks\llvm\test\tools\llvm-profgen/Inputs/inline-cs-noprobe.perfbin --output=C:\ws\w64\llvm-project\premerge-checks\build\test\tools\llvm-profgen\Output\inline-cs-noprobe.test.tmp --show-unwinder-output | c:\ws\w64\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w64\llvm-project\premerge-checks\llvm\test\tools\llvm-profgen\inline-cs-noprobe.test --check-prefix=CHECK-UNWINDER
50 msx64 windows > LLVM.tools/llvm-profgen::noinline-cs-noprobe.test
Script: -- : 'RUN: at line 1'; llvm-profgen --perfscript=C:\ws\w64\llvm-project\premerge-checks\llvm\test\tools\llvm-profgen/Inputs/noinline-cs-noprobe.perfscript --binary=C:\ws\w64\llvm-project\premerge-checks\llvm\test\tools\llvm-profgen/Inputs/noinline-cs-noprobe.perfbin --output=C:\ws\w64\llvm-project\premerge-checks\build\test\tools\llvm-profgen\Output\noinline-cs-noprobe.test.tmp --show-unwinder-output | c:\ws\w64\llvm-project\premerge-checks\build\bin\filecheck.exe C:\ws\w64\llvm-project\premerge-checks\llvm\test\tools\llvm-profgen\noinline-cs-noprobe.test --check-prefix=CHECK-UNWINDER

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hoy added inline comments.Nov 5 2020, 1:50 PM
llvm/tools/llvm-profgen/PerfReader.h
182

The comment could be retired since we have a tail call tracker coming that tracks both in-LBR tail calls and out-of-LBR tail calls universally.

llvm/tools/llvm-profgen/llvm-profgen.cpp
44–54

Perhaps it's better to include the unwinder in the reader since this driver will also handle non-CS profiles in future.

The dataflow from the reader to the profile generator may need a flexible definition (currently is Unwinder.getSampleCounters()) for future extension.

wenlei added inline comments.Nov 5 2020, 3:06 PM
llvm/tools/llvm-profgen/PerfReader.h
182

I think the comment needs to be updated, but explanation here is still needed because IIUC missing frame inference happens more like a post process (hence somewhat orthogonal), and here isCallState decides the unwind operation on the stack sample (not changed by frame inference) which will always miss tail call frame (unless dwarf stack walking is used by perf).

llvm/tools/llvm-profgen/llvm-profgen.cpp
44–54

Agreed that unwinder better be driven by PerfReader since unwinder is something PerfReader depends on directly (vs depending on its output like ProfileGenerator on PerfReader's output).

wlei updated this revision to Diff 305608.Nov 16 2020, 3:00 PM
wlei marked 5 inline comments as done.

move unwinder into PerfReader
use a BinarytoSampleCounter map to group sample counters by binary
add PrologEpilog tracker
support to use getCanonicalFnName for ELF Section based symbol name
fix a negative line offset bug
other refactoring work

hoy added inline comments.Nov 17 2020, 4:40 PM
llvm/tools/llvm-profgen/PerfReader.cpp
471

PerfType should be defined on the else branch if it is not initialized anywhere else.

llvm/tools/llvm-profgen/PerfReader.h
283

Should this be rewritten with a stream-based file reader as done in D89707?

wlei updated this revision to Diff 306182.Nov 18 2020, 12:06 PM

Address reviewer's feedback on PerfType definition

wlei added inline comments.Nov 18 2020, 12:07 PM
llvm/tools/llvm-profgen/PerfReader.h
283

I guess you mean to keep consistent to other part of code? Here you see it only read 4000bytes data from the file(getFileOrSTDIN(FileName, 4000);), so there shouldn't have memory issue.
Currently stream-based liner only support read one line at a time, it need to search line by line, which would be slower than searching in the whole 4k memory. So which one do you prefer?

hoy added inline comments.Nov 18 2020, 4:24 PM
llvm/tools/llvm-profgen/PerfReader.h
283

I see. The current implementation looks good to me.

wlei updated this revision to Diff 306285.Nov 18 2020, 6:55 PM

[NFC]rebase

wmi added inline comments.Nov 19 2020, 11:22 AM
llvm/tools/llvm-profgen/PerfReader.h
214–215

This virtual unwinder is not doing the classic unwinding thing. It is walking through the LBR stack of a LBR sample, based on the sample's callstack, and infer the callstack for each address range covered by the LBR sample. The comment can be more clear about it.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
63–71

Why there is region [A, B]: 300, but B: (0, 100) only has 100 sample count?

263

Conext --> Context?

wlei updated this revision to Diff 306726.Nov 20 2020, 10:01 AM

add more comments for unwinder and BoundaryPoint
remove Skylake only LBR duplication filter

wlei marked 43 inline comments as done.Nov 20 2020, 10:07 AM
wlei added inline comments.
llvm/tools/llvm-profgen/PerfReader.h
214–215

Thanks for your suggestion, more comments are added.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
63–71

Sorry for the confusion. See the graph below, here B:(0, 100) is the boundary point, 0 means no samples begin at B, 100 means one sample(sample1) ends at B whose count is 100. I changed the explanation in the comment, see whether it's clear or not.

|<--100-->|                  Sample1
|<------200------>|          Sample2
A         B       C
hoy added inline comments.Nov 20 2020, 10:50 AM
llvm/tools/llvm-profgen/PerfReader.cpp
26

Nit: please add a TODO here to check if Source is in prolog/epilog using precise prolog/epilog table.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

I'm wondering if a separate profile file should be output for each binary. Since the samples are already separated for binaries via BinarySampleCounters, ProfileMap can be made like that too.

llvm/tools/llvm-profgen/ProfiledBinary.cpp
138

Nit: remove the check and add it back with the compression work.

wenlei added inline comments.Nov 20 2020, 10:55 AM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

I think we also need to support cases where PERF_RECORD_MMAP2 event isn't available, in which case we just use preferred load address from ELF header.

Can you add a test case that doesn't have PERF_RECORD_MMAP2? Looks like currently we would just proceed with parsing without a base address set?

llvm/tools/llvm-profgen/PerfReader.cpp
503–506

What would be the workflow for (non-CS) AutoFDO with this new implementation?

It looks like parseTrace is responsible for aggregation only, then even for AutoFDO, there'll be a post-process after that, to get range:count, right?

so it looks to me that a unified workflow could be something like this?

for (auto Filename : PerfTraceFilenames)
    parseAndAggregateTrace(Filename);

generateRawProfile();

In side generateRawProfile, we would do simple range overlap computation for AutoFDO, or unwind for CSSPGO.

Also see comments on AggregationCounter - in addition to unifying the workflow, it would be good to unify data structure as well if possible. What do you think?

llvm/tools/llvm-profgen/PerfReader.h
211

The idea of aggregation applies to (non-CS) AutoFDO too. It'd be good to put infrastructure in place that can cover both AutoFDO and CSSPGO in a generic way.

Perhaps we can treat non-CS AutoFDO profile (or regular LBR perf profile) just like a hybrid profile except stack part is always empty? Is that what you have in mind?

wlei added inline comments.Nov 20 2020, 11:47 AM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

Yeah, it's doable. but that needs more CL design, currently we only support one output file, so we have to change supporting multiple output files which also need an exact one-one mapping to the binary. So we can use OutputFilenames to receives multiple output files and match them in order on the command line? or I'm also thinking we just remain this and if the user really need to separate the output for binary, they could call the tool multiple times with different input binary. any suggestions on the command?

wlei added inline comments.Nov 20 2020, 1:14 PM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

Yeah, currently PERF_RECORD_MMAP2 is required.
The problem using preferred load address for non-mmap event is one perf address might belong to multiple binaries, which will mess up the whole process. Also we need to one more perftrace scan to confirm there is no mmap2 event so that we can switch to use preferred address.
or we can have a switch like "--no-mmp2-events" to explicitly tell the tool use preferred address, also only support one binary under this switch. or we need some info in the perf trace tell which binary it belong to(I remembered we discuss this internally). any suggestion on this?

llvm/tools/llvm-profgen/PerfReader.cpp
503–506

Good suggestion! As you mention, we can incorporate all into unwinder by treating non-CS profile as hybrid sample with empty call stack. So how about we do that when implementing non-CS part, right now I will change to code like blow?

void generateRawProfile (..) {
  if(getPerfScriptType() == PERF_LBR) {
     // range overlap computation for regular AutoFdo
     ...
    } else if (getPerfScriptType() == PERF_LBR_STACK) {
    // Unwind samples if it's hybird sample
    unwindSamples();
  }
}
llvm/tools/llvm-profgen/PerfReader.h
211

Yeah, it should not specific to unwinder, I will move to PerfReader to support both AutoFDO and CSSPGO

hoy added inline comments.Nov 20 2020, 1:53 PM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

Maybe the binary lookup table can be pre-filled with preferred load address when the binary is loaded/constructed. Without mmap2 events in the trace file, subsequent processing with just use the preferred addresses.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

I see. Let's keep a single output for now.

wenlei added inline comments.Nov 20 2020, 2:24 PM
llvm/tools/llvm-profgen/PerfReader.cpp
503–506

Yes, that looks good for now.

wmi accepted this revision.Nov 25 2020, 4:32 PM

LGTM.

llvm/tools/llvm-profgen/PerfReader.h
214–215

That is helpful. Thanks.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
63–71

It is helpful too. Thanks.

This revision is now accepted and ready to land.Nov 25 2020, 4:32 PM
hoy added inline comments.Nov 30 2020, 9:34 AM
llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
28

Can you please add a comment on what compiler command line switches are used to build the source code?

llvm/tools/llvm-profgen/PerfReader.cpp
49

Nit: just use PrevIP here instead of using Start?

228

Nit: curly braces not needed for single-statement block.

390

Use exitWithError?

llvm/tools/llvm-profgen/PerfReader.h
83

Nit: consider using std::vector to reduce the number of memory allocations and for better locality.

155

Nit: const qualifier for these getters?

319

Nit: const qualifier for getters?

wlei updated this revision to Diff 308693.Dec 1 2020, 9:50 AM

Address reviewers' feedback: added more comments and some refactoring work

wlei marked 11 inline comments as done.Dec 1 2020, 10:17 AM
wlei added inline comments.
llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
28

Good suggestion, comment added

llvm/tools/llvm-profgen/PerfReader.h
83

Here using list is because CallStack has both push_back and push_front action, in the future it will switch to trie.

155

fixed, good suggestion, thanks!

hoy added inline comments.Dec 1 2020, 1:02 PM
llvm/tools/llvm-profgen/PerfReader.h
155

Actually I meant something like:

ProfiledBinary *getBinary() const { return Binary; }
bool hasNextLBR() const { return LBRIndex < LBRStack.size(); }
...

Sorry for the confusion.

wlei updated this revision to Diff 308758.Dec 1 2020, 1:50 PM
wlei marked 3 inline comments as done.

add const qualifier for some functions

wlei added inline comments.Dec 1 2020, 1:51 PM
llvm/tools/llvm-profgen/PerfReader.h
155

fixed, thanks for clarification!

hoy accepted this revision.Dec 1 2020, 2:05 PM
wenlei added inline comments.Dec 2 2020, 9:56 AM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

Yeah, what @hoy suggested is what I was thinking about - default to preferred load address if mmap is absent. We need that but I think It's fine to deal with it in a separate patch.

llvm/tools/llvm-profgen/PerfReader.h
156

const qualifier here as well?

224

For linear unwinding, some brief explanation for handling of inlining would be helpful too.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

What about limiting to single binary input for now? Error our with message saying unsupported if multiple binaries are provided. Generating profiles for multiple binaries in a single output file will make the profile summary info inaccurate (e.g. percentile based hot thresholds).

wlei updated this revision to Diff 309371.Dec 3 2020, 2:37 PM

Address wenlei's feedback

wenlei accepted this revision.Dec 3 2020, 8:34 PM

This looks great. Thanks for working on this and making all the changes!

wlei retitled this revision from [CSSPGO][llvm-profgen]Context-sensitive profile data generation to [CSSPGO][llvm-profgen] Context-sensitive profile data generation.Dec 7 2020, 1:06 PM
wlei edited the summary of this revision. (Show Details)
wlei updated this revision to Diff 310002.Dec 7 2020, 1:07 PM

rebase and update the diff summary

This revision was landed with ongoing or failed builds.Dec 7 2020, 1:54 PM
This revision was automatically updated to reflect the committed changes.

fails here http://lab.llvm.org:8011/#/builders/99/builds/1031

FAIL: LLVM :: tools/llvm-profgen/noinline-cs-noprobe.test (68769 of 72066)
******************** TEST 'LLVM :: tools/llvm-profgen/noinline-cs-noprobe.test' FAILED ********************
Script:
--
: 'RUN: at line 1';   llvm-profgen --perfscript=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript --binary=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfbin --output=/b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/noinline-cs-noprobe.test.tmp --show-unwinder-output | /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test --check-prefix=CHECK-UNWINDER
: 'RUN: at line 2';   /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test --input-file /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/noinline-cs-noprobe.test.tmp
--
Exit Code: 1
Command Output (stderr):
--
/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test:20:19: error: CHECK-UNWINDER: expected string not found in input
; CHECK-UNWINDER: (5b0, 5c8): 1
                  ^
<stdin>:14:2: note: scanning from here
 (5c8, 5dc): 2
 ^
Input file: <stdin>
Check file: /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
-dump-input=help explains the following input dump.
Input was:
<<<<<<
          .
          .
          .
          9:  (634, 637): 3
         10:  (645, 645): 3
         11: 
         12: Binary(noinline-cs-noprobe.perfbin)'s Branch Counter:
         13: main:1 @ foo:3 @ bar
         14:  (5c8, 5dc): 2
check:20      X~~~~~~~~~~~~ error: no match found
         15:  (5d7, 5e5): 2
check:20     ~~~~~~~~~~~~~~
         16:  (5e9, 634): 3
check:20     ~~~~~~~~~~~~~~
         17: main:1 @ foo
check:20     ~~~~~~~~~~~~
         18:  (62f, 5b0): 3
check:20     ~~~~~~~~~~~~~~
         19:  (637, 645): 3
check:20     ~~~~~~~~~~~~~~
         20:  (645, 5ff): 3
check:20     ~~~~~~~~~~~~~~
>>>>>>
--
********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM :: tools/llvm-profgen/inline-cs-noprobe.test (68770 of 72066)
******************** TEST 'LLVM :: tools/llvm-profgen/inline-cs-noprobe.test' FAILED ********************
Script:
--
: 'RUN: at line 1';   llvm-profgen --perfscript=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/inline-cs-noprobe.perfscript --binary=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/inline-cs-noprobe.perfbin --output=/b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/inline-cs-noprobe.test.tmp --show-unwinder-output | /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test --check-prefix=CHECK-UNWINDER
: 'RUN: at line 2';   /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test --input-file /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/inline-cs-noprobe.test.tmp
--
Exit Code: 1
Command Output (stderr):
--
/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test:16:19: error: CHECK-UNWINDER: expected string not found in input
; CHECK-UNWINDER: (670, 6ad): 1
                  ^
<stdin>:12:2: note: scanning from here
 (69b, 670): 1
 ^
Input file: <stdin>
Check file: /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
-dump-input=help explains the following input dump.
Input was:
<<<<<<
          .
          .
          .
          7: main:1 @ foo:3.2 @ bar
          8:  (6af, 6bb): 14
          9: 
         10: Binary(inline-cs-noprobe.perfbin)'s Branch Counter:
         11: main:1 @ foo
         12:  (69b, 670): 1
check:16      X~~~~~~~~~~~~ error: no match found
         13:  (6c8, 67e): 15
check:16     ~~~~~~~~~~~~~~~
>>>>>>
--
wlei added a comment.Dec 7 2020, 10:48 PM

Hi, @vitalybuka , sorry for the test failure, the fix-up patch(https://reviews.llvm.org/D92816) is already landed, please update the repo, thanks!