This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Context-sensitive profile data generation
ClosedPublic

Authored by wlei on Oct 19 2020, 12:59 PM.

Details

Summary

This stack of changes introduces llvm-profgen utility which generates a profile data file from given perf script data files for sample-based PGO. It’s part of(not only) the CSSPGO work. Specifically to support context-sensitive with/without pseudo probe profile, it implements a series of functionalities including perf trace parsing, instruction symbolization, LBR stack/call frame stack unwinding, pseudo probe decoding, etc. Also high throughput is achieved by multiple levels of sample aggregation and compatible format with one stop is generated at the end. Please refer to: https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s for the CSSPGO RFC.

This change supports context-sensitive profile data generation into llvm-profgen. With simultaneous sampling for LBR and call stack, we can identify leaf of LBR sample with calling context from stack sample . During the process of deriving fall through path from LBR entries, we unwind LBR by replaying all the calls and returns (including implicit calls/returns due to inlining) backwards on top of the sampled call stack. Then the state of call stack as we unwind through LBR always represents the calling context of current fall through path.

we have two types of virtual unwinding 1) LBR unwinding and 2) linear range unwinding.
Specifically, for each LBR entry which can be classified into call, return, regular branch, LBR unwinding will replay the operation by pushing, popping or switching leaf frame towards the call stack and since the initial call stack is most recently sampled, the replay should be in anti-execution order, i.e. for the regular case, pop the call stack when LBR is call, push frame on call stack when LBR is return. After each LBR processed, it also needs to align with the next LBR by going through instructions from previous LBR's target to current LBR's source, which we named linear unwinding. As instruction from linear range can come from different function by inlining, linear unwinding will do the range splitting and record counters through the range with same inline context.

With each fall through path from LBR unwinding, we aggregate each sample into counters by the calling context and eventually generate full context sensitive profile (without relying on inlining) to driver compiler's PGO/FDO.

A breakdown of noteworthy changes:

  • Added HybridSample class as the abstraction perf sample including LBR stack and call stack
  • Extended PerfReader to implement auto-detect whether input perf script output contains CS profile, then do the parsing. Multiple HybridSample are extracted
  • Speed up by aggregating HybridSample into AggregatedSamples
  • Added VirtualUnwinder that consumes aggregated HybridSample and implements unwinding of calls, returns, and linear path that contains implicit call/return from inlining. Ranges and branches counters are aggregated by the calling context.
 Here calling context is string type, each context is a pair of function name and callsite location info, the whole context is like main:1 @ foo:2 @ bar.
  • Added PorfileGenerater that accumulates counters by ranges unfolding or branch target mapping, then generates context-sensitive function profile including function body, inferring callee's head sample, callsite target samples, eventually records into ProfileMap.

  • Leveraged LLVM build-in(SampleProfWriter) writer to support different serialization format with no stop
  • getCanonicalFnName for callee name and name from ELF section
  • Added regression test for both unwinding and profile generation

Test Plan:
ninja & ninja check-llvm

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
wlei retitled this revision from [AutoFdo][llvm-profgen]Context-sensitive profile data generation to [AutoFDO][llvm-profgen]Context-sensitive profile data generation.Oct 19 2020, 1:40 PM
wlei edited the summary of this revision. (Show Details)
wlei added reviewers: hoy, wenlei, wmi, davidxl.
hoy added inline comments.Oct 21 2020, 11:04 AM
llvm/tools/llvm-profgen/PerfReader.h
43

An artificial branch stands for a series of consecutive branches starting from the current binary with a transition through external code and eventually landing back in the current binary.

llvm/tools/llvm-profgen/ProfiledBinary.cpp
118–123

Nit: std::equal(Context1.begin(), Context1.begin() + Context1.size() - 1, Context2.begin(), Context2.begin() + Context2.size() - 1)

149

RemoveLeaf should be checked here?

327

Please comment here and below that the prolog/epilog built here is based on an estimated size. Dwarf decoding is needed for a building precise prolog/epilog.

Also how about separating the code to form a prolog/epilog builder that can be based on FuncStartAddrMap? The builder will be based on Dwarf CFI in the future.

llvm/tools/llvm-profgen/ProfiledBinary.h
195

Nit: Get the full context string for the given call stack with inline context filled in?

wenlei added inline comments.Oct 21 2020, 11:16 AM
llvm/include/llvm/ProfileData/SampleProf.h
350

Let's separate CSSPGO changes in SampleProf out from the llvm-profgen changes.

We'll send CSSPGO compiler/infrastructure patches separately, and then this llvm-profgen patch can depend on that.

wmi added a comment.Oct 27 2020, 2:56 PM

Just read the unwinding part and have some questions inlined. Could you add some standalone tests for unwinding part?

llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
2

Is it possible to use a small manually crafted perfscript as input? It is easier to know whether the number in the output makes sense or not when the perfscript is small. It will also be easier if something in the test needs to be adjusted in the future.

llvm/tools/llvm-profgen/PerfReader.cpp
133

It is needed or unneeded? Without call/ret, I assume there is no need to push/pop callstack?

llvm/tools/llvm-profgen/PerfReader.h
104

What is the rationale behind the condition?

126–127

What do the three uint64_t fields in the map represent?

205

Rename it to 'isCtxSensitivePerfScript'?

wenlei added inline comments.Oct 28 2020, 12:35 AM
llvm/tools/llvm-profgen/PerfReader.cpp
63

Let getCurrentLBR return reference? or why would you want a temporary to bind const-reference to?

97

Is this still needed now that we normalize traces before aggregation (line 259)?

106

Perhaps add a wrapper State.hasMoreLBRs() for this?

117

Use getCurrentLBR instead of State.LBRStack[State.LBRIndex]?

133

There's no need to adjust stack for intra-function branches, though conceptually we still need to unwind through the LBRs (updates the State). Agreed that the mention of push/pop stack isn't accurate..

llvm/tools/llvm-profgen/PerfReader.h
1

Would be good to add more comments for the classes/types defined, especially non-trivial ones.

56

Curious why use list for Stack but SmallVector (and Index) for LBRStack? Seem a bit inconsistent for similar use case.

104

Ideally we want tip of LBR target and tip of stack leaf to align with the help of PEBS. When we take a stack sample, the leaf IP of stack could be last LBR target address +N bytes, and N shouldn't be too large because N is essentially the sampling skid distance. So I had this in my original prototype as a sanity check to filter out broken records. However the distance chosen was somewhat arbitrary.. In reality, I don't think we've seen this firing with PEBS, but it could happen without PEBS, or if cycles is instead of branch_retired as triggering event.

205

I think context-sensitivity is a concept that only exists at FDO profile level. Thus for perf script, we used the term hybrid which faithfully represents the fact that both LBR and stack are sampled together.

wlei retitled this revision from [AutoFDO][llvm-profgen]Context-sensitive profile data generation to [CSSPGO][llvm-profgen]Context-sensitive profile data generation.Oct 28 2020, 12:50 PM
wmi added inline comments.Oct 28 2020, 9:16 PM
llvm/tools/llvm-profgen/PerfReader.cpp
184

Can you give an example of LBRStack so it is easy to understand what the code is parsing here?

219–238

Removing the else if make the code a little easier to read.

if (!SrcIsInternal && !DstIsInternal) 
  continue;

if (!SrcIsInternal && DstIsInternal) {
  PrevTrDst = Dst;
  continue;
}

if (SrcIsInternal && !DstIsInternal) {
  if (!PrevTrDst)
    continue;
  Dst = PrevTrDst;
  PrevTrDst = 0;
  IsArtificial = true;
}

// Filter out the branch sample
...
265

Same as above, better to give an example showing what the function is parsing here.

308–310

What sampling event are you using? If br_inst_retired:near_taken is used, the sample won't end up in prolog/epilog.

llvm/tools/llvm-profgen/ProfileGenerator.h
24

I thought the tool can also generate profile for current debug info based non CS AFDO but I am not sure. I guess that is a special case handled by CSProfileGenerator. Could you confirm?

wmi added inline comments.Oct 28 2020, 9:19 PM
llvm/tools/llvm-profgen/PerfReader.h
104

Thanks for the detailed explanation. Copying it to comment will be useful.

wmi added inline comments.Oct 29 2020, 2:46 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

Is it possible for us to tell one level in context is inlined or not? It will make the profile more informative.

hoy added inline comments.Oct 29 2020, 3:16 PM
llvm/tools/llvm-profgen/PerfReader.cpp
308–310

Yes, we are sampling the br_inst_retired:near_taken event. Normally there will be no branch instructions in a prolog. This deals with a weird case where a branch instruction ends up in a shrink-wrapped prolog.

wenlei added inline comments.Oct 29 2020, 3:21 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

Yes, agree that can useful, especially for tuning purpose to see how CS inline decision differs from previous build. We wanted to add a metadata (similar to !CFGChecksum for pseudo-probe profile) to indicate whether a context is inlined or not. Note that in this case, it would only tell whether bar is inlined along main:1 @ foo:3, but not whether foo is inlined along main:1. What do you think?

Also to keep patch smallish, I think we can add this later separately.

[main:1 @ foo:3 @ bar]:29103:1745
  ...
  ...
!CFGChecksum: ...
!Flag: Inline
llvm/tools/llvm-profgen/ProfileGenerator.h
24

Yes, eventually llvm-profgen will support both. Our internal implementation for AFDO profile generation is also llvm based, but it's somewhat separated from this one. And we wanted to do some refactoring before we merge the two.

That said, agreed that the two can share common interface though I think we could defer that a bit? We will upstream the AFDO profile generation after CSSPGO part is cleared.

wmi added inline comments.Oct 29 2020, 3:37 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

Note that in this case, it would only tell whether bar is inlined along main:1 @ foo:3, but not whether foo is inlined along main:1. What do you think?

What is the main difficulty to keep the inline information for each Context level?

Also to keep patch smallish, I think we can add this later separately.

Sure.

> [main:1 @ foo:3 @ bar]:29103:1745
!CFGChecksum: ...
!Flag: Inline

Can we use some special sign to mark whether bar is inline or not, "*" for example?

[main:1 @ foo:3 @ bar*]:29103:1745
llvm/tools/llvm-profgen/PerfReader.cpp
308–310

I see. Thanks!

wenlei added inline comments.Oct 29 2020, 4:02 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

What is the main difficulty to keep the inline information for each Context level?

Even if we only mark for leaf frame, middle frames inline decision could be found in its own leaf context (if it exists). I think it's also doable if we want to embed inline decision for each frame in a context (either with metadata or header), but since this is mostly for tuning/debugging, we're trying to keep it at minimum for now, and we can expand later if needed..

Now as to header vs metadata as carrier. There're couple reasons we don't do it in the header.

  1. We thought consistency between inline vs non-inline context for main profile and header keep things clean. In fact, CSSPGO really treats them indifferently.
  2. !metadata can be mapped/converted to binary profile with general framework support. Doing it with special character in header may require special (not-so-clean) handling for text-binary conversion.
  3. There's only that much we can do with special character in header, so it's not extensible if we want to encode something else. Initially we have checksum in header as well, but later move it to metadata to keep header clean, and thought it'd be good to keep all auxiliary data in metadata form.
wmi added inline comments.Oct 29 2020, 4:27 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

We thought consistency between inline vs non-inline context for main profile and header keep things clean. In fact, CSSPGO really treats them indifferently.

Yes, I understand CSSPGO treats inline callsite or not inline callsite indifferently, and the special character showing whether a callsite is inlined is just for easy debug.

Since it is only for debug, you just need to add it when you output the profile into text and you can strip it once for all when you read the text. That will be a standalone processing step. CSSPGO has already include a lot more useful information than current SPGO profile, I just hope the CSSPGO afdo file can show us the inline hierarchy as easy as what we current have.

wenlei added inline comments.Oct 29 2020, 5:08 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

I just hope the CSSPGO afdo file can show us the inline hierarchy as easy as what we current have.

Yeah, I can see how that can make debugging a bit easier. Perhaps a stack of flags in metadata can do it too if needed. On the other hand, the info is all from dwarf, so what we are discussing is just the visualization. Visualizing inline hierarchy isn't the responsibility of a profile, but afdo happens to be able to visualize inline hierarchy in a nice way, though it's still more of a side effect of the way inline profile is represented.

I'd argue that cleanness and consistency probably weighs more than trying to reach parity for that nice side effect (and actually even if we use * for inline frame, it's still not as nice as afdo profile's tree style inline hierarchy..) Perhaps we can leave this one open for now, and see where actual need leads us to?

wmi added inline comments.Oct 29 2020, 6:04 PM
llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
8

Agree. It is not critical. We can leave it open at this moment.

wmi added a comment.Oct 29 2020, 6:09 PM

One issue I notice is getCanonicalFnName should be called somewhere to make the context string canonical.

hoy added a comment.Oct 29 2020, 6:21 PM
In D89723#2363556, @wmi wrote:

One issue I notice is getCanonicalFnName should be called somewhere to make the context string canonical.

Good point. Is that for LTO-backend static function renaming? The function names recorded in the profile are actually from the Dwarf debug data but not the Elf symbol table. I'd expect the names there should not be touched by the LTO backend. Correct me if I'm missing anything.

hoy added inline comments.Oct 29 2020, 7:13 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
53

typedef or using an alias like RangeCountMap for this type?

146

Just return ProfileMap[ContextStr]?

195

Can this be moved out of the loop? The string-based look up is potentially slow.

200

Binary can be achieved via Reader so no need to pass in as an argument.

217

@wmi Looks like here is a place we use ELF symbol name as function name. May need to call getCanonicalFnName here.

llvm/tools/llvm-profgen/ProfileGenerator.h
57

This sounds like a method of CSProfileGenerator.

83

Please make this TODO more clear.

llvm/tools/llvm-profgen/ProfiledBinary.h
186

Can this just return a reference so that you don't need to use the move semantics when it gets called?

wmi added a comment.Oct 29 2020, 8:45 PM
In D89723#2363568, @hoy wrote:
In D89723#2363556, @wmi wrote:

One issue I notice is getCanonicalFnName should be called somewhere to make the context string canonical.

Good point. Is that for LTO-backend static function renaming? The function names recorded in the profile are actually from the Dwarf debug data but not the Elf symbol table. I'd expect the names there should not be touched by the LTO backend. Correct me if I'm missing anything.

Yes, the renaming for LTO local function promotion is one case -- adding ".llvm.xxx" suffix. There are also other cases, like function splitting will add ".cold" suffix. The function name in Dwarf after LTO local function promotion should be consistent with the name in Elf symbol table which contains ".llvm" suffix.

wenlei added inline comments.Oct 30 2020, 12:10 AM
llvm/include/llvm/ProfileData/SampleProf.h
526

This can now be merged with getEntrySamples, with dispatching based on ProfileIsCS.

533–534

We probably shouldn't arbitrarily assume live for a general helper function. The logic to assume live can be moved to caller if needed.

536

nit: there's no "average" now with this version.

llvm/tools/llvm-profgen/PerfReader.cpp
126–129

The comment is no longer accurate - now we don't check if src/dst crossing function binary, instead we check whether the IP is indeed at a return instruction and tail call is no longer a problem for this particular processing. (My bad I didn't update the comment in the prototype..)

241

Remove internal task Id T24431811

302

This function can take reference to Trace.CallStack directly since it's only checking call stack.

Actually this function doesn't seem needed, line 290 check non-empty already, and we just need a return Trace.Binary->addressInPrologEpilog(Trace.CallStack.front()) at line 299.

304

remove the blank line?

llvm/tools/llvm-profgen/PerfReader.h
52

This is just an encapsulation for synchronized stack and lbr sample, and it's orthogonal to unwinder. Suggest decouple it from unwinder in naming, instead call it HybridSample.

55

nit: I'd replace anti-execution with bottom-up order (or leaf to root).

57

nit: anti-execution sounds a bit confusing. I think the canonical term is FIFO order (LBR sampling has two modes, FIFO for trace and FILO for call stack).

90

Can this be a reference as well just like LBRStack and point to the input Sample? Note header comment also states "it doesn't hold the data but only keep the pointer/index of the data".

128
  1. To be accurate, this is actually a counter rather than a map.
  2. It can be confusing if this is used for both branch and range, even though branch and range share the bit-wise representation. I think we can typedef BranchSample and RangeSample both to std::pair<uint64_t, uint64_t>, and then typedef ContextBranchCounter and ContextRangeCounter to std::unordered_map<std::string, std::map<BranchSample, uint64_t>, ..
229

Do we actually use address now? Can we remove all address and probe related stuff and add them properly in later patches?

234–238

I suggest let's establish a consistent naming convention here wrt what is a trace and what is an event:

  • Trace is a series of perf events.
  • Each perf event can be an mmap event or a sample.
  • hybrid trace is a series of perf event: mmap events and hybrid samples.
  • hybrid sample is lbr sample plus stack sample.

With that, we can rename the following:

void parseHybridTrace(line_iterator &Line);
->
void parseHybridSample(line_iterator &Line);

UnwinderTrace
->
HybridSample

TraceAggregation
->
SampleAggregation
236

nit: why name line_iterator Index for this one, and different from others?

278–281

I suggest we either include them properly or remove them from the patch. We still quite a few things not in upstream patch, and it's not possible to have TODOs for all of them.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
3

nit: all header comments are screwed up by our internal linter.

wenlei added inline comments.Oct 30 2020, 12:26 AM
llvm/tools/llvm-profgen/ProfiledBinary.cpp
327

Agree it's cleaner to decouple from the main disasm loop. For now a separate function trackPrologEpilog should be enough. (btw, if we need to extend to a class in the future, tracker may be better name than builder..)

wlei added inline comments.Oct 30 2020, 12:32 AM
llvm/tools/llvm-profgen/PerfReader.cpp
63

The LBR data is only kept in the UnwinderTrace and UnwindState only keep the index and ref. And because the UnwinderTrace is the key of the TraceAggregationMap, its type is converted to const.
If that makes code confusing, I can wrap all the LBRSource/LBRTarget to getCurrentLBRSource()/getCurrentLBRTarget().

llvm/tools/llvm-profgen/PerfReader.h
56

Currently using list is for easy copy data from UnwinderTrace to UnwindState(see line 90 also list). Why copying the data but not using the Index like LBRStack is because Callstack is changed dynamically during the unwinding(aka need push_front)
This is kind of a temporary solution, considering our next step is to use trie node to represent callstack, at that time I will make it consistent to use SmallVector. So Ideally UnwinderTrace only keep the data and UnwindState only keep the index/trie node.

90

Same here, there is pop_front() with CallStack so that I used list. This will be solved by trie. The comments is the final ideal one..sorry for the confusing.

wenlei added inline comments.Oct 30 2020, 12:51 AM
llvm/tools/llvm-profgen/PerfReader.cpp
63

Ok, not a big deal I guess since LBREntry is small anyways. but getCurrentLBR can still return const reference?

llvm/tools/llvm-profgen/PerfReader.h
56

Got it, makes sense. Thanks for clarifying.

90

I see. So because Trace is the key of TraceAggregationMap and we will need to mutate this list, you have to create a copy of what's in UnwindTrace, correct?

wlei added inline comments.Oct 30 2020, 1:06 AM
llvm/tools/llvm-profgen/PerfReader.h
90

Yes, the map key enforces the const type and it at least copy one to the State even using trie node( for the tie initialization) because of the mutation.

wenlei added inline comments.Oct 30 2020, 4:42 PM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
165

This would not be consistent with the definition of total samples. I think we should only add the portion that was added to body samples.

209

This now actually does more than head samples as call target value sample is handled here too. Perhaps populateFunctionBoundarySamples ?

222

Same as line 195, this could be hoisted out of the loop?

240

explicit namespace is not needed here.

264

This is now more than just value samples. The difference is this one us context to infer missing samples, but others use range and branch to populate samples directly. So name it populateInferredFunctionSamples?

279

nit: name it CallerLeafFrameLoc?

283

caller_profile -> CallerProfile

llvm/tools/llvm-profgen/ProfiledBinary.h
61

Is operator++ and operator-- used? these two are duplicated with advance/backward, and we only need to keep one set?

181

return a const ref, or StringRef since FuncStartAddrMap owns the string?

llvm/tools/llvm-profgen/llvm-profgen.cpp
44–54

If we let ProfileGenerator be the driver, I think we should also let ProfileGenerator initiate the perf loading (line 38); otherwise if we intend to decouple them, and let PerfReader read profile outside of ProfileGenerator, then it's better only pass the loaded profile to ProfileGenerator for cleaner separation.

wlei updated this revision to Diff 302315.Nov 2 2020, 9:32 AM

address reviewers' comments,getCanonicalFnName and PrologEpilogTracker will come later

llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
2

Thanks for your feedbacks. a small perfscript with only one or two sample is replaced. also add unwinder's test

llvm/tools/llvm-profgen/PerfReader.cpp
106

change to State.hasNextLBR()

184

example is added

llvm/tools/llvm-profgen/PerfReader.h
126–127

fixed according wenlei's suggestion below by using ContextBranchCounter and ContextRangeCounter , also give more explanation

234–238

Thanks for suggesting for a consistent naming convention, did the refactor.

llvm/tools/llvm-profgen/llvm-profgen.cpp
44–54

Good suggestion, change to not include PerfReader in ProfileGenerator, then I also decoupled the unwinder from the reader. for the unwinder, the input is the aggregated hybrid sample, the output is the sample counters which is later forwarded to the generator.

hoy added inline comments.Nov 5 2020, 1:50 PM
llvm/tools/llvm-profgen/PerfReader.h
182

The comment could be retired since we have a tail call tracker coming that tracks both in-LBR tail calls and out-of-LBR tail calls universally.

llvm/tools/llvm-profgen/llvm-profgen.cpp
44–54

Perhaps it's better to include the unwinder in the reader since this driver will also handle non-CS profiles in future.

The dataflow from the reader to the profile generator may need a flexible definition (currently is Unwinder.getSampleCounters()) for future extension.

wenlei added inline comments.Nov 5 2020, 3:06 PM
llvm/tools/llvm-profgen/PerfReader.h
182

I think the comment needs to be updated, but explanation here is still needed because IIUC missing frame inference happens more like a post process (hence somewhat orthogonal), and here isCallState decides the unwind operation on the stack sample (not changed by frame inference) which will always miss tail call frame (unless dwarf stack walking is used by perf).

llvm/tools/llvm-profgen/llvm-profgen.cpp
44–54

Agreed that unwinder better be driven by PerfReader since unwinder is something PerfReader depends on directly (vs depending on its output like ProfileGenerator on PerfReader's output).

wlei updated this revision to Diff 305608.Nov 16 2020, 3:00 PM
wlei marked 5 inline comments as done.

move unwinder into PerfReader
use a BinarytoSampleCounter map to group sample counters by binary
add PrologEpilog tracker
support to use getCanonicalFnName for ELF Section based symbol name
fix a negative line offset bug
other refactoring work

hoy added inline comments.Nov 17 2020, 4:40 PM
llvm/tools/llvm-profgen/PerfReader.cpp
471

PerfType should be defined on the else branch if it is not initialized anywhere else.

llvm/tools/llvm-profgen/PerfReader.h
283

Should this be rewritten with a stream-based file reader as done in D89707?

wlei updated this revision to Diff 306182.Nov 18 2020, 12:06 PM

Address reviewer's feedback on PerfType definition

wlei added inline comments.Nov 18 2020, 12:07 PM
llvm/tools/llvm-profgen/PerfReader.h
283

I guess you mean to keep consistent to other part of code? Here you see it only read 4000bytes data from the file(getFileOrSTDIN(FileName, 4000);), so there shouldn't have memory issue.
Currently stream-based liner only support read one line at a time, it need to search line by line, which would be slower than searching in the whole 4k memory. So which one do you prefer?

hoy added inline comments.Nov 18 2020, 4:24 PM
llvm/tools/llvm-profgen/PerfReader.h
283

I see. The current implementation looks good to me.

wlei updated this revision to Diff 306285.Nov 18 2020, 6:55 PM

[NFC]rebase

wmi added inline comments.Nov 19 2020, 11:22 AM
llvm/tools/llvm-profgen/PerfReader.h
214–215

This virtual unwinder is not doing the classic unwinding thing. It is walking through the LBR stack of a LBR sample, based on the sample's callstack, and infer the callstack for each address range covered by the LBR sample. The comment can be more clear about it.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
63–71

Why there is region [A, B]: 300, but B: (0, 100) only has 100 sample count?

263

Conext --> Context?

wlei updated this revision to Diff 306726.Nov 20 2020, 10:01 AM

add more comments for unwinder and BoundaryPoint
remove Skylake only LBR duplication filter

wlei marked 43 inline comments as done.Nov 20 2020, 10:07 AM
wlei added inline comments.
llvm/tools/llvm-profgen/PerfReader.h
214–215

Thanks for your suggestion, more comments are added.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
63–71

Sorry for the confusion. See the graph below, here B:(0, 100) is the boundary point, 0 means no samples begin at B, 100 means one sample(sample1) ends at B whose count is 100. I changed the explanation in the comment, see whether it's clear or not.

|<--100-->|                  Sample1
|<------200------>|          Sample2
A         B       C
hoy added inline comments.Nov 20 2020, 10:50 AM
llvm/tools/llvm-profgen/PerfReader.cpp
26

Nit: please add a TODO here to check if Source is in prolog/epilog using precise prolog/epilog table.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

I'm wondering if a separate profile file should be output for each binary. Since the samples are already separated for binaries via BinarySampleCounters, ProfileMap can be made like that too.

llvm/tools/llvm-profgen/ProfiledBinary.cpp
138

Nit: remove the check and add it back with the compression work.

wenlei added inline comments.Nov 20 2020, 10:55 AM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

I think we also need to support cases where PERF_RECORD_MMAP2 event isn't available, in which case we just use preferred load address from ELF header.

Can you add a test case that doesn't have PERF_RECORD_MMAP2? Looks like currently we would just proceed with parsing without a base address set?

llvm/tools/llvm-profgen/PerfReader.cpp
503–506

What would be the workflow for (non-CS) AutoFDO with this new implementation?

It looks like parseTrace is responsible for aggregation only, then even for AutoFDO, there'll be a post-process after that, to get range:count, right?

so it looks to me that a unified workflow could be something like this?

for (auto Filename : PerfTraceFilenames)
    parseAndAggregateTrace(Filename);

generateRawProfile();

In side generateRawProfile, we would do simple range overlap computation for AutoFDO, or unwind for CSSPGO.

Also see comments on AggregationCounter - in addition to unifying the workflow, it would be good to unify data structure as well if possible. What do you think?

llvm/tools/llvm-profgen/PerfReader.h
211

The idea of aggregation applies to (non-CS) AutoFDO too. It'd be good to put infrastructure in place that can cover both AutoFDO and CSSPGO in a generic way.

Perhaps we can treat non-CS AutoFDO profile (or regular LBR perf profile) just like a hybrid profile except stack part is always empty? Is that what you have in mind?

wlei added inline comments.Nov 20 2020, 11:47 AM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

Yeah, it's doable. but that needs more CL design, currently we only support one output file, so we have to change supporting multiple output files which also need an exact one-one mapping to the binary. So we can use OutputFilenames to receives multiple output files and match them in order on the command line? or I'm also thinking we just remain this and if the user really need to separate the output for binary, they could call the tool multiple times with different input binary. any suggestions on the command?

wlei added inline comments.Nov 20 2020, 1:14 PM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

Yeah, currently PERF_RECORD_MMAP2 is required.
The problem using preferred load address for non-mmap event is one perf address might belong to multiple binaries, which will mess up the whole process. Also we need to one more perftrace scan to confirm there is no mmap2 event so that we can switch to use preferred address.
or we can have a switch like "--no-mmp2-events" to explicitly tell the tool use preferred address, also only support one binary under this switch. or we need some info in the perf trace tell which binary it belong to(I remembered we discuss this internally). any suggestion on this?

llvm/tools/llvm-profgen/PerfReader.cpp
503–506

Good suggestion! As you mention, we can incorporate all into unwinder by treating non-CS profile as hybrid sample with empty call stack. So how about we do that when implementing non-CS part, right now I will change to code like blow?

void generateRawProfile (..) {
  if(getPerfScriptType() == PERF_LBR) {
     // range overlap computation for regular AutoFdo
     ...
    } else if (getPerfScriptType() == PERF_LBR_STACK) {
    // Unwind samples if it's hybird sample
    unwindSamples();
  }
}
llvm/tools/llvm-profgen/PerfReader.h
211

Yeah, it should not specific to unwinder, I will move to PerfReader to support both AutoFDO and CSSPGO

hoy added inline comments.Nov 20 2020, 1:53 PM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

Maybe the binary lookup table can be pre-filled with preferred load address when the binary is loaded/constructed. Without mmap2 events in the trace file, subsequent processing with just use the preferred addresses.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

I see. Let's keep a single output for now.

wenlei added inline comments.Nov 20 2020, 2:24 PM
llvm/tools/llvm-profgen/PerfReader.cpp
503–506

Yes, that looks good for now.

wmi accepted this revision.Nov 25 2020, 4:32 PM

LGTM.

llvm/tools/llvm-profgen/PerfReader.h
214–215

That is helpful. Thanks.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
63–71

It is helpful too. Thanks.

This revision is now accepted and ready to land.Nov 25 2020, 4:32 PM
hoy added inline comments.Nov 30 2020, 9:34 AM
llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
28

Can you please add a comment on what compiler command line switches are used to build the source code?

llvm/tools/llvm-profgen/PerfReader.cpp
49

Nit: just use PrevIP here instead of using Start?

228

Nit: curly braces not needed for single-statement block.

390

Use exitWithError?

llvm/tools/llvm-profgen/PerfReader.h
83

Nit: consider using std::vector to reduce the number of memory allocations and for better locality.

155

Nit: const qualifier for these getters?

319

Nit: const qualifier for getters?

wlei updated this revision to Diff 308693.Dec 1 2020, 9:50 AM

Address reviewers' feedback: added more comments and some refactoring work

wlei marked 11 inline comments as done.Dec 1 2020, 10:17 AM
wlei added inline comments.
llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
28

Good suggestion, comment added

llvm/tools/llvm-profgen/PerfReader.h
83

Here using list is because CallStack has both push_back and push_front action, in the future it will switch to trie.

155

fixed, good suggestion, thanks!

hoy added inline comments.Dec 1 2020, 1:02 PM
llvm/tools/llvm-profgen/PerfReader.h
155

Actually I meant something like:

ProfiledBinary *getBinary() const { return Binary; }
bool hasNextLBR() const { return LBRIndex < LBRStack.size(); }
...

Sorry for the confusion.

wlei updated this revision to Diff 308758.Dec 1 2020, 1:50 PM
wlei marked 3 inline comments as done.

add const qualifier for some functions

wlei added inline comments.Dec 1 2020, 1:51 PM
llvm/tools/llvm-profgen/PerfReader.h
155

fixed, thanks for clarification!

hoy accepted this revision.Dec 1 2020, 2:05 PM
wenlei added inline comments.Dec 2 2020, 9:56 AM
llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript
3

Yeah, what @hoy suggested is what I was thinking about - default to preferred load address if mmap is absent. We need that but I think It's fine to deal with it in a separate patch.

llvm/tools/llvm-profgen/PerfReader.h
156

const qualifier here as well?

224

For linear unwinding, some brief explanation for handling of inlining would be helpful too.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
48

What about limiting to single binary input for now? Error our with message saying unsupported if multiple binaries are provided. Generating profiles for multiple binaries in a single output file will make the profile summary info inaccurate (e.g. percentile based hot thresholds).

wlei updated this revision to Diff 309371.Dec 3 2020, 2:37 PM

Address wenlei's feedback

wenlei accepted this revision.Dec 3 2020, 8:34 PM

This looks great. Thanks for working on this and making all the changes!

wlei retitled this revision from [CSSPGO][llvm-profgen]Context-sensitive profile data generation to [CSSPGO][llvm-profgen] Context-sensitive profile data generation.Dec 7 2020, 1:06 PM
wlei edited the summary of this revision. (Show Details)
wlei updated this revision to Diff 310002.Dec 7 2020, 1:07 PM

rebase and update the diff summary

This revision was landed with ongoing or failed builds.Dec 7 2020, 1:54 PM
This revision was automatically updated to reflect the committed changes.

fails here http://lab.llvm.org:8011/#/builders/99/builds/1031

FAIL: LLVM :: tools/llvm-profgen/noinline-cs-noprobe.test (68769 of 72066)
******************** TEST 'LLVM :: tools/llvm-profgen/noinline-cs-noprobe.test' FAILED ********************
Script:
--
: 'RUN: at line 1';   llvm-profgen --perfscript=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfscript --binary=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/noinline-cs-noprobe.perfbin --output=/b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/noinline-cs-noprobe.test.tmp --show-unwinder-output | /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test --check-prefix=CHECK-UNWINDER
: 'RUN: at line 2';   /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test --input-file /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/noinline-cs-noprobe.test.tmp
--
Exit Code: 1
Command Output (stderr):
--
/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test:20:19: error: CHECK-UNWINDER: expected string not found in input
; CHECK-UNWINDER: (5b0, 5c8): 1
                  ^
<stdin>:14:2: note: scanning from here
 (5c8, 5dc): 2
 ^
Input file: <stdin>
Check file: /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/noinline-cs-noprobe.test
-dump-input=help explains the following input dump.
Input was:
<<<<<<
          .
          .
          .
          9:  (634, 637): 3
         10:  (645, 645): 3
         11: 
         12: Binary(noinline-cs-noprobe.perfbin)'s Branch Counter:
         13: main:1 @ foo:3 @ bar
         14:  (5c8, 5dc): 2
check:20      X~~~~~~~~~~~~ error: no match found
         15:  (5d7, 5e5): 2
check:20     ~~~~~~~~~~~~~~
         16:  (5e9, 634): 3
check:20     ~~~~~~~~~~~~~~
         17: main:1 @ foo
check:20     ~~~~~~~~~~~~
         18:  (62f, 5b0): 3
check:20     ~~~~~~~~~~~~~~
         19:  (637, 645): 3
check:20     ~~~~~~~~~~~~~~
         20:  (645, 5ff): 3
check:20     ~~~~~~~~~~~~~~
>>>>>>
--
********************
Testing:  0.. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90
FAIL: LLVM :: tools/llvm-profgen/inline-cs-noprobe.test (68770 of 72066)
******************** TEST 'LLVM :: tools/llvm-profgen/inline-cs-noprobe.test' FAILED ********************
Script:
--
: 'RUN: at line 1';   llvm-profgen --perfscript=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/inline-cs-noprobe.perfscript --binary=/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/Inputs/inline-cs-noprobe.perfbin --output=/b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/inline-cs-noprobe.test.tmp --show-unwinder-output | /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test --check-prefix=CHECK-UNWINDER
: 'RUN: at line 2';   /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/bin/FileCheck /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test --input-file /b/sanitizer-x86_64-linux-bootstrap/build/llvm_build_asan/test/tools/llvm-profgen/Output/inline-cs-noprobe.test.tmp
--
Exit Code: 1
Command Output (stderr):
--
/b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test:16:19: error: CHECK-UNWINDER: expected string not found in input
; CHECK-UNWINDER: (670, 6ad): 1
                  ^
<stdin>:12:2: note: scanning from here
 (69b, 670): 1
 ^
Input file: <stdin>
Check file: /b/sanitizer-x86_64-linux-bootstrap/build/llvm-project/llvm/test/tools/llvm-profgen/inline-cs-noprobe.test
-dump-input=help explains the following input dump.
Input was:
<<<<<<
          .
          .
          .
          7: main:1 @ foo:3.2 @ bar
          8:  (6af, 6bb): 14
          9: 
         10: Binary(inline-cs-noprobe.perfbin)'s Branch Counter:
         11: main:1 @ foo
         12:  (69b, 670): 1
check:16      X~~~~~~~~~~~~ error: no match found
         13:  (6c8, 67e): 15
check:16     ~~~~~~~~~~~~~~~
>>>>>>
--
wlei added a comment.Dec 7 2020, 10:48 PM

Hi, @vitalybuka , sorry for the test failure, the fix-up patch(https://reviews.llvm.org/D92816) is already landed, please update the repo, thanks!