This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Top-down processing order based on full profile.
ClosedPublic

Authored by hoy on Mar 25 2021, 9:09 AM.

Details

Summary

Use profiled call edges to augment the top-down order. There are cases that the top-down order computed based on the static call graph doesn't reflect real execution order. For example:

  1. Incomplete static call graph due to unknown indirect call targets. Adjusting the order by considering indirect call edges from the profile can enable the inlining of indirect call targets by allowing the caller processed before them.
  1. Mutual call edges in an SCC. The static processing order computed for an SCC may not reflect the call contexts in the context-sensitive profile, thus may cause potential inlining to be overlooked. The function order in one SCC is being adjusted to a top-down order based on the profile to favor more inlining.
  1. Transitive indirect call edges due to inlining. When a callee function is inlined into into a caller function in LTO prelink, every call edge originated from the callee will be transferred to the caller. If any of the transferred edges is indirect, the original profiled indirect edge, even if considered, would not enforce a top-down order from the caller to the potential indirect call target in LTO postlink since the inlined callee is gone from the static call graph.
  1. #3 can happen even for direct call targets, due to functions defined in header files. Header functions, when included into source files, are defined multiple times but only one definition survives due to ODR. Therefore, the LTO prelink inlining done on those dropped definitions can be useless based on a local file scope. More importantly, the inlinee, once fully inlined to a to-be-dropped inliner, will have no profile to consume when its outlined version is compiled. This can lead to a profile-less prelink compilation for the outlined version of the inlinee function which may be called from external modules. while this isn't easy to fix, we rely on the postlink AutoFDO pipeline to optimize the inlinee. Since the survived copy of the inliner (defined in headers) can be inlined in its local scope in prelink, it may not exist in the merged IR in postlink, and we'll need the profiled call edges to enforce a top-down order for the rest of the functions.

Considering those cases, a profiled call graph completely independent of the static call graph is constructed based on profile data, where function objects are not even needed to handle case #3 and case 4.

I'm seeing an average 0.4% perf win out of SPEC2017. For certain benchmark such as Xalanbmk and GCC, the win is bigger, above 2%.

The change is an enhancement to https://reviews.llvm.org/D95988.

Diff Detail

Event Timeline

hoy created this revision.Mar 25 2021, 9:09 AM
hoy requested review of this revision.Mar 25 2021, 9:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2021, 9:09 AM
hoy edited the summary of this revision. (Show Details)Mar 25 2021, 9:14 AM
hoy added reviewers: davidxl, wmi, wenlei.
hoy edited the summary of this revision. (Show Details)
hoy edited the summary of this revision. (Show Details)
wlei added a subscriber: wlei.Mar 25 2021, 1:08 PM
wenlei added inline comments.Mar 25 2021, 11:00 PM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
570–571

This is very similar to CSPreInliner::buildTopDownOrder. We had to let context track do things like addCallGraphEdges in the past. But now it probably makes more sense to let ProfileCallGraph take care of the graph building. I can refactor part of CSPreInliner::buildTopDownOrder into ctor of ProfileCallGraph so it can be reused here.

570–571

Is it intentional that we only look at trie, but not call targets from body samples for call edges?

llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Nit: the description implies that with -use-profiled-call-graph=1, we would do top-down order even if -sample-profile-top-down-load=0 is used. But the implementation doesn't do that.

Would be good to have a cohesive connection between the two switches, and description to reflect that. How about use-profiled-top-down-order with description like "Use the top-down order defined by profiled call graph when -sample-profile-top-down-load is on"?

1579

assert that we don't go this path for csspgo?

1631

What happens if a function is not in input profile, looks like it will be skipped in sample loader after this change? Before the change, we would still set entry count for a function if it has no profile.

1642–1643

Such case has to involve indirect call because if we have direct call, say we have A->B->C, if B is gone in post-link, we will still honor A->C order because the call to C is visible in A, correct?

It may be clearer to use example in the description. Same for #4.

It'd be good to call out that some of this only applies to csspgo (e.g. #2 shouldn't be a problem for AutoFDO).

1663–1665

Using ProfiledCallGraph allows us to order without needing function object, but profile could be stale (e.g. missing a new edge after source drift). Can we build with ProfiledCallGraph with static call graph nodes and edges included as well?

1667–1668

How about moving the dispatch for ProfileIsCS into SampleProfileLoader::buildProfiledCallGraph?

hoy marked an inline comment as done.Mar 26 2021, 12:16 PM
hoy added inline comments.
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
570–571

Sounds good. Moving the logic into ProfileCallGraph makes more sense.

570–571

It is intentional. A call target that doesn't come with a profile or is not on a call path to its child profile can be ignored since processing it before its caller (if this is the only context) shouldn't lose anything.

llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Good point. Description changed.

1579

Done.

1631

Good catch! It's an overlook.

1642–1643

Exactly. A->C can be used to recover A->B->C with C's !dbg information.

Examples added.

1663–1665

Actually adding static edges leads to worse performance for some benchmarks because of SCC. In that case, static edges in SCC should be completely removed so that only profile edges are honored.

On the other hand, yes, profile could be stale, but that's the information FDO relies on. I think without the profile, top-down order isn't important. In other words, static call edges seems not important when they don't correspond to a context in the profile.

hoy updated this revision to Diff 333607.Mar 26 2021, 12:17 PM
hoy marked an inline comment as done.

Addressing Wenlei's comments.

wmi added a comment.Mar 26 2021, 1:54 PM

The change is an enhancement to https://reviews.llvm.org/D95988 so it is better to mention D95988 in the description/commit log.

Before this patch UseProfileIndirectCallEdges is true for non-CS sampleFDO, now UseProfiledCallGraph is false by default so it change the behavior of non-CS sampleFDO. I will test UseProfiledCallGraph==true for non-CS sampleFDO and if performance is ok, we can enable UseProfiledCallGraph by default. How does it sound?

hoy edited the summary of this revision. (Show Details)Mar 26 2021, 1:57 PM
hoy added a comment.Mar 26 2021, 2:00 PM
In D99351#2653712, @wmi wrote:

The change is an enhancement to https://reviews.llvm.org/D95988 so it is better to mention D95988 in the description/commit log.

Before this patch UseProfileIndirectCallEdges is true for non-CS sampleFDO, now UseProfiledCallGraph is false by default so it change the behavior of non-CS sampleFDO. I will test UseProfiledCallGraph==true for non-CS sampleFDO and if performance is ok, we can enable UseProfiledCallGraph by default. How does it sound?

In D99351#2653712, @wmi wrote:

The change is an enhancement to https://reviews.llvm.org/D95988 so it is better to mention D95988 in the description/commit log.

Before this patch UseProfileIndirectCallEdges is true for non-CS sampleFDO, now UseProfiledCallGraph is false by default so it change the behavior of non-CS sampleFDO. I will test UseProfiledCallGraph==true for non-CS sampleFDO and if performance is ok, we can enable UseProfiledCallGraph by default. How does it sound?

Sound good, thanks for testing it for non-CS!

BTW, UseProfileIndirectCallEdges was true by default but only worked with CS FDO, like if (UseProfileIndirectCallEdges && ProfileIsCS), because the support of adding indirect call edges for non-CS was not done.

wenlei added inline comments.Mar 26 2021, 4:39 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1663–1665

Ok, this makes sense. So static edges can be conflicting then we may end up with SCC order not compatible with context trie. Using strictly profile order makes sure we will get maximum inlining along context trie. I think that (intentionally not adding call graph edges) worth a comment explaining by itself.

wmi added a comment.Mar 26 2021, 11:30 PM

Just find ProfiledCallGraph.h is not included in the patch so the build failed after applying the patch.

hoy added a comment.Mar 27 2021, 11:00 AM
In D99351#2654048, @wmi wrote:

Just find ProfiledCallGraph.h is not included in the patch so the build failed after applying the patch.

Thanks for trying it. You’ll also need to apply Wenlei’s patch D99146.

wmi added a comment.Mar 28 2021, 2:45 PM
In D99351#2654236, @hoy wrote:
In D99351#2654048, @wmi wrote:

Just find ProfiledCallGraph.h is not included in the patch so the build failed after applying the patch.

Thanks for trying it. You’ll also need to apply Wenlei’s patch D99146.

Then I run into the cyclic including issue:
In this patch llvm/Transforms/IPO/SampleContextTracker.h includes llvm/Transforms/IPO/ProfiledCallGraph.h while in D99146 llvm/Transforms/IPO/ProfiledCallGraph.h includes llvm/Transforms/IPO/SampleContextTracker.h.

hoy added a comment.Mar 28 2021, 4:10 PM
In D99351#2654704, @wmi wrote:
In D99351#2654236, @hoy wrote:
In D99351#2654048, @wmi wrote:

Just find ProfiledCallGraph.h is not included in the patch so the build failed after applying the patch.

Thanks for trying it. You’ll also need to apply Wenlei’s patch D99146.

Then I run into the cyclic including issue:
In this patch llvm/Transforms/IPO/SampleContextTracker.h includes llvm/Transforms/IPO/ProfiledCallGraph.h while in D99146 llvm/Transforms/IPO/ProfiledCallGraph.h includes llvm/Transforms/IPO/SampleContextTracker.h.

I see. Sounds like the inclusion of ProfiledCallGraph.h should be moved into SampleContextTracker.cpp, and SampleContextTracker.h will need to have a forward declaration of class ProfiledCallGraph.

hoy updated this revision to Diff 333747.Mar 28 2021, 4:27 PM

Addressing Wenlei's and Wei's comment.

wmi added a comment.Mar 28 2021, 11:01 PM

Seems there is still build error:

lib/Transforms/IPO/SampleProfile.cpp:1692:23: error: no matching constructor for initialization of 'llvm::sampleprof::ProfiledCallGraph'

ProfiledCallGraph ProfiledCG;

include/llvm/Transforms/IPO/ProfiledCallGraph.h:43:3: note: candidate constructor not viable: requires 2 arguments, but 0 were provided

ProfiledCallGraph(StringMap<FunctionSamples> &ProfileMap,
hoy added a comment.Mar 28 2021, 11:21 PM
In D99351#2654965, @wmi wrote:

Seems there is still build error:

lib/Transforms/IPO/SampleProfile.cpp:1692:23: error: no matching constructor for initialization of 'llvm::sampleprof::ProfiledCallGraph'

ProfiledCallGraph ProfiledCG;

include/llvm/Transforms/IPO/ProfiledCallGraph.h:43:3: note: candidate constructor not viable: requires 2 arguments, but 0 were provided

ProfiledCallGraph(StringMap<FunctionSamples> &ProfileMap,

Oh I see. I didn't sync to Wenlei's latest patch whee the profiled callgraph construction was moved into the constructor. Since there is a subtle difference about whether to limit the call graph build to nodes with samples only, I'm adding back the default constructor for now. We can discuss how to make the code well shared.

llvm/lib/Transforms/IPO/SampleProfile.cpp
1663–1665

Sounds good, comment added.

hoy updated this revision to Diff 333772.Mar 28 2021, 11:21 PM

Updating D99351: [CSSPGO] Top-down processing order based on full profile.

In D99351#2654984, @hoy wrote:
In D99351#2654965, @wmi wrote:

Seems there is still build error:

lib/Transforms/IPO/SampleProfile.cpp:1692:23: error: no matching constructor for initialization of 'llvm::sampleprof::ProfiledCallGraph'

ProfiledCallGraph ProfiledCG;

include/llvm/Transforms/IPO/ProfiledCallGraph.h:43:3: note: candidate constructor not viable: requires 2 arguments, but 0 were provided

ProfiledCallGraph(StringMap<FunctionSamples> &ProfileMap,

Oh I see. I didn't sync to Wenlei's latest patch whee the profiled callgraph construction was moved into the constructor. Since there is a subtle difference about whether to limit the call graph build to nodes with samples only, I'm adding back the default constructor for now. We can discuss how to make the code well shared.

Let me commit my change so this can also move forward (sorry for delay). I was thinking about adding another ctor for profiled call graph for AutoFDO case (would require changes in this patch). But we could also move the graph building into a helper function..

hoy added a comment.Mar 29 2021, 9:54 AM
In D99351#2654984, @hoy wrote:
In D99351#2654965, @wmi wrote:

Seems there is still build error:

lib/Transforms/IPO/SampleProfile.cpp:1692:23: error: no matching constructor for initialization of 'llvm::sampleprof::ProfiledCallGraph'

ProfiledCallGraph ProfiledCG;

include/llvm/Transforms/IPO/ProfiledCallGraph.h:43:3: note: candidate constructor not viable: requires 2 arguments, but 0 were provided

ProfiledCallGraph(StringMap<FunctionSamples> &ProfileMap,

Oh I see. I didn't sync to Wenlei's latest patch whee the profiled callgraph construction was moved into the constructor. Since there is a subtle difference about whether to limit the call graph build to nodes with samples only, I'm adding back the default constructor for now. We can discuss how to make the code well shared.

Let me commit my change so this can also move forward (sorry for delay). I was thinking about adding another ctor for profiled call graph for AutoFDO case (would require changes in this patch). But we could also move the graph building into a helper function..

Sounds good. We can do the refactoring in the current change. Thanks.

The ProfiledCallGraph change D99146 is in, now some adjustment is needed here.

llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
44

I think we can let ProfiledCallGraph take over the responsibility of graph building, for both autofdo and csspgo. We could do this through two separate ctor, or a common ctor plus two "build graph" helpers.

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
33

With graph building all moving into ProfileCallGraph, we can avoid referencing ProfiledCallGraph in context tracker. Basically SampleContextTracker::addProfiledCallEdges can be removed?

hoy updated this revision to Diff 333942.Mar 29 2021, 11:39 AM

Moving call graph build into profiledCallGraph.h.

llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
44

Sounds good.

wenlei added inline comments.Mar 29 2021, 4:31 PM
llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
43–44

The purpose of this is to have all nodes populated before adding call edges and enable sanity check when adding edges. But I can see that's not possible for afdo profiles. If we end up adding nodes on the fly for afdo, we could do the same for csspgo, in which case we can remove ProfileMap from parameter.

46

Can we just pass in StringMap<FunctionSamples> &ProfileMap instead of the Reader? ProfiledCallGraph doesn't need to interact with Reader except for getting the profiles.

For checking CS profile, we could do FunctionsSamples::ProfileIsCS too.

54

Can we merge the two ctors for CS profile? Some refactoring to merge them, with parameter to control whether we add edges for call targets and comments explaining why call target edges are skipped would be good.

It also makes sense for llvm-profgen path to use exactly what the compiler uses for top-down ordering. So if adding call target edges are problematic for compiler, maybe we should skip that for llvm-profgen too.

104

If we name the one above as addProfiledCall, this would be addProfiledCalls to be consistent?

(And this is indeed adding both nodes and edges)

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
20–21

This can be removed too.

wmi added a comment.Mar 29 2021, 5:25 PM

The performance test result is neutral, so I think we can enable UseProfiledCallGraph by default.

llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Emit an error to prevent misuse if ProfileTopDownLoad is false and UseProfiledCallGraph is true?

hoy marked 2 inline comments as done.Mar 29 2021, 6:30 PM
In D99351#2657349, @wmi wrote:

The performance test result is neutral, so I think we can enable UseProfiledCallGraph by default.

Thanks for measuring the performance. Sounds good to turn it on by default.

llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
46

Good point, Reader is not really needed here.

54

Yes, they can be merged. The existing constructor will need to get rid of ProfileMap as you mentioned in the other comment.

104

Sounds good.

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
20–21

Good catch.

llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Actually when ProfileTopDownLoad is false, UseProfiledCallGraph doesn't do anything since it'll return early in buildProfiledCallGraph. Do you think an error is needed when it returns early while UseProfiledCallGraph is true?

hoy updated this revision to Diff 334031.Mar 29 2021, 6:30 PM
hoy marked 2 inline comments as done.

Updating D99351: [CSSPGO] Top-down processing order based on full profile.

wmi added inline comments.Mar 29 2021, 8:35 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Silently ignoring this flag may cause confusion. A warning may be enough.

wenlei added inline comments.Mar 29 2021, 9:22 PM
llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
67–74

Remove AddNodeWithSamplesOnly parameter and always add everything even for llvm-profgen? CSPreInliner::processFunction skips names without profile, so it should just work. llvm-profgen should follow what compiler does.

69–72

Actually adding extra edges from call target samples has the risk of forming top-down order that is not compliant with context order due to SCC, right? That is similar to how adding static edges can hurt.

So instead of saying call target edge doesn't help, it would be helpful if we make it clear in the comment explaining how adding extra edges from call targets may hurt.

llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Looks like this is not enforced though.. A common pattern is tuning knobs for an optimization and when optimization is turned off, we don't emit warning when tuning flags are still used. It looks to me that silently ignore a tuning flag when an optimization is off is more mainstream then emitting a warning.. don't have a strong opinion though.

hoy added inline comments.Mar 29 2021, 10:38 PM
llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
67–74

Sounds good. They are now unified.

69–72

Good point. Comment updated.

llvm/lib/Transforms/IPO/SampleProfile.cpp
165

Yeah, looks that silently ignoring tuning flags is common. Added a warning though, which should be clear and helpful to users.

hoy updated this revision to Diff 334049.Mar 29 2021, 10:39 PM

Addressing comments from Wei and Wenlei.

wenlei added inline comments.Mar 29 2021, 11:18 PM
llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
73

typo: which

llvm/lib/Transforms/IPO/SampleProfile.cpp
1605

Is this the canonical way of emit warning? Or something through diagnostics like LLVMContext.diagnose(DiagnosticInfoSampleProfile(..., DS_Warning))?

hoy added inline comments.Mar 30 2021, 9:07 AM
llvm/include/llvm/Transforms/IPO/ProfiledCallGraph.h
73

Fixed.

llvm/lib/Transforms/IPO/SampleProfile.cpp
1605

The usage of errs() to display text messages is quite common, I also used that in lld, though it is not a formal way to emit warnings that users can track in documents.

hoy updated this revision to Diff 334182.Mar 30 2021, 9:09 AM

Updating D99351: [CSSPGO] Top-down processing order based on full profile.

wenlei accepted this revision.Mar 30 2021, 9:19 AM

lgtm, thanks.

llvm/lib/Transforms/IPO/SampleProfile.cpp
1605

Yeah, saw inconsistent messages all over the place.. "WARNING", "warning" and "Warning". I guess we're not making it worse. :)

This revision is now accepted and ready to land.Mar 30 2021, 9:19 AM
wmi accepted this revision.Mar 30 2021, 9:31 AM

LGTM.

This revision was automatically updated to reflect the committed changes.