This is an archive of the discontinued LLVM Phabricator instance.

Add function importing info from samplepgo profile to the module summary.
ClosedPublic

Authored by danielcdh on Feb 16 2017, 1:29 PM.

Details

Summary

For SamplePGO, the profile may contain cross-module inline stacks. As we need to make sure the profile annotation happens when all the hot inline stacks are expanded, we need to pass this info to the module importer so that it can import proper functions if necessary. This patch implemented this feature by emitting cross-module targets as part of function entry metadata. In the module-summary phase, the metadata is used to build call edges that points to functions need to be imported.

Event Timeline

danielcdh created this revision.Feb 16 2017, 1:29 PM
mehdi_amini edited edge metadata.Feb 16 2017, 2:38 PM
test/Transforms/SampleProfile/import.ll
11

Can you document what is this in the test?

And also document what this test is doing itself.

danielcdh updated this revision to Diff 88790.Feb 16 2017, 2:49 PM
danielcdh marked an inline comment as done.

update

mehdi_amini added inline comments.Feb 16 2017, 3:01 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

I'm still unsure why we need this side channel to communicate the hotness, while we have above already some infrastructure? Why isn't samplePGO integrate in the general PGO infrastructure?

danielcdh added inline comments.Feb 16 2017, 3:05 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

The root cause is the profile annotation mechanism is different.

SamplePGO needs to make sure that before profile annotation, the IR resembles the hot portion of the profiling binary. As a result, we need to explicitly inline functions before profile annotation. But in order to inline, we need to first have it imported.

I can go into more details with an example if you want. Please let me know.

mehdi_amini added inline comments.Feb 16 2017, 4:01 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

I'm still not fond of this "side channel", and I rather have a solution that annotates hot calls in the IR instead, that would be agnostic to the source of information we derive this "hotness" from.

danielcdh added inline comments.Feb 16 2017, 4:49 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

I think we already have a side channel for ICP?

The issue for marking "hot call" is that it only goes down by 1 level, for the cases where foo_in_a_cc()->bar_in_b_cc()->baz_in_c_cc(), we may only go one level to include bar_in_b_cc if we mark the hot call. But in practice, we need both levels to be imported and inlined.

mehdi_amini added inline comments.Feb 16 2017, 4:51 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

I don't see how having one level at a time is an issue. The two calls can be marked hot and considered independently when building the chain.

danielcdh added inline comments.Feb 16 2017, 4:58 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

The issue is, when compiling bar_in_b_cc(), the profile for foo_in_a_cc() is not visible. If the inline instance is the only instance for bar_in_b_cc, then the standalone copy of bar_in_b_cc may not have profile and is considered cold and thus may not be able to mark the callsite to baz_in_c_cc as hot.

mehdi_amini added inline comments.Feb 16 2017, 5:04 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

OK it is getting more clear: it is interesting because it means we have some sort of "context sensitivity" : we're not recording that bar_in_b_cc()->baz_in_c_cc() is hot in the absolute, but only when called from foo_in_a_cc().

It is still not great to add edges in the call graph to model this. You not recording the information described above IIUC, but you're recording a "hot call" from foo_in_a_cc() to bar_in_b_cc(). I'd have to think about possible unintended consequences from what it means for the summary representation and the analyses.

danielcdh added inline comments.Feb 16 2017, 5:12 PM
lib/Analysis/ModuleSummaryAnalysis.cpp
264

Yeah, context-sensitivity is one big benefit of samplepgo, and it gets even better when comes to iterative compilation: more iteration will introduce richer context, and in later iterations, the real inline pass will be a noop, and samplepgo will perform perfect top-down inlining to capture all possible contexts.

Please let me know if you can think of better ways to force importing functions by simply looking at the profile. One thing I can think of is to pass the profile to thin_link, but it seems to add much complexity.

Is it better to introduce a new meta data for this, e.g. MD_inline_instance_imports ? Can this information be directly communicated to function importer instead of relying on modifying Callgraph/profile data?

Is it better to introduce a new meta data for this, e.g. MD_inline_instance_imports ? Can this information be directly communicated to function importer instead of relying on modifying Callgraph/profile data?

We need to expose this to the thin-link for the distributed mode, otherwise the dependent files the function importer would want to import from may not be available. So the summary-based importing analysis will need to be aware of this. Right now the solution is to trick it like we do for ICP by adding "fake" edges. The difference is that for ICP we just make indirect edges direct, but we don't really cheat on the graph structure otherwise. Here we'd create "short-circuit" edges (adding an edge foo->bar when in fact the reality is foo->baz->bar).

Is it better to introduce a new meta data for this, e.g. MD_inline_instance_imports ?

I'm fine with that. But the function definition only allows one !prof metadata. So MD_inline_instance_imports can not use !prof, what should it use?

Can this information be directly communicated to function importer instead of relying on modifying Callgraph/profile data?

It needs to be part of module summary. I suppose function importer cannot have access to IR?

tejohnson edited edge metadata.Feb 27 2017, 9:21 AM

Is it better to introduce a new meta data for this, e.g. MD_inline_instance_imports ? Can this information be directly communicated to function importer instead of relying on modifying Callgraph/profile data?

We need to expose this to the thin-link for the distributed mode, otherwise the dependent files the function importer would want to import from may not be available. So the summary-based importing analysis will need to be aware of this. Right now the solution is to trick it like we do for ICP by adding "fake" edges. The difference is that for ICP we just make indirect edges direct, but we don't really cheat on the graph structure otherwise. Here we'd create "short-circuit" edges (adding an edge foo->bar when in fact the reality is foo->baz->bar).

It isn't just for the distributed mode - the Thin Link needs to know this information so that it can appropriately mark references as exported . We need to make all importing decisions in the Thin Link not the ThinLTO backends, so it needs to be in the summary in some form. Adding the direct edges for the transitive edges seemed like the cleanest way to do this as it doesn't require any changes to the function importing decisions during the thin link, rather than adding a new mechanism.

docs/BranchWeightMetadata.rst
144

Think this needs more explanation with a small example like the one you gave in the review discussion.

include/llvm/IR/Function.h
212

add something like "for sample PGO, to enable the same inlines as the profiled optimized binary".

223

ditto

include/llvm/IR/MDBuilder.h
69

Document the new parameter.

include/llvm/ProfileData/SampleProf.h
309

Since this isn't actually doing any importing, I think it would be clearer to rename to something like "findImportedFunctions" (i.e. it is identifying functions that were presumably imported and inlined in the profiled binary).

lib/Analysis/ModuleSummaryAnalysis.cpp
262

Needs comment

264

It is still not great to add edges in the call graph to model this. You not recording the information described above IIUC, but you're recording a "hot call" from foo_in_a_cc() to bar_in_b_cc(). I'd have to think about possible unintended consequences from what it means for the summary representation and the analyses.

In the case Dehao gave, which is foo_in_a_cc()->bar_in_b_cc()->baz_in_c_cc(), we presumably already have an edge in the call graph from foo_in_a_cc to bar_in_b_cc, this is just forcing that edge to be "hot" to force the import. The added edge that wouldn't have been there before would be foo_in_a_cc to baz_in_c_cc, and there we are adding a direct edge for what is a transitive dependence. I don't think this should cause any issues, since the other things we use the edges for are things like liveness and internalization, which aren't changed by adding a direct edge for the transitive dependence.

lib/IR/Function.cpp
1285

When would we have the metadata but not the operand?

lib/Transforms/IPO/SampleProfile.cpp
613

Document new parameter

1288

note that this is also recording the GUIDs that need to be imported for the IR to match

danielcdh updated this revision to Diff 89930.Feb 27 2017, 1:55 PM
danielcdh marked 8 inline comments as done.

update

lib/IR/Function.cpp
1285

Shouldn't happen, Removed the 2nd check.

Is it better to introduce a new meta data for this, e.g. MD_inline_instance_imports ? Can this information be directly communicated to function importer instead of relying on modifying Callgraph/profile data?

We need to expose this to the thin-link for the distributed mode, otherwise the dependent files the function importer would want to import from may not be available. So the summary-based importing analysis will need to be aware of this. Right now the solution is to trick it like we do for ICP by adding "fake" edges. The difference is that for ICP we just make indirect edges direct, but we don't really cheat on the graph structure otherwise. Here we'd create "short-circuit" edges (adding an edge foo->bar when in fact the reality is foo->baz->bar).

It isn't just for the distributed mode - the Thin Link needs to know this information so that it can appropriately mark references as exported . We need to make all importing decisions in the Thin Link not the ThinLTO backends,

In practice, one could still import in the backend based on IR information if it wasn't for the distributed mode (and because I've been strongly against doing this anyway for layering reason). That's why I talked about the distributed mode to simplify :)

Adding the direct edges for the transitive edges seemed like the cleanest way to do this as it doesn't require any changes to the function importing decisions during the thin link, rather than adding a new mechanism.

I'm not convinced: while I agree this is the least intrusive way to implement this functionality, I don't find it "clean": we're producing a graph that is less accurate than before.

lib/Analysis/ModuleSummaryAnalysis.cpp
264

we are adding a direct edge for what is a transitive dependence.

Right.

I don't think this should cause any issues, since the other things we use the edges for are things like liveness and internalization, which aren't changed by adding a direct edge for the transitive dependence.

I don't see any issue *right now*, but that still means that our call-graph is not longer accurate, and that makes me worried about long-term consequences and what kind of analysis or heuristic could be affected by this. I'm concerned about things like that where we break some invariant of a data structure or a component and while it seems fine on the moment, it may bite in the future.

That said the alternative (adding an explicit handling for this) seems overkill at this point, so that's fine with me.

tejohnson accepted this revision.Feb 28 2017, 6:43 AM

LGTM

docs/BranchWeightMetadata.rst
144

I would explicitly add that it is needed because the sampling based profile was collected on a binary that had already imported and inlined these functions, and we need to ensure the IR matches in the ThinLTO backends for profile annotation.

lib/Analysis/ModuleSummaryAnalysis.cpp
264

Agree that we generally want the call graph to be accurate. I think there shouldn't ever be a correctness issue though when the call graph is more "conservative" as it will be here, with the additional edges. Note that in some sense the edges added for indirect calls can be conservative too - we may not actually call those same functions if the input changes.

This revision is now accepted and ready to land.Feb 28 2017, 6:43 AM
danielcdh updated this revision to Diff 90055.Feb 28 2017, 10:20 AM
danielcdh marked an inline comment as done.

update

danielcdh closed this revision.Feb 28 2017, 10:21 AM