This is an archive of the discontinued LLVM Phabricator instance.

[AutoFDO] Top-down Inlining for specialization with context-sensitive profile
ClosedPublic

Authored by wenlei on Nov 25 2019, 12:06 AM.

Details

Summary

AutoFDO's sample profile loader processes function in arbitrary source code order, so if I change the order of two functions in source code, the inline decision can change. This also prevented the use of context-sensitive profile to do specialization while inlining. This commit enforces SCC top-down order for sample profile loader. With this change, we can now do specialization, as illustrated by the added test case:

Say if we have A->B->C and D->B->C call path, we want to inline C into B when root inliner is B, but not when root inliner is A or D, this is not possible without enforcing top-down order. E.g. Once C is inlined into B, A and D can only choose to inline (B->C) as a whole or nothing, but what we want is only inline B into A and D, not its recursive callee C. If we process functions in top-down order, this is no longer a problem, which is what this commit is doing.

This change is guarded with a new switch "-sample-profile-top-down-load" for tuning, and it depends on D70653. Eventually, top-down can be the default order for sample profile loader.

Diff Detail

Event Timeline

wenlei created this revision.Nov 25 2019, 12:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2019, 12:06 AM

This looks good. Can this be handled for cross module (thinLTO) case somehow too?

Can this be handled for cross module (thinLTO) case somehow too?

That'd be nice, but the parallelization and isolation of thin-backends made two things difficult: 1) enforcing global top-down order; 2) adjust profile based on inline decision cross thin-backends, e.g. what D70653 was trying to do (current adjustment, scale or merge, only happens on imported clone, which is sometimes not very useful).

What I'm thinking about is to have ThinLink make global inline decision without reading IR (e.g. based on profile and summary) so it's still reasonably thin, and also adjust profile globally based on inline decisions. In addition, ThinLink also need to pass adjusted profile and inline decision to thin-backends for execution. (Not doing the mechanics of inlinine during ThinLink as that can slow it down a lot, but the challenge is without seeing the IR, the inline decision is going to be a proximation, though we may be able to make it very close as the inline heuristics used by sample loader is relatively simple too)

I think conceptually this shouldn't be too disruptive to current implementation, as sample loader is very early in LTO time passes, which is close to ThinLink. But implementation-wise, this is going to be a lot of changes, and a bit intrusive. Chatted with @tejohnson about this during LLVM dev meeting, and I'd love to hear to alternative ideas too.

One way to handle it is 1) delay early inlining of sites into a hot function if a big percentage of calls to the function are from other modules. This still allows intra module top down inlining of it; or 2) keep a clone of the unlined body of the original function and use that one during cross module inlining.

One way to handle it is 1) delay early inlining of sites into a hot function if a big percentage of calls to the function are from other modules. This still allows intra module top down inlining of it; or 2) keep a clone of the unlined body of the original function and use that one during cross module inlining.

Both would work for leveraging context-sensitive profile to drive better inlining with specialization, if profile at the time of processing a specific function is accurate. However, the problem of not having cross-module profile adjustment for inlining still exists. With top-down inlining, profile adjustment and PGO inlining is an iterative process. In call graph, lower function's inlining relies on higher function's inlining (and its profile adjustment). So not being able to adjust profile cross thin-backends not only affects post-inline profile fidelity, it also limits PGO inlining..

In short, I think the two ideas will give us some mechanism to do specialization during inlining, but it still misses good (adjusted) profile that drives that kind of inlining.

what I mentioned should be complementary to the top-down method in this patch -- it just allows the full top-down to be doable for cross module scenario as well.

wmi added a comment.Nov 25 2019, 11:36 AM

That is a good catch. From my understanding it could potentially reduce inlining in some cold places and potentially increase regular inlinling. Do you see how much impact on performance and code size by changing it? I will do some evaluation on my side.

llvm/lib/Transforms/IPO/SampleProfile.cpp
434–435

If the buildFunctionOrder call is moved to runOnModule, we don't need the field and buildFunctionOrder can return a function order vector.

1749

We can move the check about whether the Function is a declaration to here, so FunctionOrderList can be a little bit smaller.

1807

The check can be moved above.

llvm/test/Transforms/SampleProfile/inline-topdown.ll
19

rename the variable using opt -instnamer.

Thanks for the discussions and code review. I'll address the comments later, but wanted to answer this one first.

Do you see how much impact on performance and code size by changing it? I will do some evaluation on my side.

I ran it with MySQL linkbench (https://github.com/facebookarchive/linkbench), and it showed ~0.5% performance win and slight code size increase (<1%) with D70653, this patch, and another one combined together. (the other one let inline replay also inline small functions if CGSCC inliner will inline them eventually. I'll upstream that one too).

I haven't tried to measure the breakdown yet, but I think the profile merge and top-down inline change should be the most important ones. I plan to play more with each individual changes, and I'm also interested in what you see from your workload. (I'm putting these changes each under a switch so it's easier for all of us to measure and tune).

wenlei updated this revision to Diff 230985.Nov 25 2019, 4:09 PM

address review feedback

wenlei marked 2 inline comments as done.Nov 25 2019, 4:12 PM
wenlei added inline comments.
llvm/lib/Transforms/IPO/SampleProfile.cpp
434–435

good point, thanks for the suggestion. I moved buildFunctionOrder into runOnModule, and also moved !isDeclaration check into buildFunctionOrder.

llvm/test/Transforms/SampleProfile/inline-topdown.ll
19

done.

wmi accepted this revision.Nov 27 2019, 9:18 AM

I did performance test for this change and the result is neutral.

llvm/lib/Transforms/IPO/SampleProfile.cpp
1808

Add some message for the assertion.

This revision is now accepted and ready to land.Nov 27 2019, 9:18 AM
This revision was automatically updated to reflect the committed changes.
ychen added a subscriber: ychen.Nov 30 2020, 7:56 PM
ychen added inline comments.
llvm/lib/Transforms/IPO/SampleProfile.cpp
1826

The new pass manager computes the call graph but here it is skipped. Should we also compute the call graph here?

hoy added inline comments.Dec 1 2020, 9:36 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1826

Passing a call graph into the sample loader sounds helpful to me. I'm not sure if there is an available FunctionAnalysisManager object so that the call graph can be reused. Computing a call graph on the spot might hurt the compiler throughput. @wenlei can you shed light on this?