This is an archive of the discontinued LLVM Phabricator instance.

Infrastructure to allow use of PGO in inliner
ClosedPublic

Authored by eraman on Jan 20 2016, 4:55 PM.

Details

Summary

This patch provides the following infrastructure to enable the use of PGO in inliner:

  • Enable the use of block level profile information in inliner
  • Incremental update of block frequency information during inlining
  • Update the function entry counts of callees when they get inlined into callers.

Diff Detail

Event Timeline

eraman updated this revision to Diff 45464.Jan 20 2016, 4:55 PM
eraman retitled this revision from to Adjust inlining thresholds based on callsite hotness.
eraman updated this object.
eraman updated this object.Jan 20 2016, 4:57 PM
eraman added reviewers: hfinkel, chandlerc.
eraman added subscribers: llvm-commits, davidxl.

The subject of the patch is misleading -- it should really be 'Make block/callsite level profile data available for inliner'

zzheng edited edge metadata.Jan 21 2016, 1:18 PM

A few suggestions to make code more easy to read. This patch of course needs a PGO expert to review.

include/llvm/Transforms/IPO/InlinerPass.h
93

Where's these two being used?

lib/Analysis/InlineCost.cpp
622

Unused variable: CallerEntryCount

627

Caller->getEntryCount() called twice, this looks simpler:

Optional<uint64_t> CallerEntryCount = Caller->getEntryCount();
if (CallerEntryCount.hasValue()) {
  uint64_t CallSiteCount =
          CallerEntryCount.getValue() * CallSiteFreq / CallerEntryFreq;
}
1567

Empty comment line, I think it can be removed.

eraman updated this revision to Diff 45589.Jan 21 2016, 1:35 PM
eraman edited edge metadata.

Address comments by zzheng

eraman marked 3 inline comments as done.Jan 21 2016, 1:37 PM

Thanks for the comments.

include/llvm/Transforms/IPO/InlinerPass.h
93

Removed them. I had initially used them and later replaced them by getXXXFunctor methods but forgot to remove those fields.

junbuml added inline comments.Jan 22 2016, 7:43 AM
lib/Analysis/InlineCost.cpp
1455

I cannot see any use of this function. Are you planing to hook this function in heuristic in this patch ?

1573

You can call LoopInfo LI(DT) instead of LoopInfo LI; LI.analyze(DT);

1576

BranchProbabilityInfo BPI(*F, LI); will call calculate(F, LI);

1579

You can also call BlockFrequencyInfo(*F, BPI, LI), which call calculate(*F, BPI, LI); inside.

davidxl added inline comments.Jan 22 2016, 4:28 PM
include/llvm/Transforms/IPO/InlinerPass.h
86

Can they be made pure virtual -- probably not if the derived class is not required to override it.

include/llvm/Transforms/Utils/Cloning.h
51

Is there a common header to put this decl in?

lib/Analysis/InlineCost.cpp
75

Remove this -- it can be folded in the threshold adjusting method tuned in the future.

81

I suggest removing these two parameters in these patch, as they are for not irrelevant and confusing (and subject to change in the very near future when summary data is available to be used here).

The objective of this patch is to provide infrastructure hooks to feed profile data to the inliner, not to tune/or change the current inline behavior/decision -- as that needs more design and experiment. To accomplish that goal, I suggest simply add a dummy wrapper function to do this:

int getAdjustedThreshold(int current_threshold, uint64_t callsite_count attribute((unused))) {

// just return the input threshold for now
return current_threshold;

}

In the future, I expect the threshold is different based on callsite_count and summary cutoffs.

623

Define a helper method to get callsite profile count.

629

call the proposed 'getAdjustedThreshold(Threshold, CallSiteCount);

1455

Remove unused function.

davidxl edited edge metadata.Jan 23 2016, 11:34 AM

When the invalidation happens, the updated entry count information gets discarded it seems. This is wrong. The data should be propagated back to the IR. The information can be useful later for decisions such as function layout and optimize for size etc.

Some test case is also needed to test that the entry counts can be properly updated during inlining.

lib/Transforms/IPO/InlineSimple.cpp
142

\p OrigBB \p NewBB

157

Freq --> OrigBBFreq with comment it is OrigBB's frequency in callee

162

\p Callee ... after it got inlined at callsite in block \p CallBB

177

computing callsite count belongs to a common helper function -- which can be used elsewhere too.

180

It is an estimate because the limitation of frequency propagation algorithm (unable to handle zero BP etc).

It is helpful to add a debug message here to indicate this condition.

eraman updated this revision to Diff 46023.Jan 26 2016, 11:53 AM
eraman edited edge metadata.
eraman marked 14 inline comments as done.

Address review comments and add a test case to function entry counts are correctly updated.

When the invalidation happens, the updated entry count information gets discarded it seems. This is wrong. The data should be propagated back to the IR. The information can be useful later for decisions such as function layout and optimize for size etc.

UpdateEntryCount in InlineSimple.cpp calls setEntryCount and hence the counts are not lost.

Some test case is also needed to test that the entry counts can be properly updated during inlining.

Added a test case.

lib/Transforms/IPO/InlineSimple.cpp
177

I have placed this helper in Inlinecost.cpp.

eraman added inline comments.Jan 26 2016, 11:53 AM
include/llvm/Transforms/IPO/InlinerPass.h
86

AlwaysInliner then has to implement this method to return nullptr. I don't see any advantage in making it pyure virtual.

include/llvm/Transforms/Utils/Cloning.h
51

I could leave it in Cloning.h and include it in InlineSimple.cpp and get rid of it in InlinerPass.h, but this adds an unnecessary dependence to InlineSimple.cpp. There is no other header file common to the inliner proper and CloneFunction.cpp where it is appropriate to add this.

lib/Analysis/InlineCost.cpp
1455

Yes, I was planning to use this as part of tuning, but I've now removed it from this patch.

eraman retitled this revision from Adjust inlining thresholds based on callsite hotness to Infrastructure to allow use of PGO in inliner.Jan 29 2016, 11:44 AM
eraman updated this object.

We need to add more test cases. For instance

  1. two callsites to one callee all get inlined -- the callee should have zero count left
  2. scaling testing -- especially interesting in the context of top-down inlining (in the future we may have more heuristics to enable more top down inining, there are existing heuristics that can be used to trigger the case):

a ---> (200) ---> c
b ---> (300) ---> c
where c has entry count of 500

c-->(200) --> e
d -->(300) --> e
where e has entry count of 500.

Suppose there are two inlines in the following order: a->c, and a->e

After inlining, testing e's entry count is properly updated.

include/llvm/Analysis/InlineCost.h
44

suggested comment change: ... on demand, cached (for callees), and incrementally updated (for callers). The caller BFI is invalidated after inlining is done for (the caller).

53

Typo -- frequency

lib/Analysis/InlineCost.cpp
620

This guard is wrong :1) it should check caller entry count, not callee entry count. 2) hasPGOCounts also depends on other conditions.

In fact, this condition can be removed.

1554

Add a comment for the function.

1567

Irrelevant comment line?

eraman updated this revision to Diff 46583.Feb 1 2016, 3:23 PM
eraman marked 4 inline comments as done.

Address David's review comments.

eraman added a comment.Feb 1 2016, 3:24 PM

I've added two more test cases.

include/llvm/Analysis/InlineCost.h
44

This class itself is agnostic about callers and callees. All it does is cache the BFI and invalidate them when required - and that's what the comment says.

lib/Analysis/InlineCost.cpp
620

Yes, this check is not needed and I've removed it.

eraman added a comment.Feb 1 2016, 3:29 PM

Chandler and Hal, could you please give your feedback on this patch? Even if you can not give detailed comments right now, any high level comments will be very useful and I can iterate on this patch based on them.

Nice test case.

Dehao mentioned there is a problem with BB splitting when a basic block has two callsites -- after the first call gets inlined, the newly split block with the second callsite does not have updated profile data. Please handle that (with a proper test case).

junbuml added inline comments.Feb 2 2016, 9:47 AM
lib/Analysis/InlineCost.cpp
618–619

Call and CallBB seems to be used only in llvm::getBlockCount(CallBB, BFA);

1564

Can we guarantee FunctionEntryFreq is always non-zero if EntryCount is non-zero ?

lib/Transforms/IPO/InlineSimple.cpp
159

Should we have (double) here ?

160

why not checking CalleeEntryFreq * CallSiteFreq is non-zero.

196

No need to have braces.

Nice test case.

Dehao mentioned there is a problem with BB splitting when a basic block has two callsites -- after the first call gets inlined, the newly split block with the second callsite does not have updated profile data. Please handle that (with a proper test case).

I am working on the fix. I am also going to refactor the code a bit to do all updates in the base Inliner class rather than SimpleInliner. My original reasoning was that we don't have to do the updates for AlwaysInliner, but that is not true. With PGO, if always inline pass inlines a hot callsite, the callee will not be updated with the right entruy count. Besides using callbacks between the base and derived class looks ugly. I am going to move all the updates to the base class (there will still be a callback passed to the cloning methods) and guard the updates with a check for PGO. Later, we can refine this to support updates for non-PGO cases as well.

eraman updated this revision to Diff 46720.Feb 2 2016, 4:17 PM
eraman marked 2 inline comments as done.

This fixes a bug (BB containing call's successor didn't have the right BFI), addresses reviewer comments and has been rebased to a recent head.

lib/Analysis/InlineCost.cpp
1564

FunctionEntryFreq is always non-zero and that doesn't depend on entry count.

lib/Transforms/IPO/InlineSimple.cpp
159

Not sure why that is needed.

160

It is dividing by CalleeEntryFreq and multiplying by CallSiteFreq, so this is a problem only when CalleeEntryFreq is 0. As I wrote in a previous comment, that is not possible.

davidxl added inline comments.Feb 2 2016, 4:35 PM
include/llvm/Transforms/IPO/InlinerPass.h
79

--> hasProfileData

lib/Transforms/IPO/Inliner.cpp
428

hasProfileData

489

add a comment here.

test/Transforms/Inline/function-count-update.ll
4 ↗(On Diff #46720)

into two callsites in the same block.

20 ↗(On Diff #46720)

Can you make these two calls in a block that is not the entry block?

Please also add a test case that covers proper profile update with always inline as intended for the refactoring.

davidxl added inline comments.Feb 2 2016, 4:43 PM
include/llvm/Analysis/InlineCost.h
152

I suggest making this a member method of BFA. Also provide a wrapper method -- this will be the primary API used by other clients for callsite hotness:

Optional<uint64_t> getCallsiteCount(CallSite CS);

eraman marked 4 inline comments as done.Feb 3 2016, 4:21 PM

Please also add a test case that covers proper profile update with always inline as intended for the refactoring.

Done. I have added the alwaysinline attribute to the callee in the first test case and have run it with -always-inline alone.

include/llvm/Analysis/InlineCost.h
152

I agree that this needs to be moved somewhere else, but I am not convinced BFA is that place. In the other profile refactoring patch, I have created a ProfileCommon.h. Perhaps that is a good place for profile related urility methods like this?

test/Transforms/Inline/function-count-update.ll
20 ↗(On Diff #46720)

That does not matter for the purposes of this test case.

eraman updated this revision to Diff 46850.Feb 3 2016, 4:22 PM

Address David's comments.

davidxl added inline comments.Feb 3 2016, 4:42 PM
include/llvm/Analysis/InlineCost.h
152

ok -- ProfileCommon.h will be a better place in the future. For now let's keep it here -- but I think we should have a top level wrapper to get count for callsite here.

test/Transforms/Inline/function-count-update.ll
5 ↗(On Diff #46850)

in a single callee or in a single BB?

The test intends to verify that the profile update of the second inlined callsite is correct after block split -- so having the callee identical in the two sites are not essential to the test.

21 ↗(On Diff #46850)

The purpose is to verify that copyBlockFrequency after bb split works in general not just the entry block.

eraman marked an inline comment as done.Feb 4 2016, 1:25 PM
eraman added inline comments.
include/llvm/Analysis/InlineCost.h
152

Do you want the wrapper in InlineCost.cpp or in ProfileCommon.h (later)? As of now, there are two calls to this and one of them can not pass CS (this is in Inliner.cpp during BFI updates) since the call instruction in CS has been removed.

I agree that getProfileCount(CS) is a useful API, but I think that should be somewhere like ProfileCommon

test/Transforms/Inline/function-count-update.ll
5 ↗(On Diff #46850)

This is a typo - I meant to write "in the same caller". I have fixed the comments. Yes, for the purpose of this test, we don't need the callee to be the same. I have made the caller call 2 different functions. I also don't think making the calls in a block other than the entry makes any difference, but since it doesn't hurt, I've made that change as well.

eraman updated this revision to Diff 46955.Feb 4 2016, 1:26 PM

Tweaks to a test case based on David's feedback.

This is a very important enhancement that many people and tuning work are waiting for. It is IMO very clean, solid and non-intrusive. Before we move on, I'd like to see more testings on large apps for sanity.

Ping. I would like to move forward on this to enable PGO improvements in
inliner and function layout (separating cold functions becomes effective if
we have their entry counts after inlining). Specifically, I think the code
that computes BFA is isolated enough that ripping it out when inliner is
ported to the new pass manager is not very intrusive.

I did more testing as asked by David. Specifically,

  1. Turn on the updates by default (instead of inly in PGO) and ran llvm

testsuite

  1. Turn on the updates by default and build clang.
  2. Build Google internal compiler benchmark suite in PGO mode with this

patch enabled.

These tests above didn't reveal any issues.

Thanks,
Easwaran

twoh added a subscriber: twoh.Feb 25 2016, 9:00 PM

Thanks for collecting the data!

I have a couple of more comments below:

lib/Analysis/InlineCost.cpp
618

This call has the side effect of force computing BFI even when it is disabled (with option). I think CallAnalyzer needs to have a wrapper to this function and check the settings before this is called.

lib/Transforms/IPO/Inliner.cpp
360

I think we should introduce a more general flag here EnableProfile. This flag's value is determined by an internal option:

--enable-profile-in-inliner=default|yes|no

With yes and no can be used to force turning on|off while default is subject to compiler: for now if hasProfile is true, turn it on otherwise off.

423

Needs to be guarded with EnableProfile?

493

Should this be guarded with EnableProfile too?

eraman marked an inline comment as done.Mar 2 2016, 11:44 AM
eraman added inline comments.
lib/Analysis/InlineCost.cpp
618

I have guarded the computation in getBlockCount by a check to see if BFA is nullptr. Also made sure getInlineCost gets a non-null BFA only when HasProfileData is true.

lib/Transforms/IPO/Inliner.cpp
360

I don't prefer to add a new flag for that. I think the current solution of checking if profile is turned on is sufficient for now.

493

The callee copyBlockFrequency has the guard, so it is not necessary at the callsite.

eraman updated this revision to Diff 49656.Mar 2 2016, 11:45 AM
davidxl accepted this revision.Mar 3 2016, 9:32 AM
davidxl edited edge metadata.

LGTM.

This looks really great, so let's move on with this long waited missing feature.

  1. watch for sanitizer build results
  2. there are more tuning opportunities that can be done in the future -- e.g, if all incoming callgraph edges of a node have been analyzed and marked as non-inlinable, the node's prof data can be evicted from the cache. Maybe useful for LTO.
This revision is now accepted and ready to land.Mar 3 2016, 9:32 AM
This revision was automatically updated to reflect the committed changes.
chandlerc edited edge metadata.Mar 7 2016, 1:40 AM

LGTM.

This looks really great, so let's move on with this long waited missing feature.

David, this is not an area of LLVM you have done substantial work on, and this is a very significant feature.

I'm sorry that I have not had time to review this yet, but the correct response is not for you to make a patch as LGTM. Please revert this and let's actually get it reviewed before it goes into the project.

First and foremost, sorry about snapping earlier. I shouldn't have done that, I was frustrated and not communicating very effectively. Thanks to Sean and Hal and others who wrote constructive and helpful emails to get this back on the rails. Secondly, sorry that I've neglected this patch for so long. I kept prioritizing working on the actual pass manager stuff over it, and I should have at least written this up.

So I do have some serious concerns with the approach. I think it is probably a nice way to prototype things and understand their impact (which is very important!) but there are several aspects of the design that seem to be the wrong approach. I've detailed them below.

I've said a few times before, but I continue to think that while some of the issues I've raised below could be addressed, that may not be the best use of time. I think it would probably be more effective to work on helping with the pass manager effort in order to make that effort finish more rapidly. There is a large amount of work that is essentially unblocked right now to help port function passes over. All of this is very easily parallelized.

Once that is done, I would work on designing a good interprocedural profile analysis interface with a good update API, and then teach the inliner to use that. I think that the process of porting the inliner to the new pass manager will end up being a fairly substantial change (essentially a re-write) which will make integrating this much cleaner and easier.

I'm somewhat worried that in its current form, this change would actually make it *harder* to port the inliner and more work to introduce the correct long-term design.

Below are detailed comments about the design issues, but also quite a few comments about serious API, code, and implementation issues with the patch as it currently stands. I think it still needs substantial work to go into the tree, even if the high level design concern above is addressed.

llvm/trunk/include/llvm/Analysis/InlineCost.h
42–55 ↗(On Diff #49755)

This is pretty much re-implementing a tiny part of the new pass manager inside the inline cost analysis. That really feels like the wrong approach to me. Notably, this is even named exactly the same as we would want to name a port of BlockFrequencyInfo to the new pass manager.

I don't agree with this approach. Fundamentally, this is not the right long-term design for how the inliner should access profile information, and it actually obstructs getting the right design in place.

151–152 ↗(On Diff #49755)

This seems like a surprising API to expose at the top level from the inline cost analysis.

llvm/trunk/include/llvm/Transforms/IPO/InlinerPass.h
30–37 ↗(On Diff #49755)

These functor typedefs are a bit confusing. First off, std::function is very slow. Is this overhead going to cause a compile time issue?

The second two of these typedefs aren't used, please don't leave dead code in the patch.

Lastly, having them at the global scope seems like a bad design. It seems like they should be part of some API instead of just being global?

82 ↗(On Diff #49755)

I would keep vertical space between declarations. It makes the comments much easier to read.

102 ↗(On Diff #49755)

I would say "Indicates whether we are using profile guided optimization." in the comment.

However, I'm surprised we need both a bool and a unique_ptr -- wouldn't the nullness of the pointer indicate the same thing as the bool?

llvm/trunk/include/llvm/Transforms/Utils/Cloning.h
51–52 ↗(On Diff #49755)

This is in essence an ODR violation -- two headers are defining the same typedef.

164 ↗(On Diff #49755)

Initializing a functor from nullptr is pretty surprising. I would just use BlockCloningFunctor Ftor = BlockCloningFunctor() instead.

However, I think this is an indication of a larger design issue. I think the fact that you need to thread this cloning functor between very disparate layers of the code isn't good. I feel like the cloning layer should really be taught to directly support updating profile information when cloning. This of course may not be terrible reasonable to implement currently, as you have the inliner essentially maintaining a mini form of the pass manager internally and need to thread some way to update it through these layers. If that's the problem, I think it really is indicating that this needs to wait until we have the pass manager fixed so that we can layer this logic more cleanly.

llvm/trunk/lib/Analysis/InlineCost.cpp
585–588 ↗(On Diff #49755)

It's a bit odd that this isn't a doxygen comment. Also that this is marked as "unused". What's the plan here?

618–620 ↗(On Diff #49755)

Separating this out into two functions doesn't make a lot of sense here. You now have two parts of the code handling profile based inlining adjustments.

I think you should keep all of the profile based adjustments inline here, and integrate it with the function count logic above.

1570–1576 ↗(On Diff #49755)

So, inside this routine, you're actually implementing a proper interprocedural profile analysis. I really think this needs to be extracted to provide a high level analysis API around interprocedurally weighted profile information.

1591–1595 ↗(On Diff #49755)

While it happens that BPI doesn't keep any references or pointers back into the dominator tree or loop info, the APIs of these analyses actually would make that OK to do... Which makes this code a rather subtle mini implementation of the core logic of the pass manager but inside the inline cost analysis. That worries me some. At the very least it would need a lot of documentation to make it clear exactly what was going on and why this was a safe thing to do.

llvm/trunk/lib/Transforms/IPO/Inliner.cpp
376–380 ↗(On Diff #49755)

You're dividing by CalleeEntryFreq because the original value was scaled up by that. This essentially strips some precision off. It would seem better to instead directly access the unscaled block frequency for the old BB, and then scale it correctly for the post-inlined state.

383–384 ↗(On Diff #49755)

We have worked very hard throughout the PGO analysis infrastructure to not rely on actual hardware floating point. We have a broad collection of carefully written math utilities for this. One part of factoring this logic into a real IPO profile update routine would be to use the same facilities and infrastructure that the existing profile (and profile update) analyses use for this purpose.

570 ↗(On Diff #49755)

Please use lambdas instead of std::bind -- they make the code much more readable as it is easier to understand how they behave. It also avoids the pain of using placeholder values.

759 ↗(On Diff #49755)

I'm really concerned by the need to manually invalidate BFI in so many places. I think this is going to prove to be a very error prone pattern to maintain long term.

eraman added a comment.Mar 8 2016, 2:11 PM

Thanks for the detailed comments Chandler. The comments related to problems around API, code and implementation are very helpful and I'll address them.

Your main concern is that this patch tries to provide the functionality of pass manager inside inline cost analysis. The intention is indeed to provide a restricted subset of pass manager's functionality for the inliner to use, but with the understanding that this is a temporary fix for inliner that should be easily replaceable when this functionality is available in PM. If this doesn't meet those goals (restricted to inliner and takes minimal effort to replace it by the new PM), I'll happily iterate on this patch to address that concern. I have responded inline to the comments related to this aspect.

Thanks,
Easwaran

llvm/trunk/include/llvm/Analysis/InlineCost.h
42–55 ↗(On Diff #49755)

This is not *intended* to be a long term solution. The idea is to provide an interface similar to what the new pass manager provides so that when it lands, this code can be ripped off. I can rename it and/or move this code to a separate file if that will make the migration smoother.

151–152 ↗(On Diff #49755)

I'll move it to ProfileCommon.h. I left it here because I don't want to make anything outside of inliner to use BlockFrequencyAnalysis

llvm/trunk/include/llvm/Transforms/Utils/Cloning.h
164 ↗(On Diff #49755)

I think clients that invoke the cloning code might want to update different analyses or choose to not update at all. IMO, sinking all that into the cloning code doesn't seem a good idea,

llvm/trunk/lib/Analysis/InlineCost.cpp
1570–1576 ↗(On Diff #49755)

Ok.

1591–1595 ↗(On Diff #49755)

To reiterate, the intention is to *temporarily* provide the functionality that will be provided by the PM in a way that will make the transition easier. Agreed that this is not documented at all. Will fix that.

llvm/trunk/lib/Transforms/IPO/Inliner.cpp
376–380 ↗(On Diff #49755)

The unscaled block frequency is meant to be used only for the initial BFI computation right? Moreover as we incrementally update the block frequency using setBlockFrequency interface of BFI, there may be blocks that do not have an unscaled value associated with them,

759 ↗(On Diff #49755)

The alternative is to conservatively invalidate BFI at the beginning of the inliner loop. That is fine with me if the increase in compilation time is ok.