Page MenuHomePhabricator

Infrastructure to allow use of PGO in inliner
ClosedPublic

Authored by eraman on Jan 20 2016, 4:55 PM.

Details

Summary

This patch provides the following infrastructure to enable the use of PGO in inliner:

  • Enable the use of block level profile information in inliner
  • Incremental update of block frequency information during inlining
  • Update the function entry counts of callees when they get inlined into callers.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
eraman added inline comments.Jan 26 2016, 11:53 AM
include/llvm/Transforms/IPO/InlinerPass.h
86

AlwaysInliner then has to implement this method to return nullptr. I don't see any advantage in making it pyure virtual.

include/llvm/Transforms/Utils/Cloning.h
51

I could leave it in Cloning.h and include it in InlineSimple.cpp and get rid of it in InlinerPass.h, but this adds an unnecessary dependence to InlineSimple.cpp. There is no other header file common to the inliner proper and CloneFunction.cpp where it is appropriate to add this.

lib/Analysis/InlineCost.cpp
1455

Yes, I was planning to use this as part of tuning, but I've now removed it from this patch.

eraman retitled this revision from Adjust inlining thresholds based on callsite hotness to Infrastructure to allow use of PGO in inliner.Jan 29 2016, 11:44 AM
eraman updated this object.

We need to add more test cases. For instance

  1. two callsites to one callee all get inlined -- the callee should have zero count left
  2. scaling testing -- especially interesting in the context of top-down inlining (in the future we may have more heuristics to enable more top down inining, there are existing heuristics that can be used to trigger the case):

a ---> (200) ---> c
b ---> (300) ---> c
where c has entry count of 500

c-->(200) --> e
d -->(300) --> e
where e has entry count of 500.

Suppose there are two inlines in the following order: a->c, and a->e

After inlining, testing e's entry count is properly updated.

include/llvm/Analysis/InlineCost.h
44

suggested comment change: ... on demand, cached (for callees), and incrementally updated (for callers). The caller BFI is invalidated after inlining is done for (the caller).

53

Typo -- frequency

lib/Analysis/InlineCost.cpp
620

This guard is wrong :1) it should check caller entry count, not callee entry count. 2) hasPGOCounts also depends on other conditions.

In fact, this condition can be removed.

1554

Add a comment for the function.

1567

Irrelevant comment line?

eraman updated this revision to Diff 46583.Feb 1 2016, 3:23 PM
eraman marked 4 inline comments as done.

Address David's review comments.

eraman added a comment.Feb 1 2016, 3:24 PM

I've added two more test cases.

include/llvm/Analysis/InlineCost.h
44

This class itself is agnostic about callers and callees. All it does is cache the BFI and invalidate them when required - and that's what the comment says.

lib/Analysis/InlineCost.cpp
620

Yes, this check is not needed and I've removed it.

eraman added a comment.Feb 1 2016, 3:29 PM

Chandler and Hal, could you please give your feedback on this patch? Even if you can not give detailed comments right now, any high level comments will be very useful and I can iterate on this patch based on them.

Nice test case.

Dehao mentioned there is a problem with BB splitting when a basic block has two callsites -- after the first call gets inlined, the newly split block with the second callsite does not have updated profile data. Please handle that (with a proper test case).

junbuml added inline comments.Feb 2 2016, 9:47 AM
lib/Analysis/InlineCost.cpp
618–619

Call and CallBB seems to be used only in llvm::getBlockCount(CallBB, BFA);

1564

Can we guarantee FunctionEntryFreq is always non-zero if EntryCount is non-zero ?

lib/Transforms/IPO/InlineSimple.cpp
159

Should we have (double) here ?

160

why not checking CalleeEntryFreq * CallSiteFreq is non-zero.

196

No need to have braces.

Nice test case.

Dehao mentioned there is a problem with BB splitting when a basic block has two callsites -- after the first call gets inlined, the newly split block with the second callsite does not have updated profile data. Please handle that (with a proper test case).

I am working on the fix. I am also going to refactor the code a bit to do all updates in the base Inliner class rather than SimpleInliner. My original reasoning was that we don't have to do the updates for AlwaysInliner, but that is not true. With PGO, if always inline pass inlines a hot callsite, the callee will not be updated with the right entruy count. Besides using callbacks between the base and derived class looks ugly. I am going to move all the updates to the base class (there will still be a callback passed to the cloning methods) and guard the updates with a check for PGO. Later, we can refine this to support updates for non-PGO cases as well.

eraman updated this revision to Diff 46720.Feb 2 2016, 4:17 PM
eraman marked 2 inline comments as done.

This fixes a bug (BB containing call's successor didn't have the right BFI), addresses reviewer comments and has been rebased to a recent head.

lib/Analysis/InlineCost.cpp
1564

FunctionEntryFreq is always non-zero and that doesn't depend on entry count.

lib/Transforms/IPO/InlineSimple.cpp
159

Not sure why that is needed.

160

It is dividing by CalleeEntryFreq and multiplying by CallSiteFreq, so this is a problem only when CalleeEntryFreq is 0. As I wrote in a previous comment, that is not possible.

davidxl added inline comments.Feb 2 2016, 4:35 PM
include/llvm/Transforms/IPO/InlinerPass.h
79

--> hasProfileData

lib/Transforms/IPO/Inliner.cpp
428

hasProfileData

489

add a comment here.

test/Transforms/Inline/function-count-update.ll
4 ↗(On Diff #46720)

into two callsites in the same block.

20 ↗(On Diff #46720)

Can you make these two calls in a block that is not the entry block?

Please also add a test case that covers proper profile update with always inline as intended for the refactoring.

davidxl added inline comments.Feb 2 2016, 4:43 PM
include/llvm/Analysis/InlineCost.h
152

I suggest making this a member method of BFA. Also provide a wrapper method -- this will be the primary API used by other clients for callsite hotness:

Optional<uint64_t> getCallsiteCount(CallSite CS);

eraman marked 4 inline comments as done.Feb 3 2016, 4:21 PM

Please also add a test case that covers proper profile update with always inline as intended for the refactoring.

Done. I have added the alwaysinline attribute to the callee in the first test case and have run it with -always-inline alone.

include/llvm/Analysis/InlineCost.h
152

I agree that this needs to be moved somewhere else, but I am not convinced BFA is that place. In the other profile refactoring patch, I have created a ProfileCommon.h. Perhaps that is a good place for profile related urility methods like this?

test/Transforms/Inline/function-count-update.ll
20 ↗(On Diff #46720)

That does not matter for the purposes of this test case.

eraman updated this revision to Diff 46850.Feb 3 2016, 4:22 PM

Address David's comments.

davidxl added inline comments.Feb 3 2016, 4:42 PM
include/llvm/Analysis/InlineCost.h
152

ok -- ProfileCommon.h will be a better place in the future. For now let's keep it here -- but I think we should have a top level wrapper to get count for callsite here.

test/Transforms/Inline/function-count-update.ll
5 ↗(On Diff #46850)

in a single callee or in a single BB?

The test intends to verify that the profile update of the second inlined callsite is correct after block split -- so having the callee identical in the two sites are not essential to the test.

21 ↗(On Diff #46850)

The purpose is to verify that copyBlockFrequency after bb split works in general not just the entry block.

eraman marked an inline comment as done.Feb 4 2016, 1:25 PM
eraman added inline comments.
include/llvm/Analysis/InlineCost.h
152

Do you want the wrapper in InlineCost.cpp or in ProfileCommon.h (later)? As of now, there are two calls to this and one of them can not pass CS (this is in Inliner.cpp during BFI updates) since the call instruction in CS has been removed.

I agree that getProfileCount(CS) is a useful API, but I think that should be somewhere like ProfileCommon

test/Transforms/Inline/function-count-update.ll
5 ↗(On Diff #46850)

This is a typo - I meant to write "in the same caller". I have fixed the comments. Yes, for the purpose of this test, we don't need the callee to be the same. I have made the caller call 2 different functions. I also don't think making the calls in a block other than the entry makes any difference, but since it doesn't hurt, I've made that change as well.

eraman updated this revision to Diff 46955.Feb 4 2016, 1:26 PM

Tweaks to a test case based on David's feedback.

This is a very important enhancement that many people and tuning work are waiting for. It is IMO very clean, solid and non-intrusive. Before we move on, I'd like to see more testings on large apps for sanity.

Ping. I would like to move forward on this to enable PGO improvements in
inliner and function layout (separating cold functions becomes effective if
we have their entry counts after inlining). Specifically, I think the code
that computes BFA is isolated enough that ripping it out when inliner is
ported to the new pass manager is not very intrusive.

I did more testing as asked by David. Specifically,

  1. Turn on the updates by default (instead of inly in PGO) and ran llvm

testsuite

  1. Turn on the updates by default and build clang.
  2. Build Google internal compiler benchmark suite in PGO mode with this

patch enabled.

These tests above didn't reveal any issues.

Thanks,
Easwaran

twoh added a subscriber: twoh.Feb 25 2016, 9:00 PM

Thanks for collecting the data!

I have a couple of more comments below:

lib/Analysis/InlineCost.cpp
618

This call has the side effect of force computing BFI even when it is disabled (with option). I think CallAnalyzer needs to have a wrapper to this function and check the settings before this is called.

lib/Transforms/IPO/Inliner.cpp
360

I think we should introduce a more general flag here EnableProfile. This flag's value is determined by an internal option:

--enable-profile-in-inliner=default|yes|no

With yes and no can be used to force turning on|off while default is subject to compiler: for now if hasProfile is true, turn it on otherwise off.

423

Needs to be guarded with EnableProfile?

493

Should this be guarded with EnableProfile too?

eraman marked an inline comment as done.Mar 2 2016, 11:44 AM
eraman added inline comments.
lib/Analysis/InlineCost.cpp
618

I have guarded the computation in getBlockCount by a check to see if BFA is nullptr. Also made sure getInlineCost gets a non-null BFA only when HasProfileData is true.

lib/Transforms/IPO/Inliner.cpp
360

I don't prefer to add a new flag for that. I think the current solution of checking if profile is turned on is sufficient for now.

493

The callee copyBlockFrequency has the guard, so it is not necessary at the callsite.

eraman updated this revision to Diff 49656.Mar 2 2016, 11:45 AM
davidxl accepted this revision.Mar 3 2016, 9:32 AM
davidxl edited edge metadata.

LGTM.

This looks really great, so let's move on with this long waited missing feature.

  1. watch for sanitizer build results
  2. there are more tuning opportunities that can be done in the future -- e.g, if all incoming callgraph edges of a node have been analyzed and marked as non-inlinable, the node's prof data can be evicted from the cache. Maybe useful for LTO.
This revision is now accepted and ready to land.Mar 3 2016, 9:32 AM
This revision was automatically updated to reflect the committed changes.
chandlerc edited edge metadata.Mar 7 2016, 1:40 AM

LGTM.

This looks really great, so let's move on with this long waited missing feature.

David, this is not an area of LLVM you have done substantial work on, and this is a very significant feature.

I'm sorry that I have not had time to review this yet, but the correct response is not for you to make a patch as LGTM. Please revert this and let's actually get it reviewed before it goes into the project.

First and foremost, sorry about snapping earlier. I shouldn't have done that, I was frustrated and not communicating very effectively. Thanks to Sean and Hal and others who wrote constructive and helpful emails to get this back on the rails. Secondly, sorry that I've neglected this patch for so long. I kept prioritizing working on the actual pass manager stuff over it, and I should have at least written this up.

So I do have some serious concerns with the approach. I think it is probably a nice way to prototype things and understand their impact (which is very important!) but there are several aspects of the design that seem to be the wrong approach. I've detailed them below.

I've said a few times before, but I continue to think that while some of the issues I've raised below could be addressed, that may not be the best use of time. I think it would probably be more effective to work on helping with the pass manager effort in order to make that effort finish more rapidly. There is a large amount of work that is essentially unblocked right now to help port function passes over. All of this is very easily parallelized.

Once that is done, I would work on designing a good interprocedural profile analysis interface with a good update API, and then teach the inliner to use that. I think that the process of porting the inliner to the new pass manager will end up being a fairly substantial change (essentially a re-write) which will make integrating this much cleaner and easier.

I'm somewhat worried that in its current form, this change would actually make it *harder* to port the inliner and more work to introduce the correct long-term design.

Below are detailed comments about the design issues, but also quite a few comments about serious API, code, and implementation issues with the patch as it currently stands. I think it still needs substantial work to go into the tree, even if the high level design concern above is addressed.

llvm/trunk/include/llvm/Analysis/InlineCost.h
42–55 ↗(On Diff #49755)

This is pretty much re-implementing a tiny part of the new pass manager inside the inline cost analysis. That really feels like the wrong approach to me. Notably, this is even named exactly the same as we would want to name a port of BlockFrequencyInfo to the new pass manager.

I don't agree with this approach. Fundamentally, this is not the right long-term design for how the inliner should access profile information, and it actually obstructs getting the right design in place.

151–152 ↗(On Diff #49755)

This seems like a surprising API to expose at the top level from the inline cost analysis.

llvm/trunk/include/llvm/Transforms/IPO/InlinerPass.h
30–37 ↗(On Diff #49755)

These functor typedefs are a bit confusing. First off, std::function is very slow. Is this overhead going to cause a compile time issue?

The second two of these typedefs aren't used, please don't leave dead code in the patch.

Lastly, having them at the global scope seems like a bad design. It seems like they should be part of some API instead of just being global?

82 ↗(On Diff #49755)

I would keep vertical space between declarations. It makes the comments much easier to read.

102 ↗(On Diff #49755)

I would say "Indicates whether we are using profile guided optimization." in the comment.

However, I'm surprised we need both a bool and a unique_ptr -- wouldn't the nullness of the pointer indicate the same thing as the bool?

llvm/trunk/include/llvm/Transforms/Utils/Cloning.h
51–52 ↗(On Diff #49755)

This is in essence an ODR violation -- two headers are defining the same typedef.

164 ↗(On Diff #49755)

Initializing a functor from nullptr is pretty surprising. I would just use BlockCloningFunctor Ftor = BlockCloningFunctor() instead.

However, I think this is an indication of a larger design issue. I think the fact that you need to thread this cloning functor between very disparate layers of the code isn't good. I feel like the cloning layer should really be taught to directly support updating profile information when cloning. This of course may not be terrible reasonable to implement currently, as you have the inliner essentially maintaining a mini form of the pass manager internally and need to thread some way to update it through these layers. If that's the problem, I think it really is indicating that this needs to wait until we have the pass manager fixed so that we can layer this logic more cleanly.

llvm/trunk/lib/Analysis/InlineCost.cpp
585–588 ↗(On Diff #49755)

It's a bit odd that this isn't a doxygen comment. Also that this is marked as "unused". What's the plan here?

618–620 ↗(On Diff #49755)

Separating this out into two functions doesn't make a lot of sense here. You now have two parts of the code handling profile based inlining adjustments.

I think you should keep all of the profile based adjustments inline here, and integrate it with the function count logic above.

1570–1576 ↗(On Diff #49755)

So, inside this routine, you're actually implementing a proper interprocedural profile analysis. I really think this needs to be extracted to provide a high level analysis API around interprocedurally weighted profile information.

1591–1595 ↗(On Diff #49755)

While it happens that BPI doesn't keep any references or pointers back into the dominator tree or loop info, the APIs of these analyses actually would make that OK to do... Which makes this code a rather subtle mini implementation of the core logic of the pass manager but inside the inline cost analysis. That worries me some. At the very least it would need a lot of documentation to make it clear exactly what was going on and why this was a safe thing to do.

llvm/trunk/lib/Transforms/IPO/Inliner.cpp
376–380 ↗(On Diff #49755)

You're dividing by CalleeEntryFreq because the original value was scaled up by that. This essentially strips some precision off. It would seem better to instead directly access the unscaled block frequency for the old BB, and then scale it correctly for the post-inlined state.

383–384 ↗(On Diff #49755)

We have worked very hard throughout the PGO analysis infrastructure to not rely on actual hardware floating point. We have a broad collection of carefully written math utilities for this. One part of factoring this logic into a real IPO profile update routine would be to use the same facilities and infrastructure that the existing profile (and profile update) analyses use for this purpose.

570 ↗(On Diff #49755)

Please use lambdas instead of std::bind -- they make the code much more readable as it is easier to understand how they behave. It also avoids the pain of using placeholder values.

759 ↗(On Diff #49755)

I'm really concerned by the need to manually invalidate BFI in so many places. I think this is going to prove to be a very error prone pattern to maintain long term.

eraman added a comment.Mar 8 2016, 2:11 PM

Thanks for the detailed comments Chandler. The comments related to problems around API, code and implementation are very helpful and I'll address them.

Your main concern is that this patch tries to provide the functionality of pass manager inside inline cost analysis. The intention is indeed to provide a restricted subset of pass manager's functionality for the inliner to use, but with the understanding that this is a temporary fix for inliner that should be easily replaceable when this functionality is available in PM. If this doesn't meet those goals (restricted to inliner and takes minimal effort to replace it by the new PM), I'll happily iterate on this patch to address that concern. I have responded inline to the comments related to this aspect.

Thanks,
Easwaran

llvm/trunk/include/llvm/Analysis/InlineCost.h
42–55 ↗(On Diff #49755)

This is not *intended* to be a long term solution. The idea is to provide an interface similar to what the new pass manager provides so that when it lands, this code can be ripped off. I can rename it and/or move this code to a separate file if that will make the migration smoother.

151–152 ↗(On Diff #49755)

I'll move it to ProfileCommon.h. I left it here because I don't want to make anything outside of inliner to use BlockFrequencyAnalysis

llvm/trunk/include/llvm/Transforms/Utils/Cloning.h
164 ↗(On Diff #49755)

I think clients that invoke the cloning code might want to update different analyses or choose to not update at all. IMO, sinking all that into the cloning code doesn't seem a good idea,

llvm/trunk/lib/Analysis/InlineCost.cpp
1570–1576 ↗(On Diff #49755)

Ok.

1591–1595 ↗(On Diff #49755)

To reiterate, the intention is to *temporarily* provide the functionality that will be provided by the PM in a way that will make the transition easier. Agreed that this is not documented at all. Will fix that.

llvm/trunk/lib/Transforms/IPO/Inliner.cpp
376–380 ↗(On Diff #49755)

The unscaled block frequency is meant to be used only for the initial BFI computation right? Moreover as we incrementally update the block frequency using setBlockFrequency interface of BFI, there may be blocks that do not have an unscaled value associated with them,

759 ↗(On Diff #49755)

The alternative is to conservatively invalidate BFI at the beginning of the inliner loop. That is fine with me if the increase in compilation time is ok.