This is an archive of the discontinued LLVM Phabricator instance.

Improve PGO support for the new inliner
ClosedPublic

Authored by eraman on Jan 4 2017, 3:07 PM.

Details

Summary

This adds the following to the new PM based inliner in PGO mode:

  • Use block frequency analysis to derive callsite's profile count and use that to adjust thresholds of hot and cold callsites.
  • Incrementally update the BFI of the caller after a callee gets inlined into it. This incremental update is only within an invocation of the run method - BFI is not preserved across calls to run.
  • Update the function entry count of the callee after inlining it into a caller.

I've tuned the thresholds for the hot and cold callsites using a hacked up version of the old inliner that explicitly computes BFI on a set of internal benchmarks and spec. Once the new PM based pipeline stabilizes (IIRC Chandler mentioned there are known issues) I'll benchmark this again and adjust the thresholds if required.

Diff Detail

Event Timeline

eraman updated this revision to Diff 83147.Jan 4 2017, 3:07 PM
eraman retitled this revision from to Improve PGO support for the new inliner.
eraman updated this object.
eraman added reviewers: chandlerc, davidxl.
eraman added a subscriber: llvm-commits.
davide added a subscriber: davide.Jan 4 2017, 3:11 PM
chandlerc edited edge metadata.Jan 4 2017, 7:05 PM

Some quick initial comments, thanks!

include/llvm/Analysis/InlineCost.h
178

I would use llvm's function_ref rather than std::function -- it is much lighter weight and the caller can ensure that the function object outlives the call.

Also, I would prefer Optional<function_ref<...>> over function_ref<...>* as the pointer seems a more error prone API.

include/llvm/Transforms/Utils/Cloning.h
181

My brain is probably just being a bit dense, but is there no way to just give an (optional) BFI (or pair of BFIs) directly? This doesn't need to query recursively the way inline cost does?

lib/Analysis/InlineCost.cpp
649–650

It seems problematic to couple these hints in this way, and I think that results in the need to use a predicate here.

What about first a small refactoring patch that just separates the PSI-based threshold adjustment from InlineHint, and then this code can add a preferred path instead of the PSI-exclusive one when we have BFI available?

649–650

This code is now very confusing combined with the below code. What is it that determines whether a callsite is hot?

651

Why would CallBB ever be null?

657

You don't print cold here, just hot?

lib/Transforms/Utils/InlineFunction.cpp
1463–1474

I feel like much of this logic should really live in BFI next to the other logic for updating and maintaining these values rather than buried in the inliner.

In particular, it seems like there should be generic logic for taking a collection of basic blocks with existing frequencies and scaling them based on some other block's frequencies.

The same core logic seems like it should be usable by things like loop unrolling, loop splitting, etc where they clone a region of blocks that have existing *relative* frequencies but are suddenly below some dominating predicate that needs to scale those frequencies.

Then the only thing happening here is the cloning of (raw, unscaled) frequencies from the callee into the inlined blocks in the caller. That cloning logic makes sense here as it has everything to do with the inlining step and nothing to do with the particular values being cloned.

1494–1499

This is why I intensely dislike uint64_t... If these were int64_t, then this would just be std::max(0, ...).

Anyways, not something to change in this patch.

1586–1587

Why do we need a dedicated cloned blocks map rather than re-using the VMap below? That is the map the cloning routine uses to map between basic blocks IIRC, so it already seems like it would have the information we need.

davidxl added inline comments.Jan 4 2017, 8:28 PM
include/llvm/Transforms/Utils/Cloning.h
181

This sounds right. IFI probably just needs a pair of BFIs.

lib/Analysis/InlineCost.cpp
649–650

What the following code says is that if callsite hotness info is available, do not use callee-entry based hotness hint.

This needs to be documented.

lib/Transforms/Utils/InlineFunction.cpp
1463–1474

I agree this probably belongs to BFI. Besides IFI is not even used here.

eraman marked 8 inline comments as done.Jan 9 2017, 6:02 PM

Thanks for the detailed comments. In the course of addressing these, I've added an isColdCallSite and isHotCallSite to ProfileSummaryInfo and moved that logic out of inline cost.

include/llvm/Transforms/Utils/Cloning.h
181

Yes, GetBFI is not needed. Now passing caller and callee BFI pointers.

lib/Analysis/InlineCost.cpp
651

You're right, it can never be null. Removed the check.

657

I never needed it when I was debugging the code. I've added a new set of debug messages based on {callsite/callee} {hotness/coldness}

eraman updated this revision to Diff 83761.Jan 9 2017, 6:09 PM
eraman edited edge metadata.
eraman marked 2 inline comments as done.

Address review comments.

davidxl added inline comments.Jan 10 2017, 11:55 AM
lib/Analysis/BlockFrequencyInfo.cpp
174 ↗(On Diff #83761)

can this one be split into a different patch (and with a unit test)?

lib/Analysis/ProfileSummaryInfo.cpp
147 ↗(On Diff #83761)

The following new APIs belong to a separate patch perhaps.

eraman updated this revision to Diff 85008.Jan 19 2017, 12:25 PM

Rebase after pushing some refactoring from previous revision into separate patches.

eraman marked 2 inline comments as done.Jan 19 2017, 12:26 PM

This is looking pretty close. Couple of nits and a couple of more interesting comments inline.

lib/Analysis/InlineCost.cpp
658

nit: Here and below you shouldn't need the llvm:: qualifier on dbgs I think.

lib/Transforms/IPO/Inliner.cpp
769–773

I think you can name the lambda GetBFI and below pass it as {GetBFI} if GetBFI doesn't work. You shouldn't need to construct a variable for the function_ref.

lib/Transforms/Utils/InlineFunction.cpp
1396–1446

missing vertical whitespace here.

1409

The value map seems much larger than the set of basic blocks cloned? Why not walk the cloned region of basic blocks directly and just look them up in the value map?

2247–2249

If the else needs braces, use them on the if?

2250–2252

The location of this makes it easy to completely miss...

I wonder, would it be better to just do this immediately when we create AfterCallBB even though we'll delete it in some cases?

eraman marked 4 inline comments as done.Jan 19 2017, 2:58 PM

Thanks for the comments! Will update the diff in a bit.

lib/Transforms/IPO/Inliner.cpp
769–773

{GetBFI} does the trick. Thanks.

lib/Transforms/Utils/InlineFunction.cpp
2250–2252

I've moved it earlier as you prefer.

eraman updated this revision to Diff 85045.Jan 19 2017, 3:00 PM
eraman marked an inline comment as done.

Address Chandler's comments.

chandlerc accepted this revision.Jan 19 2017, 3:57 PM

Sorry, I think you should have argued with one of my suggestions, see below. =D Your code was already right.

I also have one comment below that I'd mostly like to talk about and address in a follow-up patch if needed, it doesn't need to block this going in. Feel free to submit once you go back to your original (better) approach on the loop.

lib/Analysis/InlineCost.cpp
52–62

Is there any way to avoid introducing yet another tunable here? It feels weird to have two cold tunables, and more weird for them to be different.

Also, I would expect the threshold here to be *very* similar to the -Oz inlining threshold. Should they be the same? Or at least documented together?

lib/Transforms/Utils/InlineFunction.cpp
1408–1409

I was hoping we could do still better by walking the inlined region in the caller so we would *only* touch cloned blocks. However, there isn't a map from cloned BB to original BB, which is why you can't do that. And why you wrote the code you did. Sorry for the confusion.

Unfortunately, I think walking the entire map is better than walking the entire callee as the map only contains the inlined part. Also, in all of the other code here we walk the map to find relatively sparse and infrequent things so it seems fine. Like I said, sorry for sending you down a wrong path, I hadn't thought about the lack of clone -> original BB.

This revision is now accepted and ready to land.Jan 19 2017, 3:58 PM

(Er, I also thought David had already LGTMed for some reason, but it isn't showing... Clearly wait for his LGTM as well!)

davidxl added inline comments.Jan 19 2017, 4:27 PM
test/Transforms/Inline/function-count-update-2.ll
13

Check no calls remaining.

18

same here

test/Transforms/Inline/function-count-update-3.ll
14

Should it be C2 * (C1/C4) where C4 is the entry count of function C? which happens to be the same to C3 here. C3 is irrelevant

32

Also check remaining count of C

test/Transforms/Inline/inline-cold-callee.ll
30

Wrap callee2 call in conditional -- otherwise its entry count should be at least 50.

31

This test case is flawed. callee1's entry count should be a least 50 given new change.

test/Transforms/Inline/inline-cold-callsite.ll
16

Is there a need to introduce callee2? Just calling callee1 twice should also illustrate the point.

test/Transforms/Inline/inline-hot-callsite-2.ll
18

just reuse callee1?

eraman added inline comments.Jan 19 2017, 4:28 PM
lib/Analysis/InlineCost.cpp
52–62

Agreed. We likely want to have one threshold for Oz, cold callsite and cold callee. I'll get the numbers and send the threshold change separately.

lib/Transforms/Utils/InlineFunction.cpp
1408–1409

But the size of the value map is O(|instructions in the cloned region|) and not O(|BBs in the cloned region|). My thinking was O(|instructions in the cloned region|) likely to be greater than O(|BBs in the original callee|) and that's why you preferred this.

chandlerc added inline comments.Jan 19 2017, 4:35 PM
lib/Transforms/Utils/InlineFunction.cpp
1408–1409

Well, we already do tons of things O(|instructions in the cloned region|) including walking the VMap in at least three other places and the original InlineCost computation. This is typically OK because we have a constant bound on it (the threshold).

But we have no realistic bound on the number of basic blocks in the callee. So it might be faster on average but have much worse long tail.

I think I'd feel safer staying within the inlined region whether its instructions or basic blocks. And if this proves hot, we should just add a reverse map (at least for basic blocks) and walk the basic blocks in the cloned region.

Sorry for arriving late, glad to see this making progress! I left one comment inline.

lib/Analysis/InlineCost.cpp
48–57

Is it possible to introduce another cl::opt here to disable this mode while in PGO? It would be extremely useful for testing the benefit of this feature compared to what's already there in PGO.

davide added inline comments.Jan 19 2017, 7:57 PM
lib/Analysis/InlineCost.cpp
1588–1589

Nit: the option is now called -inline-cold-callsite-threshold

Prazek added a subscriber: Prazek.Jan 20 2017, 3:41 AM
Prazek added inline comments.
lib/Analysis/InlineCost.cpp
48–57

Probably passing normal inlining threshold would be the same as disabling this mode.

660–662

Are there any plans in the nearest future to only rely on caller hotness/coldness and not on callee hotness?

lib/Transforms/Utils/InlineFunction.cpp
1409

auto *ClonedBB

1416

auto

eraman marked 9 inline comments as done.Jan 20 2017, 11:28 AM
eraman added inline comments.
lib/Analysis/InlineCost.cpp
48–57

I'm generally wary of the proliferation of options, and as Piotr mentioned you could set -hot-callsite-threshold to the same value as -hinlinhint-threshold and -inline-cold-callsite-threshold to -cold-callee-threshold to get the same effect.

660–662

Not sure what qualifies as "nearest" future but the plan is to use the same threshold for cold callee, cold callsite and -Oz and once we fully migrate to the new PM based pipeline consider only the callsite's hotness.

1588–1589

Thanks. There is inconsistency in the way options are named, but I prefer to prefix inline- everywhere and will fix the other options in a separate patch.

lib/Transforms/Utils/InlineFunction.cpp
1408–1409

Ok, I'll revert back to walking the VMap.

test/Transforms/Inline/function-count-update-2.ll
13

There is nothing interesting in terms of inlining here (ie the decision to inline here doesn't exercise any code path associated with this patch), but if you insist I'll add the checks.

test/Transforms/Inline/function-count-update-3.ll
14

Good catch!

test/Transforms/Inline/inline-cold-callee.ll
31

Yes and I think the best course of action is to disable the test for the new PM. Excepting some profile insanity, there is no way a callee is going to be cold without all callsites to the callee being cold and callsite coldness takes precedence over callee precedence (and there is a separate test for that)

eraman updated this revision to Diff 85164.Jan 20 2017, 11:29 AM
eraman marked 3 inline comments as done.

Address review comments.

davidxl added inline comments.Jan 20 2017, 11:31 AM
test/Transforms/Inline/function-count-update-2.ll
13

This is just to make sure the inline decisions and expected end result of callee entry count are fully in sync.

eraman updated this revision to Diff 85170.Jan 20 2017, 11:50 AM

Fix a test case as suggested by David

eraman marked an inline comment as done.Jan 20 2017, 11:51 AM
davidxl accepted this revision.Jan 20 2017, 2:11 PM

lgtm

Prazek accepted this revision.Jan 20 2017, 2:14 PM

It is nice to see that gbranch inliner is moving forwards to the trunk :)

This revision was automatically updated to reflect the committed changes.