This is an archive of the discontinued LLVM Phabricator instance.

Avoid inlining CallSites leading to unreachable
AbandonedPublic

Authored by junbuml on Dec 7 2015, 6:40 AM.

Details

Summary

It might be reasonably to avoid inlining CallSites leading to
an unreachable-terminated block unless the inline cost is obviously low.
Before analysis passes such as BPI and BFI are included in inliner, this change
could support performing conservative inlining for CallSites which will be
unlikely executed in their normal execution flows.

This change might reduce code size blow-up in exceptional blocks that are rarely
taken (e.g., exception handle regions) as well as indirectly increase inline opportunities
for unwinding functions containing exception handling code.

Diff Detail

Event Timeline

junbuml updated this revision to Diff 42067.Dec 7 2015, 6:40 AM
junbuml retitled this revision from to Avoid inlining CallSites leading to unreachable.
junbuml updated this object.
junbuml added a reviewer: majnemer.
junbuml added subscribers: mssimpso, gberry.
mcrosier added a subscriber: Gerolf.Dec 7 2015, 6:47 AM
majnemer edited edge metadata.Dec 7 2015, 8:42 AM

Do you have benchmark numbers to substantiate this change? Also, it's not totally uncommon to use exception handling for control flow in languages like C#.

I had performance tests only for spec2006 in cortex-a57 where I found some gains in xalancbmk (3%), povary (4%), and gcc (2%).

The basic idea of this patch is in the assumption of that unreachable-terminated blocks (e.g., exception handling) are less frequently executed comparing with the normal execution flow. If this basic idea is reasonable, do you think making the unreachable-threshold less aggressive could be a reasonable compromise? As of now, we have very aggressive unreachable-threshold, which is 0 in this change.

Gerolf added a comment.Dec 7 2015, 1:29 PM

Hi Jun

can you share some insight into where the gains are coming from? My reading is that the patch suppresses inlining of unreachable code. That static code size increase should not decrease performance. Does it push the thresholds and suppress inlining of “good” functions? Or does it suppress optimizations from happening?

Thanks
Gerolf

@Gerolf, I don't think this patch wants to make code which contains unreachable less favorable to inlining. I think the idea is that if you have code which is post-dominated by unreachable, then it might be on an error handling path (e.g. exit, abort, throw) which might mean that it is colder.

@junbuml I think it would be good to gather data on a few different thresholds here.

manmanren edited edge metadata.Dec 7 2015, 2:18 PM

This approach looks okay to me in general. Is there any compile-time impact with this change?
Is there any stats behind choosing 0 as threshold?

Cheers,
Manman

lib/Transforms/IPO/Inliner.cpp
282

Can you add a comment here on the relationship of this function and calcUnreachableHeuristics in BranchProbabilityInfo.cpp? i.e the difference and what to do when Inliner can include BPI and BFI.

292

Please elaborate why in the comment.

304

Can this if statement be removed or combined with the next if statement?

@Gerolf As David mentioned above, this change tries to avoid inlining call sites which is post-dominated by unreachable in their normal execution flow. For example, in below code, Exception() and getErrorMsg() would be post-dominated by unreachable so that we don't want to inline them in elementAt(). By avoiding inlining them in such cold regions, elementAt() itself could get more opportunities to be inlined.

int elementAt(int idx) {
  if (idx > limit)
    throw Exception(idx, getErrorMsg());
  return Data[idx];
}

@majnemer
Regarding the threshold, let me start addition performance runs with different thresholds and update comments.

@manmanren
I didn't have chance to measure compile-time with/without this change, but I believe that this change may not have any significant impact on compile-time because collectBlocksLeadingToUnreachable() is executed only once per function and it iterates basic blocks in the post order only once. Please let me know if you need more clarification about compile-time. I will address your inline comments soon.

Thanks for the reviews.

Gerolf added inline comments.Dec 7 2015, 5:25 PM
include/llvm/Transforms/IPO/InlinerPass.h
62

// Find blocks post-dominated by an unreachable-terminated block

would be more accurate.

lib/Transforms/IPO/Inliner.cpp
285

Would it make sense to have a hasUnreachableBlock function attribute that marks a routine that has no (or at least one) unreachable block? For such function the analysis here is unnecessary.

527–535

You could compute a hasUnreachableBlock attribute here, too.

Gerolf added a comment.Dec 7 2015, 7:20 PM

Do you also plan to investigate/apply similar inlining heuristics to call sites dominated by catch clauses etc?

Thanks
Gerolf

Do you also plan to investigate/apply similar inlining heuristics to call sites dominated by catch clauses etc?

Once this patch is landed. I could investigate the catch block case and other cases as well based on BranchProbabilityInfo::calculate().

lib/Transforms/IPO/Inliner.cpp
527–535

If the purpose of new function attribute is to minimize the analysis (collectBlocksLeadingToUnreachable()), one easiest way could be adding a new variable here like HasUnreachable and mark HasUnreachable=true when we detect an unreachable in the function. And then we can call collectBlocksLeadingToUnreachable() only when both HasCallSites and HasUnreachable true.

junbuml updated this revision to Diff 42207.Dec 8 2015, 12:02 PM
junbuml edited edge metadata.

Addressed comments from Manman and Gerolf. Still running performance tests with different thresholds.

hfinkel edited edge metadata.Dec 9 2015, 6:25 AM

You might wish to make an exception for main() here... there is plenty of code like this:

int main() {
  ...
  exit(0);
}

Should you add llvm-commits as a subscriber?

Cheers,
Manman

junbuml edited edge metadata.Dec 9 2015, 9:52 AM
junbuml added a subscriber: llvm-commits.
junbuml updated this revision to Diff 42312.Dec 9 2015, 10:09 AM

Addressed Hal's comment. So, I made an exception for main() and added a test.
Thanks Hal for the review.

hfinkel added inline comments.Dec 9 2015, 10:39 AM
lib/Transforms/IPO/Inliner.cpp
287

To ask a higher-level question: Why do you want to take this intermediate step? Why not just jump straight to the end and use BPI? I'm a bit afraid of introducing yet more heuristics that we intend to rip out because of the resulting performance churn.

It is true, doing this outside the pass manager adds compile-time expense (this will be better with the new pass manager, because we'll be able to properly cache and update these things), but I suspect the compile-time benefit from not inlining will compensate to a large extent.

DominatorTree DT;
DT.recalculate(*F);
LoopInfo LI(DT);
BranchProbabilityInfo BPI(*F, LI);

and now you have a BPI object, and can do the "right" thing. Plus, we can then concentrate on adding branch probability heuristics in BPI, where we really want them, and not implicitly in the inliner.

I took this intermediate step simply because I thought that BPI should be hooked with new pass manager.
I believe using the BPI object directly to find call sites leading to some uncommon regions must be a better approach and will make everything smoother with the new pass mananger later.

Gerolf added a comment.Dec 9 2015, 3:02 PM

LGTM assuming you address the check for 'main'.

lib/Transforms/IPO/Inliner.cpp
558

That is a bit too hacky. :-) There could be exception paths even in main where you want your optimization to fire. There needs to be a way to distinguish between "hot" and "cold" unreachable blocks, eg. if it is cold if is part of EH handling or if it is an exit returning an error code etc. And that heuristic should be applicable eg. for BranchProbabilityInfo::calcUnreachableHeuristics also eventually. I suggest a FIXME for now instead of checking for a specific function name.

Do you want to land this change first. And then bring BPI object here as a separate patch?

lib/Transforms/IPO/Inliner.cpp
558

For me, checking function names for main() doesn't seem uncommon. Please let me know if there is any better way.

I agree that even in main we can distinguish real cold blocks, but I didn't want to make this patch bigger. I think a separate patch only for that could lead the review process much easier. So as you suggested, I want to add FIXME for now if reviewers are okay with it.

Do you want to land this change first. And then bring BPI object here as a separate patch?

We can do that; the test cases will be good to have regardless.

lib/Transforms/IPO/Inliner.cpp
558

A FIXME is fine for now (so long as we address it in the near future).

junbuml updated this revision to Diff 42441.Dec 10 2015, 10:00 AM

Add FIXME about detecting real cold unreachable blocks in main.

We can do that; the test cases will be good to have regardless.

Can you give me more detail about the test cases I missed ?

We can do that; the test cases will be good to have regardless.

Can you give me more detail about the test cases I missed ?

I meant that the test cases you have will be useful as regression tests along with a BPI-based implementation as well. Thus, when we have a BPI-based implementation, we that discard this one, but keep the tests.

Thank you for clarifying that.
Still running performance tests.

Note on windows "main" may be "wmain" "WinMain" "WINMAIN" "MAIN" "AfxWinMain" "ENTGQQ" etc.

I've tried doing things like this in the past and it is tricky to get right outside of benchmarks. For instance we had performance-sensitive loops postdominated by calls to exit(). (And if you think interprocedurally, everything is generally postdominated by a very infrequently taken exit point).

Using BPI or similar is better in that at least all the parts of the framework can agree on the criteria for what is relative hot and what is relatively cold. As I told our devs, if you think you know better, please work to enhance the static heuristics or the profile count maintenance so every part of the code can benefit.

Also there are lots of cases where inlining in cold paths is helpful to improving CQ on hot paths. But that's a discussion for another day, perhaps.

Note on windows "main" may be "wmain" "WinMain" "WINMAIN" "MAIN" "AfxWinMain" "ENTGQQ" etc.

I'm not sure if it make sense to add all these names here. Is there any existing function to find an entry point for Windows? Please let me know any suggestion?

I've tried doing things like this in the past and it is tricky to get right outside of benchmarks. For instance we had performance-sensitive loops postdominated by calls to exit(). (And if you think interprocedurally, everything is generally postdominated by a very infrequently taken exit point).

I agree this could happen. So, I think we cannot blindly assume that all unreachable-terminated blocks are rarely executed. This is also the case in BranchProbabilityInfo::calcUnreachableHeuristics(). However, it might be still reasonable to assume that EH paths are less frequently taken in general so that we can inline little conservatively when call sites are post-dominated by EH code. So instead of detecting all unreachable blocks, we can limit the search only for blocks post-dominated by EH code (e.g., throw ()) ?

Using BPI or similar is better in that at least all the parts of the framework can agree on the criteria for what is relative hot and what is relatively cold. As I told our devs, if you think you know better, please work to enhance the static heuristics or the profile count maintenance so every part of the code can benefit.

I agree using BPI with some general heuristic here and enhancing BPI itself must be a right approach in general. But, I prefer to take incremental changes as long as the basic idea and direction is reasonable.

Also there are lots of cases where inlining in cold paths is helpful to improving CQ on hot paths. But that's a discussion for another day, perhaps.

Can you give me more detail about this?

Note on windows "main" may be "wmain" "WinMain" "WINMAIN" "MAIN" "AfxWinMain" "ENTGQQ" etc.

...

I agree using BPI with some general heuristic here and enhancing BPI itself must be a right approach in general. But, I prefer to take incremental changes as long as the basic idea and direction is reasonable.

This is why I'm not super happy about this patch: Aside for the test cases, it is not clear this is really any kind of incremental step toward the solution we want (based on BPI). I'm not unhappy enough to say it shouldn't go in if others are okay with it going in, but I agree that the BPI-based solution is really where we should be devoting our efforts.

...

I don't recall seeing any data collection with regard to the threshold.

I'm saying you probably can't really reliably use "main" (or any of windows's myriad variations of the name) as something to base heuristics on. Even if you could, it would be annoying to get different optimizations for code depending on the name of the function that contained the code.

This code basically duplicates unreachable heuristics in BPI analysis, which is not desirable as mentioned by Hal.

The good news is that Easwaran is working on a solution that can remove the inliner's limitation so that profile data will be made available to inliner. I expect this to be landed very soon.

@hfinkel @AndyAyers @davidxl
As you mentioned, using BPI directly in this patch must be the right thing to do. As a first step, I will limit the use of BPI to identify blocks leading to a real cold unreachable blocks and inliner will only use BPI without any function name check. I should enhance calcUnreachableHeuristics() to consider unwinding edges of InvokeInst and exit(0).

@majnemer
Regarding threshold, base on my spec2006 runs, I'm trying to find the maximum possible value so that we can avoid inlining only when the call site leading to cold region is large enough. Hopefully, I will be able to finish my tests today.

@davidxl
For me it appears that Easwaran's change performs aggressive inlining for the hot sites based on profile data? I think my change somewhat different because it statically try to identify cold blocks and perform conservative inlining.

As Easwaran is working on bringing BPI/BFI in inliner and expect it to be landed soon, I may want to revisit this patch after BPI/BFI is hooked. Please let me know if the BPI/BFI hook with inliner is not expected to be committed soon.

I submitted a separate patch (D15466) to enhance BranchProbabilityInfo::calcUnreachableHeuristics for InvokeInst which is basically the same idea used in collectBlocksLeadingToUnreachable in this patch.

junbuml abandoned this revision.Jan 12 2016, 8:00 AM

I will revisit this patch after BPI/BFI is hooked in inliner.