Page MenuHomePhabricator

Partial Inlining with multi-region outlining based on PGO information

Authored by gyiu on Sep 22 2017, 2:19 PM.



With PGO information, we can do more aggressive outlining of cold regions in the inline candidate function. This contrasts with the scheme of keeping only the 'early return' portion of the inline candidate and outlining the rest of the function as a single function call.

Support for outlining multiple regions of each function is added, as well as some basic heuristics to determine which regions are good to outline. Outline candidates limited to regions that are single-entry & single-exit. Also we don't account for live-ranges we may be killing across the region with a function. These are enhancements we can consider in another patch.

Fallback to the regular partial inlining scheme is retained when either i) no regions are identified for outlining in the function, or ii) the outlined function could not be inlined in any of its callers.

Diff Detail


Event Timeline

gyiu created this revision.Sep 22 2017, 2:19 PM
davidxl added inline comments.Sep 25 2017, 1:49 PM
45 ↗(On Diff #116397)

Make the name more meaningful? New does not tells us anything.

63 ↗(On Diff #116397)

Make this an enum option?

66 ↗(On Diff #116397)

Merge with the enum trace option

77 ↗(On Diff #116397)

Indicate the purpose of the ratio, e.g, 'to decide xxx'

83 ↗(On Diff #116397)

Why is option needed?

87 ↗(On Diff #116397)

Why using BP to determine region coldness instead of PSI and profile count?

161 ↗(On Diff #116397)

Add documentation.

417 ↗(On Diff #116397)

Add a high level description of how the cost/savings are computed below.

1060 ↗(On Diff #116397)

Is DT updated by the extractor?

1150 ↗(On Diff #116397)

return { false, nullptr };

1173 ↗(On Diff #116397)

Better check if the function has EntryCount

gyiu marked 6 inline comments as done.Oct 2 2017, 6:55 PM
gyiu added inline comments.
87 ↗(On Diff #116397)

No particular reason, mainly for convenience. I figure I would still need to compare the profile count of a block to its sibling blocks to know its relative coldness. Is PSI and profile count more accurate?

1060 ↗(On Diff #116397)

I don't believe so. You're thinking we should update the DT each time we extract a code region?

gyiu marked 2 inline comments as done.Oct 2 2017, 7:22 PM
gyiu added inline comments.
83 ↗(On Diff #116397)

Just for added tune-ability. I wasn't sure if '100' is a reasonable minimum execution count to have.

One high level comment that I added inline as well:

What's the criteria for deleting the old code? When should we plan on doing this?

168–169 ↗(On Diff #116397)

When is this?

gyiu added inline comments.Oct 10 2017, 8:33 AM
168–169 ↗(On Diff #116397)

This is a leftover comment that's no longer valid. We'll need to keep both constructors. However, there's a refinement to how we track outlined functions that is now redundant (class members OutlinedFunctions vs. OutlinedFunc+OutliningCallBB), so I'll make that change now.

gyiu updated this revision to Diff 118670.Oct 11 2017, 12:44 PM
  • Updated code based on review comments.
  • Also added check for candidate region outputs. If cold region has an output (ie. a live exit variable) then we could be blocking some code motion in the caller after it has been outlined.
  • Also added code to mark outlined functions/callsites with coldCC.
davidxl edited edge metadata.Oct 16 2017, 12:05 PM

please also add test cases.

165 ↗(On Diff #118670)

Make these two member variables.

428 ↗(On Diff #118670)

This skips SamplePGO. Is it intended? If yes, only the second check should be enough.

437 ↗(On Diff #118670)

Is this too restrictive? Single entry region does not mean the entry block can not have multiple predecessors.

506 ↗(On Diff #118670)

should check if the the Succ block is actually cold enough.

520 ↗(On Diff #118670)

if (!IsSingleExit(..)))

545 ↗(On Diff #118670)

Why this limitation? This should be handled by the code extractor

1437 ↗(On Diff #118670)

Most of the code here can be shared with the existing partial inliner. Refactor them to avoid code duplication?

87 ↗(On Diff #116397)

Using PSI gives you information about global hotness of a block, so that the compiler only focus on regions that are globally hot or cold.

gyiu marked 5 inline comments as done.Oct 25 2017, 7:24 PM
gyiu added inline comments.
428 ↗(On Diff #118670)

Yeah, I'm not confident that the sampling profile will give us proper coldness information, and may lead us to outline more than we should.

437 ↗(On Diff #118670)

I guess I'm unsure what a dominator block with multiple predecessors actually means in this case, because the code looks for a cold successor block to a branch. If that cold successor has multiple predecessors, then doesn't that we can enter this region through multiple blocks, hence not be single entry at all?

506 ↗(On Diff #118670)

You mean check the global coldness of the Succ Block? I guess this is where we need to weigh the pros and cons of global coldnress vs. relative coldness (similarly for hotness). Are you thinking we should check relative AND global coldness? Would this be reasonable in a situation where the callgraph is disjoint (eg. building a shared library) and we can't make reasonable comparisons of the counters between subgraphs?

545 ↗(On Diff #118670)

You're right. I didn't realize that.

87 ↗(On Diff #116397)

Interesting. One concern I have, though, is whether we should check the predecessor block for 'isHotBB', or the cold successor block for 'isColdBB', or both. Checking if the predecessor is globally hot will still require some branch probability ratio to indicate a relatively cold successor block. Similarly with a globally cold successor, we will still need to check the predecessor block for relative hotness. Checking both could be good, though we will likely miss a lot of opportunities.

gyiu updated this revision to Diff 121017.Oct 31 2017, 10:47 AM
gyiu marked 2 inline comments as done.

Updated based on latest comments. Still missing testcase that I'm working on currently. Will update patch when ready.

Kader added a subscriber: Kader.Nov 1 2017, 11:59 AM
fhahn added a subscriber: fhahn.Nov 2 2017, 4:33 PM
gyiu updated this revision to Diff 121831.Nov 6 2017, 6:39 PM
  • Added testcase.
  • Remove the requirement for a conditional branch as the terminator of the entry block, for multi-region outlining specifically.
davidxl added inline comments.Nov 8 2017, 1:27 PM
429 ↗(On Diff #121831)

Consider making this into a message with -Rpass-missed=partial-inining

497 ↗(On Diff #121831)

This trace does not look useful. Consider removing it.

508 ↗(On Diff #121831)

merge these into one output stream

514 ↗(On Diff #121831)

Consider changing this into -Rpass-analysis=partial-inlining

1151 ↗(On Diff #121831)


1228 ↗(On Diff #121831)

clang-format and similar other places.

1243 ↗(On Diff #121831)

Consider removing this trace. Should be handled by -print-before-xxx and -print-after-xxx

1464 ↗(On Diff #121831)

You can achieve the same thing with -Rpass=partial-inlining. Consider remove this tracing output.

5 ↗(On Diff #121831)

Perhaps add another case with multiple regions outlined?

gyiu marked 9 inline comments as done.Nov 21 2017, 4:26 PM
gyiu updated this revision to Diff 123873.Nov 21 2017, 6:57 PM
  • Added new multi-outlined-region testcase.
  • Removed unnecessary traces and moved useful ones to 'ifndef NDEBUG' macro.
  • Modified/added ORE messages according to suggestions.
  • General clean-up using clang-format.
davidxl accepted this revision.Nov 27 2017, 9:23 AM


This revision is now accepted and ready to land.Nov 27 2017, 9:23 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 13 2019, 11:11 PM