This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Add switches to control prelink/postlink inline separately
AbandonedPublic

Authored by wenlei on Feb 5 2021, 9:08 PM.

Details

Reviewers
wmi
hoy
davidxl
Summary

With CSSPGO, to maximize the benefit of global top-down context-sensitive inlining, we are experimenting with shifting more inlining from LTO prelink to postlink, from cgscc inlining to sample loader inlining. The change adds switches to turn off prelink/postlink cgscc/sample-loader inlining separately. Note this is different from -fno-inline as the new switches only augment pass pipeline without changing inline attribute on functions.

We saw better performance on some benchmark with prelto inlining turned off, but there're other regressions too. Currently both switches default to on so nothing changes by default, even with CSSPGO. I'm exposing the switches to help further tuning.

Diff Detail

Event Timeline

wenlei created this revision.Feb 5 2021, 9:08 PM
wenlei requested review of this revision.Feb 5 2021, 9:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2021, 9:08 PM
wmi added a comment.Feb 9 2021, 11:01 PM

Have you tried tuning the inline param instead of disabling CGSCC inlining in prelto? From our experience, thinlto wants some inlining happening in prelink phase especially for smaller callees so that the heuristic of choosing functions to import can be more effective. Current sampleFDO also turned off hot callsite inline heuristic in CGSCC inlining by setting the inline param, in order to prevent optimizations in prelink from obstructing profile annotation in postlink, but keep the other params the same. Feel using inline param for the tuning will be more flexible.

In D96197#2553329, @wmi wrote:

Have you tried tuning the inline param instead of disabling CGSCC inlining in prelto? From our experience, thinlto wants some inlining happening in prelink phase especially for smaller callees so that the heuristic of choosing functions to import can be more effective. Current sampleFDO also turned off hot callsite inline heuristic in CGSCC inlining by setting the inline param, in order to prevent optimizations in prelink from obstructing profile annotation in postlink, but keep the other params the same. Feel using inline param for the tuning will be more flexible.

Yeah, the amount of inlining affects the importing due to call graph depth difference. The part you mentioned about turning off hot call site CGSCC inlining in prelto is the place where HotCallSiteThreshold is set to zero in buildInlinerPipeline, right? We would still have some inlining for small functions as they may end up with negative cost. Using threshold is more flexible and can achieve the same thing, though we'd need to pass four zero or negative thresholds (hot|regular x fdo|cgscc), so I thought switch would be a bit easier. It's somewhat similar to -fno-inline - we could theoretically achieve the same thing by tweaking thresholds too.

wmi added a comment.Feb 10 2021, 11:14 AM

Yeah, the amount of inlining affects the importing due to call graph depth difference. The part you mentioned about turning off hot call site CGSCC inlining in prelto is the place where HotCallSiteThreshold is set to zero in buildInlinerPipeline, right?

Right.

We would still have some inlining for small functions as they may end up with negative cost.

If EnableRegularCGSCCInline and EnableSampleProfileInline are false in prelto, how would you have inlining for small functions?

Using threshold is more flexible and can achieve the same thing, though we'd need to pass four zero or negative thresholds (hot|regular x fdo|cgscc), so I thought switch would be a bit easier. It's somewhat similar to -fno-inline - we could theoretically achieve the same thing by tweaking thresholds too.

Passing multiple flags to set params may be ok for tuning but not for the default usage. I think we can hardcode the threshold value for CSSPGO after tuning is done.

we are experimenting with shifting more inlining from LTO prelink to postlink, from cgscc inlining to sample loader inlining.

Talking about the shifting from cgscc inlining to sample loader inlining. One thing missing in sample loader inlining is it will be lack of iterative cleaning during inlining which cgscc inlining provides. Do you think whether it matters?

In D96197#2554808, @wmi wrote:

Yeah, the amount of inlining affects the importing due to call graph depth difference. The part you mentioned about turning off hot call site CGSCC inlining in prelto is the place where HotCallSiteThreshold is set to zero in buildInlinerPipeline, right?

Right.

We would still have some inlining for small functions as they may end up with negative cost.

If EnableRegularCGSCCInline and EnableSampleProfileInline are false in prelto, how would you have inlining for small functions?

In that case, small function inlining will also move to LTO, though it would require tweaking importing instr limit/threshold.

Using threshold is more flexible and can achieve the same thing, though we'd need to pass four zero or negative thresholds (hot|regular x fdo|cgscc), so I thought switch would be a bit easier. It's somewhat similar to -fno-inline - we could theoretically achieve the same thing by tweaking thresholds too.

Passing multiple flags to set params may be ok for tuning but not for the default usage. I think we can hardcode the threshold value for CSSPGO after tuning is done.

Makes sense, we can use threshold for now and change the defaults for CSSPGO after it settles. I will skip this patch for now then.

we are experimenting with shifting more inlining from LTO prelink to postlink, from cgscc inlining to sample loader inlining.

Talking about the shifting from cgscc inlining to sample loader inlining. One thing missing in sample loader inlining is it will be lack of iterative cleaning during inlining which cgscc inlining provides. Do you think whether it matters?

Yeah, this could potentially be a challenge. Without the iterative cleanup, the cost inliner sees may not be accurate. We hope that this could be mitigated by 1) tweaking the threshold for sample loader inliner, 2) potentially use post-codegen size from previous build to help estimating the cost, .e.g we could put function size alongside with cfg checksum in profile metadata, and reference that size to see through potential cleanup during sample loader. We always have cgscc passes in LTO, so the actually clean up should still do a good job there.

wenlei abandoned this revision.Mar 13 2021, 11:55 AM

Will revisit after CSSPGO tuning settles.