This is an archive of the discontinued LLVM Phabricator instance.

[HotColdSplit] Move splitting after instrumented PGO use
ClosedPublic

Authored by tejohnson on Feb 5 2019, 8:23 PM.

Details

Summary

Follow up to D57082 which moved splitting earlier in the pipeline, in
order to perform it before inlining. However, it was moved too early,
before the IR is annotated with instrumented PGO data. This caused the
splitting to incorrectly determine cold functions.

Move it to just after PGO annotation (still before inlining), in both
pass managers.

Diff Detail

Repository
rL LLVM

Event Timeline

tejohnson created this revision.Feb 5 2019, 8:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2019, 8:23 PM
Herald added a subscriber: mehdi_amini. · View Herald Transcript
vsk accepted this revision.Feb 5 2019, 8:26 PM

Thanks, LGTM.

test/Other/pass-pipelines.ll
13 ↗(On Diff #185481)

nit, as this isn't a pgo-specific test, it might aid readability to use 'PGOUSE' (or similar) as the check prefix.

This revision is now accepted and ready to land.Feb 5 2019, 8:26 PM
tejohnson marked an inline comment as done.Feb 5 2019, 8:28 PM
tejohnson updated this revision to Diff 185482.Feb 5 2019, 8:29 PM

Implement suggestion

This revision was automatically updated to reflect the committed changes.
vsk added a comment.Feb 11 2019, 11:12 AM

@tejohnson have you had a chance to evaluate performance with IR-PGO + splitting enabled?

Our internal CI shows performance regressions on SPEC/CINT2000 with FE-PGO + splitting enabled. Allowing inlining of split functions reduces the perf regression, and moving splitting after inlining eliminates it. It seems important to inline and optimize certain basic blocks which FE PGO data marks cold. We may need to address this by changing the FE-PGO instrumentation, or by ignoring ProfileSummaryInfo when it's based on a FE profile.

What's interesting about this is that I didn't see a perf regression on SPEC/CINT2000 with IR-PGO + splitting. I'd be curious to know if your testing bears this out.

In D57805#1393416, @vsk wrote:

@tejohnson have you had a chance to evaluate performance with IR-PGO + splitting enabled?

I have one data point, more below.

Our internal CI shows performance regressions on SPEC/CINT2000 with FE-PGO + splitting enabled. Allowing inlining of split functions reduces the perf regression,

This is controlled by the MinSize attribute, right?

and moving splitting after inlining eliminates it.

It seems important to inline and optimize certain basic blocks which FE PGO data marks cold. We may need to address this by changing the FE-PGO instrumentation, or by ignoring ProfileSummaryInfo when it's based on a FE profile.

What's interesting about this is that I didn't see a perf regression on SPEC/CINT2000 with IR-PGO + splitting. I'd be curious to know if your testing bears this out.

I tried for one important internal app that we build with IR-PGO and ThinLTO (late last week with this patch and r353434). Unfortunately it is degrading around 1%. I verified that if I remove the part that marks existing cold functions (i.e those that don't get split) with the MinSize attribute that this reduces the degradation to around 0.5%. If I prevent marking the new split cold functions with MinSize it possibly gets a bit better (degradation only around 0.4%). When I had tried splitting awhile back when it was in the original position after inlining (in the ThinLTO backends) I got around neutral performance. I was hoping for some improvement based on experiments we had done back with gcc's function splitting (-freorder-blocks-and-partition).

I did do some profiling to compare function profiles with and without splitting enabled. I see only one case where we are spending any time in a cold split function (i.e. where the profile presumably wasn't accurate), but I don't think this is causing most of the difference. It looks like there are very different inlines (expected), but these might be causing a degradation for some reason. I will try moving the splitting to after the ThinLTO backend (post-thinlink) inlining and see what effect there is. Theoretically we should be getting more accurate importing/inlining, it would be good to understand where this is going wrong if not!

vsk added a comment.Feb 11 2019, 12:36 PM
In D57805#1393416, @vsk wrote:

@tejohnson have you had a chance to evaluate performance with IR-PGO + splitting enabled?

I have one data point, more below.

Our internal CI shows performance regressions on SPEC/CINT2000 with FE-PGO + splitting enabled. Allowing inlining of split functions reduces the perf regression,

This is controlled by the MinSize attribute, right?

Partially, yes, I think MinSize affects inlining thresholds. It's also controlled by the 'noinline' attribute. To tweak this, you can disable CI->setIsNoInline() in extractColdFunction.

and moving splitting after inlining eliminates it.

It seems important to inline and optimize certain basic blocks which FE PGO data marks cold. We may need to address this by changing the FE-PGO instrumentation, or by ignoring ProfileSummaryInfo when it's based on a FE profile.

What's interesting about this is that I didn't see a perf regression on SPEC/CINT2000 with IR-PGO + splitting. I'd be curious to know if your testing bears this out.

I tried for one important internal app that we build with IR-PGO and ThinLTO (late last week with this patch and r353434). Unfortunately it is degrading around 1%. I verified that if I remove the part that marks existing cold functions (i.e those that don't get split) with the MinSize attribute that this reduces the degradation to around 0.5%. If I prevent marking the new split cold functions with MinSize it possibly gets a bit better (degradation only around 0.4%).

I see, it sounds like the perf regression with PGO may not be specific to FE-PGO.

When I had tried splitting awhile back when it was in the original position after inlining (in the ThinLTO backends) I got around neutral performance. I was hoping for some improvement based on experiments we had done back with gcc's function splitting (-freorder-blocks-and-partition).

I did do some profiling to compare function profiles with and without splitting enabled. I see only one case where we are spending any time in a cold split function (i.e. where the profile presumably wasn't accurate), but I don't think this is causing most of the difference. It looks like there are very different inlines (expected), but these might be causing a degradation for some reason. I will try moving the splitting to after the ThinLTO backend (post-thinlink) inlining and see what effect there is. Theoretically we should be getting more accurate importing/inlining, it would be good to understand where this is going wrong if not!

Thanks, this would be a really useful experiment.

vsk added a comment.Feb 14 2019, 3:37 PM

... I will try moving the splitting to after the ThinLTO backend (post-thinlink) inlining and see what effect there is. Theoretically we should be getting more accurate importing/inlining, it would be good to understand where this is going wrong if not!

I have not yet tried the experiment you've described here. We've noticed that scheduling splitting early causes a regression for certain benchmarks in SPEC even without PGO data applied, however. As the heuristics for splitting are very conservative without PGO, this suggests that splitting before inlining may inadvertently hide important context from the optimizer.

In D57805#1398718, @vsk wrote:

... I will try moving the splitting to after the ThinLTO backend (post-thinlink) inlining and see what effect there is. Theoretically we should be getting more accurate importing/inlining, it would be good to understand where this is going wrong if not!

I have not yet tried the experiment you've described here. We've noticed that scheduling splitting early causes a regression for certain benchmarks in SPEC even without PGO data applied, however. As the heuristics for splitting are very conservative without PGO, this suggests that splitting before inlining may inadvertently hide important context from the optimizer.

I gave this a try with our internal benchmark that was slowing down with splitting before inlining. Moving it after inlining (the post link inlining in ThinLTO) caused the degradation to go away. Unfortunately no speedup though. I haven't had a chance to dig into the performance results more than that (and am heading out of town for the better part of the next couple weeks). I will review your new patch to move the pass later shortly though.