This is an archive of the discontinued LLVM Phabricator instance.

Schedule Hot Cold Splitting pass after most optimization passes
ClosedPublic

Authored by hiraditya on Oct 19 2018, 8:06 AM.

Details

Summary

In the new+old pass manager, hot cold splitting was schedule too early.
Thanks to Vedant for pointing this out.

Diff Detail

Repository
rL LLVM

Event Timeline

hiraditya created this revision.Oct 19 2018, 8:06 AM
hiraditya edited the summary of this revision. (Show Details)
vsk added a comment.Oct 19 2018, 8:53 AM

Could you add a test for this similar to 'test/Other/opt-Os-pipeline.ll'?

hiraditya updated this revision to Diff 170302.Oct 19 2018, 9:57 PM
hiraditya edited the summary of this revision. (Show Details)
This revision is now accepted and ready to land.Oct 20 2018, 8:47 AM
vsk accepted this revision.Oct 20 2018, 11:11 AM

Looks great, thanks!

This revision was automatically updated to reflect the committed changes.

It looks like the test is failing on some of the Arm buildbots. In particular some of them don't have the X86 backend: http://lab.llvm.org:8011/builders/clang-cmake-armv7-quick/builds/4611 so it looks like it is most likely missing a REQUIRES: x86. Can you take a look?

vsk added a comment.Oct 22 2018, 9:52 AM

It looks like the test is failing on some of the Arm buildbots. In particular some of them don't have the X86 backend: http://lab.llvm.org:8011/builders/clang-cmake-armv7-quick/builds/4611 so it looks like it is most likely missing a REQUIRES: x86. Can you take a look?

@peter.smith @hiraditya r344919 should address this.

Note that this addresses in the old PM one issue I found when enabling splitting with ThinLTO, but exposes some related issues. In the old location, which was before the early return when PrepareForThinLTO==true around line 590, we were doing two rounds of splitting: one during the compile step, and a second later in the ThinLTO backends. I saw that we split an already split function (see below for more info). But I notice that there are still a couple issues:

  1. In the old PM, we will still do splitting early (during the compile step) for regular LTO, which doesn't return early from this routine. You should presumably guard the adding of the hot cold split pass by "if (!PrepareForLTO)"
  2. Once you've done 1), regular LTO will no longer perform any splitting, since its backend doesn't invoke populateModulePassManager. You can fix this by adding your hot cold splitting pass to an equivalent location in addLTOOptimizationPasses(), which is invoked only in the regular LTO backend.
  3. In the new PM, the move of the hot cold splitting pass does not fix the issue for ThinLTO. The reason is that there is no early exit from buildModuleSimplificationPipeline. To fix, you can guard the hot cold split pass by "Phase != ThinLTOPhase::PreLink".
  4. In the new PM, regular LTO invokes buildModuleSimplificationPipeline in the compile step, but without any extra info to indicate that it is an LTO compile. So something else will need to be done to suppress the hot cold splitting from happening early for regular LTO (e.g. pass down a new flag?).
  5. Once 4) is done, regular LTO will no longer perform any splitting in the new PM since its backend doesn't invoke buildModuleSimplificationPipeline. To fix, you will want to add the hot cold split pass to an appropriate place in buildLTODefaultPipeline().

This all assumes that you want to only do splitting in the *LTO backends (i.e. after cross module inlining). What are the tradeoffs of doing the splitting before/after inlining? One advantage of doing the splitting earlier for ThinLTO, in the compile step, is that it would reduce the size of the functions during the import analysis, and we might import more hot (split) functions. But then the splitting would be happening before the cross-module inlining in the backends.

Regarding what I am currently seeing with two rounds of splitting with ThinLTO: I noticed that the second round of splitting in the ThinLTO backends was resplitting an already split function - specifically, the cold outlined function was getting split again, which doesn't make a lot of sense since presumably the edge weights were all cold. Note this is after r344558 (the fix for the issue of outlining the whole function). Why would that happen?

vsk added a comment.Oct 22 2018, 1:57 PM

Thanks for your notes @tejohnson.

Note that this addresses in the old PM one issue I found when enabling splitting with ThinLTO, but exposes some related issues. In the old location, which was before the early return when PrepareForThinLTO==true around line 590, we were doing two rounds of splitting: one during the compile step, and a second later in the ThinLTO backends. I saw that we split an already split function (see below for more info). But I notice that there are still a couple issues:

  1. In the old PM, we will still do splitting early (during the compile step) for regular LTO, which doesn't return early from this routine. You should presumably guard the adding of the hot cold split pass by "if (!PrepareForLTO)"
  2. Once you've done 1), regular LTO will no longer perform any splitting, since its backend doesn't invoke populateModulePassManager. You can fix this by adding your hot cold splitting pass to an equivalent location in addLTOOptimizationPasses(), which is invoked only in the regular LTO backend.
  3. In the new PM, the move of the hot cold splitting pass does not fix the issue for ThinLTO. The reason is that there is no early exit from buildModuleSimplificationPipeline. To fix, you can guard the hot cold split pass by "Phase != ThinLTOPhase::PreLink".
  4. In the new PM, regular LTO invokes buildModuleSimplificationPipeline in the compile step, but without any extra info to indicate that it is an LTO compile. So something else will need to be done to suppress the hot cold splitting from happening early for regular LTO (e.g. pass down a new flag?).
  5. Once 4) is done, regular LTO will no longer perform any splitting in the new PM since its backend doesn't invoke buildModuleSimplificationPipeline. To fix, you will want to add the hot cold split pass to an appropriate place in buildLTODefaultPipeline().

This all assumes that you want to only do splitting in the *LTO backends (i.e. after cross module inlining). What are the tradeoffs of doing the splitting before/after inlining?

One advantage of doing splitting before inlining is, as you pointed out, that more inlining opportunities are created because hot functions become smaller. The downside is that outlining may be a barrier for intra-function optimizations which benefit from seeing both sides of a branch (hoisting/sinking, const prop, maybe jump threading?).

I think the right thing to do is to conservatively avoid introducing barriers by delaying splitting as much as possible. I'll work on gathering some pre/post-patch performance numbers on arm64.

One advantage of doing the splitting earlier for ThinLTO, in the compile step, is that it would reduce the size of the functions during the import analysis, and we might import more hot (split) functions. But then the splitting would be happening before the cross-module inlining in the backends.

Regarding what I am currently seeing with two rounds of splitting with ThinLTO: I noticed that the second round of splitting in the ThinLTO backends was resplitting an already split function - specifically, the cold outlined function was getting split again, which doesn't make a lot of sense since presumably the edge weights were all cold. Note this is after r344558 (the fix for the issue of outlining the whole function). Why would that happen?

vsk added a comment.Oct 23 2018, 11:56 AM
In D53437#1271395, @vsk wrote:

Thanks for your notes @tejohnson.

Note that this addresses in the old PM one issue I found when enabling splitting with ThinLTO, but exposes some related issues. In the old location, which was before the early return when PrepareForThinLTO==true around line 590, we were doing two rounds of splitting: one during the compile step, and a second later in the ThinLTO backends. I saw that we split an already split function (see below for more info). But I notice that there are still a couple issues:

  1. In the old PM, we will still do splitting early (during the compile step) for regular LTO, which doesn't return early from this routine. You should presumably guard the adding of the hot cold split pass by "if (!PrepareForLTO)"
  2. Once you've done 1), regular LTO will no longer perform any splitting, since its backend doesn't invoke populateModulePassManager. You can fix this by adding your hot cold splitting pass to an equivalent location in addLTOOptimizationPasses(), which is invoked only in the regular LTO backend.
  3. In the new PM, the move of the hot cold splitting pass does not fix the issue for ThinLTO. The reason is that there is no early exit from buildModuleSimplificationPipeline. To fix, you can guard the hot cold split pass by "Phase != ThinLTOPhase::PreLink".
  4. In the new PM, regular LTO invokes buildModuleSimplificationPipeline in the compile step, but without any extra info to indicate that it is an LTO compile. So something else will need to be done to suppress the hot cold splitting from happening early for regular LTO (e.g. pass down a new flag?).
  5. Once 4) is done, regular LTO will no longer perform any splitting in the new PM since its backend doesn't invoke buildModuleSimplificationPipeline. To fix, you will want to add the hot cold split pass to an appropriate place in buildLTODefaultPipeline().

This all assumes that you want to only do splitting in the *LTO backends (i.e. after cross module inlining). What are the tradeoffs of doing the splitting before/after inlining?

One advantage of doing splitting before inlining is, as you pointed out, that more inlining opportunities are created because hot functions become smaller. The downside is that outlining may be a barrier for intra-function optimizations which benefit from seeing both sides of a branch (hoisting/sinking, const prop, maybe jump threading?).

I think the right thing to do is to conservatively avoid introducing barriers by delaying splitting as much as possible. I'll work on gathering some pre/post-patch performance numbers on arm64.

One advantage of doing the splitting earlier for ThinLTO, in the compile step, is that it would reduce the size of the functions during the import analysis, and we might import more hot (split) functions. But then the splitting would be happening before the cross-module inlining in the backends.

Regarding what I am currently seeing with two rounds of splitting with ThinLTO: I noticed that the second round of splitting in the ThinLTO backends was resplitting an already split function - specifically, the cold outlined function was getting split again, which doesn't make a lot of sense since presumably the edge weights were all cold. Note this is after r344558 (the fix for the issue of outlining the whole function). Why would that happen?

I have some thoughts about @tejohnson's second question here about how a cold outlined function could be split twice.

First, I think the cache of outlined Function objects (see shouldOutlineFrom) may be wiped out before LTO begins. Second, CodeExtractor inserts a (useless?) "header" block with an unconditional jump in the outlined function, and I think this creates more work for the splitting pass.

I'll look into getting rid of the cache and teaching CodeExtractor to run MergeBasicBlockIntoOnlyPred on the outlined function. I also have an upcoming patch that should guarantee that cold regions are maximal & have a warm predecessor -- these two conditions should prevent outlined code from being split again.

Note that this addresses in the old PM one issue I found when enabling splitting with ThinLTO, but exposes some related issues. In the old location, which was before the early return when PrepareForThinLTO==true around line 590, we were doing two rounds of splitting: one during the compile step, and a second later in the ThinLTO backends. I saw that we split an already split function (see below for more info). But I notice that there are still a couple issues:

  1. In the old PM, we will still do splitting early (during the compile step) for regular LTO, which doesn't return early from this routine. You should presumably guard the adding of the hot cold split pass by "if (!PrepareForLTO)"
  2. Once you've done 1), regular LTO will no longer perform any splitting, since its backend doesn't invoke populateModulePassManager. You can fix this by adding your hot cold splitting pass to an equivalent location in addLTOOptimizationPasses(), which is invoked only in the regular LTO backend.
  3. In the new PM, the move of the hot cold splitting pass does not fix the issue for ThinLTO. The reason is that there is no early exit from buildModuleSimplificationPipeline. To fix, you can guard the hot cold split pass by "Phase != ThinLTOPhase::PreLink".

I'm going to send a patch to fix #3 so that we no longer run hot cold splitting multiple times anywhere to unblock me.

  1. In the new PM, regular LTO invokes buildModuleSimplificationPipeline in the compile step, but without any extra info to indicate that it is an LTO compile. So something else will need to be done to suppress the hot cold splitting from happening early for regular LTO (e.g. pass down a new flag?).
  2. Once 4) is done, regular LTO will no longer perform any splitting in the new PM since its backend doesn't invoke buildModuleSimplificationPipeline. To fix, you will want to add the hot cold split pass to an appropriate place in buildLTODefaultPipeline().

This all assumes that you want to only do splitting in the *LTO backends (i.e. after cross module inlining). What are the tradeoffs of doing the splitting before/after inlining? One advantage of doing the splitting earlier for ThinLTO, in the compile step, is that it would reduce the size of the functions during the import analysis, and we might import more hot (split) functions. But then the splitting would be happening before the cross-module inlining in the backends.

Regarding what I am currently seeing with two rounds of splitting with ThinLTO: I noticed that the second round of splitting in the ThinLTO backends was resplitting an already split function - specifically, the cold outlined function was getting split again, which doesn't make a lot of sense since presumably the edge weights were all cold. Note this is after r344558 (the fix for the issue of outlining the whole function). Why would that happen?