This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Enable PGO-based loop peeling by default
ClosedPublic

Authored by mkuper on Dec 13 2016, 2:31 PM.

Details

Summary

This enables loop peeling by default when profile information is available.

Performance on SPEC looks flat, modulo noise. I am, however, seeing improvements on internal benchmarks, for instrumentation-based PGO.
The previous improvement in SPEC povray disappeared. Either it was a fluke after all, or it was "swallowed" by r287186 (that is, the important thing for povray was disabling runtime unrolling, not enabling peeling).

Two caveats:

  1. The mean size increase for SPEC is 1.4%, the total .text size increase is ~0.5%. Most size increases are very modest, and we even have some size decreases (I guess due to interaction with inlining.), but there are a couple of outliers - bzip2 grows by ~10.5%, and sphinx3 by ~4.5%. For the set of internal benchmarks, the size impact is much smaller.
  2. This may cause regressions if the profile information is very inaccurate (no surprise). The unfortunate part is that this currently happens for sampling PGO, because we can get weight(preheader) = weight(header), causing us to peel one iteration. The case I'm seeing should be fixed once D26256 lands, but there may be more.

Anyone interested in testing this for their PGO workloads?

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 81299.Dec 13 2016, 2:31 PM
mkuper retitled this revision from to [LoopUnroll] Enable PGO-based loop peeling by default.
mkuper updated this object.
mkuper added reviewers: anemet, danielcdh, haicheng.
mkuper added subscribers: llvm-commits, davidxl, mzolotukhin.
danielcdh accepted this revision.Dec 28 2016, 1:23 PM
danielcdh edited edge metadata.
This revision is now accepted and ready to land.Dec 28 2016, 1:23 PM

Thanks, Dehao.

Now that we're past the holidays, and more people are around - any concerns about this? Michael, Adam?

anemet edited edge metadata.Jan 10 2017, 8:36 PM

Hi Michael,

Is this is on with -Os + PGO or just O<n> + PGO?

Also I forgot the heuristics from the original patch, are we only peeling hot low-trip count loops or also cold ones?

Thanks,
Adam

Regarding #1, this should not be a concern for PGO. PGO is a mode (same as O3 and above) that aims to maximize program speed at cost of possible size and compile time increase. besides, for large apps, the size increase should be minimal. bzip2 is an outlier as it is tiny. For reference, GCC turns on loop peeing with PGO as well.

Regarding Adam's question: I have not yet seen users using Os + PGO combination. If this really is a concern, Os should also be checked (under PGO) to avoid size growth.

Hi Michael,

Is this is on with -Os + PGO or just O<n> + PGO?

This is O<n> (O2, really) + PGO.
The peel count computation is limited by UP.Threshold, so we should not see it kick in with default thresholds for -Os + PGO. I'll add a test for that in a separate patch.

Also I forgot the heuristics from the original patch, are we only peeling hot low-trip count loops or also cold ones?

We will not peel loops that are so cold they're dead. :)
The loop needs to have profile information, so it must have been entered at least once.

Other than that, the current heuristic doesn't distinguish between hot and cold loops. This is rather unfortunate, but I'm not sure how it can be fixed.
Both because using BFI in a loop pass is problematic (hence llvm::getLoopEstimatedTripCount() - although it's also cheaper than computing full BFI just to get the equivalent), and because it's not clear that relative hotness to function entry is the right metric here. Maybe relative hotness to the entire profile?

This is O<n> (O2, really) + PGO.

That's "O 2"... phab is being weird.

Added the optsize test in rL291708.

anemet accepted this revision.Jan 12 2017, 9:05 PM
anemet edited edge metadata.

Other than that, the current heuristic doesn't distinguish between hot and cold loops. This is rather unfortunate, but I'm not sure how it can be fixed.
Both because using BFI in a loop pass is problematic (hence llvm::getLoopEstimatedTripCount() - although it's also cheaper than computing full BFI just to get the equivalent), and because it's not clear that relative hotness to function entry is the right metric here. Maybe relative hotness to the entire profile?

ProfileSummaryInfo should have this capability.

Anyhow I didn't mean to hold up this patch for this reason; this could be a further improvement on top of this. LGTM.

Other than that, the current heuristic doesn't distinguish between hot and cold loops. This is rather unfortunate, but I'm not sure how it can be fixed.
Both because using BFI in a loop pass is problematic (hence llvm::getLoopEstimatedTripCount() - although it's also cheaper than computing full BFI just to get the equivalent), and because it's not clear that relative hotness to function entry is the right metric here. Maybe relative hotness to the entire profile?

ProfileSummaryInfo should have this capability.

There are two problems with this:

  1. ProfileSummaryInfo uses function entry counts to decide which functions are hot. But it doesn't give you information about the hotness of a block. To get the global hotness of a block (e.g. the loop header), you'd need to know both the relative hotness of a function (PSI) and the hotness of the loop header relative to the function entry (BFI). If BFI is unavailable, you can assume branch probabilities aren't scaled, and correspond exactly to the number of times each branch was taken/not-taken, but this is fishy. It's "more or less fine" if you have BFI, but it's imprecise - but it sounds like a bad idea to rely on alone, when BFI isn't available.

(Querying BFI and then, if BFI thinks the block is cold, relying on the branch weights being "correct", is, in fact, exactly how ProfileSummaryInfo::isHotBB() works.)

  1. It's not clear what the thresholds should be. Again, ProfileSummaryInfo has thresholds for hot/cold functions, not blocks. I guess we could go with "only peel low trip-count loops within a hot function". - this will partially solve the problem.

Anyhow I didn't mean to hold up this patch for this reason; this could be a further improvement on top of this. LGTM.

Thanks!
I'm currently holding this myself, because of D28593.

Err, never mind, I got confused, PSI has a single threshold for what a "hot" region is, doesn't matter if it's a function or a block.
But the main problem (lack of BFI) still stands.

Ok, I'm going to stop holding this. We still have some issues with sampling-based FDO due to lack of duplication factor, and it's not clear where D28593 is going, but I think at this point this is a win.

This revision was automatically updated to reflect the committed changes.