This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Enable PGO-based loop peeling by default
ClosedPublic

Authored by mkuper on Dec 13 2016, 2:31 PM.

Download Raw Diff

Details

Reviewers

anemet
mzolotukhin
danielcdh
haicheng

Commits

rGc2af82b4b7ac: [LoopUnroll] Enable PGO-based loop peeling by default.
rL295796: [LoopUnroll] Enable PGO-based loop peeling by default.

Summary

This enables loop peeling by default when profile information is available.

Performance on SPEC looks flat, modulo noise. I am, however, seeing improvements on internal benchmarks, for instrumentation-based PGO.
The previous improvement in SPEC povray disappeared. Either it was a fluke after all, or it was "swallowed" by r287186 (that is, the important thing for povray was disabling runtime unrolling, not enabling peeling).

Two caveats:

The mean size increase for SPEC is 1.4%, the total .text size increase is ~0.5%. Most size increases are very modest, and we even have some size decreases (I guess due to interaction with inlining.), but there are a couple of outliers - bzip2 grows by ~10.5%, and sphinx3 by ~4.5%. For the set of internal benchmarks, the size impact is much smaller.
This may cause regressions if the profile information is very inaccurate (no surprise). The unfortunate part is that this currently happens for sampling PGO, because we can get weight(preheader) = weight(header), causing us to peel one iteration. The case I'm seeing should be fixed once D26256 lands, but there may be more.

Anyone interested in testing this for their PGO workloads?

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 81299.Dec 13 2016, 2:31 PM

mkuper retitled this revision from to [LoopUnroll] Enable PGO-based loop peeling by default.

mkuper updated this object.

mkuper added reviewers: anemet, danielcdh, haicheng.

mkuper added subscribers: llvm-commits, davidxl, mzolotukhin.

danielcdh accepted this revision.Dec 28 2016, 1:23 PM

danielcdh edited edge metadata.

This revision is now accepted and ready to land.Dec 28 2016, 1:23 PM

Thanks, Dehao.

Now that we're past the holidays, and more people are around - any concerns about this? Michael, Adam?

Hi Michael,

Is this is on with -Os + PGO or just O<n> + PGO?

Also I forgot the heuristics from the original patch, are we only peeling hot low-trip count loops or also cold ones?

Thanks,
Adam

Regarding #1, this should not be a concern for PGO. PGO is a mode (same as O3 and above) that aims to maximize program speed at cost of possible size and compile time increase. besides, for large apps, the size increase should be minimal. bzip2 is an outlier as it is tiny. For reference, GCC turns on loop peeing with PGO as well.

Regarding Adam's question: I have not yet seen users using Os + PGO combination. If this really is a concern, Os should also be checked (under PGO) to avoid size growth.

In D27734#642276, @anemet wrote:

Hi Michael,

Is this is on with -Os + PGO or just O<n> + PGO?

This is O<n> (O2, really) + PGO.
The peel count computation is limited by UP.Threshold, so we should not see it kick in with default thresholds for -Os + PGO. I'll add a test for that in a separate patch.

Also I forgot the heuristics from the original patch, are we only peeling hot low-trip count loops or also cold ones?

We will not peel loops that are so cold they're dead. :)
The loop needs to have profile information, so it must have been entered at least once.

Other than that, the current heuristic doesn't distinguish between hot and cold loops. This is rather unfortunate, but I'm not sure how it can be fixed.
Both because using BFI in a loop pass is problematic (hence llvm::getLoopEstimatedTripCount() - although it's also cheaper than computing full BFI just to get the equivalent), and because it's not clear that relative hotness to function entry is the right metric here. Maybe relative hotness to the entire profile?

This is O<n> (O2, really) + PGO.

That's "O 2"... phab is being weird.

Added the optsize test in rL291708.

Other than that, the current heuristic doesn't distinguish between hot and cold loops. This is rather unfortunate, but I'm not sure how it can be fixed.
Both because using BFI in a loop pass is problematic (hence llvm::getLoopEstimatedTripCount() - although it's also cheaper than computing full BFI just to get the equivalent), and because it's not clear that relative hotness to function entry is the right metric here. Maybe relative hotness to the entire profile?

ProfileSummaryInfo should have this capability.

Anyhow I didn't mean to hold up this patch for this reason; this could be a further improvement on top of this. LGTM.

In D27734#644780, @anemet wrote:

Other than that, the current heuristic doesn't distinguish between hot and cold loops. This is rather unfortunate, but I'm not sure how it can be fixed.
Both because using BFI in a loop pass is problematic (hence llvm::getLoopEstimatedTripCount() - although it's also cheaper than computing full BFI just to get the equivalent), and because it's not clear that relative hotness to function entry is the right metric here. Maybe relative hotness to the entire profile?

ProfileSummaryInfo should have this capability.

There are two problems with this:

ProfileSummaryInfo uses function entry counts to decide which functions are hot. But it doesn't give you information about the hotness of a block. To get the global hotness of a block (e.g. the loop header), you'd need to know both the relative hotness of a function (PSI) and the hotness of the loop header relative to the function entry (BFI). If BFI is unavailable, you can assume branch probabilities aren't scaled, and correspond exactly to the number of times each branch was taken/not-taken, but this is fishy. It's "more or less fine" if you have BFI, but it's imprecise - but it sounds like a bad idea to rely on alone, when BFI isn't available.

(Querying BFI and then, if BFI thinks the block is cold, relying on the branch weights being "correct", is, in fact, exactly how ProfileSummaryInfo::isHotBB() works.)

It's not clear what the thresholds should be. Again, ProfileSummaryInfo has thresholds for hot/cold functions, not blocks. I guess we could go with "only peel low trip-count loops within a hot function". - this will partially solve the problem.

Anyhow I didn't mean to hold up this patch for this reason; this could be a further improvement on top of this. LGTM.

Thanks!
I'm currently holding this myself, because of D28593.

Err, never mind, I got confused, PSI has a single threshold for what a "hot" region is, doesn't matter if it's a function or a block.
But the main problem (lack of BFI) still stands.

Ok, I'm going to stop holding this. We still have some issues with sampling-based FDO due to lack of duplication factor, and it's not clear where D28593 is going, but I think at this point this is a win.

Closed by commit rL295796: [LoopUnroll] Enable PGO-based loop peeling by default. (authored by mkuper). · Explain WhyFeb 21 2017, 4:39 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

4 lines

test/

Transforms/

LoopUnroll/

peel-loop-pgo.ll

2 lines

Diff 89302

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 104 Lines • ▼ Show 20 Lines

static cl::opt<unsigned> FlatLoopTripCountThreshold(		static cl::opt<unsigned> FlatLoopTripCountThreshold(
"flat-loop-tripcount-threshold", cl::init(5), cl::Hidden,		"flat-loop-tripcount-threshold", cl::init(5), cl::Hidden,
cl::desc("If the runtime tripcount for the loop is lower than the "		cl::desc("If the runtime tripcount for the loop is lower than the "
"threshold, the loop is considered as flat and will be less "		"threshold, the loop is considered as flat and will be less "
"aggressively unrolled."));		"aggressively unrolled."));

static cl::opt<bool>		static cl::opt<bool>
UnrollAllowPeeling("unroll-allow-peeling", cl::Hidden,		UnrollAllowPeeling("unroll-allow-peeling", cl::init(true), cl::Hidden,
cl::desc("Allows loops to be peeled when the dynamic "		cl::desc("Allows loops to be peeled when the dynamic "
"trip count is known to be low."));		"trip count is known to be low."));

// This option isn't ever intended to be enabled, it serves to allow		// This option isn't ever intended to be enabled, it serves to allow
// experiments to check the assumptions about when this kind of revisit is		// experiments to check the assumptions about when this kind of revisit is
// necessary.		// necessary.
static cl::opt<bool> UnrollRevisitChildLoops(		static cl::opt<bool> UnrollRevisitChildLoops(
"unroll-revisit-child-loops", cl::Hidden,		"unroll-revisit-child-loops", cl::Hidden,
Show All 28 Lines	static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
UP.FullUnrollMaxCount = UINT_MAX;		UP.FullUnrollMaxCount = UINT_MAX;
UP.BEInsns = 2;		UP.BEInsns = 2;
UP.Partial = false;		UP.Partial = false;
UP.Runtime = false;		UP.Runtime = false;
UP.AllowRemainder = true;		UP.AllowRemainder = true;
UP.AllowExpensiveTripCount = false;		UP.AllowExpensiveTripCount = false;
UP.Force = false;		UP.Force = false;
UP.UpperBound = false;		UP.UpperBound = false;
UP.AllowPeeling = false;		UP.AllowPeeling = true;

// Override with any target specific settings		// Override with any target specific settings
TTI.getUnrollingPreferences(L, UP);		TTI.getUnrollingPreferences(L, UP);

// Apply size attributes		// Apply size attributes
if (L->getHeader()->getParent()->optForSize()) {		if (L->getHeader()->getParent()->optForSize()) {
UP.Threshold = UP.OptSizeThreshold;		UP.Threshold = UP.OptSizeThreshold;
UP.PartialThreshold = UP.PartialOptSizeThreshold;		UP.PartialThreshold = UP.PartialOptSizeThreshold;
▲ Show 20 Lines • Show All 1,057 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopUnroll/peel-loop-pgo.ll

	; RUN: opt < %s -S -debug-only=loop-unroll -loop-unroll -unroll-allow-peeling 2>&1 \| FileCheck %s			; RUN: opt < %s -S -debug-only=loop-unroll -loop-unroll 2>&1 \| FileCheck %s
	; REQUIRES: asserts			; REQUIRES: asserts

	; Make sure we use the profile information correctly to peel-off 3 iterations			; Make sure we use the profile information correctly to peel-off 3 iterations
	; from the loop, and update the branch weights for the peeled loop properly.			; from the loop, and update the branch weights for the peeled loop properly.

	; CHECK: Loop Unroll: F[basic]			; CHECK: Loop Unroll: F[basic]
	; CHECK: PEELING loop %for.body with iteration count 3!			; CHECK: PEELING loop %for.body with iteration count 3!
	; CHECK: Loop Unroll: F[optsize]			; CHECK: Loop Unroll: F[optsize]
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines