Page MenuHomePhabricator

[LoopUnroll] Implement profile-based loop peeling
ClosedPublic

Authored by mkuper on Oct 25 2016, 1:30 PM.

Details

Summary

This implements profile-based loop peeling.

The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.

The code is somewhat similar (and is based on the original version of) the runtime unrolling code, but I think like they're sufficiently different that trying to share the implementation isn't a good idea. Since the current runtime unrolling implementation already has two different prolog/epilog cases, making it do peeling as well will make it rather unreadable.

I'm planning on committing this as disabled-by-default, until I have a bit more confidence in the performance - some more tuning may be required.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mkuper marked 19 inline comments as done.Oct 27 2016, 5:50 PM

Updated per David's comments.

David, after some more thought, I think your analysis was right - so I used a narrower distribution. You can see the result for a 3-iteration peeling in the pgo test. Let me know if you think it makes sense.
Also I repurposed getPeelCount to be self-contained in terms of selecting the factor, but it led to a bit of ugliness, not sure if this is better or worse than the original.

Gerolf added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
47 ↗(On Diff #76148)

Please add an option to turn it off by default.

125 ↗(On Diff #76148)

Could this be implemented as

  • distribute k: N-k (assuming trip count N)
  • complete unroll first k iterations?
mkuper added inline comments.Oct 27 2016, 8:07 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
47 ↗(On Diff #76148)

That option already exists - it's in LoopUnrollPass (because it's needed as part of the UnrollingPreferences initialization), and is currently off by default. These two options also started out living in LoopUnrollPass but moved since I tried to consolidate choosing the peeling factor here.
That's part of the "ugliness" I referred to above.

125 ↗(On Diff #76148)

Anyway, yes, I've originally thought about implementing it like that (it seemed like it'd be simpler, although I'm not entirely sure there's a big difference). But I think a more direct implementation fits the flow of the unroller better.

I wanted to make the "unroll / runtime unroll / peel" decision at the same point (otherwise this could be a separate pass that runs before the unroller, performs the split and marks the two loops as nounroll and force-unroll). And if this lives in the unroller - splitting and then unrolling the new loop while in the middle of "unrolling" the original one seemed awkward.

davidxl added inline comments.Oct 28 2016, 12:04 PM
lib/Transforms/Utils/LoopUnroll.cpp
357 ↗(On Diff #76148)

.\n --> iterations.\n

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

Is UP parameter intended to be used?

121 ↗(On Diff #76148)

I think the following weights should be passed in:

  • TotalHeaderWeight of the currently peeled loop -- it is passed in for update: TotalHeaderWeight -= PeeledIterationHeaderWeight;
  • PeeledIterationHeaderWeight
    1. its initial value for the first peeled iteration is the PreheaderWeight or ExitWeight of the original loop
    2. After peeling of one iteration, its value will be updated to the fallthrough weight of the peel
  • TotalExitEdgeWeight -- passed in for update. After peeling, TotalExitEdgeWeight -= PeeledIterationExitWeight'
125 ↗(On Diff #76148)

for counted loops, distribute + complete unroll may be simple, but not necessarily good for general cases such as linked-list traversal loops etc.

168 ↗(On Diff #76148)

Should average trip count be used instead of PeelCount (which may be different)?

291 ↗(On Diff #76148)

See my comment above about what weights need to be passed for updating.

323 ↗(On Diff #76148)

Both headeTotalWeight and ExitEdge Weight need to be adjusted during peeling, and the new meta data can just use the updated backedge weight and exit weight.

mkuper marked an inline comment as done.Oct 28 2016, 1:33 PM
mkuper added inline comments.
lib/Transforms/Utils/LoopUnroll.cpp
357 ↗(On Diff #76148)

Sure.

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

Sorry, I don't quite understand what you mean here. It is used (in line 84 or below). Could you explain?

121 ↗(On Diff #76148)

I don't think we need TotalExitEdgeWeight, but you're right, we need to accumulate the header weight to be able to update the final weight of the backedge properly.

168 ↗(On Diff #76148)

Right, I assumed they're the same (and currently they are), but it's not necessarily true in general, I'll add a parameter (and pass PeelCount).

davidxl added inline comments.Oct 28 2016, 1:36 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

Sorry I misread the early return as the regular return. Anyway, if UP is used here, there seems redundant to set UP.PeelCount in the caller of this function.

mkuper marked an inline comment as done.Oct 28 2016, 1:53 PM
mkuper added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

I think it'd be better for it to be set explicitly in the caller, rather than hidden away in here. (This currently also sets it, but that was unintentional, thanks for noticing).
I'd prefer to pass UP here by const reference, and set UP.PeelCount in the caller.

mkuper updated this revision to Diff 76258.Oct 28 2016, 3:15 PM

Updated per David's comments.

Note that I'm becoming more convinced the current distribution for the latch branch copies is rather unreasonable, but we can fix that separately.

davidxl added inline comments.Nov 1 2016, 1:31 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
111 ↗(On Diff #76258)

I think this parameter should better be called 'PeeledHeaderWeight' -- it is really the weight of the 'header' block of the current peeled iteration. The PeeledHeaderWeight will be updated to the next iteration's header weight after this call is done.

The name 'Remaining' sounds like the original value of this weight parameter is the sum of all peeled iteration's header weight, but it is actually not.

295 ↗(On Diff #76258)

If user request a large peel count, this may end up < 0 which should be avoided. Make it as least >=0.

mkuper added inline comments.Nov 1 2016, 5:48 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
111 ↗(On Diff #76258)

Sure, I'll change it.

295 ↗(On Diff #76258)

Nice catch, thanks! I actually noticed it, but forgot to change the code.
This was one of the reasons I think the distribution is problematic - this shouldn't happen in a distribution when the peel count really is the average.

(Also, I think it probably should be >=1, not just >=0).

mkuper updated this revision to Diff 77068.Nov 7 2016, 11:22 AM
mkuper marked 2 inline comments as done.

Updated per David's comments.

mkuper added a comment.Nov 9 2016, 4:04 PM

Some performance numbers - SPECint and SPECfp C/C++ with instrumentation-based PGO:

444.namd         -0.43%
447.dealII       -0.69%
450.soplex       -0.12%
453.povray       +2.48%
433.milc         +0.38%
470.lbm          -0.58%
482.sphinx3      +0.27%
471.omnetpp      -0.40%
473.astar        +0.17%
483.xalancbmk    -0.46%
400.perlbench    -0.55%
401.bzip2        +0.17%
403.gcc          +0.38%
429.mcf          -0.36%
445.gobmk        +0.11%
456.hmmer        +0.69%
458.sjeng        +0.49%
462.libquantum   -0.07%
464.h264ref      +0.65%

geometric mean   +0.11%

The improvement in povray seems stable, the rest is probably noise.

davidxl edited edge metadata.Nov 14 2016, 11:18 AM

Looks good to me. Adam, do you have more comments?

anemet edited edge metadata.Nov 14 2016, 11:35 AM

Sorry guys about the delay. I'll look at this today!

haicheng edited edge metadata.Nov 14 2016, 2:55 PM

Just a few nits.

lib/Transforms/Scalar/LoopUnrollPass.cpp
985 ↗(On Diff #77068)

Unnecessary change.

1100 ↗(On Diff #77068)

Unnecessary change.

lib/Transforms/Utils/LoopUnrollPeel.cpp
212–214 ↗(On Diff #77068)

Braces are not necessary.

dberris added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
211–214 ↗(On Diff #77068)

You might also be able to turn this into a range-based for loop:

for (const auto& KV : VMap)
  LVMap[KV.first] = KV.second;

The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.

The resulting code should also be more branch-predictor friendly; for small-trip count loops the loop-exiting misprediction can be significant.

include/llvm/Transforms/Utils/UnrollLoop.h
19 ↗(On Diff #77068)

forward-declare

lib/Transforms/Scalar/LoopUnrollPass.cpp
1181–1183 ↗(On Diff #77068)

If this is a formatting-only change please remove it from this patch.

lib/Transforms/Utils/LoopUnroll.cpp
356 ↗(On Diff #77068)

We don't want to print internal IR names like this. The remark will already point at the source location of the loop.

lib/Transforms/Utils/LoopUnrollPeel.cpp
12 ↗(On Diff #77068)

compile-time constant trip count

107–118 ↗(On Diff #77068)

All these function comments should be doxygen comments.

111 ↗(On Diff #77068)

"loop entries" is a bit confusing here; we're not entering the loop.

233–260 ↗(On Diff #77068)

I am not sure I don't understand this comment. Perhaps having the pseudo code when one more iteration is peeled may clear things up. It's unclear to me in this last paragraph for example how the conditional branch is generated in the split bottom anchor block.

63 ↗(On Diff #76148)

I agree with David that this interface is confusing. This function either knows about UP or doesn't. If it does it should handle both read-and-write.

lib/Transforms/Utils/LoopUtils.cpp
1084–1093 ↗(On Diff #77068)

Do we have a precedence adjusting weight like this back? Hopefully the underlying issue will be fixed soon so we shouldn't add band-aids like this.

test/Transforms/LoopUnroll/peel-loop.ll
1 ↗(On Diff #77068)

It would be good to explain what each of these tests.

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Wasn't @davidxl working on fixing them?

mkuper added a subscriber: bogner.Nov 15 2016, 4:53 PM

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Wasn't @davidxl working on fixing them?

Not that I know of. @davidxl, did I miss something?

Also, phab seems to have swallowed a long comment I wrote about that. :-\

Basically, the underlying issue is that clang doesn't actually record the branch weights it gets from the profile. Rather, it records the branch weights it thinks are better for the purpose of branch probability estimation. From CodeGenPGO.cpp:
/ According to Laplace's Rule of Succession, it is better to compute the
/ weight based on the count plus 1, so universally add 1 to the value.
While this may be good from a branch-probability standpoint, it throws off using the weights for trip count estimates, when the exit weight is small.

I see 3 options:

  1. Keep what I have here as a band-aid.
  2. Don't do the -1 adjustment here - hopefully, in reality, we don't care about the cases where it makes a difference. I'll need to benchmark this, though.
  3. Remove the +1 adjustment in clang, and instead do it where we actually use the branch weights to estimate probabilities, that is, in BPI. Of course, there may be other places that read the weights but really want the adjusted weights (and, say, assume weights are never 0). I asked @bogner about this on IRC, and he seemed ambivalent.

What do you think?

danielcdh added inline comments.Nov 16 2016, 8:56 AM
include/llvm/Transforms/Utils/LoopUtils.h
465 ↗(On Diff #77068)

Is it possible to have this helper function in another patch (maybe NFC) and submit it first? I'd like to use it in another patch: https://reviews.llvm.org/D26527. Or if this patch is going in pretty soon, it's also fine. Thanks.

danielcdh added inline comments.Nov 16 2016, 9:20 AM
include/llvm/Transforms/Utils/LoopUtils.h
465 ↗(On Diff #77068)

Also, this function does not distinguish between "0-trip count" and "unknown tripcount". Maybe change the return value to int so that <0 means unknown tripcount? Or simply use Optional<unsigned> as return value.

mkuper added inline comments.Nov 16 2016, 11:51 AM
include/llvm/Transforms/Utils/LoopUtils.h
465 ↗(On Diff #77068)

I hope this can go in pretty soon. :-)

In any case, I'd rather not commit unused API as an NFC patch, but feel free to "steal" this into your patch in case that goes in first.
Re "0-trip count" and "unknown trip count" - fair point. I'll change the interface to use Optional<>.

mkuper updated this revision to Diff 78255.Nov 16 2016, 1:45 PM
mkuper edited edge metadata.
mkuper marked 12 inline comments as done.

Updated per comments.
Also changed getLoopEstimatedTripCount() to round-to-nearest when dividing.

This was probably the right thing to do to begin with, and compensates for an off-by-one if we have very precise, but adjusted by clang, branch weights.
If the original edge weights were 3000 and 1000, we really want to peel off 3 iterations, but 3001/1001 = 2.99.. and the previous code would truncate it to 2.

mkuper added inline comments.Nov 16 2016, 6:55 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

Actually, now that I think about this again, I'm not sure this is right.
Dehao, for instrumentation-based profiles, do back-edges of loops that are never taken get any profile metadata? I think they don't, in which case this should probably return 0.

danielcdh added inline comments.Nov 16 2016, 8:28 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

I'm not sure about instrumentation-based profiles, but for sample-based profiles, if the backedge is never taken (but executed and thus has sample), it will still get branch probability metadata annotated, which will be (sample_count+1, 1)

mkuper added inline comments.Nov 16 2016, 9:19 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

Ok, will keep it as is, then.

davidxl added inline comments.Nov 16 2016, 9:38 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

as long as the loop body is executed, it will have profile data annotated. If the backedge is never taken, the weight will be (0, 1)

mkuper updated this revision to Diff 78574.Nov 18 2016, 1:16 PM

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

danielcdh edited edge metadata.Nov 18 2016, 1:28 PM

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

SGTM. Looks like the check becomes noop when peeling is enabled. But it would still serve as a backup plan when peeling is not enabled.

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

SGTM. Looks like the check becomes noop when peeling is enabled. But it would still serve as a backup plan when peeling is not enabled.

I don't think it's a noop. We may decide not to peel even with a low trip count, but we will still prefer not to unroll.

Adam, David, do you have any additional comments, or is this good to go?

davidxl accepted this revision.Nov 21 2016, 11:58 AM
davidxl edited edge metadata.

lgtm

This revision is now accepted and ready to land.Nov 21 2016, 11:58 AM
anemet added inline comments.Nov 22 2016, 12:14 AM
lib/Transforms/Utils/LoopUnrollPeel.cpp
111–112 ↗(On Diff #78574)

You should also say something about adjusting profile data otherwise it's not clear why all the other parameters are passed.

Does that actually need to happen in this same function? It feels that this could be further decomposed for improved readability.

224–226 ↗(On Diff #78574)

Missing function comment.

293 ↗(On Diff #78574)

.peel.new

mkuper updated this revision to Diff 78908.Nov 22 2016, 11:44 AM
mkuper edited edge metadata.

Thanks, Adam!

anemet accepted this revision.Nov 28 2016, 12:14 PM
anemet edited edge metadata.

LGTM with some minor changes.

lib/Transforms/Utils/LoopUnrollPeel.cpp
137 ↗(On Diff #78908)

Why the 0.9? This also needs a comment.

204–219 ↗(On Diff #78908)

This needs a bit more comment; here we're turning values that were before defined by the loop PHIs to use values directly from the preheader or the previous cloned loop body.

This revision was automatically updated to reflect the committed changes.
trentxintong added inline comments.
llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp
357

@mkuper any particular reasons why the backedge weight has to be 1 instead of 0 ?

mkuper added inline comments.Jan 11 2017, 11:30 AM
llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp
357

There was some concern (on IRC, I don't remember who expressed it, unfortunately) that other things that look at profile information will not be happy with 0 weights, since clang doesn't normally generate them (the whole "Laplace's Rule of Succession" adjustment).

I haven't actually tested it, but it didn't seem like it should matter much - if we get here, the distribution isn't that realistic to begin with. So I preferred to play it safe.