This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Implement profile-based loop peeling
ClosedPublic

Authored by mkuper on Oct 25 2016, 1:30 PM.

Details

Summary

This implements profile-based loop peeling.

The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.

The code is somewhat similar (and is based on the original version of) the runtime unrolling code, but I think like they're sufficiently different that trying to share the implementation isn't a good idea. Since the current runtime unrolling implementation already has two different prolog/epilog cases, making it do peeling as well will make it rather unreadable.

I'm planning on committing this as disabled-by-default, until I have a bit more confidence in the performance - some more tuning may be required.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 75780.Oct 25 2016, 1:30 PM
mkuper retitled this revision from to [LoopUnroll] Implement profile-based loop peeling.
mkuper updated this object.
mkuper added a subscriber: llvm-commits.
davidxl added inline comments.Oct 25 2016, 4:40 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗(On Diff #75780)

Should PeelCount be a member of UnrollingPreferences class?

887 ↗(On Diff #75780)

I think all the logic here should probably be wrapped in a helper function "computeOptimalPeelCount".

904 ↗(On Diff #75780)

getPeelCount looks like a simple wrapper that is only used once. Probably get rid of it.

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #75780)

Is there common utility to do body cloning?

As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).

Intel, for example has (https://software.intel.com/en-us/node/524502):

#pragma loop_count(n)

which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:

#pragma loop_count(n1, n2, ...)

which asks for specializations for trip counts n1, n2, etc.

Also supported by Intel's compiler is:

#pragma loop_count min(n),max(n),avg(n)

(in this case, avg is a hint, but min and max are promises).

FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.

Thanks, David.

lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗(On Diff #75780)

I think not, but I don't really know.
To the best of my understanding, the main point of UnrollingPreferences is to give targets the ability to override the defaults, and I'm not sure this is useful here.

887 ↗(On Diff #75780)

Maybe. That was the idea behind getPeelCount() - but I thought to leave the forcing logic here, where it is for all the other user knobs.

904 ↗(On Diff #75780)

Right, it originally contained the code that now lives in getLoopEstimatedTripCount(), but I decided to factor that out because that's more generally useful (e.g. I can see it used in the vectorizer).

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #75780)

Not really.

This is based on CloneLoopBlocks() in LoopUnrollRuntime, but it's sufficiently different so that sharing code didn't really make sense to me. The main loop unrolling code also has its own, more complicated, version of this.

The basic loop copying code is fairly straightforward

for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) {
  BasicBlock *NewBB = CloneBasicBlock(*BB, VMap, ".", F);
  NewBlocks.push_back(NewBB);
  VMap[*BB] = NewBB;
}
remapInstructionsInBlocks(NewBlocks, VMap);

But the different versions do different fixups on the fly - hookinh up inputs, outputs and branches, VMap changes, updating ScalarEvolution and/or LoopInfo, etc.

As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).

Intel, for example has (https://software.intel.com/en-us/node/524502):

#pragma loop_count(n)

which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:

#pragma loop_count(n1, n2, ...)

which asks for specializations for trip counts n1, n2, etc.

Also supported by Intel's compiler is:

#pragma loop_count min(n),max(n),avg(n)

I agree this would be nice, but I think it's somewhat orthogonal.
We can start with an implementation of "estimated trip count" that relies on branch weights, and refine to use more specialized metadata if/when we have it.

FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.

Right, but at least for sampling-based PGO, I think average is the best we're going to get. (Instrumentation can probably do better, and user hints certainly can).
I'm not entirely sure this is a problem, though. We want to optimize for the common case, and I think the average gives us that - in the "0.5 * 3 + 0.5 * 5" case, if we peel off 4 iterations, then 90% of the dynamically executed iterations will hit the peeled-off section - all iterations of the "3 trips" case, and 4 out of 5 iterations of the "5 trips" cases. Which is hopefully better than leaving the loop as is.

davidxl added inline comments.Oct 26 2016, 9:53 AM
lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗(On Diff #75780)

The Count field in UP is used to specify unrolling factor including heuristic based :

// 4rd priority is partial unrolling.

// Try partial unroll only when TripCount could be staticaly calculated.
if (TripCount) {
  if (UP.Count == 0)
    UP.Count = TripCount;

Conceptually peeling Count should be treated in the same way.

lib/Transforms/Utils/LoopUnrollPeel.cpp
87 ↗(On Diff #75780)

Can this be moved outside the loop?

assert(VMap[Header]);
InsertTop->getTerminator()->setSuccessor(0, VMap[Header]);

90 ↗(On Diff #75780)

This can be handled outside the loop too.

91 ↗(On Diff #75780)

What this this erase do?

95 ↗(On Diff #75780)

Is this a stale comment?

101 ↗(On Diff #75780)

Why? I think we should update the branch probability here -- it depends on the what iteration of the peeled clone. If peel count < average/estimated trip count, then each peeled iteration should be more biased towards fall through. If peel_count == est trip_count, then the last peel iteration should be biased toward exit.

220 ↗(On Diff #75780)

The profile meta data of the original loop's back branch should be adjusted too.

Gerolf added a subscriber: Gerolf.Oct 26 2016, 10:08 AM

Hi,

Could you provide more background on this idea? What is your motivational use case? When the trip count is low why optimize? If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt. There are also side effects on code size, register pressure etc. that could hurt performance.

Thanks
Gerolf

Hi Gerolf,

Hi,

Could you provide more background on this idea?

I can't take credit for the idea - this is something GCC already does.

What is your motivational use case? When the trip count is low why optimize?

The motivational use case is a loop with a low trip count that is nested inside a loop with a high trip count.
Peeling the inner loop allows further passes in the optimization pipeline simplify the code for the iterations that actually run, making the outer loop's body better.
Consider something like this:

for (int i = 0; i < bignum; ++i) {
  int ret = 0;
  for (int j = 0; j < param; ++j)
    ret += arr[i][j];
  out[i] = ret;
}

Imagine param is usually 1.
We can then peel this into:

for (int i = 0; i < bignum; ++i) {
  int ret = 0;
  if (param == 0)
    continue;
  ret += arr[i][0]
  for (int j = 1; j < param; ++j)
    ret += arr[i][j];
  out[i] = ret;
}

Which then becomes something morally equivalent to:

for (int i = 0; i < bignum; ++i) {
  if (param == 0)
     continue;
  if (param == 1) {
    out[i] = arr[i][0];
    continue;
  }
  ret = arr[i][0];
  for (int j = 1; j < param; ++j)
    ret += arr[i][j];
  out[i] = ret;
}

So, we've improved the common case (param == 1) - we no longer have to initialize ret, we don't have to deal with the inner loop's IV, there's no add, just a direct assignment.

If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt.

Sure, but this is true for any profile-guided optimization. PGO is only good with representative inputs. If an optimization was good regardless of input, we'd be doing it for non-PGO builds.

There are also side effects on code size, register pressure etc. that could hurt performance.

Right. But that's not really different from any other kind of loop unrolling. Hence the thresholds, etc.

Thanks
Gerolf

Thanks,
Michael

lib/Transforms/Utils/LoopUnrollPeel.cpp
87 ↗(On Diff #75780)

Right, both this and the Latch handling should be moved outside the loop, thanks.

90 ↗(On Diff #75780)

Right, thanks.

91 ↗(On Diff #75780)

Nothing, nice catch!
(It's stale - it's needed when you replace LatchBR instead of modifying it in-place.)

95 ↗(On Diff #75780)

No, but I guess it's not clear.

Let's say we're peeling off K iterations.

For iteration J in 1..K-1, we want the branch that terminates the copy of the latch to be:
if (cond) goto header(J+1) else goto exit

For iteration K, we want to set this branch to be:
if (cond) goto new-ph else goto exit.

Here, new-ph is the preheader of the new loop (that is, the loop that handles iterations >= K+1). Technically, this preheader should be empty, and only contains a branch to the loop header - the only reason it exists is to keep the loop in canonical form.
Does this make sense? If it does, I'll try to improve the comment.

101 ↗(On Diff #75780)

You're right, it's not that we don't know anything - but we don't know enough. I'm not sure how to attach a reasonable number to this, without knowing the distribution.
Do you have any suggestions? The trivial option would be to assume an extremely narrow distribution (the loop always exits after exactly K iterations), but that would mean having an extreme bias for all of the branches, and I'm not sure that's wise.

220 ↗(On Diff #75780)

Right, I missed that, thanks.
But, as above - I'm not sure by how much to adjust it.

mkuper added inline comments.Oct 26 2016, 11:13 AM
lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗(On Diff #75780)

To be honest, I don't understand why Count lives in UnrollingPreferences either. It's never a target preference, it's a derived value.

davidxl added inline comments.Oct 26 2016, 1:09 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
101 ↗(On Diff #75780)

A reasonable way to annotate the branch is like this.
Say the original trip count of the loop is N, then for the m th (from 0 to N-1) peeled iteration, the fall through probability is a decreasing function:

(N - m )/N

Add some fuzzing factor to avoid creating extremely biased branch prob:

for instance (N-m)*3/(4*N)

As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).

Intel, for example has (https://software.intel.com/en-us/node/524502):

#pragma loop_count(n)

which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:

#pragma loop_count(n1, n2, ...)

which asks for specializations for trip counts n1, n2, etc.

Also supported by Intel's compiler is:

#pragma loop_count min(n),max(n),avg(n)

I agree this would be nice, but I think it's somewhat orthogonal.
We can start with an implementation of "estimated trip count" that relies on branch weights, and refine to use more specialized metadata if/when we have it.

Agreed.

FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.

Right, but at least for sampling-based PGO, I think average is the best we're going to get. (Instrumentation can probably do better, and user hints certainly can).
I'm not entirely sure this is a problem, though. We want to optimize for the common case, and I think the average gives us that - in the "0.5 * 3 + 0.5 * 5" case, if we peel off 4 iterations, then 90% of the dynamically executed iterations will hit the peeled-off section - all iterations of the "3 trips" case, and 4 out of 5 iterations of the "5 trips" cases. Which is hopefully better than leaving the loop as is.

I agree. Thanks for explaining this, because I did not understand what was happening. I thought that you where peeling off a fixed number of iterations as a single block. You're not. This will give a different performance vs. applicability tradeoff. I think that this probably makes more sense for PGO-driven information.

mkuper updated this revision to Diff 76148.Oct 27 2016, 5:50 PM
mkuper marked 19 inline comments as done.Oct 27 2016, 5:50 PM

Updated per David's comments.

David, after some more thought, I think your analysis was right - so I used a narrower distribution. You can see the result for a 3-iteration peeling in the pgo test. Let me know if you think it makes sense.
Also I repurposed getPeelCount to be self-contained in terms of selecting the factor, but it led to a bit of ugliness, not sure if this is better or worse than the original.

Gerolf added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
47 ↗(On Diff #76148)

Please add an option to turn it off by default.

125 ↗(On Diff #76148)

Could this be implemented as

  • distribute k: N-k (assuming trip count N)
  • complete unroll first k iterations?
mkuper added inline comments.Oct 27 2016, 8:07 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
47 ↗(On Diff #76148)

That option already exists - it's in LoopUnrollPass (because it's needed as part of the UnrollingPreferences initialization), and is currently off by default. These two options also started out living in LoopUnrollPass but moved since I tried to consolidate choosing the peeling factor here.
That's part of the "ugliness" I referred to above.

125 ↗(On Diff #76148)

Anyway, yes, I've originally thought about implementing it like that (it seemed like it'd be simpler, although I'm not entirely sure there's a big difference). But I think a more direct implementation fits the flow of the unroller better.

I wanted to make the "unroll / runtime unroll / peel" decision at the same point (otherwise this could be a separate pass that runs before the unroller, performs the split and marks the two loops as nounroll and force-unroll). And if this lives in the unroller - splitting and then unrolling the new loop while in the middle of "unrolling" the original one seemed awkward.

davidxl added inline comments.Oct 28 2016, 12:04 PM
lib/Transforms/Utils/LoopUnroll.cpp
357 ↗(On Diff #76148)

.\n --> iterations.\n

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

Is UP parameter intended to be used?

121 ↗(On Diff #76148)

I think the following weights should be passed in:

  • TotalHeaderWeight of the currently peeled loop -- it is passed in for update: TotalHeaderWeight -= PeeledIterationHeaderWeight;
  • PeeledIterationHeaderWeight
    1. its initial value for the first peeled iteration is the PreheaderWeight or ExitWeight of the original loop
    2. After peeling of one iteration, its value will be updated to the fallthrough weight of the peel
  • TotalExitEdgeWeight -- passed in for update. After peeling, TotalExitEdgeWeight -= PeeledIterationExitWeight'
125 ↗(On Diff #76148)

for counted loops, distribute + complete unroll may be simple, but not necessarily good for general cases such as linked-list traversal loops etc.

168 ↗(On Diff #76148)

Should average trip count be used instead of PeelCount (which may be different)?

291 ↗(On Diff #76148)

See my comment above about what weights need to be passed for updating.

323 ↗(On Diff #76148)

Both headeTotalWeight and ExitEdge Weight need to be adjusted during peeling, and the new meta data can just use the updated backedge weight and exit weight.

mkuper marked an inline comment as done.Oct 28 2016, 1:33 PM
mkuper added inline comments.
lib/Transforms/Utils/LoopUnroll.cpp
357 ↗(On Diff #76148)

Sure.

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

Sorry, I don't quite understand what you mean here. It is used (in line 84 or below). Could you explain?

121 ↗(On Diff #76148)

I don't think we need TotalExitEdgeWeight, but you're right, we need to accumulate the header weight to be able to update the final weight of the backedge properly.

168 ↗(On Diff #76148)

Right, I assumed they're the same (and currently they are), but it's not necessarily true in general, I'll add a parameter (and pass PeelCount).

davidxl added inline comments.Oct 28 2016, 1:36 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

Sorry I misread the early return as the regular return. Anyway, if UP is used here, there seems redundant to set UP.PeelCount in the caller of this function.

mkuper marked an inline comment as done.Oct 28 2016, 1:53 PM
mkuper added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

I think it'd be better for it to be set explicitly in the caller, rather than hidden away in here. (This currently also sets it, but that was unintentional, thanks for noticing).
I'd prefer to pass UP here by const reference, and set UP.PeelCount in the caller.

mkuper updated this revision to Diff 76258.Oct 28 2016, 3:15 PM

Updated per David's comments.

Note that I'm becoming more convinced the current distribution for the latch branch copies is rather unreasonable, but we can fix that separately.

davidxl added inline comments.Nov 1 2016, 1:31 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
111 ↗(On Diff #76258)

I think this parameter should better be called 'PeeledHeaderWeight' -- it is really the weight of the 'header' block of the current peeled iteration. The PeeledHeaderWeight will be updated to the next iteration's header weight after this call is done.

The name 'Remaining' sounds like the original value of this weight parameter is the sum of all peeled iteration's header weight, but it is actually not.

295 ↗(On Diff #76258)

If user request a large peel count, this may end up < 0 which should be avoided. Make it as least >=0.

mkuper added inline comments.Nov 1 2016, 5:48 PM
lib/Transforms/Utils/LoopUnrollPeel.cpp
111 ↗(On Diff #76258)

Sure, I'll change it.

295 ↗(On Diff #76258)

Nice catch, thanks! I actually noticed it, but forgot to change the code.
This was one of the reasons I think the distribution is problematic - this shouldn't happen in a distribution when the peel count really is the average.

(Also, I think it probably should be >=1, not just >=0).

mkuper updated this revision to Diff 77068.Nov 7 2016, 11:22 AM
mkuper marked 2 inline comments as done.

Updated per David's comments.

mkuper added a comment.Nov 9 2016, 4:04 PM

Some performance numbers - SPECint and SPECfp C/C++ with instrumentation-based PGO:

444.namd         -0.43%
447.dealII       -0.69%
450.soplex       -0.12%
453.povray       +2.48%
433.milc         +0.38%
470.lbm          -0.58%
482.sphinx3      +0.27%
471.omnetpp      -0.40%
473.astar        +0.17%
483.xalancbmk    -0.46%
400.perlbench    -0.55%
401.bzip2        +0.17%
403.gcc          +0.38%
429.mcf          -0.36%
445.gobmk        +0.11%
456.hmmer        +0.69%
458.sjeng        +0.49%
462.libquantum   -0.07%
464.h264ref      +0.65%

geometric mean   +0.11%

The improvement in povray seems stable, the rest is probably noise.

davidxl edited edge metadata.Nov 14 2016, 11:18 AM

Looks good to me. Adam, do you have more comments?

anemet edited edge metadata.Nov 14 2016, 11:35 AM

Sorry guys about the delay. I'll look at this today!

haicheng edited edge metadata.Nov 14 2016, 2:55 PM

Just a few nits.

lib/Transforms/Scalar/LoopUnrollPass.cpp
985 ↗(On Diff #77068)

Unnecessary change.

1100 ↗(On Diff #77068)

Unnecessary change.

lib/Transforms/Utils/LoopUnrollPeel.cpp
212–214 ↗(On Diff #77068)

Braces are not necessary.

dberris added inline comments.
lib/Transforms/Utils/LoopUnrollPeel.cpp
211–214 ↗(On Diff #77068)

You might also be able to turn this into a range-based for loop:

for (const auto& KV : VMap)
  LVMap[KV.first] = KV.second;

The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.

The resulting code should also be more branch-predictor friendly; for small-trip count loops the loop-exiting misprediction can be significant.

include/llvm/Transforms/Utils/UnrollLoop.h
19 ↗(On Diff #77068)

forward-declare

lib/Transforms/Scalar/LoopUnrollPass.cpp
1181–1183 ↗(On Diff #77068)

If this is a formatting-only change please remove it from this patch.

lib/Transforms/Utils/LoopUnroll.cpp
356 ↗(On Diff #77068)

We don't want to print internal IR names like this. The remark will already point at the source location of the loop.

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗(On Diff #76148)

I agree with David that this interface is confusing. This function either knows about UP or doesn't. If it does it should handle both read-and-write.

12 ↗(On Diff #77068)

compile-time constant trip count

107–118 ↗(On Diff #77068)

All these function comments should be doxygen comments.

111 ↗(On Diff #77068)

"loop entries" is a bit confusing here; we're not entering the loop.

233–260 ↗(On Diff #77068)

I am not sure I don't understand this comment. Perhaps having the pseudo code when one more iteration is peeled may clear things up. It's unclear to me in this last paragraph for example how the conditional branch is generated in the split bottom anchor block.

lib/Transforms/Utils/LoopUtils.cpp
1084–1093 ↗(On Diff #77068)

Do we have a precedence adjusting weight like this back? Hopefully the underlying issue will be fixed soon so we shouldn't add band-aids like this.

test/Transforms/LoopUnroll/peel-loop.ll
1 ↗(On Diff #77068)

It would be good to explain what each of these tests.

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Wasn't @davidxl working on fixing them?

mkuper added a subscriber: bogner.Nov 15 2016, 4:53 PM

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Wasn't @davidxl working on fixing them?

Not that I know of. @davidxl, did I miss something?

Also, phab seems to have swallowed a long comment I wrote about that. :-\

Basically, the underlying issue is that clang doesn't actually record the branch weights it gets from the profile. Rather, it records the branch weights it thinks are better for the purpose of branch probability estimation. From CodeGenPGO.cpp:
/ According to Laplace's Rule of Succession, it is better to compute the
/ weight based on the count plus 1, so universally add 1 to the value.
While this may be good from a branch-probability standpoint, it throws off using the weights for trip count estimates, when the exit weight is small.

I see 3 options:

  1. Keep what I have here as a band-aid.
  2. Don't do the -1 adjustment here - hopefully, in reality, we don't care about the cases where it makes a difference. I'll need to benchmark this, though.
  3. Remove the +1 adjustment in clang, and instead do it where we actually use the branch weights to estimate probabilities, that is, in BPI. Of course, there may be other places that read the weights but really want the adjusted weights (and, say, assume weights are never 0). I asked @bogner about this on IRC, and he seemed ambivalent.

What do you think?

danielcdh added inline comments.Nov 16 2016, 8:56 AM
include/llvm/Transforms/Utils/LoopUtils.h
465 ↗(On Diff #77068)

Is it possible to have this helper function in another patch (maybe NFC) and submit it first? I'd like to use it in another patch: https://reviews.llvm.org/D26527. Or if this patch is going in pretty soon, it's also fine. Thanks.

danielcdh added inline comments.Nov 16 2016, 9:20 AM
include/llvm/Transforms/Utils/LoopUtils.h
465 ↗(On Diff #77068)

Also, this function does not distinguish between "0-trip count" and "unknown tripcount". Maybe change the return value to int so that <0 means unknown tripcount? Or simply use Optional<unsigned> as return value.

mkuper added inline comments.Nov 16 2016, 11:51 AM
include/llvm/Transforms/Utils/LoopUtils.h
465 ↗(On Diff #77068)

I hope this can go in pretty soon. :-)

In any case, I'd rather not commit unused API as an NFC patch, but feel free to "steal" this into your patch in case that goes in first.
Re "0-trip count" and "unknown trip count" - fair point. I'll change the interface to use Optional<>.

mkuper updated this revision to Diff 78255.Nov 16 2016, 1:45 PM
mkuper edited edge metadata.
mkuper marked 12 inline comments as done.

Updated per comments.
Also changed getLoopEstimatedTripCount() to round-to-nearest when dividing.

This was probably the right thing to do to begin with, and compensates for an off-by-one if we have very precise, but adjusted by clang, branch weights.
If the original edge weights were 3000 and 1000, we really want to peel off 3 iterations, but 3001/1001 = 2.99.. and the previous code would truncate it to 2.

mkuper added inline comments.Nov 16 2016, 6:55 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

Actually, now that I think about this again, I'm not sure this is right.
Dehao, for instrumentation-based profiles, do back-edges of loops that are never taken get any profile metadata? I think they don't, in which case this should probably return 0.

danielcdh added inline comments.Nov 16 2016, 8:28 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

I'm not sure about instrumentation-based profiles, but for sample-based profiles, if the backedge is never taken (but executed and thus has sample), it will still get branch probability metadata annotated, which will be (sample_count+1, 1)

mkuper added inline comments.Nov 16 2016, 9:19 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

Ok, will keep it as is, then.

davidxl added inline comments.Nov 16 2016, 9:38 PM
lib/Transforms/Utils/LoopUtils.cpp
1089 ↗(On Diff #78255)

as long as the loop body is executed, it will have profile data annotated. If the backedge is never taken, the weight will be (0, 1)

mkuper updated this revision to Diff 78574.Nov 18 2016, 1:16 PM

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

danielcdh edited edge metadata.Nov 18 2016, 1:28 PM

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

SGTM. Looks like the check becomes noop when peeling is enabled. But it would still serve as a backup plan when peeling is not enabled.

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

SGTM. Looks like the check becomes noop when peeling is enabled. But it would still serve as a backup plan when peeling is not enabled.

I don't think it's a noop. We may decide not to peel even with a low trip count, but we will still prefer not to unroll.

Adam, David, do you have any additional comments, or is this good to go?

davidxl accepted this revision.Nov 21 2016, 11:58 AM
davidxl edited edge metadata.

lgtm

This revision is now accepted and ready to land.Nov 21 2016, 11:58 AM
anemet added inline comments.Nov 22 2016, 12:14 AM
lib/Transforms/Utils/LoopUnrollPeel.cpp
111–112 ↗(On Diff #78574)

You should also say something about adjusting profile data otherwise it's not clear why all the other parameters are passed.

Does that actually need to happen in this same function? It feels that this could be further decomposed for improved readability.

224–226 ↗(On Diff #78574)

Missing function comment.

293 ↗(On Diff #78574)

.peel.new

mkuper updated this revision to Diff 78908.Nov 22 2016, 11:44 AM
mkuper edited edge metadata.

Thanks, Adam!

anemet accepted this revision.Nov 28 2016, 12:14 PM
anemet edited edge metadata.

LGTM with some minor changes.

lib/Transforms/Utils/LoopUnrollPeel.cpp
137 ↗(On Diff #78908)

Why the 0.9? This also needs a comment.

204–219 ↗(On Diff #78908)

This needs a bit more comment; here we're turning values that were before defined by the loop PHIs to use values directly from the preheader or the previous cloned loop body.

This revision was automatically updated to reflect the committed changes.
trentxintong added inline comments.
llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp
357

@mkuper any particular reasons why the backedge weight has to be 1 instead of 0 ?

mkuper added inline comments.Jan 11 2017, 11:30 AM
llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp
357

There was some concern (on IRC, I don't remember who expressed it, unfortunately) that other things that look at profile information will not be happy with 0 weights, since clang doesn't normally generate them (the whole "Laplace's Rule of Succession" adjustment).

I haven't actually tested it, but it didn't seem like it should matter much - if we get here, the distribution isn't that realistic to begin with. So I preferred to play it safe.