This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Implement profile-based loop peeling
ClosedPublic

Authored by mkuper on Oct 25 2016, 1:30 PM.

Download Raw Diff

Details

Reviewers

anemet
mzolotukhin
davidxl
danielcdh
haicheng

Commits

rGb151a641aa8f: [LoopUnroll] Implement profile-based loop peeling
rL288274: [LoopUnroll] Implement profile-based loop peeling

Summary

This implements profile-based loop peeling.

The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.

The code is somewhat similar (and is based on the original version of) the runtime unrolling code, but I think like they're sufficiently different that trying to share the implementation isn't a good idea. Since the current runtime unrolling implementation already has two different prolog/epilog cases, making it do peeling as well will make it rather unreadable.

I'm planning on committing this as disabled-by-default, until I have a bit more confidence in the performance - some more tuning may be required.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 75780.Oct 25 2016, 1:30 PM

mkuper retitled this revision from to [LoopUnroll] Implement profile-based loop peeling.

mkuper updated this object.

mkuper added reviewers: mzolotukhin, haicheng, davidxl, danielcdh.

mkuper added a subscriber: llvm-commits.

Herald added subscribers: modocache, mgorny, beanz, sanjoy. · View Herald TranscriptOct 25 2016, 1:30 PM

davidxl added inline comments.Oct 25 2016, 4:40 PM

lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗	(On Diff #75780)	Should PeelCount be a member of UnrollingPreferences class?
887 ↗	(On Diff #75780)	I think all the logic here should probably be wrapped in a helper function "computeOptimalPeelCount".
904 ↗	(On Diff #75780)	getPeelCount looks like a simple wrapper that is only used once. Probably get rid of it.
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #75780)	Is there common utility to do body cloning?

As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).

Intel, for example has (https://software.intel.com/en-us/node/524502):

#pragma loop_count(n)

which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:

#pragma loop_count(n1, n2, ...)

which asks for specializations for trip counts n1, n2, etc.

Also supported by Intel's compiler is:

#pragma loop_count min(n),max(n),avg(n)

(in this case, avg is a hint, but min and max are promises).

FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.

Thanks, David.

lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗	(On Diff #75780)	I think not, but I don't really know. To the best of my understanding, the main point of UnrollingPreferences is to give targets the ability to override the defaults, and I'm not sure this is useful here.
887 ↗	(On Diff #75780)	Maybe. That was the idea behind getPeelCount() - but I thought to leave the forcing logic here, where it is for all the other user knobs.
904 ↗	(On Diff #75780)	Right, it originally contained the code that now lives in getLoopEstimatedTripCount(), but I decided to factor that out because that's more generally useful (e.g. I can see it used in the vectorizer).
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #75780)	Not really. This is based on CloneLoopBlocks() in LoopUnrollRuntime, but it's sufficiently different so that sharing code didn't really make sense to me. The main loop unrolling code also has its own, more complicated, version of this. The basic loop copying code is fairly straightforward for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) { BasicBlock NewBB = CloneBasicBlock(BB, VMap, ".", F); NewBlocks.push_back(NewBB); VMap[*BB] = NewBB; } remapInstructionsInBlocks(NewBlocks, VMap); But the different versions do different fixups on the fly - hookinh up inputs, outputs and branches, VMap changes, updating ScalarEvolution and/or LoopInfo, etc.

In D25963#579328, @hfinkel wrote:
As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).

Intel, for example has (https://software.intel.com/en-us/node/524502):
#pragma loop_count(n)
which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:
#pragma loop_count(n1, n2, ...)
which asks for specializations for trip counts n1, n2, etc.

Also supported by Intel's compiler is:
#pragma loop_count min(n),max(n),avg(n)

I agree this would be nice, but I think it's somewhat orthogonal.
We can start with an implementation of "estimated trip count" that relies on branch weights, and refine to use more specialized metadata if/when we have it.

FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.

Right, but at least for sampling-based PGO, I think average is the best we're going to get. (Instrumentation can probably do better, and user hints certainly can).
I'm not entirely sure this is a problem, though. We want to optimize for the common case, and I think the average gives us that - in the "0.5 * 3 + 0.5 * 5" case, if we peel off 4 iterations, then 90% of the dynamically executed iterations will hit the peeled-off section - all iterations of the "3 trips" case, and 4 out of 5 iterations of the "5 trips" cases. Which is hopefully better than leaving the loop as is.

davidxl added inline comments.Oct 26 2016, 9:53 AM

lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗	(On Diff #75780)	The Count field in UP is used to specify unrolling factor including heuristic based : // 4rd priority is partial unrolling. // Try partial unroll only when TripCount could be staticaly calculated. if (TripCount) { if (UP.Count == 0) UP.Count = TripCount; Conceptually peeling Count should be treated in the same way.
lib/Transforms/Utils/LoopUnrollPeel.cpp
87 ↗	(On Diff #75780)	Can this be moved outside the loop? assert(VMap[Header]); InsertTop->getTerminator()->setSuccessor(0, VMap[Header]);
90 ↗	(On Diff #75780)	This can be handled outside the loop too.
91 ↗	(On Diff #75780)	What this this erase do?
95 ↗	(On Diff #75780)	Is this a stale comment?
101 ↗	(On Diff #75780)	Why? I think we should update the branch probability here -- it depends on the what iteration of the peeled clone. If peel count < average/estimated trip count, then each peeled iteration should be more biased towards fall through. If peel_count == est trip_count, then the last peel iteration should be biased toward exit.
220 ↗	(On Diff #75780)	The profile meta data of the original loop's back branch should be adjusted too.

Hi,

Could you provide more background on this idea? What is your motivational use case? When the trip count is low why optimize? If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt. There are also side effects on code size, register pressure etc. that could hurt performance.

Thanks
Gerolf

Hi Gerolf,

In D25963#579918, @Gerolf wrote:

Hi,

Could you provide more background on this idea?

I can't take credit for the idea - this is something GCC already does.

What is your motivational use case? When the trip count is low why optimize?

The motivational use case is a loop with a low trip count that is nested inside a loop with a high trip count.
Peeling the inner loop allows further passes in the optimization pipeline simplify the code for the iterations that actually run, making the outer loop's body better.
Consider something like this:

for (int i = 0; i < bignum; ++i) {
  int ret = 0;
  for (int j = 0; j < param; ++j)
    ret += arr[i][j];
  out[i] = ret;
}

Imagine param is usually 1.
We can then peel this into:

for (int i = 0; i < bignum; ++i) {
  int ret = 0;
  if (param == 0)
    continue;
  ret += arr[i][0]
  for (int j = 1; j < param; ++j)
    ret += arr[i][j];
  out[i] = ret;
}

Which then becomes something morally equivalent to:

for (int i = 0; i < bignum; ++i) {
  if (param == 0)
     continue;
  if (param == 1) {
    out[i] = arr[i][0];
    continue;
  }
  ret = arr[i][0];
  for (int j = 1; j < param; ++j)
    ret += arr[i][j];
  out[i] = ret;
}

So, we've improved the common case (param == 1) - we no longer have to initialize ret, we don't have to deal with the inner loop's IV, there's no add, just a direct assignment.

If the profile is wrong and it actually is a hot loop for a regular/different input set peeling could hurt.

Sure, but this is true for any profile-guided optimization. PGO is only good with representative inputs. If an optimization was good regardless of input, we'd be doing it for non-PGO builds.

There are also side effects on code size, register pressure etc. that could hurt performance.

Right. But that's not really different from any other kind of loop unrolling. Hence the thresholds, etc.

Thanks
Gerolf

Thanks,
Michael

lib/Transforms/Utils/LoopUnrollPeel.cpp
87 ↗	(On Diff #75780)	Right, both this and the Latch handling should be moved outside the loop, thanks.
90 ↗	(On Diff #75780)	Right, thanks.
91 ↗	(On Diff #75780)	Nothing, nice catch! (It's stale - it's needed when you replace LatchBR instead of modifying it in-place.)
95 ↗	(On Diff #75780)	No, but I guess it's not clear. Let's say we're peeling off K iterations. For iteration J in 1..K-1, we want the branch that terminates the copy of the latch to be: if (cond) goto header(J+1) else goto exit For iteration K, we want to set this branch to be: if (cond) goto new-ph else goto exit. Here, new-ph is the preheader of the new loop (that is, the loop that handles iterations >= K+1). Technically, this preheader should be empty, and only contains a branch to the loop header - the only reason it exists is to keep the loop in canonical form. Does this make sense? If it does, I'll try to improve the comment.
101 ↗	(On Diff #75780)	You're right, it's not that we don't know anything - but we don't know enough. I'm not sure how to attach a reasonable number to this, without knowing the distribution. Do you have any suggestions? The trivial option would be to assume an extremely narrow distribution (the loop always exits after exactly K iterations), but that would mean having an extreme bias for all of the branches, and I'm not sure that's wise.
220 ↗	(On Diff #75780)	Right, I missed that, thanks. But, as above - I'm not sure by how much to adjust it.

mkuper added inline comments.Oct 26 2016, 11:13 AM

lib/Transforms/Scalar/LoopUnrollPass.cpp
724 ↗	(On Diff #75780)	To be honest, I don't understand why Count lives in UnrollingPreferences either. It's never a target preference, it's a derived value.

davidxl added inline comments.Oct 26 2016, 1:09 PM

lib/Transforms/Utils/LoopUnrollPeel.cpp
101 ↗	(On Diff #75780)	A reasonable way to annotate the branch is like this. Say the original trip count of the loop is N, then for the m th (from 0 to N-1) peeled iteration, the fall through probability is a decreasing function: (N - m )/N Add some fuzzing factor to avoid creating extremely biased branch prob: for instance (N-m)3/(4N)

In D25963#579367, @mkuper wrote:
In D25963#579328, @hfinkel wrote:
As a high-level comment, it would be nice to also have loop metadata to specify a typical trip count (or trip counts).

Intel, for example has (https://software.intel.com/en-us/node/524502):
#pragma loop_count(n)
which asks the optimizer to optimize for a trip count of n. Moreover, and perhaps more importantly, is also supports:
#pragma loop_count(n1, n2, ...)
which asks for specializations for trip counts n1, n2, etc.

Also supported by Intel's compiler is:
#pragma loop_count min(n),max(n),avg(n)
I agree this would be nice, but I think it's somewhat orthogonal.
We can start with an implementation of "estimated trip count" that relies on branch weights, and refine to use more specialized metadata if/when we have it.

Agreed.

FWIW, obviously part of the problem with the average is that you might miss the common trip counts. A loop that is generally executed with a trip count of 3 or 5, might end up with a average near 4; I'm not sure what the best thing would be to do in that case.

Right, but at least for sampling-based PGO, I think average is the best we're going to get. (Instrumentation can probably do better, and user hints certainly can).
I'm not entirely sure this is a problem, though. We want to optimize for the common case, and I think the average gives us that - in the "0.5 * 3 + 0.5 * 5" case, if we peel off 4 iterations, then 90% of the dynamically executed iterations will hit the peeled-off section - all iterations of the "3 trips" case, and 4 out of 5 iterations of the "5 trips" cases. Which is hopefully better than leaving the loop as is.

I agree. Thanks for explaining this, because I did not understand what was happening. I thought that you where peeling off a fixed number of iterations as a single block. You're not. This will give a different performance vs. applicability tradeoff. I think that this probably makes more sense for PGO-driven information.

igor-laevsky added a subscriber: igor-laevsky.Oct 27 2016, 6:43 AM

mkuper updated this revision to Diff 76148.Oct 27 2016, 5:50 PM

Updated per David's comments.

David, after some more thought, I think your analysis was right - so I used a narrower distribution. You can see the result for a 3-iteration peeling in the pgo test. Let me know if you think it makes sense.
Also I repurposed getPeelCount to be self-contained in terms of selecting the factor, but it led to a bit of ugliness, not sure if this is better or worse than the original.

Herald added a subscriber: anna. · View Herald TranscriptOct 27 2016, 5:50 PM

Gerolf added a reviewer: anemet.Oct 27 2016, 7:49 PM

Gerolf added inline comments.

lib/Transforms/Utils/LoopUnrollPeel.cpp
47 ↗	(On Diff #76148)	Please add an option to turn it off by default.
125 ↗	(On Diff #76148)	Could this be implemented as distribute k: N-k (assuming trip count N) complete unroll first k iterations?

mkuper added inline comments.Oct 27 2016, 8:07 PM

lib/Transforms/Utils/LoopUnrollPeel.cpp
47 ↗	(On Diff #76148)	That option already exists - it's in LoopUnrollPass (because it's needed as part of the UnrollingPreferences initialization), and is currently off by default. These two options also started out living in LoopUnrollPass but moved since I tried to consolidate choosing the peeling factor here. That's part of the "ugliness" I referred to above.
125 ↗	(On Diff #76148)	Anyway, yes, I've originally thought about implementing it like that (it seemed like it'd be simpler, although I'm not entirely sure there's a big difference). But I think a more direct implementation fits the flow of the unroller better. I wanted to make the "unroll / runtime unroll / peel" decision at the same point (otherwise this could be a separate pass that runs before the unroller, performs the split and marks the two loops as nounroll and force-unroll). And if this lives in the unroller - splitting and then unrolling the new loop while in the middle of "unrolling" the original one seemed awkward.

davidxl added inline comments.Oct 28 2016, 12:04 PM

lib/Transforms/Utils/LoopUnroll.cpp
357 ↗	(On Diff #76148)	.\n --> iterations.\n
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #76148)	Is UP parameter intended to be used?
121 ↗	(On Diff #76148)	I think the following weights should be passed in: TotalHeaderWeight of the currently peeled loop -- it is passed in for update: TotalHeaderWeight -= PeeledIterationHeaderWeight; PeeledIterationHeaderWeight its initial value for the first peeled iteration is the PreheaderWeight or ExitWeight of the original loop After peeling of one iteration, its value will be updated to the fallthrough weight of the peel TotalExitEdgeWeight -- passed in for update. After peeling, TotalExitEdgeWeight -= PeeledIterationExitWeight'
125 ↗	(On Diff #76148)	for counted loops, distribute + complete unroll may be simple, but not necessarily good for general cases such as linked-list traversal loops etc.
168 ↗	(On Diff #76148)	Should average trip count be used instead of PeelCount (which may be different)?
291 ↗	(On Diff #76148)	See my comment above about what weights need to be passed for updating.
323 ↗	(On Diff #76148)	Both headeTotalWeight and ExitEdge Weight need to be adjusted during peeling, and the new meta data can just use the updated backedge weight and exit weight.

mkuper marked an inline comment as done.Oct 28 2016, 1:33 PM

mkuper added inline comments.

lib/Transforms/Utils/LoopUnroll.cpp
357 ↗	(On Diff #76148)	Sure.
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #76148)	Sorry, I don't quite understand what you mean here. It is used (in line 84 or below). Could you explain?
121 ↗	(On Diff #76148)	I don't think we need TotalExitEdgeWeight, but you're right, we need to accumulate the header weight to be able to update the final weight of the backedge properly.
168 ↗	(On Diff #76148)	Right, I assumed they're the same (and currently they are), but it's not necessarily true in general, I'll add a parameter (and pass PeelCount).

davidxl added inline comments.Oct 28 2016, 1:36 PM

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #76148)	Sorry I misread the early return as the regular return. Anyway, if UP is used here, there seems redundant to set UP.PeelCount in the caller of this function.

mkuper marked an inline comment as done.Oct 28 2016, 1:53 PM

mkuper added inline comments.

lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #76148)	I think it'd be better for it to be set explicitly in the caller, rather than hidden away in here. (This currently also sets it, but that was unintentional, thanks for noticing). I'd prefer to pass UP here by const reference, and set UP.PeelCount in the caller.

Updated per David's comments.

Note that I'm becoming more convinced the current distribution for the latch branch copies is rather unreasonable, but we can fix that separately.

davidxl added inline comments.Nov 1 2016, 1:31 PM

lib/Transforms/Utils/LoopUnrollPeel.cpp
111 ↗	(On Diff #76258)	I think this parameter should better be called 'PeeledHeaderWeight' -- it is really the weight of the 'header' block of the current peeled iteration. The PeeledHeaderWeight will be updated to the next iteration's header weight after this call is done. The name 'Remaining' sounds like the original value of this weight parameter is the sum of all peeled iteration's header weight, but it is actually not.
295 ↗	(On Diff #76258)	If user request a large peel count, this may end up < 0 which should be avoided. Make it as least >=0.

mkuper added inline comments.Nov 1 2016, 5:48 PM

lib/Transforms/Utils/LoopUnrollPeel.cpp
111 ↗	(On Diff #76258)	Sure, I'll change it.
295 ↗	(On Diff #76258)	Nice catch, thanks! I actually noticed it, but forgot to change the code. This was one of the reasons I think the distribution is problematic - this shouldn't happen in a distribution when the peel count really is the average. (Also, I think it probably should be >=1, not just >=0).

Updated per David's comments.

evstupac added a subscriber: evstupac.Nov 7 2016, 4:32 PM

Some performance numbers - SPECint and SPECfp C/C++ with instrumentation-based PGO:

444.namd         -0.43%
447.dealII       -0.69%
450.soplex       -0.12%
453.povray       +2.48%
433.milc         +0.38%
470.lbm          -0.58%
482.sphinx3      +0.27%
471.omnetpp      -0.40%
473.astar        +0.17%
483.xalancbmk    -0.46%
400.perlbench    -0.55%
401.bzip2        +0.17%
403.gcc          +0.38%
429.mcf          -0.36%
445.gobmk        +0.11%
456.hmmer        +0.69%
458.sjeng        +0.49%
462.libquantum   -0.07%
464.h264ref      +0.65%

geometric mean   +0.11%

The improvement in povray seems stable, the rest is probably noise.

Ping.

Looks good to me. Adam, do you have more comments?

Sorry guys about the delay. I'll look at this today!

Just a few nits.

lib/Transforms/Scalar/LoopUnrollPass.cpp
985 ↗	(On Diff #77068)	Unnecessary change.
1100 ↗	(On Diff #77068)	Unnecessary change.
lib/Transforms/Utils/LoopUnrollPeel.cpp
212–214 ↗	(On Diff #77068)	Braces are not necessary.

dberris added a subscriber: dberris.Nov 14 2016, 5:43 PM

dberris added inline comments.

lib/Transforms/Utils/LoopUnrollPeel.cpp
211–214 ↗	(On Diff #77068)	You might also be able to turn this into a range-based for loop: for (const auto& KV : VMap) LVMap[KV.first] = KV.second;

The basic idea is that when the average dynamic trip-count of a loop is known, based on PGO, to be low, we can expect a performance win by peeling off the first several iterations of that loop. Unlike unrolling based on a known trip count, or a trip count multiple, this doesn't save us the conditional check and branch on each iteration. However, it does allow us to simplify the straight-line code we get (constant-folding, etc.), which is important given that we know that we will usually only hit this code, and not the actual loop.

The resulting code should also be more branch-predictor friendly; for small-trip count loops the loop-exiting misprediction can be significant.

include/llvm/Transforms/Utils/UnrollLoop.h
19 ↗	(On Diff #77068)	forward-declare
lib/Transforms/Scalar/LoopUnrollPass.cpp
1181–1183 ↗	(On Diff #77068)	If this is a formatting-only change please remove it from this patch.
lib/Transforms/Utils/LoopUnroll.cpp
356 ↗	(On Diff #77068)	We don't want to print internal IR names like this. The remark will already point at the source location of the loop.
lib/Transforms/Utils/LoopUnrollPeel.cpp
63 ↗	(On Diff #76148)	I agree with David that this interface is confusing. This function either knows about UP or doesn't. If it does it should handle both read-and-write.
12 ↗	(On Diff #77068)	compile-time constant trip count
107–118 ↗	(On Diff #77068)	All these function comments should be doxygen comments.
111 ↗	(On Diff #77068)	"loop entries" is a bit confusing here; we're not entering the loop.
233–260 ↗	(On Diff #77068)	I am not sure I don't understand this comment. Perhaps having the pseudo code when one more iteration is peeled may clear things up. It's unclear to me in this last paragraph for example how the conditional branch is generated in the split bottom anchor block.
lib/Transforms/Utils/LoopUtils.cpp
1084–1093 ↗	(On Diff #77068)	Do we have a precedence adjusting weight like this back? Hopefully the underlying issue will be fixed soon so we shouldn't add band-aids like this.
test/Transforms/LoopUnroll/peel-loop.ll
1 ↗	(On Diff #77068)	It would be good to explain what each of these tests.

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

In D25963#596582, @mkuper wrote:

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Wasn't @davidxl working on fixing them?

In D25963#596720, @anemet wrote:

In D25963#596582, @mkuper wrote:

Thanks, Adam!
I'll fix the cosmetics, but I'm still not sure what to do about the branch weight adjustments.

Wasn't @davidxl working on fixing them?

Not that I know of. @davidxl, did I miss something?

Also, phab seems to have swallowed a long comment I wrote about that. :-\

Basically, the underlying issue is that clang doesn't actually record the branch weights it gets from the profile. Rather, it records the branch weights it thinks are better for the purpose of branch probability estimation. From CodeGenPGO.cpp:
/ According to Laplace's Rule of Succession, it is better to compute the
/ weight based on the count plus 1, so universally add 1 to the value.
While this may be good from a branch-probability standpoint, it throws off using the weights for trip count estimates, when the exit weight is small.

I see 3 options:

Keep what I have here as a band-aid.
Don't do the -1 adjustment here - hopefully, in reality, we don't care about the cases where it makes a difference. I'll need to benchmark this, though.
Remove the +1 adjustment in clang, and instead do it where we actually use the branch weights to estimate probabilities, that is, in BPI. Of course, there may be other places that read the weights but really want the adjusted weights (and, say, assume weights are never 0). I asked @bogner about this on IRC, and he seemed ambivalent.

What do you think?

danielcdh added inline comments.Nov 16 2016, 8:56 AM

include/llvm/Transforms/Utils/LoopUtils.h
465 ↗	(On Diff #77068)	Is it possible to have this helper function in another patch (maybe NFC) and submit it first? I'd like to use it in another patch: https://reviews.llvm.org/D26527. Or if this patch is going in pretty soon, it's also fine. Thanks.

danielcdh added inline comments.Nov 16 2016, 9:20 AM

include/llvm/Transforms/Utils/LoopUtils.h
465 ↗	(On Diff #77068)	Also, this function does not distinguish between "0-trip count" and "unknown tripcount". Maybe change the return value to int so that <0 means unknown tripcount? Or simply use Optional<unsigned> as return value.

mkuper added inline comments.Nov 16 2016, 11:51 AM

include/llvm/Transforms/Utils/LoopUtils.h
465 ↗	(On Diff #77068)	I hope this can go in pretty soon. :-) In any case, I'd rather not commit unused API as an NFC patch, but feel free to "steal" this into your patch in case that goes in first. Re "0-trip count" and "unknown trip count" - fair point. I'll change the interface to use Optional<>.

Updated per comments.
Also changed getLoopEstimatedTripCount() to round-to-nearest when dividing.

This was probably the right thing to do to begin with, and compensates for an off-by-one if we have very precise, but adjusted by clang, branch weights.
If the original edge weights were 3000 and 1000, we really want to peel off 3 iterations, but 3001/1001 = 2.99.. and the previous code would truncate it to 2.

mzolotukhin mentioned this in D26527: Use profile info to adjust loop unroll threshold..Nov 16 2016, 2:24 PM

mkuper added inline comments.Nov 16 2016, 6:55 PM

lib/Transforms/Utils/LoopUtils.cpp
1089 ↗	(On Diff #78255)	Actually, now that I think about this again, I'm not sure this is right. Dehao, for instrumentation-based profiles, do back-edges of loops that are never taken get any profile metadata? I think they don't, in which case this should probably return 0.

danielcdh added inline comments.Nov 16 2016, 8:28 PM

lib/Transforms/Utils/LoopUtils.cpp
1089 ↗	(On Diff #78255)	I'm not sure about instrumentation-based profiles, but for sample-based profiles, if the backedge is never taken (but executed and thus has sample), it will still get branch probability metadata annotated, which will be (sample_count+1, 1)

mkuper added inline comments.Nov 16 2016, 9:19 PM

lib/Transforms/Utils/LoopUtils.cpp
1089 ↗	(On Diff #78255)	Ok, will keep it as is, then.

davidxl added inline comments.Nov 16 2016, 9:38 PM

lib/Transforms/Utils/LoopUtils.cpp
1089 ↗	(On Diff #78255)	as long as the loop body is executed, it will have profile data annotated. If the backedge is never taken, the weight will be (0, 1)

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

In D25963#600188, @mkuper wrote:

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

SGTM. Looks like the check becomes noop when peeling is enabled. But it would still serve as a backup plan when peeling is not enabled.

In D25963#600198, @danielcdh wrote:

In D25963#600188, @mkuper wrote:

Rebased to resolve conflicts with r287186.

Dehao, note I moved the flat loop check to be below the peeling computation, otherwise it would preempt it.
I think as long as it's above runtime unrolling, it does what you meant, right?

SGTM. Looks like the check becomes noop when peeling is enabled. But it would still serve as a backup plan when peeling is not enabled.

I don't think it's a noop. We may decide not to peel even with a low trip count, but we will still prefer not to unroll.

mcrosier added a subscriber: mcrosier.Nov 21 2016, 9:43 AM

Adam, David, do you have any additional comments, or is this good to go?

lgtm

This revision is now accepted and ready to land.Nov 21 2016, 11:58 AM

anemet added inline comments.Nov 22 2016, 12:14 AM

lib/Transforms/Utils/LoopUnrollPeel.cpp
111–112 ↗	(On Diff #78574)	You should also say something about adjusting profile data otherwise it's not clear why all the other parameters are passed. Does that actually need to happen in this same function? It feels that this could be further decomposed for improved readability.
224–226 ↗	(On Diff #78574)	Missing function comment.
293 ↗	(On Diff #78574)	.peel.new

Thanks, Adam!

LGTM with some minor changes.

lib/Transforms/Utils/LoopUnrollPeel.cpp
137 ↗	(On Diff #78908)	Why the 0.9? This also needs a comment.
204–219 ↗	(On Diff #78908)	This needs a bit more comment; here we're turning values that were before defined by the loop PHIs to use values directly from the preheader or the previous cloned loop body.

Closed by commit rL288274: [LoopUnroll] Implement profile-based loop peeling (authored by mkuper). · Explain WhyNov 30 2016, 1:24 PM

This revision was automatically updated to reflect the committed changes.

trentxintong added a subscriber: trentxintong.Jan 11 2017, 11:13 AM

trentxintong added inline comments.

llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp
357	@mkuper any particular reasons why the backedge weight has to be 1 instead of 0 ?

mkuper added inline comments.Jan 11 2017, 11:30 AM

llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp
357	There was some concern (on IRC, I don't remember who expressed it, unfortunately) that other things that look at profile information will not be happy with 0 weights, since clang doesn't normally generate them (the whole "Laplace's Rule of Succession" adjustment). I haven't actually tested it, but it didn't seem like it should matter much - if we get here, the distribution isn't that realistic to begin with. So I preferred to play it safe.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

Transforms/

Utils/

UnrollLoop.h

13 lines

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

49 lines

Utils/

1 line

36 lines

405 lines

14 lines

test/

Transforms/

LoopUnroll/

peel-loop-pgo.ll

47 lines

peel-loop.ll

96 lines

Diff 79803

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 259 Lines • ▼ Show 20 Lines	struct UnrollingPreferences {
/// OptSizeThreshold, but used for partial/runtime unrolling (set to		/// OptSizeThreshold, but used for partial/runtime unrolling (set to
/// UINT_MAX to disable).		/// UINT_MAX to disable).
unsigned PartialOptSizeThreshold;		unsigned PartialOptSizeThreshold;
/// A forced unrolling factor (the number of concatenated bodies of the		/// A forced unrolling factor (the number of concatenated bodies of the
/// original loop in the unrolled loop body). When set to 0, the unrolling		/// original loop in the unrolled loop body). When set to 0, the unrolling
/// transformation will select an unrolling factor based on the current cost		/// transformation will select an unrolling factor based on the current cost
/// threshold and other factors.		/// threshold and other factors.
unsigned Count;		unsigned Count;
		/// A forced peeling factor (the number of bodied of the original loop
		/// that should be peeled off before the loop body). When set to 0, the
		/// unrolling transformation will select a peeling factor based on profile
		/// information and other factors.
		unsigned PeelCount;
/// Default unroll count for loops with run-time trip count.		/// Default unroll count for loops with run-time trip count.
unsigned DefaultUnrollRuntimeCount;		unsigned DefaultUnrollRuntimeCount;
// Set the maximum unrolling factor. The unrolling factor may be selected		// Set the maximum unrolling factor. The unrolling factor may be selected
// using the appropriate cost threshold, but may not exceed this number		// using the appropriate cost threshold, but may not exceed this number
// (set to UINT_MAX to disable). This does not apply in cases where the		// (set to UINT_MAX to disable). This does not apply in cases where the
// loop is being fully unrolled.		// loop is being fully unrolled.
unsigned MaxCount;		unsigned MaxCount;
/// Set the maximum unrolling factor for full unrolling. Like MaxCount, but		/// Set the maximum unrolling factor for full unrolling. Like MaxCount, but
Show All 17 Lines	struct UnrollingPreferences {
/// Allow emitting expensive instructions (such as divisions) when computing		/// Allow emitting expensive instructions (such as divisions) when computing
/// the trip count of a loop for runtime unrolling.		/// the trip count of a loop for runtime unrolling.
bool AllowExpensiveTripCount;		bool AllowExpensiveTripCount;
/// Apply loop unroll on any kind of loop		/// Apply loop unroll on any kind of loop
/// (mainly to loops that fail runtime unrolling).		/// (mainly to loops that fail runtime unrolling).
bool Force;		bool Force;
/// Allow using trip count upper bound to unroll loops.		/// Allow using trip count upper bound to unroll loops.
bool UpperBound;		bool UpperBound;
		/// Allow peeling off loop iterations for loops with low dynamic tripcount.
		bool AllowPeeling;
};		};

/// \brief Get target-customized preferences for the generic loop unrolling		/// \brief Get target-customized preferences for the generic loop unrolling
/// transformation. The caller will initialize UP with the current		/// transformation. The caller will initialize UP with the current
/// target-independent defaults.		/// target-independent defaults.
void getUnrollingPreferences(Loop *L, UnrollingPreferences &UP) const;		void getUnrollingPreferences(Loop *L, UnrollingPreferences &UP) const;

/// @}		/// @}
▲ Show 20 Lines • Show All 874 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Transforms/Utils/UnrollLoop.h

	Show All 10 Lines
	// actual pass or policy, but provides a single function to perform loop			// actual pass or policy, but provides a single function to perform loop
	// unrolling.			// unrolling.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H			#ifndef LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H
	#define LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H			#define LLVM_TRANSFORMS_UTILS_UNROLLLOOP_H

				// Needed because we can't forward-declare the nested struct
				// TargetTransformInfo::UnrollingPreferences
				#include "llvm/Analysis/TargetTransformInfo.h"

	namespace llvm {			namespace llvm {

	class StringRef;			class StringRef;
	class AssumptionCache;			class AssumptionCache;
	class DominatorTree;			class DominatorTree;
	class Loop;			class Loop;
	class LoopInfo;			class LoopInfo;
	class LPPassManager;			class LPPassManager;
	class MDNode;			class MDNode;
	class Pass;			class Pass;
	class OptimizationRemarkEmitter;			class OptimizationRemarkEmitter;
	class ScalarEvolution;			class ScalarEvolution;

	bool UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,			bool UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,
	bool AllowRuntime, bool AllowExpensiveTripCount,			bool AllowRuntime, bool AllowExpensiveTripCount,
	bool PreserveCondBr, bool PreserveOnlyFirst,			bool PreserveCondBr, bool PreserveOnlyFirst,
	unsigned TripMultiple, LoopInfo LI, ScalarEvolution SE,			unsigned TripMultiple, unsigned PeelCount, LoopInfo *LI,
	DominatorTree DT, AssumptionCache AC,			ScalarEvolution SE, DominatorTree DT, AssumptionCache *AC,
	OptimizationRemarkEmitter *ORE, bool PreserveLCSSA);			OptimizationRemarkEmitter *ORE, bool PreserveLCSSA);

	bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,			bool UnrollRuntimeLoopRemainder(Loop *L, unsigned Count,
	bool AllowExpensiveTripCount,			bool AllowExpensiveTripCount,
	bool UseEpilogRemainder, LoopInfo *LI,			bool UseEpilogRemainder, LoopInfo *LI,
	ScalarEvolution SE, DominatorTree DT,			ScalarEvolution SE, DominatorTree DT,
	bool PreserveLCSSA);			bool PreserveLCSSA);

				void computePeelCount(Loop *L, unsigned LoopSize,
				TargetTransformInfo::UnrollingPreferences &UP);

				bool peelLoop(Loop L, unsigned PeelCount, LoopInfo LI, ScalarEvolution *SE,
				DominatorTree *DT, bool PreserveLCSSA);

	MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);			MDNode GetUnrollMetadata(MDNode LoopID, StringRef Name);
	}			}

	#endif			#endif

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show All 18 Lines
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
#include "llvm/Analysis/InstructionSimplify.h"		#include "llvm/Analysis/InstructionSimplify.h"
#include "llvm/Analysis/LoopPass.h"		#include "llvm/Analysis/LoopPass.h"
#include "llvm/Analysis/LoopPassManager.h"		#include "llvm/Analysis/LoopPassManager.h"
#include "llvm/Analysis/LoopUnrollAnalyzer.h"		#include "llvm/Analysis/LoopUnrollAnalyzer.h"
#include "llvm/Analysis/OptimizationDiagnosticInfo.h"		#include "llvm/Analysis/OptimizationDiagnosticInfo.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/InstVisitor.h"		#include "llvm/IR/InstVisitor.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/Metadata.h"		#include "llvm/IR/Metadata.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	cl::desc("Unrolled size limit for loops with an unroll(full) or "
"unroll_count pragma."));		"unroll_count pragma."));

static cl::opt<unsigned> FlatLoopTripCountThreshold(		static cl::opt<unsigned> FlatLoopTripCountThreshold(
"flat-loop-tripcount-threshold", cl::init(5), cl::Hidden,		"flat-loop-tripcount-threshold", cl::init(5), cl::Hidden,
cl::desc("If the runtime tripcount for the loop is lower than the "		cl::desc("If the runtime tripcount for the loop is lower than the "
"threshold, the loop is considered as flat and will be less "		"threshold, the loop is considered as flat and will be less "
"aggressively unrolled."));		"aggressively unrolled."));

		static cl::opt<bool>
		UnrollAllowPeeling("unroll-allow-peeling", cl::Hidden,
		cl::desc("Allows loops to be peeled when the dynamic "
		"trip count is known to be low."));

/// A magic value for use with the Threshold parameter to indicate		/// A magic value for use with the Threshold parameter to indicate
/// that the loop unroll should be performed regardless of how much		/// that the loop unroll should be performed regardless of how much
/// code expansion would result.		/// code expansion would result.
static const unsigned NoThreshold = UINT_MAX;		static const unsigned NoThreshold = UINT_MAX;

/// Gather the various unrolling parameters based on the defaults, compiler		/// Gather the various unrolling parameters based on the defaults, compiler
/// flags, TTI overrides and user specified parameters.		/// flags, TTI overrides and user specified parameters.
static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(		static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
Loop *L, const TargetTransformInfo &TTI, Optional<unsigned> UserThreshold,		Loop *L, const TargetTransformInfo &TTI, Optional<unsigned> UserThreshold,
Optional<unsigned> UserCount, Optional<bool> UserAllowPartial,		Optional<unsigned> UserCount, Optional<bool> UserAllowPartial,
Optional<bool> UserRuntime, Optional<bool> UserUpperBound) {		Optional<bool> UserRuntime, Optional<bool> UserUpperBound) {
TargetTransformInfo::UnrollingPreferences UP;		TargetTransformInfo::UnrollingPreferences UP;

// Set up the defaults		// Set up the defaults
UP.Threshold = 150;		UP.Threshold = 150;
UP.PercentDynamicCostSavedThreshold = 50;		UP.PercentDynamicCostSavedThreshold = 50;
UP.DynamicCostSavingsDiscount = 100;		UP.DynamicCostSavingsDiscount = 100;
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
UP.PartialThreshold = UP.Threshold;		UP.PartialThreshold = UP.Threshold;
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;
UP.Count = 0;		UP.Count = 0;
		UP.PeelCount = 0;
UP.DefaultUnrollRuntimeCount = 8;		UP.DefaultUnrollRuntimeCount = 8;
UP.MaxCount = UINT_MAX;		UP.MaxCount = UINT_MAX;
UP.FullUnrollMaxCount = UINT_MAX;		UP.FullUnrollMaxCount = UINT_MAX;
UP.BEInsns = 2;		UP.BEInsns = 2;
UP.Partial = false;		UP.Partial = false;
UP.Runtime = false;		UP.Runtime = false;
UP.AllowRemainder = true;		UP.AllowRemainder = true;
UP.AllowExpensiveTripCount = false;		UP.AllowExpensiveTripCount = false;
UP.Force = false;		UP.Force = false;
UP.UpperBound = false;		UP.UpperBound = false;
		UP.AllowPeeling = false;

// Override with any target specific settings		// Override with any target specific settings
TTI.getUnrollingPreferences(L, UP);		TTI.getUnrollingPreferences(L, UP);

// Apply size attributes		// Apply size attributes
if (L->getHeader()->getParent()->optForSize()) {		if (L->getHeader()->getParent()->optForSize()) {
UP.Threshold = UP.OptSizeThreshold;		UP.Threshold = UP.OptSizeThreshold;
UP.PartialThreshold = UP.PartialOptSizeThreshold;		UP.PartialThreshold = UP.PartialOptSizeThreshold;
Show All 16 Lines	static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
if (UnrollAllowPartial.getNumOccurrences() > 0)		if (UnrollAllowPartial.getNumOccurrences() > 0)
UP.Partial = UnrollAllowPartial;		UP.Partial = UnrollAllowPartial;
if (UnrollAllowRemainder.getNumOccurrences() > 0)		if (UnrollAllowRemainder.getNumOccurrences() > 0)
UP.AllowRemainder = UnrollAllowRemainder;		UP.AllowRemainder = UnrollAllowRemainder;
if (UnrollRuntime.getNumOccurrences() > 0)		if (UnrollRuntime.getNumOccurrences() > 0)
UP.Runtime = UnrollRuntime;		UP.Runtime = UnrollRuntime;
if (UnrollMaxUpperBound == 0)		if (UnrollMaxUpperBound == 0)
UP.UpperBound = false;		UP.UpperBound = false;
		if (UnrollAllowPeeling.getNumOccurrences() > 0)
		UP.AllowPeeling = UnrollAllowPeeling;

// Apply user values provided by argument		// Apply user values provided by argument
if (UserThreshold.hasValue()) {		if (UserThreshold.hasValue()) {
UP.Threshold = *UserThreshold;		UP.Threshold = *UserThreshold;
UP.PartialThreshold = *UserThreshold;		UP.PartialThreshold = *UserThreshold;
}		}
if (UserCount.hasValue())		if (UserCount.hasValue())
UP.Count = *UserCount;		UP.Count = *UserCount;
▲ Show 20 Lines • Show All 567 Lines • ▼ Show 20 Lines	if (PragmaFullUnroll && TripCount != 0) {
if (getUnrolledLoopSize(LoopSize, UP) < PragmaUnrollThreshold)		if (getUnrolledLoopSize(LoopSize, UP) < PragmaUnrollThreshold)
return false;		return false;
}		}

bool PragmaEnableUnroll = HasUnrollEnablePragma(L);		bool PragmaEnableUnroll = HasUnrollEnablePragma(L);
bool ExplicitUnroll = PragmaCount > 0 \|\| PragmaFullUnroll \|\|		bool ExplicitUnroll = PragmaCount > 0 \|\| PragmaFullUnroll \|\|
PragmaEnableUnroll \|\| UserUnrollCount;		PragmaEnableUnroll \|\| UserUnrollCount;

// Check if the runtime trip count is too small when profile is available.
if (L->getHeader()->getParent()->getEntryCount() && TripCount == 0) {
if (auto ProfileTripCount = getLoopEstimatedTripCount(L)) {
if (*ProfileTripCount < FlatLoopTripCountThreshold)
return false;
else
UP.AllowExpensiveTripCount = true;
}
}

if (ExplicitUnroll && TripCount != 0) {		if (ExplicitUnroll && TripCount != 0) {
// If the loop has an unrolling pragma, we want to be more aggressive with		// If the loop has an unrolling pragma, we want to be more aggressive with
// unrolling limits. Set thresholds to at least the PragmaThreshold value		// unrolling limits. Set thresholds to at least the PragmaThreshold value
// which is larger than the default limits.		// which is larger than the default limits.
UP.Threshold = std::max<unsigned>(UP.Threshold, PragmaUnrollThreshold);		UP.Threshold = std::max<unsigned>(UP.Threshold, PragmaUnrollThreshold);
UP.PartialThreshold =		UP.PartialThreshold =
std::max<unsigned>(UP.PartialThreshold, PragmaUnrollThreshold);		std::max<unsigned>(UP.PartialThreshold, PragmaUnrollThreshold);
}		}
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	static bool computeUnrollCount(
if (PragmaFullUnroll)		if (PragmaFullUnroll)
ORE->emit(		ORE->emit(
OptimizationRemarkMissed(DEBUG_TYPE,		OptimizationRemarkMissed(DEBUG_TYPE,
"CantFullUnrollAsDirectedRuntimeTripCount",		"CantFullUnrollAsDirectedRuntimeTripCount",
L->getStartLoc(), L->getHeader())		L->getStartLoc(), L->getHeader())
<< "Unable to fully unroll loop as directed by unroll(full) pragma "		<< "Unable to fully unroll loop as directed by unroll(full) pragma "
"because loop has a runtime trip count.");		"because loop has a runtime trip count.");

// 5th priority is runtime unrolling.		// 5th priority is loop peeling
		computePeelCount(L, LoopSize, UP);
		if (UP.PeelCount) {
		UP.Runtime = false;
		UP.Count = 1;
		return ExplicitUnroll;
		}

		// 6th priority is runtime unrolling.
// Don't unroll a runtime trip count loop when it is disabled.		// Don't unroll a runtime trip count loop when it is disabled.
if (HasRuntimeUnrollDisablePragma(L)) {		if (HasRuntimeUnrollDisablePragma(L)) {
UP.Count = 0;		UP.Count = 0;
return false;		return false;
}		}

		// Check if the runtime trip count is too small when profile is available.
		if (L->getHeader()->getParent()->getEntryCount()) {
		if (auto ProfileTripCount = getLoopEstimatedTripCount(L)) {
		if (*ProfileTripCount < FlatLoopTripCountThreshold)
		return false;
		else
		UP.AllowExpensiveTripCount = true;
		}
		}

// Reduce count based on the type of unrolling and the threshold values.		// Reduce count based on the type of unrolling and the threshold values.
UP.Runtime \|= PragmaEnableUnroll \|\| PragmaCount > 0 \|\| UserUnrollCount;		UP.Runtime \|= PragmaEnableUnroll \|\| PragmaCount > 0 \|\| UserUnrollCount;
if (!UP.Runtime) {		if (!UP.Runtime) {
DEBUG(dbgs() << " will not try to unroll loop with runtime trip count "		DEBUG(dbgs() << " will not try to unroll loop with runtime trip count "
<< "-unroll-runtime not given\n");		<< "-unroll-runtime not given\n");
UP.Count = 0;		UP.Count = 0;
return false;		return false;
}		}
▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	if (!UP.Count)
return false;		return false;
// Unroll factor (Count) must be less or equal to TripCount.		// Unroll factor (Count) must be less or equal to TripCount.
if (TripCount && UP.Count > TripCount)		if (TripCount && UP.Count > TripCount)
UP.Count = TripCount;		UP.Count = TripCount;

// Unroll the loop.		// Unroll the loop.
if (!UnrollLoop(L, UP.Count, TripCount, UP.Force, UP.Runtime,		if (!UnrollLoop(L, UP.Count, TripCount, UP.Force, UP.Runtime,
UP.AllowExpensiveTripCount, UseUpperBound, MaxOrZero,		UP.AllowExpensiveTripCount, UseUpperBound, MaxOrZero,
TripMultiple, LI, SE, &DT, &AC, &ORE, PreserveLCSSA))		TripMultiple, UP.PeelCount, LI, SE, &DT, &AC, &ORE,
		PreserveLCSSA))
return false;		return false;

// If loop has an unroll count pragma or unrolled by explicitly set count		// If loop has an unroll count pragma or unrolled by explicitly set count
// mark loop as unrolled to prevent unrolling beyond that requested.		// mark loop as unrolled to prevent unrolling beyond that requested.
if (IsCountSetExplicitly)		// If the loop was peeled, we already "used up" the profile information
		// we had, so we don't want to unroll or peel again.
		if (IsCountSetExplicitly \|\| UP.PeelCount)
SetLoopAlreadyUnrolled(L);		SetLoopAlreadyUnrolled(L);

return true;		return true;
}		}

namespace {		namespace {
class LoopUnroll : public LoopPass {		class LoopUnroll : public LoopPass {
public:		public:
static char ID; // Pass ID, replacement for typeid		static char ID; // Pass ID, replacement for typeid
LoopUnroll(Optional<unsigned> Threshold = None,		LoopUnroll(Optional<unsigned> Threshold = None,
▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Utils/CMakeLists.txt

Show All 20 Lines	add_llvm_library(LLVMTransformUtils
ImportedFunctionsInliningStatistics.cpp		ImportedFunctionsInliningStatistics.cpp
InstructionNamer.cpp		InstructionNamer.cpp
IntegerDivision.cpp		IntegerDivision.cpp
LCSSA.cpp		LCSSA.cpp
LibCallsShrinkWrap.cpp		LibCallsShrinkWrap.cpp
Local.cpp		Local.cpp
LoopSimplify.cpp		LoopSimplify.cpp
LoopUnroll.cpp		LoopUnroll.cpp
		LoopUnrollPeel.cpp
LoopUnrollRuntime.cpp		LoopUnrollRuntime.cpp
LoopUtils.cpp		LoopUtils.cpp
LoopVersioning.cpp		LoopVersioning.cpp
LowerInvoke.cpp		LowerInvoke.cpp
LowerSwitch.cpp		LowerSwitch.cpp
Mem2Reg.cpp		Mem2Reg.cpp
MemorySSA.cpp		MemorySSA.cpp
MetaRenamer.cpp		MetaRenamer.cpp
Show All 24 Lines

llvm/trunk/lib/Transforms/Utils/LoopUnroll.cpp

Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines
///		///
/// If AllowRuntime is true then UnrollLoop will consider unrolling loops that		/// If AllowRuntime is true then UnrollLoop will consider unrolling loops that
/// have a runtime (i.e. not compile time constant) trip count. Unrolling these		/// have a runtime (i.e. not compile time constant) trip count. Unrolling these
/// loops require a unroll "prologue" that runs "RuntimeTripCount % Count"		/// loops require a unroll "prologue" that runs "RuntimeTripCount % Count"
/// iterations before branching into the unrolled loop. UnrollLoop will not		/// iterations before branching into the unrolled loop. UnrollLoop will not
/// runtime-unroll the loop if computing RuntimeTripCount will be expensive and		/// runtime-unroll the loop if computing RuntimeTripCount will be expensive and
/// AllowExpensiveTripCount is false.		/// AllowExpensiveTripCount is false.
///		///
		/// If we want to perform PGO-based loop peeling, PeelCount is set to the
		/// number of iterations we want to peel off.
		///
/// The LoopInfo Analysis that is passed will be kept consistent.		/// The LoopInfo Analysis that is passed will be kept consistent.
///		///
/// This utility preserves LoopInfo. It will also preserve ScalarEvolution and		/// This utility preserves LoopInfo. It will also preserve ScalarEvolution and
/// DominatorTree if they are non-null.		/// DominatorTree if they are non-null.
bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,		bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,
bool AllowRuntime, bool AllowExpensiveTripCount,		bool AllowRuntime, bool AllowExpensiveTripCount,
bool PreserveCondBr, bool PreserveOnlyFirst,		bool PreserveCondBr, bool PreserveOnlyFirst,
unsigned TripMultiple, LoopInfo LI, ScalarEvolution SE,		unsigned TripMultiple, unsigned PeelCount, LoopInfo *LI,
DominatorTree DT, AssumptionCache AC,		ScalarEvolution SE, DominatorTree DT,
OptimizationRemarkEmitter *ORE, bool PreserveLCSSA) {		AssumptionCache AC, OptimizationRemarkEmitter ORE,
		bool PreserveLCSSA) {

BasicBlock *Preheader = L->getLoopPreheader();		BasicBlock *Preheader = L->getLoopPreheader();
if (!Preheader) {		if (!Preheader) {
DEBUG(dbgs() << " Can't unroll; loop preheader-insertion failed.\n");		DEBUG(dbgs() << " Can't unroll; loop preheader-insertion failed.\n");
return false;		return false;
}		}

BasicBlock *LatchBlock = L->getLoopLatch();		BasicBlock *LatchBlock = L->getLoopLatch();
if (!LatchBlock) {		if (!LatchBlock) {
Show All 29 Lines	bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,
if (TripMultiple != 1)		if (TripMultiple != 1)
DEBUG(dbgs() << " Trip Multiple = " << TripMultiple << "\n");		DEBUG(dbgs() << " Trip Multiple = " << TripMultiple << "\n");

// Effectively "DCE" unrolled iterations that are beyond the tripcount		// Effectively "DCE" unrolled iterations that are beyond the tripcount
// and will never be executed.		// and will never be executed.
if (TripCount != 0 && Count > TripCount)		if (TripCount != 0 && Count > TripCount)
Count = TripCount;		Count = TripCount;

// Don't enter the unroll code if there is nothing to do. This way we don't		// Don't enter the unroll code if there is nothing to do.
// need to support "partial unrolling by 1".		if (TripCount == 0 && Count < 2 && PeelCount == 0)
if (TripCount == 0 && Count < 2)
return false;		return false;

assert(Count > 0);		assert(Count > 0);
assert(TripMultiple > 0);		assert(TripMultiple > 0);
assert(TripCount == 0 \|\| TripCount % TripMultiple == 0);		assert(TripCount == 0 \|\| TripCount % TripMultiple == 0);

// Are we eliminating the loop control altogether?		// Are we eliminating the loop control altogether?
bool CompletelyUnroll = Count == TripCount;		bool CompletelyUnroll = Count == TripCount;
Show All 12 Lines	bool NeedToFixLCSSA = PreserveLCSSA && CompletelyUnroll &&
return isa<PHINode>(BB->begin());		return isa<PHINode>(BB->begin());
});		});

// We assume a run-time trip count if the compiler cannot		// We assume a run-time trip count if the compiler cannot
// figure out the loop trip count and the unroll-runtime		// figure out the loop trip count and the unroll-runtime
// flag is specified.		// flag is specified.
bool RuntimeTripCount = (TripCount == 0 && Count > 0 && AllowRuntime);		bool RuntimeTripCount = (TripCount == 0 && Count > 0 && AllowRuntime);

		assert((!RuntimeTripCount \|\| !PeelCount) &&
		"Did not expect runtime trip-count unrolling "
		"and peeling for the same loop");

		if (PeelCount)
		peelLoop(L, PeelCount, LI, SE, DT, PreserveLCSSA);

// Loops containing convergent instructions must have a count that divides		// Loops containing convergent instructions must have a count that divides
// their TripMultiple.		// their TripMultiple.
DEBUG(		DEBUG(
{		{
bool HasConvergent = false;		bool HasConvergent = false;
for (auto &BB : L->blocks())		for (auto &BB : L->blocks())
for (auto &I : *BB)		for (auto &I : *BB)
if (auto CS = CallSite(&I))		if (auto CS = CallSite(&I))
HasConvergent \|= CS.isConvergent();		HasConvergent \|= CS.isConvergent();
assert((!HasConvergent \|\| TripMultiple % Count == 0) &&		assert((!HasConvergent \|\| TripMultiple % Count == 0) &&
"Unroll count must divide trip multiple if loop contains a "		"Unroll count must divide trip multiple if loop contains a "
"convergent operation.");		"convergent operation.");
});		});
// Don't output the runtime loop remainder if Count is a multiple of
// TripMultiple. Such a remainder is never needed, and is unsafe if the loop
// contains a convergent instruction.
if (RuntimeTripCount && TripMultiple % Count != 0 &&		if (RuntimeTripCount && TripMultiple % Count != 0 &&
!UnrollRuntimeLoopRemainder(L, Count, AllowExpensiveTripCount,		!UnrollRuntimeLoopRemainder(L, Count, AllowExpensiveTripCount,
UnrollRuntimeEpilog, LI, SE, DT,		UnrollRuntimeEpilog, LI, SE, DT,
PreserveLCSSA)) {		PreserveLCSSA)) {
if (Force)		if (Force)
RuntimeTripCount = false;		RuntimeTripCount = false;
else		else
return false;		return false;
Show All 19 Lines	bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,
// Report the unrolling decision.		// Report the unrolling decision.
if (CompletelyUnroll) {		if (CompletelyUnroll) {
DEBUG(dbgs() << "COMPLETELY UNROLLING loop %" << Header->getName()		DEBUG(dbgs() << "COMPLETELY UNROLLING loop %" << Header->getName()
<< " with trip count " << TripCount << "!\n");		<< " with trip count " << TripCount << "!\n");
ORE->emit(OptimizationRemark(DEBUG_TYPE, "FullyUnrolled", L->getStartLoc(),		ORE->emit(OptimizationRemark(DEBUG_TYPE, "FullyUnrolled", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "completely unrolled loop with "		<< "completely unrolled loop with "
<< NV("UnrollCount", TripCount) << " iterations");		<< NV("UnrollCount", TripCount) << " iterations");
		} else if (PeelCount) {
		DEBUG(dbgs() << "PEELING loop %" << Header->getName()
		<< " with iteration count " << PeelCount << "!\n");
		ORE->emit(OptimizationRemark(DEBUG_TYPE, "Peeled", L->getStartLoc(),
		L->getHeader())
		<< " peeled loop by " << NV("PeelCount", PeelCount)
		<< " iterations");
} else {		} else {
OptimizationRemark Diag(DEBUG_TYPE, "PartialUnrolled", L->getStartLoc(),		OptimizationRemark Diag(DEBUG_TYPE, "PartialUnrolled", L->getStartLoc(),
L->getHeader());		L->getHeader());
Diag << "unrolled loop by a factor of " << NV("UnrollCount", Count);		Diag << "unrolled loop by a factor of " << NV("UnrollCount", Count);

DEBUG(dbgs() << "UNROLLING loop %" << Header->getName()		DEBUG(dbgs() << "UNROLLING loop %" << Header->getName()
<< " by " << Count);		<< " by " << Count);
if (TripMultiple == 0 \|\| BreakoutTrip != TripMultiple) {		if (TripMultiple == 0 \|\| BreakoutTrip != TripMultiple) {
▲ Show 20 Lines • Show All 273 Lines • ▼ Show 20 Lines	bool llvm::UnrollLoop(Loop *L, unsigned Count, unsigned TripCount, bool Force,
// FIXME: We only preserve DT info for complete unrolling now. Incrementally		// FIXME: We only preserve DT info for complete unrolling now. Incrementally
// updating domtree after partial loop unrolling should also be easy.		// updating domtree after partial loop unrolling should also be easy.
if (DT && !CompletelyUnroll)		if (DT && !CompletelyUnroll)
DT->recalculate(*L->getHeader()->getParent());		DT->recalculate(*L->getHeader()->getParent());
else if (DT)		else if (DT)
DEBUG(DT->verifyDomTree());		DEBUG(DT->verifyDomTree());

// Simplify any new induction variables in the partially unrolled loop.		// Simplify any new induction variables in the partially unrolled loop.
if (SE && !CompletelyUnroll) {		if (SE && !CompletelyUnroll && Count > 1) {
SmallVector<WeakVH, 16> DeadInsts;		SmallVector<WeakVH, 16> DeadInsts;
simplifyLoopIVs(L, SE, DT, LI, DeadInsts);		simplifyLoopIVs(L, SE, DT, LI, DeadInsts);

// Aggressively clean up dead instructions that simplifyLoopIVs already		// Aggressively clean up dead instructions that simplifyLoopIVs already
// identified. Any remaining should be cleaned up below.		// identified. Any remaining should be cleaned up below.
while (!DeadInsts.empty())		while (!DeadInsts.empty())
if (Instruction *Inst =		if (Instruction *Inst =
dyn_cast_or_null<Instruction>(&*DeadInsts.pop_back_val()))		dyn_cast_or_null<Instruction>(&*DeadInsts.pop_back_val()))
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp

				//===-- UnrollLoopPeel.cpp - Loop peeling utilities -----------------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements some loop unrolling utilities for peeling loops
				// with dynamically inferred (from PGO) trip counts. See LoopUnroll.cpp for
				// unrolling loops with compile-time constant trip counts.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/LoopIterator.h"
				#include "llvm/Analysis/LoopPass.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/IR/BasicBlock.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/MDBuilder.h"
				#include "llvm/IR/Metadata.h"
				#include "llvm/IR/Module.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Transforms/Scalar.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/Cloning.h"
				#include "llvm/Transforms/Utils/LoopUtils.h"
				#include "llvm/Transforms/Utils/UnrollLoop.h"
				#include <algorithm>

				using namespace llvm;

				#define DEBUG_TYPE "loop-unroll"
				STATISTIC(NumPeeled, "Number of loops peeled");

				static cl::opt<unsigned> UnrollPeelMaxCount(
				"unroll-peel-max-count", cl::init(7), cl::Hidden,
				cl::desc("Max average trip count which will cause loop peeling."));

				static cl::opt<unsigned> UnrollForcePeelCount(
				"unroll-force-peel-count", cl::init(0), cl::Hidden,
				cl::desc("Force a peel count regardless of profiling information."));

				// Check whether we are capable of peeling this loop.
				static bool canPeel(Loop *L) {
				// Make sure the loop is in simplified form
				if (!L->isLoopSimplifyForm())
				return false;

				// Only peel loops that contain a single exit
				if (!L->getExitingBlock() \|\| !L->getUniqueExitBlock())
				return false;

				return true;
				}

				// Return the number of iterations we want to peel off.
				void llvm::computePeelCount(Loop *L, unsigned LoopSize,
				TargetTransformInfo::UnrollingPreferences &UP) {
				UP.PeelCount = 0;
				if (!canPeel(L))
				return;

				// Only try to peel innermost loops.
				if (!L->empty())
				return;

				// If the user provided a peel count, use that.
				bool UserPeelCount = UnrollForcePeelCount.getNumOccurrences() > 0;
				if (UserPeelCount) {
				DEBUG(dbgs() << "Force-peeling first " << UnrollForcePeelCount
				<< " iterations.\n");
				UP.PeelCount = UnrollForcePeelCount;
				return;
				}

				// If we don't know the trip count, but have reason to believe the average
				// trip count is low, peeling should be beneficial, since we will usually
				// hit the peeled section.
				// We only do this in the presence of profile information, since otherwise
				// our estimates of the trip count are not reliable enough.
				if (UP.AllowPeeling && L->getHeader()->getParent()->getEntryCount()) {
				Optional<unsigned> PeelCount = getLoopEstimatedTripCount(L);
				if (!PeelCount)
				return;

				DEBUG(dbgs() << "Profile-based estimated trip count is " << *PeelCount
				<< "\n");

				if (*PeelCount) {
				if ((*PeelCount <= UnrollPeelMaxCount) &&
				(LoopSize * (*PeelCount + 1) <= UP.Threshold)) {
				DEBUG(dbgs() << "Peeling first " << *PeelCount << " iterations.\n");
				UP.PeelCount = *PeelCount;
				return;
				}
				DEBUG(dbgs() << "Requested peel count: " << *PeelCount << "\n");
				DEBUG(dbgs() << "Max peel count: " << UnrollPeelMaxCount << "\n");
				DEBUG(dbgs() << "Peel cost: " << LoopSize * (*PeelCount + 1) << "\n");
				DEBUG(dbgs() << "Max peel cost: " << UP.Threshold << "\n");
				}
				}

				return;
				}

				/// \brief Update the branch weights of the latch of a peeled-off loop
				/// iteration.
				/// This sets the branch weights for the latch of the recently peeled off loop
				/// iteration correctly.
				/// Our goal is to make sure that:
				/// a) The total weight of all the copies of the loop body is preserved.
				/// b) The total weight of the loop exit is preserved.
				/// c) The body weight is reasonably distributed between the peeled iterations.
				///
				/// \param Header The copy of the header block that belongs to next iteration.
				/// \param LatchBR The copy of the latch branch that belongs to this iteration.
				/// \param IterNumber The serial number of the iteration that was just
				/// peeled off.
				/// \param AvgIters The average number of iterations we expect the loop to have.
				/// \param[in,out] PeeledHeaderWeight The total number of dynamic loop
				/// iterations that are unaccounted for. As an input, it represents the number
				/// of times we expect to enter the header of the iteration currently being
				/// peeled off. The output is the number of times we expect to enter the
				/// header of the next iteration.
				static void updateBranchWeights(BasicBlock Header, BranchInst LatchBR,
				unsigned IterNumber, unsigned AvgIters,
				uint64_t &PeeledHeaderWeight) {

				// FIXME: Pick a more realistic distribution.
				// Currently the proportion of weight we assign to the fall-through
				// side of the branch drops linearly with the iteration number, and we use
				// a 0.9 fudge factor to make the drop-off less sharp...
				if (PeeledHeaderWeight) {
				uint64_t FallThruWeight =
				PeeledHeaderWeight * ((float)(AvgIters - IterNumber) / AvgIters * 0.9);
				uint64_t ExitWeight = PeeledHeaderWeight - FallThruWeight;
				PeeledHeaderWeight -= ExitWeight;

				unsigned HeaderIdx = (LatchBR->getSuccessor(0) == Header ? 0 : 1);
				MDBuilder MDB(LatchBR->getContext());
				MDNode *WeightNode =
				HeaderIdx ? MDB.createBranchWeights(ExitWeight, FallThruWeight)
				: MDB.createBranchWeights(FallThruWeight, ExitWeight);
				LatchBR->setMetadata(LLVMContext::MD_prof, WeightNode);
				}
				}

				/// \brief Clones the body of the loop L, putting it between \p InsertTop and \p
				/// InsertBot.
				/// \param IterNumber The serial number of the iteration currently being
				/// peeled off.
				/// \param Exit The exit block of the original loop.
				/// \param[out] NewBlocks A list of the the blocks in the newly created clone
				/// \param[out] VMap The value map between the loop and the new clone.
				/// \param LoopBlocks A helper for DFS-traversal of the loop.
				/// \param LVMap A value-map that maps instructions from the original loop to
				/// instructions in the last peeled-off iteration.
				static void cloneLoopBlocks(Loop L, unsigned IterNumber, BasicBlock InsertTop,
				BasicBlock InsertBot, BasicBlock Exit,
				SmallVectorImpl<BasicBlock *> &NewBlocks,
				LoopBlocksDFS &LoopBlocks, ValueToValueMapTy &VMap,
				ValueToValueMapTy &LVMap, LoopInfo *LI) {

				BasicBlock *Header = L->getHeader();
				BasicBlock *Latch = L->getLoopLatch();
				BasicBlock *PreHeader = L->getLoopPreheader();

				Function *F = Header->getParent();
				LoopBlocksDFS::RPOIterator BlockBegin = LoopBlocks.beginRPO();
				LoopBlocksDFS::RPOIterator BlockEnd = LoopBlocks.endRPO();
				Loop *ParentLoop = L->getParentLoop();

				// For each block in the original loop, create a new copy,
				// and update the value map with the newly created values.
				for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) {
				BasicBlock NewBB = CloneBasicBlock(BB, VMap, ".peel", F);
				NewBlocks.push_back(NewBB);

				if (ParentLoop)
				ParentLoop->addBasicBlockToLoop(NewBB, *LI);

				VMap[*BB] = NewBB;
				}

				// Hook-up the control flow for the newly inserted blocks.
				// The new header is hooked up directly to the "top", which is either
				// the original loop preheader (for the first iteration) or the previous
				// iteration's exiting block (for every other iteration)
				InsertTop->getTerminator()->setSuccessor(0, cast<BasicBlock>(VMap[Header]));

				// Similarly, for the latch:
				// The original exiting edge is still hooked up to the loop exit.
				// The backedge now goes to the "bottom", which is either the loop's real
				// header (for the last peeled iteration) or the copied header of the next
				// iteration (for every other iteration)
				BranchInst *LatchBR =
				cast<BranchInst>(cast<BasicBlock>(VMap[Latch])->getTerminator());
				unsigned HeaderIdx = (LatchBR->getSuccessor(0) == Header ? 0 : 1);
				LatchBR->setSuccessor(HeaderIdx, InsertBot);
				LatchBR->setSuccessor(1 - HeaderIdx, Exit);

				// The new copy of the loop body starts with a bunch of PHI nodes
				// that pick an incoming value from either the preheader, or the previous
				// loop iteration. Since this copy is no longer part of the loop, we
				// resolve this statically:
				// For the first iteration, we use the value from the preheader directly.
				// For any other iteration, we replace the phi with the value generated by
				// the immediately preceding clone of the loop body (which represents
				// the previous iteration).
				for (BasicBlock::iterator I = Header->begin(); isa<PHINode>(I); ++I) {
				PHINode NewPHI = cast<PHINode>(VMap[&I]);
				if (IterNumber == 0) {
				VMap[&*I] = NewPHI->getIncomingValueForBlock(PreHeader);
				} else {
				Value *LatchVal = NewPHI->getIncomingValueForBlock(Latch);
				Instruction *LatchInst = dyn_cast<Instruction>(LatchVal);
				if (LatchInst && L->contains(LatchInst))
				VMap[&*I] = LVMap[LatchInst];
				else
				VMap[&*I] = LatchVal;
				}
				cast<BasicBlock>(VMap[Header])->getInstList().erase(NewPHI);
				}

				// Fix up the outgoing values - we need to add a value for the iteration
				// we've just created. Note that this must happen after the incoming
				// values are adjusted, since the value going out of the latch may also be
				// a value coming into the header.
				for (BasicBlock::iterator I = Exit->begin(); isa<PHINode>(I); ++I) {
				PHINode *PHI = cast<PHINode>(I);
				Value *LatchVal = PHI->getIncomingValueForBlock(Latch);
				Instruction *LatchInst = dyn_cast<Instruction>(LatchVal);
				if (LatchInst && L->contains(LatchInst))
				LatchVal = VMap[LatchVal];
				PHI->addIncoming(LatchVal, cast<BasicBlock>(VMap[Latch]));
				}

				// LastValueMap is updated with the values for the current loop
				// which are used the next time this function is called.
				for (const auto &KV : VMap)
				LVMap[KV.first] = KV.second;
				}

				/// \brief Peel off the first \p PeelCount iterations of loop \p L.
				///
				/// Note that this does not peel them off as a single straight-line block.
				/// Rather, each iteration is peeled off separately, and needs to check the
				/// exit condition.
				/// For loops that dynamically execute \p PeelCount iterations or less
				/// this provides a benefit, since the peeled off iterations, which account
				/// for the bulk of dynamic execution, can be further simplified by scalar
				/// optimizations.
				bool llvm::peelLoop(Loop L, unsigned PeelCount, LoopInfo LI,
				ScalarEvolution SE, DominatorTree DT,
				bool PreserveLCSSA) {
				if (!canPeel(L))
				return false;

				LoopBlocksDFS LoopBlocks(L);
				LoopBlocks.perform(LI);

				BasicBlock *Header = L->getHeader();
				BasicBlock *PreHeader = L->getLoopPreheader();
				BasicBlock *Latch = L->getLoopLatch();
				BasicBlock *Exit = L->getUniqueExitBlock();

				Function *F = Header->getParent();

				// Set up all the necessary basic blocks. It is convenient to split the
				// preheader into 3 parts - two blocks to anchor the peeled copy of the loop
				// body, and a new preheader for the "real" loop.

				// Peeling the first iteration transforms.
				//
				// PreHeader:
				// ...
				// Header:
				// LoopBody
				// If (cond) goto Header
				// Exit:
				//
				// into
				//
				// InsertTop:
				// LoopBody
				// If (!cond) goto Exit
				// InsertBot:
				// NewPreHeader:
				// ...
				// Header:
				// LoopBody
				// If (cond) goto Header
				// Exit:
				//
				// Each following iteration will split the current bottom anchor in two,
				// and put the new copy of the loop body between these two blocks. That is,
				// after peeling another iteration from the example above, we'll split
				// InsertBot, and get:
				//
				// InsertTop:
				// LoopBody
				// If (!cond) goto Exit
				// InsertBot:
				// LoopBody
				// If (!cond) goto Exit
				// InsertBot.next:
				// NewPreHeader:
				// ...
				// Header:
				// LoopBody
				// If (cond) goto Header
				// Exit:

				BasicBlock *InsertTop = SplitEdge(PreHeader, Header, DT, LI);
				BasicBlock *InsertBot =
				SplitBlock(InsertTop, InsertTop->getTerminator(), DT, LI);
				BasicBlock *NewPreHeader =
				SplitBlock(InsertBot, InsertBot->getTerminator(), DT, LI);

				InsertTop->setName(Header->getName() + ".peel.begin");
				InsertBot->setName(Header->getName() + ".peel.next");
				NewPreHeader->setName(PreHeader->getName() + ".peel.newph");

				ValueToValueMapTy LVMap;

				// If we have branch weight information, we'll want to update it for the
				// newly created branches.
				BranchInst *LatchBR =
				cast<BranchInst>(cast<BasicBlock>(Latch)->getTerminator());
				unsigned HeaderIdx = (LatchBR->getSuccessor(0) == Header ? 0 : 1);

				uint64_t TrueWeight, FalseWeight;
				uint64_t ExitWeight = 0, BackEdgeWeight = 0;
				if (LatchBR->extractProfMetadata(TrueWeight, FalseWeight)) {
				ExitWeight = HeaderIdx ? TrueWeight : FalseWeight;
				BackEdgeWeight = HeaderIdx ? FalseWeight : TrueWeight;
				}

				// For each peeled-off iteration, make a copy of the loop.
				for (unsigned Iter = 0; Iter < PeelCount; ++Iter) {
				SmallVector<BasicBlock *, 8> NewBlocks;
				ValueToValueMapTy VMap;

				// The exit weight of the previous iteration is the header entry weight
				// of the current iteration. So this is exactly how many dynamic iterations
				// the current peeled-off static iteration uses up.
				// FIXME: due to the way the distribution is constructed, we need a
				// guard here to make sure we don't end up with non-positive weights.
				if (ExitWeight < BackEdgeWeight)
				BackEdgeWeight -= ExitWeight;
				else
				BackEdgeWeight = 1;
				trentxintongUnsubmitted Not Done Reply Inline Actions @mkuper any particular reasons why the backedge weight has to be 1 instead of 0 ? trentxintong: @mkuper any particular reasons why the backedge weight has to be 1 instead of 0 ?
				mkuperAuthorUnsubmitted Not Done Reply Inline Actions There was some concern (on IRC, I don't remember who expressed it, unfortunately) that other things that look at profile information will not be happy with 0 weights, since clang doesn't normally generate them (the whole "Laplace's Rule of Succession" adjustment). I haven't actually tested it, but it didn't seem like it should matter much - if we get here, the distribution isn't that realistic to begin with. So I preferred to play it safe. mkuper: There was some concern (on IRC, I don't remember who expressed it, unfortunately) that other…

				cloneLoopBlocks(L, Iter, InsertTop, InsertBot, Exit,
				NewBlocks, LoopBlocks, VMap, LVMap, LI);
				updateBranchWeights(InsertBot, cast<BranchInst>(VMap[LatchBR]), Iter,
				PeelCount, ExitWeight);

				InsertTop = InsertBot;
				InsertBot = SplitBlock(InsertBot, InsertBot->getTerminator(), DT, LI);
				InsertBot->setName(Header->getName() + ".peel.next");

				F->getBasicBlockList().splice(InsertTop->getIterator(),
				F->getBasicBlockList(),
				NewBlocks[0]->getIterator(), F->end());

				// Remap to use values from the current iteration instead of the
				// previous one.
				remapInstructionsInBlocks(NewBlocks, VMap);
				}

				// Now adjust the phi nodes in the loop header to get their initial values
				// from the last peeled-off iteration instead of the preheader.
				for (BasicBlock::iterator I = Header->begin(); isa<PHINode>(I); ++I) {
				PHINode *PHI = cast<PHINode>(I);
				Value *NewVal = PHI->getIncomingValueForBlock(Latch);
				Instruction *LatchInst = dyn_cast<Instruction>(NewVal);
				if (LatchInst && L->contains(LatchInst))
				NewVal = LVMap[LatchInst];

				PHI->setIncomingValue(PHI->getBasicBlockIndex(NewPreHeader), NewVal);
				}

				// Adjust the branch weights on the loop exit.
				if (ExitWeight) {
				MDBuilder MDB(LatchBR->getContext());
				MDNode *WeightNode =
				HeaderIdx ? MDB.createBranchWeights(ExitWeight, BackEdgeWeight)
				: MDB.createBranchWeights(BackEdgeWeight, ExitWeight);
				LatchBR->setMetadata(LLVMContext::MD_prof, WeightNode);
				}

				// If the loop is nested, we changed the parent loop, update SE.
				if (Loop *ParentLoop = L->getParentLoop())
				SE->forgetLoop(ParentLoop);

				NumPeeled++;

				return true;
				}

llvm/trunk/lib/Transforms/Utils/LoopUtils.cpp

Show First 20 Lines • Show All 1,084 Lines • ▼ Show 20 Lines	Optional<unsigned> llvm::getLoopEstimatedTripCount(Loop *L) {

// To estimate the number of times the loop body was executed, we want to		// To estimate the number of times the loop body was executed, we want to
// know the number of times the backedge was taken, vs. the number of times		// know the number of times the backedge was taken, vs. the number of times
// we exited the loop.		// we exited the loop.
// The branch weights give us almost what we want, since they were adjusted		// The branch weights give us almost what we want, since they were adjusted
// from the raw counts to provide a better probability estimate. Remove		// from the raw counts to provide a better probability estimate. Remove
// the adjustment by subtracting 1 from both weights.		// the adjustment by subtracting 1 from both weights.
uint64_t TrueVal, FalseVal;		uint64_t TrueVal, FalseVal;
if (!LatchBR->extractProfMetadata(TrueVal, FalseVal) \|\| (TrueVal <= 1) \|\|		if (!LatchBR->extractProfMetadata(TrueVal, FalseVal))
(FalseVal <= 1))
return None;		return None;

TrueVal -= 1;		if (!TrueVal \|\| !FalseVal)
FalseVal -= 1;		return 0;

// Divide the count of the backedge by the count of the edge exiting the loop.		// Divide the count of the backedge by the count of the edge exiting the loop,
		// rounding to nearest.
if (LatchBR->getSuccessor(0) == L->getHeader())		if (LatchBR->getSuccessor(0) == L->getHeader())
return TrueVal / FalseVal;		return (TrueVal + (FalseVal / 2)) / FalseVal;
else		else
return FalseVal / TrueVal;		return (FalseVal + (TrueVal / 2)) / TrueVal;
}		}

llvm/trunk/test/Transforms/LoopUnroll/peel-loop-pgo.ll

				; RUN: opt < %s -S -debug-only=loop-unroll -loop-unroll -unroll-allow-peeling 2>&1 \| FileCheck %s
				; REQUIRES: asserts

				; Make sure we use the profile information correctly to peel-off 3 iterations
				; from the loop, and update the branch weights for the peeled loop properly.
				; CHECK: PEELING loop %for.body with iteration count 3!
				; CHECK-LABEL: @basic
				; CHECK: br i1 %{{.}}, label %[[NEXT0:.]], label %for.cond.for.end_crit_edge, !prof !1
				; CHECK: [[NEXT0]]:
				; CHECK: br i1 %{{.}}, label %[[NEXT1:.]], label %for.cond.for.end_crit_edge, !prof !2
				; CHECK: [[NEXT1]]:
				; CHECK: br i1 %{{.}}, label %[[NEXT2:.]], label %for.cond.for.end_crit_edge, !prof !3
				; CHECK: [[NEXT2]]:
				; CHECK: br i1 %{{.}}, label %for.body, label %{{.}}, !prof !4

				define void @basic(i32* %p, i32 %k) #0 !prof !0 {
				entry:
				%cmp3 = icmp slt i32 0, %k
				br i1 %cmp3, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.lr.ph, %for.body
				%i.05 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
				%p.addr.04 = phi i32* [ %p, %for.body.lr.ph ], [ %incdec.ptr, %for.body ]
				%incdec.ptr = getelementptr inbounds i32, i32* %p.addr.04, i32 1
				store i32 %i.05, i32* %p.addr.04, align 4
				%inc = add nsw i32 %i.05, 1
				%cmp = icmp slt i32 %inc, %k
				br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge, !prof !1

				for.cond.for.end_crit_edge: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.cond.for.end_crit_edge, %entry
				ret void
				}

				!0 = !{!"function_entry_count", i64 1}
				!1 = !{!"branch_weights", i32 3001, i32 1001}

				;CHECK: !1 = !{!"branch_weights", i32 900, i32 101}
				;CHECK: !2 = !{!"branch_weights", i32 540, i32 360}
				;CHECK: !3 = !{!"branch_weights", i32 162, i32 378}
				;CHECK: !4 = !{!"branch_weights", i32 560, i32 162}

llvm/trunk/test/Transforms/LoopUnroll/peel-loop.ll

				; RUN: opt < %s -S -loop-unroll -unroll-force-peel-count=3 -simplifycfg -instcombine \| FileCheck %s

				; Basic loop peeling - check that we can peel-off the first 3 loop iterations
				; when explicitly requested.
				; CHECK-LABEL: @basic
				; CHECK: %[[CMP0:.*]] = icmp sgt i32 %k, 0
				; CHECK: br i1 %[[CMP0]], label %[[NEXT0:.*]], label %for.end
				; CHECK: [[NEXT0]]:
				; CHECK: store i32 0, i32* %p, align 4
				; CHECK: %[[CMP1:.*]] = icmp eq i32 %k, 1
				; CHECK: br i1 %[[CMP1]], label %for.end, label %[[NEXT1:.*]]
				; CHECK: [[NEXT1]]:
				; CHECK: %[[INC1:.]] = getelementptr inbounds i32, i32 %p, i64 1
				; CHECK: store i32 1, i32* %[[INC1]], align 4
				; CHECK: %[[CMP2:.*]] = icmp sgt i32 %k, 2
				; CHECK: br i1 %[[CMP2]], label %[[NEXT2:.*]], label %for.end
				; CHECK: [[NEXT2]]:
				; CHECK: %[[INC2:.]] = getelementptr inbounds i32, i32 %p, i64 2
				; CHECK: store i32 2, i32* %[[INC2]], align 4
				; CHECK: %[[CMP3:.*]] = icmp eq i32 %k, 3
				; CHECK: br i1 %[[CMP3]], label %for.end, label %[[LOOP:.*]]
				; CHECK: [[LOOP]]:
				; CHECK: %[[IV:.]] = phi i32 [ {{.}}, %[[LOOP]] ], [ 3, %[[NEXT2]] ]

				define void @basic(i32* %p, i32 %k) #0 {
				entry:
				%cmp3 = icmp slt i32 0, %k
				br i1 %cmp3, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.lr.ph, %for.body
				%i.05 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
				%p.addr.04 = phi i32* [ %p, %for.body.lr.ph ], [ %incdec.ptr, %for.body ]
				%incdec.ptr = getelementptr inbounds i32, i32* %p.addr.04, i32 1
				store i32 %i.05, i32* %p.addr.04, align 4
				%inc = add nsw i32 %i.05, 1
				%cmp = icmp slt i32 %inc, %k
				br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge

				for.cond.for.end_crit_edge: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.cond.for.end_crit_edge, %entry
				ret void
				}

				; Make sure peeling works correctly when a value defined in a loop is used
				; in later code - we need to correctly plumb the phi depending on which
				; iteration is actually used.
				; CHECK-LABEL: @output
				; CHECK: %[[CMP0:.*]] = icmp sgt i32 %k, 0
				; CHECK: br i1 %[[CMP0]], label %[[NEXT0:.*]], label %for.end
				; CHECK: [[NEXT0]]:
				; CHECK: store i32 0, i32* %p, align 4
				; CHECK: %[[CMP1:.*]] = icmp eq i32 %k, 1
				; CHECK: br i1 %[[CMP1]], label %for.end, label %[[NEXT1:.*]]
				; CHECK: [[NEXT1]]:
				; CHECK: %[[INC1:.]] = getelementptr inbounds i32, i32 %p, i64 1
				; CHECK: store i32 1, i32* %[[INC1]], align 4
				; CHECK: %[[CMP2:.*]] = icmp sgt i32 %k, 2
				; CHECK: br i1 %[[CMP2]], label %[[NEXT2:.*]], label %for.end
				; CHECK: [[NEXT2]]:
				; CHECK: %[[INC2:.]] = getelementptr inbounds i32, i32 %p, i64 2
				; CHECK: store i32 2, i32* %[[INC2]], align 4
				; CHECK: %[[CMP3:.*]] = icmp eq i32 %k, 3
				; CHECK: br i1 %[[CMP3]], label %for.end, label %[[LOOP:.*]]
				; CHECK: [[LOOP]]:
				; CHECK: %[[IV:.]] = phi i32 [ %[[IV:.]], %[[LOOP]] ], [ 3, %[[NEXT2]] ]
				; CHECK: %ret = phi i32 [ 0, %entry ], [ 1, %[[NEXT0]] ], [ 2, %[[NEXT1]] ], [ 3, %[[NEXT2]] ], [ %[[IV]], %[[LOOP]] ]
				; CHECK: ret i32 %ret
				define i32 @output(i32* %p, i32 %k) #0 {
				entry:
				%cmp3 = icmp slt i32 0, %k
				br i1 %cmp3, label %for.body.lr.ph, label %for.end

				for.body.lr.ph: ; preds = %entry
				br label %for.body

				for.body: ; preds = %for.body.lr.ph, %for.body
				%i.05 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
				%p.addr.04 = phi i32* [ %p, %for.body.lr.ph ], [ %incdec.ptr, %for.body ]
				%incdec.ptr = getelementptr inbounds i32, i32* %p.addr.04, i32 1
				store i32 %i.05, i32* %p.addr.04, align 4
				%inc = add nsw i32 %i.05, 1
				%cmp = icmp slt i32 %inc, %k
				br i1 %cmp, label %for.body, label %for.cond.for.end_crit_edge

				for.cond.for.end_crit_edge: ; preds = %for.body
				br label %for.end

				for.end: ; preds = %for.cond.for.end_crit_edge, %entry
				%ret = phi i32 [ 0, %entry], [ %inc, %for.cond.for.end_crit_edge ]
				ret i32 %ret
				}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Implement profile-based loop peelingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 79803

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

llvm/trunk/include/llvm/Transforms/Utils/UnrollLoop.h

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

llvm/trunk/lib/Transforms/Utils/CMakeLists.txt

llvm/trunk/lib/Transforms/Utils/LoopUnroll.cpp

llvm/trunk/lib/Transforms/Utils/LoopUnrollPeel.cpp

llvm/trunk/lib/Transforms/Utils/LoopUtils.cpp

llvm/trunk/test/Transforms/LoopUnroll/peel-loop-pgo.ll

llvm/trunk/test/Transforms/LoopUnroll/peel-loop.ll

[LoopUnroll] Implement profile-based loop peeling
ClosedPublic