This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/Transforms/
-
llvm/
-
Transforms/
-
Scalar.h
-
Scalar/
-
LoopUnrollPass.h
-
lib/
-
Passes/
-
PassBuilder.cpp
-
Transforms/
-
IPO/
-
PassManagerBuilder.cpp
-
Scalar/
2/4
LoopUnrollPass.cpp
-
test/Transforms/LoopVectorize/X86/
-
Transforms/
-
LoopVectorize/
-
X86/
-
metadata-enable.ll

Differential D28368

Increases full-unroll threshold.
ClosedPublic

Authored by danielcdh on Jan 5 2017, 10:55 AM.

Download Raw Diff

Details

Reviewers

chandlerc
mzolotukhin
davidxl
mkuper
hfinkel

Commits

rG7d230325efbd: Increases full-unroll threshold.
rL295538: Increases full-unroll threshold.

Summary

The default threshold for fully unroll is too conservative. This patch doubles the full-unroll threshold

This change will affect the following speccpu2006 benchmarks (performance numbers were collected from Intel Sandybridge):

Performance:

403 0.11%
433 0.51%
445 0.48%
447 3.50%
453 1.49%
464 0.75%

Code size:

403 0.56%
433 0.96%
445 2.16%
447 2.96%
453 0.94%
464 8.02%

The compiler time overhead is similar with code size.

Diff Detail

Build Status

Buildable 4105
Build 4105: arc lint + arc unit

Event Timeline

danielcdh updated this revision to Diff 83274.Jan 5 2017, 10:55 AM

danielcdh retitled this revision from to Give higher full-unroll boosting when the loop iteration is small..

danielcdh updated this object.

danielcdh added reviewers: mzolotukhin, davidxl, mkuper.

danielcdh added a subscriber: llvm-commits.

So 464.h264ref is getting a bit larger and faster or slower?

In D28368#637192, @hfinkel wrote:

So 464.h264ref is getting a bit larger and faster or slower?

Yes, it's 3% larger (text size). The performance difference is in noise range (it disappears in a different run).

In D28368#637335, @danielcdh wrote:

In D28368#637192, @hfinkel wrote:

So 464.h264ref is getting a bit larger and faster or slower?

Yes, it's 3% larger (text size). The performance difference is in noise range (it disappears in a different run).

Is that a concern, or is it just a small benchmark?

haicheng added a subscriber: haicheng.Jan 5 2017, 2:01 PM

I don't think it is a concern.

it is a small benchmark (1.2M text size)
there are a lot of countable loops in the code, thus it increased the most
the increased size does not change performance
the size will not change if optForSize() is true

In D28368#637452, @danielcdh wrote:

I don't think it is a concern.

it is a small benchmark (1.2M text size)

there are a lot of countable loops in the code, thus it increased the most

the increased size does not change performance

the size will not change if optForSize() is true

SGTM

lib/Transforms/Scalar/LoopUnrollPass.cpp
678	fully unroll -> fully unrolling
684	Can you please put the magic numbers here into cl::opt flags so that it is easy to experiment with tuning them later?

Hi Dehao,

Does your change impact the inline decision of 464.h264ref? If so, would you please let me know the functions that are no longer inlined?

Haicheng

Hi,

I don't think that a right approach for it. We already have a lot of thresholds and I'd prefer to avoid adding another one as much as possible.

In this particular case it also seems that we can express the same 'bonus' with existing thresholds. There is unroll-max-iteration-count-to-analyze, which you might want to play with - if your loops have higher trip-counts then this thresholds, we don't try to predict benefits of unrolling and behave conservatively. Increasing this threshold should help unrolling more loops. If for some reason you don't want to change this parameter, then you can get almost the same effect as you propose by just bumping up unroll-threshold.

Please also note that we already take into account the fact that unrolling removes branches: see getUnrolledLoopSize.

Also, please also provide results of compile time testing for such changes.

Michael

In D28368#637473, @haicheng wrote:

Hi Dehao,

Does your change impact the inline decision of 464.h264ref? If so, would you please let me know the functions that are no longer inlined?

Haicheng

The inline decision of 464.h264ref did not change with this patch.

In D28368#637499, @mzolotukhin wrote:

Hi,

I don't think that a right approach for it. We already have a lot of thresholds and I'd prefer to avoid adding another one as much as possible.

Sorry that I don't seem to quite follow the comment. We did not add a new threshold in this patch, or am I missing something? Can you help clarify?

In this particular case it also seems that we can express the same 'bonus' with existing thresholds. There is unroll-max-iteration-count-to-analyze, which you might want to play with - if your loops have higher trip-counts then this thresholds, we don't try to predict benefits of unrolling and behave conservatively. Increasing this threshold should help unrolling more loops. If for some reason you don't want to change this parameter, then you can get almost the same effect as you propose by just bumping up unroll-threshold.

Please also note that we already take into account the fact that unrolling removes branches: see getUnrolledLoopSize.

Yes and we model that in the original getFullUnrollBoostingFactor. This patch tried to model the other benefit: expanding optimization scope.

Also, please also provide results of compile time testing for such changes.

Michael

danielcdh updated this object.Jan 5 2017, 4:09 PM

danielcdh edited edge metadata.

update

In D28368#637541, @danielcdh wrote:

In D28368#637499, @mzolotukhin wrote:

Hi,

I don't think that a right approach for it. We already have a lot of thresholds and I'd prefer to avoid adding another one as much as possible.

Sorry that I don't seem to quite follow the comment. We did not add a new threshold in this patch, or am I missing something? Can you help clarify?

Ahh, do you mean that we should not add a new option to tune this boosting factor? (Thanks Micheal (mkuper@) for pointing this out).

If that's the case, I've updated the patch to make it tunable with unroll-max-iteration-count-to-analyze. PTAL.

Thanks,
Dehao

In this particular case it also seems that we can express the same 'bonus' with existing thresholds. There is unroll-max-iteration-count-to-analyze, which you might want to play with - if your loops have higher trip-counts then this thresholds, we don't try to predict benefits of unrolling and behave conservatively. Increasing this threshold should help unrolling more loops. If for some reason you don't want to change this parameter, then you can get almost the same effect as you propose by just bumping up unroll-threshold.

Please also note that we already take into account the fact that unrolling removes branches: see getUnrolledLoopSize.

Yes and we model that in the original getFullUnrollBoostingFactor. This patch tried to model the other benefit: expanding optimization scope.

Also, please also provide results of compile time testing for such changes.

Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
684	Updated to make it tunable with unroll-max-iteration-count-to-analyze

Sorry that I don't seem to quite follow the comment. We did not add a new threshold in this patch, or am I missing something? Can you help clarify?

Oh, my apologies. I implicitly agreed with what @hfinkel suggested and implied that we use a cl::opt parameter instead of the magic number 400.

This patch tried to model the other benefit: expanding optimization scope.

Ok, I understand the idea. However, it looks too vague to model, and I don't see a relation with tripcounts then. If a loop has a huge tripcount, and we completely unroll it, we'll get a huge single basic block with tons of optimization opportunities, right? How is it different for a loop with a smaller tripcount? Why does it depend on the tripcount at all, and why do we model it as a 400/TripCount?

I think that we either need to be more explicit about what benefits we expect from unrolling (see e.g. what LoopUnrollAnalyzer does), or just use generic thresholds like unroll-threshold.

Michael

In D28368#637564, @mzolotukhin wrote:

Sorry that I don't seem to quite follow the comment. We did not add a new threshold in this patch, or am I missing something? Can you help clarify?

Oh, my apologies. I implicitly agreed with what @hfinkel suggested and implied that we use a cl::opt parameter instead of the magic number 400.

This patch tried to model the other benefit: expanding optimization scope.

Ok, I understand the idea. However, it looks too vague to model, and I don't see a relation with tripcounts then. If a loop has a huge tripcount, and we completely unroll it, we'll get a huge single basic block with tons of optimization opportunities, right? How is it different for a loop with a smaller tripcount? Why does it depend on the tripcount at all, and why do we model it as a 400/TripCount?

I see your point. Yes, the optimization scope is not related to trip count, and it is confusing to relate one with each other.

The real motivation for this patch is to boost the threshold for fully unroll so that we can materialize the performance benefits in our benchmarks. The initial thoughts were: simply boost unroll-threshold by a minimum of 2X for fully unrolling (which is fine to materialize the performance). But I think this might be harder to be accepted by upstream as it seems too brutal-force. Then I'm thinking of integrating trip_count into the model to limit the "relative code size increase".

Anyway, I think the patch needs to be updated to make it accurate. We could either use a constant minimum boosting factor, or still use the trip_count to limit "relative code size increase" but update the comment to make it accurate.

Any suggestions?

Thanks,
Dehao

I think that we either need to be more explicit about what benefits we expect from unrolling (see e.g. what LoopUnrollAnalyzer does), or just use generic thresholds like unroll-threshold.

Michael

The real motivation for this patch is to boost the threshold for fully unroll so that we can materialize the performance benefits in our benchmarks. The initial thoughts were: simply boost unroll-threshold by a minimum of 2X for fully unrolling (which is fine to materialize the performance). But I think this might be harder to be accepted by upstream as it seems too brutal-force. Then I'm thinking of integrating trip_count into the model to limit the "relative code size increase".

I see. Just doubling the threshold indeed will be hard to upstream, and reasonably so. While it improves performance on your benchmarks, it also increases code-size/compile-time on many other tests. IMHO the ideal solution here would be to understand what differentiates loops in your benchmarks from other loops, then estimate how popular such cases are, and if they're popular enough, teach llvm to recognize them (i.e. see benefits of unrolling such loops). This way we'll only increase code-size/compile time in cases where we gain performance, and lose nothing in other cases.

Anyway, I think the patch needs to be updated to make it accurate. We could either use a constant minimum boosting factor, or still use the trip_count to limit "relative code size increase" but update the comment to make it accurate.

Any suggestions?

In D28368#637684, @mzolotukhin wrote:

The real motivation for this patch is to boost the threshold for fully unroll so that we can materialize the performance benefits in our benchmarks. The initial thoughts were: simply boost unroll-threshold by a minimum of 2X for fully unrolling (which is fine to materialize the performance). But I think this might be harder to be accepted by upstream as it seems too brutal-force. Then I'm thinking of integrating trip_count into the model to limit the "relative code size increase".

I see. Just doubling the threshold indeed will be hard to upstream, and reasonably so. While it improves performance on your benchmarks, it also increases code-size/compile-time on many other tests. IMHO the ideal solution here would be to understand what differentiates loops in your benchmarks from other loops, then estimate how popular such cases are, and if they're popular enough, teach llvm to recognize them (i.e. see benefits of unrolling such loops). This way we'll only increase code-size/compile time in cases where we gain performance, and lose nothing in other cases.

I compared the profile between our internal benchmark and 464.h264ref, the only difference is that the fully unrolled loop showed high up in the profile of our benchmark, while the fully unrolled loop is cold in 464.h264ref, thus it has no performance impact.

We could use profile info to allow more aggressive threshold only for hot loops, so that code size increase can be avoided. But this requires BFI within the loop pass, which will be expensive (compile time overhead).

OTOH, if the heuristic is generally helping performance, I guess it would be worth to tradeoff 3% code size/compile time with potential better performance?

Thanks,
Dehao

Anyway, I think the patch needs to be updated to make it accurate. We could either use a constant minimum boosting factor, or still use the trip_count to limit "relative code size increase" but update the comment to make it accurate.

Any suggestions?

I compared the profile between our internal benchmark and 464.h264ref, the only difference is that the fully unrolled loop showed high up in the profile of our benchmark, while the fully unrolled loop is cold in 464.h264ref, thus it has no performance impact.

Could you tell what exactly happens to the loop after unrolling? Do we get the performance improvement from just removing branches, or does unrolling enable later optimizations (if so, which ones)?

We could use profile info to allow more aggressive threshold only for hot loops, so that code size increase can be avoided. But this requires BFI within the loop pass, which will be expensive (compile time overhead).

Using profile info in loop-unrolling is definitely worthwhile. Most of the loops we unroll are actually in the cold parts, so if we can avoid unrolling them, we can save some budget for more aggressive unrolling in hot regions (or just get smaller code and faster compilation).

OTOH, if the heuristic is generally helping performance...

What heuristic are you referring to here?

Michael

In D28368#637684, @mzolotukhin wrote:

The real motivation for this patch is to boost the threshold for fully unroll so that we can materialize the performance benefits in our benchmarks. The initial thoughts were: simply boost unroll-threshold by a minimum of 2X for fully unrolling (which is fine to materialize the performance). But I think this might be harder to be accepted by upstream as it seems too brutal-force. Then I'm thinking of integrating trip_count into the model to limit the "relative code size increase".

I see. Just doubling the threshold indeed will be hard to upstream,...

Alternatively, maybe we can make the cost model more accurate. I observe the cost model used by the unroller overestimate the cost of free (S/Z)EXT and unconditional branches.

In D28368#637730, @mzolotukhin wrote:

I compared the profile between our internal benchmark and 464.h264ref, the only difference is that the fully unrolled loop showed high up in the profile of our benchmark, while the fully unrolled loop is cold in 464.h264ref, thus it has no performance impact.

Could you tell what exactly happens to the loop after unrolling? Do we get the performance improvement from just removing branches, or does unrolling enable later optimizations (if so, which ones)?

It's from reduced branch as well as loop preparation code (dynamic instruction reduced from 179 to 167, which has already been captured by the unroll size analysis (boosting = rolled_cost/unroll_cost = 179/167). However, for that specific case, we need a threshold of ~200 to make the fully unroll happen.

We could use profile info to allow more aggressive threshold only for hot loops, so that code size increase can be avoided. But this requires BFI within the loop pass, which will be expensive (compile time overhead).

Using profile info in loop-unrolling is definitely worthwhile. Most of the loops we unroll are actually in the cold parts, so if we can avoid unrolling them, we can save some budget for more aggressive unrolling in hot regions (or just get smaller code and faster compilation).

I agree profile can help get a good balance here, but O2 build cannot benefit from it.

OTOH, if the heuristic is generally helping performance...

What heuristic are you referring to here?

Sorry, I meant the profile I proposed in this patch.

Thanks,
Dehao

Michael

In D28368#637912, @haicheng wrote:

In D28368#637684, @mzolotukhin wrote:

The real motivation for this patch is to boost the threshold for fully unroll so that we can materialize the performance benefits in our benchmarks. The initial thoughts were: simply boost unroll-threshold by a minimum of 2X for fully unrolling (which is fine to materialize the performance). But I think this might be harder to be accepted by upstream as it seems too brutal-force. Then I'm thinking of integrating trip_count into the model to limit the "relative code size increase".

I see. Just doubling the threshold indeed will be hard to upstream,...

Alternatively, maybe we can make the cost model more accurate. I observe the cost model used by the unroller overestimate the cost of free (S/Z)EXT and unconditional branches.

Agree that we need more accurate model, but the problem is that even the model is 100% accurate, linear-boosting factor cannot help boost threshold big enough for our case.

Thanks,
Dehao

Agree that we need more accurate model, but the problem is that even the model is 100% accurate, linear-boosting factor cannot help boost threshold big enough for our case.

One problem with current cost model is that it's used for estimating both code size and runtime performance. It might be worth checking if we can gain anything from separating these two metrics more clearly - I think it was discussed in the past, but no decision has been made.

I agree profile can help get a good balance here, but O2 build cannot benefit from it.

There are always cases where we generate sub-optimal code. For users striving for the outmost performance we provide higher optimization levels (+LTO, +PGO) and pragmas. We cannot just bump thresholds for every case we want to unroll/inline/whatever.

Sorry, I meant the profile I proposed in this patch.

Adding Constant/TripCount looks like simply bumping the threshold to me, except it also adds complexity to the code, so I'm not convinced we want this.

Michael

In D28368#638153, @mzolotukhin wrote:

Agree that we need more accurate model, but the problem is that even the model is 100% accurate, linear-boosting factor cannot help boost threshold big enough for our case.

One problem with current cost model is that it's used for estimating both code size and runtime performance. It might be worth checking if we can gain anything from separating these two metrics more clearly - I think it was discussed in the past, but no decision has been made.

The relationship between code size and runtime performance is different between different unroller.

In the dynamic unroll and partial unroll, performance will initially increase as code size increases (because dynamic branch is reduced), but when it reaches a threshold, the performance will start to degrade when code size increase (due to i-cache miss increase and loop body no long fit into LSD, etc). So a fixed threshold is usually helpful to find the performance sweet-spot.

In the fully unroll, if the loop can be fully unrolled, it will not likely to trigger LSD (not enough trip count), nor will it affect the icache-miss (fully unrolled loop is streight-line code, no temporal locality, even if it's embedded in an outer-loop, the backedge of the outer loop should be easy to predict right). So if we assume all backend optimizations is sane (e.g. SLP performs as well as loop vectorizer, RA is doing good job in large BB, etc). As a result, larger code size should always lead to better performance for fully unroll. So a threshold here is purely limiting the size of the text.

If my above analysis is reasonable, then I think probably two types of unroller should not share the same threshold? And fully unroller may better have a larger threshold?

I agree profile can help get a good balance here, but O2 build cannot benefit from it.

There are always cases where we generate sub-optimal code. For users striving for the outmost performance we provide higher optimization levels (+LTO, +PGO) and pragmas. We cannot just bump thresholds for every case we want to unroll/inline/whatever.

Sounds reasonable. How about we bump the threshold in O3, so that people who do not have profiler can still choose to fully unroll more aggressively?

Thanks,
Dehao

Sorry, I meant the profile I proposed in this patch.

Adding Constant/TripCount looks like simply bumping the threshold to me, except it also adds complexity to the code, so I'm not convinced we want this.

Michael

In the fully unroll, if the loop can be fully unrolled, it will not likely to trigger LSD (not enough trip count), nor will it affect the icache-miss (fully unrolled loop is streight-line code, no temporal locality, even if it's embedded in an outer-loop, the backedge of the outer loop should be easy to predict right). So if we assume all backend optimizations is sane (e.g. SLP performs as well as loop vectorizer, RA is doing good job in large BB, etc). As a result, larger code size should always lead to better performance for fully unroll. So a threshold here is purely limiting the size of the text.

This is not exactly true in practice. If we just bump up the threshold, we'll see both performance improvements and regressions.

I think probably two types of unroller should not share the same threshold?

This makes sense. However, I prefer not to bloat our army of thresholds without a guaranteed benefit.

How about we bump the threshold in O3, so that people who do not have profiler can still choose to fully unroll more aggressively?

For the change like this please submit a separate patch and include as much testing data as you can (including but not limited to SPEC, LLVM-testsuite, etc.). Please include runtime performance, compile time, and binary sizes.

Thanks,
Michael

In D28368#641747, @mzolotukhin wrote:

In the fully unroll, if the loop can be fully unrolled, it will not likely to trigger LSD (not enough trip count), nor will it affect the icache-miss (fully unrolled loop is streight-line code, no temporal locality, even if it's embedded in an outer-loop, the backedge of the outer loop should be easy to predict right). So if we assume all backend optimizations is sane (e.g. SLP performs as well as loop vectorizer, RA is doing good job in large BB, etc). As a result, larger code size should always lead to better performance for fully unroll. So a threshold here is purely limiting the size of the text.

This is not exactly true in practice. If we just bump up the threshold, we'll see both performance improvements and regressions.

From our limited experiments, bumping up the fully unroll threshold by 2X only improves performance for both speccpu and our internal benchmarks. If we boost it by 10X, we do see perf regression on some coding/decoding benchmarks. We root-caused the problem to be SLP cannot vectorize fully-unrolled code while loop vectorizer can. @mkuper is working on SLP to solve it. Other than that, it appears even boosting the threshold to 10X is a pure win for performance.

Could you point us to the benchmarks you observed regression after boosting fully unroll threshold? We would be happy to take a look and learn why performance get worse and possibly improve it. Thanks!

I think probably two types of unroller should not share the same threshold?

This makes sense. However, I prefer not to bloat our army of thresholds without a guaranteed benefit.

What do you mean by "guaranteed benefit"?

If it means "positive speedup with no code size/compile time increase", this seems impossible as any threshold boost will lead to code size boost.

If it means "positive speedup" only, it seems to be already satisfied.

How about we bump the threshold in O3, so that people who do not have profiler can still choose to fully unroll more aggressively?

For the change like this please submit a separate patch and include as much testing data as you can (including but not limited to SPEC, LLVM-testsuite, etc.). Please include runtime performance, compile time, and binary sizes.

I'll send out a new patch for this is we decided to put this in O3. During the mean time, I collected more performance data:

update the data to remove the trip count logic and merely boost the fully unroll tripcount by 2X

benchmark	code size	compile time	performance
447.dealII	0.52%	-0.24%	-0.94%
453.povray	0.45%	-0.65%	3.00%
433.milc	0.20%	2.01%	0.47%
445.gobmk	0.32%	-1.12%	0.32%
403.gcc	0.05%	0.58%	0.25%
464.h264ref	4.04%	4.62%	0.28%

build llvm testsuite with and without the change, it only affects the following 3 binaries. No noticeable compile time/run time has been observed.

binary	code size change
CMakeFiles/CheckTypeSize/CMAKE_SIZEOF_UNSIGNED_SHORT.bin	0.1%
CMakeFiles/feature_tests.bin	0%
CMakeFiles/TestEndianess.bin	0.1%

Thanks,
Dehao

Thanks,
Michael

Could you point us to the benchmarks you observed regression after boosting fully unroll threshold?

I ran the standard LLVM testsuite in the past and I think I observed several runtime regressions. However my memory might play tricks on me, so if you've just remeasured it, you can ignore this.

What do you mean by "guaranteed benefit"?

I meant that while some compiletime/runtime tradeoff might be acceptable, we definitely need to be aware of it before we land such changes, and we do have to have numbers at hand for that. Ideally, that would be pure win (runtime performance improves, compile time and code size is the same), but yeah, unfortunately it's unfeasible.

update the data to remove the trip count logic and merely boost the fully unroll tripcount by 2X

What option do you mean by fully unroll tripcount threshold? -unroll-threshold?

use a separate FullThreshold for fully unroller.

In D28368#641996, @mzolotukhin wrote:

Could you point us to the benchmarks you observed regression after boosting fully unroll threshold?

I ran the standard LLVM testsuite in the past and I think I observed several runtime regressions. However my memory might play tricks on me, so if you've just remeasured it, you can ignore this.

What do you mean by "guaranteed benefit"?

I meant that while some compiletime/runtime tradeoff might be acceptable, we definitely need to be aware of it before we land such changes, and we do have to have numbers at hand for that. Ideally, that would be pure win (runtime performance improves, compile time and code size is the same), but yeah, unfortunately it's unfeasible.

Definitely, the result I have so far includes speccpu and llvm testsuite. We also have internal benchmarks with all positive performance impact and < 0.1% mean size increase, unfortunately we cannot show them here. Let me know if there's any other benchmarks you would like me to test for perf/code_size impacts.

update the data to remove the trip count logic and merely boost the fully unroll tripcount by 2X

What option do you mean by fully unroll tripcount threshold? -unroll-threshold?

I should have upload the updated patch to make it clear, sorry about that. The above numbers are collected with the updated patch.

Thanks,
Dehao

ping...

Thanks,
Dehao

It looks like this makes UnrollingPreferences::Threshold essentially unused? We already have a separate threshold for runtime/partial unrolling; it just isn't exposed as a command-line option.

In D28368#648614, @efriedma wrote:

It looks like this makes UnrollingPreferences::Threshold essentially unused? We already have a separate threshold for runtime/partial unrolling; it just isn't exposed as a command-line option.

It's still used here:

// Check for explicit Count.
// 1st priority is unroll count set by "unroll-count" option.
bool UserUnrollCount = UnrollCount.getNumOccurrences() > 0;
if (UserUnrollCount) {
  UP.Count = UnrollCount;
  UP.AllowExpensiveTripCount = true;
  UP.Force = true;
  if (UP.AllowRemainder && getUnrolledLoopSize(LoopSize, UP) < UP.Threshold)
    return true;
}

You are right, partial unroll and runtime unroll uses UP.PartialThreshold which was set the same as UP.Threshold. Any recommendations on how to make this less confusing?

My suggestion:

Rename -unroll-threshold to -unroll-full-threshold.
Rename UP.Threshold to UP.FullThreshold.
Add an option -unroll-partial-threshold, and use it to initialize UP.PartialThreshold instead of unroll-full-threshold.
Make sure all the uses of UP.PartialThreshold and UP.FullThreshold make sense.
Separate out the change to modify the default full-unroll threshold into a different patch.

efriedma added a subscriber: zzheng.Jan 17 2017, 2:20 PM

I agree with the plan proposed by Eli.

One thing I'm not 100% certain is "Rename -unroll-threshold to -unroll-full-threshold": I think we can keep '-unroll-threshold' for full unroll threshold and add '-unroll-partial-threshold' for the partial unrolling case. It might be better because all current uses of '-unroll-threshold' option will remain correct. But I don't feel strong about it, so whatever name you choose is fine with me.

Michael

I picked to reuse Threshold for brevity. Let me know if anyone objects.

https://reviews.llvm.org/D28831 sent out to separate -unroll-partial-threshold.

Thanks,
Dehao

Rebase to only adjust the unroll threshold.

Harbormaster completed remote builds in B3012: Diff 84774.Jan 17 2017, 4:23 PM

ping... (the patch now became much simpler).

Thanks,
Dehao

ping...

ping....

Using the higher threshold makes sense at -O3. Not so sure about -O2... maybe ask on llvmdev?

The new threshold looks reasonable to me. Please update title and summary?

danielcdh retitled this revision from Give higher full-unroll boosting when the loop iteration is small. to Increases full-unroll threshold..Feb 10 2017, 3:52 PM

danielcdh edited the summary of this revision. (Show Details)

LGTM

This revision is now accepted and ready to land.Feb 10 2017, 5:08 PM

danielcdh edited the summary of this revision. (Show Details)Feb 13 2017, 8:28 AM

Move the threshold update to O3

Herald added a subscriber: mehdi_amini. · View Herald TranscriptFeb 17 2017, 1:52 PM

danielcdh added a reviewer: chandlerc.Feb 17 2017, 1:53 PM

Harbormaster completed remote builds in B4096: Diff 88951.Feb 17 2017, 2:32 PM

LGTM, thanks for all the analysis!

lib/Transforms/Scalar/LoopUnrollPass.cpp
133–134	clang-format here?

Thanks for the reviews.

clang-format and rebase

Harbormaster completed remote builds in B4105: Diff 89008.Feb 17 2017, 7:58 PM

danielcdh closed this revision.Feb 17 2017, 7:58 PM

Revision Contents

Path

Size

include/

llvm/

Transforms/

Scalar.h

4 lines

Scalar/

LoopUnrollPass.h

13 lines

lib/

Passes/

PassBuilder.cpp

4 lines

Transforms/

IPO/

PassManagerBuilder.cpp

12 lines

Scalar/

LoopUnrollPass.cpp

49 lines

test/

Transforms/

LoopVectorize/

X86/

metadata-enable.ll

14 lines

Diff 89008

include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 175 Lines • ▼ Show 20 Lines
	// LoopInstSimplify - This pass simplifies instructions in a loop's body.			// LoopInstSimplify - This pass simplifies instructions in a loop's body.
	//			//
	Pass *createLoopInstSimplifyPass();			Pass *createLoopInstSimplifyPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LoopUnroll - This pass is a simple loop unrolling pass.			// LoopUnroll - This pass is a simple loop unrolling pass.
	//			//
	Pass *createLoopUnrollPass(int Threshold = -1, int Count = -1,			Pass *createLoopUnrollPass(int OptLevel = 2, int Threshold = -1, int Count = -1,
	int AllowPartial = -1, int Runtime = -1,			int AllowPartial = -1, int Runtime = -1,
	int UpperBound = -1);			int UpperBound = -1);
	// Create an unrolling pass for full unrolling that uses exact trip count only.			// Create an unrolling pass for full unrolling that uses exact trip count only.
	Pass *createSimpleLoopUnrollPass();			Pass *createSimpleLoopUnrollPass(int OptLevel);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LoopReroll - This pass is a simple loop rerolling pass.			// LoopReroll - This pass is a simple loop rerolling pass.
	//			//
	Pass *createLoopRerollPass();			Pass *createLoopRerollPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 367 Lines • Show Last 20 Lines

include/llvm/Transforms/Scalar/LoopUnrollPass.h

	Show All 12 Lines
	#include "llvm/Analysis/LoopInfo.h"			#include "llvm/Analysis/LoopInfo.h"
	#include "llvm/IR/PassManager.h"			#include "llvm/IR/PassManager.h"
	#include "llvm/Transforms/Scalar/LoopPassManager.h"			#include "llvm/Transforms/Scalar/LoopPassManager.h"

	namespace llvm {			namespace llvm {

	class LoopUnrollPass : public PassInfoMixin<LoopUnrollPass> {			class LoopUnrollPass : public PassInfoMixin<LoopUnrollPass> {
	const bool AllowPartialUnrolling;			const bool AllowPartialUnrolling;
				const int OptLevel;

	explicit LoopUnrollPass(bool AllowPartialUnrolling)			explicit LoopUnrollPass(bool AllowPartialUnrolling, int OptLevel)
	: AllowPartialUnrolling(AllowPartialUnrolling) {}			: AllowPartialUnrolling(AllowPartialUnrolling), OptLevel(OptLevel) {}

	public:			public:
	/// Create an instance of the loop unroll pass that will support both full			/// Create an instance of the loop unroll pass that will support both full
	/// and partial unrolling.			/// and partial unrolling.
	///			///
	/// This uses the target information (or flags) to control the thresholds for			/// This uses the target information (or flags) to control the thresholds for
	/// different unrolling stategies but supports all of them.			/// different unrolling stategies but supports all of them.
	static LoopUnrollPass create() {			static LoopUnrollPass create(int OptLevel = 2) {
	return LoopUnrollPass(/AllowPartialUnrolling/ true);			return LoopUnrollPass(/AllowPartialUnrolling/ true, OptLevel);
	}			}

	/// Create an instance of the loop unroll pass that only does full loop			/// Create an instance of the loop unroll pass that only does full loop
	/// unrolling.			/// unrolling.
	///			///
	/// This will disable any runtime or partial unrolling.			/// This will disable any runtime or partial unrolling.
	static LoopUnrollPass createFull() {			static LoopUnrollPass createFull(int OptLevel = 2) {
	return LoopUnrollPass(/AllowPartialUnrolling/ false);			return LoopUnrollPass(/AllowPartialUnrolling/ false, OptLevel);
	}			}

	PreservedAnalyses run(Loop &L, LoopAnalysisManager &AM,			PreservedAnalyses run(Loop &L, LoopAnalysisManager &AM,
	LoopStandardAnalysisResults &AR, LPMUpdater &U);			LoopStandardAnalysisResults &AR, LPMUpdater &U);
	};			};
	} // end namespace llvm			} // end namespace llvm

	#endif // LLVM_TRANSFORMS_SCALAR_LOOPUNROLLPASS_H			#endif // LLVM_TRANSFORMS_SCALAR_LOOPUNROLLPASS_H

lib/Passes/PassBuilder.cpp

Show First 20 Lines • Show All 328 Lines • ▼ Show 20 Lines	PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
LPM1.addPass(LICMPass());		LPM1.addPass(LICMPass());
#if 0		#if 0
// The LoopUnswitch pass isn't yet ported to the new pass manager.		// The LoopUnswitch pass isn't yet ported to the new pass manager.
LPM1.addPass(LoopUnswitchPass(/* OptimizeForSize */ Level != O3));		LPM1.addPass(LoopUnswitchPass(/* OptimizeForSize */ Level != O3));
#endif		#endif
LPM2.addPass(IndVarSimplifyPass());		LPM2.addPass(IndVarSimplifyPass());
LPM2.addPass(LoopIdiomRecognizePass());		LPM2.addPass(LoopIdiomRecognizePass());
LPM2.addPass(LoopDeletionPass());		LPM2.addPass(LoopDeletionPass());
LPM2.addPass(LoopUnrollPass::createFull());		LPM2.addPass(LoopUnrollPass::createFull(Level));

// We provide the opt remark emitter pass for LICM to use. We only need to do		// We provide the opt remark emitter pass for LICM to use. We only need to do
// this once as it is immutable.		// this once as it is immutable.
FPM.addPass(RequireAnalysisPass<OptimizationRemarkEmitterAnalysis, Function>());		FPM.addPass(RequireAnalysisPass<OptimizationRemarkEmitterAnalysis, Function>());
FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM1)));		FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM1)));
FPM.addPass(SimplifyCFGPass());		FPM.addPass(SimplifyCFGPass());
FPM.addPass(InstCombinePass());		FPM.addPass(InstCombinePass());
FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2)));		FPM.addPass(createFunctionToLoopPassAdaptor(std::move(LPM2)));
▲ Show 20 Lines • Show All 254 Lines • ▼ Show 20 Lines	PassBuilder::buildPerModuleDefaultPipeline(OptimizationLevel Level,
OptimizePM.addPass(InstCombinePass());		OptimizePM.addPass(InstCombinePass());

// Unroll small loops to hide loop backedge latency and saturate any parallel		// Unroll small loops to hide loop backedge latency and saturate any parallel
// execution resources of an out-of-order processor. We also then need to		// execution resources of an out-of-order processor. We also then need to
// clean up redundancies and loop invariant code.		// clean up redundancies and loop invariant code.
// FIXME: It would be really good to use a loop-integrated instruction		// FIXME: It would be really good to use a loop-integrated instruction
// combiner for cleanup here so that the unrolling and LICM can be pipelined		// combiner for cleanup here so that the unrolling and LICM can be pipelined
// across the loop nests.		// across the loop nests.
OptimizePM.addPass(createFunctionToLoopPassAdaptor(LoopUnrollPass::create()));		OptimizePM.addPass(createFunctionToLoopPassAdaptor(LoopUnrollPass::create(Level)));
OptimizePM.addPass(InstCombinePass());		OptimizePM.addPass(InstCombinePass());
OptimizePM.addPass(RequireAnalysisPass<OptimizationRemarkEmitterAnalysis, Function>());		OptimizePM.addPass(RequireAnalysisPass<OptimizationRemarkEmitterAnalysis, Function>());
OptimizePM.addPass(createFunctionToLoopPassAdaptor(LICMPass()));		OptimizePM.addPass(createFunctionToLoopPassAdaptor(LICMPass()));

// Now that we've vectorized and unrolled loops, we may have more refined		// Now that we've vectorized and unrolled loops, we may have more refined
// alignment information, try to re-derive it here.		// alignment information, try to re-derive it here.
OptimizePM.addPass(AlignmentFromAssumptionsPass());		OptimizePM.addPass(AlignmentFromAssumptionsPass());

▲ Show 20 Lines • Show All 801 Lines • Show Last 20 Lines

lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 314 Lines • ▼ Show 20 Lines	void PassManagerBuilder::addFunctionSimplificationPasses(
addExtensionsToPM(EP_LateLoopOptimizations, MPM);		addExtensionsToPM(EP_LateLoopOptimizations, MPM);
MPM.add(createLoopDeletionPass()); // Delete dead loops		MPM.add(createLoopDeletionPass()); // Delete dead loops

if (EnableLoopInterchange) {		if (EnableLoopInterchange) {
MPM.add(createLoopInterchangePass()); // Interchange loops		MPM.add(createLoopInterchangePass()); // Interchange loops
MPM.add(createCFGSimplificationPass());		MPM.add(createCFGSimplificationPass());
}		}
if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
MPM.add(createSimpleLoopUnrollPass()); // Unroll small loops		MPM.add(createSimpleLoopUnrollPass(OptLevel)); // Unroll small loops
addExtensionsToPM(EP_LoopOptimizerEnd, MPM);		addExtensionsToPM(EP_LoopOptimizerEnd, MPM);

if (OptLevel > 1) {		if (OptLevel > 1) {
MPM.add(createMergedLoadStoreMotionPass()); // Merge ld/st in diamonds		MPM.add(createMergedLoadStoreMotionPass()); // Merge ld/st in diamonds
MPM.add(NewGVN ? createNewGVNPass()		MPM.add(NewGVN ? createNewGVNPass()
: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies		: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies
}		}
MPM.add(createMemCpyOptPass()); // Remove memcpy / form memset		MPM.add(createMemCpyOptPass()); // Remove memcpy / form memset
Show All 29 Lines	if (BBVectorize) {
MPM.add(NewGVN		MPM.add(NewGVN
? createNewGVNPass()		? createNewGVNPass()
: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies		: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies
else		else
MPM.add(createEarlyCSEPass()); // Catch trivial redundancies		MPM.add(createEarlyCSEPass()); // Catch trivial redundancies

// BBVectorize may have significantly shortened a loop body; unroll again.		// BBVectorize may have significantly shortened a loop body; unroll again.
if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
MPM.add(createLoopUnrollPass());		MPM.add(createLoopUnrollPass(OptLevel));
}		}
}		}

if (LoadCombine)		if (LoadCombine)
MPM.add(createLoadCombinePass());		MPM.add(createLoadCombinePass());

MPM.add(createAggressiveDCEPass()); // Delete dead instructions		MPM.add(createAggressiveDCEPass()); // Delete dead instructions
MPM.add(createCFGSimplificationPass()); // Merge & remove BBs		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
▲ Show 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	if (BBVectorize) {
MPM.add(NewGVN		MPM.add(NewGVN
? createNewGVNPass()		? createNewGVNPass()
: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies		: createGVNPass(DisableGVNLoadPRE)); // Remove redundancies
else		else
MPM.add(createEarlyCSEPass()); // Catch trivial redundancies		MPM.add(createEarlyCSEPass()); // Catch trivial redundancies

// BBVectorize may have significantly shortened a loop body; unroll again.		// BBVectorize may have significantly shortened a loop body; unroll again.
if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
MPM.add(createLoopUnrollPass());		MPM.add(createLoopUnrollPass(OptLevel));
}		}
}		}

addExtensionsToPM(EP_Peephole, MPM);		addExtensionsToPM(EP_Peephole, MPM);
MPM.add(createCFGSimplificationPass());		MPM.add(createCFGSimplificationPass());
addInstructionCombiningPass(MPM);		addInstructionCombiningPass(MPM);

if (!DisableUnrollLoops) {		if (!DisableUnrollLoops) {
MPM.add(createLoopUnrollPass()); // Unroll small loops		MPM.add(createLoopUnrollPass(OptLevel)); // Unroll small loops

// LoopUnroll may generate some redundency to cleanup.		// LoopUnroll may generate some redundency to cleanup.
addInstructionCombiningPass(MPM);		addInstructionCombiningPass(MPM);

// Runtime unrolling will introduce runtime check in loop prologue. If the		// Runtime unrolling will introduce runtime check in loop prologue. If the
// unrolled loop is a inner loop, then the prologue will be inside the		// unrolled loop is a inner loop, then the prologue will be inside the
// outer loop. LICM pass can help to promote the runtime check out if the		// outer loop. LICM pass can help to promote the runtime check out if the
// checked value is loop invariant.		// checked value is loop invariant.
▲ Show 20 Lines • Show All 134 Lines • ▼ Show 20 Lines	void PassManagerBuilder::addLTOOptimizationPasses(legacy::PassManagerBase &PM) {

// More loops are countable; try to optimize them.		// More loops are countable; try to optimize them.
PM.add(createIndVarSimplifyPass());		PM.add(createIndVarSimplifyPass());
PM.add(createLoopDeletionPass());		PM.add(createLoopDeletionPass());
if (EnableLoopInterchange)		if (EnableLoopInterchange)
PM.add(createLoopInterchangePass());		PM.add(createLoopInterchangePass());

if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
PM.add(createSimpleLoopUnrollPass()); // Unroll small loops		PM.add(createSimpleLoopUnrollPass(OptLevel)); // Unroll small loops
PM.add(createLoopVectorizePass(true, LoopVectorize));		PM.add(createLoopVectorizePass(true, LoopVectorize));
// The vectorizer may have significantly shortened a loop body; unroll again.		// The vectorizer may have significantly shortened a loop body; unroll again.
if (!DisableUnrollLoops)		if (!DisableUnrollLoops)
PM.add(createLoopUnrollPass());		PM.add(createLoopUnrollPass(OptLevel));

// Now that we've optimized loops (in particular loop induction variables),		// Now that we've optimized loops (in particular loop induction variables),
// we may have exposed more scalar opportunities. Run parts of the scalar		// we may have exposed more scalar opportunities. Run parts of the scalar
// optimizer again at this point.		// optimizer again at this point.
addInstructionCombiningPass(PM); // Initial cleanup		addInstructionCombiningPass(PM); // Initial cleanup
PM.add(createCFGSimplificationPass()); // if-convert		PM.add(createCFGSimplificationPass()); // if-convert
PM.add(createSCCPPass()); // Propagate exposed constants		PM.add(createSCCPPass()); // Propagate exposed constants
addInstructionCombiningPass(PM); // Clean up again		addInstructionCombiningPass(PM); // Clean up again
▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines

/// A magic value for use with the Threshold parameter to indicate		/// A magic value for use with the Threshold parameter to indicate
/// that the loop unroll should be performed regardless of how much		/// that the loop unroll should be performed regardless of how much
/// code expansion would result.		/// code expansion would result.
static const unsigned NoThreshold = UINT_MAX;		static const unsigned NoThreshold = UINT_MAX;

/// Gather the various unrolling parameters based on the defaults, compiler		/// Gather the various unrolling parameters based on the defaults, compiler
/// flags, TTI overrides and user specified parameters.		/// flags, TTI overrides and user specified parameters.
static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(		static TargetTransformInfo::UnrollingPreferences gatherUnrollingPreferences(
Loop *L, const TargetTransformInfo &TTI, Optional<unsigned> UserThreshold,		Loop *L, const TargetTransformInfo &TTI, int OptLevel,
		chandlercUnsubmitted Done Reply Inline Actions clang-format here? chandlerc: clang-format here?
Optional<unsigned> UserCount, Optional<bool> UserAllowPartial,		Optional<unsigned> UserThreshold, Optional<unsigned> UserCount,
Optional<bool> UserRuntime, Optional<bool> UserUpperBound) {		Optional<bool> UserAllowPartial, Optional<bool> UserRuntime,
		Optional<bool> UserUpperBound) {
TargetTransformInfo::UnrollingPreferences UP;		TargetTransformInfo::UnrollingPreferences UP;

// Set up the defaults		// Set up the defaults
UP.Threshold = 150;		UP.Threshold = OptLevel > 2 ? 300 : 150;
UP.MaxPercentThresholdBoost = 400;		UP.MaxPercentThresholdBoost = 400;
UP.OptSizeThreshold = 0;		UP.OptSizeThreshold = 0;
UP.PartialThreshold = 150;		UP.PartialThreshold = 150;
UP.PartialOptSizeThreshold = 0;		UP.PartialOptSizeThreshold = 0;
UP.Count = 0;		UP.Count = 0;
UP.PeelCount = 0;		UP.PeelCount = 0;
UP.DefaultUnrollRuntimeCount = 8;		UP.DefaultUnrollRuntimeCount = 8;
UP.MaxCount = UINT_MAX;		UP.MaxCount = UINT_MAX;
▲ Show 20 Lines • Show All 520 Lines • ▼ Show 20 Lines	static void SetLoopAlreadyUnrolled(Loop *L) {
L->setLoopID(NewLoopID);		L->setLoopID(NewLoopID);
}		}

// Computes the boosting factor for complete unrolling.		// Computes the boosting factor for complete unrolling.
// If fully unrolling the loop would save a lot of RolledDynamicCost, it would		// If fully unrolling the loop would save a lot of RolledDynamicCost, it would
// be beneficial to fully unroll the loop even if unrolledcost is large. We		// be beneficial to fully unroll the loop even if unrolledcost is large. We
// use (RolledDynamicCost / UnrolledCost) to model the unroll benefits to adjust		// use (RolledDynamicCost / UnrolledCost) to model the unroll benefits to adjust
// the unroll threshold.		// the unroll threshold.
static unsigned getFullUnrollBoostingFactor(const EstimatedUnrollCost &Cost,		static unsigned getFullUnrollBoostingFactor(const EstimatedUnrollCost &Cost,
		hfinkelUnsubmitted Done Reply Inline Actions fully unroll -> fully unrolling hfinkel: fully unroll -> fully unrolling
unsigned MaxPercentThresholdBoost) {		unsigned MaxPercentThresholdBoost) {
if (Cost.RolledDynamicCost >= UINT_MAX / 100)		if (Cost.RolledDynamicCost >= UINT_MAX / 100)
return 100;		return 100;
else if (Cost.UnrolledCost != 0)		else if (Cost.UnrolledCost != 0)
// The boosting factor is RolledDynamicCost / UnrolledCost		// The boosting factor is RolledDynamicCost / UnrolledCost
return std::min(100 * Cost.RolledDynamicCost / Cost.UnrolledCost,		return std::min(100 * Cost.RolledDynamicCost / Cost.UnrolledCost,
		hfinkelUnsubmitted Not Done Reply Inline Actions Can you please put the magic numbers here into cl::opt flags so that it is easy to experiment with tuning them later? hfinkel: Can you please put the magic numbers here into cl::opt flags so that it is easy to experiment…
		danielcdhAuthorUnsubmitted Not Done Reply Inline Actions Updated to make it tunable with unroll-max-iteration-count-to-analyze danielcdh: Updated to make it tunable with unroll-max-iteration-count-to-analyze
MaxPercentThresholdBoost);		MaxPercentThresholdBoost);
else		else
return MaxPercentThresholdBoost;		return MaxPercentThresholdBoost;
}		}

// Returns loop size estimation for unrolled loop.		// Returns loop size estimation for unrolled loop.
static uint64_t getUnrolledLoopSize(		static uint64_t getUnrolledLoopSize(
unsigned LoopSize,		unsigned LoopSize,
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	#endif
if (UP.Count < 2)		if (UP.Count < 2)
UP.Count = 0;		UP.Count = 0;
return ExplicitUnroll;		return ExplicitUnroll;
}		}

static bool tryToUnrollLoop(Loop L, DominatorTree &DT, LoopInfo LI,		static bool tryToUnrollLoop(Loop L, DominatorTree &DT, LoopInfo LI,
ScalarEvolution *SE, const TargetTransformInfo &TTI,		ScalarEvolution *SE, const TargetTransformInfo &TTI,
AssumptionCache &AC, OptimizationRemarkEmitter &ORE,		AssumptionCache &AC, OptimizationRemarkEmitter &ORE,
bool PreserveLCSSA,		bool PreserveLCSSA, int OptLevel,
Optional<unsigned> ProvidedCount,		Optional<unsigned> ProvidedCount,
Optional<unsigned> ProvidedThreshold,		Optional<unsigned> ProvidedThreshold,
Optional<bool> ProvidedAllowPartial,		Optional<bool> ProvidedAllowPartial,
Optional<bool> ProvidedRuntime,		Optional<bool> ProvidedRuntime,
Optional<bool> ProvidedUpperBound) {		Optional<bool> ProvidedUpperBound) {
DEBUG(dbgs() << "Loop Unroll: F[" << L->getHeader()->getParent()->getName()		DEBUG(dbgs() << "Loop Unroll: F[" << L->getHeader()->getParent()->getName()
<< "] Loop %" << L->getHeader()->getName() << "\n");		<< "] Loop %" << L->getHeader()->getName() << "\n");
if (HasUnrollDisablePragma(L))		if (HasUnrollDisablePragma(L))
return false;		return false;
if (!L->isLoopSimplifyForm()) {		if (!L->isLoopSimplifyForm()) {
DEBUG(		DEBUG(
dbgs() << " Not unrolling loop which is not in loop-simplify form.\n");		dbgs() << " Not unrolling loop which is not in loop-simplify form.\n");
return false;		return false;
}		}

unsigned NumInlineCandidates;		unsigned NumInlineCandidates;
bool NotDuplicatable;		bool NotDuplicatable;
bool Convergent;		bool Convergent;
TargetTransformInfo::UnrollingPreferences UP = gatherUnrollingPreferences(		TargetTransformInfo::UnrollingPreferences UP = gatherUnrollingPreferences(
L, TTI, ProvidedThreshold, ProvidedCount, ProvidedAllowPartial,		L, TTI, OptLevel, ProvidedThreshold, ProvidedCount, ProvidedAllowPartial,
ProvidedRuntime, ProvidedUpperBound);		ProvidedRuntime, ProvidedUpperBound);
// Exit early if unrolling is disabled.		// Exit early if unrolling is disabled.
if (UP.Threshold == 0 && (!UP.Partial \|\| UP.PartialThreshold == 0))		if (UP.Threshold == 0 && (!UP.Partial \|\| UP.PartialThreshold == 0))
return false;		return false;
unsigned LoopSize = ApproximateLoopSize(		unsigned LoopSize = ApproximateLoopSize(
L, NumInlineCandidates, NotDuplicatable, Convergent, TTI, &AC, UP.BEInsns);		L, NumInlineCandidates, NotDuplicatable, Convergent, TTI, &AC, UP.BEInsns);
DEBUG(dbgs() << " Loop Size = " << LoopSize << "\n");		DEBUG(dbgs() << " Loop Size = " << LoopSize << "\n");
if (NotDuplicatable) {		if (NotDuplicatable) {
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	static bool tryToUnrollLoop(Loop L, DominatorTree &DT, LoopInfo LI,

return true;		return true;
}		}

namespace {		namespace {
class LoopUnroll : public LoopPass {		class LoopUnroll : public LoopPass {
public:		public:
static char ID; // Pass ID, replacement for typeid		static char ID; // Pass ID, replacement for typeid
LoopUnroll(Optional<unsigned> Threshold = None,		LoopUnroll(int OptLevel = 2, Optional<unsigned> Threshold = None,
Optional<unsigned> Count = None,		Optional<unsigned> Count = None,
Optional<bool> AllowPartial = None, Optional<bool> Runtime = None,		Optional<bool> AllowPartial = None, Optional<bool> Runtime = None,
Optional<bool> UpperBound = None)		Optional<bool> UpperBound = None)
: LoopPass(ID), ProvidedCount(std::move(Count)),		: LoopPass(ID), OptLevel(OptLevel), ProvidedCount(std::move(Count)),
ProvidedThreshold(Threshold), ProvidedAllowPartial(AllowPartial),		ProvidedThreshold(Threshold), ProvidedAllowPartial(AllowPartial),
ProvidedRuntime(Runtime), ProvidedUpperBound(UpperBound) {		ProvidedRuntime(Runtime), ProvidedUpperBound(UpperBound) {
initializeLoopUnrollPass(*PassRegistry::getPassRegistry());		initializeLoopUnrollPass(*PassRegistry::getPassRegistry());
}		}

		int OptLevel;
Optional<unsigned> ProvidedCount;		Optional<unsigned> ProvidedCount;
Optional<unsigned> ProvidedThreshold;		Optional<unsigned> ProvidedThreshold;
Optional<bool> ProvidedAllowPartial;		Optional<bool> ProvidedAllowPartial;
Optional<bool> ProvidedRuntime;		Optional<bool> ProvidedRuntime;
Optional<bool> ProvidedUpperBound;		Optional<bool> ProvidedUpperBound;

bool runOnLoop(Loop *L, LPPassManager &) override {		bool runOnLoop(Loop *L, LPPassManager &) override {
if (skipLoop(L))		if (skipLoop(L))
return false;		return false;

Function &F = *L->getHeader()->getParent();		Function &F = *L->getHeader()->getParent();

auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LoopInfo *LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
ScalarEvolution *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();		ScalarEvolution *SE = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
const TargetTransformInfo &TTI =		const TargetTransformInfo &TTI =
getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);		auto &AC = getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
// For the old PM, we can't use OptimizationRemarkEmitter as an analysis		// For the old PM, we can't use OptimizationRemarkEmitter as an analysis
// pass. Function analyses need to be preserved across loop transformations		// pass. Function analyses need to be preserved across loop transformations
// but ORE cannot be preserved (see comment before the pass definition).		// but ORE cannot be preserved (see comment before the pass definition).
OptimizationRemarkEmitter ORE(&F);		OptimizationRemarkEmitter ORE(&F);
bool PreserveLCSSA = mustPreserveAnalysisID(LCSSAID);		bool PreserveLCSSA = mustPreserveAnalysisID(LCSSAID);

return tryToUnrollLoop(L, DT, LI, SE, TTI, AC, ORE, PreserveLCSSA,		return tryToUnrollLoop(L, DT, LI, SE, TTI, AC, ORE, PreserveLCSSA, OptLevel,
ProvidedCount, ProvidedThreshold,		ProvidedCount, ProvidedThreshold,
ProvidedAllowPartial, ProvidedRuntime,		ProvidedAllowPartial, ProvidedRuntime,
ProvidedUpperBound);		ProvidedUpperBound);
}		}

/// This transformation requires natural loop information & requires that		/// This transformation requires natural loop information & requires that
/// loop preheaders be inserted into the CFG...		/// loop preheaders be inserted into the CFG...
///		///
Show All 9 Lines

char LoopUnroll::ID = 0;		char LoopUnroll::ID = 0;
INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_BEGIN(LoopUnroll, "loop-unroll", "Unroll loops", false, false)
INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)		INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
INITIALIZE_PASS_DEPENDENCY(LoopPass)		INITIALIZE_PASS_DEPENDENCY(LoopPass)
INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(TargetTransformInfoWrapperPass)
INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)		INITIALIZE_PASS_END(LoopUnroll, "loop-unroll", "Unroll loops", false, false)

Pass *llvm::createLoopUnrollPass(int Threshold, int Count, int AllowPartial,		Pass *llvm::createLoopUnrollPass(int OptLevel, int Threshold, int Count,
int Runtime, int UpperBound) {		int AllowPartial, int Runtime,
		int UpperBound) {
// TODO: It would make more sense for this function to take the optionals		// TODO: It would make more sense for this function to take the optionals
// directly, but that's dangerous since it would silently break out of tree		// directly, but that's dangerous since it would silently break out of tree
// callers.		// callers.
return new LoopUnroll(Threshold == -1 ? None : Optional<unsigned>(Threshold),		return new LoopUnroll(
		OptLevel, Threshold == -1 ? None : Optional<unsigned>(Threshold),
Count == -1 ? None : Optional<unsigned>(Count),		Count == -1 ? None : Optional<unsigned>(Count),
AllowPartial == -1 ? None		AllowPartial == -1 ? None : Optional<bool>(AllowPartial),
: Optional<bool>(AllowPartial),
Runtime == -1 ? None : Optional<bool>(Runtime),		Runtime == -1 ? None : Optional<bool>(Runtime),
UpperBound == -1 ? None : Optional<bool>(UpperBound));		UpperBound == -1 ? None : Optional<bool>(UpperBound));
}		}

Pass *llvm::createSimpleLoopUnrollPass() {		Pass *llvm::createSimpleLoopUnrollPass(int OptLevel) {
return llvm::createLoopUnrollPass(-1, -1, 0, 0, 0);		return llvm::createLoopUnrollPass(OptLevel, -1, -1, 0, 0, 0);
}		}

PreservedAnalyses LoopUnrollPass::run(Loop &L, LoopAnalysisManager &AM,		PreservedAnalyses LoopUnrollPass::run(Loop &L, LoopAnalysisManager &AM,
LoopStandardAnalysisResults &AR,		LoopStandardAnalysisResults &AR,
LPMUpdater &Updater) {		LPMUpdater &Updater) {
const auto &FAM =		const auto &FAM =
AM.getResult<FunctionAnalysisManagerLoopProxy>(L, AR).getManager();		AM.getResult<FunctionAnalysisManagerLoopProxy>(L, AR).getManager();
Function *F = L.getHeader()->getParent();		Function *F = L.getHeader()->getParent();
Show All 15 Lines	PreservedAnalyses LoopUnrollPass::run(Loop &L, LoopAnalysisManager &AM,

// The API here is quite complex to call, but there are only two interesting		// The API here is quite complex to call, but there are only two interesting
// states we support: partial and full (or "simple") unrolling. However, to		// states we support: partial and full (or "simple") unrolling. However, to
// enable these things we actually pass "None" in for the optional to avoid		// enable these things we actually pass "None" in for the optional to avoid
// providing an explicit choice.		// providing an explicit choice.
Optional<bool> AllowPartialParam, RuntimeParam, UpperBoundParam;		Optional<bool> AllowPartialParam, RuntimeParam, UpperBoundParam;
if (!AllowPartialUnrolling)		if (!AllowPartialUnrolling)
AllowPartialParam = RuntimeParam = UpperBoundParam = false;		AllowPartialParam = RuntimeParam = UpperBoundParam = false;
bool Changed = tryToUnrollLoop(&L, AR.DT, &AR.LI, &AR.SE, AR.TTI, AR.AC, *ORE,		bool Changed = tryToUnrollLoop(
/PreserveLCSSA/ true, /Count/ None,		&L, AR.DT, &AR.LI, &AR.SE, AR.TTI, AR.AC, *ORE,
/Threshold/ None, AllowPartialParam,		/PreserveLCSSA/ true, OptLevel, /Count/ None,
RuntimeParam, UpperBoundParam);		/Threshold/ None, AllowPartialParam, RuntimeParam, UpperBoundParam);
if (!Changed)		if (!Changed)
return PreservedAnalyses::all();		return PreservedAnalyses::all();

// The parent must not be damaged by unrolling!		// The parent must not be damaged by unrolling!
#ifndef NDEBUG		#ifndef NDEBUG
if (ParentL)		if (ParentL)
ParentL->verifyLoop();		ParentL->verifyLoop();
#endif		#endif
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/X86/metadata-enable.ll

	; RUN: opt < %s -mcpu=corei7 -O1 -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O1			; RUN: opt < %s -mcpu=corei7 -O1 -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O1
	; RUN: opt < %s -mcpu=corei7 -O2 -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O2			; RUN: opt < %s -mcpu=corei7 -O2 -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O2
	; RUN: opt < %s -mcpu=corei7 -O3 -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O3			; RUN: opt < %s -mcpu=corei7 -O3 -S -unroll-threshold=150 -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O3
				; RUN: opt < %s -mcpu=corei7 -O3 -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O3DEFAULT
	; RUN: opt < %s -mcpu=corei7 -Os -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=Os			; RUN: opt < %s -mcpu=corei7 -Os -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=Os
	; RUN: opt < %s -mcpu=corei7 -Oz -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=Oz			; RUN: opt < %s -mcpu=corei7 -Oz -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=Oz
	; RUN: opt < %s -mcpu=corei7 -O1 -vectorize-loops -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O1VEC			; RUN: opt < %s -mcpu=corei7 -O1 -vectorize-loops -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O1VEC
	; RUN: opt < %s -mcpu=corei7 -Oz -vectorize-loops -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=OzVEC			; RUN: opt < %s -mcpu=corei7 -Oz -vectorize-loops -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=OzVEC
	; RUN: opt < %s -mcpu=corei7 -O1 -loop-vectorize -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O1VEC2			; RUN: opt < %s -mcpu=corei7 -O1 -loop-vectorize -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O1VEC2
	; RUN: opt < %s -mcpu=corei7 -Oz -loop-vectorize -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=OzVEC2			; RUN: opt < %s -mcpu=corei7 -Oz -loop-vectorize -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=OzVEC2
	; RUN: opt < %s -mcpu=corei7 -O3 -disable-loop-vectorization -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O3DIS			; RUN: opt < %s -mcpu=corei7 -O3 -unroll-threshold=150 -disable-loop-vectorization -S -unroll-allow-partial=0 \| FileCheck %s --check-prefix=O3DIS

	; This file tests the llvm.loop.vectorize.enable metadata forcing			; This file tests the llvm.loop.vectorize.enable metadata forcing
	; vectorization even when optimization levels are too low, or when			; vectorization even when optimization levels are too low, or when
	; vectorization is disabled.			; vectorization is disabled.

	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	; O1-LABEL: @enabled(			; O1-LABEL: @enabled(
	; O1: store <4 x i32>			; O1: store <4 x i32>
	; O1: ret i32			; O1: ret i32
	; O2-LABEL: @enabled(			; O2-LABEL: @enabled(
	; O2: store <4 x i32>			; O2: store <4 x i32>
	; O2: ret i32			; O2: ret i32
	; O3-LABEL: @enabled(			; O3-LABEL: @enabled(
	; O3: store <4 x i32>			; O3: store <4 x i32>
	; O3: ret i32			; O3: ret i32
				; O3DEFAULT-LABEL: @enabled(
				; O3DEFAULT: store <4 x i32>
				; O3DEFAULT: ret i32
	; Pragma always wins!			; Pragma always wins!
	; O3DIS-LABEL: @enabled(			; O3DIS-LABEL: @enabled(
	; O3DIS: store <4 x i32>			; O3DIS: store <4 x i32>
	; O3DIS: ret i32			; O3DIS: ret i32
	; Os-LABEL: @enabled(			; Os-LABEL: @enabled(
	; Os: store <4 x i32>			; Os: store <4 x i32>
	; Os: ret i32			; Os: ret i32
	; Oz-LABEL: @enabled(			; Oz-LABEL: @enabled(
	Show All 36 Lines
	; O1-NOT: store <4 x i32>			; O1-NOT: store <4 x i32>
	; O1: ret i32			; O1: ret i32
	; O2-LABEL: @nopragma(			; O2-LABEL: @nopragma(
	; O2: store <4 x i32>			; O2: store <4 x i32>
	; O2: ret i32			; O2: ret i32
	; O3-LABEL: @nopragma(			; O3-LABEL: @nopragma(
	; O3: store <4 x i32>			; O3: store <4 x i32>
	; O3: ret i32			; O3: ret i32
				; O3DEFAULT-LABEL: @nopragma(
				; O3DEFAULT: store <4 x i32>
				; O3DEFAULT: ret i32
	; O3DIS-LABEL: @nopragma(			; O3DIS-LABEL: @nopragma(
	; O3DIS-NOT: store <4 x i32>			; O3DIS-NOT: store <4 x i32>
	; O3DIS: ret i32			; O3DIS: ret i32
	; Os-LABEL: @nopragma(			; Os-LABEL: @nopragma(
	; Os: store <4 x i32>			; Os: store <4 x i32>
	; Os: ret i32			; Os: ret i32
	; Oz-LABEL: @nopragma(			; Oz-LABEL: @nopragma(
	; Oz-NOT: store <4 x i32>			; Oz-NOT: store <4 x i32>
	Show All 35 Lines
	; O1-NOT: store <4 x i32>			; O1-NOT: store <4 x i32>
	; O1: ret i32			; O1: ret i32
	; O2-LABEL: @disabled(			; O2-LABEL: @disabled(
	; O2-NOT: store <4 x i32>			; O2-NOT: store <4 x i32>
	; O2: ret i32			; O2: ret i32
	; O3-LABEL: @disabled(			; O3-LABEL: @disabled(
	; O3-NOT: store <4 x i32>			; O3-NOT: store <4 x i32>
	; O3: ret i32			; O3: ret i32
				; O3DEFAULT-LABEL: @disabled(
				; O3DEFAULT: store <4 x i32>
				; O3DEFAULT: ret i32
	; O3DIS-LABEL: @disabled(			; O3DIS-LABEL: @disabled(
	; O3DIS-NOT: store <4 x i32>			; O3DIS-NOT: store <4 x i32>
	; O3DIS: ret i32			; O3DIS: ret i32
	; Os-LABEL: @disabled(			; Os-LABEL: @disabled(
	; Os-NOT: store <4 x i32>			; Os-NOT: store <4 x i32>
	; Os: ret i32			; Os: ret i32
	; Oz-LABEL: @disabled(			; Oz-LABEL: @disabled(
	; Oz-NOT: store <4 x i32>			; Oz-NOT: store <4 x i32>
	Show All 38 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Increases full-unroll threshold.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 89008

include/llvm/Transforms/Scalar.h

include/llvm/Transforms/Scalar/LoopUnrollPass.h

lib/Passes/PassBuilder.cpp

lib/Transforms/IPO/PassManagerBuilder.cpp

lib/Transforms/Scalar/LoopUnrollPass.cpp

test/Transforms/LoopVectorize/X86/metadata-enable.ll

Increases full-unroll threshold.
ClosedPublic