This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Use the upper bound of the loop trip count to completely unroll loops
ClosedPublic

Authored by haicheng on Sep 20 2016, 9:52 PM.

Details

Summary

This patch tries to fully unroll loops having break statement like this

for (int i = 0; i < 8; i++) {                                                                     
    if (a[i] == value) {                                                                               
        found = true; 
        break;
    }                                                                                             
}

GCC can fully unroll such loops, but currently LLVM cannot because LLVM only supports loops having exact constant trip counts.

The upper bound of the trip count can be obtained from calling ScalarEvolution::getMaxBackedgeTakenCount(). Part of the patch is the refactoring work in SCEV to prevent duplicating code.

The feature of using the upper bound is enabled under the same circumstance when runtime unrolling is enabled since both are used to unroll loops without knowing the exact constant trip count.

The modified test/CodeGen/AMDGPU/tti-unroll-prefs.ll can be used as the test case of this patch.

Diff Detail

Repository
rL LLVM

Event Timeline

haicheng updated this revision to Diff 72006.Sep 20 2016, 9:52 PM
haicheng retitled this revision from to [LoopUnroll] Use the upper bound of the loop trip count to completely unroll loops.
haicheng updated this object.
haicheng set the repository for this revision to rL LLVM.
haicheng added a subscriber: llvm-commits.
mzolotukhin edited edge metadata.Sep 22 2016, 4:17 PM

Hi,

This makes total sense to me. But I think there is no need in introducing yet another knob in loop-unroller: have you measured the effect of the change on benchmarks (LLVM testsuite/SPEC/others)? I think it should be correct to unroll a loop with a small trip count even if it's not exactly known: the effect of this transformation would be not worse then from unrolling a similar loop with a constant, but equal to upper bound, trip-count. The benchmark results would help confirming or rejecting this assumption.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
934 ↗(On Diff #72006)

Unnecessary change.

ABataev added inline comments.
lib/Analysis/ScalarEvolution.cpp
5349 ↗(On Diff #72006)

const auto *MaxExitCount

lib/Transforms/Utils/LoopUnroll.cpp
543 ↗(On Diff #72006)

Maybe just NeedConditional = (UseUpperBound && j)?

haicheng marked 3 inline comments as done.Sep 23 2016, 8:58 AM

Hi,

This makes total sense to me. But I think there is no need in introducing yet another knob in loop-unroller: have you measured the effect of the change on benchmarks (LLVM testsuite/SPEC/others)? I think it should be correct to unroll a loop with a small trip count even if it's not exactly known: the effect of this transformation would be not worse then from unrolling a similar loop with a constant, but equal to upper bound, trip-count. The benchmark results would help confirming or rejecting this assumption.

Thanks,
Michael

Thank you for reviewing my patch, Michael.

My initial design was to completely unroll all loops having small trip count upper bounds whenever the exact trip counts are unknown, but I saw several regressions in spec200x and internal benchmarks (e.g. spec2000/eon -1.8%, spec2000/gcc -1.2%) running on a AArch64 device. One of the major reason was that I unrolled many loops with calls. As you may already know, the cost model of call is not that awesome. BasicTTIImplBase::getUnrollingPreferences() can help me check the call IRs in the loop and I need a boolean anyway to pass the check result so that I create a new entry in UnrollingPreferences.

If using exact trip count to unroll, the unrolled loop usually becomes a giant basic block which is preferable. However, if using the upper bound to unroll, the unrolled loop usually become a sequence of small basic blocks because it is not safe to merge loop blocks belonging to different iterations. Some of these blocks may not be executed during runtime. This is another reason that I think we may need to be more conservative to use upper bound to unroll loops.

I tried several different configurations and the patch I uploaded was the best I found. No noticeable regressions in the benchmarks I tested and there are several small wins in spec200x (e.g. spec2000/perlbmk +0.8%, spec2006/xalancbmk +0.7%) and many larger wins in our internal benchmarks which are much larger than spec2006.

Haicheng

haicheng updated this revision to Diff 72293.Sep 23 2016, 9:03 AM
haicheng edited edge metadata.

Address the comments from Michael and Alexey. Thank you very much.

Hi Haicheng,

Thanks for working on this, please find my answers below and some more remarks/nit-picks inline.

One of the major reason was that I unrolled many loops with calls. As you may already know, the cost model of call is not that awesome.

Maybe we just fix the cost model for calls instead :-) But I know, it might not be actually possible at all.

If using exact trip count to unroll, the unrolled loop usually becomes a giant basic block which is preferable. However, if using the upper bound to unroll, the unrolled loop usually become a sequence of small basic blocks because it is not safe to merge loop blocks belonging to different iterations. Some of these blocks may not be executed during runtime. This is another reason that I think we may need to be more conservative to use upper bound to unroll loops.

I see, this makes perfect sense to me. Indeed, having separate thresholds might be reasonable.

I tried several different configurations and the patch I uploaded was the best I found.

Have you tested it on any other architecture except AArch64?

Thanks,
Michael

lib/Analysis/ScalarEvolution.cpp
5316 ↗(On Diff #72293)

The first letter shouldn't be capitalized. I think the routine name also might be improved, but I don't have any better suggestions at the moment.

lib/Transforms/Scalar/LoopUnrollPass.cpp
766 ↗(On Diff #72293)

s/make/makes/

767 ↗(On Diff #72293)

s/staticaly/statically/

771 ↗(On Diff #72293)

This gets confusing. Can we somehow rename these variables so that it's clearer what's the difference between them?

779 ↗(On Diff #72293)

What if TripCount matches MaxTripCount? Will we think that we're using upper bound instead of the exact trip count in this case?

1007 ↗(On Diff #72293)

I'm not a big fan of reusing this variable. Initially it was supposed to come from UP and show if we're allowed to use upper-bound instead of exactly-known trip-count, but now we're also using it to communicate between various routines (and the interface was not specified). Can it be refactored somehow?

lib/Transforms/Utils/LoopUnroll.cpp
206 ↗(On Diff #72293)

Please add a comment about UseUpperBound. Actually, I don't think that's a good name for this argument, because this functions doesn't use upper bound for anything. What it needs is a flag indicating that conditional branches must be preserved - I'd suggest to reflect that in the argument name.

543 ↗(On Diff #72293)

Is it intentional that we can make NeedConditional true after it was set to false before that? From semantics it looks like this never happens (in this case we can assert it), but I'd like to make sure it was not overlooked. Did you intend to replace this if with else if as well?

test/CodeGen/AMDGPU/tti-unroll-prefs.ll
14–16 ↗(On Diff #72293)

We could've just passed -unroll-threshold to overcome this. It might make sense to do it even now, so that the test doesn't break if UnrollPreferences are changed.

haicheng updated this revision to Diff 73277.Oct 3 2016, 7:34 AM
haicheng marked 6 inline comments as done.

Address Michael's comments. Thank you.

haicheng added inline comments.Oct 3 2016, 7:58 AM
lib/Transforms/Scalar/LoopUnrollPass.cpp
779 ↗(On Diff #72293)

TripCount and MaxTripCount cannot both be non zero because MaxTripCount is computed only if TripCount is Zero. If one is non zero, the other one must be zero. If they are both zero, FullUnrollTripCount is zero and then we cannot enter here.

lib/Transforms/Utils/LoopUnroll.cpp
543 ↗(On Diff #72293)

I intended to replace if with else if. Using complete unroll should not enter the else if part. I think the change improves the readability.

test/CodeGen/AMDGPU/tti-unroll-prefs.ll
14–16 ↗(On Diff #72293)

This loop has a break statement so that we cannot get the exact trip cont. I think we cannot fully unroll the loop without my change. Passing -unroll-threshold cannot overcome this.

haicheng added inline comments.Oct 3 2016, 9:51 AM
lib/Transforms/Scalar/LoopUnrollPass.cpp
779 ↗(On Diff #72293)

I also added a sentence in the comment to describe this.

lib/Transforms/Utils/LoopUnroll.cpp
543 ↗(On Diff #72293)

Without the change of else if, the logic of my patch is also wrong.

Hi Michael,

Please see my inlined response. Thank you.

Hi Haicheng,

Thanks for working on this, please find my answers below and some more remarks/nit-picks inline.

One of the major reason was that I unrolled many loops with calls. As you may already know, the cost model of call is not that awesome.

Maybe we just fix the cost model for calls instead :-) But I know, it might not be actually possible at all.

It should be fixed, but I have no clue how to do that now. Can I leave it for the future?

If using exact trip count to unroll, the unrolled loop usually becomes a giant basic block which is preferable. However, if using the upper bound to unroll, the unrolled loop usually become a sequence of small basic blocks because it is not safe to merge loop blocks belonging to different iterations. Some of these blocks may not be executed during runtime. This is another reason that I think we may need to be more conservative to use upper bound to unroll loops.

I see, this makes perfect sense to me. Indeed, having separate thresholds might be reasonable.

I tried threshold 100 for using upper bound to unroll instead of the current default threshold 150, but I saw some regressions.

I have some other random thoughts about unroll threshold that not directly related to this patch and not tested with any benchmarks yet. I think we may want to encourage the unroller to unroll loops with larger trip count to reduce more loop overhead. For example, if we have a loop whose size is 75 and can be unrolled twice and we have another loop whose size is 21 but can be unrolled 8 times, we may prefer unrolling the latter more. It occurs to me because I noticed that GCC unrolls more small loops with high trip count than LLVM does and LLVM unrolls more large loops with small trip count than GCC does. Maybe we can use some equations like 100+10*trip_count to calculate the threshold. What do you think?

I tried several different configurations and the patch I uploaded was the best I found.

Have you tested it on any other architecture except AArch64?

I ran spec2000/2006 on x86 last week, no noticeable regressions. Some small improvements: spec2000/bzip2 +0.7, spec2006/dealII +2.1% gcc +0.8%.

Thanks,
Michael

Hi Michael,

Would you please take another look at my modified patch? Thank you in advance.

Haicheng

Hi Haicheng,

Please find a couple of minor comments inline.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
779 ↗(On Diff #72293)

Could you add an assert for this please?

95–102 ↗(On Diff #73277)

Do we really need two options here? Can the first one be replaced with (UnrollMaxUpperBound == 0)?

haicheng updated this revision to Diff 74193.Oct 10 2016, 6:19 PM
haicheng edited edge metadata.
haicheng marked 5 inline comments as done.

Address Michael's comments. Thank you.

mzolotukhin accepted this revision.Oct 10 2016, 6:38 PM
mzolotukhin edited edge metadata.

Hi Haicheng,

The patch looks mostly good to me modulo a couple of small remarks, please feel free to commit once they are addressed.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
768–769 ↗(On Diff #74193)

The assert might fail to catch a bug if the multiplication result overflows and wraps around to 0.

test/Transforms/LoopUnroll/AArch64/full-unroll-trip-count-upper-bound.ll
1–2 ↗(On Diff #74193)

Shouldn't we have -unroll-upper here? Actually, I think we need to check that a) a loop is not unrolled without this flag, and b) it's unrolled with it.

Also, what happened to the changes in the other (existing) test?

This revision is now accepted and ready to land.Oct 10 2016, 6:38 PM
haicheng updated this revision to Diff 74259.Oct 11 2016, 8:28 AM
haicheng edited edge metadata.
haicheng marked an inline comment as done.

Addressed more comments from Michael. I will commit tomorrow morning if no new comments coming today.

This revision was automatically updated to reflect the committed changes.
evstupac added inline comments.
llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
1027

Sorry for comment after approve.
What I see here is that TripCount could implicitly (as TripCount was passed to computeUnrollCount as a reference) become MaxTripCount. Here it is passed to UnrollLoop as TripCount.
Could you please add a comment to UnrollLoop that TripCount parameter is not always a real loop trip count, but sometimes MaxTripCount. Or rename TripCount to MaxTripCount in UnrollLoop parameters. There could be future errors if someone assumes that TripCount is exact loop trip count.

haicheng added inline comments.Nov 14 2016, 9:49 PM
llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
1027

I will modify the comments in a separate patch to make it clearer. However, you may need to know that even the trip counter is calculated by SCEV::getSmallConstantTripCount(), it may not be the exact times the loop header get executed. Please see the comments before llvm::UnrollLoop() in LoopUnroll.cpp.