This is an archive of the discontinued LLVM Phabricator instance.

[runtime-unroll] Remove restriction about unrolling multiple exit loops
AbandonedPublic

Authored by reames on Nov 15 2021, 1:54 PM.

Details

Summary

This change simply removes entirely a restriction in the runtime unrolling implementation which prevents us from unrolling loops with multiple exits in anything except the narrowest circumstances.

For context here, I think its important to note that this code originally didn't handle unrolling multiple exit loops (at all, it would crash), and then we selectively added one case with known motivation, and then iteratively widened the cases which could be done under force flags. At this point, there's no known correctness limitations left. I highlight this history mostly because it's tempting to think the existing heuristic is well justified, and frankly, it wasn't. It was simply us being conservative along the way.

For additional context, the unroll pass will do it's own cost model before invoking this code at all. That cost model is fairly coarse; it's entirely code size based. With the code here, we effectively had a two level cost model.

With all the context, why do I believe we should enable this? Because we're already runtime unrolling cases which are directly analogous to the multiple exit edge case. At the simplest, consider a C loop where all break statements are replaced with continues. (Ignore the fact this might not compute the same result.) We'd unroll the continue form, but not the break form.

Diff Detail

Event Timeline

reames created this revision.Nov 15 2021, 1:54 PM
reames requested review of this revision.Nov 15 2021, 1:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 15 2021, 1:54 PM

This looks reasonable to me on a conceptual level -- we have recently removed multi-exit restrictions for non-runtime unrolling as well.

Of course, this does have the potential to further exacerbate our problem with way too aggressive unrolling.

llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
418

The UnrollRuntimeMultiExit and UnrollRuntimeOtherExitPredictable options are now dead.

reames added inline comments.Nov 16 2021, 11:42 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
418

Er, yeah. I thought about removing them, but apparently didn't actually do so. Oops. :)

Once we have some sanity check benchmark scores, I'll update to remove these in the patch.

This looks reasonable to me on a conceptual level -- we have recently removed multi-exit restrictions for non-runtime unrolling as well.

Of course, this does have the potential to further exacerbate our problem with way too aggressive unrolling.

+1 to that, but i think unrolling is generally a good thing that we *should* be doing.

anna edited reviewers, added: anna; removed: annatom.Nov 23 2021, 6:47 AM
anna added a comment.Nov 23 2021, 7:02 AM

This change looks good to me and thank you for taking this through!

Philip, I think it is worth mentioning in the description for folks benchmarking this patch, that they can just work on ToT and compare baseline against -unroll-runtime-multi-exit=true (this is the option we are using for our benchmarking measurements).

llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
418

Maybe we should keep the UnrollRuntimeMultiExit option for a week or so (after the change lands), to facilitate triaging of functional/performance regressions.
As mentioned in the description, the original case supported is very narrow - the single exit block other than the latchExit is a deopt block.
So UnrollRuntimeMultiExit specified through command line with 0 or 1, will turn ON or OFF the option without profitability considerations.

Ran RawSpeed benchmark, i think i really like this, it appears to be a geomean -0.75% run time improvement.


Ran the test suite. Surprisingly, this appears to be nearly performance neutral.

tot-results  ...        diff

count 592.000000 ... 591.000000
mean 1472.037113 ... 0.000075
std 13074.744904 ... 0.021348
min 0.603100 ... -0.225786
25% 2.796850 ... -0.000778
50% 101.952583 ... 0.000079
75% 471.058909 ... 0.001108
max 217582.535333 ... 0.208422

I spot checked a couple of the worst regressions. The assembly is unchanged, and the test is simply noisy as all heck.

On compile time, I would say the results are inconclusive, but don't show any major slowdown.

tot-results  ...        diff

count 153.000000 ... 153.000000
mean 7.498642 ... 0.004975
std 11.700351 ... 0.035278
min 0.625100 ... -0.069157
25% 1.598500 ... -0.010616
50% 4.031300 ... 0.002407
75% 6.612500 ... 0.017538
max 92.241100 ... 0.202831

I will comment that I don't have much experience running test-suite, and ran this on an AWS instance. There's definitely room for error in the run configuration to dominate measurements. :)

Whitney added inline comments.Nov 24 2021, 8:48 AM
llvm/lib/Transforms/Utils/LoopUnrollRuntime.cpp
411
reames updated this revision to Diff 389579.Nov 24 2021, 11:33 AM
reames edited the summary of this revision. (Show Details)

Address a couple of minor review comments.

nikic added a comment.Nov 28 2021, 1:42 PM

As suspected, this has significant impact on code size even on CTMark: https://llvm-compile-time-tracker.com/compare.php?from=3608e18a946e77a474a468304b6c3904c55dbce0&to=71597b63b6b5e2c180f8417a7809d94637209c72&stat=size-text sqlite3 regresses by more than 2% and and SPASS with ThinLTO regresses by more than 3%. This is very likely not a good optimization tradeoff, especially for non-benchmarky applications (aka most of the code out there). I think we'll have to improve our heuristics or cost modelling if we want to enable this unconditionally.

There is compile-time impact as well, but I think it is directly related to the large code size increases: https://llvm-compile-time-tracker.com/compare.php?from=3608e18a946e77a474a468304b6c3904c55dbce0&to=71597b63b6b5e2c180f8417a7809d94637209c72&stat=instructions

bmahjour added a project: Restricted Project.Dec 1 2021, 8:23 AM
reames added a comment.Dec 2 2021, 9:16 AM

Just so folks know what to expect here, let me briefly summarize.

At this point, we have enough cases reported that regress than in order to drive this forward, I'd likely need to do some cost modeling changes. Heuristic changes are famously delicate and expensive time wise. Its really hard for me to justify that time as things stand.

I'm currently waiting on a couple of examples from my client where this causes regressions on internal workloads. I'm going to take a look at them to see if anything glaring falls out, but we're expecting the answer to simply be that we unrolled an unprofitable case. Given these couple of regressions seem to be exceptions, I'm probably going to end up recommending my client enable this downstream and close the upstream review.

In the meantime, if anyone wants to send me IR for an "interesting" regression, I'm happy to take a look at other sources to see if ideas fall out. I don't plan to proactively investigate test-suite results just because I've found in the past that the investigation cost is high, and the expected value in this case is quite low.

Analyzing regressions did turn up one interesting issue. After this change, we were aggressively unrolling short multiple exit loops because of a restriction in how we computed estimate trip counts. D115362 fixes that, and might help any profiled workload.

reames abandoned this revision.Feb 12 2022, 8:24 AM

I'm going to abandon this patch.

Finally got another set of compile time and performance data following the small loops change above, and the results were not particularly compelling. Performance didn't really move in any interesting way, and the compile time across the suite degraded by a couple of percent. I suspect there are things we could do about compile time, but given the lack of compelling performance gains, I lack the motivation to push this.

For anyone interested in following up, here are some ideas on possible compile time wins:

  • Prune loop exits during unrolling. Check that D116496, and D114039 have landed, then go looking for other examples. In particular, the case where an exit becomes unreachable is slightly hard, but possibly worthwhile.
  • Do incremental peephole during unrolling. As a degenerate example, an increment unrolled one thousand times and used exactly once at the end of the chain currently results in one thousand scev constructions. This is a quite noticeable waste.
  • Beyond peephole, both CSE and DSE are possible to do incrementally during unrolling. These will both cut down on memory traffic and IR size.
  • Revise the cost model to account for *benefit* from unrolling, not just *cost*. This is a much deeper change, but looks increasing necessary.