This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Discount uniform instructions in cost models
AbandonedPublic

Authored by reames on Nov 14 2020, 10:09 AM.

Details

Summary

This patch updates the cost models for loop unrolling to discount the cost of a loop invariant expression on all but one iteration. The reasoning here is that such an expression (as determined by SCEV) will be CSEd once the loop is unrolled. Note that SCEVs reasoning will find things which can be invariant, not simply those outside the loop.

Note that unrolling currently has two cost models. One very very simple for partial and runtime unrolling, one much more complicated for full unrolling. The second was already handled, this case is the former.

Diff Detail

Event Timeline

reames created this revision.Nov 14 2020, 10:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 14 2020, 10:09 AM
reames requested review of this revision.Nov 14 2020, 10:09 AM
fhahn added a subscriber: fhahn.Nov 14 2020, 11:56 AM

Reviewers, I am not entirely sure this is the right approach. The obvious alternative is to generalize the full unrolling cost model to handle partial unrolling. As detailed below, the alternative looks non-trivial, and I decided against it. I do see the argument for that being the right approach though, so if you want to just reject this patch outright, I won't argue.

Agreed, this is going to be tricky, as the full unrolling cost-model tries hard to estimate what can be eliminated through full unrolling, which would need a lot of adjustments for partial unrolling. Moving forward with improving them separately seems reasonable to me for now.

llvm/lib/Analysis/LoopUnrollAnalyzer.cpp
40 ↗(On Diff #305320)

Unless I am missing something, It seems like there's a test case missing for this change? Would it be possible to add one?

This could also be split into 2 separate patches, to make it slightly easier to pin-point potential regressions, but I am not sure if that's worth it.

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
794

LoopSizeEstimator iterates over the whole loop body on construction, but it might not be used. Would it be possible to change it to compute the size on demand and cache it for further accesses?

xbolva00 added inline comments.
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
745

Typo

reames added inline comments.Nov 14 2020, 8:16 PM
llvm/lib/Analysis/LoopUnrollAnalyzer.cpp
40 ↗(On Diff #305320)

Yep, I missed a test, thanks for catching.

I wanted them grouped since the full unroll does use the estimator along one path. It felt "odd" to have some looks fully unrolled on uniform, and some not. No strong preference though.

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
745

Will fix, thanks.

794

Can do. Need to do that for the follow on, I just tried to keep this one simple.

reames updated this revision to Diff 306946.Nov 22 2020, 4:21 PM

Address review comments

This looks sensible to me and a step in the right direction. Have you ran any benchmarks? Would be good to know, at least, that this isn't causing any obvious regressions.

This looks sensible to me and a step in the right direction. Have you ran any benchmarks? Would be good to know, at least, that this isn't causing any obvious regressions.

I haven't. I'm not particularly setup to do so easily either.

fhahn added a comment.Nov 23 2020, 1:24 PM

This looks sensible to me and a step in the right direction. Have you ran any benchmarks? Would be good to know, at least, that this isn't causing any obvious regressions.

I haven't. I'm not particularly setup to do so easily either.

It might be helpful to build llvm-test-suite and check the impact on the number of unrolled loops using statistics (-DTEST_SUITE_COLLECT_STATS)? That should give a rough idea on how broad the impact is going to be, without having actual HW for benchmarking.

This looks sensible to me and a step in the right direction. Have you ran any benchmarks? Would be good to know, at least, that this isn't causing any obvious regressions.

I haven't. I'm not particularly setup to do so easily either.

It might be helpful to build llvm-test-suite and check the impact on the number of unrolled loops using statistics (-DTEST_SUITE_COLLECT_STATS)? That should give a rough idea on how broad the impact is going to be, without having actual HW for benchmarking.

This seems like overkill. We have no reason to expect the impact from this to be high or negative for that matter. I would strongly prefer simply landing and watching the bots.

Whitney added inline comments.Dec 1 2020, 5:12 AM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
776

Do we need a test case for this case where the loop is not unrolled because the entire loop body is uniform?

reames added inline comments.Dec 1 2020, 9:08 AM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
776

Any suggestions on how to write one? This was mostly defensive programming. Every loop structure I can think of is either a) infinite, or b) involves at least two instructions. Given BEInst is configurable (why?) the later isn't quite enough to fully disallow the case.

Whitney added inline comments.Dec 1 2020, 9:24 AM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
776

I was having problem picturing this scenario, that's why was thinking a test case would help.
BEInsns is configureable due to it is compared with results from TTI?

792

Given that unsigned LoopSize = Metrics.NumInsts;, and NumInsts += TTI.getUserCost(&I, TargetTransformInfo::TCK_CodeSize);, should we increment Count by TTI.getUserCost(&I, TargetTransformInfo::TCK_CodeSize); instead of 1?

reames added inline comments.Dec 1 2020, 1:35 PM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
792

Yes, yes we should. Thank you for catching this!

reames planned changes to this revision.Dec 3 2020, 2:31 PM
reames updated this revision to Diff 309384.Dec 3 2020, 3:21 PM

Address review comment on costing.

I didn't add a separate test, mostly because it's not clear to me which cases the cost model returns a interesting result. If anyone has a suggestion, happy to add.

reames planned changes to this revision.May 14 2021, 8:06 AM

Appears to have bit rotten while waiting for review.

reames updated this revision to Diff 345597.May 14 2021, 7:08 PM

Refresh - still missing a dedicated test, expect that tomorrow.

reames updated this revision to Diff 345715.May 16 2021, 10:23 AM

Add a test to demonstrate effect.

reames edited the summary of this revision. (Show Details)May 17 2021, 3:12 PM

ping

fhahn accepted this revision.May 26 2021, 3:06 PM

LGTM, thanks!

This revision is now accepted and ready to land.May 26 2021, 3:06 PM
nikic added a comment.May 26 2021, 3:28 PM

The reasoning here is that such an expression (as determined by SCEV) will be CSEd once the loop is unrolled.

Is that really the case? I don't think unrolling itself does any CSE, and runtime unrolling is run at the very end of the pipeline, so we don't have any EarlyCSE or GVN passes after it.

Also, the converse question: Why have invariant instructions not been hoisted by LICM before unrolling is reached?

The reasoning here is that such an expression (as determined by SCEV) will be CSEd once the loop is unrolled.

Is that really the case? I don't think unrolling itself does any CSE, and runtime unrolling is run at the very end of the pipeline, so we don't have any EarlyCSE or GVN passes after it.

If this is a problem in practice, we can add simple CSE to the unroll transform.

Also, the converse question: Why have invariant instructions not been hoisted by LICM before unrolling is reached?

This uses SCEV's notion of invariant, not Loops. As an example, LIV udiv LIV2 is invariant, but can't be hoisted unless we can prove LIV2 != 0.

nikic added a comment.May 26 2021, 3:56 PM

The reasoning here is that such an expression (as determined by SCEV) will be CSEd once the loop is unrolled.

Is that really the case? I don't think unrolling itself does any CSE, and runtime unrolling is run at the very end of the pipeline, so we don't have any EarlyCSE or GVN passes after it.

If this is a problem in practice, we can add simple CSE to the unroll transform.

Without having a CSE after unrolling, it's not immediately clear to me why this change would make sense from a cost-modelling perspective. It models these instructions as free, but (without CSE) they will not be free.

Also, the converse question: Why have invariant instructions not been hoisted by LICM before unrolling is reached?

This uses SCEV's notion of invariant, not Loops. As an example, LIV udiv LIV2 is invariant, but can't be hoisted unless we can prove LIV2 != 0.

Unfortunately, using SCEVs notion of invariance is rather problematic here, because it requires creating SCEV expressions for every single instruction in the loop. This has predictable impact on compile-time: https://llvm-compile-time-tracker.com/compare.php?from=963495f0d4b5a0707f82b6c6454f42f3aa52da9b&to=39492e37b8285c2ccc8a58136191a011c590470c&stat=instructions

reames updated this revision to Diff 349617.Jun 3 2021, 11:19 AM

Restructured for compile time.

reames updated this revision to Diff 349665.Jun 3 2021, 1:06 PM

Further optimize for compile time.

@nikic Could you share new numbers for this version please?

nikic added a comment.Jun 3 2021, 2:11 PM

New compile-time: https://llvm-compile-time-tracker.com/compare.php?from=33e41eaecdd7f71184b0bc9dbfdc2892aa45534a&to=bdf43c851f0f5f99df40af83f2d696df8babe7a8&stat=instructions

It's worth mentioning that contrary to your first patch (which had near zero impact on CTMark codegen), I'm seeing some significant code size changes with the new version: https://llvm-compile-time-tracker.com/compare.php?from=33e41eaecdd7f71184b0bc9dbfdc2892aa45534a&to=bdf43c851f0f5f99df40af83f2d696df8babe7a8&stat=size-text This might indicate a bug in the new implementation.

reames abandoned this revision.Sun, Nov 26, 11:50 AM

Closing stale phabricator review. No longer actively working on this, if someone wants to pick it up, feel free.

Herald added a project: Restricted Project. · View Herald TranscriptSun, Nov 26, 11:50 AM
Herald added a subscriber: StephenFan. · View Herald Transcript