This is an archive of the discontinued LLVM Phabricator instance.

Reimplement heuristic for estimating complete-unroll optimization effects.
AbandonedPublic

Authored by mzolotukhin on Feb 23 2015, 12:57 PM.

Details

Reviewers
chandlerc
hfinkel
Summary

This patch reimplements heuristic that tries to estimate optimization beneftis
from complete loop unrolling.

Diff Detail

Event Timeline

mzolotukhin retitled this revision from to Reimplement heuristic for estimating complete-unroll optimization effects..
mzolotukhin updated this object.
mzolotukhin edited the test plan for this revision. (Show Details)
mzolotukhin added reviewers: hfinkel, chandlerc.
mzolotukhin added a subscriber: Unknown Object (MLST).Feb 23 2015, 1:00 PM
  • Address Hal's remarks.
chandlerc edited edge metadata.Feb 23 2015, 6:44 PM

Initial comments below. I'd also like to see some compile time impact measurement, but maybe wait to settle the design a bit first.

lib/Transforms/Scalar/LoopUnrollPass.cpp
528–532

This comment seems out of date.

534–535

There is essentially no documentation of "EarlyExitFound" or why this behavior is correct.

Generally, I'm worried about all the 'break' and loop control flow here. I think this is going to be a bit easier to understand if our early-stop points just return.

537

Rather than declaring this in the iteration loop, please declare it in simulateLoop and clear it on each iteration so that we re-use allocated storage. We're *very* likely to put roughly the same number of BBs on the worklist each trip through.

545–552

I think it would be more clear to inline the analyzeBlock and addBBSuccessors here... At the least, inlining addBBSuccessors avoids passing around the worklist.

555–562

Why not return?

564–587

I find it a bit surprising to do this caching up-front. I would expect it to work better to lazily populate the cache based on the expressions we see.

If that makes sense, I think it also makes sense to factor this into a completely separate change that just wraps up SCEV queries in a lazily populated cache, specifically for the purpose of simulation? Might make sense to fully sink this into SCEV, I don't know and would defer to others that know SCEV better.

mzolotukhin edited edge metadata.

Use range-based loops.

I missed that earlier, when addressing Hal's remarks.

Partly address Chandler's remarks:

  • Inline analyzeBlock and addBBSuccessors.
  • Hoist worklist declaration out of the loop.
  • Replace breaks with returns.

The rest is coming shortly.

Hi Chandler,

Thanks for the review, please find my replies inline.

lib/Transforms/Scalar/LoopUnrollPass.cpp
528–532

It was just weirdly formatted:)

534–535

I will add more comments regarding EarlyExitFound flag.

The logic behind it is the following: when at some point we managed to resolve conditional branch in some block, and this branch leads us out of the loop, we don't want to analyze further iterations - if we unroll the loop, they will become a dead code and will be removed. However, we do want to finish estimation of the current iteration - that's why we don't just return (otherwise, we'll miss the cost of blocks, remaining in the worklist).

537

Good point, thanks. Fixed.

545–552

Fixed, thanks.

555–562

Fixed, thanks.

564–587

Initially, I didn't want to mix working with SCEV with everything else, especially taking into account this data will be the same across all iterations and will be needed at every iteration. But now I'm thinking that moving this to visitGetElementPtr would simplify the code. I'll do that.

As for the second part - SCEV is already a cache (i.e. Value --> SCEVExpression map ). But what we store in our cache is the result of some additional computations, which are very task-specific (we're looking for GEPs, whose SCEV expressions contain at most one SCEVAddRec and this AddRec relates to the particular loop). These computations seem expensive enough to cache their results, but they are not general enough to move that to SCEV itself.

A few more code simplification comments just so you have them queued up.

lib/Transforms/Scalar/LoopUnrollPass.cpp
489

Just insert and continue here as well? (again, see below for why this might be ok)

494

Rather than keeping a variable, just insert this onto the worklist an 'continue'? (see below for why this might be better)

497

No need to handle this, just fall through two the successor loop.

498

Ok, I see what you're trying to do here. You're trying to check for us *definitely* exiting the loop after N iterations due to branching to the exit block.

But I don't think this code is correct at this point. While you know that if this basic block is dynamically executed on this iteration, the loop will be exited, you don't know that this block will be dynamically executed on this iteration.

This only seems important to handle cases where simulating N iterations proves something about the trip count that SCEV can't prove, which on the whole seems like a bug in SCEV. Is it really worth trying to handle it here? If so, you'll want to check that this block is the only loop exiting block or otherwise ensure that it is dynamically executed on every iteration without fail.

If you decide not to handle this, then that makes using 'continue' in my two comments above make sense.

508–513

Handle this before we process the terminators?

Thanks for the comments, I'll post updated patch shortly.

lib/Transforms/Scalar/LoopUnrollPass.cpp
489

Makes sense. It's a kind of leftover from inlining addBBSuccessors.

498

Oh, you are right. I incorrectly assumed that all blocks from worklist would be actually executed.

I think I'll just drop this then to keep the code simpler.

508–513

Makes sense.

mzolotukhin added inline comments.Feb 23 2015, 8:29 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
508–513

Actually, we can't do this before processing the terminators.

It's possible that we find nothing in one block, but find a lot in others - that means we should only perform this check when all blocks are processed, i.e. at the end of an iteration processing.

Addressing Chandler's remarks, continued:

  • Inline findConstantPointers into GEP-instructions visitor.
  • Inline simulateLoop into analyzeLoop + some other simplifications.
  • Don't try to handle early exits.

Hi Chandler,

I just discovered, that in this implementation of lazy caching we'll end up visiting 'bad' GEPs on every iteration. Do you think it's ok to have a set of such GEPs, which'll prevent us from doing the same useless analysis over and over again? By 'bad' GEPs here I mean instructions, which are not BaseAddress+Constant - we visit them, but don't add to the SCEVCache.

Michael

  • Clear SimplifiedValues at the beginning of each iteration.
  • Adjust tests to this change (our estimate lowered).
  • Add a missing sanity check.

Hi,

I also checked compile-time impact of the change, and didn't see any noticeable changes when building SPEC2006.

Below are the compile time results:

Test                   CompleteUnrollOn     CompleteUnrollOff
400.perlbench          23.55                23.55
401.bzip2              2.12                 2.10
403.gcc                76.42                76.00
429.mcf                0.46                 0.45
445.gobmk              17.93                17.92
456.hmmer              5.17                 5.17
458.sjeng              2.33                 2.30
462.libquantum         0.72                 0.71
464.h264ref            12.91                12.80
471.omnetpp            34.59                35.06
473.astar              0.96                 0.96
483.xalancbmk          177.89               178.46

I tend to think that the differences are just a noise.

  • Update tests.
  • Estimated DCE effect too.

Hi Chandler,

I added the DCE estimation part as well, so that you can review this altogether.

Thanks,
Michael

mzolotukhin abandoned this revision.Apr 2 2015, 7:28 PM

This version has became stale, so I rebased it, and created a new phabricator review for it: D8816.