This is an archive of the discontinued LLVM Phabricator instance.

Reimplement heuristic for estimating complete-unroll optimization effects.
AbandonedPublic

Authored by mzolotukhin on Feb 23 2015, 12:57 PM.

Details

Reviewers
chandlerc
hfinkel
Summary

This patch reimplements heuristic that tries to estimate optimization beneftis
from complete loop unrolling.

Diff Detail

Event Timeline

mzolotukhin retitled this revision from to Reimplement heuristic for estimating complete-unroll optimization effects..
mzolotukhin updated this object.
mzolotukhin edited the test plan for this revision. (Show Details)
mzolotukhin added reviewers: hfinkel, chandlerc.
mzolotukhin added a subscriber: Unknown Object (MLST).Feb 23 2015, 1:00 PM
  • Address Hal's remarks.
chandlerc edited edge metadata.Feb 23 2015, 6:44 PM

Initial comments below. I'd also like to see some compile time impact measurement, but maybe wait to settle the design a bit first.

lib/Transforms/Scalar/LoopUnrollPass.cpp
524–528

This comment seems out of date.

530–531

There is essentially no documentation of "EarlyExitFound" or why this behavior is correct.

Generally, I'm worried about all the 'break' and loop control flow here. I think this is going to be a bit easier to understand if our early-stop points just return.

533

Rather than declaring this in the iteration loop, please declare it in simulateLoop and clear it on each iteration so that we re-use allocated storage. We're *very* likely to put roughly the same number of BBs on the worklist each trip through.

539–544

Why not return?

541–548

I think it would be more clear to inline the analyzeBlock and addBBSuccessors here... At the least, inlining addBBSuccessors avoids passing around the worklist.

546–571

I find it a bit surprising to do this caching up-front. I would expect it to work better to lazily populate the cache based on the expressions we see.

If that makes sense, I think it also makes sense to factor this into a completely separate change that just wraps up SCEV queries in a lazily populated cache, specifically for the purpose of simulation? Might make sense to fully sink this into SCEV, I don't know and would defer to others that know SCEV better.

mzolotukhin edited edge metadata.

Use range-based loops.

I missed that earlier, when addressing Hal's remarks.

Partly address Chandler's remarks:

  • Inline analyzeBlock and addBBSuccessors.
  • Hoist worklist declaration out of the loop.
  • Replace breaks with returns.

The rest is coming shortly.

Hi Chandler,

Thanks for the review, please find my replies inline.

lib/Transforms/Scalar/LoopUnrollPass.cpp
524–528

It was just weirdly formatted:)

530–531

I will add more comments regarding EarlyExitFound flag.

The logic behind it is the following: when at some point we managed to resolve conditional branch in some block, and this branch leads us out of the loop, we don't want to analyze further iterations - if we unroll the loop, they will become a dead code and will be removed. However, we do want to finish estimation of the current iteration - that's why we don't just return (otherwise, we'll miss the cost of blocks, remaining in the worklist).

533

Good point, thanks. Fixed.

539–544

Fixed, thanks.

541–548

Fixed, thanks.

546–571

Initially, I didn't want to mix working with SCEV with everything else, especially taking into account this data will be the same across all iterations and will be needed at every iteration. But now I'm thinking that moving this to visitGetElementPtr would simplify the code. I'll do that.

As for the second part - SCEV is already a cache (i.e. Value --> SCEVExpression map ). But what we store in our cache is the result of some additional computations, which are very task-specific (we're looking for GEPs, whose SCEV expressions contain at most one SCEVAddRec and this AddRec relates to the particular loop). These computations seem expensive enough to cache their results, but they are not general enough to move that to SCEV itself.

A few more code simplification comments just so you have them queued up.

lib/Transforms/Scalar/LoopUnrollPass.cpp
521–526

Handle this before we process the terminators?

524

Just insert and continue here as well? (again, see below for why this might be ok)

530

Rather than keeping a variable, just insert this onto the worklist an 'continue'? (see below for why this might be better)

533

No need to handle this, just fall through two the successor loop.

533

Ok, I see what you're trying to do here. You're trying to check for us *definitely* exiting the loop after N iterations due to branching to the exit block.

But I don't think this code is correct at this point. While you know that if this basic block is dynamically executed on this iteration, the loop will be exited, you don't know that this block will be dynamically executed on this iteration.

This only seems important to handle cases where simulating N iterations proves something about the trip count that SCEV can't prove, which on the whole seems like a bug in SCEV. Is it really worth trying to handle it here? If so, you'll want to check that this block is the only loop exiting block or otherwise ensure that it is dynamically executed on every iteration without fail.

If you decide not to handle this, then that makes using 'continue' in my two comments above make sense.

Thanks for the comments, I'll post updated patch shortly.

lib/Transforms/Scalar/LoopUnrollPass.cpp
521–526

Makes sense.

524

Makes sense. It's a kind of leftover from inlining addBBSuccessors.

533

Oh, you are right. I incorrectly assumed that all blocks from worklist would be actually executed.

I think I'll just drop this then to keep the code simpler.

mzolotukhin added inline comments.Feb 23 2015, 8:29 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
521–526

Actually, we can't do this before processing the terminators.

It's possible that we find nothing in one block, but find a lot in others - that means we should only perform this check when all blocks are processed, i.e. at the end of an iteration processing.

Addressing Chandler's remarks, continued:

  • Inline findConstantPointers into GEP-instructions visitor.
  • Inline simulateLoop into analyzeLoop + some other simplifications.
  • Don't try to handle early exits.

Hi Chandler,

I just discovered, that in this implementation of lazy caching we'll end up visiting 'bad' GEPs on every iteration. Do you think it's ok to have a set of such GEPs, which'll prevent us from doing the same useless analysis over and over again? By 'bad' GEPs here I mean instructions, which are not BaseAddress+Constant - we visit them, but don't add to the SCEVCache.

Michael

  • Clear SimplifiedValues at the beginning of each iteration.
  • Adjust tests to this change (our estimate lowered).
  • Add a missing sanity check.

Hi,

I also checked compile-time impact of the change, and didn't see any noticeable changes when building SPEC2006.

Below are the compile time results:

Test                   CompleteUnrollOn     CompleteUnrollOff
400.perlbench          23.55                23.55
401.bzip2              2.12                 2.10
403.gcc                76.42                76.00
429.mcf                0.46                 0.45
445.gobmk              17.93                17.92
456.hmmer              5.17                 5.17
458.sjeng              2.33                 2.30
462.libquantum         0.72                 0.71
464.h264ref            12.91                12.80
471.omnetpp            34.59                35.06
473.astar              0.96                 0.96
483.xalancbmk          177.89               178.46

I tend to think that the differences are just a noise.

  • Update tests.
  • Estimated DCE effect too.

Hi Chandler,

I added the DCE estimation part as well, so that you can review this altogether.

Thanks,
Michael

mzolotukhin abandoned this revision.Apr 2 2015, 7:28 PM

This version has became stale, so I rebased it, and created a new phabricator review for it: D8816.