This patch reimplements heuristic that tries to estimate optimization beneftis
from complete loop unrolling.
Details
Diff Detail
Event Timeline
Initial comments below. I'd also like to see some compile time impact measurement, but maybe wait to settle the design a bit first.
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
524–528 | This comment seems out of date. | |
530–531 | There is essentially no documentation of "EarlyExitFound" or why this behavior is correct. Generally, I'm worried about all the 'break' and loop control flow here. I think this is going to be a bit easier to understand if our early-stop points just return. | |
533 | Rather than declaring this in the iteration loop, please declare it in simulateLoop and clear it on each iteration so that we re-use allocated storage. We're *very* likely to put roughly the same number of BBs on the worklist each trip through. | |
539–544 | Why not return? | |
541–548 | I think it would be more clear to inline the analyzeBlock and addBBSuccessors here... At the least, inlining addBBSuccessors avoids passing around the worklist. | |
546–571 | I find it a bit surprising to do this caching up-front. I would expect it to work better to lazily populate the cache based on the expressions we see. If that makes sense, I think it also makes sense to factor this into a completely separate change that just wraps up SCEV queries in a lazily populated cache, specifically for the purpose of simulation? Might make sense to fully sink this into SCEV, I don't know and would defer to others that know SCEV better. |
Partly address Chandler's remarks:
- Inline analyzeBlock and addBBSuccessors.
- Hoist worklist declaration out of the loop.
- Replace breaks with returns.
The rest is coming shortly.
Hi Chandler,
Thanks for the review, please find my replies inline.
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
524–528 | It was just weirdly formatted:) | |
530–531 | I will add more comments regarding EarlyExitFound flag. The logic behind it is the following: when at some point we managed to resolve conditional branch in some block, and this branch leads us out of the loop, we don't want to analyze further iterations - if we unroll the loop, they will become a dead code and will be removed. However, we do want to finish estimation of the current iteration - that's why we don't just return (otherwise, we'll miss the cost of blocks, remaining in the worklist). | |
533 | Good point, thanks. Fixed. | |
539–544 | Fixed, thanks. | |
541–548 | Fixed, thanks. | |
546–571 | Initially, I didn't want to mix working with SCEV with everything else, especially taking into account this data will be the same across all iterations and will be needed at every iteration. But now I'm thinking that moving this to visitGetElementPtr would simplify the code. I'll do that. As for the second part - SCEV is already a cache (i.e. Value --> SCEVExpression map ). But what we store in our cache is the result of some additional computations, which are very task-specific (we're looking for GEPs, whose SCEV expressions contain at most one SCEVAddRec and this AddRec relates to the particular loop). These computations seem expensive enough to cache their results, but they are not general enough to move that to SCEV itself. |
A few more code simplification comments just so you have them queued up.
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
521–526 | Handle this before we process the terminators? | |
524 | Just insert and continue here as well? (again, see below for why this might be ok) | |
530 | Rather than keeping a variable, just insert this onto the worklist an 'continue'? (see below for why this might be better) | |
533 | No need to handle this, just fall through two the successor loop. | |
533 | Ok, I see what you're trying to do here. You're trying to check for us *definitely* exiting the loop after N iterations due to branching to the exit block. But I don't think this code is correct at this point. While you know that if this basic block is dynamically executed on this iteration, the loop will be exited, you don't know that this block will be dynamically executed on this iteration. This only seems important to handle cases where simulating N iterations proves something about the trip count that SCEV can't prove, which on the whole seems like a bug in SCEV. Is it really worth trying to handle it here? If so, you'll want to check that this block is the only loop exiting block or otherwise ensure that it is dynamically executed on every iteration without fail. If you decide not to handle this, then that makes using 'continue' in my two comments above make sense. |
Thanks for the comments, I'll post updated patch shortly.
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
521–526 | Makes sense. | |
524 | Makes sense. It's a kind of leftover from inlining addBBSuccessors. | |
533 | Oh, you are right. I incorrectly assumed that all blocks from worklist would be actually executed. I think I'll just drop this then to keep the code simpler. |
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
521–526 | Actually, we can't do this before processing the terminators. It's possible that we find nothing in one block, but find a lot in others - that means we should only perform this check when all blocks are processed, i.e. at the end of an iteration processing. |
Addressing Chandler's remarks, continued:
- Inline findConstantPointers into GEP-instructions visitor.
- Inline simulateLoop into analyzeLoop + some other simplifications.
- Don't try to handle early exits.
Hi Chandler,
I just discovered, that in this implementation of lazy caching we'll end up visiting 'bad' GEPs on every iteration. Do you think it's ok to have a set of such GEPs, which'll prevent us from doing the same useless analysis over and over again? By 'bad' GEPs here I mean instructions, which are not BaseAddress+Constant - we visit them, but don't add to the SCEVCache.
Michael
- Clear SimplifiedValues at the beginning of each iteration.
- Adjust tests to this change (our estimate lowered).
Hi,
I also checked compile-time impact of the change, and didn't see any noticeable changes when building SPEC2006.
Below are the compile time results:
Test CompleteUnrollOn CompleteUnrollOff 400.perlbench 23.55 23.55 401.bzip2 2.12 2.10 403.gcc 76.42 76.00 429.mcf 0.46 0.45 445.gobmk 17.93 17.92 456.hmmer 5.17 5.17 458.sjeng 2.33 2.30 462.libquantum 0.72 0.71 464.h264ref 12.91 12.80 471.omnetpp 34.59 35.06 473.astar 0.96 0.96 483.xalancbmk 177.89 178.46
I tend to think that the differences are just a noise.
Hi Chandler,
I added the DCE estimation part as well, so that you can review this altogether.
Thanks,
Michael
This version has became stale, so I rebased it, and created a new phabricator review for it: D8816.
Just insert and continue here as well? (again, see below for why this might be ok)