This patch adds capability for estimation effect of DCE on completely unrolled loop.
That helps to improve accuracy of the prediction, and thus make the decision
whether to completely unroll more informative.
Details
- Reviewers
hfinkel
Diff Detail
Event Timeline
Before we go here, I think you should handle dead CFG paths which is a *much* simpler problem.
Specifically, you should simplify the condition for branches or the input for switch, and only add a single successor when it folds to a constant.
Once that's handled, I think this should work the other way. The problem with doing it as you have is that walking all the users of an instruction is very expensive (linked list traversal, to cache hostile). Instead, when we first see an instruction in the body of the loop without side-effects we should add it to a dead set rather than accumulating its cost, and we should remove all operands of all instructions we see in the body of the loop from the dead set (and if it was removed, account for its cost there). On the last iteration, we can also remove all live-out instructions from the dead set using the LCSSA PHI node (or just not use the dead set for the last iteration).
Does that make sense?
Handling dead CFG paths only looks like a simple problem, but in fact, it's much trickier.
Let me start with an example:
for(i = 0; i < 1000; i++) { a[i] = b[i] + c[i]; if(d[i]) { // very expensive code - let's say 998 instructions. } }
Cost of the loop body here would be 1+1+998=1000, and the estimated cost of the original loop TripCount*BodyCost = 10^6.
Suppose that d[i] is filled with 0, so if(d[i]) is always false and we never take the expensive path.
That means, that after complete unrolling we'll end up with 1000 instructions: a[0] = b[0] + c[0], a[1] = b[1] + c[1], ...
That looks like a huge win - the cost of unrolled loop is 10^3, while the cost of original loop is 10^6. But what we actually did? We significantly increased the code size, and gained nothing in terms of performance - that expensive code would never be executed in the original loop either! The things would get even worse if, e.g. d[] contains non-zeros too.
So, we can't simply fold the branch and take only one successor - it would be incorrect to compare cost of the loop computed this way with the original cost. To be precise, that works well for code size estimate, but not for execution time (~performance). And the goal of the optimization is to improve performance - i.e. if in completely unrolled loop we'd need to execute 20% less instructions in real time, than it's worth unrolling.
Having said that, it might be interesting to take branch-folding into account, but that will need much more complicated cost model (and thus will increase code complexity). Currently I incline toward putting it off until we get a real use-case where it can help. What do you think?
Now to DCE part:)
So, you basically suggest adding all instructions to the 'dead' set and then remove from it instructions that have uses? Is this the idea? If so, I guess we'll end up with what we have now, but on the removal phase - when you remove instruction operands from the set, you also need to remove their operand, and their operands etc. - that's effectively the same linked list traversal as we have now.
There is a high chance that I misunderstood your suggestion here, so please correct me!
- Rebase.
- Adjust according to the new naming scheme.
- Adjust DCE test according to new naming scheme.
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
633 | How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ... | |
635–637 | I'm missing context. Where do the costs get computed? | |
test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll | ||
7 | This should be true for expressions also, not just loads. |
Thanks, Gerolf,
I'll update the comments.
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
635–637 | On lines 594, 598. |
lib/Transforms/Scalar/LoopUnrollPass.cpp | ||
---|---|---|
640–656 | This seems like it may be somewhat slow. And I expect this to relatively rarely impact the computation. SimplifiedValues should have forward-pruned most of the dead instructions here? What about a slightly different approach:
This should only start from the instructions we can't forward-simplify (things that SCEV simplifies for example), and walk recursively up its operands GC-ing everything whose use count in the loop reaches zero as a consequence. This seems like it should be faster in the common cases than walking the user lists of every instruction? As far as I can tell, we at most walk every operand twice (once to increment, once to decrement).... |
- Rebase on master.
- Avoid visiting all basic blocks and all their instructions - instead, use a worklist.
Hi Chandler,
I chose somewhat a hybrid approach between how it was implemented and what you suggested.
I did use a worklist to avoid one more visit of all instructions, but I decided not to count uses. I had in mind two reasons for that:
- it would work incorrectly (or would require some special handling) with uses outside the loop - we won't count them and thus might think an instruction is dead, while it's not.
- It would additionally complicate the code.
What do you think? Does it sound reasonable?
Michael
How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ...