This is an archive of the discontinued LLVM Phabricator instance.

Estimate DCE effect in heuristic for estimating complete-unroll optimization effects.
Needs ReviewPublic

Authored by mzolotukhin on Apr 2 2015, 7:30 PM.

Details

Reviewers
hfinkel
Summary

This patch adds capability for estimation effect of DCE on completely unrolled loop.
That helps to improve accuracy of the prediction, and thus make the decision
whether to completely unroll more informative.

Diff Detail

Event Timeline

mzolotukhin updated this revision to Diff 23202.Apr 2 2015, 7:30 PM
mzolotukhin retitled this revision from to Estimate DCE effect in heuristic for estimating complete-unroll optimization effects..
mzolotukhin updated this object.
mzolotukhin edited the test plan for this revision. (Show Details)
mzolotukhin added reviewers: hfinkel, chandlerc.
mzolotukhin added a subscriber: Unknown Object (MLST).
chandlerc edited edge metadata.Apr 9 2015, 2:01 PM

Before we go here, I think you should handle dead CFG paths which is a *much* simpler problem.

Specifically, you should simplify the condition for branches or the input for switch, and only add a single successor when it folds to a constant.

Once that's handled, I think this should work the other way. The problem with doing it as you have is that walking all the users of an instruction is very expensive (linked list traversal, to cache hostile). Instead, when we first see an instruction in the body of the loop without side-effects we should add it to a dead set rather than accumulating its cost, and we should remove all operands of all instructions we see in the body of the loop from the dead set (and if it was removed, account for its cost there). On the last iteration, we can also remove all live-out instructions from the dead set using the LCSSA PHI node (or just not use the dead set for the last iteration).

Does that make sense?

Handling dead CFG paths only looks like a simple problem, but in fact, it's much trickier.

Let me start with an example:

for(i = 0; i < 1000; i++) {
   a[i] = b[i] + c[i];
   if(d[i]) {
      // very expensive code - let's say 998 instructions.
   }
}

Cost of the loop body here would be 1+1+998=1000, and the estimated cost of the original loop TripCount*BodyCost = 10^6.
Suppose that d[i] is filled with 0, so if(d[i]) is always false and we never take the expensive path.
That means, that after complete unrolling we'll end up with 1000 instructions: a[0] = b[0] + c[0], a[1] = b[1] + c[1], ...
That looks like a huge win - the cost of unrolled loop is 10^3, while the cost of original loop is 10^6. But what we actually did? We significantly increased the code size, and gained nothing in terms of performance - that expensive code would never be executed in the original loop either! The things would get even worse if, e.g. d[] contains non-zeros too.

So, we can't simply fold the branch and take only one successor - it would be incorrect to compare cost of the loop computed this way with the original cost. To be precise, that works well for code size estimate, but not for execution time (~performance). And the goal of the optimization is to improve performance - i.e. if in completely unrolled loop we'd need to execute 20% less instructions in real time, than it's worth unrolling.

Having said that, it might be interesting to take branch-folding into account, but that will need much more complicated cost model (and thus will increase code complexity). Currently I incline toward putting it off until we get a real use-case where it can help. What do you think?

Now to DCE part:)
So, you basically suggest adding all instructions to the 'dead' set and then remove from it instructions that have uses? Is this the idea? If so, I guess we'll end up with what we have now, but on the removal phase - when you remove instruction operands from the set, you also need to remove their operand, and their operands etc. - that's effectively the same linked list traversal as we have now.

There is a high chance that I misunderstood your suggestion here, so please correct me!

mzolotukhin edited edge metadata.

Rebase to trunk.

  • Rebase ontop of the recent trunk.
  • Add a test.
mzolotukhin updated this revision to Diff 27263.Jun 5 2015, 9:13 PM
  • Rebase.
  • Adjust according to the new naming scheme.
  • Adjust DCE test according to new naming scheme.
Gerolf added a subscriber: Gerolf.Jul 7 2015, 10:20 PM
Gerolf added inline comments.
lib/Transforms/Scalar/LoopUnrollPass.cpp
590–591

How about: // When there is no optimization opportunity in the first iteration, we won't find opportunities in later iterations because ...

590–591

I'm missing context. Where do the costs get computed?

test/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
6 ↗(On Diff #27263)

This should be true for expressions also, not just loads.

mzolotukhin marked an inline comment as done.Jul 7 2015, 10:28 PM

Thanks, Gerolf,
I'll update the comments.

lib/Transforms/Scalar/LoopUnrollPass.cpp
590–591

On lines 594, 598.

Was there an update?

Thanks
Gerolf

chandlerc added inline comments.Jul 15 2015, 5:20 PM
lib/Transforms/Scalar/LoopUnrollPass.cpp
593–609

This seems like it may be somewhat slow. And I expect this to relatively rarely impact the computation. SimplifiedValues should have forward-pruned most of the dead instructions here?

What about a slightly different approach:

  • Each time we simplify something whose operands are not simplified, add that instruction to a SimplifiedRootsSet and SimplifiedRootsWorklist.
  • Each time we actually count an instruction's cost, add it to a set of cost counted instructions, and increment a count of uses for each of its operands.
  • Here, for each instruction in the worklist, for each operand to that instruction, if the operand is in the set of cost counted instructions and not in the SimplifiedRootSet and has a zero count of uses, subtract its cost, decrement the use counts of all its operands, add it to the SimplifiedRootSet, and add it to the worklist.

This should only start from the instructions we can't forward-simplify (things that SCEV simplifies for example), and walk recursively up its operands GC-ing everything whose use count in the loop reaches zero as a consequence.

This seems like it should be faster in the common cases than walking the user lists of every instruction? As far as I can tell, we at most walk every operand twice (once to increment, once to decrement)....

mzolotukhin marked an inline comment as done.
  • Rebase on master.
  • Avoid visiting all basic blocks and all their instructions - instead, use a worklist.

Hi Chandler,

I chose somewhat a hybrid approach between how it was implemented and what you suggested.

I did use a worklist to avoid one more visit of all instructions, but I decided not to count uses. I had in mind two reasons for that:

  1. it would work incorrectly (or would require some special handling) with uses outside the loop - we won't count them and thus might think an instruction is dead, while it's not.
  2. It would additionally complicate the code.

What do you think? Does it sound reasonable?

Michael