This is an archive of the discontinued LLVM Phabricator instance.

[Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the...
ClosedPublic

Authored by mzolotukhin on Aug 5 2015, 1:54 AM.

Details

Summary

...loop after the last iteration.

This is really hard to do correctly. The core problem is that we need to
model liveness through the induction PHIs from iteration to iteration in
order to get the correct results, and we need to correctly de-duplicate
the common subgraphs of instructions feeding some subset of the
induction PHIs. All of this can be driven either from a side effect at
some iteration or from the loop values used after the loop finishes.

This patch implements this by storing the forward-propagating analysis
of each instruction in a cache to recall whether it was free and whether
it has become live and thus counted toward the total unroll cost. Then,
at each sink for a value in the loop, we recursively walk back through
every value that feeds the sink, including looping back through the
iterations as needed, until we have marked the entire input graph as
live. Because we cache this, we never visit instructions more than twice

  • once when we analyze them and put them into the cache, and once when

we count their cost towards the unrolled loop. Also, because the cache
is only two bits and because we are dealing with relatively small
iteration counts, we can store all of this very densely in memory to
avoid this from becoming an excessively slow analysis.

The code here is still pretty gross. I would appreciate suggestions
about better ways to factor or split this up, I've stared too long at
the algorithmic side to really have a good sense of what the design
should probably look at.

Also, it might seem like we should do all of this bottom-up, but I think
that is a red herring. Specifically, the simplification power is *much*
greater working top-down. We can forward propagate very effectively,
even across strange and interesting recurrances around the backedge.
Because we use data to propagate, this doesn't cause a state space
explosion. Doing this level of constant folding, etc, would be very
expensive to do bottom-up because it wouldn't be until the last moment
that you could collapse everything. The current solution is essentially
a top-down simplification with a bottom-up cost accounting which seems
to get the best of both worlds. It makes the simplification incremental
and powerful while leaving everything dead until we *know* it is needed.

Finally, a core property of this approach is its *monotonicity*. At all
times, the current UnrolledCost is a conservatively low estimate. This
ensures that we will never early-exit from the analysis due to exceeding
a threshold when if we had continued, the cost would have gone back
below the threshold. These kinds of bugs can cause incredibly hard to
track down random changes to behavior.

We could use a techinque similar (but much simpler) within the inliner
as well to avoid considering speculated code in the inline cost.

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc updated this revision to Diff 31339.Aug 5 2015, 1:54 AM
chandlerc retitled this revision from to [Unroll] Implement a conservative and monotonically increasing cost tracking system during the full unroll heuristic analysis that avoids counting any instruction cost until that instruction becomes "live" through a side-effect or use outside the....
chandlerc updated this object.
chandlerc added a reviewer: mzolotukhin.
chandlerc added a subscriber: llvm-commits.
chandlerc updated this revision to Diff 31384.Aug 5 2015, 11:47 AM

Rebase after landing prerequisite patches and merge a missing commit into this
patch. Sorry for the broken original diff.

mzolotukhin edited edge metadata.Aug 5 2015, 3:23 PM

Hi Chandler,

Mostly the patch looks good, a couple of nit-picks inline.

But I think we need more tests for this. I think we can add debug prints and pin-point the cases we want to check with it in tests.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
596 ↗(On Diff #31384)

s/instrution/instruction/

616 ↗(On Diff #31384)

Maybe worth commenting here that only I and Iteration are actually used as a key. It might be not-obvious for someone looking at this code without context of this patch.

620 ↗(On Diff #31384)

s/free.a/free./

739 ↗(On Diff #31384)

I find having debug prints very convenient in this kind of analyses. Would it be possible to return them back?

We could report instructions that was simplified, and instructions that were proven to be live, for instance. Also, I think we can use this diagnostic in tests, as it gets harder and harder to write them.

749 ↗(On Diff #31384)

Is word 'either' redundant here, or is it just my English?:)

Hi Chandler,

I found that with this patch compile time is significantly worse on some benchmarks, a reduced testcase from one of them is attached. It seems like the problem isn’t in the patch itself, but in the way we perform actual unrolling - you can see that if you add ‘-debug’ flag. I can look into what’s bad there later, but if you’d like to poke it too, you are welcome:)

Hi Chandler,

This patch got stuck, but finally I got some time and investigated it more carefully. I found a bug in current implementation that explains everything, and as a consequence I constructed a simple test-case to demonstrate the problem:

int foo(char *a) {
  int i = 0;
  for (i = 0; i < 500; i++)
    if (a[i] == 0)
      return i;
  return 0;
}

The problem is that the only instructions we consider live are those having side effects. However, they might have no side effects (by LLVM's definition of side effects: mayWriteToMemory() || mayThrow() || !mayReturn()) and still affect program behavior, as in the case above.

Thanks,
Michael

lib/Transforms/Scalar/LoopUnrollPass.cpp
752–753 ↗(On Diff #31384)

Should we also take into account instructions with out-of-the-loop uses and early-exit instructions? Are there any other cases we want not to miss?

Hi Chander,

We briefly discussed it on IRC, but I'll duplicate my latest findings here too.

I tested this patch and found several issues, which lead to undesired unrolling in some cases, and thus, have significant compile-time impact for no performance benefit. With them fixed/worked-around, the compile time regressions seemed not that bad, but I'll need to remeasure it when we fix the issues in a proper way. With these issues worked-around I still saw some nice performance gains.

Here is the list of problems that I found:

  1. With this improved algorithm for finding dead instructions, we're now able to figure out that loop control flow becomes dead after unrolling. If a loop has a small body, then the control flow might be up to 50% of the loop of the body, but it doesn't seem reasonable to unroll such loops. For instance:
for (i = 0; i < 500; i++)
   a[i] += 1;

In this case unrolling removes nothing except the control flow, but it fools the current heuristic so the loop is unrolled. Such loops are pretty popular, so the compile time hit is severe if we unroll them. Performance gain is questionable, and probably we actually regress the performance in such cases.

  1. Currently simplifyInstWithSCEV returns true (meaning the corresponding instruction is simplified) for expressions in a form Address + ConstantOffset. However, unrolling doesn't necessarily leads to simplification of such instruction, so our estimate might be wrong here. For example:
for (int i=0; i < 16; i++)  {
   a[i][0]=b[S.y][S.x+i];
}

In this case index expressions take the most part of the loop body, but unrolling doesn't help to simplify them in contrast to our estimate.

  1. After we unrolled a loop we should make sure that we cleaned-up everything we expected to be simplified/dead, otherwise we will count it again when we analyze the parent loop.

Michael

mzolotukhin commandeered this revision.Mar 4 2016, 4:39 PM
mzolotukhin edited reviewers, added: chandlerc; removed: mzolotukhin.

I plan to rebase this patch and slightly update it.

mzolotukhin updated this revision to Diff 49862.Mar 4 2016, 4:44 PM
  • Rebase onto TOT.
  • Only work on inner loops: we can't accurately estimate cost of instructions from inner loops when analyzing an outer loop, and we won't be visiting inner loops later anyway, so even if unrolling of the outer loop could expose new opportunities in inner loops, we won't be catching them now. We can revisit this later.
  • Don't unroll loop with calls inside. Experiments showed some problems with such loops (our estimate is far from accurate for them), and disabling it doesn't hurt any test. Again, we can revisit it later if we have better way to estimate a cost of calls.
chandlerc accepted this revision.May 12 2016, 3:52 PM
chandlerc edited edge metadata.

OK, I've paged all this back into my brain.

The inner loop aspect makes more sense to me now. I'm still a bit dubious on the skipping calls, but I understand that at least we're not ready for them yet and its important to get this stuff moving, so it seems like a good incremental step.

Some comments below. Only the test for the inner loop case is really interesting. Feel free to submit with these comments addressed or post any updates with more questions if useful.

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
1 ↗(On Diff #49862)

This file is under lib, not test?

lib/Transforms/Scalar/LoopUnrollPass.cpp
241–244 ↗(On Diff #49862)

It would make slightly more sense to hoist this out to the caller... There is nothing about this routine that is specific to inner loops, its just that it isn't useful right now?

Not a big deal either way.

270 ↗(On Diff #49862)

Heh, now that you've taken this over, need to address your own comment here. ;]

290 ↗(On Diff #49862)

I'm fine with a comment, or extending the DenseMapInfo to support using .find_as(std::make_pair(I, Iteration)).

417 ↗(On Diff #49862)

I have no problem adding them back. Want to do it in a follow-up patch? In this patch?

430–431 ↗(On Diff #49862)

Not sure what this comment was referencing...

The idea is that all out-of-loop uses should be via a LCSSA phi node in the exit block(s) and we backwards saturate their cost there?

432–433 ↗(On Diff #49862)

Maybe leave a FIXME? I feel like we should make some effort to address this in follow-up commits. At the very least I suspect we want to handle intrinsics here.

This revision is now accepted and ready to land.May 12 2016, 3:52 PM
sanjoy added a subscriber: sanjoy.May 12 2016, 4:00 PM

Random drive-by comment inline

lib/Transforms/Scalar/LoopUnrollPass.cpp
432 ↗(On Diff #49862)

Nit: spelling

This revision was automatically updated to reflect the committed changes.
mzolotukhin marked 14 inline comments as done.May 12 2016, 6:59 PM

Hi,

I committed the patch in r269388. Along with the changes you requested I adjusted some thresholds in tests invocation and made one semantical change in visitPHINode. In the previous version we checked if (PN.getParent() == L->getHeader()) and early exited if it's true. That prevented us from running simplifyWithSCEV on the phi-value, so the analysis didn't get the information about this phi. Now we run the base visitor first to mine the data if we can, and only then we check if the phi is in the header.

Michael

lib/Transforms/LoopUnroll/full-unroll-heuristics-dce.ll
1 ↗(On Diff #49862)

Ouch, thanks for the catch!

lib/Transforms/Scalar/LoopUnrollPass.cpp
241–244 ↗(On Diff #49862)

We need to treat inner loops in a special way (e.g. using weights for basic blocks based on the profile info), but we're not doing it now, so I tend to think this function in current implementation isn't designed to work on outer loops.

270 ↗(On Diff #49862)

Fixed :)

417 ↗(On Diff #49862)

I'll add them back in another patch if I see a need in them.

430–431 ↗(On Diff #49862)

Yeah, I think I might misunderstand this part at that time.

chapuni added inline comments.
llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp
434

FYI, it seems msc doesn't like it.

bool Inserted = InstCostMap.insert({&I, (int)Iteration, (unsigned)IsFree,
                                           /*IsCounted*/ false})

would work for me.

mzolotukhin marked 5 inline comments as done.May 13 2016, 2:46 PM

Thanks, I reapplied the patch with your fix (r269486). Hopefully, the bots will be happy!

Michael