This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Limit peeling to conds in BBs executed on every iteration.
AbandonedPublic

Authored by fhahn on Apr 6 2018, 8:36 AM.

Details

Summary

Peeling off iterations that only simplify a nested conditional is likely
to increase the code size without being likely to be beneficial.

Diff Detail

Event Timeline

fhahn created this revision.Apr 6 2018, 8:36 AM
junbuml added a subscriber: junbuml.Apr 6 2018, 9:14 AM

You get some of the benefit of peeling whether or not the condition dominates the latch: both the peeled iterations and the remaining loop become simpler because the check disappears, and presumably that codepath is executed at least some of the time. But maybe that doesn't help enough to be worth doing? I'd like to understand the motivation a bit more.

I also don't get it. If peeling can help to deal with a comparison like this

b = -1;
for (i = 0; i < N; i++) {
  if (something) {
    if (b > 0) break;
  }
  b = i;
}

Here b > 0 does not dominate the latch, but it is still profitable to peel away 1 iteration to get rid of it.

fhahn added a comment.Apr 9 2018, 2:37 AM

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Prior to loop peeling the function we'd like to inlined in h264ref has a single use.

Currently, (non-simple) loop peeling will not peel a loop if it includes a function call that is likely to be inlined (i.e., is not marked with a noinline attribute, has internal linkage and has a single use). This is exactly the case we're dealing with in h264ref, except the function to be inlined isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase.

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Quick update on this Florian. I think my initial analysis was a little off or rather the regression in h264ref was actually just noise. I tried reverting r327671 (your change to peeling) on ToT this afternoon to verify the regression and now the revert itself is causing a regression. Further, after r324557 (which changed the codegen optimization level when compiling with gold) was reverted in r329458, I'm now showing that h264ref is ahead by 3.19% when comparing ToT to Clang 6.0. Thus, I not sure this change is worth pursuing if the sole purpose is to address the h264ref regression (which doesn't appear to be a real regression after all).

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Prior to loop peeling the function we'd like to inlined in h264ref has a single use.

Currently, (non-simple) loop peeling will not peel a loop if it includes a function call that is likely to be inlined (i.e., is not marked with a noinline attribute, has internal linkage and has a single use). This is exactly the case we're dealing with in h264ref, except the function to be inlined isn't marked as internal until the LTO phase of compilation. Thus, one possible approach would be to defer peeling until the LTO phase.

Ignore the above comment. I wrote this earlier this morning, prior to identifying the regression was more likely just noise.

Thanks for having a look and sorry for not being clearer. Chad discovered a regression in SPEC2006's h264ref with LTO, caused by this change. The problem was that we peeled off an iteration of a big loop before LTO. That caused the function to be too big for the inliner during LTO, whereas it would be inlined before. We based the peeling decision on a nested condition. With this patch I tried to find a balance between increasing code size and benefits of peeling (simplifying nested conditionals are likely to have less positive impact than top-level ones).

@mcrosier did some more digging and found that we might just want to run simple unrolling before LTO and normal unrolling/peeling during LTO, which makes a sense to me. With that, we would not need this patch (or we could only consider top-level conditionals for "simple" peeling) and IMO that is what we should try to do.

Quick update on this Florian. I think my initial analysis was a little off or rather the regression in h264ref was actually just noise. I tried reverting r327671 (your change to peeling) on ToT this afternoon to verify the regression and now the revert itself is causing a regression. Further, after r324557 (which changed the codegen optimization level when compiling with gold) was reverted in r329458, I'm now showing that h264ref is ahead by 3.19% when comparing ToT to Clang 6.0. Thus, I not sure this change is worth pursuing if the sole purpose is to address the h264ref regression (which doesn't appear to be a real regression after all).

fhahn abandoned this revision.Apr 11 2018, 7:46 AM

Thanks for the update Chad. There is no need for this change then I think.

Thanks for the update Chad. There is no need for this change then I think.

I tend to agree. Thanks, Florian.