This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] By default disable unrolling when optimizing for size.
ClosedPublic

Authored by mzolotukhin on Aug 10 2016, 5:30 PM.

Details

Summary

In clang commit r268509 we started to invoke loop-unroll pass from the
driver even under -Os. However, we happen to not initialize optsize
thresholds properly, which si fixed with this change.

r268509 led to some big compile time regressions, because we started to
unroll some loops that we didn't unroll before. With this change I hope
to recover most of the regressions. We still are slightly slower than
before, because we do some checks here and there in loop-unrolling
before we bail out, but at least the slowdown is not that huge now.

Diff Detail

Repository
rL LLVM

Event Timeline

mzolotukhin retitled this revision from to [LoopUnroll] By default disable unrolling when optimizing for size..
mzolotukhin updated this object.
mzolotukhin added reviewers: hfinkel, chandlerc.
mzolotukhin added a subscriber: llvm-commits.
chandlerc accepted this revision.Aug 10 2016, 6:19 PM
chandlerc edited edge metadata.

Seems fine to mitigate compile time issues, but I think we need some kind of opt-size loop unrolling test cases that check we *do* unroll obviously benificial loops under optsize and we *don't* unroll obvious bad loops that would have been unrolled in a non-optsize function. LGTM, but whether in this commit or a follow-up, I'd like to get at least obvious cases here covered by tests.

include/llvm/CodeGen/BasicTTIImpl.h
284–288 ↗(On Diff #67639)

So this is just designed to fairly closely replicate the behavior before we put the pass into the pipeline?

It'd be good to have some kind of FIXME or whatever. Much like inlining, I would expect that sufficiently conservative unrolling will still be a win in opt-size mode.

lib/Transforms/Scalar/LoopUnrollPass.cpp
951–953 ↗(On Diff #67639)

Generally this makes sense, but if we're not doing partial unrolling, but we have a non-zero partial threshold, we won't early-exit even though threshold is zero... Maybe structure the checks so we only consider the partial threshold when considering partial unrolling?

This revision is now accepted and ready to land.Aug 10 2016, 6:19 PM
mzolotukhin added inline comments.Aug 10 2016, 6:38 PM
include/llvm/CodeGen/BasicTTIImpl.h
284–288 ↗(On Diff #67639)

So this is just designed to fairly closely replicate the behavior before we put the pass into the pipeline?

For now - basically yes.

It'd be good to have some kind of FIXME or whatever. Much like inlining, I would expect that sufficiently conservative unrolling will still be a win in opt-size mode.

Also, it's only a generic implementation, targets are free to override it with other values.

In general I agree that on Os (in contrast to Oz) we should allow some unrolling. But I prefer to be extra careful here not to do more harm than good - for now I'm just reproducing what we did before the change, while keeping the way clear for enabling it later.

lib/Transforms/Scalar/LoopUnrollPass.cpp
951–953 ↗(On Diff #67639)

This makes sense.

This revision was automatically updated to reflect the committed changes.
davide added a subscriber: davide.May 22 2017, 9:16 PM

This is, unfortunately, not ideal, as full unrolling can be beneficial in -Os.
Michael (@mzolotukhin) did you get a chance to re-evaluate this?
Do you have an example where this triggers?

Hi Davide,

I haven't reevaluated this. At that time, we spotted several compile-time regressions (see the table below), and my goal was simply to change the unrolling behavior back to the original. I didn't try to tune the thresholds in any way, so if anyone has a compelling data, we could definitely revisit the values.

SingleSource/Benchmarks/Stanford/Perm37.37%
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v433.95%
SingleSource/Regression/C/bigstack22.07%
MultiSource/Benchmarks/Olden/power/power13.49%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow12.68%
MultiSource/Benchmarks/Olden/health/health8.46%
MultiSource/Benchmarks/Olden/bh/bh7.00%
MultiSource/Benchmarks/mediabench/g721/g721encode/encode5.63%
MultiSource/Benchmarks/nbench/nbench5.29%
External/SPEC/CINT2006/462_libquantum/462_libquantum5.17%
External/SPEC/CFP2006/447_dealII/447_dealII4.74%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm2.38%
MultiSource/Benchmarks/mediabench/gsm/toast/toast2.28%
External/SPEC/CFP2006/433_milc/433_milc2.08%
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame1.62%
MultiSource/Applications/JM/lencod/lencod1.61%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan1.48%
MultiSource/Benchmarks/Olden/bh/bh-1.67%

I looked at your benchmarks numbers, with particular attention to the worst regressions, and I'm not convinced these benchmarks are representative.
There are a couple of issues:

  1. They're lifetime is too short (basically, I wasn't able to find anything which lasted more than 0.05s). This makes pretty much impossible to profile and understand where the cycles are spent reliably.
  2. Some of these benchmarks are a little noisy (even after I pinned a thread to a CPU, disable ASLR and frequency scaling etc...)
[davide@cupiditate C]$ time ~/work/llvm/build-rel-noassert/bin/clang -Os bigstack.c

real    0m0.048s
user    0m0.035s
sys     0m0.013s

I think we should try to base our decisions (at least for compile time choices on more real-world/CPU intensive programs/workloads).

Side note: If I take clang.bc in an LTO build and run opt -O2 on it I see:

10.8451 (  1.8%)   0.2831 (  2.3%)  11.1282 (  1.8%)  11.0717 (  1.8%)  Unroll loops

So we just try to execute loop unroll to realize the unroll threshold is zero and bail it. Turns out this is 2% of the time for large testcases.
If we want to disable loop unroll for -Os (I think we shouldn't :) I'd rather disable it entirely instead of adjusting the thresholds to zero.

I looked at your benchmarks numbers, with particular attention to the worst regressions, and I'm not convinced these benchmarks are representative.

What about tramp3d-v4 and SPECs? As far as I remember, tramp3d-v4 compiles pretty long and is pretty stable (and it is a part of CTMark).

But anyway, I completely agree with:

I think we should try to base our decisions (at least for compile time choices on more real-world/CPU intensive programs/workloads).

I'll be happy to look at patches for tuning the Os thresholds! If noone picks it up, I might return to it at some point, but it's not in my near-term plans now.

Thanks,
Michael