This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
BasicTTIImpl.h
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
-
LoopUnrollPass.cpp

Differential D23388

[LoopUnroll] By default disable unrolling when optimizing for size.
ClosedPublic

Authored by mzolotukhin on Aug 10 2016, 5:30 PM.

Download Raw Diff

Details

Reviewers

chandlerc
hfinkel

Commits

rGbd63d436c148: [LoopUnroll] By default disable unrolling when optimizing for size.
rL279585: [LoopUnroll] By default disable unrolling when optimizing for size.

Summary

In clang commit r268509 we started to invoke loop-unroll pass from the
driver even under -Os. However, we happen to not initialize optsize
thresholds properly, which si fixed with this change.

r268509 led to some big compile time regressions, because we started to
unroll some loops that we didn't unroll before. With this change I hope
to recover most of the regressions. We still are slightly slower than
before, because we do some checks here and there in loop-unrolling
before we bail out, but at least the slowdown is not that huge now.

Diff Detail

Repository: rL LLVM

Event Timeline

mzolotukhin updated this revision to Diff 67639.Aug 10 2016, 5:30 PM

mzolotukhin retitled this revision from to [LoopUnroll] By default disable unrolling when optimizing for size..

mzolotukhin updated this object.

mzolotukhin added reviewers: hfinkel, chandlerc.

mzolotukhin added a subscriber: llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptAug 10 2016, 5:30 PM

Seems fine to mitigate compile time issues, but I think we need some kind of opt-size loop unrolling test cases that check we *do* unroll obviously benificial loops under optsize and we *don't* unroll obvious bad loops that would have been unrolled in a non-optsize function. LGTM, but whether in this commit or a follow-up, I'd like to get at least obvious cases here covered by tests.

include/llvm/CodeGen/BasicTTIImpl.h
284–288 ↗	(On Diff #67639)	So this is just designed to fairly closely replicate the behavior before we put the pass into the pipeline? It'd be good to have some kind of FIXME or whatever. Much like inlining, I would expect that sufficiently conservative unrolling will still be a win in opt-size mode.
lib/Transforms/Scalar/LoopUnrollPass.cpp
951–953 ↗	(On Diff #67639)	Generally this makes sense, but if we're not doing partial unrolling, but we have a non-zero partial threshold, we won't early-exit even though threshold is zero... Maybe structure the checks so we only consider the partial threshold when considering partial unrolling?

This revision is now accepted and ready to land.Aug 10 2016, 6:19 PM

mzolotukhin added inline comments.Aug 10 2016, 6:38 PM

include/llvm/CodeGen/BasicTTIImpl.h
284–288 ↗	(On Diff #67639)	So this is just designed to fairly closely replicate the behavior before we put the pass into the pipeline? For now - basically yes. It'd be good to have some kind of FIXME or whatever. Much like inlining, I would expect that sufficiently conservative unrolling will still be a win in opt-size mode. Also, it's only a generic implementation, targets are free to override it with other values. In general I agree that on Os (in contrast to Oz) we should allow some unrolling. But I prefer to be extra careful here not to do more harm than good - for now I'm just reproducing what we did before the change, while keeping the way clear for enabling it later.
lib/Transforms/Scalar/LoopUnrollPass.cpp
951–953 ↗	(On Diff #67639)	This makes sense.

Closed by commit rL279585: [LoopUnroll] By default disable unrolling when optimizing for size. (authored by mzolotukhin). · Explain WhyAug 23 2016, 4:21 PM

This revision was automatically updated to reflect the committed changes.

This is, unfortunately, not ideal, as full unrolling can be beneficial in -Os.
Michael (@mzolotukhin) did you get a chance to re-evaluate this?
Do you have an example where this triggers?

Hi Davide,

I haven't reevaluated this. At that time, we spotted several compile-time regressions (see the table below), and my goal was simply to change the unrolling behavior back to the original. I didn't try to tune the thresholds in any way, so if anyone has a compelling data, we could definitely revisit the values.

SingleSource/Benchmarks/Stanford/Perm	37.37%
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4	33.95%
SingleSource/Regression/C/bigstack	22.07%
MultiSource/Benchmarks/Olden/power/power	13.49%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow	12.68%
MultiSource/Benchmarks/Olden/health/health	8.46%
MultiSource/Benchmarks/Olden/bh/bh	7.00%
MultiSource/Benchmarks/mediabench/g721/g721encode/encode	5.63%
MultiSource/Benchmarks/nbench/nbench	5.29%
External/SPEC/CINT2006/462_libquantum/462_libquantum	5.17%
External/SPEC/CFP2006/447_dealII/447_dealII	4.74%
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm	2.38%
MultiSource/Benchmarks/mediabench/gsm/toast/toast	2.28%
External/SPEC/CFP2006/433_milc/433_milc	2.08%
MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame	1.62%
MultiSource/Applications/JM/lencod/lencod	1.61%
MultiSource/Benchmarks/MiBench/automotive-susan/automotive-susan	1.48%
MultiSource/Benchmarks/Olden/bh/bh	-1.67%

I looked at your benchmarks numbers, with particular attention to the worst regressions, and I'm not convinced these benchmarks are representative.
There are a couple of issues:

They're lifetime is too short (basically, I wasn't able to find anything which lasted more than 0.05s). This makes pretty much impossible to profile and understand where the cycles are spent reliably.
Some of these benchmarks are a little noisy (even after I pinned a thread to a CPU, disable ASLR and frequency scaling etc...)

[davide@cupiditate C]$ time ~/work/llvm/build-rel-noassert/bin/clang -Os bigstack.c

real    0m0.048s
user    0m0.035s
sys     0m0.013s

I think we should try to base our decisions (at least for compile time choices on more real-world/CPU intensive programs/workloads).

Side note: If I take clang.bc in an LTO build and run opt -O2 on it I see:

10.8451 (  1.8%)   0.2831 (  2.3%)  11.1282 (  1.8%)  11.0717 (  1.8%)  Unroll loops

So we just try to execute loop unroll to realize the unroll threshold is zero and bail it. Turns out this is 2% of the time for large testcases.
If we want to disable loop unroll for -Os (I think we shouldn't :) I'd rather disable it entirely instead of adjusting the thresholds to zero.

davide added subscribers: filcab, RKSimon, simon.f.whittaker.May 31 2017, 12:57 PM

I looked at your benchmarks numbers, with particular attention to the worst regressions, and I'm not convinced these benchmarks are representative.

What about tramp3d-v4 and SPECs? As far as I remember, tramp3d-v4 compiles pretty long and is pretty stable (and it is a part of CTMark).

But anyway, I completely agree with:

I think we should try to base our decisions (at least for compile time choices on more real-world/CPU intensive programs/workloads).

I'll be happy to look at patches for tuning the Os thresholds! If noone picks it up, I might return to it at some point, but it's not in my near-term plans now.

Thanks,
Michael

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

CodeGen/

BasicTTIImpl.h

6 lines

lib/

Transforms/

Scalar/

LoopUnrollPass.cpp

4 lines

Diff 69054

llvm/trunk/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 279 Lines • ▼ Show 20 Lines	for (Loop::block_iterator I = L->block_begin(), E = L->block_end(); I != E;
}		}

return;		return;
}		}
}		}

// Enable runtime and partial unrolling up to the specified size.		// Enable runtime and partial unrolling up to the specified size.
UP.Partial = UP.Runtime = true;		UP.Partial = UP.Runtime = true;
UP.PartialThreshold = UP.PartialOptSizeThreshold = MaxOps;		UP.PartialThreshold = MaxOps;

		// Avoid unrolling when optimizing for size.
		UP.OptSizeThreshold = 0;
		UP.PartialOptSizeThreshold = 0;
}		}

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

unsigned getNumberOfRegisters(bool Vector) { return Vector ? 0 : 1; }		unsigned getNumberOfRegisters(bool Vector) { return Vector ? 0 : 1; }
▲ Show 20 Lines • Show All 674 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Scalar/LoopUnrollPass.cpp

Show First 20 Lines • Show All 942 Lines • ▼ Show 20 Lines	if (ExitingBlock) {
TripCount = SE->getSmallConstantTripCount(L, ExitingBlock);		TripCount = SE->getSmallConstantTripCount(L, ExitingBlock);
TripMultiple = SE->getSmallConstantTripMultiple(L, ExitingBlock);		TripMultiple = SE->getSmallConstantTripMultiple(L, ExitingBlock);
}		}

TargetTransformInfo::UnrollingPreferences UP = gatherUnrollingPreferences(		TargetTransformInfo::UnrollingPreferences UP = gatherUnrollingPreferences(
L, TTI, ProvidedThreshold, ProvidedCount, ProvidedAllowPartial,		L, TTI, ProvidedThreshold, ProvidedCount, ProvidedAllowPartial,
ProvidedRuntime);		ProvidedRuntime);

		// Exit early if unrolling is disabled.
		if (UP.Threshold == 0 && (!UP.Partial \|\| UP.PartialThreshold == 0))
		return false;

// If the loop contains a convergent operation, the prelude we'd add		// If the loop contains a convergent operation, the prelude we'd add
// to do the first few instructions before we hit the unrolled loop		// to do the first few instructions before we hit the unrolled loop
// is unsafe -- it adds a control-flow dependency to the convergent		// is unsafe -- it adds a control-flow dependency to the convergent
// operation. Therefore restrict remainder loop (try unrollig without).		// operation. Therefore restrict remainder loop (try unrollig without).
//		//
// TODO: This is quite conservative. In practice, convergent_op()		// TODO: This is quite conservative. In practice, convergent_op()
// is likely to be called unconditionally in the loop. In this		// is likely to be called unconditionally in the loop. In this
// case, the program would be ill-formed (on most architectures)		// case, the program would be ill-formed (on most architectures)
▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines