This is an archive of the discontinued LLVM Phabricator instance.

Enable LoopLoadElimination by default
ClosedPublic

Authored by anemet on Jan 18 2016, 12:15 PM.

Download Raw Diff

Details

Reviewers

• dberlin
hfinkel

Commits

rGdd9e637acaf0: Enable LoopLoadElimination by default
rL262250: Enable LoopLoadElimination by default

Summary

If you're interested to benchmark this yourself, you can enable the pass
with -mllvm -enable-loop-load-elim currently. Please let me know if you
see any regressions.

I re-benchmarked this and results are similar to original results in
D13259:

On ARM64:

SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog -59.27%
SingleSource/Benchmarks/Polybench/stencils/adi                   -19.78%

On x86:

SingleSource/Benchmarks/Polybench/linear-algebra/solvers/dynprog  -27.14%

And of course the original ~20% gain on SPECint_2006/456.hmmer with Loop
Distribution.

In terms of compile time, there is ~5% increase on both
SingleSource/Benchmarks/Misc/oourafft and
SingleSource/Benchmarks/Linkpack/linkpack-pc. These are both very tiny
loop-intensive programs where SCEV computations dominates compile time.

The reason that time spent in SCEV increases has to do with the design
of the old pass manager. If a transform pass does not preserve an
analysis we *invalidate* the analysis even if there was *no*
modification made by the transform pass.

This means that currently we don't take advantage of LLE and LV sharing
the same analysis (LAA) and unfortunately we recompute LAA *and* SCEV
for LLE.

(There should be a way to work around this limitation in the case of
SCEV and LAA since both compute things on demand and internally cache
their result. Thus we could pretend that transform passes preserve
these analyses and manually invalidate them upon actual modification.
On the other hand the new pass manager is supposed to solve so I am not
sure if this is worthwhile.)

Diff Detail

Repository: rL LLVM

Event Timeline

anemet updated this revision to Diff 45201.Jan 18 2016, 12:15 PM

anemet retitled this revision from to Enable LoopLoadElimination by default.

anemet updated this object.

anemet added reviewers: • dberlin, hfinkel.

anemet added a subscriber: llvm-commits.

Herald added subscribers: mehdi_amini, aemerson. · View Herald TranscriptJan 18 2016, 12:15 PM

mssimpso added a subscriber: mssimpso.Jan 19 2016, 5:26 AM

Just to make sure I understand, LLE is effectively a specialized form of
PRE targeted at loops, but with the additional flexibility to introduce
runtime checks to resolve aliasing queries? Is that a correct rephrasing?

Have you examined how much of the benefit comes from the runtime
checks? I'd be very tempted to introduce an aggressive and a
non-aggressive form which differ only in whether they insert runtime
checks. I'm concerned by the potential code size and compile time
impact of aggressive code duplication.

Has this been stress tested by doing (for instance) a clang self host?
Sorry if this has been addressed earlier, I don't see it mentioned
specifically here and haven't read back through the history.

In D16300#330374, @reames wrote:

Just to make sure I understand, LLE is effectively a specialized form of
PRE targeted at loops, but with the additional flexibility to introduce
runtime checks to resolve aliasing queries? Is that a correct rephrasing?

That is correct. The idea is also to allow PRE between non-adjacent iterations so another way to look at the problem this pass solves is loop-aware scalar replacement (with versioning).

Have you examined how much of the benefit comes from the runtime
checks?

I would expect most of the cases require run-time checks right now because Load-PRE currently performs a similar optimization without checks during the canonicalization phase so it steals most of those opportunities. Long term, the hope is that loop Load-PRE will be removed in favor of this pass at which point those cases will fall on this pass too.

For the record, I did look at the benchmarks mentioned in the description and out of those hmmer and adi require memchecks while dynprog does not.

I'd be very tempted to introduce an aggressive and a
non-aggressive form which differ only in whether they insert runtime
checks. I'm concerned by the potential code size and compile time
impact of aggressive code duplication.

That's reasonable. Are you thinking of disabling if SizeLevel is set (i.e. -Os or -Oz) or something different.

Has this been stress tested by doing (for instance) a clang self host?
Sorry if this has been addressed earlier, I don't see it mentioned
specifically here and haven't read back through the history.

Yes I've done a self-host but I never mentioned it.

Thanks for your feedback,
Adam

FWIW: I agree this should be on by default and will dance with joy when we can remove the current inter-iteration LoadPRE (in favor of a more standard PRE that handles both loads and scalars)

That said, you mention " Long term, the hope is that loop Load-PRE will be removed in favor of this pass at which point those cases will fall on this pass too."

Out of curiosity, have you tried the pass with load pre disabled?
:)

In D16300#333762, @dberlin wrote:

Out of curiosity, have you tried the pass with load pre disabled?

Thanks! You're making me curious too, I'll do a run. Obviously we only handle full redundancy and no ld->ld but it should be interesting to see how far we are.

If you only handle full redundancy, you'll have to explicitly comment out
the code in processNonLocalLoad.
It does not consider full availability phi insertion to be load PRE.

(GVN.cpp:1761)

It seems to me that partial redundancy means slightly different things between GVN Load-PRE and LLE.

In LLE, a load is partially redundant, if the forwarding store does not dominate all the loop latches. To make it fully redundant we would have to add loads on the unavailable paths.

I think that in GVN, the loopy case is always considered a partial redundancy case because you'd have to insert a load in the preheader. Let me know if I am getting this wrong.

So it seems to me that the LLE partial redundancy case is equivalent to the case in GVN Load-PRE when we need *more than one* load inserted.

Does this make sense?

anemet mentioned this in rL259861: [LoopLoadElim] Don't allow versioning when optForSize.Feb 4 2016, 5:18 PM

In D16300#330578, @anemet wrote:

In D16300#330374, @reames wrote:

I'd be very tempted to introduce an aggressive and a
non-aggressive form which differ only in whether they insert runtime
checks. I'm concerned by the potential code size and compile time
impact of aggressive code duplication.

That's reasonable. Are you thinking of disabling if SizeLevel is set (i.e. -Os or -Oz) or something different.

This was committed in rL259861.

Is this OK to go in now?

Thanks.

Ping.

Thanks for doing all this work!

I think this is fine to go in at this point, as long as you are willing to monitor it closely and be ready to disable it if serious regressions are discovered that can't be addressed quickly.

(It seems like the time regression is one that we expect to be addressed by the new pass manager, and we understand how to mitigate it if need be)

In D16300#363677, @dberlin wrote:

Thanks for doing all this work!

I think this is fine to go in at this point, as long as you are willing to monitor it closely and be ready to disable it if serious regressions are discovered that can't be addressed quickly.

Absolutely and thanks for the review!

I also plan to keep looking at the interaction with GVN Load-PRE, add support for the load-to-load forwarding and see if GVN can be simplified in turn.

Closed by commit rL262250: Enable LoopLoadElimination by default (authored by anemet). · Explain WhyFeb 29 2016, 12:39 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

IPO/

PassManagerBuilder.cpp

4 lines

Diff 49406

llvm/trunk/lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableLoopDistribute(
cl::desc("Enable the new, experimental LoopDistribution Pass"));		cl::desc("Enable the new, experimental LoopDistribution Pass"));

static cl::opt<bool> EnableNonLTOGlobalsModRef(		static cl::opt<bool> EnableNonLTOGlobalsModRef(
"enable-non-lto-gmr", cl::init(true), cl::Hidden,		"enable-non-lto-gmr", cl::init(true), cl::Hidden,
cl::desc(		cl::desc(
"Enable the GlobalsModRef AliasAnalysis outside of the LTO pipeline."));		"Enable the GlobalsModRef AliasAnalysis outside of the LTO pipeline."));

static cl::opt<bool> EnableLoopLoadElim(		static cl::opt<bool> EnableLoopLoadElim(
"enable-loop-load-elim", cl::init(false), cl::Hidden,		"enable-loop-load-elim", cl::init(true), cl::Hidden,
cl::desc("Enable the new, experimental LoopLoadElimination Pass"));		cl::desc("Enable the LoopLoadElimination Pass"));

static cl::opt<std::string> RunPGOInstrGen(		static cl::opt<std::string> RunPGOInstrGen(
"profile-generate", cl::init(""), cl::Hidden,		"profile-generate", cl::init(""), cl::Hidden,
cl::desc("Enable generation phase of PGO instrumentation and specify the "		cl::desc("Enable generation phase of PGO instrumentation and specify the "
"path of profile data file"));		"path of profile data file"));

static cl::opt<std::string> RunPGOInstrUse(		static cl::opt<std::string> RunPGOInstrUse(
"profile-use", cl::init(""), cl::Hidden, cl::value_desc("filename"),		"profile-use", cl::init(""), cl::Hidden, cl::value_desc("filename"),
▲ Show 20 Lines • Show All 729 Lines • Show Last 20 Lines