This is an archive of the discontinued LLVM Phabricator instance.

LLE 6/6: Add LoopLoadElimination pass
ClosedPublic

Authored by anemet on Sep 29 2015, 10:28 AM.

Details

Summary

The goal of this pass is to perform store-to-load forwarding across the
backedge of a loop. E.g.:

for (i)
   A[i + 1] = A[i] + B[i]

=>

T = A[0]
for (i)
   T = T + B[i]
   A[i + 1] = T

The pass relies on loop dependence analysis via LoopAccessAnalisys to
find opportunities of loop-carried dependences with a distance of one
between a store and a load. Since it's using LoopAccessAnalysis, it was
easy to also add support for versioning away may-aliasing intervening
stores that would otherwise prevent this transformation.

This optimization is also performed by Load-PRE in GVN without the
option of multi-versioning. As was discussed with Daniel Berlin in
http://reviews.llvm.org/D9548, this is inferior to a more loop-aware
solution applied here. Hopefully, we will be able to remove some
complexity from GVN/MemorySSA as a consequence.

In the long run, we may want to extend this pass (or create a new one if
there is little overlap) to also eliminate loop-indepedent redundant
loads and store that *require* versioning due to may-aliasing
intervening stores/loads. I have some motivating cases for store
elimination. My plan right now is to wait for MemorySSA to come online
first rather than using memdep for this.

The main motiviation for this pass is the 456.hmmer loop in SPECint2006
where after distributing the original loop and vectorizing the top part,
we are left with the critical path exposed in the bottom loop. Being
able to promote the memory dependence into a register depedence (even
though the HW does perform store-to-load fowarding as well) results in a
major gain (~20%). This gain also transfers over to x86: it's
around 8-10%.

Right now the pass is off by default and can be enabled
with -enable-loop-load-elim. On the LNT testsuite, there are two
performance changes (negative number -> improvement):

  1. -28% in Polybench/linear-algebra/solvers/dynprog: the length of the critical paths is reduced
  2. +2% in Polybench/stencils/adi: Unfortunately, I couldn't reproduce this outside of LNT

The pass is scheduled after the loop vectorizer (which is after loop
distribution). The rational is to try to reuse LAA state, rather than
recomputing it. The order between LV and LLE is not critical because
normally LV does not touch scalar st->ld forwarding cases where
vectorizing would inhibit the CPU's st->ld forwarding to kick in.

LoopLoadElimination requires LAA to provide the full set of dependences
(including forward dependences). LAA is known to omit loop-independent
dependences in certain situations. The big comment before
removeDependencesFromMultipleStores explains why this should not occur
for the cases that we're interested in.

Diff Detail

Event Timeline

anemet updated this revision to Diff 36003.Sep 29 2015, 10:28 AM
anemet retitled this revision from to LLE 6/6: Add LoopLoadElimination pass.
anemet updated this object.
anemet added reviewers: hfinkel, dberlin.
anemet added a subscriber: llvm-commits.
anemet updated this object.Sep 29 2015, 10:30 AM
dberlin edited edge metadata.Oct 2 2015, 1:10 PM

Thanks so much for taking this on.

Out of curiosity, does this actually catch the PRE load-forwarding cases that currently are tested?

That is, tests/Transform/GVN/pre-load.ll tests 9 and 10 i believe.

lib/Transforms/Scalar/LoopLoadElimination.cpp
65

This and the checks below it really look like checks to see if the load and store are really the same type, and not related to the dependence distance.
Can this be abstracted out somewhere?

IE it makes more sense to be calling Cand.isSameTypeAs(SE) && Cand.iisDependenceDistanceOfOne(SE)

100

This is an interesting restriction, and i understand why it is here.
I haven't thought too hard about it, but wonder if it could not be gotten rid of in a lot of cases with code or guard insertion, PRE style.

137

You also could do the type filtering you are doing in isDependenceDistanceOfOne here, no?

anemet added a comment.Oct 5 2015, 4:17 PM

Thanks very much, Daniel, for looking at the patch.

Out of curiosity, does this actually catch the PRE load-forwarding cases that currently are tested?

That is, tests/Transform/GVN/pre-load.ll tests 9 and 10 i believe.

It catches the tests that check store-to-load forwarding. These are 6, 7 and 9.

Test 10 has both store-to-load and load-to-load forwarding. This pass catches st->ld but not ld->ld.

I guess in order to catch ld->ld as well we'd need LAA to also track RAR loop-carried dependences. That does not sound difficult.

lib/Transforms/Scalar/LoopLoadElimination.cpp
65

Actually after looking at this again, these checks are not necessary.

Since we only include store->load dependences of known types in the candidates, we can't have mismatching address space or type because that would yield an unknown dependence.

Sorry about the confusion, I'll change these into asserts.

Also as a follow-on, I am planning to make dependence distance an attribute of the dependence class and then this function would go away.

100

I think you're right. We should be able to enhance it further to also handle cases where to load is only partially redundant.

Needless to say, I am going for fully-redundant loads in this initial version.

137

See my explanation in isDependenceDistanceOfOne.

anemet updated this revision to Diff 36568.Oct 5 2015, 4:20 PM
anemet edited edge metadata.

Address dberlin's comments

Test 10 has both store-to-load and load-to-load forwarding. This pass catches st->ld but not ld->ld.
I guess in order to catch ld->ld as well we'd need LAA to also track RAR loop-carried dependences. That does not sound difficult.

Hi Adam,

I believe this is the case I'm looking at, from Livermore Loops, for the RAR case:

a[k] = b[k] + b[k+1];

If the loop cannot be vectorized because of other problems, the LLE pass should be able to catch it, too, like the GVN does today, so best of both worlds.

Regarding this patch, I have some small comments, but overall, looks great!

I'll merge these changes locally and see if I can come up with a few RAR examples to transform.

cheers,
--renato

lib/Transforms/Scalar/LoopLoadElimination.cpp
12

Maybe not necessary to specify store-to-load, here, since this could also do load-to-load in the future?

154

I'm assuming to add LAL deps we'd need to add the case here, and map the type together, so that you can use the right struct (StoreToLoadForwardingCandidate or LoadToLoadForwardingCandidate).

Or maybe just extend to ForwardingCandidate and add a type to it would be simpler.

195

Is this related to D13255?

320

Silly nitpick: I originally thought this function would actually find the forward dependencies, not just find the candidates. Maybe changing the name to findPointers... would be more clear.

369

would this be meaningful?

assert(PtrSCEV)
anemet marked 6 inline comments as done.Oct 26 2015, 8:31 PM

Test 10 has both store-to-load and load-to-load forwarding. This pass catches st->ld but not ld->ld.
I guess in order to catch ld->ld as well we'd need LAA to also track RAR loop-carried dependences. That does not sound difficult.

Hi Adam,

I believe this is the case I'm looking at, from Livermore Loops, for the RAR case:

a[k] = b[k] + b[k+1];

If the loop cannot be vectorized because of other problems, the LLE pass should be able to catch it, too, like the GVN does today, so best of both worlds.

Regarding this patch, I have some small comments, but overall, looks great!

I'll merge these changes locally and see if I can come up with a few RAR examples to transform.

Yes, you're right. It looks like we will need RAR for your case as well.

Adam

lib/Transforms/Scalar/LoopLoadElimination.cpp
12

Makes sense.

154

Yeah probably the latter since they should be pretty similar. But I could be wrong.

195

Yes, the comment is meant to explain why the issue I discovered in D13255 does not affect LLE.

320

This comment may have shifted or I am just confused. Can you please elaborate what you mean?

369

cast has an implicit assert, no?

rengolin added inline comments.Oct 27 2015, 5:32 AM
lib/Transforms/Scalar/LoopLoadElimination.cpp
320

Sorry. It's about:

computePointersWrittenOnForwardingPath(Candidates);

You don't use it elsewhere, and it doesn't seem important on its own. I'm guessing you pulled it out because of collectMemchecks' size. This nitpick was just a personal preference to call it "find" or "collect" instead of "compute", since that's all it's doing. But it's not important.

369

ah, yes, of course!

dberlin accepted this revision.Oct 27 2015, 8:14 AM
dberlin edited edge metadata.

I think, assuming all comments are addressed, that this looks fine to commit and iterate from there.

This revision is now accepted and ready to land.Oct 27 2015, 8:14 AM
anemet added inline comments.Oct 27 2015, 9:58 PM
lib/Transforms/Scalar/LoopLoadElimination.cpp
320

Ah, got it, either of the two is better than "compute", thanks.

While working on top of this patch, I noticed that not all DEBUG() messages are prefixed with "LLE:", which may confuse people using -debug-only=.

I'm confused.
AFAIK, we normally do *not* prefix the messages explicitly (i see this pass
does it in some places but not others, that should be changed to be not
prefixed, IMHO).

This pass has DEBUG_TYPE set to loop-load-elim at the top, so if you use
debug-only, you won't get them unless you specify debug-only=loop-load-elim.

I would suggest rather than prefix *anything*, anywhere, that if we want
prefixes, we just have the DEBUG macro output the debug type in front of
the message ;-)

Then it's automatic, not confusing, and clear where it came from.

I would suggest rather than prefix *anything*, anywhere, that if we want
prefixes, we just have the DEBUG macro output the debug type in front of
the message ;-)

Sounds goo to me.

hfinkel edited edge metadata.Nov 2 2015, 10:37 AM

LGTM too, except for:

I don't think we check anywhere to disqualify volatile/atomic loads/stores, and we need to do that (especially for volatiles, since you're changing the number of accesses, and I'd need to think more-carefully about atomics).

anemet added a comment.Nov 2 2015, 3:57 PM

LGTM too, except for:

I don't think we check anywhere to disqualify volatile/atomic loads/stores, and we need to do that (especially for volatiles, since you're changing the number of accesses, and I'd need to think more-carefully about atomics).

LAA stops the analysis unless the memops are all isSimple(). Let me add a comment about this.

junbuml added a subscriber: junbuml.Nov 3 2015, 6:49 AM
This revision was automatically updated to reflect the committed changes.
anemet marked 13 inline comments as done.