Page MenuHomePhabricator

Loop Versioning for LICM
ClosedPublic

Authored by ashutosh.nema on Apr 21 2015, 7:30 AM.

Details

Summary

I like to propose a new loop multi versioning optimization for LICM.

Cases where memory aliasing wasn't sure compiler became conservative and
didn't proceeds with invariant code motion. This results in possible
missed optimization opportunities.

Our main motivation is to exploit such cases and allow LICM optimization.

Most of the time when alias analysis is unsure about memory access and it
assumes may-alias. This un surety from alias analysis restrict LICM to proceed
further. Cases where alias analysis is unsure we like to use loop versioning
as an alternative.

Loop Versioning will creates version of the loop with aggressive alias and
the other with conservative (default) alias. Aggressive alias version of loop
will have all the memory access marked as no-alias. These two version of loop
will be preceded by a memory runtime check. This runtime check consists of bound
checks for all unique memory accessed in loop, and it ensures aliasing of memory.
Based on this check result at runtime any of the loops gets executed, if memory
is non aliased then aggressive aliasing loop gets executed, else when memory is
aliased then non aggressive aliased version gets executed.

Following are the top level steps:

  • Perform loop versioning feasibility check.
  • If loop is a candidate for versioning then create a memory bound check, by considering all the memory access in loop body.
  • Clone original loop and set all memory access as no-alias in new loop.
  • Set original loop & versioned loop as a branch target of runtime check result.
  • Call LICM::promoteLoopAccessesToScalars on aggressive alias versioned of loop.

Consider following test:

1 int foo(int * var1, int * var2, int * var3, unsigned itr) {
2 unsigned i = 0, j = 0;
3 for(; i < itr; i++) {
4 for(; j < itr; j++) {
5 var1[j] = itr + i;
6 var3[i] = var1[j] + var3[i];
7 }
8 }
9 }

At line #6 store to var3 can be moved out by LICM(promoteLoopAccessesToScalars)
but because of alias analysis un surety about memory access it unable to move it out.

After Loop versioning IR:

<Versioned Loop>
for.body3.loopVersion: ; preds = %for.body3.loopVersion.preheader, %for.body3.loopVersion

%indvars.iv.loopVersion = phi i64 [ %indvars.iv.next.loopVersion, %for.body3.loopVersion ], [ %2, %for.body3.loopVersion.preheader ]
%arrayidx.loopVersion = getelementptr inbounds i32* %var1, i64 %indvars.iv.loopVersion
store i32 %add, i32* %arrayidx.loopVersion, align 4, !tbaa !1, !alias.scope !11, !noalias !11
%indvars.iv.next.loopVersion = add nuw nsw i64 %indvars.iv.loopVersion, 1
%lftr.wideiv.loopVersion = trunc i64 %indvars.iv.loopVersion to i32
%exitcond.loopVersion = icmp eq i32 %lftr.wideiv.loopVersion, %0
br i1 %exitcond.loopVersion, label %for.inc11.loopexit38, label %for.body3.loopVersion
<Original Loop>
for.body3: ; preds = %for.body3.lr.ph, %for.body3

%indvars.iv = phi i64 [ %indvars.iv.next, %for.body3 ], [ %2, %for.body3.lr.ph ]
%arrayidx = getelementptr inbounds i32* %var1, i64 %indvars.iv
store i32 %add, i32* %arrayidx, align 4, !tbaa !1
%8 = load i32* %arrayidx7, align 4, !tbaa !1
%add8 = add nsw i32 %8, %add
store i32 %add8, i32* %arrayidx7, align 4, !tbaa !1
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv to i32
%exitcond = icmp eq i32 %lftr.wideiv, %0
br i1 %exitcond, label %for.inc11, label %for.body3
In versioned loop difference is visible, 1 store has moved out.

Following are some high level details about current implementation:

LoopVersioning
LoopVersioning is main class which holds multi versioning functionality.

LoopVersioning :: isLegalForVersioning
Its member to ‘LoopVersioning’
Does feasibility check for loop versioning.
a) Checks layout of loop.
b) Instruction level check.
c) memory checks.

LoopVersioning :: versionizeLoop
a) Clone original loop
b) Create a runtime memory check.
c) Add both loops under runtime check results target.

In this patch used maximum loop nest threshold as 2, and maximum number
of pointers in runtime memory check as 5, also these threshold are controlled
by command line.

During feasibility check, we compare possible invariant code motion with total
instruction of loop. If the comparison goes beyond certain threshold limit and
found profitable then only we do loop versioning.

Requesting to go through patch for detailed approach.
Suggestions are comments are welcome.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hfinkel added inline comments.Dec 11 2015, 3:29 AM
docs/LangRef.rst
4386 ↗(On Diff #39332)

Add a space just before "(LICM)".

4387 ↗(On Diff #39332)

clone loop -> all other loop versions

4387 ↗(On Diff #39332)

original loop -> the original loop

4388 ↗(On Diff #39332)

It helps suggesting loop-versioning-licm pass that the loop should not be re-versioned. -> The loop-versioning-licm pass will not create additional loop versions of loops with this metadata.

lib/Transforms/Scalar/LoopVersioningLICM.cpp
12 ↗(On Diff #39332)

un surety -> uncertainty

12 ↗(On Diff #39332)

restrict LICM to proceed -> restricts LICM from proceeding

13 ↗(On Diff #39332)

Cases -> In cases

13 ↗(On Diff #39332)

like to -> will

16 ↗(On Diff #39332)

creates version -> create a version

16 ↗(On Diff #39332)

alias -> aliasing assumptions

ashutosh.nema marked 27 inline comments as done.

Thanks Hal, for again looking into this change.

Do the runtme checks inserted cover only the interactions between
the invariant access and the non-invariant accesses, or do they also
perform range overlap checks on the non-invariant accesses?

Runtime check also perform overlap checks between non-invariant
accesses. without checking all memory access we can’t make aggressive
aliasing assumptions.

In this patch, incorporated comments from Hal.

Hi Hal,

Is this looks OK now ?

Regards,
Ashutosh

Thanks Hal, for again looking into this change.

Do the runtme checks inserted cover only the interactions between
the invariant access and the non-invariant accesses, or do they also
perform range overlap checks on the non-invariant accesses?

Runtime check also perform overlap checks between non-invariant
accesses. without checking all memory access we can’t make aggressive
aliasing assumptions.

In that case, you should also add llvm.mem.parallel_loop_access metadata to the versioned loop. Otherwise, the vectorizer will add a duplicate set of checks should it decide to vectorize the versiond loop. We also need to add metadata to the original loop to disable vectorization (as we already know that the vectorization checks will have failed when that loop is reached).

lib/Transforms/IPO/PassManagerBuilder.cpp
109 ↗(On Diff #42692)

new, experimental -> experimental

(If something is experimental, and yet not new, something's probably wrong ;) )

lib/Transforms/Scalar/LoopVersioningLICM.cpp
5 ↗(On Diff #42692)

Remove this paragraph.

9 ↗(On Diff #42692)

Remove this line (it, and the paragraph above, are not needed because the problem is explained clearly in the next paragraph).

62 ↗(On Diff #42692)

These lines need to appear directly under the "The LLVM Compiler Infrastructure" line.

135 ↗(On Diff #42692)

This utility function is not right, we need to combine the loop metadata here with any other loop metadata which might also exist.

Please see the writeHintsToMetadata function in lib/Transforms/Vectorize/LoopVectorize.cpp and/or the relevant code in lib/Transforms/Utils/LoopUnrollRuntime.cpp near the end of the CloneLoopBlocks function.

As a general note, we really should refactor this into a common utility function.

154 ↗(On Diff #42692)

Add here:

AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();

(so that you don't kill off GlobalsAA here).

310 ↗(On Diff #42692)

typeCheck -> TypeCheck

323 ↗(On Diff #42692)

You're checking here that all points in the given alias set have the same type; why?

Dear All,

sorry for jumping in late. I wonder if there is a specific reason this pass is run as in the canonicalization phase as part of the inliner loop. Would it not make more sense to run it before the vectorizers?

Best,
Tobias

Hi Tobias,

I wonder if there is a specific reason this pass is run as in the
canonicalization phase as part of the inliner loop.

Not very clear, what you mean by 'inliner loop', is it inner loop ?

Would it not make more sense to run it before the vectorizers?

If you see pass manager its scheduled prior to Vectorizer.
As its LICM based, it’s not performing any vectorizer legality.
It helps in cases where LICM became conservative because of
memory aliasing uncertainty.

Thanks,
Ashutosh

Hi Hal,

Will work on your comments and come back.

Thanks,
Ashutosh

lib/Transforms/Scalar/LoopVersioningLICM.cpp
323 ↗(On Diff #42692)

Actually this is a pre-condition in LICM’s “promoteLoopAccessesToScalars”
where it expects all pointers in alias should have same type.

<LICM.cpp>
881 Check that all of the pointers in the alias set have the same type. We
882
cannot (yet) promote a memory location that is loaded and stored in
883 // different sizes.
884 if (SomePtr->getType() != ASIV->getType())
885 return Changed;

To confirm same behaviour we added this check.

Addressed Hal's comment.

mehdi_amini added inline comments.Jan 2 2016, 8:36 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

Why is is done here and not where LICM is usually done? Line 419?
Did you try multiple placements? Do you have benchmarks results that drives this choice?

ashutosh.nema added inline comments.Jan 12 2016, 9:10 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

We want to run this when inining is over, because after that we may see more accurate aliasing.
That’s why we placed it after ‘createBarrierNoopPass’. If we do this earlier probably of no-alias
aliasing assumptions in version loop, other optimizations may get some benefit.

This can be schedule later as well but I don’t see any benefit.

We have observed good gains in our internal benchmarks, mostly based on customer workloads.

mehdi_amini added inline comments.Jan 12 2016, 10:36 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

I'm not sure I follow why you need the inliner to be fully complete over the whole module and not having processed the function on which you want to run LICM. What kind of aliasing will this avoid? Do you have an example?

Also, you are saying that you don't see any benefit of scheduling this later, can you elaborate with respect to the other run of LICM I pointed?

Also, I ask about benchmark with respect to the different possible placement for this pass, and you didn't answer this in particular.

grosser added inline comments.Jan 12 2016, 10:40 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

One reason to run this late is that too early versioning may prevent both further inlining (due to increased code size) as well as versioning later a possibly larger loop.

Furthermore, to my understanding the inlining loop is indeed a canonicalization phase. We are planning to run Polly after the inliner loop for this very same reason.

mehdi_amini added inline comments.Jan 12 2016, 10:47 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

Fair for the inliner loop. But what about the already existing LICM that already runs a little bit later?

grosser added inline comments.Jan 12 2016, 10:55 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

Are you talking about the LICM pass within the inliner loop? It is to my understanding part of the canonicalization sequence that helps both to eliminate loops (in case all statements can be proven loop invariant) and which also helps SCEV by moving certain SCEVUnknowns further out of the tree. As the existing LICM does not add a large code size increase, it seems to be save to be run in the inliner loop.

(LICM is not a pure canonicalization hence it causes some troubles for Polly, but we are working on addressing them on our side).

mehdi_amini added inline comments.Jan 12 2016, 11:27 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

No, not within the inliner loop, see line 419 (or 411 before the patch)

ashutosh.nema added inline comments.Jan 12 2016, 11:37 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

We did tried multiple placements i.e. prior to inlining completion and post completion.
As this is an internal benchmark, we can’t disclose details. But with these placements we
did not noticed any degrade in our regular testing. Reason for running just after inlining
completion is other passes might gain from non-alias version of loop (As of now we do
not have any case where other optimization getting benefits from no-alias version of loop).

I think this is getting really close to being ready.

docs/LangRef.rst
4531 ↗(On Diff #43509)

For consistency with our other metadata (such as llvm.loop.unroll.disable), we should name this:

llvm.loop.licm_versioning.disable
4534 ↗(On Diff #43509)

This paragraph is too implementation-specific. How about this:

This metadata indicates that the loop should not be versioned for the purpose of enabling loop-invariant code motion (LICM). The metadata has a single operand which is the string `llvm.loop.licm_versioning.disable`.
include/llvm/Transforms/Utils/LoopUtils.h
380 ↗(On Diff #43509)

Remove space in between the * and TheLoop.

383 ↗(On Diff #43509)

How about calling this:

addStringMetadataToLoop
lib/Transforms/Scalar/LoopVersioningLICM.cpp
10 ↗(On Diff #43509)

This first statement is actually misleading; it should always return MayAlias when it is uncertain. How about saying this:

Then alias analysis is uncertain about the aliasing between any two accesses, it will return MayAlias.
12 ↗(On Diff #43509)

will -> might

16 ↗(On Diff #43509)

and the other -> in addition to the original

21 ↗(On Diff #43509)

"Based on this check result at runtime any of the loop gets executed," -> The result of the runtime check determines which of the loop versions is executed:

28 ↗(On Diff #43509)

LoopVersioningLICM -> LoopVersioningLICM's

30 ↗(On Diff #43509)

access -> accesses

31 ↗(On Diff #43509)

access -> accesses

32 ↗(On Diff #43509)

of runtime check -> of the runtime check

35 ↗(On Diff #43509)

transform -> transforms

95 ↗(On Diff #43509)

Please update to llvm.loop.licm_versioning.disable

100 ↗(On Diff #43509)

instruction -> instructions

103 ↗(On Diff #43509)

LoopVersioningLICM threshold minimum -> LoopVersioningLICM's minimum

104 ↗(On Diff #43509)

for -> of

105 ↗(On Diff #43509)

instruction -> instructions

105 ↗(On Diff #43509)

in a -> per

112 ↗(On Diff #43509)

LoopVersioningLICM -> LoopVersioningLICM's

118 ↗(On Diff #43509)

LoopVersioningLICM's maximum number of pointers per runtime check

302 ↗(On Diff #43509)

Why are you checking this?

386 ↗(On Diff #43509)

Shouldn't you use LAI->getNumRuntimePointerChecks() here instead of PointerSet.size()?

413 ↗(On Diff #43509)

Performing this check per instruction seems silly. Maybe move to legalLoopStructure?

632 ↗(On Diff #43509)

do-loop-versioning-licm -> loop-versioning-licm

(I often just use DEBUG_TYPE for this; there's no reason for them to be out-of-sync)

643 ↗(On Diff #43509)

do-loop-versioning-licm -> loop-versioning-licm (or DEBUG_TYPE)

mehdi_amini added inline comments.Jan 15 2016, 11:21 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

Well as long as it is not enabled by default in the pipeline, it does not matter. It is worth noting that this question may come back later though.

hfinkel added inline comments.Jan 15 2016, 11:29 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

I don't think that making functions larger prior to inlining is something we can justify (i.e. I'm sure that will cause regressions). Also, as functions get inlined, it is likely to turn out that aliasing that we could not reason about in the function in isolation can be reasoned about in the context of a particular caller. Making the decision to version early will be problematic in that sense as well (it would be nice to think that, in such a case, we'd statically evaluate the runtime condition one way or the other, but I don't think that can happen now either).

mehdi_amini added inline comments.Jan 15 2016, 11:42 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

After talking on IRC, the answer to my original question about the LICM that runs later in the pipeline is that it runs after the vectorizer as part of post-vectorize cleanup. The current point of the multi-versioning LICM is different: it may help to allow better vectorization, so it makes sense to differentiate with the existing LICM.

hfinkel added inline comments.Jan 15 2016, 11:47 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
349 ↗(On Diff #43509)

Correct. One issue, however, is that unconditionally running LICM after the versioning pass is not ideal (in the compile-time sense), because LICM Is not super cheap.

Given that we broke LICM up into a set of utility functions (exposed in include/llvm/Transforms/Utils/LoopUtils.h), we should really call them from this pass only if we actually do anything.

anemet added inline comments.Jan 15 2016, 11:57 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
346–349 ↗(On Diff #43509)

@ashutosh.nema, We should document the outcome of the above discussion in a comment here. E.g.:

The reason we run LICMLVer right after the CGSCC pass manager so that later patches can take advantage of the new non-alias annotations.

And the reason we actually schedule an LICM at this point so that invariant access to global could be hoisted out of the loop which could allow further vectorizations.

@hfinkel, Please let me know if any of this is inaccurate.

ashutosh.nema added inline comments.Jan 18 2016, 10:34 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
346–349 ↗(On Diff #43509)

Sure Adam, will document key points in comments.

349 ↗(On Diff #43509)

LICM utilities can be called from LoopVersioningLICM, but we identified issues
running it, i.e. in few cases we noticed LoopInfo for clone is not getting updated,
which can make this optimization ineffective.
Understand its compile time expensive to run complete pass, but If we run LICM
pass later we will not see issues like LoopInfo get getting updated.
Also this is an optional pass, should be OK to run LICM.

anemet added inline comments.Jan 20 2016, 6:05 PM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
555–575 ↗(On Diff #43509)

You probably want to add a comment saying that you can add no-alias between any pairs of memory operations because you ignore loops with must aliasing accesses. Otherwise I don't think this would be valid.

622–623 ↗(On Diff #43509)

Why is this necessary? LoopVersioning is supposed to update the DT.

Addressed Hal's & Adam's comments.

ashutosh.nema added inline comments.Jan 22 2016, 5:01 AM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
302 ↗(On Diff #43509)

Loop trip count is required for runtime check generation, as the bound checks are based on this.

386 ↗(On Diff #43509)

‘LAI->getNumRuntimePointerChecks()’ returns number of runtime checks but here we are checking maximum possible pointers in runtime check.

632 ↗(On Diff #43509)

Tried making it 'loop-versioning-licm' but later found command line error mentioning option registered more than once.
So, made DEBUG_TYPE as “do-loop-versioning-licm”. Hoping this should be OK.

Hi Hal,

Is this Looks OK now ?

Regards,
Ashutosh

hfinkel added inline comments.Feb 2 2016, 9:44 PM
lib/Transforms/IPO/PassManagerBuilder.cpp
120 ↗(On Diff #45680)

"loop-versioning-licm" -> "enable-loop-versioning-licm"

lib/Transforms/Scalar/LoopVersioningLICM.cpp
303 ↗(On Diff #45680)

Okay, please explain this better in the comment. Something like:

// We need to be able to compute the loop trip count in order to generate the bound checks.
387 ↗(On Diff #45680)

Why do you care how many pointers there are? I'd think only the number of checks generated matters in terms of cost.

633 ↗(On Diff #45680)

No, please rename the other option (as I noted by the other option, putting adding enable- as a prefix would be natural). Then make this change.

Thanks Hal.

I'll work on your comments and come back.

lib/Transforms/Scalar/LoopVersioningLICM.cpp
633 ↗(On Diff #45680)

I did not understood this comment completely.

Are you saying use "enable-loop-versioning-licm" here, in DEBUG_TYPE & in PassManagerBuilder ?

hfinkel added inline comments.Feb 3 2016, 4:23 AM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
633 ↗(On Diff #45680)

Use "enable-loop-versioning-licm" in PassManagerBuilder, and then use "loop-versioning-licm" in DEBUG_TYPE and here. Thanks!

Thanks Hal for clarification, will make these changes and come back.

Regards,
Ashutosh

Addressed Hal's comments.

Corrected include file order.

hfinkel accepted this revision.Feb 5 2016, 11:03 AM
hfinkel edited edge metadata.

LGTM, thanks!

This revision was automatically updated to reflect the committed changes.