This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Add support of a new optimization case to Loop Versioning for LICM + code clean up
Needs ReviewPublic

Authored by eastig on Sep 26 2016, 11:56 AM.

Details

Summary

At the moment Loop Versioning for LICM does not support the following loops which, if versioned, give ~+18-40%% score improvement of benchmarks on Cortex-M7:

void mem_copy_01(char **dst, char **src, int bytes_count) {
    while (bytes_count--)
    {
      *((*dst)++) = *((*src)++);
    }
}

void mem_copy_02(char **dst, char *src, int bytes_count) {
    while (bytes_count--)
    {
      *((*dst)++) = *src++;
    }
}

void mem_copy_03(char *dst, char **src, int bytes_count) {
    while (bytes_count--)
    {
      *dst++ = *((*src)++);
    }
}

IR of mem_copy_01:

define void @mem_copy_01(i8** nocapture %dst, i8** nocapture %src, i32 %bytes_count) {
entry:
  %tobool2 = icmp eq i32 %bytes_count, 0
  br i1 %tobool2, label %while.end, label %while.body

while.body:                                       ; preds = %entry, %while.body
  %bytes_count.addr.03 = phi i32 [ %dec, %while.body ], [ %bytes_count, %entry ]
  %dec = add nsw i32 %bytes_count.addr.03, -1
  %0 = load i8*, i8** %src, align 4, !tbaa !3
  %incdec.ptr = getelementptr inbounds i8, i8* %0, i32 1
  store i8* %incdec.ptr, i8** %src, align 4, !tbaa !3
  %1 = load i8, i8* %0, align 1, !tbaa !7
  %2 = load i8*, i8** %dst, align 4, !tbaa !3
  %incdec.ptr1 = getelementptr inbounds i8, i8* %2, i32 1
  store i8* %incdec.ptr1, i8** %dst, align 4, !tbaa !3
  store i8 %1, i8* %2, align 1, !tbaa !7
  %tobool = icmp eq i32 %dec, 0
  br i1 %tobool, label %while.end, label %while.body

while.end:                                        ; preds = %while.body, %entry
  ret void
}

LoopAccessAnalysis can create aliasing checks for src and dst but not for *src and *dst because *src and *dst are loaded from memory. If we look at IR above we can notice how the pointers are defined and used (InvPtr - loop invariant pointer):

Ptr = Load(InvPtr)
NewPtr = GEP(Ptr, Const)
Store(NewPtr, InvPtr)
Mem_operations using Ptr
  1. If Ptr and InvPtr are not aliased at the iteration N then at the iteration N+1 the value of Ptr is the value defined by the GEP instruction.
  2. Without aliasing Ptr has values from [Ptr0, Ptr0 + (number_of_iterations-1) * type_size * GEP_index], where Ptr0 is Load(InvPtr) at the first iteration.

Absence of aliasing means:

4_or_8_bytes_aligned(Ptr0) != InvPtr                                          : iteration 1
4_or_8_bytes_aligned(Ptr0 + type_size*GEP_index) != InvPtr   : iteration 2
4_or_8_bytes_aligned(Ptr0 + 2*type_size*GEP_index) != InvPtr: iteration 3
...

Aligned Ptr0 is used because InvPtr is a pointer to a pointer and it's aligned either 4 or 8 bytes.
We can write a stricter check:
InvPtr is not in [4_or_8_bytes_aligned(Ptr0), Ptr0 + (number_of_iterations-1) * type_size * GEP_index]
which guarantees all checks above are satisfied.

We check only aliasing among pointers loaded from invariant locations and pointers to those locations which is enough to make decisions to move operations on invariant pointers out of a loop. As checks are for the purpose of LICM and don't cover all pointers combinations creation/adding of them can not be in LoopAccessAnalysis/LoopVersioning. LoopAccessAnalysis/LoopVersioning should provide a means of processing unrecognized pointers and adding checks for them.

Summary of changes:

  1. Clean up of the code of Loop Versioning for LICM. See comments to the changes below.
  2. LoopVersioning::versionLoop functions are changed to return BasicBlock where RT checks are inserted. The return basic block can be used for inserting additional checks.
  3. LoopAccessAnalysis can operate in 'PartialCheckingAllowed' state which mean to create RT checks for recognized pointers and collect unrecognized pointers. The unrecognized pointers can be processed by a user of LAA later.
  4. Recognition of the new optimization case is added to Loop Versioning for LICM.
  5. New tests are added.
  6. Old tests are updated.

Event Timeline

eastig updated this revision to Diff 72477.Sep 26 2016, 11:56 AM
eastig retitled this revision from to [LICM] Add support of a new optimization case to Loop Versioning for LICM + code clean up.
eastig updated this object.
eastig added reviewers: hfinkel, anemet, ashutosh.nema.
eastig added subscribers: llvm-commits, jmolloy.
eastig added inline comments.Sep 26 2016, 12:24 PM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
17

We can mark only memory operation for which RT checks are created.

161

TargetLibraryInfo is not used in the code.

172

Removed unused fields or made them function scoped instead of class scoped.

189

Made class fields private because there is no need to have them public.

305

TypeCheck is needed when we with AliasSet with MustAlias. E.g. char * and char ** pointers may be aliased but they have different types.

502

Loop Versioning provides API for annotating a loop with 'no alias' metadata. It's better not to duplicate this functionality.

524

This is not safe. Replaced with unique_ptr.

test/Transforms/LoopVersioningLICM/loopversioningLICM1.ll
15

Meaningless metadata. The instruction is not aliased with itself by default. Also it's better not to hard-coded ids as they can be changed.

ashutosh.nema edited edge metadata.Sep 27 2016, 4:33 AM

Thanks Evgeny for working on this.

Currently when pointers bounds are unknown we don’t consider them
for alias check and we avoid versioning, but this change introduces partial
alias check, it may be risky for applications, wondering what’s the purpose partial
alias check.

lib/Transforms/Scalar/LoopVersioningLICM.cpp
403

This function looks for very specific pattern, need to generalize.

431

Why restricting to single load & its users as memory operations & GEP ?

446

Why one use ?

489

Current LoopVersioningLICM supports loop nest, i.e. it targets innermost loop.
This behavior should be keep supported.

496

Earlier collected write to memory as well, why don't want to consider them here ?

Thanks Evgeny for working on this.

Currently when pointers bounds are unknown we don’t consider them
for alias check and we avoid versioning, but this change introduces partial
alias check, it may be risky for applications, wondering what’s the purpose partial
alias check.

The purpose of the partial alias checking is to split responsibility between LoopAccessAnalysis and its user. LoopAccessAnalysis is a very powerful tool but it tries to cover all pointers in a loop which is not always possible. IMHO it's up to the user of LoopAccessAnalysis what to do with unrecognized pointers.
E.g. if there are three pointers:

LoopInvariantPointer
PointerA
PointerB

then LICM can move operations on LoopInvariantPointer out of a loop if it's proven that LoopInvariantPointer is not aliased with PointerA and PointerB. We don't need to check aliasing of PointerA and PointerB. At the moment if any of pointers are not recognized no checks are created. Why not to allow to reuse results of analysis? Actually I would like to have more control on which RT checks need to be finally created.

eastig added inline comments.Sep 27 2016, 8:15 AM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
403

Yes, it's currently for very a specific pattern which we've got from benchmarks and real code.
To generalize the function I need to have more patterns.

431

Because this is the current pattern we want to recognized. More uses might create sophisticated DFG which is not worth to analyze. Such cases should be discovered first then get analyzed.

446

The same answer as above. Lack of real use cases.

489

The old cases work as before, nothing is changed.
The behaviour is only changed for the new cases. Loads are very heavy operations it is too dangerous to move them from an inner loop to an upper loop because the upper loop might have 1000 iterations and the inner loop might have 10. If there are aliased pointers a operation loading a pointer will be executed 11000 times instead of 10000 times: 1000 times in the RT checking basic block + 10000 times in the original loop. To make a proper decision an execution profile should be used.

496

LoopAccessAnalysis collects pointers in terms of Values. Writes are users of those pointers. So for each pointer in a loop we have an instruction defining it.

ashutosh.nema added inline comments.Oct 4 2016, 1:14 AM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
489

Agree it does not change the existing behavior, but for new cases why you enforcing such restriction.

In your example it can be other way around as well where the inner loop has 1000 iteration and outer loop has 10 iterations and its actually beneficial to hoist load from inner load to outer lop. LoopVersioning LICM does not make any hoisting decision, actual decision & hoisting will be done later by LICM.

Are you getting issues/degrades by allowing inner loops ?

eastig added inline comments.Oct 4 2016, 1:57 AM
lib/Transforms/Scalar/LoopVersioningLICM.cpp
489

Are you getting issues/degrades by allowing inner loops ?

No, I have not seen any issues and performance regressions yet.
I finally made runs when the pass was enabled by default as a part of the middle-end. Before I had manually run opt with the pass and then llc.
I've got +20%...+48.5% performance improvement.
This restriction is fully based on my experience of implementing different optimizations for loop nests.
I will remove the restriction.

eastig updated this revision to Diff 73430.Oct 4 2016, 2:06 AM
eastig updated this object.
eastig edited edge metadata.

Removed the restriction not to apply the optimization in case of loop nest.

eastig marked an inline comment as done.Oct 10 2016, 9:11 AM

Ping

I just had a brief look through this, but I don't think LAA changes are correct (see the inline comment).
Also if there are non-functional changes here, could you split them into another review (this would make things easier to understand)?

Cheers,
Silviu

lib/Analysis/LoopAccessAnalysis.cpp
664

Won't this just make canCheckPtrAtRT say that it can ignore all pointers with unknown bounds?
I think this would be incorrect for all other LAA users.

eastig added inline comments.Oct 10 2016, 3:01 PM
lib/Analysis/LoopAccessAnalysis.cpp
664

They are not ignored. They are collected to be post-processed.

The default behaviour is not to create any RT checks if a pointer with unknown bounds has been found. I think this is too strict. What if I don't want to discard recognized pointers and RT checks for them. Any idea how to implement this without duplicating the code?

The default behaviour is not changed if a LAA user does not want to have information about pointers with unknown bounds. I don't how the default behaviour can be changed unintentionally.

LAA was originally developed for the Loop Vectorizer. But now it is suggested to be used as the loop memory dependence framework. The problem is that the Loop Vectorizer specifics are not removed. The current users of LLA, besides the Loop Vectorizer, are LoopDistribute and LoopLoadElimination passes. They are also for vectorizing loops.

anemet added inline comments.Oct 10 2016, 3:17 PM
lib/Analysis/LoopAccessAnalysis.cpp
664

It would be good if you could describe what behavior you're trying to achieve.

If I remember correctly both LDist and LLE post-process the full set of run-time checks, so you probably want something similar. (Sorry if that is what you're doing. I haven't looked at the patch because it has many unrelated changes.)

LAA was originally developed for the Loop Vectorizer. But now it is suggested to be used as the loop memory dependence framework. The problem is that the Loop Vectorizer specifics are not removed.

This is not accurate and it's misleading. Yes, the LV-specific APIs were retained but new generic API were developed. The LV-specific APIs are now typically formulated in terms of the generic APIs. It was done this way to allow incremental migration. If the LV-specific APIs are misleading we could move them to LV.

sbaranga added inline comments.Oct 11 2016, 2:21 AM
lib/Analysis/LoopAccessAnalysis.cpp
664

The result of this analysis is being cached (see LoopAccessLegacyAnalysis::getInfo) , so it will have whatever flags the first caller uses and subsequent calls won't recompute it. Since the other users of LAA don't know about this interface change, it will cause them to assume that the pointers with unknown bounds can be checked (even though we won't emit any checks for them).

It looks like for your case you could keep the CanDoRT = false; (so there's no interface change), record the pointers that cannot be checked and not clear the RTCheck when we can't do all the checks (a few lines below). You would probably need to also clear it at some other point (maybe at the start of canCheckPtrAtRT?).

eastig added inline comments.Oct 11 2016, 3:33 AM
lib/Analysis/LoopAccessAnalysis.cpp
664

Thank you, Silviu. Now I see the issue.

Put NFC of LoopVersioningLICM to https://reviews.llvm.org/D25464