This is an archive of the discontinued LLVM Phabricator instance.

[GVN] Improve alias analysis for parallel accesses
AbandonedPublic

Authored by alban.bridonneau on Aug 18 2021, 2:15 AM.

Details

Summary

This patch helps to eliminate more loads in loops with the omp simd pragma,
as part of the GVN pass.
This pragma indicates groups of load/stores that we know are not
aliasing. Through better alias analysis, we can find more accurately
a memory access which provides the same data as the loads
we want to eliminate.

Diff Detail

Unit TestsFailed

Event Timeline

alban.bridonneau requested review of this revision.Aug 18 2021, 2:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2021, 2:15 AM

The build error seems unrelated to the this patch. There is another build next to mine, for a different patch, with the same unit test failing.

Based on the current definition of llvm.loop.parallel_accesses it's not clear if it can be used to draw conclusions about aliasing. That MD only really provides guarantees about lack of loop-carried dependencies. The load/stores could still alias if they have loop-independent (ie intra-iteration) dependencies.

llvm/include/llvm/Analysis/MemoryDependenceAnalysis.h
467

belongToSameLoopParallelAccessGroup might be a more appropriate name.
Is the BB parameter really needed? If so a comment should be added to describe what it is.

Thanks for the review. I started looking at the metadata definition and previous messages from the mailing list. It seems you're right, and the metadata doesn't give the information i thought it did.
I'd love to get some confirmation, if someone knows for sure that this is an abuse of this particular metadata.
I'll come back to the IR of the original test case, and see if there some other information i should have used.

Meinersbur requested changes to this revision.Aug 18 2021, 9:04 AM

llvm.loop.parallel_accesses is only relative to the loop that it is attached to. The information is not useful if not also passing the loop under consideration. That is, the access may be "parallel" in one loop but not in another. For instance:

for (int i = 0; i < n; ++i) {
  for (int j = 0; j < n; ++j) {
    A[j] = ...; // DepInst
    use(A[j]); // QueryInst
  }
}

Clearly, DepInst and QueryInst do alias every time, yet the j-loop can be execute in parallel so llvm.loop.parallel_accesses can be attached to it. However, the i-loop cannot be executed in parallel.

This revision now requires changes to proceed.Aug 18 2021, 9:04 AM

Could I have some more information with regards to the proposed changes?

  • With regards to having the correct loop, i am guessing what we want is to use the loop directly related to the basic block of the query instruction. I am envisioning something like
BasicBlock *BB1= QueryInst->getParent();
Loop *L = Loops.getLoopFor(BB1);

In which case, we can remove the BB argument to the function created in this patch, and we can also ensure that both the QueryInst and the DepInst are from the same Basic Block, and and the same loop. Does that match what you had in mind?

  • With regards to using this metadata for better alias analysis, does that mean the particular case in this patch is alright with you?

In which case, we can remove the BB argument to the function created in this patch, and we can also ensure that both the QueryInst and the DepInst are from the same Basic Block, and and the same loop. Does that match what you had in mind?

I think any use of llvm.loop.parallel_accesses that does not take into account which loop it is attached to case be correct. A simple mitigation is to test whether it is parallel in all loops. Still, there is non-reducible control flow and intra-iteration dependencies, as in your test case.

  • With regards to using this metadata for better alias analysis, does that mean the particular case in this patch is alright with you?

It would help if the test would mention what it is trying to achieve. A pseudocode equivalent would also be helpful.

I think the test is incorrect. %gepout may alias in the same iteration with %gep1 when %out == %in. From llvm.loop.parallel_accesses metadata one could at most infer that %gep1 does not alias with %gepout from other iterations of the omp.inner.for.body loop. Even that is probably also not valid to infer. Strictly, the metadata only states that the memory accesses are not in the way of parallelizing/vectorizing the loop. It may also well be that the store is a trivial store (always writes the same value the memory already contains, e.g. because one of %0/%1 is always 0[*]) and hence aliasing still might have occured.

  • It is controversial whether hardware has well-defined behavior for such writes.

Thank you for the clarifications!
I see now that the llvm.loop.parallel_accesses metadata can't be used in the way that we wanted. We have also come back to the original IR and found that there was a more appropriate way to tackle our issue. So I'll close this review. I'll also add a bit more information about our case below, in case you're interested.

Our case is basically the one shown in the unit test:

x1 = load @x 
store @y
x2 = load @x

We should be able to eliminate the 2nd load, but we can only do that if we know that it is not aliasing with the store.
The issue we tried to solve, is that we manage to do the elimination of the load when the loop is written simply, but if we use a pragma simd, and annotate the loop as parallel, we are not managing to eliminate the load. So the pragma that we use to create better performance is actually having the opposite effect.

We're going to work next on the Loop Vectorizer. When the loop is annotated parallel, the vectorizer doesn't create any runtime checks and just vectorizes. Without the parallel annotation, we create runtime memory checks, version the loop, and the Loop Vectorizer creates new metadata on the memory accesses (noalias/alias.scope). This is what we want to use in order to do more load elimination. Even though nothing is needed for vectorization, we might want some level of runtime checks in order to further optimize the vectorized loop.
That's the basic idea anyway, we'll need to work on it, to figure out exactly what to do.

alban.bridonneau abandoned this revision.Aug 23 2021, 1:08 AM

Thanks @alban.bridonneau for the explanation.