This is an archive of the discontinued LLVM Phabricator instance.

[LoopCacheAnalysis] Enable delinearization of fixed sized arrays
ClosedPublic

Authored by congzhe on Mar 31 2022, 7:01 PM.

Details

Summary

This is a follow-up to the discussion during the LoopOptWG meeting.

Currently loop cache cost (LCC) cannot analyze fix-sized arrays since it cannot delinearize them. This patch adds the capability to delinearize fix-sized arrays to LCC.

Please note that the function newly added in this patch, i.e., IndexedReference::tryDelinearizeFixedSize(), has some duplications with DependenceInfo::tryDelinearizeFixedSize() in DependenceAnalysis.cpp. In DependenceAnalysis we work on two memory accesses Src and Dst when we do delinearization while in LCC we only work on one memory access, therefore I did not directly reuse DependenceInfo::tryDelinearizeFixedSize() in LCC. However if desired, I could extract IndexedReference::tryDelinearizeFixedSize() as a utility function, move it to DependenceAnalysis.cpp, and reuse this utility function in DependenceInfo::tryDelinearizeFixedSize() thus simplifies DependenceInfo::tryDelinearizeFixedSize(). This refactoring will need some amount of work, therefore I would like to ask for comments and feedbacks on the overall functionality of the current patch. If the current patch looks okay to everyone, I can move on with the aforementioned refactoring work if needed.

Another note is that this patch did not do range checks after delinearization (this was done for DependenceAnalysis under flag !DisableDelinearizationChecks). However, since in LCC the delinearization of parametric sized arrays had not done this check at the first place, for now I omitted the check for delinearization of fix-sized arrays as well.

Diff Detail

Event Timeline

congzhe created this revision.Mar 31 2022, 7:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 7:01 PM
congzhe requested review of this revision.Mar 31 2022, 7:01 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 7:01 PM
congzhe updated this revision to Diff 419605.Mar 31 2022, 7:29 PM
congzhe edited the summary of this revision. (Show Details)
congzhe edited the summary of this revision. (Show Details)Mar 31 2022, 7:32 PM
congzhe added reviewers: bmahjour, Whitney, Restricted Project.
congzhe added a project: Restricted Project.
congzhe edited the summary of this revision. (Show Details)Mar 31 2022, 7:39 PM
congzhe edited the summary of this revision. (Show Details)
Meinersbur added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
146–147

In principle, each dimension can be dynamic or fixed size separately. For instance using C99 VLAs:

void func(int n, double A[][128][n]);

has 3 dimensions of which only the middle one is fixed size. However, DependenceInfo may not support them either?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
345–346

Did you consider one of stripPointerCasts, stripPointerCastsAndAliases, stripPointerCastsSameRepresentation, stripPointerCastsForAliasAnalysis?

Meinersbur retitled this revision from [LoopCacheAnlysis] Enable delinearization of fixed sized arrays to [LoopCacheAnalysis] Enable delinearization of fixed sized arrays.Apr 5 2022, 11:14 AM
nikic added a subscriber: nikic.Apr 5 2022, 11:40 AM
nikic added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

So, I guess this is something of a pre-existing issue, but this API looks completely broken to me. Deriving information from a GEP source element type is blatantly illegal under IR semantics -- in particular, the array bounds specified therein may be completely arbitrary.

The correct way to derive such information is from SCEV addrec expressions. For example in your first test you have %arrayidx10 with SCEV {{(8 + %a)<nuw>,+,8192}<nuw><%for.body>,+,4}<nuw><%for.body4>, and the steps of those addrecs are semantically meaningful.

Meinersbur added inline comments.Apr 5 2022, 4:55 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

It should be only a heuristic, using the fact that these indices are usually something sensible from the source code. In case of DependenceInfo, there is an additional check whether the derived indices are actually legal (unless --da-disable-delinearization-checks=true). LoopCacheAnalysis is an heuristic analysis entirely, deriving and heuristic cost function without a claim of semantics.

The SCEV addrec approach is already used by LoopCacheAnalysis in the form of llvm::delinearize.

nikic added inline comments.Apr 6 2022, 1:10 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

Okay, this is slightly less bad than I though then -- but GEP type structure should still not be used for heuristic purposes either.

If the SCEV addrec approach is already used in llvm::delinearize(), then why is this additional code needed? The addrecs involved in the motivational test cases contain all the necessary information, so there should be no need to look at GEP structure.

congzhe added inline comments.Apr 6 2022, 7:42 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

Thanks for the comment! The current SCEV addrec approach that is already used in llvm::delinearize() could not handle fixed-size arrays, it could only handle parametric-size arrays. That is why we treat delinearization of fixed-size and parametric-size arrays separately in DependenceAnalysis, and I followed the same fashion here in LoopCacheAnalysis.

congzhe added inline comments.Apr 6 2022, 11:33 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

Could you please elaborate a bit more on the reason that "GEP type structure should still not be used for heuristic purposes"? Is it because in the longer term we envision GEP instructions will not have type information anymore?

nikic added inline comments.Apr 6 2022, 12:28 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

Could you please elaborate a bit more on the reason that "GEP type structure should still not be used for heuristic purposes"? Is it because in the longer term we envision GEP instructions will not have type information anymore?

That's part of it, but also more generally that you're relying on the frontend encoding things in a specific way here. Maybe the array is actually wrapped inside a struct. Maybe the array decays into a slice ([0 x ...] style) or a pointer. Maybe the user is an old-school C programmer and uses ary[y * SIZE + x] instead of ary[y][x]. Matching GEP structure locks you into matching a very specific code pattern, and does not support code that is semantically equivalent, but represented slightly differently in IR.

You mention that the SCEV-based approach cannot handle fixed-sized arrays. Could you please explain in more detail why? It's not really obvious to me.

congzhe added inline comments.Apr 7 2022, 2:37 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
335

Thanks for your explanation on GEP type structures.

Currently the SCEV-based approach is more like a symbolic technique and if it does not find any parameters (symbols) in the SCEV expression, it just bails out. I'm looking into whether we could incorporate fixed-size array delinearization in the the SCEV-based approach and trying to prototype a solution if possible. I will get back to this thread once I get some results.

That's part of it, but also more generally that you're relying on the frontend encoding things in a specific way here. Maybe the array is actually wrapped inside a struct. Maybe the array decays into a slice ([0 x ...] style) or a pointer. Maybe the user is an old-school C programmer and uses ary[y * SIZE + x] instead of ary[y][x]. Matching GEP structure locks you into matching a very specific code pattern, and does not support code that is semantically equivalent, but represented slightly differently in IR.

There is and always will different optimizations depending on how the code is written even though semantically equivalent. While I agree the it is nicer to canonicalize it away, it's an unrealistic goal to expect that everywhere.

I assume you are also referring to "Future: Type-less GEP" from https://llvm.org/devmtg/2022-04-03/slides/keynote.Opaque.Pointers.Are.Coming.pdf. I think there are problems with (e.g. no inrange and inbounds qualifiers), but even if, we could as well add metadata to the memory accesses giving hints to the optimizer to what the dimensions are.

You mention that the SCEV-based approach cannot handle fixed-sized arrays. Could you please explain in more detail why? It's not really obvious to me.

The current delinearizer assumes that the size of a dimension is is represented by a SCEVUnknown, such as %n. When that's the case it will occur in every access function as long as it's not zero (but break when the actual size is an expression such as n+2). The stride depends on what the index expression is, eg. A[2*i][j] has a twice as large stride as the "real" size of an inner subarray. What's worse is that different memory accesses may result in different interpretations of the array sizes for each access, which is hard to cope with. Symbolic sizes are comparatively stable.

At least this is what I am assuming why the current algorithm is designed as it is, and there might be a lot of room for improvement.

bmahjour added a comment.EditedApr 11 2022, 1:23 PM

You mention that the SCEV-based approach cannot handle fixed-sized arrays. Could you please explain in more detail why? It's not really obvious to me.

To add to the above explanations, the algorithm for recovering subscripts (delinearization) is described in this paper. The implementation in LLVM is based on that paper. In section 3.1 they describe a generic example where they try to recover subscripts A[i+o0][ j+o1][k+o2] for an access function that's been linearized in the IR. The original (linearized) polynomial in that example is (n2(n1*o0 + o1) + o2) + n1*n2*i + n2*j + k. They apply a heuristic to recover the size of each dimension of the array (in this case A[][n1][n2]). They then take the polynomial and divide it by the size of the inner most dimension n2. The remainder from that division (ie. o2 + k) forms the subscript for the inner most dimension and the quotient (ie n1o0+o1+n1i+ j) is further divided by n1 to recover o1 + j for the middle dimension and so on.

If the sizes of each dimension of the array were compile-time constants, and some of the offsets (ie o0, o1 or o2) were also constant, those constants would get folded together in away that would make them indistinguishable from each other and the algorithm wouldn't be able to correctly recover the original subscripts. (eg. if n1=10, n2=10, o0=1, o1=2, o2=3, then the original linearized expression would be 123 + 100i + 10j + k. Even if the algorithm knew the sizes of each dimension of the array (ie 10x10), dividing this expression by 10 would yield a remainder of 123 + k which would not be a valid subscript for the inner dimension!

Hopefully this example clarifies why the same algorithm cannot be used to delinearize subscripts in a fixed size array.

@nikic it may be a good idea to have a discussion about this (impact of opaque pointers on delinearization) at one of the upcoming LoopOptWG calls. If you care to join us, please let me know or add an agenda item to https://docs.google.com/document/d/1sdzoyB11s0ccTZ3fobqctDpgJmRoFcz0sviKxqczs4g/edit so we can discuss this in more depth. Thanks!

bmahjour added inline comments.Apr 11 2022, 2:07 PM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
146–147

The existing delinearization algorithms only work if all sizes are parametric or if all are constant. A in the example above will be a plain double* pointer, so getIndexExpressionsFromGEP() won't be able to recover much from it.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
355

can use a range-based loop here

358

[nit] this assert should be done before the copy into Sizes.

392

Why do we need to store IsFixedSize as a data member?

congzhe updated this revision to Diff 422727.Apr 13 2022, 8:07 PM

Addressed comments from Michael @Meinersbur and Bardia @bmahjour.

If the current patch looks okay, the next step would be to do refactoring, i.e., move IndexedReference::tryDelinearizeFixedSize(ScalarEvolution *SE, Instruction *Src, const SCEV *SrcAccessFn, SmallVectorImpl<const SCEV *> &SrcSubscripts) to DependenceInfo, and reuse it in both LoopCacheAnalysis.cpp and DependenceAnalysis.cpp.

Currently tryDelinearizeFixedSize() in DependenceAnalysis.cpp is used as tryDelinearizeFixedSize(Src, Dst, SrcAccessFn, DstAccessFn, SrcSubscripts, DstSubscripts). What the refactoring would do is to replace it by tryDelinearizeFixedSize(Src, SrcAccessFn, SrcSubscripts) and tryDelinearizeFixedSize(Dst, DstAccessFn, DstSubscripts).

I am leaning towards doing the refactoring work in a separate patch after the current patch is landed, which may make the process cleaner and more trackable. Nevertheless, I am open to other approaches as well (such as putting the refactoring work in this patch too).

congzhe added inline comments.Apr 13 2022, 8:11 PM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
146–147

Thanks for the explanation, that is exactly what would happen.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
355

Please correct me if I'm wrong -- here we need the index (i) to index into both Subscripts and SrcSizes. If we used a range-based loop like for (const SCEV *S : Subscripts), we would not be able to get the elements from SrcSizes?

358

Thanks for the comment, I've placed the assert before the copy into Sizes.

392

Thanks, you are right that it does not necessarily need to be a class member. I've updated it to be a local variable.

congzhe added inline comments.Apr 13 2022, 8:15 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
345–346

Thanks for the comment! According to your comment, this piece of change in DependenceAnalysis.cpp is updated to using stripPointerCasts() in D123559 and is now landed, and here I've done the same updated.

@congzhe Could you write here what your plans are? Continue with this patch? Improve the SCEV-based algorithm? Introduce array size hint metadata?

@congzhe Could you write here what your plans are? Continue with this patch? Improve the SCEV-based algorithm? Introduce array size hint metadata?

I think I will look into how I can improve the SCEV-based algorithm to incorporate fixed-size array delinearization.

I wonder if you could clarify what you meant by "continue with this patch"? I would certainly like to continue with the current patch if possible, but due to opaque pointers it seems that the current approach in this patch become fragile and not viable?

I wonder if you could clarify what you meant by "continue with this patch"? I would certainly like to continue with the current patch if possible, but due to opaque pointers it seems that the current approach in this patch become fragile and not viable?

I would not block this based only on a possible future development, taking into account that there are already uses of getIndexExpressionsFromGEP (e.g. by DA) and there is a possible workaround.

I wonder if you could clarify what you meant by "continue with this patch"? I would certainly like to continue with the current patch if possible, but due to opaque pointers it seems that the current approach in this patch become fragile and not viable?

I would not block this based only on a possible future development, taking into account that there are already uses of getIndexExpressionsFromGEP (e.g. by DA) and there is a possible workaround.

Thanks Michael, that does sound good to me. I've already addressed the comments in this patch in its current form. Moving forward I guess we could do one of the following:

  1. go ahead trying to merge it and do the refactoring (as mentioned in https://reviews.llvm.org/D122857#3450657) in the next patch, or
  2. modify this patch to do refactoring in the current patch.

I am leaning towards (1) since it might be easier to track the changes, as compared to doing a whole lot of work in one patch. But I don't have a strong preference and I'll be glad to take either option (and any other possible option).

Meinersbur accepted this revision.Apr 25 2022, 10:18 AM

I am OK with doing the refactoring later.

This revision is now accepted and ready to land.Apr 25 2022, 10:18 AM
This revision was landed with ongoing or failed builds.Apr 29 2022, 1:19 PM
This revision was automatically updated to reflect the committed changes.

Landed this patch, will work on the refactoring in the next patch.

The test case added here appears to sometimes fail due to output ordering differences. E.g. https://lab.llvm.org/buildbot/#/builders/16/builds/28225

That failure is blamed on my revert commit -- but in my local build, the test isn't failing. Perhaps there's some nondeterminism in processing order? I might switch these to CHECK-DAG, though I don't know if that's just covering up some other underlying bug...

congzhe added a comment.EditedApr 29 2022, 3:13 PM

The test case added here appears to sometimes fail due to output ordering differences. E.g. https://lab.llvm.org/buildbot/#/builders/16/builds/28225

That failure is blamed on my revert commit -- but in my local build, the test isn't failing. Perhaps there's some nondeterminism in processing order? I might switch these to CHECK-DAG, though I don't know if that's just covering up some other underlying bug...

Thanks James, I've noticed that as well - In my local build it does pass. But I should have used CHECK-DAG. Let me commit a small patch that does this.

The test case added here appears to sometimes fail due to output ordering differences. E.g. https://lab.llvm.org/buildbot/#/builders/16/builds/28225

That failure is blamed on my revert commit -- but in my local build, the test isn't failing. Perhaps there's some nondeterminism in processing order? I might switch these to CHECK-DAG, though I don't know if that's just covering up some other underlying bug...

Thanks James, I've noticed that as well - In my local build it does pass. But I should have used CHECK-DAG. Let me commit a small patch that does this.

Fixed in https://github.com/llvm/llvm-project/commit/97b8a54b25f326cac8324e0ee3adde271799951c

Is it possible to make the print order deterministic instead?

fhahn added a subscriber: fhahn.Apr 30 2022, 4:03 AM

The test case added here appears to sometimes fail due to output ordering differences. E.g. https://lab.llvm.org/buildbot/#/builders/16/builds/28225

That failure is blamed on my revert commit -- but in my local build, the test isn't failing. Perhaps there's some nondeterminism in processing order? I might switch these to CHECK-DAG, though I don't know if that's just covering up some other underlying bug...

Thanks James, I've noticed that as well - In my local build it does pass. But I should have used CHECK-DAG. Let me commit a small patch that does this.

Fixed in https://github.com/llvm/llvm-project/commit/97b8a54b25f326cac8324e0ee3adde271799951c

CHECK-DAG works around the non-determinism. The analysis should probably be fixed to print results in a deterministic fashion.

The test case added here appears to sometimes fail due to output ordering differences. E.g. https://lab.llvm.org/buildbot/#/builders/16/builds/28225

That failure is blamed on my revert commit -- but in my local build, the test isn't failing. Perhaps there's some nondeterminism in processing order? I might switch these to CHECK-DAG, though I don't know if that's just covering up some other underlying bug...

Thanks James, I've noticed that as well - In my local build it does pass. But I should have used CHECK-DAG. Let me commit a small patch that does this.

Fixed in https://github.com/llvm/llvm-project/commit/97b8a54b25f326cac8324e0ee3adde271799951c

CHECK-DAG works around the non-determinism. The analysis should probably be fixed to print results in a deterministic fashion.

@nikic @fhahn Thanks for the comments, I posted a patch D124725 to fix the nondeterminism order, I'd appreciate your comments.