Page MenuHomePhabricator

Loop Cache Analysis
AcceptedPublic

Authored by etiotto on Jun 17 2019, 2:41 PM.

Details

Summary

Implement a new analysis to estimate the number of cache lines required by a loop nest.
The analysis is largely based on the following paper:

Compiler Optimizations for Improving Data Locality
By: Steve Carr, Katherine S. McKinley, Chau-Wen Tseng
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf

The analysis considers temporal reuse (accesses to the same memory location) and spatial reuse (accesses to memory locations within a cache line). For simplicity the analysis considers memory accesses in the innermost loop in a loop nest, and thus determines the number of cache lines used when the loop L in loop nest LN is placed in the innermost position.

The result of the analysis can be used to drive several transformations. As an example, loop interchange could use it determine which loops in a perfect loop nest should be interchanged to maximize cache reuse. Similarly, loop distribution could be enhanced to take into consideration cache reuse between arrays when distributing a loop to eliminate vectorization inhibiting dependencies.

The general approach taken to estimate the number of cache lines used by the memory references in the inner loop of a loop nest is:

  • Partition memory references that exhibit temporal or spatial reuse into reference groups.
  • For each loop L in the a loop nest LN: a. Compute the cost of the reference group b. Compute the 'cache cost' of the loop nest by summing up the reference groups costs

For further details of the algorithm please refer to the paper.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Meinersbur added inline comments.Jun 17 2019, 4:51 PM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
175

[suggestion] Make these magic numbers cl::opt options?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
333

[style] Did you mean break/return here? The next iteration might add elements to it. I'd structure this as

if (!isOneDimensionalArray(AccessFn, L)) {
   ...
  break;
}

...

conforming https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code.

387

Why not using getBackedgeTakenCount()?

524–532

Since this only has code that executes in assert-builds, it should be guarded entirely by an LLVM_DEBUG.

591–592

If the intention is to provide analysis for innermost loops only, why this limitation? Could it just return an analysis result for each innermost loop?

If the analysis requires a global view to determine the cost for each loop, wouldn't a FunctionPass be more appropriate? Currently, it seems users first need get the LoopCacheAnalysis for a topmost loops, the ask it for one of its nested loops.

Are such loop nests not analyzable at all?

while (repeat) {
  for (int i = ...)
    for (int j = ...)
      B[i][j] = ... A[i+1][j+1] ... // stencil
  for (int i = ...)
    for (int j = ...)
      A[i][j] = ... B[i+1][j+1] ... // stencil
}
llvm/test/Analysis/LoopCacheAnalysis/a.ll
1 ↗(On Diff #205185)

[serious] Can you give a more descriptive file names that a.ll and b.ll etc?

4 ↗(On Diff #205185)

[suggestion] Is it possible to leave the triple unspecified to the test case?

etiotto marked 16 inline comments as done.Jun 18 2019, 2:20 PM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
31–32

OK I'll use the Ty suffix.

175

Added cl::opt for the default loop trip count and the temporal reuse threshold (see the .cpp file)

llvm/lib/Analysis/LoopCacheAnalysis.cpp
64–65

This function attempts to retrieve the innermost loop in the given loop vector. It returns a nullptr if any loops in the loop vector supplied has more than one sibling. The loop vector is expected to contain loops collected in breadth-first order. I'll improve the comment.

387

I'll do that.

591–592

The current scope of this PR is to analyze loop nests that have a single innermost loop. The analysis returns a vector of loop costs for each loop in a loop nest. This is not a hard requirement, however I would like to extend the scope of the transformation in a future PR. One of the scenario I had in mind, at least initially, as a consumer of this analysis is loop interchange which operates on perfect nests and therefore the current implementation of the analysis is sufficient for that use.

If we were to make the analysis a function pass that would force consumers to become function passes (or use the cached version of this analysis). That seems overly restrictive especially considering that loop interchange is currently a loop pass.

llvm/test/Analysis/LoopCacheAnalysis/a.ll
1 ↗(On Diff #205185)

Ooops, yes I'll rename those files.

4 ↗(On Diff #205185)

Unfortunately is not easily removed because the target triple is necessary to determine the target architecture cache line size, which is used in the analysis.

etiotto updated this revision to Diff 205434.Jun 18 2019, 2:23 PM
etiotto marked 4 inline comments as done.

Addressed review comment from M. Kruse.

xbolva00 added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
221

llvm::sort

llvm/lib/Analysis/LoopCacheAnalysis.cpp
366

llvm::all_of

fhahn added a subscriber: fhahn.Jun 18 2019, 2:54 PM
fhahn added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
591–592

I'll try to have a closer look in a few days, but I think for the use in LoopInterchange, it would not need to be an analysis pass (I suppose there won't be a big benefit to preserve this as an analysis?).

I think a lightweight interface to query the cost for certain valid permutations would be sufficient. I think it would be great if we only compute the information when directly required (i.e. we only need to compute the costs for loops we can interchange, for the permutations valid to interchange)

How would a pass use this analysis? It computes a cost for the current IR, but there is nothing to compare it to unless the transformation pass emits the transformed loop nest next to the original pass such that the LoopCacheAnalysis can compute its cost.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
565

Would it be useful to make CacheCostTy an int64_t? At least with UBSan we could diagnose an overflow.

etiotto marked an inline comment as done.Jun 18 2019, 4:10 PM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
591–592

Looking forward to you comments. You are correct, we could teach loop interchange about the locality of accesses in a loop nest. There are several other loop transformations that can benefit from it. The paper that introduces the analysis (Compiler Optimizations for Improving Data Locality - http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf) illustrates how the analysis was used to guide several other loop transformation, namely loop reversal, loop fusion, and loop distribution. Given the applicability of the analysis to several transformation I think it would make sense to centralize the code as an analysis pass.

etiotto updated this revision to Diff 205570.Jun 19 2019, 6:37 AM
etiotto marked 3 inline comments as done.
etiotto marked 2 inline comments as done.Jun 19 2019, 6:42 AM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
221

Yup agree, thanks for pointing this out.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
565

Ok.

How would a pass use this analysis? It computes a cost for the current IR, but there is nothing to compare it to unless the transformation pass emits the transformed loop nest next to the original pass such that the LoopCacheAnalysis can compute its cost.

The result of the analysis will be a vector of costs for each loop in a loop nest. The cost associated with a loop estimates the number of cache line used if the loop was placed in the innermost position in the nest. Therefore sorting the cost vector in descending order correspond to minimizing the number of data cache misses in the nest. For example loop distribution can use the sorted vector and work out a set of permutation moves (assuming legality constraint are satisfied) to maximize cache locality.

etiotto marked 3 inline comments as done.Jun 21 2019, 12:12 PM
xusx595 added a subscriber: xusx595.Tue, Jul 2, 1:48 AM

@fhahn Do you still want to look over this patch?

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
55

Should return size_t

159

It does not make sense to have owning pointers to SmallVector's. This just adds another level of indirection compared to containing std::vector (or SmallVector to have a single allocation for all elements if none exceeds the small size) directly.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
52

Before const_cast why not removing the const from the parameter list?

56

WL is used as a stack, a (Small)Vector could do this as well.

103

Don’t “almost always” use auto

for (const SCEV *Subscript : R.subscripts)

You seem to have applied the coding standard for auto everywhere except in foreach-loops.

117

No need for Subscripts(), Sizes()

Prefer default member initializer at declaration over IsValid(false), BaserPointer(nullptr)

121

[subjective] I'd prefer a function that either returns an nullptr/ErrorOr<IndexedReference> than an IsValid flag for ever object.

146

Nice, this is where the coding standard recommends auto.

173–176

Some compilers may complain about empty statements in non-debug builds. Alternative:

LLVM_DEBUG({
 if (InSameCacheLine)
    dbgs().indent(2) << "Found spacial reuse.\n";
  else
    dbgs().indent(2) << "No spacial reuse.\n";
});
208

Prefer int over unsigned

499–500

The LLVM code base does not use const for stack variables.

538

Do you need wrapping behavior? int instead of unsigned.

591–592

If we were to make the analysis a function pass that would force consumers to become function passes (or use the cached version of this analysis). That seems overly restrictive especially considering that loop interchange is currently a loop pass.

I don't think the new pass manager has this restriction. Passes can ask for analyses of any level using OuterAnalysisManagerProxy. The new pass manager caches everything.

etiotto marked 17 inline comments as done.Tue, Jul 9, 8:05 AM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
52

I dropped the const qualifier in the parameter declaration and adjusted the code accordingly.

103

OK I'll fix the use of auto in foreach-loops.

146

Thanks.

173–176

Agree. Is cleaner the way you propose as there is only one LLVM_DEBUG.

499–500

const qualifying local variables preempts unintended modifications later in the function... but I do not feel strongly about it. I'll change it.

591–592

Yes it is also my understanding (from the comments in PassManager.h) that the new pass manager does allows an "inner" pass (e.g. a Loop Pass) to ask for an "outer" analysis (e.g. a Function Analysis). However, at least from the comments in PassManager.h, the inner pass cannot cause the outer analysis to run and can only rely on the cached version which may give back a nullptr.

etiotto updated this revision to Diff 208684.Tue, Jul 9, 8:06 AM
etiotto marked 5 inline comments as done.

Addressed Michael Kruse comments.

fhahn added inline comments.Tue, Jul 9, 8:21 AM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
220

Do you anticipate most users relying on the loops to be ordered by cost? If not, it might be worth switching to a map to avoid using find_if at multiple places (even though not too common in user-written C/C++ code, I've encountered various (auto-generated) sources, which have very deep nesting levels of loops).

LoopInterhcange for example would be interested in comparing just the total cache cost of a various re-orderings of a loop nest.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
591–592

. Given the applicability of the analysis to several transformation I think it would make sense to centralize the code as an analysis pass.

I agree it is a useful utility to have, I am just wondering what the benefit of exposing it as an analysis pass would be, as it is unlikely that the result would be used for most loops or could be cached between interested transforms frequently.

IMO it would be fine as a utility class/function, i.e. just provide a getLoopCacheCost() function that takes a root loop and maybe a potential re-ordering of loops, which the interested transforms can use only on the loops that they can transform. I think that would reduce the size of the patch a bit and focus on the important bits.

Meinersbur added inline comments.Tue, Jul 9, 11:23 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
591–592

I convinced myself that indeed a FunctionPass might not be that nice, because any LoopPass would need to preserve it. As @fhahn points out, wanting to preserve/reuse the analysis might be rare.

etiotto marked 6 inline comments as done.Tue, Jul 9, 4:14 PM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
220

I was envisioning Loop Interchange wanting to get a sorted vector of cache costs for a nest and then using it to determine the optimal reordering of the loop in the nest (loop with smaller cost in the innermost position, larger cost in the outermost position). Having the cache cost vector sorted is also handy when printing it.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
48

Will remove this static function and instead use the breadth_first(&Root) iterator from ADT/BreadthFirstIterator.h

591–592

I was thinking that having it as an analysis pass would allow loop transformations that do not modify the memory references in a loop nest to preserve the analysis...
I am OK with just adding a member function to the CacheCost class to construct and return the cache cost for a loop nested root by a given loop.
I will upstream a path to make that change and for the time being avoid making this an analysis pass.

591–592

I will upstream a path as suggested by @fhahn to just provide a member function to the CacheCost class to construct and return the cache cost for a loop nested rooted by a given loop.

etiotto updated this revision to Diff 208836.Tue, Jul 9, 4:18 PM
etiotto marked 2 inline comments as done.

Addressed suggestions from @fhahn and dropped the pass as an analysis. Instead I provided a static member function in the CacheCost class to compute the cache cost of a nest rooted by a given loop.

etiotto added a comment.EditedTue, Jul 9, 4:23 PM

Note: Because I removed the getLoops static function in the last path older comments in LoopCacheAnalysis.cpp are unfortunately no longer attached to the correct line :-(

@fhahn @Meinersbur do you have further comments? If not can this be approved?

Meinersbur accepted this revision.Wed, Jul 17, 1:21 PM

LGTM, some nitpicks left.

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
181

CacheCostTy is int64_t; did you mean to assign -1?

200

To not leak the internal list implementation, return ArrayRef<LoopCacheCostTy>.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
167
This revision is now accepted and ready to land.Wed, Jul 17, 1:21 PM
fhahn added a comment.Wed, Jul 17, 1:43 PM

I would like to take another look, but do not want to block this for too long. Please feel free to commit if you don't hear from me by Monday.

llvm/test/Analysis/LoopCacheAnalysis/loads-store.ll
4

We need to make sure that the PPC backend was built when running those tests. usually we put target-specific tests in target subdirectories, with a local lit.local.cfg, checking for the backend (e.g. see llvm/test/Transforms/LoopVectorize/PowerPC/lit.local.cfg)

greened added inline comments.Thu, Jul 18, 10:28 AM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
76

Referencing a cl:opt defined in the .cpp? It's a little confusing.

107

Why is this part of IndexedReference?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

What units is this in? What does "2" mean?

189

This seems a bit dangerous to me. Depending on the client, we might want to assume reuse if we don't know the distance. Could this function return a tribool of (yes/no/unknown)?

greened added inline comments.Thu, Jul 18, 5:53 PM
llvm/test/Analysis/LoopCacheAnalysis/loads-store.ll
1

I'd like to see block comments at the top of all of these tests explaining what they are testing. It will be much easier to understand what's going on when these tests fail.

fhahn added inline comments.Mon, Jul 22, 6:43 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
256

RefCost is guaranteed to be a SCEVConstant here, so it would be better to use cast<> instead. Or even better

if (auto ConstantCost = dyn_cast<SCEVCOnstant>(RefCost))
  return return ConstantCost->getValue()->getSExtValue();

LLVM_DEBUG(dbgs().indent(4)
             << "RefCost is not a constant! Setting to RefCost=InvalidCost "
                "(invalid value).\n");
return CacheCost::InvalidCost;
267

It looks like this function does not capture much and might be better as a separate member function?

274

nit: Capitalize start of sentence.

280

nit: Capitalize start of sentence.

456

\n at the end?

471

This should be passed in I think and the users should request it via the pass manager.