Page MenuHomePhabricator

Loop Cache Analysis
ClosedPublic

Authored by etiotto on Jun 17 2019, 2:41 PM.

Details

Summary

Implement a new analysis to estimate the number of cache lines required by a loop nest.
The analysis is largely based on the following paper:

Compiler Optimizations for Improving Data Locality
By: Steve Carr, Katherine S. McKinley, Chau-Wen Tseng
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf

The analysis considers temporal reuse (accesses to the same memory location) and spatial reuse (accesses to memory locations within a cache line). For simplicity the analysis considers memory accesses in the innermost loop in a loop nest, and thus determines the number of cache lines used when the loop L in loop nest LN is placed in the innermost position.

The result of the analysis can be used to drive several transformations. As an example, loop interchange could use it determine which loops in a perfect loop nest should be interchanged to maximize cache reuse. Similarly, loop distribution could be enhanced to take into consideration cache reuse between arrays when distributing a loop to eliminate vectorization inhibiting dependencies.

The general approach taken to estimate the number of cache lines used by the memory references in the inner loop of a loop nest is:

  • Partition memory references that exhibit temporal or spatial reuse into reference groups.
  • For each loop L in the a loop nest LN: a. Compute the cost of the reference group b. Compute the 'cache cost' of the loop nest by summing up the reference groups costs

For further details of the algorithm please refer to the paper.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Meinersbur added inline comments.Jul 8 2019, 5:07 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
121

[subjective] I'd prefer a function that either returns an nullptr/ErrorOr<IndexedReference> than an IsValid flag for ever object.

146

Nice, this is where the coding standard recommends auto.

173–176

Some compilers may complain about empty statements in non-debug builds. Alternative:

LLVM_DEBUG({
 if (InSameCacheLine)
    dbgs().indent(2) << "Found spacial reuse.\n";
  else
    dbgs().indent(2) << "No spacial reuse.\n";
});
208

Prefer int over unsigned

538

Do you need wrapping behavior? int instead of unsigned.

etiotto marked 17 inline comments as done.Jul 9 2019, 8:05 AM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
52

I dropped the const qualifier in the parameter declaration and adjusted the code accordingly.

103

OK I'll fix the use of auto in foreach-loops.

146

Thanks.

173–176

Agree. Is cleaner the way you propose as there is only one LLVM_DEBUG.

499–500

const qualifying local variables preempts unintended modifications later in the function... but I do not feel strongly about it. I'll change it.

591–592

Yes it is also my understanding (from the comments in PassManager.h) that the new pass manager does allows an "inner" pass (e.g. a Loop Pass) to ask for an "outer" analysis (e.g. a Function Analysis). However, at least from the comments in PassManager.h, the inner pass cannot cause the outer analysis to run and can only rely on the cached version which may give back a nullptr.

etiotto updated this revision to Diff 208684.Jul 9 2019, 8:06 AM
etiotto marked 5 inline comments as done.

Addressed Michael Kruse comments.

fhahn added inline comments.Jul 9 2019, 8:21 AM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
220

Do you anticipate most users relying on the loops to be ordered by cost? If not, it might be worth switching to a map to avoid using find_if at multiple places (even though not too common in user-written C/C++ code, I've encountered various (auto-generated) sources, which have very deep nesting levels of loops).

LoopInterhcange for example would be interested in comparing just the total cache cost of a various re-orderings of a loop nest.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
591–592

. Given the applicability of the analysis to several transformation I think it would make sense to centralize the code as an analysis pass.

I agree it is a useful utility to have, I am just wondering what the benefit of exposing it as an analysis pass would be, as it is unlikely that the result would be used for most loops or could be cached between interested transforms frequently.

IMO it would be fine as a utility class/function, i.e. just provide a getLoopCacheCost() function that takes a root loop and maybe a potential re-ordering of loops, which the interested transforms can use only on the loops that they can transform. I think that would reduce the size of the patch a bit and focus on the important bits.

Meinersbur added inline comments.Jul 9 2019, 11:23 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
591–592

I convinced myself that indeed a FunctionPass might not be that nice, because any LoopPass would need to preserve it. As @fhahn points out, wanting to preserve/reuse the analysis might be rare.

etiotto marked 6 inline comments as done.Jul 9 2019, 4:14 PM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
220

I was envisioning Loop Interchange wanting to get a sorted vector of cache costs for a nest and then using it to determine the optimal reordering of the loop in the nest (loop with smaller cost in the innermost position, larger cost in the outermost position). Having the cache cost vector sorted is also handy when printing it.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
48

Will remove this static function and instead use the breadth_first(&Root) iterator from ADT/BreadthFirstIterator.h

591–592

I was thinking that having it as an analysis pass would allow loop transformations that do not modify the memory references in a loop nest to preserve the analysis...
I am OK with just adding a member function to the CacheCost class to construct and return the cache cost for a loop nested root by a given loop.
I will upstream a path to make that change and for the time being avoid making this an analysis pass.

591–592

I will upstream a path as suggested by @fhahn to just provide a member function to the CacheCost class to construct and return the cache cost for a loop nested rooted by a given loop.

etiotto updated this revision to Diff 208836.Jul 9 2019, 4:18 PM
etiotto marked 2 inline comments as done.

Addressed suggestions from @fhahn and dropped the pass as an analysis. Instead I provided a static member function in the CacheCost class to compute the cache cost of a nest rooted by a given loop.

etiotto added a comment.EditedJul 9 2019, 4:23 PM

Note: Because I removed the getLoops static function in the last path older comments in LoopCacheAnalysis.cpp are unfortunately no longer attached to the correct line :-(

@fhahn @Meinersbur do you have further comments? If not can this be approved?

Meinersbur accepted this revision.Jul 17 2019, 1:21 PM

LGTM, some nitpicks left.

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
181

CacheCostTy is int64_t; did you mean to assign -1?

200

To not leak the internal list implementation, return ArrayRef<LoopCacheCostTy>.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
167
This revision is now accepted and ready to land.Jul 17 2019, 1:21 PM
fhahn added a comment.Jul 17 2019, 1:43 PM

I would like to take another look, but do not want to block this for too long. Please feel free to commit if you don't hear from me by Monday.

llvm/test/Analysis/LoopCacheAnalysis/loads-store.ll
4

We need to make sure that the PPC backend was built when running those tests. usually we put target-specific tests in target subdirectories, with a local lit.local.cfg, checking for the backend (e.g. see llvm/test/Transforms/LoopVectorize/PowerPC/lit.local.cfg)

greened added inline comments.Jul 18 2019, 10:28 AM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
76

Referencing a cl:opt defined in the .cpp? It's a little confusing.

107

Why is this part of IndexedReference?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

What units is this in? What does "2" mean?

189

This seems a bit dangerous to me. Depending on the client, we might want to assume reuse if we don't know the distance. Could this function return a tribool of (yes/no/unknown)?

greened added inline comments.Jul 18 2019, 5:53 PM
llvm/test/Analysis/LoopCacheAnalysis/loads-store.ll
1

I'd like to see block comments at the top of all of these tests explaining what they are testing. It will be much easier to understand what's going on when these tests fail.

fhahn added inline comments.Jul 22 2019, 6:43 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
256

RefCost is guaranteed to be a SCEVConstant here, so it would be better to use cast<> instead. Or even better

if (auto ConstantCost = dyn_cast<SCEVCOnstant>(RefCost))
  return return ConstantCost->getValue()->getSExtValue();

LLVM_DEBUG(dbgs().indent(4)
             << "RefCost is not a constant! Setting to RefCost=InvalidCost "
                "(invalid value).\n");
return CacheCost::InvalidCost;
267

It looks like this function does not capture much and might be better as a separate member function?

274

nit: Capitalize start of sentence.

280

nit: Capitalize start of sentence.

456

\n at the end?

471

This should be passed in I think and the users should request it via the pass manager.

etiotto updated this revision to Diff 211487.Jul 24 2019, 6:31 AM
etiotto marked 7 inline comments as done.

Addressing code review comments given by @Meinersbur and @fhahn.

etiotto marked 3 inline comments as done.Jul 24 2019, 6:41 AM

I would like to take another look, but do not want to block this for too long. Please feel free to commit if you don't hear from me by Monday.

Done.

etiotto updated this revision to Diff 211491.Jul 24 2019, 6:44 AM

Address remaining comments from @fhahn

etiotto marked 10 inline comments as done.Jul 24 2019, 9:39 AM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
76

I'll pass it as an argument.

107

I changed into a static function in the .cpp file instead.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

I clarified the description.

267

OK I'll make it a private member function.

etiotto updated this revision to Diff 211532.Jul 24 2019, 9:41 AM
etiotto marked 4 inline comments as done.
greened added inline comments.Jul 24 2019, 10:13 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

I'm still a bit confused. Does this mean the a[i] and a[i+1] will be considered to have temporal reuse with a value of 2? Perhaps "temporal reuse" itself needs a definition. Are we talking about referencing the same exact memory location, the same cache line, the same page, ...? It seems odd to me that "temporal reuse" would mean anything other than accessing exactly the same memory location. Everything else I would consider to be "spacial reuse."

At the very least this deserves a longer comment block about what it means and its implications. Some clients want the definition of "temporal reuse" to be "access the exact same memory location" and this default value seems to mean something very different.

240

The debug message should reference MaxDistance and not hard-code the value 2.

etiotto marked 5 inline comments as done.Jul 24 2019, 2:42 PM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

That's correct, a[i] and a[i+1] are consider to have temporal reuse when the threshold is 2. A threshold of 1 would cause only references to the same memory location have temporal reuse. The analysis attempts to implement the algorithm in the paper mentioned in the summary, and
I agree that this is a bit confusing. I will add a comment to attempt to clarify better.

189

If the dependence distance is unknown at compile time the references are conservatively considered to have no spacial reuse, and consequently the analysis will overestimate the number of cache lines used by the loop (when it is in the innermost position in the nest).

For now I can return an Optional<bool> (didn't find a tribool data type readily available in LLVM). This will make it easier to place references with 'unknow' distance in the same reference group if we find a motivating test case that needs it.

etiotto updated this revision to Diff 211602.Jul 24 2019, 2:43 PM
etiotto marked 2 inline comments as done.
etiotto added a subscriber: reames.

Addressing review comment from @reames

I addressed all pending review comments, @fhahn @reames does it look ok to you guys now?

fhahn added a comment.Jul 30 2019, 2:58 AM

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
199

IIRC this returns the ordered loop costs, right? Please document.

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/matvecmul.ll
17 ↗(On Diff #211602)

IIRC the costs should be ordered. If so, please use check next for entries with differing costs. (here and at other places)

etiotto updated this revision to Diff 212429.Jul 30 2019, 1:47 PM
etiotto marked 2 inline comments as done.

Run clang-format, address comments from @fhahn.

etiotto added a comment.EditedJul 30 2019, 1:51 PM

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

fhahn added a comment.Jul 30 2019, 3:45 PM

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If you have time to integrate the analysis in loop interchange next week go ahead. I will need to work through other work in progress I am doing, and upstream those pieces before I can get to loop interchange. I think we can discuss details during the biweekly loop meetings.

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

greened added inline comments.Jul 31 2019, 12:49 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

So how will clients with different needs use this analysis? Some might want the definition in the paper but others might want to restrict it to exact memory locations only. A cl::opt doesn't allow configuration by clients.

etiotto updated this revision to Diff 213677.Aug 6 2019, 11:57 AM

Added a command line option to specify the threshold for temporal reuse.

etiotto marked 2 inline comments as done.Aug 6 2019, 11:59 AM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44

I added a optional parameter to allow users to specify the temporal reuse threshold.

etiotto marked an inline comment as done.Aug 6 2019, 12:00 PM
etiotto updated this revision to Diff 214231.Aug 8 2019, 2:19 PM
Closed by commit rL368439: Title: Loop Cache Analysis (authored by whitneyt, committed by ). · Explain WhyAug 9 2019, 6:56 AM
This revision was automatically updated to reflect the committed changes.
etiotto updated this revision to Diff 214384.Aug 9 2019, 8:54 AM

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:110:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::IndexedReference&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:45:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const IndexedReference &R);
                    ^~~~~~~~

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:443:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::CacheCost&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const CacheCost &CC) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:174:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const CacheCost &CC);

GCC 7 Linux.

etiotto updated this revision to Diff 214718.Aug 12 2019, 3:11 PM

Fix build warning for operator<< when using GCC 7.

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:110:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::IndexedReference&)’ has not been declared within llvm

raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {
             ^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:45:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const IndexedReference &R);
                    ^~~~~~~~

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:443:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::CacheCost&)’ has not been declared within llvm

raw_ostream &llvm::operator<<(raw_ostream &OS, const CacheCost &CC) {
             ^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:174:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const CacheCost &CC);

GCC 7 Linux.

I uploaded a fix for this warning. I use clang to build sigh, apologies for the problem.

No problem, it was just warning :)

fhahn added a comment.Aug 13 2019, 6:33 AM

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:110:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::IndexedReference&)’ has not been declared within llvm

raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {
             ^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:45:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const IndexedReference &R);
                    ^~~~~~~~

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:443:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::CacheCost&)’ has not been declared within llvm

raw_ostream &llvm::operator<<(raw_ostream &OS, const CacheCost &CC) {
             ^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:174:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const CacheCost &CC);

GCC 7 Linux.

I uploaded a fix for this warning. I use clang to build sigh, apologies for the problem.

If the patch has not been reverted, I think it would be better to create a new patch on Phabricator with the fix, so it is easier to see the change.

greened added inline comments.Aug 13 2019, 10:23 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
245

Can we add the value of MaxDistance to this message?

Oops, I missed that this landed already. Perhaps a later commit can improve the debug message.