This is an archive of the discontinued LLVM Phabricator instance.

Loop Cache Analysis
ClosedPublic

Authored by etiotto on Jun 17 2019, 2:41 PM.

Details

Summary

Implement a new analysis to estimate the number of cache lines required by a loop nest.
The analysis is largely based on the following paper:

Compiler Optimizations for Improving Data Locality
By: Steve Carr, Katherine S. McKinley, Chau-Wen Tseng
http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf

The analysis considers temporal reuse (accesses to the same memory location) and spatial reuse (accesses to memory locations within a cache line). For simplicity the analysis considers memory accesses in the innermost loop in a loop nest, and thus determines the number of cache lines used when the loop L in loop nest LN is placed in the innermost position.

The result of the analysis can be used to drive several transformations. As an example, loop interchange could use it determine which loops in a perfect loop nest should be interchanged to maximize cache reuse. Similarly, loop distribution could be enhanced to take into consideration cache reuse between arrays when distributing a loop to eliminate vectorization inhibiting dependencies.

The general approach taken to estimate the number of cache lines used by the memory references in the inner loop of a loop nest is:

  • Partition memory references that exhibit temporal or spatial reuse into reference groups.
  • For each loop L in the a loop nest LN: a. Compute the cost of the reference group b. Compute the 'cache cost' of the loop nest by summing up the reference groups costs

For further details of the algorithm please refer to the paper.

Diff Detail

Repository
rL LLVM

Event Timeline

etiotto created this revision.Jun 17 2019, 2:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 17 2019, 2:41 PM
etiotto edited the summary of this revision. (Show Details)Jun 17 2019, 2:44 PM

What do you plan as the first use of this transformation?

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
30–31 ↗(On Diff #205185)

[style] The current code base most often uses a Ty suffix for typedefs.

174 ↗(On Diff #205185)

[suggestion] Make these magic numbers cl::opt options?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
63–64 ↗(On Diff #205185)

Is passing a single innermost loop a precondition or a what this function computes? It seems the only loop that this function might return is Loops.back() and kind-of verifies that the loop vector is in the right order?!?

307 ↗(On Diff #205185)

[nit] double space

319 ↗(On Diff #205185)

[style] Subscripts.empty(). Sizes.empty()

332 ↗(On Diff #205185)

[style] Did you mean break/return here? The next iteration might add elements to it. I'd structure this as

if (!isOneDimensionalArray(AccessFn, L)) {
   ...
  break;
}

...

conforming https://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code.

336 ↗(On Diff #205185)

[style] return all_of(Subcripts, ...

386 ↗(On Diff #205185)

Why not using getBackedgeTakenCount()?

523–531 ↗(On Diff #205185)

Since this only has code that executes in assert-builds, it should be guarded entirely by an LLVM_DEBUG.

590–591 ↗(On Diff #205185)

If the intention is to provide analysis for innermost loops only, why this limitation? Could it just return an analysis result for each innermost loop?

If the analysis requires a global view to determine the cost for each loop, wouldn't a FunctionPass be more appropriate? Currently, it seems users first need get the LoopCacheAnalysis for a topmost loops, the ask it for one of its nested loops.

Are such loop nests not analyzable at all?

while (repeat) {
  for (int i = ...)
    for (int j = ...)
      B[i][j] = ... A[i+1][j+1] ... // stencil
  for (int i = ...)
    for (int j = ...)
      A[i][j] = ... B[i+1][j+1] ... // stencil
}
llvm/test/Analysis/LoopCacheAnalysis/a.ll
1 ↗(On Diff #205185)

[serious] Can you give a more descriptive file names that a.ll and b.ll etc?

4 ↗(On Diff #205185)

[suggestion] Is it possible to leave the triple unspecified to the test case?

etiotto marked 16 inline comments as done.Jun 18 2019, 2:20 PM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
30–31 ↗(On Diff #205185)

OK I'll use the Ty suffix.

174 ↗(On Diff #205185)

Added cl::opt for the default loop trip count and the temporal reuse threshold (see the .cpp file)

llvm/lib/Analysis/LoopCacheAnalysis.cpp
63–64 ↗(On Diff #205185)

This function attempts to retrieve the innermost loop in the given loop vector. It returns a nullptr if any loops in the loop vector supplied has more than one sibling. The loop vector is expected to contain loops collected in breadth-first order. I'll improve the comment.

386 ↗(On Diff #205185)

I'll do that.

590–591 ↗(On Diff #205185)

The current scope of this PR is to analyze loop nests that have a single innermost loop. The analysis returns a vector of loop costs for each loop in a loop nest. This is not a hard requirement, however I would like to extend the scope of the transformation in a future PR. One of the scenario I had in mind, at least initially, as a consumer of this analysis is loop interchange which operates on perfect nests and therefore the current implementation of the analysis is sufficient for that use.

If we were to make the analysis a function pass that would force consumers to become function passes (or use the cached version of this analysis). That seems overly restrictive especially considering that loop interchange is currently a loop pass.

llvm/test/Analysis/LoopCacheAnalysis/a.ll
1 ↗(On Diff #205185)

Ooops, yes I'll rename those files.

4 ↗(On Diff #205185)

Unfortunately is not easily removed because the target triple is necessary to determine the target architecture cache line size, which is used in the analysis.

etiotto updated this revision to Diff 205434.Jun 18 2019, 2:23 PM
etiotto marked 4 inline comments as done.

Addressed review comment from M. Kruse.

xbolva00 added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
220 ↗(On Diff #205434)

llvm::sort

llvm/lib/Analysis/LoopCacheAnalysis.cpp
365 ↗(On Diff #205434)

llvm::all_of

fhahn added a subscriber: fhahn.Jun 18 2019, 2:54 PM
fhahn added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
590–591 ↗(On Diff #205185)

I'll try to have a closer look in a few days, but I think for the use in LoopInterchange, it would not need to be an analysis pass (I suppose there won't be a big benefit to preserve this as an analysis?).

I think a lightweight interface to query the cost for certain valid permutations would be sufficient. I think it would be great if we only compute the information when directly required (i.e. we only need to compute the costs for loops we can interchange, for the permutations valid to interchange)

How would a pass use this analysis? It computes a cost for the current IR, but there is nothing to compare it to unless the transformation pass emits the transformed loop nest next to the original pass such that the LoopCacheAnalysis can compute its cost.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
564 ↗(On Diff #205434)

Would it be useful to make CacheCostTy an int64_t? At least with UBSan we could diagnose an overflow.

etiotto marked an inline comment as done.Jun 18 2019, 4:10 PM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
590–591 ↗(On Diff #205185)

Looking forward to you comments. You are correct, we could teach loop interchange about the locality of accesses in a loop nest. There are several other loop transformations that can benefit from it. The paper that introduces the analysis (Compiler Optimizations for Improving Data Locality - http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf) illustrates how the analysis was used to guide several other loop transformation, namely loop reversal, loop fusion, and loop distribution. Given the applicability of the analysis to several transformation I think it would make sense to centralize the code as an analysis pass.

etiotto updated this revision to Diff 205570.Jun 19 2019, 6:37 AM
etiotto marked 3 inline comments as done.
etiotto marked 2 inline comments as done.Jun 19 2019, 6:42 AM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
220 ↗(On Diff #205434)

Yup agree, thanks for pointing this out.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
564 ↗(On Diff #205434)

Ok.

How would a pass use this analysis? It computes a cost for the current IR, but there is nothing to compare it to unless the transformation pass emits the transformed loop nest next to the original pass such that the LoopCacheAnalysis can compute its cost.

The result of the analysis will be a vector of costs for each loop in a loop nest. The cost associated with a loop estimates the number of cache line used if the loop was placed in the innermost position in the nest. Therefore sorting the cost vector in descending order correspond to minimizing the number of data cache misses in the nest. For example loop distribution can use the sorted vector and work out a set of permutation moves (assuming legality constraint are satisfied) to maximize cache locality.

etiotto marked 3 inline comments as done.Jun 21 2019, 12:12 PM
xusx595 added a subscriber: xusx595.Jul 2 2019, 1:48 AM

@fhahn Do you still want to look over this patch?

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
54 ↗(On Diff #205570)

Should return size_t

158 ↗(On Diff #205570)

It does not make sense to have owning pointers to SmallVector's. This just adds another level of indirection compared to containing std::vector (or SmallVector to have a single allocation for all elements if none exceeds the small size) directly.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
51 ↗(On Diff #205570)

Before const_cast why not removing the const from the parameter list?

55 ↗(On Diff #205570)

WL is used as a stack, a (Small)Vector could do this as well.

102 ↗(On Diff #205570)

Don’t “almost always” use auto

for (const SCEV *Subscript : R.subscripts)

You seem to have applied the coding standard for auto everywhere except in foreach-loops.

116 ↗(On Diff #205570)

No need for Subscripts(), Sizes()

Prefer default member initializer at declaration over IsValid(false), BaserPointer(nullptr)

120 ↗(On Diff #205570)

[subjective] I'd prefer a function that either returns an nullptr/ErrorOr<IndexedReference> than an IsValid flag for ever object.

145 ↗(On Diff #205570)

Nice, this is where the coding standard recommends auto.

172–175 ↗(On Diff #205570)

Some compilers may complain about empty statements in non-debug builds. Alternative:

LLVM_DEBUG({
 if (InSameCacheLine)
    dbgs().indent(2) << "Found spacial reuse.\n";
  else
    dbgs().indent(2) << "No spacial reuse.\n";
});
207 ↗(On Diff #205570)

Prefer int over unsigned

498–499 ↗(On Diff #205570)

The LLVM code base does not use const for stack variables.

537 ↗(On Diff #205570)

Do you need wrapping behavior? int instead of unsigned.

590–591 ↗(On Diff #205185)

If we were to make the analysis a function pass that would force consumers to become function passes (or use the cached version of this analysis). That seems overly restrictive especially considering that loop interchange is currently a loop pass.

I don't think the new pass manager has this restriction. Passes can ask for analyses of any level using OuterAnalysisManagerProxy. The new pass manager caches everything.

etiotto marked 17 inline comments as done.Jul 9 2019, 8:05 AM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
51 ↗(On Diff #205570)

I dropped the const qualifier in the parameter declaration and adjusted the code accordingly.

102 ↗(On Diff #205570)

OK I'll fix the use of auto in foreach-loops.

145 ↗(On Diff #205570)

Thanks.

172–175 ↗(On Diff #205570)

Agree. Is cleaner the way you propose as there is only one LLVM_DEBUG.

498–499 ↗(On Diff #205570)

const qualifying local variables preempts unintended modifications later in the function... but I do not feel strongly about it. I'll change it.

590–591 ↗(On Diff #205185)

Yes it is also my understanding (from the comments in PassManager.h) that the new pass manager does allows an "inner" pass (e.g. a Loop Pass) to ask for an "outer" analysis (e.g. a Function Analysis). However, at least from the comments in PassManager.h, the inner pass cannot cause the outer analysis to run and can only rely on the cached version which may give back a nullptr.

etiotto updated this revision to Diff 208684.Jul 9 2019, 8:06 AM
etiotto marked 5 inline comments as done.

Addressed Michael Kruse comments.

fhahn added inline comments.Jul 9 2019, 8:21 AM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
219 ↗(On Diff #205570)

Do you anticipate most users relying on the loops to be ordered by cost? If not, it might be worth switching to a map to avoid using find_if at multiple places (even though not too common in user-written C/C++ code, I've encountered various (auto-generated) sources, which have very deep nesting levels of loops).

LoopInterhcange for example would be interested in comparing just the total cache cost of a various re-orderings of a loop nest.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
590–591 ↗(On Diff #205185)

. Given the applicability of the analysis to several transformation I think it would make sense to centralize the code as an analysis pass.

I agree it is a useful utility to have, I am just wondering what the benefit of exposing it as an analysis pass would be, as it is unlikely that the result would be used for most loops or could be cached between interested transforms frequently.

IMO it would be fine as a utility class/function, i.e. just provide a getLoopCacheCost() function that takes a root loop and maybe a potential re-ordering of loops, which the interested transforms can use only on the loops that they can transform. I think that would reduce the size of the patch a bit and focus on the important bits.

Meinersbur added inline comments.Jul 9 2019, 11:23 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
590–591 ↗(On Diff #205185)

I convinced myself that indeed a FunctionPass might not be that nice, because any LoopPass would need to preserve it. As @fhahn points out, wanting to preserve/reuse the analysis might be rare.

etiotto marked 6 inline comments as done.Jul 9 2019, 4:14 PM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
219 ↗(On Diff #205570)

I was envisioning Loop Interchange wanting to get a sorted vector of cache costs for a nest and then using it to determine the optimal reordering of the loop in the nest (loop with smaller cost in the innermost position, larger cost in the outermost position). Having the cache cost vector sorted is also handy when printing it.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
47 ↗(On Diff #208684)

Will remove this static function and instead use the breadth_first(&Root) iterator from ADT/BreadthFirstIterator.h

590–591 ↗(On Diff #205185)

I was thinking that having it as an analysis pass would allow loop transformations that do not modify the memory references in a loop nest to preserve the analysis...
I am OK with just adding a member function to the CacheCost class to construct and return the cache cost for a loop nested root by a given loop.
I will upstream a path to make that change and for the time being avoid making this an analysis pass.

590–591 ↗(On Diff #205185)

I will upstream a path as suggested by @fhahn to just provide a member function to the CacheCost class to construct and return the cache cost for a loop nested rooted by a given loop.

etiotto updated this revision to Diff 208836.Jul 9 2019, 4:18 PM
etiotto marked 2 inline comments as done.

Addressed suggestions from @fhahn and dropped the pass as an analysis. Instead I provided a static member function in the CacheCost class to compute the cache cost of a nest rooted by a given loop.

etiotto added a comment.EditedJul 9 2019, 4:23 PM

Note: Because I removed the getLoops static function in the last path older comments in LoopCacheAnalysis.cpp are unfortunately no longer attached to the correct line :-(

@fhahn @Meinersbur do you have further comments? If not can this be approved?

Meinersbur accepted this revision.Jul 17 2019, 1:21 PM

LGTM, some nitpicks left.

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
181 ↗(On Diff #208836)

CacheCostTy is int64_t; did you mean to assign -1?

200 ↗(On Diff #208836)

To not leak the internal list implementation, return ArrayRef<LoopCacheCostTy>.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
167 ↗(On Diff #208836)
This revision is now accepted and ready to land.Jul 17 2019, 1:21 PM
fhahn added a comment.Jul 17 2019, 1:43 PM

I would like to take another look, but do not want to block this for too long. Please feel free to commit if you don't hear from me by Monday.

llvm/test/Analysis/LoopCacheAnalysis/loads-store.ll
4 ↗(On Diff #208836)

We need to make sure that the PPC backend was built when running those tests. usually we put target-specific tests in target subdirectories, with a local lit.local.cfg, checking for the backend (e.g. see llvm/test/Transforms/LoopVectorize/PowerPC/lit.local.cfg)

greened added inline comments.Jul 18 2019, 10:28 AM
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
76 ↗(On Diff #208836)

Referencing a cl:opt defined in the .cpp? It's a little confusing.

107 ↗(On Diff #208836)

Why is this part of IndexedReference?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
44 ↗(On Diff #208836)

What units is this in? What does "2" mean?

189 ↗(On Diff #208836)

This seems a bit dangerous to me. Depending on the client, we might want to assume reuse if we don't know the distance. Could this function return a tribool of (yes/no/unknown)?

greened added inline comments.Jul 18 2019, 5:53 PM
llvm/test/Analysis/LoopCacheAnalysis/loads-store.ll
1 ↗(On Diff #208836)

I'd like to see block comments at the top of all of these tests explaining what they are testing. It will be much easier to understand what's going on when these tests fail.

fhahn added inline comments.Jul 22 2019, 6:43 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
256 ↗(On Diff #208836)

RefCost is guaranteed to be a SCEVConstant here, so it would be better to use cast<> instead. Or even better

if (auto ConstantCost = dyn_cast<SCEVCOnstant>(RefCost))
  return return ConstantCost->getValue()->getSExtValue();

LLVM_DEBUG(dbgs().indent(4)
             << "RefCost is not a constant! Setting to RefCost=InvalidCost "
                "(invalid value).\n");
return CacheCost::InvalidCost;
267 ↗(On Diff #208836)

It looks like this function does not capture much and might be better as a separate member function?

274 ↗(On Diff #208836)

nit: Capitalize start of sentence.

280 ↗(On Diff #208836)

nit: Capitalize start of sentence.

456 ↗(On Diff #208836)

\n at the end?

471 ↗(On Diff #208836)

This should be passed in I think and the users should request it via the pass manager.

etiotto updated this revision to Diff 211487.Jul 24 2019, 6:31 AM
etiotto marked 7 inline comments as done.

Addressing code review comments given by @Meinersbur and @fhahn.

etiotto marked 3 inline comments as done.Jul 24 2019, 6:41 AM

I would like to take another look, but do not want to block this for too long. Please feel free to commit if you don't hear from me by Monday.

Done.

etiotto updated this revision to Diff 211491.Jul 24 2019, 6:44 AM

Address remaining comments from @fhahn

etiotto marked 10 inline comments as done.Jul 24 2019, 9:39 AM
etiotto added inline comments.
llvm/include/llvm/Analysis/LoopCacheAnalysis.h
76 ↗(On Diff #208836)

I'll pass it as an argument.

107 ↗(On Diff #208836)

I changed into a static function in the .cpp file instead.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
44 ↗(On Diff #208836)

I clarified the description.

267 ↗(On Diff #208836)

OK I'll make it a private member function.

etiotto updated this revision to Diff 211532.Jul 24 2019, 9:41 AM
etiotto marked 4 inline comments as done.
greened added inline comments.Jul 24 2019, 10:13 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44 ↗(On Diff #208836)

I'm still a bit confused. Does this mean the a[i] and a[i+1] will be considered to have temporal reuse with a value of 2? Perhaps "temporal reuse" itself needs a definition. Are we talking about referencing the same exact memory location, the same cache line, the same page, ...? It seems odd to me that "temporal reuse" would mean anything other than accessing exactly the same memory location. Everything else I would consider to be "spacial reuse."

At the very least this deserves a longer comment block about what it means and its implications. Some clients want the definition of "temporal reuse" to be "access the exact same memory location" and this default value seems to mean something very different.

239 ↗(On Diff #211532)

The debug message should reference MaxDistance and not hard-code the value 2.

etiotto marked 5 inline comments as done.Jul 24 2019, 2:42 PM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44 ↗(On Diff #208836)

That's correct, a[i] and a[i+1] are consider to have temporal reuse when the threshold is 2. A threshold of 1 would cause only references to the same memory location have temporal reuse. The analysis attempts to implement the algorithm in the paper mentioned in the summary, and
I agree that this is a bit confusing. I will add a comment to attempt to clarify better.

189 ↗(On Diff #208836)

If the dependence distance is unknown at compile time the references are conservatively considered to have no spacial reuse, and consequently the analysis will overestimate the number of cache lines used by the loop (when it is in the innermost position in the nest).

For now I can return an Optional<bool> (didn't find a tribool data type readily available in LLVM). This will make it easier to place references with 'unknow' distance in the same reference group if we find a motivating test case that needs it.

etiotto updated this revision to Diff 211602.Jul 24 2019, 2:43 PM
etiotto marked 2 inline comments as done.
etiotto added a subscriber: reames.

Addressing review comment from @reames

I addressed all pending review comments, @fhahn @reames does it look ok to you guys now?

fhahn added a comment.Jul 30 2019, 2:58 AM

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

llvm/include/llvm/Analysis/LoopCacheAnalysis.h
198 ↗(On Diff #211602)

IIRC this returns the ordered loop costs, right? Please document.

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/matvecmul.ll
17 ↗(On Diff #211602)

IIRC the costs should be ordered. If so, please use check next for entries with differing costs. (here and at other places)

etiotto updated this revision to Diff 212429.Jul 30 2019, 1:47 PM
etiotto marked 2 inline comments as done.

Run clang-format, address comments from @fhahn.

etiotto added a comment.EditedJul 30 2019, 1:51 PM

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

fhahn added a comment.Jul 30 2019, 3:45 PM

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If you have time to integrate the analysis in loop interchange next week go ahead. I will need to work through other work in progress I am doing, and upstream those pieces before I can get to loop interchange. I think we can discuss details during the biweekly loop meetings.

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

Can you re-run clang-format on the latest version of the patch? I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

I reformatted the patch and addressed inline comments. The paper reports that the analysis was used to guide both loop fusion and loop interchange. I used it myself profitably on loop interchange (not yet in tree), and the lit tests provided are based in part on that experiment.

Right, are you planning on submitting a patch for loop interchange upstream?

I used the printer on some of the loop interchange tests, but the delinearization does not support some cases there yet. I hope I find some time to look into that next week and maybe also integrating it into loop interchange unless you plan to do so.

If there are not further comments and/or concerns I'll commit the patch on Thursday.

greened added inline comments.Jul 31 2019, 12:49 PM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44 ↗(On Diff #208836)

So how will clients with different needs use this analysis? Some might want the definition in the paper but others might want to restrict it to exact memory locations only. A cl::opt doesn't allow configuration by clients.

etiotto updated this revision to Diff 213677.Aug 6 2019, 11:57 AM

Added a command line option to specify the threshold for temporal reuse.

etiotto marked 2 inline comments as done.Aug 6 2019, 11:59 AM
etiotto added inline comments.
llvm/lib/Analysis/LoopCacheAnalysis.cpp
44 ↗(On Diff #208836)

I added a optional parameter to allow users to specify the temporal reuse threshold.

etiotto marked an inline comment as done.Aug 6 2019, 12:00 PM
etiotto updated this revision to Diff 214231.Aug 8 2019, 2:19 PM
This revision was automatically updated to reflect the committed changes.
etiotto updated this revision to Diff 214384.Aug 9 2019, 8:54 AM

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:110:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::IndexedReference&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:45:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const IndexedReference &R);
                    ^~~~~~~~

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:443:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::CacheCost&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const CacheCost &CC) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:174:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const CacheCost &CC);

GCC 7 Linux.

etiotto updated this revision to Diff 214718.Aug 12 2019, 3:11 PM

Fix build warning for operator<< when using GCC 7.

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:110:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::IndexedReference&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:45:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const IndexedReference &R);
                    ^~~~~~~~

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:443:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::CacheCost&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const CacheCost &CC) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:174:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const CacheCost &CC);

GCC 7 Linux.

I uploaded a fix for this warning. I use clang to build sigh, apologies for the problem.

No problem, it was just warning :)

fhahn added a comment.Aug 13 2019, 6:33 AM

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:110:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::IndexedReference&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:45:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const IndexedReference &R);
                    ^~~~~~~~

/home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:443:14: warning: ‘llvm::raw_ostream& llvm::operator<<(llvm::raw_ostream&, const llvm::CacheCost&)’ has not been declared within llvm
raw_ostream &llvm::operator<<(raw_ostream &OS, const CacheCost &CC) {

^~~~

In file included from /home/xbolva00/LLVM/llvm/lib/Analysis/LoopCacheAnalysis.cpp:28:0:
/home/xbolva00/LLVM/llvm/include/llvm/Analysis/LoopCacheAnalysis.h:174:23: note: only here as a friend

friend raw_ostream &operator<<(raw_ostream &OS, const CacheCost &CC);

GCC 7 Linux.

I uploaded a fix for this warning. I use clang to build sigh, apologies for the problem.

If the patch has not been reverted, I think it would be better to create a new patch on Phabricator with the fix, so it is easier to see the change.

greened added inline comments.Aug 13 2019, 10:23 AM
llvm/lib/Analysis/LoopCacheAnalysis.cpp
244 ↗(On Diff #214718)

Can we add the value of MaxDistance to this message?

Oops, I missed that this landed already. Perhaps a later commit can improve the debug message.

/home/xbolva00/LLVM/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp 353 warn V612 An unconditional 'return' within a loop.
/home/xbolva00/LLVM/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp 456 err V502 Perhaps the '?:' operator works in a different way than it was expected. The '?:' operator has a lower priority than the '==' operator.

Found by PVS Studio

I think it would be good to get an in-tree user of this soon, to make sure the modeling works as expected on real hardware/benchmarks. Do you have a timeline to get this used? I think you mentioned LoopFusion as one of the first planned users?

What about this? Or still dead code?

/home/xbolva00/LLVM/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp 353 warn V612 An unconditional 'return' within a loop.

This indeed does not look right. There are beaks and returns in the loop, but no continue. Meaning the loop is only iterated for one element (in this case: the innermost loop). I think nothing else than the innermost loop is needed for delinearization, in which case it could be made a conditional instead of a for-statement.

/home/xbolva00/LLVM/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp 456 err V502 Perhaps the '?:' operator works in a different way than it was expected. The '?:' operator has a lower priority than the '==' operator.

TRT(TRT == None ? Optional<unsigned>(TemporalReuseThreshold) : TRT),

Could add parens around TRT == None to silence the analyzer.

etiotto added a comment.EditedNov 4 2019, 1:14 PM

/home/xbolva00/LLVM/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp 353 warn V612 An unconditional 'return' within a loop.

This indeed does not look right. There are beaks and returns in the loop, but no continue. Meaning the loop is only iterated for one element (in this case: the innermost loop). I think nothing else than the innermost loop is needed for delinearization, in which case it could be made a conditional instead of a for-statement.

/home/xbolva00/LLVM/llvm-project/llvm/lib/Analysis/LoopCacheAnalysis.cpp 456 err V502 Perhaps the '?:' operator works in a different way than it was expected. The '?:' operator has a lower priority than the '==' operator.

TRT(TRT == None ? Optional<unsigned>(TemporalReuseThreshold) : TRT),

Could add parens around TRT == None to silence the analyzer.

I have submitted a patch for review in https://reviews.llvm.org/D69821.