This is an archive of the discontinued LLVM Phabricator instance.

[NFC][LoopCacheAnalysis] Add a motivating test case for improved loop cache analysis cost calculation
ClosedPublic

Authored by congzhe on Mar 30 2022, 6:39 PM.

Details

Summary

Analysis of loop cache cost (LCC) is supposed to return an ordered vector of loops that reflects their optimal order in the loopnest based on cache locality (think of loop interchange, for example). The 1st loop in the vector should be placed as the outermost loop and the last loop should be placed as the innermost loop.

Motivation of this patch:
According to what we discussed in the last loopOptWG meeting, I've been improving LCC in various aspects. Currently LCC only considers estimating the number of cache lines that each loop accesses, and sorts the loops based on those numbers. This could result in inaccuracy, for exampe:

void foo(long n, long m, long o, int A[n][m][o]) {
  for (long j = 0; j < m; j++)
    for (long i = 0; i < n; i++)
      for (long k = 0; k < o; k++)
        A[2*i+3][2*j-4][2*k+7] = 1;
}

Consider cache locality, for this loopnest we would like to place the i-loop as the outermost loop, the j-loop as the middle loop, and the k-loop as the innermost loop. In other words, the final resulting vector should be [i-loop, j-loop, k-loop]. However, what the current LCC returns is [j-loop, i-loop, k-loop], which is incorrect if we used it as the cost model for loop interchange.

The reason for the incorrect result is that LCC only considers the "cost" of each loop and would output the following (vector of sorted loops), and hence could not determine whether the i-loop and the j-loop should be placed as the outermost loop, since both have the same cost.

Loop 'for.j' has cost = 1000000
Loop 'for.i' has cost = 1000000
Loop 'for.k' has cost = 60000

Furthermore, consider the motivating test case in this patch https://reviews.llvm.org/D120386. I will soon post another LCC patch that enables delinearization for fixed-size arrays, by then LCC could estimate the cost for loops in that motivating case but would output an incorrect result, meaning LCC is unable to place each loop at an ideal position in the loopnest due to exactly the same reason I described.

What this patch proposes is an improvement for LCC: we not only consider the "cost" (estimated number of cache lines we access), but also consider the "stride" and sort the loops based on both "cost" and "stride". For the above loopnest, with this patch LCC would give the following result

Loop 'for.i' has cost = 1000000 and stride = 80000
Loop 'for.j' has cost = 1000000 and stride = 800
Loop 'for.k' has cost = 60000 and stride = 8,

and each loop is now placed in the ideal position.

Note that in this example since the array is a parametric-sized array, currently LCC uses DefaultTripCount as an estimate of trip counts which is the reason for those "cost" numbers. Therefore we did the same for estimating "stride" numbers in this patch -- the actual strides for loops i, j, k are 8*m*o, 8*o, 8, respectively, and their corresponding estimations are 80000, 800 and 8, according to DefaultTripCount.

Diff Detail

Event Timeline

congzhe created this revision.Mar 30 2022, 6:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 6:39 PM
congzhe requested review of this revision.Mar 30 2022, 6:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 6:39 PM
congzhe retitled this revision from [LoopCacheAnalysis] Improve loop cache analysis results by taking strides into consideration to [LoopCacheAnalysis] Improve loop cache analysis results by taking memory access strides into consideration.Mar 30 2022, 6:56 PM
congzhe edited the summary of this revision. (Show Details)
congzhe edited the summary of this revision. (Show Details)
congzhe edited the summary of this revision. (Show Details)
congzhe edited the summary of this revision. (Show Details)Mar 30 2022, 7:05 PM
congzhe added reviewers: bmahjour, Whitney.
congzhe edited the summary of this revision. (Show Details)Mar 30 2022, 7:13 PM
congzhe edited the summary of this revision. (Show Details)Mar 30 2022, 7:31 PM
congzhe edited the summary of this revision. (Show Details)Mar 31 2022, 1:42 PM
congzhe added a reviewer: Restricted Project.
congzhe added a project: Restricted Project.Mar 31 2022, 1:57 PM

Thanks for looking into this @congzhe . As you described we don't seem to be able to distinguish between the cost of moving outer loops in nests more than 2 levels deep. The data locality paper tries to estimate the cost based on the estimated number of cache lines used when moving a loop into innermost level. Since the stride information (when it's less than the cache line size) is already used in RefCost functions, it makes more sense to somehow integrate it into the cost formula when it's larger than the cache line size. Treating stride as a separate component of the cost and associating it with the loop (as opposed to each reference group) makes the result of the analysis more complicated to consume and can cause ambiguity when dealing with multiple reference groups.

I think the proper way to solve this is to fold the "stride" information into the cost calculation. In computing the LoopCost(l) function, the paper already uses the notion of estimating the cache lines used based on the product of loop trip counts. I'd like to propose https://reviews.llvm.org/D123400 as a more desirable alternative as it tries to use the same concept to take the depth of the subscript dimensions into account when calculating each RefCost. Comments are welcome. FYI @etiotto.

congzhe updated this revision to Diff 426689.May 3 2022, 6:58 AM
congzhe retitled this revision from [LoopCacheAnalysis] Improve loop cache analysis results by taking memory access strides into consideration to [NFC][LoopCacheAnalysis] Add a motivating test case for improved loop cache analysis cost calculation.

@bmahjour Hi Bardia, I've updated this patch to a pure NFC patch with the motivating test case only. Looking forward to your comment :)

bmahjour added inline comments.May 3 2022, 7:20 AM
llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll
78

Could you please make it more clear how this test is different from the one above (ie add a comment to say that this is testing to make sure the analysis prints the loops in the expected order despite the original loop nest being in suboptimal order)?

88–90

Use CHECK and CHECK-NEXT

congzhe updated this revision to Diff 426727.May 3 2022, 8:36 AM

Thanks for the comments, I've updated the patch accordingly.

bmahjour accepted this revision.May 3 2022, 9:08 AM

LGTM

This revision is now accepted and ready to land.May 3 2022, 9:08 AM
This revision was landed with ongoing or failed builds.May 4 2022, 2:14 PM
This revision was automatically updated to reflect the committed changes.