Analysis of loop cache cost (LCC) is supposed to return an ordered vector of loops that reflects their optimal order in the loopnest based on cache locality (think of loop interchange, for example). The 1st loop in the vector should be placed as the outermost loop and the last loop should be placed as the innermost loop.
Motivation of this patch:
According to what we discussed in the last loopOptWG meeting, I've been improving LCC in various aspects. Currently LCC only considers estimating the number of cache lines that each loop accesses, and sorts the loops based on those numbers. This could result in inaccuracy, for exampe:
void foo(long n, long m, long o, int A[n][m][o]) { for (long j = 0; j < m; j++) for (long i = 0; i < n; i++) for (long k = 0; k < o; k++) A[2*i+3][2*j-4][2*k+7] = 1; }
Consider cache locality, for this loopnest we would like to place the i-loop as the outermost loop, the j-loop as the middle loop, and the k-loop as the innermost loop. In other words, the final resulting vector should be [i-loop, j-loop, k-loop]. However, what the current LCC returns is [j-loop, i-loop, k-loop], which is incorrect if we used it as the cost model for loop interchange.
The reason for the incorrect result is that LCC only considers the "cost" of each loop and would output the following (vector of sorted loops), and hence could not determine whether the i-loop and the j-loop should be placed as the outermost loop, since both have the same cost.
Loop 'for.j' has cost = 1000000 Loop 'for.i' has cost = 1000000 Loop 'for.k' has cost = 60000
Furthermore, consider the motivating test case in this patch https://reviews.llvm.org/D120386. I will soon post another LCC patch that enables delinearization for fixed-size arrays, by then LCC could estimate the cost for loops in that motivating case but would output an incorrect result, meaning LCC is unable to place each loop at an ideal position in the loopnest due to exactly the same reason I described.
What this patch proposes is an improvement for LCC: we not only consider the "cost" (estimated number of cache lines we access), but also consider the "stride" and sort the loops based on both "cost" and "stride". For the above loopnest, with this patch LCC would give the following result
Loop 'for.i' has cost = 1000000 and stride = 80000 Loop 'for.j' has cost = 1000000 and stride = 800 Loop 'for.k' has cost = 60000 and stride = 8,
and each loop is now placed in the ideal position.
Note that in this example since the array is a parametric-sized array, currently LCC uses DefaultTripCount as an estimate of trip counts which is the reason for those "cost" numbers. Therefore we did the same for estimating "stride" numbers in this patch -- the actual strides for loops i, j, k are 8*m*o, 8*o, 8, respectively, and their corresponding estimations are 80000, 800 and 8, according to DefaultTripCount.
Could you please make it more clear how this test is different from the one above (ie add a comment to say that this is testing to make sure the analysis prints the loops in the expected order despite the original loop nest being in suboptimal order)?