This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/test/Analysis/LoopCacheAnalysis/PowerPC/
-
test/
-
Analysis/
-
LoopCacheAnalysis/
-
PowerPC/
2
single-store.ll

Differential D122776

[NFC][LoopCacheAnalysis] Add a motivating test case for improved loop cache analysis cost calculation
ClosedPublic

Authored by congzhe on Mar 30 2022, 6:39 PM.

Download Raw Diff

Details

Reviewers

bmahjour
Whitney

Group Reviewers

Restricted Project

Commits

rG5e004fb78769: [LoopCacheAnalysis][NFC] Add a test case for improved loop cache analysis cost…

Summary

Analysis of loop cache cost (LCC) is supposed to return an ordered vector of loops that reflects their optimal order in the loopnest based on cache locality (think of loop interchange, for example). The 1st loop in the vector should be placed as the outermost loop and the last loop should be placed as the innermost loop.

Motivation of this patch:
According to what we discussed in the last loopOptWG meeting, I've been improving LCC in various aspects. Currently LCC only considers estimating the number of cache lines that each loop accesses, and sorts the loops based on those numbers. This could result in inaccuracy, for exampe:

void foo(long n, long m, long o, int A[n][m][o]) {
  for (long j = 0; j < m; j++)
    for (long i = 0; i < n; i++)
      for (long k = 0; k < o; k++)
        A[2*i+3][2*j-4][2*k+7] = 1;
}

Consider cache locality, for this loopnest we would like to place the i-loop as the outermost loop, the j-loop as the middle loop, and the k-loop as the innermost loop. In other words, the final resulting vector should be [i-loop, j-loop, k-loop]. However, what the current LCC returns is [j-loop, i-loop, k-loop], which is incorrect if we used it as the cost model for loop interchange.

The reason for the incorrect result is that LCC only considers the "cost" of each loop and would output the following (vector of sorted loops), and hence could not determine whether the i-loop and the j-loop should be placed as the outermost loop, since both have the same cost.

Loop 'for.j' has cost = 1000000
Loop 'for.i' has cost = 1000000
Loop 'for.k' has cost = 60000

Furthermore, consider the motivating test case in this patch https://reviews.llvm.org/D120386. I will soon post another LCC patch that enables delinearization for fixed-size arrays, by then LCC could estimate the cost for loops in that motivating case but would output an incorrect result, meaning LCC is unable to place each loop at an ideal position in the loopnest due to exactly the same reason I described.

What this patch proposes is an improvement for LCC: we not only consider the "cost" (estimated number of cache lines we access), but also consider the "stride" and sort the loops based on both "cost" and "stride". For the above loopnest, with this patch LCC would give the following result

Loop 'for.i' has cost = 1000000 and stride = 80000
Loop 'for.j' has cost = 1000000 and stride = 800
Loop 'for.k' has cost = 60000 and stride = 8,

and each loop is now placed in the ideal position.

Note that in this example since the array is a parametric-sized array, currently LCC uses DefaultTripCount as an estimate of trip counts which is the reason for those "cost" numbers. Therefore we did the same for estimating "stride" numbers in this patch -- the actual strides for loops i, j, k are 8*m*o, 8*o, 8, respectively, and their corresponding estimations are 80000, 800 and 8, according to DefaultTripCount.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

congzhe created this revision.Mar 30 2022, 6:39 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 6:39 PM

Herald added subscribers: hiraditya, nemanjai. · View Herald Transcript

congzhe requested review of this revision.Mar 30 2022, 6:39 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2022, 6:39 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B157077: Diff 419304.Mar 30 2022, 6:43 PM

congzhe retitled this revision from [LoopCacheAnalysis] Improve loop cache analysis results by taking strides into consideration to [LoopCacheAnalysis] Improve loop cache analysis results by taking memory access strides into consideration.Mar 30 2022, 6:56 PM

congzhe edited the summary of this revision. (Show Details)

congzhe edited the summary of this revision. (Show Details)Mar 30 2022, 7:05 PM

congzhe added reviewers: bmahjour, Whitney.

congzhe edited the summary of this revision. (Show Details)Mar 30 2022, 7:13 PM

congzhe edited the summary of this revision. (Show Details)Mar 30 2022, 7:31 PM

congzhe edited the summary of this revision. (Show Details)Mar 31 2022, 1:42 PM

congzhe added a reviewer: Restricted Project.

congzhe added a project: Restricted Project.Mar 31 2022, 1:57 PM

bmahjour mentioned this in D123400: [LoopCacheAnalysis] Consider dimension depth of the subscript reference when calculating cost.Apr 8 2022, 9:18 AM

Thanks for looking into this @congzhe . As you described we don't seem to be able to distinguish between the cost of moving outer loops in nests more than 2 levels deep. The data locality paper tries to estimate the cost based on the estimated number of cache lines used when moving a loop into innermost level. Since the stride information (when it's less than the cache line size) is already used in RefCost functions, it makes more sense to somehow integrate it into the cost formula when it's larger than the cache line size. Treating stride as a separate component of the cost and associating it with the loop (as opposed to each reference group) makes the result of the analysis more complicated to consume and can cause ambiguity when dealing with multiple reference groups.

I think the proper way to solve this is to fold the "stride" information into the cost calculation. In computing the LoopCost(l) function, the paper already uses the notion of estimating the cache lines used based on the product of loop trip counts. I'd like to propose https://reviews.llvm.org/D123400 as a more desirable alternative as it tries to use the same concept to take the depth of the subscript dimensions into account when calculating each RefCost. Comments are welcome. FYI @etiotto.

@bmahjour Hi Bardia, I've updated this patch to a pure NFC patch with the motivating test case only. Looking forward to your comment :)

bmahjour added inline comments.May 3 2022, 7:20 AM

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll
78	Could you please make it more clear how this test is different from the one above (ie add a comment to say that this is testing to make sure the analysis prints the loops in the expected order despite the original loop nest being in suboptimal order)?
88–90	Use CHECK and CHECK-NEXT

Harbormaster completed remote builds in B162449: Diff 426689.May 3 2022, 7:46 AM

Thanks for the comments, I've updated the patch accordingly.

LGTM

This revision is now accepted and ready to land.May 3 2022, 9:08 AM

Harbormaster completed remote builds in B162473: Diff 426727.May 3 2022, 9:49 AM

This revision was landed with ongoing or failed builds.May 4 2022, 2:14 PM

Closed by commit rG5e004fb78769: [LoopCacheAnalysis][NFC] Add a test case for improved loop cache analysis cost… (authored by congzhe). · Explain Why

This revision was automatically updated to reflect the committed changes.

congzhe added a commit: rG5e004fb78769: [LoopCacheAnalysis][NFC] Add a test case for improved loop cache analysis cost….

Revision Contents

Path

Size

llvm/

test/

Analysis/

LoopCacheAnalysis/

PowerPC/

single-store.ll

77 lines

Diff 427136

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll

	Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines

	for.end.loopexit: ; preds = %for.inci			for.end.loopexit: ; preds = %for.inci
	br label %for.end			br label %for.end

	for.end: ; preds = %for.end.loopexit, %for.cond1.preheader.lr.ph, %entry			for.end: ; preds = %for.end.loopexit, %for.cond1.preheader.lr.ph, %entry
	ret void			ret void
	}			}

				; Loop i is supposed to have the largest cost and be placed
				bmahjourUnsubmitted Not Done Reply Inline Actions Could you please make it more clear how this test is different from the one above (ie add a comment to say that this is testing to make sure the analysis prints the loops in the expected order despite the original loop nest being in suboptimal order)? bmahjour: Could you please make it more clear how this test is different from the one above (ie add a…
				; as the outermost loop. This test differs from foo() since the
				; loopnest has a suboptimal order j-i-k.
				; After D123400 we ensure that the order of loop cache analysis output
				; is loop i-j-k, despite the suboptimal order in the original loopnest.
				;
				; void foo(long n, long m, long o, int A[n][m][o]) {
				; for (long j = 0; j < m; j++)
				; for (long i = 0; i < n; i++)
				; for (long k = 0; k < o; k++)
				; A[2i+3][2j-4][2*k+7] = 1;
				; }

				bmahjourUnsubmitted Not Done Reply Inline Actions Use CHECK and CHECK-NEXT bmahjour: Use CHECK and CHECK-NEXT
				; CHECK: Loop 'for.i' has cost = 100000000
				; CHECK-NEXT: Loop 'for.j' has cost = 1000000
				; CHECK-NEXT: Loop 'for.k' has cost = 60000

				define void @foo2(i64 %n, i64 %m, i64 %o, i32* %A) {
				entry:
				%cmp32 = icmp sgt i64 %n, 0
				%cmp230 = icmp sgt i64 %m, 0
				%cmp528 = icmp sgt i64 %o, 0
				br i1 %cmp32, label %for.cond1.preheader.lr.ph, label %for.end

				for.cond1.preheader.lr.ph: ; preds = %entry
				br i1 %cmp230, label %for.j.preheader, label %for.end

				for.j.preheader: ; preds = %for.cond1.preheader.lr.ph
				br i1 %cmp528, label %for.j.preheader.split, label %for.end

				for.j.preheader.split: ; preds = %for.j.preheader
				br label %for.j

				for.i: ; preds = %for.inci, %for.j
				%i = phi i64 [ %inci, %for.inci ], [ 0, %for.j ]
				%mul8 = shl i64 %i, 1
				%add9 = add nsw i64 %mul8, 3
				%0 = mul i64 %add9, %m
				%sub = add i64 %0, -4
				%mul7 = mul nsw i64 %j, 2
				%tmp = add i64 %sub, %mul7
				%tmp27 = mul i64 %tmp, %o
				br label %for.k

				for.j: ; preds = %for.incj, %for.j.preheader.split
				%j = phi i64 [ %incj, %for.incj ], [ 0, %for.j.preheader.split ]
				br label %for.i

				for.k: ; preds = %for.k, %for.i
				%k = phi i64 [ 0, %for.i ], [ %inck, %for.k ]

				%mul = mul nsw i64 %k, 2
				%arrayidx.sum = add i64 %mul, 7
				%arrayidx10.sum = add i64 %arrayidx.sum, %tmp27
				%arrayidx11 = getelementptr inbounds i32, i32* %A, i64 %arrayidx10.sum
				store i32 1, i32* %arrayidx11, align 4

				%inck = add nsw i64 %k, 1
				%exitcond.us = icmp eq i64 %inck, %o
				br i1 %exitcond.us, label %for.inci, label %for.k

				for.incj: ; preds = %for.inci
				%incj = add nsw i64 %j, 1
				%exitcond54.us = icmp eq i64 %incj, %m
				br i1 %exitcond54.us, label %for.end.loopexit, label %for.j

				for.inci: ; preds = %for.k
				%inci = add nsw i64 %i, 1
				%exitcond55.us = icmp eq i64 %inci, %n
				br i1 %exitcond55.us, label %for.incj, label %for.i

				for.end.loopexit: ; preds = %for.incj
				br label %for.end

				for.end: ; preds = %for.end.loopexit, %for.cond1.preheader.lr.ph, %entry
				ret void
				}