This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
5
LoopCacheAnalysis.cpp
-
test/Analysis/LoopCacheAnalysis/PowerPC/
-
Analysis/
-
LoopCacheAnalysis/
-
PowerPC/
1
compute-cost.ll

Differential D73064

[LoopCacheAnalysis]: Add support for negative stride
ClosedPublic

Authored by rcraik on Jan 20 2020, 1:04 PM.

Download Raw Diff

Details

Reviewers

kbarton
jdoerfert
Meinersbur
bmahjour
etiotto
Whitney

Commits

rG1f5542006502: [LoopCacheAnalysis]: Add support for negative stride

Summary

LoopCacheAnalysis currently assumes the loop will be iterated over in
a forward direction. This patch addresses the issue by using the
absolute value of the stride when iterating backwards.

Note: this patch will treat negative and positive array access the
same, resulting in the same cost being calculated for single and
bi-directional access patterns. This should be improved in a
subsequent patch.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

rcraik created this revision.Jan 20 2020, 1:04 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 20 2020, 1:05 PM

Herald added subscribers: llvm-commits, jsji, hiraditya, nemanjai. · View Herald Transcript

Ping?

Herald added a subscriber: • wuzish. · View Herald TranscriptJan 27 2020, 12:20 PM

LGTM, will wait for others to approve.

Correct me if I'm wrong but it seems you pretend negative steps are positive ones. While this is just a best-effort analysis, it seems wrong.
Take

for (i = N; i > 0; i--)
  A[i] = A[N - i];

which should have a different cost than

for (i = N; i > 0; i--)
  A[i] = A[i];

which is the same as

for (i = 0; i < N; i++)
  A[i] = A[i];

Can you add these examples to the tests please?

llvm/lib/Analysis/LoopCacheAnalysis.cpp
359	The example code is not really helpful here.
381	No braces.

Added the testcases suggested by @jdoerfert

jdoerfert added inline comments.Jan 31 2020, 10:30 AM

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/compute-cost.ll
70	As I mentioned, this should not be scored as the ones below. Add a FIXME here to explain that. In the code we should also add a FIXME to indicate that this is not the best handling of negative strides.

I took each of these example loops and, taking N = 1024, ran each loop 10 000 000 times. These testcases I ran 10 times each and found no statistical difference between either the number of L1 cache load misses or the L1 cache miss rate between each testcase. This indicates that the cost of these loops is the same, which is what this patch implements.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
381	I checked the LLVM Coding Standards, and it doesn't mention braces vs no braces for single statement if/for/etc. blocks. Since this statement covers multiple lines, I would rather keep the braces for clarity.

In D73064#1852211, @rcraik wrote:

I took each of these example loops and, taking N = 1024, ran each loop 10 000 000 times. These testcases I ran 10 times each and found no statistical difference between either the number of L1 cache load misses or the L1 cache miss rate between each testcase. This indicates that the cost of these loops is the same, which is what this patch implements.

Thanks for trying to verify this but there is a problem with the setup. With N = 1024, the memory footprint M of each example loops is 1024 * sizeof(A[0]), now I don't know what types you used but let's assume double. The footprint M is 1024 * sizeof(double) = 1024 * 8 = 8192 bytes. My laptop L1 data cache is 32KiB in size. That means I can fit the array 3 times into the L1 cache. Now, once the loop was traversed a single time, the L1 cache contains all of the array and there is little reason to assume eviction. Every consequent iteration, so 10 000 000 -1 times, will most likely only see L1 cache hits no matter how you iterate over the array.

llvm/lib/Analysis/LoopCacheAnalysis.cpp
359	Especially since it is way to complex. Why do we need the struct, the function, ...
381	FWIW, I went through the coding standard myself again. As far as I can tell, every conditional follows the same rule: If it's a single statement, no braces, otherwise braces. (the alternative case has braces if the consequence has and vise versa)

This revision now requires changes to proceed.Jan 31 2020, 1:19 PM

In D73064#1844826, @jdoerfert wrote:
Correct me if I'm wrong but it seems you pretend negative steps are positive ones. While this is just a best-effort analysis, it seems wrong.
Take
for (i = N; i > 0; i--)
  A[i] = A[N - i];
which should have a different cost than
for (i = N; i > 0; i--)
  A[i] = A[i];
which is the same as
for (i = 0; i < N; i++)
  A[i] = A[i];
Can you add these examples to the tests please?

Could you explain why you believe the first test should have a different cost than the second two? The first and last testcase both iterate over the array in a forwards direction (at least in terms of loads) while the middle one is the only one which goes backwards.

In D73064#1857187, @rcraik wrote:
In D73064#1844826, @jdoerfert wrote:
Correct me if I'm wrong but it seems you pretend negative steps are positive ones. While this is just a best-effort analysis, it seems wrong.
Take
for (i = N; i > 0; i--)
  A[i] = A[N - i];
which should have a different cost than
for (i = N; i > 0; i--)
  A[i] = A[i];
which is the same as
for (i = 0; i < N; i++)
  A[i] = A[i];
Can you add these examples to the tests please?
Could you explain why you believe the first test should have a different cost than the second two? The first and last testcase both iterate over the array in a forwards direction (at least in terms of loads) while the middle one is the only one which goes backwards.

Going backwards is no different from going forward (assuming we ignore prefetchers). So I argue 2 and 3 are the same when it comes to cache behavior. In both we access the same memory twice in each iteration, but since it is the same memory there is actually only a single cache miss per iteration (assuming one array element per cache line for simplicity, otherwise we just get a factor everywhere). Now in version 1 we access two different elements in all but potentially one iteration (the middle) so we get two cache misses in each iteration.

Updated comments and testcases

Sorry for my delay, traveling...

LGTM. Thanks for the fixme and the test cases!

This revision is now accepted and ready to land.Feb 8 2020, 11:02 AM

Closed by commit rG1f5542006502: [LoopCacheAnalysis]: Add support for negative stride (authored by rcraik). · Explain WhyFeb 10 2020, 10:28 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Analysis/

LoopCacheAnalysis.cpp

37 lines

test/

Analysis/

LoopCacheAnalysis/

PowerPC/

compute-cost.ll

120 lines

Diff 243628

llvm/lib/Analysis/LoopCacheAnalysis.cpp

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	static bool isOneDimensionalArray(const SCEV &AccessFn, const SCEV &ElemSize,
const SCEV *Step = AR->getStepRecurrence(SE);		const SCEV *Step = AR->getStepRecurrence(SE);
if (isa<SCEVAddRecExpr>(Start) \|\| isa<SCEVAddRecExpr>(Step))		if (isa<SCEVAddRecExpr>(Start) \|\| isa<SCEVAddRecExpr>(Step))
return false;		return false;

// Check that start and increment are both invariant in the loop.		// Check that start and increment are both invariant in the loop.
if (!SE.isLoopInvariant(Start, &L) \|\| !SE.isLoopInvariant(Step, &L))		if (!SE.isLoopInvariant(Start, &L) \|\| !SE.isLoopInvariant(Step, &L))
return false;		return false;

return AR->getStepRecurrence(SE) == &ElemSize;		const SCEV *StepRec = AR->getStepRecurrence(SE);
		if (StepRec && SE.isKnownNegative(StepRec))
		StepRec = SE.getNegativeSCEV(StepRec);

		return StepRec == &ElemSize;
}		}

/// Compute the trip count for the given loop \p L. Return the SCEV expression		/// Compute the trip count for the given loop \p L. Return the SCEV expression
/// for the trip count or nullptr if it cannot be computed.		/// for the trip count or nullptr if it cannot be computed.
static const SCEV *computeTripCount(const Loop &L, ScalarEvolution &SE) {		static const SCEV *computeTripCount(const Loop &L, ScalarEvolution &SE) {
const SCEV *BackedgeTakenCount = SE.getBackedgeTakenCount(&L);		const SCEV *BackedgeTakenCount = SE.getBackedgeTakenCount(&L);
if (isa<SCEVCouldNotCompute>(BackedgeTakenCount) \|\|		if (isa<SCEVCouldNotCompute>(BackedgeTakenCount) \|\|
!isa<SCEVConstant>(BackedgeTakenCount))		!isa<SCEVConstant>(BackedgeTakenCount))
▲ Show 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	CacheCostTy IndexedReference::computeRefCost(const Loop &L,
const SCEV *RefCost = TripCount;		const SCEV *RefCost = TripCount;

if (isConsecutive(L, CLS)) {		if (isConsecutive(L, CLS)) {
const SCEV *Coeff = getLastCoefficient();		const SCEV *Coeff = getLastCoefficient();
const SCEV *ElemSize = Sizes.back();		const SCEV *ElemSize = Sizes.back();
const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);		const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);
const SCEV *CacheLineSize = SE.getConstant(Stride->getType(), CLS);		const SCEV *CacheLineSize = SE.getConstant(Stride->getType(), CLS);
Type *WiderType = SE.getWiderType(Stride->getType(), TripCount->getType());		Type *WiderType = SE.getWiderType(Stride->getType(), TripCount->getType());
Stride = SE.getNoopOrSignExtend(Stride, WiderType);		if (SE.isKnownNegative(Stride))
		Stride = SE.getNegativeSCEV(Stride);
		Stride = SE.getNoopOrAnyExtend(Stride, WiderType);
TripCount = SE.getNoopOrAnyExtend(TripCount, WiderType);		TripCount = SE.getNoopOrAnyExtend(TripCount, WiderType);
const SCEV *Numerator = SE.getMulExpr(Stride, TripCount);		const SCEV *Numerator = SE.getMulExpr(Stride, TripCount);
RefCost = SE.getUDivExpr(Numerator, CacheLineSize);		RefCost = SE.getUDivExpr(Numerator, CacheLineSize);

LLVM_DEBUG(dbgs().indent(4)		LLVM_DEBUG(dbgs().indent(4)
<< "Access is consecutive: RefCost=(TripCount*Stride)/CLS="		<< "Access is consecutive: RefCost=(TripCount*Stride)/CLS="
<< *RefCost << "\n");		<< *RefCost << "\n");
} else		} else
LLVM_DEBUG(dbgs().indent(4)		LLVM_DEBUG(dbgs().indent(4)
<< "Access is not consecutive: RefCost=TripCount=" << *RefCost		<< "Access is not consecutive: RefCost=TripCount=" << *RefCost
<< "\n");		<< "\n");

▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	if (Subscripts.empty() \|\| Sizes.empty() \|\|
if (!isOneDimensionalArray(AccessFn, ElemSize, *L, SE)) {		if (!isOneDimensionalArray(AccessFn, ElemSize, *L, SE)) {
LLVM_DEBUG(dbgs().indent(2)		LLVM_DEBUG(dbgs().indent(2)
<< "ERROR: failed to delinearize reference\n");		<< "ERROR: failed to delinearize reference\n");
Subscripts.clear();		Subscripts.clear();
Sizes.clear();		Sizes.clear();
return false;		return false;
}		}

		// The array may be accessed in reverse, for example:
		jdoerfertUnsubmitted Not Done Reply Inline Actions The example code is not really helpful here. jdoerfert: The example code is not really helpful here.
		jdoerfertUnsubmitted Not Done Reply Inline Actions Especially since it is way to complex. Why do we need the struct, the function, ... jdoerfert: Especially since it is way to complex. Why do we need the struct, the function, ...
		// for (i = N; i > 0; i--)
		// A[i] = 0;
		// In this case, reconstruct the access function using the absolute value
		// of the step recurrence.
		const SCEVAddRecExpr *AccessFnAR = dyn_cast<SCEVAddRecExpr>(AccessFn);
		const SCEV *StepRec = AccessFnAR ? AccessFnAR->getStepRecurrence(SE) : nullptr;

		if (StepRec && SE.isKnownNegative(StepRec))
		AccessFn = SE.getAddRecExpr(AccessFnAR->getStart(),
		SE.getNegativeSCEV(StepRec),
		AccessFnAR->getLoop(),
		AccessFnAR->getNoWrapFlags());
const SCEV *Div = SE.getUDivExactExpr(AccessFn, ElemSize);		const SCEV *Div = SE.getUDivExactExpr(AccessFn, ElemSize);
Subscripts.push_back(Div);		Subscripts.push_back(Div);
Sizes.push_back(ElemSize);		Sizes.push_back(ElemSize);
}		}

return all_of(Subscripts, [&](const SCEV *Subscript) {		return all_of(Subscripts, [&](const SCEV *Subscript) {
return isSimpleAddRecurrence(Subscript, L);		return isSimpleAddRecurrence(Subscript, L);
});		});
}		}

		jdoerfertUnsubmitted Not Done Reply Inline Actions No braces. jdoerfert: No braces.
		rcraikAuthorUnsubmitted Not Done Reply Inline Actions I checked the LLVM Coding Standards, and it doesn't mention braces vs no braces for single statement if/for/etc. blocks. Since this statement covers multiple lines, I would rather keep the braces for clarity. rcraik: I checked the LLVM Coding Standards, and it doesn't mention braces vs no braces for single…
		jdoerfertUnsubmitted Not Done Reply Inline Actions FWIW, I went through the coding standard myself again. As far as I can tell, every conditional follows the same rule: If it's a single statement, no braces, otherwise braces. (the alternative case has braces if the consequence has and vise versa) jdoerfert: FWIW, I went through the coding standard myself again. As far as I can tell, every conditional…
return false;		return false;
}		}

bool IndexedReference::isLoopInvariant(const Loop &L) const {		bool IndexedReference::isLoopInvariant(const Loop &L) const {
Value *Addr = getPointerOperand(&StoreOrLoadInst);		Value *Addr = getPointerOperand(&StoreOrLoadInst);
assert(Addr != nullptr && "Expecting either a load or a store instruction");		assert(Addr != nullptr && "Expecting either a load or a store instruction");
assert(SE.isSCEVable(Addr->getType()) && "Addr should be SCEVable");		assert(SE.isSCEVable(Addr->getType()) && "Addr should be SCEVable");

Show All 21 Lines	bool IndexedReference::isConsecutive(const Loop &L, unsigned CLS) const {
}		}

// ...and the access stride is less than the cache line size.		// ...and the access stride is less than the cache line size.
const SCEV *Coeff = getLastCoefficient();		const SCEV *Coeff = getLastCoefficient();
const SCEV *ElemSize = Sizes.back();		const SCEV *ElemSize = Sizes.back();
const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);		const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);
const SCEV *CacheLineSize = SE.getConstant(Stride->getType(), CLS);		const SCEV *CacheLineSize = SE.getConstant(Stride->getType(), CLS);

		Stride = SE.isKnownNegative(Stride) ? SE.getNegativeSCEV(Stride) : Stride;
return SE.isKnownPredicate(ICmpInst::ICMP_ULT, Stride, CacheLineSize);		return SE.isKnownPredicate(ICmpInst::ICMP_ULT, Stride, CacheLineSize);
}		}

const SCEV *IndexedReference::getLastCoefficient() const {		const SCEV *IndexedReference::getLastCoefficient() const {
const SCEV *LastSubscript = getLastSubscript();		const SCEV *LastSubscript = getLastSubscript();
assert(isa<SCEVAddRecExpr>(LastSubscript) &&		assert(isa<SCEVAddRecExpr>(LastSubscript) &&
"Expecting a SCEV add recurrence expression");		"Expecting a SCEV add recurrence expression");
const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(LastSubscript);		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(LastSubscript);
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
for (ReferenceGroupTy &RefGroup : RefGroups) {		for (ReferenceGroupTy &RefGroup : RefGroups) {
const IndexedReference &Representative = *RefGroup.front().get();		const IndexedReference &Representative = *RefGroup.front().get();
LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "References:\n";		dbgs() << "References:\n";
dbgs().indent(2) << *R << "\n";		dbgs().indent(2) << *R << "\n";
dbgs().indent(2) << Representative << "\n";		dbgs().indent(2) << Representative << "\n";
});		});


		// FIXME: Both positive and negative access functions will be placed
		// into the same reference group, resulting in a bi-directional array
		// access such as:
		// for (i = N; i > 0; i--)
		// A[i] = A[N - i];
		// having the same cost calculation as a single dimention access pattern
		// for (i = 0; i < N; i++)
		// A[i] = A[i];
		// when in actuality, depending on the array size, the first example
		// should have a cost closer to 2x the second due to the two cache
		// access per iteration from opposite ends of the array
Optional<bool> HasTemporalReuse =		Optional<bool> HasTemporalReuse =
R->hasTemporalReuse(Representative, TRT, InnerMostLoop, DI, AA);		R->hasTemporalReuse(Representative, TRT, InnerMostLoop, DI, AA);
Optional<bool> HasSpacialReuse =		Optional<bool> HasSpacialReuse =
R->hasSpacialReuse(Representative, CLS, AA);		R->hasSpacialReuse(Representative, CLS, AA);

if ((HasTemporalReuse.hasValue() && *HasTemporalReuse) \|\|		if ((HasTemporalReuse.hasValue() && *HasTemporalReuse) \|\|
(HasSpacialReuse.hasValue() && *HasSpacialReuse)) {		(HasSpacialReuse.hasValue() && *HasSpacialReuse)) {
RefGroup.push_back(std::move(R));		RefGroup.push_back(std::move(R));
▲ Show 20 Lines • Show All 82 Lines • Show Last 20 Lines

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/compute-cost.ll

Show All 25 Lines	for.body: ; preds = %for.cond
%arrayidx = getelementptr inbounds %struct._Handleitem, %struct._Handleitem* %blocks, i64 %idxprom		%arrayidx = getelementptr inbounds %struct._Handleitem, %struct._Handleitem* %blocks, i64 %idxprom
store %struct._Handleitem* null, %struct._Handleitem** %arrayidx, align 8		store %struct._Handleitem* null, %struct._Handleitem** %arrayidx, align 8
%inc = add nuw nsw i32 %i.0, 1		%inc = add nuw nsw i32 %i.0, 1
br label %for.cond		br label %for.cond

; Exit blocks		; Exit blocks
for.end: ; preds = %for.cond		for.end: ; preds = %for.cond
ret void		ret void
		}



		; Check IndexedReference::computeRefCost can handle negative stride

		; CHECK: Loop 'for.neg.cond' has cost = 64

		define void @handle_to_ptr_neg_stride(%struct._Handleitem** %blocks) {
		; Preheader:
		entry:
		br label %for.neg.cond

		; Loop:
		for.neg.cond: ; preds = %for.neg.body, %entry
		%i.0 = phi i32 [ 1023, %entry ], [ %dec, %for.neg.body ]
		%cmp = icmp sgt i32 %i.0, 0
		br i1 %cmp, label %for.neg.body, label %for.neg.end

		for.neg.body: ; preds = %for.neg.cond
		%idxprom = zext i32 %i.0 to i64
		%arrayidx = getelementptr inbounds %struct._Handleitem, %struct._Handleitem* %blocks, i64 %idxprom
		store %struct._Handleitem* null, %struct._Handleitem** %arrayidx, align 8
		%dec = add nsw i32 %i.0, -1
		br label %for.neg.cond

		; Exit blocks
		for.neg.end: ; preds = %for.neg.cond
		ret void
		}



		; for (int i = 40960; i > 0; i--)
		; B[i] = B[40960 - i];

		; FIXME: Currently negative access functions are treated the same as positive
		jdoerfertUnsubmitted Not Done Reply Inline Actions As I mentioned, this should not be scored as the ones below. Add a FIXME here to explain that. In the code we should also add a FIXME to indicate that this is not the best handling of negative strides. jdoerfert: As I mentioned, this should not be scored as the ones below. Add a FIXME here to explain that.
		; access functions. When this is fixed this testcase should have a cost
		; approximately 2x higher.

		; CHECK: Loop 'for.cond2' has cost = 2560
		define void @Test2(double* %B) {
		entry:
		br label %for.cond2

		for.cond2: ; preds = %for.body, %entry
		%i.0 = phi i32 [ 40960, %entry ], [ %dec, %for.body ]
		%cmp = icmp sgt i32 %i.0, 0
		br i1 %cmp, label %for.body, label %for.end

		for.body: ; preds = %for.cond
		%sub = sub nsw i32 40960, %i.0
		%idxprom = sext i32 %sub to i64
		%arrayidx = getelementptr inbounds double, double* %B, i64 %idxprom
		%0 = load double, double* %arrayidx, align 8
		%idxprom1 = sext i32 %i.0 to i64
		%arrayidx2 = getelementptr inbounds double, double* %B, i64 %idxprom1
		store double %0, double* %arrayidx2, align 8
		%dec = add nsw i32 %i.0, -1
		br label %for.cond2

		for.end: ; preds = %for.cond
		ret void
		}



		; for (i = 40960; i > 0; i--)
		; C[i] = C[i];

		; CHECK: Loop 'for.cond3' has cost = 2560
		define void @Test3(double** %C) {
		entry:
		br label %for.cond3

		for.cond3: ; preds = %for.body, %entry
		%i.0 = phi i32 [ 40960, %entry ], [ %dec, %for.body ]
		%cmp = icmp sgt i32 %i.0, 0
		br i1 %cmp, label %for.body, label %for.end

		for.body: ; preds = %for.cond
		%idxprom = sext i32 %i.0 to i64
		%arrayidx = getelementptr inbounds double, double* %C, i64 %idxprom
		%0 = load double, double* %arrayidx, align 8
		%idxprom1 = sext i32 %i.0 to i64
		%arrayidx2 = getelementptr inbounds double, double* %C, i64 %idxprom1
		store double* %0, double** %arrayidx2, align 8
		%dec = add nsw i32 %i.0, -1
		br label %for.cond3

		for.end: ; preds = %for.cond
		ret void
		}



		; for (i = 0; i < 40960; i++)
		; D[i] = D[i];

		; CHECK: Loop 'for.cond4' has cost = 2560
		define void @Test4(double** %D) {
		entry:
		br label %for.cond4

		for.cond4: ; preds = %for.body, %entry
		%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		%cmp = icmp slt i32 %i.0, 40960
		br i1 %cmp, label %for.body, label %for.end

		for.body: ; preds = %for.cond
		%idxprom = sext i32 %i.0 to i64
		%arrayidx = getelementptr inbounds double, double* %D, i64 %idxprom
		%0 = load double, double* %arrayidx, align 8
		%idxprom1 = sext i32 %i.0 to i64
		%arrayidx2 = getelementptr inbounds double, double* %D, i64 %idxprom1
		store double* %0, double** %arrayidx2, align 8
		%inc = add nsw i32 %i.0, 1
		br label %for.cond4

		for.end: ; preds = %for.cond
		ret void
}		}