This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Analysis/
-
llvm/
-
Analysis/
-
LoopCacheAnalysis.h
-
lib/Analysis/
-
Analysis/
-
LoopCacheAnalysis.cpp
-
test/Analysis/LoopCacheAnalysis/PowerPC/
-
Analysis/
-
LoopCacheAnalysis/
-
PowerPC/
-
LoopnestFixedSize.ll
-
matvecmul.ll
-
multi-store.ll
-
single-store.ll

Differential D123400

[LoopCacheAnalysis] Consider dimension depth of the subscript reference when calculating cost
ClosedPublic

Authored by bmahjour on Apr 8 2022, 9:18 AM.

Download Raw Diff

Details

Reviewers

congzhe
etiotto
Whitney

Group Reviewers

Restricted Project

Commits

rGef4ecc3ceffc: [LoopCacheAnalysis] Consider dimension depth of the subscript reference when…

Summary

As described in https://reviews.llvm.org/D122776, the current LoopCacheCost analysis is not able to determine profitable nesting for outer loops of a nest more than 2 levels deep. For example consider the first loop in the existing LIT test llvm/test/Analysis/LoopCacheAnalysis/PowerPC/matvecmul.ll. The loop looks like this:

;   for (int k=1;k<nz,++k)
;      for (int j=1;j<ny,++j)
;        for (int i=1;i<nx,++i)
;          for (int l=1;l<nb,++l)
;            for (int m=1;m<nb,++m)
;                 y[k+1][j][i][l] = y[k+1][j][i][l] + b[k][j][i][m][l]*x[k][j][i][m]

and the cost for the k, j and i loops are all calculated to be 30000000000. The problem is that when considering a subject loop as the inner most loop of the nest, if the access pattern is not consecutive, the cost function returns the trip count of that loop as the estimated number of cache lines accessed. If the trip counts are equal (or unknown in which case we assume a default value of 100) then the costs of the outer loops will be identical. The cost function needs to give more weight to the reference groups that use a function of the loop's IV as a subscript into outer dimensions. This patch tries to do that by multiplying the trip counts of the loops corresponding to subscripts that come between the subject loop and inner dimensions. (ie for a given reference group, the farther away a subscript from the innermost level, the higher the cost of moving the corresponding loop into the innermost position).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bmahjour created this revision.Apr 8 2022, 9:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 9:18 AM

Herald added subscribers: steven.zhang, hiraditya, nemanjai. · View Herald Transcript

bmahjour requested review of this revision.Apr 8 2022, 9:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 9:18 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

bmahjour mentioned this in D122776: [NFC][LoopCacheAnalysis] Add a motivating test case for improved loop cache analysis cost calculation.Apr 8 2022, 9:34 AM

Harbormaster completed remote builds in B158714: Diff 421542.Apr 8 2022, 10:07 AM

LGTM

This revision is now accepted and ready to land.Apr 11 2022, 7:22 AM

Hi Bardia

Thanks for your work on this. While I keep an eye on the work on loop interchage, I don't have enough details in mind to give a detailed review here. Just two things occured to me that may be worthwhile to mention

Boiling down various parameters to one final "cost" value has a disadvantage. In fact we discussed this with @congzhe as the first solution, but we discarded it because essentially you lose some informaiton when you boil everything down to one number. The reason that Congzhe decided to go with a two component cost was that extra detail, carries more useful information and helps to make more accurate decision. Now this was a theoretical concern and I am not sure if in practice it holds or not. But that is the background on congzhe's work.
Wouldn't it be more appropriate to continue discussion on congzhe's patch rather than posting a parallel patch to address the same problem? I think that will be more constructive given congzhe has already identified the problem, investigated some solutions and developed a patch? I suggest we continue the discussion, as review comments on his patch. Once there is an agreement on the cost function, Congzhe can update the patch (assuming it is needed) to reflect the agreed upon solution.

In D123400#3442881, @amehsan wrote:

Boiling down various parameters to one final "cost" value has a disadvantage. In fact we discussed this with @congzhe as the first solution, but we discarded it because essentially you lose some informaiton when you boil everything down to one number. The reason that Congzhe decided to go with a two component cost was that extra detail, carries more useful information and helps to make more accurate decision. Now this was a theoretical concern and I am not sure if in practice it holds or not. But that is the background on congzhe's work.

The cost returned by the analysis must consider the impact of outer dimension strides incurred by each reference group, otherwise it will not be accurate. If, for whatever reason, we need extra details in future we could provide extra interfaces to the analysis to query those details.

Wouldn't it be more appropriate to continue discussion on congzhe's patch rather than posting a parallel patch to address the same problem? I think that will be more constructive given congzhe has already identified the problem, investigated some solutions and developed a patch? I suggest we continue the discussion, as review comments on his patch. Once there is an agreement on the cost function, Congzhe can update the patch (assuming it is needed) to reflect the agreed upon solution.

I explained my reasoning for needing a different approach in D122776. I just figured it would be a lot easier to show what I mean by posting a new patch instead of trying to describe it as a change request on top of D122776. I'm happy to continue further discussion on this wherever seems more appropriate to you or other reviewers.

In D123400#3443913, @bmahjour wrote:

I explained my reasoning for needing a different approach in D122776. I just figured it would be a lot easier to show what I mean by posting a new patch instead of trying to describe it as a change request on top of D122776. I'm happy to continue further discussion on this wherever seems more appropriate to you or other reviewers.

Thanks Bardia….Since this patch already has a LGTM, perhaps the easiest thing is to merge this. Just to clarify my comment, first I want to appreciate your input to the discussions and your effort in reviewing the patches, looking into papers and in general feedback that you provide. I think it is more encouraging if we avoid creating parallel patches, once a patch is under discussion, to recognize the effort that has gone into identifying the problem, investigating different solutions, and creating the initial patch. That does not mean to imply that reviewing the patch is an easy task and in some cases like what you have done, it involves reading a paper, or some other significant contribution. But as long as the author of original patch is active and receptive to feedback I don’t see a reason for parallel patches. I am open to alternative opinions from others.

In D123400#3443913, @bmahjour wrote:

In D123400#3442881, @amehsan wrote:

Boiling down various parameters to one final "cost" value has a disadvantage. In fact we discussed this with @congzhe as the first solution, but we discarded it because essentially you lose some informaiton when you boil everything down to one number. The reason that Congzhe decided to go with a two component cost was that extra detail, carries more useful information and helps to make more accurate decision. Now this was a theoretical concern and I am not sure if in practice it holds or not. But that is the background on congzhe's work.

The cost returned by the analysis must consider the impact of outer dimension strides incurred by each reference group, otherwise it will not be accurate. If, for whatever reason, we need extra details in future we could provide extra interfaces to the analysis to query those details.

Wouldn't it be more appropriate to continue discussion on congzhe's patch rather than posting a parallel patch to address the same problem? I think that will be more constructive given congzhe has already identified the problem, investigated some solutions and developed a patch? I suggest we continue the discussion, as review comments on his patch. Once there is an agreement on the cost function, Congzhe can update the patch (assuming it is needed) to reflect the agreed upon solution.

I explained my reasoning for needing a different approach in D122776. I just figured it would be a lot easier to show what I mean by posting a new patch instead of trying to describe it as a change request on top of D122776. I'm happy to continue further discussion on this wherever seems more appropriate to you or other reviewers.

Thanks Bardia for this work, I think the approach in this patch does resolve the motivating problem described in D122776. Nevertheless, I would also like to clarify that my approach in D122776 does consider outer dimension strides incurred by each reference group (I should clarify this in D122776 though). Currently what loop cache analysis does is that for each loop we sum all costs of all reference groups and get the final cost (which is what your patch does as well). What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

Treating stride as a secondary component is what I respectfully objected to, and explained earlier. Not sure if taking the maximum stride would give us what we need. For example consider

for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[i][k][j]
       ... C[i][k][j]

the maximum stride will be the same for both i-j-k and i-k-j configurations (despite the second one being more profitable) bringing us back to the original problem.

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

Isn't llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll providing the same test coverage? Note that the analysis is not sensitive to the order of the loops within the loop nest, as it considers all permutations regardless of the original order.

In D123400#3457537, @bmahjour wrote:
What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

Treating stride as a secondary component is what I respectfully objected to, and explained earlier. Not sure if taking the maximum stride would give us what we need. For example consider
for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[i][k][j]
       ... C[i][k][j]
the maximum stride will be the same for both i-j-k and i-k-j configurations (despite the second one being more profitable) bringing us back to the original problem.

IMHO for this case the cost of loop-k would be higher than loop-j (remember that we compare the cost first and then stride). So loop cache analysis does output the i-k-j pattern.

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

Isn't llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll providing the same test coverage? Note that the analysis is not sensitive to the order of the loops within the loop nest, as it considers all permutations regardless of the original order.

The test case in D122776 is the one that really shows the impact of our work, which is why I developed that test. Without our work loop cache analysis would fail that test -- it would output the loop vector as [j, i, k] which is not the optimal access pattern.

The current llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll (and the tests updated in this patch) do not expose the problem we are working on. Current loop cache analysis does already output the optimal access pattern for those tests, which might make it not clear enough why we want to improve loop cache analysis.

In D123400#3459362, @congzhe wrote:
In D123400#3457537, @bmahjour wrote:
What I did is to take the stride into account as a second component, and for each loop take the maximum stride of all reference groups to get the final stride, which presumably could resolve the motivating problem too.

Treating stride as a secondary component is what I respectfully objected to, and explained earlier. Not sure if taking the maximum stride would give us what we need. For example consider
for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[i][k][j]
       ... C[i][k][j]
the maximum stride will be the same for both i-j-k and i-k-j configurations (despite the second one being more profitable) bringing us back to the original problem.
IMHO for this case the cost of loop-k would be higher than loop-j (remember that we compare the cost first and then stride). So loop cache analysis does output the i-k-j pattern.

Sorry I made a mistake in the example above. I meant to consider this example:

for (i)
  for (j)
    for (k)
       ... A[i][j][k]
       ... B[j][i][k]
       ... C[j][i][k]

Here the optimal order is j-i-k, but if we take the maximum among all reference groups we'll end up with the same value for both the i-j-k and j-i-k configurations. With this patch the j-loop will have a cost that is larger than the i-loop and we get the optimal permutation:

Loop 'j' has cost = 201000000
Loop 'i' has cost = 102000000
Loop 'k' has cost = 90000

After you land this patch, I hope that I could get the test case in D122776 merged, since that is really the motivating test for these works. I could update the "CHECK: " lines according to the approach proposed in this patch, and update D122776 to a pure NFC patch which includes only that test. I look forward to your thoughts about it :)

Isn't llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll providing the same test coverage? Note that the analysis is not sensitive to the order of the loops within the loop nest, as it considers all permutations regardless of the original order.

The test case in D122776 is the one that really shows the impact of our work, which is why I developed that test. Without our work loop cache analysis would fail that test -- it would output the loop vector as [j, i, k] which is not the optimal access pattern.

The current llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll (and the tests updated in this patch) do not expose the problem we are working on. Current loop cache analysis does already output the optimal access pattern for those tests, which might make it not clear enough why we want to improve loop cache analysis.

The current loop cache analysis outputs the loops in the correct order by luck (because it maintains the original breath-first order and that order just happens to be the optimal order), but it outputs the same cost value for the two outer loops, which is the root problem! By ensuring that the correct and distinguishable cost is associated with each loop we also ensure that the optimal order is maintained. I do see your point in wanting to make sure that the sort order is correct, but if that's the case we probably want to use CHECK-NEXT instead of CHECK-DAG for your test.

In D123400#3460157, @bmahjour wrote:

The current loop cache analysis outputs the loops in the correct order by luck (because it maintains the original breath-first order and that order just happens to be the optimal order), but it outputs the same cost value for the two outer loops, which is the root problem! By ensuring that the correct and distinguishable cost is associated with each loop we also ensure that the optimal order is maintained. I do see your point in wanting to make sure that the sort order is correct, but if that's the case we probably want to use CHECK-NEXT instead of CHECK-DAG for your test.

Sure, I will update the test case in D122776 to use CHECK-NEXT instead of CHECK-DAG, and update the cost numbers therein according to your improvement in this patch. Thanks again for all the discussions!

Added a test for the multi-store case I mentioned above.

Harbormaster completed remote builds in B160473: Diff 423938.Apr 20 2022, 10:54 AM

Hi Bardia @bmahjour , my apologies that I did not clearly remember the discussion during the loopopt meeting -- was it like you would like me to commit this patch for you, or was it like you would like me to keep an eye on it after you commit it?

Sorry for the delay, as I didn't get a chance to commit this before I went on vacation.

I just noticed that after fixed-size delinearization commit went in, this patch needed to be re-based and new values needed to be generated for the newly added LoopnestFixedSize.ll. I also noticed that this test exposes an issue with the analysis producing large cost values that can overflow the underlying int64_t type. This problem is not specific to this patch, as described in https://github.com/llvm/llvm-project/issues/55233. I think the overflow problem should be dealt with as a separate issue and will add it to the agenda for the next LoopOptWG call. In the mean time I've changed LoopnestFixedSize.ll to use smaller trip counts. @congzhe please let me know if you are ok with this change. We should recreate a test with large sizes as part of fixing 55233.

In D123400#3486341, @bmahjour wrote:

Sorry for the delay, as I didn't get a chance to commit this before I went on vacation.

I just noticed that after fixed-size delinearization commit went in, this patch needed to be re-based and new values needed to be generated for the newly added LoopnestFixedSize.ll. I also noticed that this test exposes an issue with the analysis producing large cost values that can overflow the underlying int64_t type. This problem is not specific to this patch, as described in https://github.com/llvm/llvm-project/issues/55233. I think the overflow problem should be dealt with as a separate issue and will add it to the agenda for the next LoopOptWG call. In the mean time I've changed LoopnestFixedSize.ll to use smaller trip counts. @congzhe please let me know if you are ok with this change. We should recreate a test with large sizes as part of fixing 55233.

Thanks, I agree that we can merge this patch and then resolve the overflow issue.

Harbormaster completed remote builds in B162291: Diff 426460.May 2 2022, 1:24 PM

This revision was landed with ongoing or failed builds.May 2 2022, 1:51 PM

Closed by commit rGef4ecc3ceffc: [LoopCacheAnalysis] Consider dimension depth of the subscript reference when… (authored by bmahjour). · Explain Why

This revision was automatically updated to reflect the committed changes.

bmahjour added a commit: rGef4ecc3ceffc: [LoopCacheAnalysis] Consider dimension depth of the subscript reference when….

congzhe mentioned this in rG5e004fb78769: [LoopCacheAnalysis][NFC] Add a test case for improved loop cache analysis cost….May 4 2022, 2:14 PM

bmahjour mentioned this in D124984: [NFC][LoopCacheAnalysis] Update test cases to make sure the outputs follow the right order.May 16 2022, 10:02 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

LoopCacheAnalysis.h

7 lines

lib/

Analysis/

LoopCacheAnalysis.cpp

81 lines

test/

Analysis/

LoopCacheAnalysis/

PowerPC/

27 lines

6 lines

102 lines

2 lines

Diff 426514

llvm/include/llvm/Analysis/LoopCacheAnalysis.h

Show First 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	private:
bool isLoopInvariant(const Loop &L) const;		bool isLoopInvariant(const Loop &L) const;

/// Return true if the indexed reference is 'consecutive' in loop \p L.		/// Return true if the indexed reference is 'consecutive' in loop \p L.
/// An indexed reference is 'consecutive' if the only coefficient that uses		/// An indexed reference is 'consecutive' if the only coefficient that uses
/// the loop induction variable is the rightmost one, and the access stride is		/// the loop induction variable is the rightmost one, and the access stride is
/// smaller than the cache line size \p CLS.		/// smaller than the cache line size \p CLS.
bool isConsecutive(const Loop &L, unsigned CLS) const;		bool isConsecutive(const Loop &L, unsigned CLS) const;

		/// Retrieve the index of the subscript corresponding to the given loop \p
		/// L. Return a zero-based positive index if the subscript index is
		/// succesfully located and a negative value otherwise. For example given the
		/// indexed reference 'A[i][2j+1][3k+2]', the call
		/// 'getSubscriptIndex(loop-k)' would return value 2.
		unsigned getSubscriptIndex(const Loop &L) const;

/// Return the coefficient used in the rightmost dimension.		/// Return the coefficient used in the rightmost dimension.
const SCEV *getLastCoefficient() const;		const SCEV *getLastCoefficient() const;

/// Return true if the coefficient corresponding to induction variable of		/// Return true if the coefficient corresponding to induction variable of
/// loop \p L in the given \p Subscript is zero or is loop invariant in \p L.		/// loop \p L in the given \p Subscript is zero or is loop invariant in \p L.
bool isCoeffForLoopZeroOrInvariant(const SCEV &Subscript,		bool isCoeffForLoopZeroOrInvariant(const SCEV &Subscript,
const Loop &L) const;		const Loop &L) const;

▲ Show 20 Lines • Show All 168 Lines • Show Last 20 Lines

llvm/lib/Analysis/LoopCacheAnalysis.cpp

Show First 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	static bool isOneDimensionalArray(const SCEV &AccessFn, const SCEV &ElemSize,

const SCEV *StepRec = AR->getStepRecurrence(SE);		const SCEV *StepRec = AR->getStepRecurrence(SE);
if (StepRec && SE.isKnownNegative(StepRec))		if (StepRec && SE.isKnownNegative(StepRec))
StepRec = SE.getNegativeSCEV(StepRec);		StepRec = SE.getNegativeSCEV(StepRec);

return StepRec == &ElemSize;		return StepRec == &ElemSize;
}		}

/// Compute the trip count for the given loop \p L. Return the SCEV expression		/// Compute the trip count for the given loop \p L or assume a default value if
/// for the trip count or nullptr if it cannot be computed.		/// it is not a compile time constant. Return the SCEV expression for the trip
static const SCEV *computeTripCount(const Loop &L, ScalarEvolution &SE) {		/// count.
		static const SCEV *computeTripCount(const Loop &L, const SCEV &ElemSize,
		ScalarEvolution &SE) {
const SCEV *BackedgeTakenCount = SE.getBackedgeTakenCount(&L);		const SCEV *BackedgeTakenCount = SE.getBackedgeTakenCount(&L);
if (isa<SCEVCouldNotCompute>(BackedgeTakenCount) \|\|		const SCEV *TripCount = (!isa<SCEVCouldNotCompute>(BackedgeTakenCount) &&
!isa<SCEVConstant>(BackedgeTakenCount))		isa<SCEVConstant>(BackedgeTakenCount))
return nullptr;		? SE.getTripCountFromExitCount(BackedgeTakenCount)
return SE.getTripCountFromExitCount(BackedgeTakenCount);		: nullptr;

		if (!TripCount) {
		LLVM_DEBUG(dbgs() << "Trip count of loop " << L.getName()
		<< " could not be computed, using DefaultTripCount\n");
		TripCount = SE.getConstant(ElemSize.getType(), DefaultTripCount);
		}

		return TripCount;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// IndexedReference implementation		// IndexedReference implementation
//		//
raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {		raw_ostream &llvm::operator<<(raw_ostream &OS, const IndexedReference &R) {
if (!R.IsValid) {		if (!R.IsValid) {
OS << R.StoreOrLoadInst;		OS << R.StoreOrLoadInst;
▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	CacheCostTy IndexedReference::computeRefCost(const Loop &L,
});		});

// If the indexed reference is loop invariant the cost is one.		// If the indexed reference is loop invariant the cost is one.
if (isLoopInvariant(L)) {		if (isLoopInvariant(L)) {
LLVM_DEBUG(dbgs().indent(4) << "Reference is loop invariant: RefCost=1\n");		LLVM_DEBUG(dbgs().indent(4) << "Reference is loop invariant: RefCost=1\n");
return 1;		return 1;
}		}

const SCEV *TripCount = computeTripCount(L, SE);		const SCEV TripCount = computeTripCount(L, Sizes.back(), SE);
if (!TripCount) {		assert(TripCount && "Expecting valid TripCount");
LLVM_DEBUG(dbgs() << "Trip count of loop " << L.getName()
<< " could not be computed, using DefaultTripCount\n");
const SCEV *ElemSize = Sizes.back();
TripCount = SE.getConstant(ElemSize->getType(), DefaultTripCount);
}
LLVM_DEBUG(dbgs() << "TripCount=" << *TripCount << "\n");		LLVM_DEBUG(dbgs() << "TripCount=" << *TripCount << "\n");

// If the indexed reference is 'consecutive' the cost is		const SCEV *RefCost = nullptr;
// (TripCount*Stride)/CLS, otherwise the cost is TripCount.
const SCEV *RefCost = TripCount;

if (isConsecutive(L, CLS)) {		if (isConsecutive(L, CLS)) {
		// If the indexed reference is 'consecutive' the cost is
		// (TripCount*Stride)/CLS.
const SCEV *Coeff = getLastCoefficient();		const SCEV *Coeff = getLastCoefficient();
const SCEV *ElemSize = Sizes.back();		const SCEV *ElemSize = Sizes.back();
		assert(Coeff->getType() == ElemSize->getType() &&
		"Expecting the same type");
const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);		const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);
Type *WiderType = SE.getWiderType(Stride->getType(), TripCount->getType());		Type *WiderType = SE.getWiderType(Stride->getType(), TripCount->getType());
const SCEV *CacheLineSize = SE.getConstant(WiderType, CLS);		const SCEV *CacheLineSize = SE.getConstant(WiderType, CLS);
if (SE.isKnownNegative(Stride))		if (SE.isKnownNegative(Stride))
Stride = SE.getNegativeSCEV(Stride);		Stride = SE.getNegativeSCEV(Stride);
Stride = SE.getNoopOrAnyExtend(Stride, WiderType);		Stride = SE.getNoopOrAnyExtend(Stride, WiderType);
TripCount = SE.getNoopOrAnyExtend(TripCount, WiderType);		TripCount = SE.getNoopOrAnyExtend(TripCount, WiderType);
const SCEV *Numerator = SE.getMulExpr(Stride, TripCount);		const SCEV *Numerator = SE.getMulExpr(Stride, TripCount);
RefCost = SE.getUDivExpr(Numerator, CacheLineSize);		RefCost = SE.getUDivExpr(Numerator, CacheLineSize);

LLVM_DEBUG(dbgs().indent(4)		LLVM_DEBUG(dbgs().indent(4)
<< "Access is consecutive: RefCost=(TripCount*Stride)/CLS="		<< "Access is consecutive: RefCost=(TripCount*Stride)/CLS="
<< *RefCost << "\n");		<< *RefCost << "\n");
} else		} else {
		// If the indexed reference is not 'consecutive' the cost is proportional to
		// the trip count and the depth of the dimension which the subject loop
		// subscript is accessing. We try to estimate this by multiplying the cost
		// by the trip counts of loops corresponding to the inner dimensions. For
		// example, given the indexed reference 'A[i][j][k]', and assuming the
		// i-loop is in the innermost position, the cost would be equal to the
		// iterations of the i-loop multiplied by iterations of the j-loop.
		RefCost = TripCount;

		unsigned Index = getSubscriptIndex(L);
		assert(Index >= 0 && "Cound not locate a valid Index");

		for (unsigned I = Index + 1; I < getNumSubscripts() - 1; ++I) {
		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(getSubscript(I));
		assert(AR && AR->getLoop() && "Expecting valid loop");
		const SCEV *TripCount =
		computeTripCount(AR->getLoop(), Sizes.back(), SE);
		Type *WiderType = SE.getWiderType(RefCost->getType(), TripCount->getType());
		RefCost = SE.getMulExpr(SE.getNoopOrAnyExtend(RefCost, WiderType),
		SE.getNoopOrAnyExtend(TripCount, WiderType));
		}

LLVM_DEBUG(dbgs().indent(4)		LLVM_DEBUG(dbgs().indent(4)
<< "Access is not consecutive: RefCost=TripCount=" << *RefCost		<< "Access is not consecutive: RefCost=" << *RefCost << "\n");
<< "\n");		}
		assert(RefCost && "Expecting a valid RefCost");

// Attempt to fold RefCost into a constant.		// Attempt to fold RefCost into a constant.
if (auto ConstantCost = dyn_cast<SCEVConstant>(RefCost))		if (auto ConstantCost = dyn_cast<SCEVConstant>(RefCost))
return ConstantCost->getValue()->getSExtValue();		return ConstantCost->getValue()->getSExtValue();

LLVM_DEBUG(dbgs().indent(4)		LLVM_DEBUG(dbgs().indent(4)
<< "RefCost is not a constant! Setting to RefCost=InvalidCost "		<< "RefCost is not a constant! Setting to RefCost=InvalidCost "
"(invalid value).\n");		"(invalid value).\n");
▲ Show 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	bool IndexedReference::isConsecutive(const Loop &L, unsigned CLS) const {
const SCEV *ElemSize = Sizes.back();		const SCEV *ElemSize = Sizes.back();
const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);		const SCEV *Stride = SE.getMulExpr(Coeff, ElemSize);
const SCEV *CacheLineSize = SE.getConstant(Stride->getType(), CLS);		const SCEV *CacheLineSize = SE.getConstant(Stride->getType(), CLS);

Stride = SE.isKnownNegative(Stride) ? SE.getNegativeSCEV(Stride) : Stride;		Stride = SE.isKnownNegative(Stride) ? SE.getNegativeSCEV(Stride) : Stride;
return SE.isKnownPredicate(ICmpInst::ICMP_ULT, Stride, CacheLineSize);		return SE.isKnownPredicate(ICmpInst::ICMP_ULT, Stride, CacheLineSize);
}		}

		unsigned IndexedReference::getSubscriptIndex(const Loop &L) const {
		for (auto Idx : seq<unsigned>(0, getNumSubscripts())) {
		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(getSubscript(Idx));
		if (AR && AR->getLoop() == &L) {
		return Idx;
		}
		}
		return -1;
		}

const SCEV *IndexedReference::getLastCoefficient() const {		const SCEV *IndexedReference::getLastCoefficient() const {
const SCEV *LastSubscript = getLastSubscript();		const SCEV *LastSubscript = getLastSubscript();
auto *AR = cast<SCEVAddRecExpr>(LastSubscript);		auto *AR = cast<SCEVAddRecExpr>(LastSubscript);
return AR->getStepRecurrence(SE);		return AR->getStepRecurrence(SE);
}		}

bool IndexedReference::isCoeffForLoopZeroOrInvariant(const SCEV &Subscript,		bool IndexedReference::isCoeffForLoopZeroOrInvariant(const SCEV &Subscript,
const Loop &L) const {		const Loop &L) const {
▲ Show 20 Lines • Show All 227 Lines • Show Last 20 Lines

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/LoopnestFixedSize.ll

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	for.inc11: ; preds = %for.body4
br i1 %exitcond7, label %for.body, label %for.end13		br i1 %exitcond7, label %for.body, label %for.end13

for.end13: ; preds = %for.inc11		for.end13: ; preds = %for.inc11
ret void		ret void
}		}

declare [2048 x i32]* @func_with_returned_arg([2048 x i32]* returned %arg)		declare [2048 x i32]* @func_with_returned_arg([2048 x i32]* returned %arg)

; CHECK: Loop 'for.body' has cost = 4472886244958208		; CHECK: Loop 'for.body' has cost = 2112128815104000000
; CHECK: Loop 'for.body4' has cost = 4472886244958208		; CHECK: Loop 'for.body4' has cost = 16762927104000000
; CHECK: Loop 'for.body8' has cost = 4472886244958208		; CHECK: Loop 'for.body8' has cost = 130960368000000
; CHECK: Loop 'for.body12' has cost = 4472886244958208		; CHECK: Loop 'for.body12' has cost = 1047682944000
; CHECK: Loop 'for.body16' has cost = 137728168833024		; CHECK: Loop 'for.body16' has cost = 32260032000

		;; #define N 128
;; #define N 1024
;; #define M 2048		;; #define M 2048
;; void t3(int a[][N][N][N][M]) {		;; void t3(int a[][N][N][N][M]) {
;; for (int i1 = 0; i1 < N-1; ++i1)		;; for (int i1 = 0; i1 < N-1; ++i1)
;; for (int i2 = 2; i2 < N; ++i2)		;; for (int i2 = 2; i2 < N; ++i2)
;; for (int i3 = 0; i3 < N; ++i3)		;; for (int i3 = 0; i3 < N; ++i3)
;; for (int i4 = 3; i4 < N; ++i4)		;; for (int i4 = 3; i4 < N; ++i4)
;; for (int i5 = 0; i5 < M-2; ++i5)		;; for (int i5 = 0; i5 < M-2; ++i5)
;; a[i1][i2][i3][i4][i5] = a[i1+1][i2-2][i3][i4-3][i5+2];		;; a[i1][i2][i3][i4][i5] = a[i1+1][i2-2][i3][i4-3][i5+2];
;; }		;; }

define void @t3([1024 x [1024 x [1024 x [2048 x i32]]]]* %a) {		define void @t3([128 x [128 x [128 x [2048 x i32]]]]* %a) {
entry:		entry:
br label %for.body		br label %for.body

for.body: ; preds = %entry, %for.inc46		for.body: ; preds = %entry, %for.inc46
%indvars.iv18 = phi i64 [ 0, %entry ], [ %indvars.iv.next19, %for.inc46 ]		%indvars.iv18 = phi i64 [ 0, %entry ], [ %indvars.iv.next19, %for.inc46 ]
br label %for.body4		br label %for.body4

for.body4: ; preds = %for.body, %for.inc43		for.body4: ; preds = %for.body, %for.inc43
Show All 9 Lines	for.body12: ; preds = %for.body8, %for.inc37
br label %for.body16		br label %for.body16

for.body16: ; preds = %for.body12, %for.body16		for.body16: ; preds = %for.body12, %for.body16
%indvars.iv = phi i64 [ 0, %for.body12 ], [ %indvars.iv.next, %for.body16 ]		%indvars.iv = phi i64 [ 0, %for.body12 ], [ %indvars.iv.next, %for.body16 ]
%0 = add nuw nsw i64 %indvars.iv18, 1		%0 = add nuw nsw i64 %indvars.iv18, 1
%1 = add nsw i64 %indvars.iv14, -2		%1 = add nsw i64 %indvars.iv14, -2
%2 = add nsw i64 %indvars.iv7, -3		%2 = add nsw i64 %indvars.iv7, -3
%3 = add nuw nsw i64 %indvars.iv, 2		%3 = add nuw nsw i64 %indvars.iv, 2
%arrayidx26 = getelementptr inbounds [1024 x [1024 x [1024 x [2048 x i32]]]], [1024 x [1024 x [1024 x [2048 x i32]]]]* %a, i64 %0, i64 %1, i64 %indvars.iv11, i64 %2, i64 %3		%arrayidx26 = getelementptr inbounds [128 x [128 x [128 x [2048 x i32]]]], [128 x [128 x [128 x [2048 x i32]]]]* %a, i64 %0, i64 %1, i64 %indvars.iv11, i64 %2, i64 %3
%4 = load i32, i32* %arrayidx26, align 4		%4 = load i32, i32* %arrayidx26, align 4
%arrayidx36 = getelementptr inbounds [1024 x [1024 x [1024 x [2048 x i32]]]], [1024 x [1024 x [1024 x [2048 x i32]]]]* %a, i64 %indvars.iv18, i64 %indvars.iv14, i64 %indvars.iv11, i64 %indvars.iv7, i64 %indvars.iv		%arrayidx36 = getelementptr inbounds [128 x [128 x [128 x [2048 x i32]]]], [128 x [128 x [128 x [2048 x i32]]]]* %a, i64 %indvars.iv18, i64 %indvars.iv14, i64 %indvars.iv11, i64 %indvars.iv7, i64 %indvars.iv
store i32 %4, i32* %arrayidx36, align 4		store i32 %4, i32* %arrayidx36, align 4
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp ne i64 %indvars.iv.next, 2046		%exitcond = icmp ne i64 %indvars.iv.next, 2046
br i1 %exitcond, label %for.body16, label %for.inc37		br i1 %exitcond, label %for.body16, label %for.inc37

for.inc37: ; preds = %for.body16		for.inc37: ; preds = %for.body16
%indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1		%indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1
%exitcond10 = icmp ne i64 %indvars.iv.next8, 1024		%exitcond10 = icmp ne i64 %indvars.iv.next8, 128
br i1 %exitcond10, label %for.body12, label %for.inc40		br i1 %exitcond10, label %for.body12, label %for.inc40

for.inc40: ; preds = %for.inc37		for.inc40: ; preds = %for.inc37
%indvars.iv.next12 = add nuw nsw i64 %indvars.iv11, 1		%indvars.iv.next12 = add nuw nsw i64 %indvars.iv11, 1
%exitcond13 = icmp ne i64 %indvars.iv.next12, 1024		%exitcond13 = icmp ne i64 %indvars.iv.next12, 128
br i1 %exitcond13, label %for.body8, label %for.inc43		br i1 %exitcond13, label %for.body8, label %for.inc43

for.inc43: ; preds = %for.inc40		for.inc43: ; preds = %for.inc40
%indvars.iv.next15 = add nuw nsw i64 %indvars.iv14, 1		%indvars.iv.next15 = add nuw nsw i64 %indvars.iv14, 1
%exitcond17 = icmp ne i64 %indvars.iv.next15, 1024		%exitcond17 = icmp ne i64 %indvars.iv.next15, 128
br i1 %exitcond17, label %for.body4, label %for.inc46		br i1 %exitcond17, label %for.body4, label %for.inc46

for.inc46: ; preds = %for.inc43		for.inc46: ; preds = %for.inc43
%indvars.iv.next19 = add nuw nsw i64 %indvars.iv18, 1		%indvars.iv.next19 = add nuw nsw i64 %indvars.iv18, 1
%exitcond21 = icmp ne i64 %indvars.iv.next19, 1023		%exitcond21 = icmp ne i64 %indvars.iv.next19, 127
br i1 %exitcond21, label %for.body, label %for.end48		br i1 %exitcond21, label %for.body, label %for.end48

for.end48: ; preds = %for.inc46		for.end48: ; preds = %for.inc46
ret void		ret void
}		}

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/matvecmul.ll

	; RUN: opt < %s -passes='print<loop-cache-cost>' -disable-output 2>&1 \| FileCheck %s			; RUN: opt < %s -passes='print<loop-cache-cost>' -disable-output 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-n32:64"			target datalayout = "e-m:e-i64:64-n32:64"
	target triple = "powerpc64le-unknown-linux-gnu"			target triple = "powerpc64le-unknown-linux-gnu"

	; void matvecmul(const double __restrict y, const double __restrict x, const double * __restrict b,			; void matvecmul(const double __restrict y, const double __restrict x, const double * __restrict b,
	; const int * __restrict nb, const int * __restrict nx, const int * __restrict ny, const int * __restrict nz) {			; const int * __restrict nb, const int * __restrict nx, const int * __restrict ny, const int * __restrict nz) {
	;			;
	; for (int k=1;k<nz,++k)			; for (int k=1;k<nz,++k)
	; for (int j=1;j<ny,++j)			; for (int j=1;j<ny,++j)
	; for (int i=1;i<nx,++i)			; for (int i=1;i<nx,++i)
	; for (int l=1;l<nb,++l)			; for (int l=1;l<nb,++l)
	; for (int m=1;m<nb,++m)			; for (int m=1;m<nb,++m)
	; y[k+1][j][i][l] = y[k+1][j][i][l] + b[k][j][i][m][l]*x[k][j][i][m]			; y[k+1][j][i][l] = y[k+1][j][i][l] + b[k][j][i][m][l]*x[k][j][i][m]
	; }			; }

	; CHECK: Loop 'k_loop' has cost = 30000000000			; CHECK: Loop 'k_loop' has cost = 10200000000000000
	; CHECK: Loop 'j_loop' has cost = 30000000000			; CHECK: Loop 'j_loop' has cost = 102000000000000
	; CHECK: Loop 'i_loop' has cost = 30000000000			; CHECK: Loop 'i_loop' has cost = 1020000000000
	; CHECK: Loop 'm_loop' has cost = 10700000000			; CHECK: Loop 'm_loop' has cost = 10700000000
	; CHECK: Loop 'l_loop' has cost = 1300000000			; CHECK: Loop 'l_loop' has cost = 1300000000

	%_elem_type_of_double = type <{ double }>			%_elem_type_of_double = type <{ double }>

	; Function Attrs: norecurse nounwind			; Function Attrs: norecurse nounwind
	define void @mat_vec_mpy([0 x %_elem_type_of_double]* noalias %y, [0 x %_elem_type_of_double]* noalias readonly %x,			define void @mat_vec_mpy([0 x %_elem_type_of_double]* noalias %y, [0 x %_elem_type_of_double]* noalias readonly %x,
	[0 x %_elem_type_of_double]* noalias readonly %b, i32* noalias readonly %nb, i32* noalias readonly %nx,			[0 x %_elem_type_of_double]* noalias readonly %b, i32* noalias readonly %nb, i32* noalias readonly %nx,
	▲ Show 20 Lines • Show All 158 Lines • Show Last 20 Lines

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/multi-store.ll

This file was added.

				; RUN: opt < %s -opaque-pointers -passes='print<loop-cache-cost>' -disable-output 2>&1 \| FileCheck %s

				target datalayout = "e-m:e-i64:64-n32:64-S128-v256:256:256-v512:512:512"
				target triple = "powerpc64le-unknown-linux-gnu"

				; CHECK-DAG: Loop 'for.j' has cost = 201000000
				; CHECK-DAG: Loop 'for.i' has cost = 102000000
				; CHECK-DAG: Loop 'for.k' has cost = 90000

				;; Test to make sure when we have multiple conflicting access patterns, the
				;; chosen loop configuration favours the majority of those accesses.
				;; For example this nest should be ordered as j-i-k.
				;; for (int i = 0; i < n; i++)
				;; for (int j = 0; j < n; j++)
				;; for (int k = 0; k < n; k++) {
				;; A[i][j][k] = 1;
				;; B[j][i][k] = 2;
				;; C[j][i][k] = 3;
				;; }

				define void @foo(i32 noundef signext %n, ptr noalias noundef %A, ptr noalias noundef %B, ptr noalias noundef %C) {
				entry:
				%0 = zext i32 %n to i64
				%1 = zext i32 %n to i64
				%2 = zext i32 %n to i64
				%3 = zext i32 %n to i64
				%4 = zext i32 %n to i64
				%5 = zext i32 %n to i64
				%cmp5 = icmp sgt i32 %n, 0
				br i1 %cmp5, label %for.i.preheader, label %for.end30

				for.i.preheader: ; preds = %entry
				%wide.trip.count16 = zext i32 %n to i64
				br label %for.i

				for.i: ; preds = %for.i.preheader, %for.inc28
				%indvars.iv13 = phi i64 [ 0, %for.i.preheader ], [ %indvars.iv.next14, %for.inc28 ]
				%cmp23 = icmp sgt i32 %n, 0
				br i1 %cmp23, label %for.j.preheader, label %for.inc28

				for.j.preheader: ; preds = %for.i
				%wide.trip.count11 = zext i32 %n to i64
				br label %for.j

				for.j: ; preds = %for.j.preheader, %for.inc25
				%indvars.iv8 = phi i64 [ 0, %for.j.preheader ], [ %indvars.iv.next9, %for.inc25 ]
				%cmp61 = icmp sgt i32 %n, 0
				br i1 %cmp61, label %for.k.preheader, label %for.inc25

				for.k.preheader: ; preds = %for.j
				%wide.trip.count = zext i32 %n to i64
				br label %for.k

				for.k: ; preds = %for.k.preheader, %for.k
				%indvars.iv = phi i64 [ 0, %for.k.preheader ], [ %indvars.iv.next, %for.k ]
				%6 = mul nuw i64 %0, %1
				%7 = mul nsw i64 %6, %indvars.iv13
				%arrayidx = getelementptr inbounds i32, ptr %A, i64 %7
				%8 = mul nuw nsw i64 %indvars.iv8, %1
				%arrayidx10 = getelementptr inbounds i32, ptr %arrayidx, i64 %8
				%arrayidx12 = getelementptr inbounds i32, ptr %arrayidx10, i64 %indvars.iv
				store i32 1, ptr %arrayidx12, align 4
				%9 = mul nuw i64 %2, %3
				%10 = mul nsw i64 %9, %indvars.iv8
				%arrayidx14 = getelementptr inbounds i32, ptr %B, i64 %10
				%11 = mul nuw nsw i64 %indvars.iv13, %3
				%arrayidx16 = getelementptr inbounds i32, ptr %arrayidx14, i64 %11
				%arrayidx18 = getelementptr inbounds i32, ptr %arrayidx16, i64 %indvars.iv
				store i32 2, ptr %arrayidx18, align 4
				%12 = mul nuw i64 %4, %5
				%13 = mul nsw i64 %12, %indvars.iv8
				%arrayidx20 = getelementptr inbounds i32, ptr %C, i64 %13
				%14 = mul nuw nsw i64 %indvars.iv13, %5
				%arrayidx22 = getelementptr inbounds i32, ptr %arrayidx20, i64 %14
				%arrayidx24 = getelementptr inbounds i32, ptr %arrayidx22, i64 %indvars.iv
				store i32 3, ptr %arrayidx24, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.k, label %for.inc25.loopexit

				for.inc25.loopexit: ; preds = %for.k
				br label %for.inc25

				for.inc25: ; preds = %for.inc25.loopexit, %for.j
				%indvars.iv.next9 = add nuw nsw i64 %indvars.iv8, 1
				%exitcond12 = icmp ne i64 %indvars.iv.next9, %wide.trip.count11
				br i1 %exitcond12, label %for.j, label %for.inc28.loopexit

				for.inc28.loopexit: ; preds = %for.inc25
				br label %for.inc28

				for.inc28: ; preds = %for.inc28.loopexit, %for.i
				%indvars.iv.next14 = add nuw nsw i64 %indvars.iv13, 1
				%exitcond17 = icmp ne i64 %indvars.iv.next14, %wide.trip.count16
				br i1 %exitcond17, label %for.i, label %for.end30.loopexit

				for.end30.loopexit: ; preds = %for.inc28
				br label %for.end30

				for.end30: ; preds = %for.end30.loopexit, %entry
				ret void
				}

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll

	; RUN: opt < %s -passes='print<loop-cache-cost>' -disable-output 2>&1 \| FileCheck %s			; RUN: opt < %s -passes='print<loop-cache-cost>' -disable-output 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-n32:64"			target datalayout = "e-m:e-i64:64-n32:64"
	target triple = "powerpc64le-unknown-linux-gnu"			target triple = "powerpc64le-unknown-linux-gnu"

	; void foo(long n, long m, long o, int A[n][m][o]) {			; void foo(long n, long m, long o, int A[n][m][o]) {
	; for (long i = 0; i < n; i++)			; for (long i = 0; i < n; i++)
	; for (long j = 0; j < m; j++)			; for (long j = 0; j < m; j++)
	; for (long k = 0; k < o; k++)			; for (long k = 0; k < o; k++)
	; A[2i+3][3j-4][2*k+7] = 1;			; A[2i+3][3j-4][2*k+7] = 1;
	; }			; }

	; CHECK: Loop 'for.i' has cost = 1000000			; CHECK: Loop 'for.i' has cost = 100000000
	; CHECK: Loop 'for.j' has cost = 1000000			; CHECK: Loop 'for.j' has cost = 1000000
	; CHECK: Loop 'for.k' has cost = 60000			; CHECK: Loop 'for.k' has cost = 60000

	define void @foo(i64 %n, i64 %m, i64 %o, i32* %A) {			define void @foo(i64 %n, i64 %m, i64 %o, i32* %A) {
	entry:			entry:
	%cmp32 = icmp sgt i64 %n, 0			%cmp32 = icmp sgt i64 %n, 0
	%cmp230 = icmp sgt i64 %m, 0			%cmp230 = icmp sgt i64 %m, 0
	%cmp528 = icmp sgt i64 %o, 0			%cmp528 = icmp sgt i64 %o, 0
	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopCacheAnalysis] Consider dimension depth of the subscript reference when calculating costClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 426514

llvm/include/llvm/Analysis/LoopCacheAnalysis.h

llvm/lib/Analysis/LoopCacheAnalysis.cpp

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/LoopnestFixedSize.ll

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/matvecmul.ll

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/multi-store.ll

llvm/test/Analysis/LoopCacheAnalysis/PowerPC/single-store.ll

[LoopCacheAnalysis] Consider dimension depth of the subscript reference when calculating cost
ClosedPublic