This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
Analysis/
15
LoopCostAnalysis.h
-
Passes.h
-
InitializePasses.h
-
LinkAllPasses.h
-
lib/Analysis/
-
Analysis/
-
Analysis.cpp
-
CMakeLists.txt
8
LoopCostAnalysis.cpp
-
test/Analysis/CostModel/X86/
-
Analysis/
-
CostModel/
-
X86/
2
matmul-perfect-loopCost.ll

Differential D21124

Cache aware Loop Cost Analysis
Needs RevisionPublic

Authored by tvvikram on Jun 8 2016, 12:57 AM.

Download Raw Diff

Details

Reviewers

anemet
amehsan
hfinkel
jfb

Summary

Implement an analysis pass that calculates loop cost based on cache data.

The patch basically creates groups of references that would lie in the same cache line. Each group is then analysed with respect to innermost loops considering cache lines. Penalty for the reference is:
a. 1, if the reference is invariant with the innermost loop,
b. TripCount for non-unit stride access,
c. TripCount / CacheLineSize for a unit-stride access.
Loop Cost is then calculated as the sum of the reference penalties times the product of the loop bounds of the outer loops. This loop cost can then be used as a profitability measure for cache reuse related optimizations. This is just a brief description; please refer to http://www.cs.utexas.edu/users/mckinley/papers/asplos-1994.pdf for the details.

Current drawbacks:
a. Static use of CacheLineSize.
b. Only perfect nests are handled.
c. Only single bb of innermost loop is considered.
d. Add reg data and other cost related information if possible.
e. Strides <= CLS belong to same ref. group.

Diff Detail

Event Timeline

tvvikram updated this revision to Diff 60008.Jun 8 2016, 12:57 AM

tvvikram retitled this revision from to Cache aware Loop Cost Analysis.

tvvikram updated this object.

Herald added subscribers: mzolotukhin, sanjoy. · View Herald TranscriptJun 8 2016, 12:57 AM

tvvikram added a subscriber: llvm-commits.Jun 8 2016, 1:00 AM

mssimpso added a subscriber: mssimpso.Jun 8 2016, 4:33 AM

tvvikram updated this object.Jun 9 2016, 9:36 AM

aprantl added a subscriber: aprantl.Jun 9 2016, 1:01 PM

tvvikram added reviewers: hfinkel, amehsan, jfb, anemet.Jun 28 2016, 9:39 AM

suyog added a subscriber: suyog.Jun 29 2016, 2:21 PM

Quick review, nothing technical.

include/llvm/Analysis/LoopCostAnalysis.h
43	Typo "resue"
46	Typo "implemenation"
50	`enum class` is nicer IMO. You can also rename it to Way instead of CacheWay.
67	This ctor isn't used. It looks like you could get rid of the default ctor and initCacheData if you have access to the target at construction time (do you? I'm not sure, it would be good to wire that up). If you keep this ctor, it should be written as: CacheData(unsigned LineSize, unsigned CacheSize, CacheData::CacheWay Associativity) : LineSize(LineSize), CacheSize(CacheSize), Associativity(Associativity) {} You can't use identifiers starting with `_[A-Z]`, and you can reuse the same name.
77	The three setters above aren't used, and don't seem useful given that you have initCacheData.
82	Could you address this TODO? That's the part I'm most interested in :) That could be a separate patch if you want to keep things minimal.
91	LLVM doesn't usually name things with "_t".
97	enum class here as well.
100	Column major is never used in the code. Is it going to be added later?
116	Weird format here. Could you run new code through `clang-format`?
116	Wouldn't `float` be sufficient for this purpose?
lib/Analysis/LoopCostAnalysis.cpp
22	?
35	Use a range-based for loop here. There are other places below where you should also use them.
161	Use a `constexpr` for this, or a `cl::opt` if you want to be able to set it from the command line.
165	You can `push_back({L, TripCount})`
193	"its access"
203	Delete?
204	Naming isn't consistent with other parts for the LLVM code (`I` and `BB`), which you seem to try to follow in general (e.g. `GEP`).
test/Analysis/CostModel/X86/matmul-perfect-loopCost.ll
23	Is your algorithm stable enough that every platform will produce exactly the same `double` result? You may want to print out the double differently so it's easier to ignore bits of lower significance.

In D21124#470684, @jfb wrote:

Quick review, nothing technical.

Thanks! Will address your comments, but few inline comments.

include/llvm/Analysis/LoopCostAnalysis.h
82	I will work on this and submit a separate patch.
91	"LoopNestType" is better?
100	It is used in LoopCostAnalysis.cpp:334. Can GEP represent column major accesses? If ColumnMajor producers like Fortran frontend convert column major accesses to row major accesses, then the entire enum Order will not be necessary.
116	For a loop with trip count 10^6 and a deep nesting of say 5 level nest, the algorithm computes a penalty of (10^6)^5 = 10^30 for the outermost level loop. Since it can easily reach FLOAT_MAX, I had kept it double.
lib/Analysis/LoopCostAnalysis.cpp
22	LoopCost computation works on loop nests. Some of the code below populates perfect nests from LoopInfo. FWIW, I think it can be moved to LoopUtils, so that other optimizations can reuse it.
test/Analysis/CostModel/X86/matmul-perfect-loopCost.ll
23	LSBs can vary, I guess. Will shorten to 4 decimal places.

aemerson added a subscriber: aemerson.Aug 23 2016, 8:45 AM

Vikram, are you still interested in working on this?

In D21124#568350, @hfinkel wrote:

Vikram, are you still interested in working on this?

Yes. I will update the patch in a few days.

Subsumed by https://llvm.org/docs/CommandGuide/llvm-mca.html ? Can we close this?

This revision now requires changes to proceed.May 7 2018, 10:14 PM

Herald added a subscriber: mgorny. · View Herald TranscriptMay 7 2018, 10:14 PM

llvm-mca doesn't subsume this patch.

In D21124#1091093, @tvvikram wrote:

llvm-mca doesn't subsume this patch.

Fair. Is this patch going somewhere, or is it abandoned?

In D21124#1091393, @jfb wrote:

In D21124#1091093, @tvvikram wrote:

llvm-mca doesn't subsume this patch.

Fair. Is this patch going somewhere, or is it abandoned?

I am interested (but someone should fund!).

Doesn't sound like I should expect progress for now. Happy to help review if something changes.

Revision Contents

Path

Size

include/

llvm/

Analysis/

159 lines

7 lines

1 line

1 line

lib/

Analysis/

Analysis.cpp

1 line

CMakeLists.txt

1 line

LoopCostAnalysis.cpp

382 lines

test/

Analysis/

CostModel/

X86/

matmul-perfect-loopCost.ll

102 lines

Diff 60008

include/llvm/Analysis/LoopCostAnalysis.h

This file was added.

				//===----- LoopCostAnalysis.h - Analyse a loop for its cost ---------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements a set of routines that does the loop cost analysis.
				//
				// Prerequisite Reading:
				// Compiler Optimizations for Improving Data Locality [Carr-McKinley-Tseng]

				// High level details:
				// Look through the memory accesses and create groups of references such that
				// two references fall into different groups if they are accessed in different
				// cache lines. Each group is then analysed with respect to innermost loops
				// considering cache lines.

				// Penalty for the reference is:
				// a. 1 if the reference is invariant with the innermost loop,
				// b. TripCount for non-unit stride access,
				// c. TripCount/CacheLineSize for a unit-stride access.

				// Loop cost is the sum total of the reference penalties times the product of
				// the loop bounds of the outer loops.

				//===----------------------------------------------------------------------===//

				#include "llvm/Pass.h"
				#include "llvm/Analysis/LoopPass.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/ScalarEvolutionExpressions.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/IR/Value.h"
				#include "llvm/IR/Instructions.h"

				namespace llvm {

				// This data is mainly used for cache resue (spatial/temporal) aware
				jfbUnsubmitted Not Done Reply Inline Actions Typo "resue" jfb: Typo "resue"
				// calculations.
				// TODO: This should probably go to a separate file with separate
				// implemenation if the size grows.
				jfbUnsubmitted Not Done Reply Inline Actions Typo "implemenation" jfb: Typo "implemenation"
				class CacheData {
				unsigned LineSize; // The number of words in a cache line.
				unsigned CacheSize; // The size of cache.
				enum CacheWay {
				jfbUnsubmitted Not Done Reply Inline Actions `enum class` is nicer IMO. You can also rename it to Way instead of CacheWay. jfb: `enum class` is nicer IMO. You can also rename it to Way instead of CacheWay.
				DIRECT,
				WAY2,
				WAY4,
				WAY8,
				FULL
				} Associativity;

				// TODO: Need to add other cache details as required.

				public:
				CacheData() {}

				CacheData(unsigned _LS, unsigned _CS, CacheData::CacheWay _Assoc) {
				LineSize = _LS;
				CacheSize = _CS;
				Associativity = _Assoc;
				}
				jfbUnsubmitted Not Done Reply Inline Actions This ctor isn't used. It looks like you could get rid of the default ctor and initCacheData if you have access to the target at construction time (do you? I'm not sure, it would be good to wire that up). If you keep this ctor, it should be written as: CacheData(unsigned LineSize, unsigned CacheSize, CacheData::CacheWay Associativity) : LineSize(LineSize), CacheSize(CacheSize), Associativity(Associativity) {} You can't use identifiers starting with `_[A-Z]`, and you can reuse the same name. jfb: This ctor isn't used. It looks like you could get rid of the default ctor and initCacheData if…

				// Get/set member functions.
				void setLineSize(unsigned Size) { LineSize = Size; }
				unsigned getLineSize() { return LineSize; }

				void setCacheSize(unsigned Size) { CacheSize = Size; }
				unsigned getCacheSize() { return CacheSize; }

				void setAssociativity(CacheData::CacheWay Assoc) { Associativity = Assoc; }
				CacheData::CacheWay getAssociativity() { return Associativity; }
				jfbUnsubmitted Not Done Reply Inline Actions The three setters above aren't used, and don't seem useful given that you have initCacheData. jfb: The three setters above aren't used, and don't seem useful given that you have initCacheData.

				void initCacheData() {
				// TODO: Get the cache data from the target architecture.
				// Set default values for generic data layout.
				LineSize = 4; // Statically setting for now.
				jfbUnsubmitted Not Done Reply Inline Actions Could you address this TODO? That's the part I'm most interested in :) That could be a separate patch if you want to keep things minimal. jfb: Could you address this TODO? That's the part I'm most interested in :) That could be a separate…
				tvvikramAuthorUnsubmitted Not Done Reply Inline Actions I will work on this and submit a separate patch. tvvikram: I will work on this and submit a separate patch.
				}
				};

				// TODO: Add a class RegData that will be mainly used for reg reuse aware
				// calculations.

				// Ordered list of loops from outer loop to the innermost loop that form the
				// loop nest.
				typedef SmallVector<Loop *, 2> LoopNest_t;
				jfbUnsubmitted Not Done Reply Inline Actions LLVM doesn't usually name things with "_t". jfb: LLVM doesn't usually name things with "_t".
				tvvikramAuthorUnsubmitted Not Done Reply Inline Actions "LoopNestType" is better? tvvikram: "LoopNestType" is better?

				// This is a utility class that can be used to calculate cache aware loop costs
				// for a perfectly nested loop nest.
				class LoopCost {
				public:
				enum Order {
				jfbUnsubmitted Not Done Reply Inline Actions enum class here as well. jfb: enum class here as well.
				COLUMNMAJOR = 0,
				ROWMAJOR
				} AccessOrder;
				jfbUnsubmitted Not Done Reply Inline Actions Column major is never used in the code. Is it going to be added later? jfb: Column major is never used in the code. Is it going to be added later?
				tvvikramAuthorUnsubmitted Not Done Reply Inline Actions It is used in LoopCostAnalysis.cpp:334. Can GEP represent column major accesses? If ColumnMajor producers like Fortran frontend convert column major accesses to row major accesses, then the entire enum Order will not be necessary. tvvikram: It is used in LoopCostAnalysis.cpp:334. Can GEP represent column major accesses? If ColumnMajor…

				private:
				CacheData Cache;
				ScalarEvolution *SCEV;

				/// List of loops in loopnest with their tripcounts.
				typedef std::pair<Loop*, unsigned> loopTripcount;
				SmallVector<loopTripcount, 2> LoopTripCounts;
				void setTripCounts(LoopNest_t LN);

				/// The reference groups - directly noting the GEPs instead of load/stores.
				SmallVector<GetElementPtrInst*, 2> ReferenceGroups;
				void CreateReferenceGroups(BasicBlock *bb);

				/// The loops and their calculated loop costs.
				std::map <Loop *, double> LoopCosts;
				jfbUnsubmitted Not Done Reply Inline Actions Weird format here. Could you run new code through `clang-format`? jfb: Weird format here. Could you run new code through `clang-format`?
				jfbUnsubmitted Not Done Reply Inline Actions Wouldn't `float` be sufficient for this purpose? jfb: Wouldn't `float` be sufficient for this purpose?
				tvvikramAuthorUnsubmitted Not Done Reply Inline Actions For a loop with trip count 10^6 and a deep nesting of say 5 level nest, the algorithm computes a penalty of (10^6)^5 = 10^30 for the outermost level loop. Since it can easily reach FLOAT_MAX, I had kept it double. tvvikram: For a loop with trip count 10^6 and a deep nesting of say 5 level nest, the algorithm computes…

				bool ASTMatch(Value Operand, PHINode P);

				public:
				LoopCost(ScalarEvolution *_SCEV)
				: SCEV(_SCEV) {
				Cache.initCacheData();
				AccessOrder = ROWMAJOR;
				}

				// Given a perfect nest, calculate loop costs of the loops in the nest.
				void calculateLoopCosts(LoopNest_t LN);

				double getLoopCostOf(Loop *L);

				// Print routines.
				void printLoopCosts();
				void printTripCounts();
				void printReferenceGroups();
				};


				class LoopCostAnalysis : public FunctionPass {
				LoopCost *LC;

				public:
				static char ID;

				LoopCostAnalysis() : FunctionPass(ID) {
				initializeLoopCostAnalysisPass(*PassRegistry::getPassRegistry());
				}

				const LoopCost &getLoopCosts() const { return *LC; }

				bool runOnFunction(Function &F) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addRequired<ScalarEvolutionWrapperPass>();
				AU.setPreservesAll();
				}
				};
				}

include/llvm/Analysis/Passes.h

Show First 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	namespace llvm {
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
//		//
// Minor pass prototypes, allowing us to expose them through bugpoint and		// Minor pass prototypes, allowing us to expose them through bugpoint and
// analyze.		// analyze.
FunctionPass *createInstCountPass();		FunctionPass *createInstCountPass();

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
//		//
		// createLoopCostAnalysisPass - This pass assigns a numerical cost to each
		// loop in the loop nest considering cache and register data.
		//
		FunctionPass *createLoopCostAnalysisPass();

		//===--------------------------------------------------------------------===//
		//
// createRegionInfoPass - This pass finds all single entry single exit regions		// createRegionInfoPass - This pass finds all single entry single exit regions
// in a function and builds the region hierarchy.		// in a function and builds the region hierarchy.
//		//
FunctionPass *createRegionInfoPass();		FunctionPass *createRegionInfoPass();

// Print module-level debug info metadata in human-readable form.		// Print module-level debug info metadata in human-readable form.
ModulePass *createModuleDebugInfoPrinterPass();		ModulePass *createModuleDebugInfoPrinterPass();

Show All 17 Lines

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines
	void initializeDomOnlyViewerPass(PassRegistry&);			void initializeDomOnlyViewerPass(PassRegistry&);
	void initializeDomPrinterPass(PassRegistry&);			void initializeDomPrinterPass(PassRegistry&);
	void initializeDomViewerPass(PassRegistry&);			void initializeDomViewerPass(PassRegistry&);
	void initializeDominanceFrontierWrapperPassPass(PassRegistry&);			void initializeDominanceFrontierWrapperPassPass(PassRegistry&);
	void initializeDominatorTreeWrapperPassPass(PassRegistry&);			void initializeDominatorTreeWrapperPassPass(PassRegistry&);
	void initializeEarlyIfConverterPass(PassRegistry&);			void initializeEarlyIfConverterPass(PassRegistry&);
	void initializeEdgeBundlesPass(PassRegistry&);			void initializeEdgeBundlesPass(PassRegistry&);
	void initializeExpandPostRAPass(PassRegistry&);			void initializeExpandPostRAPass(PassRegistry&);
				void initializeLoopCostAnalysisPass(PassRegistry&);
	void initializeAAResultsWrapperPassPass(PassRegistry &);			void initializeAAResultsWrapperPassPass(PassRegistry &);
	void initializeGCOVProfilerLegacyPassPass(PassRegistry&);			void initializeGCOVProfilerLegacyPassPass(PassRegistry&);
	void initializePGOInstrumentationGenLegacyPassPass(PassRegistry&);			void initializePGOInstrumentationGenLegacyPassPass(PassRegistry&);
	void initializePGOInstrumentationUseLegacyPassPass(PassRegistry&);			void initializePGOInstrumentationUseLegacyPassPass(PassRegistry&);
	void initializePGOIndirectCallPromotionLegacyPassPass(PassRegistry&);			void initializePGOIndirectCallPromotionLegacyPassPass(PassRegistry&);
	void initializeInstrProfilingLegacyPassPass(PassRegistry &);			void initializeInstrProfilingLegacyPassPass(PassRegistry &);
	void initializeAddressSanitizerPass(PassRegistry&);			void initializeAddressSanitizerPass(PassRegistry&);
	void initializeAddressSanitizerModulePass(PassRegistry&);			void initializeAddressSanitizerModulePass(PassRegistry&);
	▲ Show 20 Lines • Show All 212 Lines • Show Last 20 Lines

include/llvm/LinkAllPasses.h

Show First 20 Lines • Show All 103 Lines • ▼ Show 20 Lines	ForcePassLinking() {
(void) llvm::createIPSCCPPass();		(void) llvm::createIPSCCPPass();
(void) llvm::createInductiveRangeCheckEliminationPass();		(void) llvm::createInductiveRangeCheckEliminationPass();
(void) llvm::createIndVarSimplifyPass();		(void) llvm::createIndVarSimplifyPass();
(void) llvm::createInstructionCombiningPass();		(void) llvm::createInstructionCombiningPass();
(void) llvm::createInternalizePass();		(void) llvm::createInternalizePass();
(void) llvm::createLCSSAPass();		(void) llvm::createLCSSAPass();
(void) llvm::createLICMPass();		(void) llvm::createLICMPass();
(void) llvm::createLazyValueInfoPass();		(void) llvm::createLazyValueInfoPass();
		(void) llvm::createLoopCostAnalysisPass();
(void) llvm::createLoopExtractorPass();		(void) llvm::createLoopExtractorPass();
(void) llvm::createLoopInterchangePass();		(void) llvm::createLoopInterchangePass();
(void) llvm::createLoopSimplifyPass();		(void) llvm::createLoopSimplifyPass();
(void) llvm::createLoopSimplifyCFGPass();		(void) llvm::createLoopSimplifyCFGPass();
(void) llvm::createLoopStrengthReducePass();		(void) llvm::createLoopStrengthReducePass();
(void) llvm::createLoopRerollPass();		(void) llvm::createLoopRerollPass();
(void) llvm::createLoopUnrollPass();		(void) llvm::createLoopUnrollPass();
(void) llvm::createLoopUnswitchPass();		(void) llvm::createLoopUnswitchPass();
▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

lib/Analysis/Analysis.cpp

Show All 39 Lines	void llvm::initializeAnalysis(PassRegistry &Registry) {
initializeDemandedBitsWrapperPassPass(Registry);		initializeDemandedBitsWrapperPassPass(Registry);
initializeDivergenceAnalysisPass(Registry);		initializeDivergenceAnalysisPass(Registry);
initializeDominanceFrontierWrapperPassPass(Registry);		initializeDominanceFrontierWrapperPassPass(Registry);
initializeDomViewerPass(Registry);		initializeDomViewerPass(Registry);
initializeDomPrinterPass(Registry);		initializeDomPrinterPass(Registry);
initializeDomOnlyViewerPass(Registry);		initializeDomOnlyViewerPass(Registry);
initializePostDomViewerPass(Registry);		initializePostDomViewerPass(Registry);
initializeDomOnlyPrinterPass(Registry);		initializeDomOnlyPrinterPass(Registry);
		initializeLoopCostAnalysisPass(Registry);
initializePostDomPrinterPass(Registry);		initializePostDomPrinterPass(Registry);
initializePostDomOnlyViewerPass(Registry);		initializePostDomOnlyViewerPass(Registry);
initializePostDomOnlyPrinterPass(Registry);		initializePostDomOnlyPrinterPass(Registry);
initializeAAResultsWrapperPassPass(Registry);		initializeAAResultsWrapperPassPass(Registry);
initializeGlobalsAAWrapperPassPass(Registry);		initializeGlobalsAAWrapperPassPass(Registry);
initializeIVUsersPass(Registry);		initializeIVUsersPass(Registry);
initializeInstCountPass(Registry);		initializeInstCountPass(Registry);
initializeIntervalPartitionPass(Registry);		initializeIntervalPartitionPass(Registry);
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

lib/Analysis/CMakeLists.txt

Show All 34 Lines	add_llvm_library(LLVMAnalysis
Interval.cpp		Interval.cpp
IntervalPartition.cpp		IntervalPartition.cpp
IteratedDominanceFrontier.cpp		IteratedDominanceFrontier.cpp
LazyCallGraph.cpp		LazyCallGraph.cpp
LazyValueInfo.cpp		LazyValueInfo.cpp
Lint.cpp		Lint.cpp
Loads.cpp		Loads.cpp
LoopAccessAnalysis.cpp		LoopAccessAnalysis.cpp
		LoopCostAnalysis.cpp
LoopUnrollAnalyzer.cpp		LoopUnrollAnalyzer.cpp
LoopInfo.cpp		LoopInfo.cpp
LoopPass.cpp		LoopPass.cpp
LoopPassManager.cpp		LoopPassManager.cpp
MemDepPrinter.cpp		MemDepPrinter.cpp
MemDerefPrinter.cpp		MemDerefPrinter.cpp
MemoryBuiltins.cpp		MemoryBuiltins.cpp
MemoryDependenceAnalysis.cpp		MemoryDependenceAnalysis.cpp
Show All 32 Lines

lib/Analysis/LoopCostAnalysis.cpp

This file was added.

				//===----- LoopCostAnalysis.cpp - Analyse a loop for its cost -------------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements a set of routines that does the loop cost analysis.
				// TODO: Extend the Loop Cost calculation to imperfect nests and more than one
				// basicblock in the innermost loop.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Analysis/LoopCostAnalysis.h"

				using namespace llvm;

				#define DEBUG_TYPE "loop-cost"

				/* - Build/test perfect loop nests - should probably be moved to Loop Utils - */
				jfbUnsubmitted Not Done Reply Inline Actions ? jfb: ?
				tvvikramAuthorUnsubmitted Not Done Reply Inline Actions LoopCost computation works on loop nests. Some of the code below populates perfect nests from LoopInfo. FWIW, I think it can be moved to LoopUtils, so that other optimizations can reuse it. tvvikram: LoopCost computation works on loop nests. Some of the code below populates perfect nests from…
				// TODO: Rotated loops forming perfect nests cannot be relied on and hence not
				// considered for now.
				bool IsRotatedLoop(Loop *L) {
				return (L->getHeader() != L->getExitingBlock()); // Probably rotated loop.
				}

				// Return true if all blocks of @L are perfectly nested under @L's subloop. @L's
				// header and latch are exempted though.
				bool BlocksPerfectlyNestedUnder(Loop *L) {
				assert ((L->getSubLoops().size() == 1) &&
				"Expected loop containing a single subloop!");
				Loop Subloop = L->begin();
				for (auto bb = L->block_begin(), bbe = L->block_end(); bb != bbe; ++bb) {
				jfbUnsubmitted Not Done Reply Inline Actions Use a range-based for loop here. There are other places below where you should also use them. jfb: Use a range-based for loop here. There are other places below where you should also use them.
				if (bb == L->getHeader() \|\| bb == L->getLoopLatch())
				continue;
				if (!Subloop->contains(*bb)) {
				// Ignore empty blocks.
				auto bbBr = dyn_cast<BranchInst>((*bb)->getTerminator());
				if ((*bb)->size() == 1 && (bbBr && bbBr->isUnconditional()))
				continue;
				return false;
				}
				}
				return true;
				}

				bool hasSimpleHeaderLatch(Loop *L) {
				// TODO: Unrotated loops form perfect nests only if loop's header and latch
				// contain only loop control updates and exit check. Any other instruction is
				// deemed to violate perfect nesting.
				return true;
				}

				bool
				PopulatePerfectLoopNestsUnder(Loop *L,
				SmallVectorImpl<LoopNest_t> &PerfectLoopNests) {
				// Base case: the innermost loop.
				if (L->begin() == L->end()) {
				if (!IsRotatedLoop(L) && hasSimpleHeaderLatch(L)) {
				LoopNest_t LN;
				LN.push_back(L);
				PerfectLoopNests.push_back(LN);
				return true;
				}
				return false;
				}

				// Iterate over subloops.
				bool PerfectSubnest = true;
				for (Loop::iterator i = L->begin(), e = L->end(); i != e; i++)
				PerfectSubnest &= PopulatePerfectLoopNestsUnder(*i, PerfectLoopNests);

				// If perfect and single subnest.
				if (PerfectSubnest && (L->getSubLoops().size() == 1) && !IsRotatedLoop(L) &&
				hasSimpleHeaderLatch(L) && BlocksPerfectlyNestedUnder(L)) {
				// Add this loop to perfect nest.
				LoopNest_t *LN = &PerfectLoopNests[PerfectLoopNests.size() - 1];
				LN->insert(LN->begin(), L);
				return true;
				}
				return false;
				}

				// Check if the loops in @LN form a perfect nest. Loops in @LN should be listed
				// from outermost loop to innermost loop of the loop nest.
				bool IsPerfectNest(LoopNest_t LN) {
				auto l = LN.rbegin(); // Start with innermost loop.
				Loop Innermost = l;
				if ((Innermost->begin() != Innermost->end()) \|\| IsRotatedLoop(Innermost) \|\|
				!hasSimpleHeaderLatch(Innermost))
				return false;

				Loop *Subloop = Innermost;
				++l;
				for (auto le = LN.rend(); l != le; ++l) {
				Loop L = l;
				if ((L->getSubLoops().size() != 1) \|\| (*L->begin() != Subloop) \|\|
				IsRotatedLoop(L) \|\| !hasSimpleHeaderLatch(L) \|\|
				!BlocksPerfectlyNestedUnder(L))
				return false;
				Subloop = L;
				}
				return true;
				}

				// LoopCost implementation.
				// Print the loop costs of all the loops in the loopnest.
				void LoopCost::printLoopCosts() {
				DEBUG(
				dbgs() << "Printing Loop Costs: ";
				if (LoopCosts.empty())
				dbgs() << "(empty)";
				dbgs() << "\n";
				for (auto lc : LoopCosts) {
				dbgs() << "Loop: " << lc.first->getHeader()->getName();
				dbgs() << "\tCosts: " << lc.second << "\n";
				}
				);
				}

				// Print the reference groups.
				void LoopCost::printReferenceGroups() {
				DEBUG(
				dbgs() << "Printing Reference Groups (GEPs): ";
				if (ReferenceGroups.begin() == ReferenceGroups.end())
				dbgs() << "(empty)";
				dbgs() << "\n";
				for (auto ri : ReferenceGroups)
				dbgs() << "Ref group: " << *ri << "\n";
				);
				}

				// Print the loop trip counts of each of the loops in the loopnest.
				void LoopCost::printTripCounts() {
				DEBUG(
				dbgs() << "Printing Trip Counts: ";
				if (LoopTripCounts.empty())
				dbgs() << "(empty)";
				dbgs() << "\n";
				for (auto ltc : LoopTripCounts) {
				dbgs() << "Loop: " << ltc.first->getHeader()->getName();
				dbgs() << "\tTripCount: " << ltc.second << "\n";
				}
				);
				}

				// Return loop cost of @L. Return -1.0 if loop is not found in map.
				double LoopCost::getLoopCostOf(Loop *L) {
				auto LC = LoopCosts.find(L);
				if (LC == LoopCosts.end())
				return -1.0;
				return LC->second;
				}

				// Set the trip counts for each of the loops in the loopnest. If trip count for
				// a loop is not found, a normalized value based on the surrounding loops is
				// set or statically set to STATIC_TRIP_COUNT.
				void LoopCost::setTripCounts(LoopNest_t LN) {
				#define STATIC_TRIP_COUNT 1000
				jfbUnsubmitted Not Done Reply Inline Actions Use a `constexpr` for this, or a `cl::opt` if you want to be able to set it from the command line. jfb: Use a `constexpr` for this, or a `cl::opt` if you want to be able to set it from the command…

				for (auto L : LN) {
				unsigned TripCount = SCEV->getSmallConstantTripCount(L, L->getExitingBlock());
				LoopTripCounts.push_back(std::make_pair(L, (TripCount)));
				jfbUnsubmitted Not Done Reply Inline Actions You can `push_back({L, TripCount})` jfb: You can `push_back({L, TripCount})`
				}

				// Normalize trip counts: reset trip counts if 0.
				// TODO: Can do better?
				for (auto li = LoopTripCounts.begin(), le = LoopTripCounts.end();
				li != le; li++) {
				unsigned TC = (*li).second;
				if (TC)
				continue;
				// TC is 0, reset to average of the surrounding loop's tripcount.
				auto prev = li, next = li;
				if (li == LoopTripCounts.begin()) {
				++next;
				(li).second = (next).second;
				} else if (li == (LoopTripCounts.end() - 1)) {
				--prev;
				(li).second = (prev).second;
				} else {
				--prev; ++next;
				(li).second = ((prev).second + (*next).second) / 2;
				}
				if ((*li).second == 0) // Still 0? Statically set!
				(*li).second = STATIC_TRIP_COUNT;
				}
				}

				// This routine creates different Reference (GEP) Groups based on GEP's access
				// w.r.t. CacheLine. A GEP access is added to an existing group if it's access
				jfbUnsubmitted Not Done Reply Inline Actions "its access" jfb: "its access"
				// lies in the same cache line; otherwise this GEP will form a new RefGroup.
				// Groups can be made based on the following:
				// a. A GEP having different number of operands will form a different group.
				// b. Only the last operand of GEP is analysed as that is what matters with the
				// cache line.
				// c. Two GEPs with their last operands - i and i + 4 falls into same RefGroup
				// if SCEV(i) - SCEV(i + 4) is a constant that is less than the cache line.
				// TODO: Add alignment info.
				void LoopCost::CreateReferenceGroups(BasicBlock *bb) {
				assert(bb);
				jfbUnsubmitted Not Done Reply Inline Actions Delete? jfb: Delete?
				for (auto &i : *bb) {
				jfbUnsubmitted Not Done Reply Inline Actions Naming isn't consistent with other parts for the LLVM code (`I` and `BB`), which you seem to try to follow in general (e.g. `GEP`). jfb: Naming isn't consistent with other parts for the LLVM code (`I` and `BB`), which you seem to…
				if (auto *GEP = dyn_cast<GetElementPtrInst>(&i)) {
				bool UniqueGEP = true;
				auto NumOps = GEP->getNumOperands();
				auto LastOp = GEP->getOperand(NumOps - 1);
				if (SCEV->isSCEVable(LastOp->getType())) {
				for (auto RG : ReferenceGroups) {
				auto RGNumOps = RG->getNumOperands();
				if (RGNumOps != NumOps)
				continue;

				// All n - 1 GEP operands must be equal.
				bool EqualOps = true;
				for (unsigned i = 0, e = NumOps - 1; (EqualOps && i != e); ++i)
				if (GEP->getOperand(i) != RG->getOperand(i))
				EqualOps = false;
				if (!EqualOps)
				continue;

				auto RGLastOp = RG->getOperand(RGNumOps - 1);
				if (!SCEV->isSCEVable(RGLastOp->getType()))
				continue;

				// Check if (LastOp - RGLastOp) lies within cache line.
				auto GEPSCEV = SCEV->getSCEV(LastOp);
				auto RGSCEV = SCEV->getSCEV(RGLastOp);
				auto DiffSCEV =
				dyn_cast<SCEVConstant>(SCEV->getMinusSCEV(GEPSCEV, RGSCEV));
				if (DiffSCEV->getValue()->getSExtValue() < Cache.getLineSize()) {
				UniqueGEP = false;
				break;
				}
				}
				}
				if (UniqueGEP)
				ReferenceGroups.push_back(GEP);
				}
				}
				}

				// Return true if the phinode is used in the computation of the operand.
				// TODO: Use SCEVTraversal or some similar thing.
				bool LoopCost::ASTMatch(Value Operand, PHINode P) {
				if (Operand == P)
				return true;

				auto I = dyn_cast<Instruction>(Operand);
				if (!I \|\| isa<PHINode>(I))
				return false;

				for (unsigned i = 0, e = I->getNumOperands(); i != e; ++i)
				if (ASTMatch(I->getOperand(i), P))
				return true;
				return false;
				}

				BasicBlock getInnerSingleBB(Loop L) {
				BasicBlock *BB = nullptr;
				for (auto bb = L->block_begin(), bbe = L->block_end(); bb != bbe; ++bb) {
				if (bb == L->getHeader() \|\| bb == L->getLoopLatch())
				continue;

				// Ignore empty blocks.
				auto bbBr = dyn_cast<BranchInst>((*bb)->getTerminator());
				if ((*bb)->size() == 1 && (bbBr && bbBr->isUnconditional()))
				continue;

				if (BB) // A BB was already found.
				return nullptr; // There are multiple BBs.

				BB = *bb;
				}
				return BB;
				}

				void LoopCost::calculateLoopCosts(LoopNest_t LN) {
				// Initialize cost of all loops in nest to -1.
				for (auto L : LN)
				LoopCosts[L] = -1.0;

				if (!IsPerfectNest(LN))
				return;

				// For now, restricting to a single bb in innermost loop. This will be
				// relaxed soon.
				BasicBlock innermostBB = getInnerSingleBB(LN.rbegin());
				if (!innermostBB)
				return;

				CreateReferenceGroups(innermostBB);
				setTripCounts(LN);

				// Calculate loop costs for each of the loop in the loopnest assuming loop to
				// be the innermost loop.
				for (auto L : LN) {
				double ThisLoopCost = 0.0;

				PHINode *P = L->getCanonicalInductionVariable();
				if (!P) {
				DEBUG(dbgs() << "Could not find induction variable\n");
				continue;
				}
				DEBUG(dbgs() << "Loop: " << L->getHeader()->getName()
				<< " phi: " << *P << "\n");

				// Calculate penalties from other loops.
				double ThisLoopPenalty = 1.0;
				double OtherLoopPenalties = 1.0;
				for (auto ltc : LoopTripCounts) {
				if (ltc.first == L)
				ThisLoopPenalty = ltc.second;
				else
				OtherLoopPenalties = OtherLoopPenalties * ltc.second;
				}
				assert((ThisLoopPenalty > 0 && OtherLoopPenalties > 0)
				&& "Incorrect loop penalty");
				DEBUG(dbgs() << "ThisLoopPenalty: " << ThisLoopPenalty << "\n");
				DEBUG(dbgs() << "OtherLoopPenalties: " << OtherLoopPenalties << "\n");

				for (auto GEP : ReferenceGroups) {
				// Check stride of this GEP access. The position of use of 'P' is
				// checked in the operands of GEP. TODO: If the operands of GEP do not
				// use 'P' directly, the AST has to be traversed starting from each
				// operand of GEP in order to find the use of 'P'.
				DEBUG(dbgs() << "GEP: " << *GEP << "\n");
				double ThisRefPenalty = 1.0; // Is 1 for an invariant reference.
				for (unsigned ni = 1, ne = GEP->getNumIndices(); ni <= ne; ni++) {
				if (ASTMatch(GEP->getOperand(ni), P)) {
				// Check stride access.
				if ((AccessOrder == ROWMAJOR && ni == ne)
				\|\| (AccessOrder == COLUMNMAJOR && ni == 2)) { // ni == 1 is primary offset.
				// Contiguous locations.
				ThisRefPenalty = ThisLoopPenalty / Cache.getLineSize();
				} else {
				// Non-contiguous locations.
				ThisRefPenalty = ThisLoopPenalty;
				}
				DEBUG(dbgs() << "ThisRefPenalty: " << ThisRefPenalty << "\n");
				}
				}
				ThisLoopCost = ThisLoopCost + ThisRefPenalty * OtherLoopPenalties;
				DEBUG(dbgs() << "Accum cost of this loop: " << ThisLoopCost << "\n\n");
				}
				LoopCosts[L] = ThisLoopCost;
				}
				}

				bool LoopCostAnalysis::runOnFunction(Function &F) {
				auto LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
				auto SCEV = &getAnalysis<ScalarEvolutionWrapperPass>().getSE();
				LC = new LoopCost(SCEV);

				DEBUG(dbgs() << "Calculate LoopCosts in Function: " << F.getName() << "\n");
				SmallVector <LoopNest_t, 2> PerfectLoopNests;
				// Build perfect loop nests.
				for (auto L = LI->begin(), Le = LI->end(); L != Le; ++L)
				PopulatePerfectLoopNestsUnder(*L, PerfectLoopNests);

				for (auto PN : PerfectLoopNests)
				LC->calculateLoopCosts(PN);

				LC->printLoopCosts();
				LC->printTripCounts();
				return false;
				}

				char LoopCostAnalysis::ID = 0;

				auto PassDesc = "Experimental, Cache aware Loop Cost Analysis";
				INITIALIZE_PASS_BEGIN(LoopCostAnalysis, "loop-cost", PassDesc, false, true)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
				INITIALIZE_PASS_END(LoopCostAnalysis, "loop-cost", PassDesc, false, true)

				namespace llvm {
				FunctionPass *createLoopCostAnalysisPass() {
				return new LoopCostAnalysis();
				}
				}

test/Analysis/CostModel/X86/matmul-perfect-loopCost.ll

This file was added.

				; RUN: opt -loop-cost -debug %s -o /dev/null 2>&1 \| FileCheck %s
				; Test Loop Cost Analysis of perfectly nested matrix multiplication code having
				; global arrays.
				; Generated from matmul-perfect.c:
				; #include <stdio.h>
				; #define SIZE 5000
				; int a[SIZE][SIZE], b[SIZE][SIZE], c[SIZE][SIZE];
				; void matmul() {
				; int i, j, k;
				; for (i = 0; i < SIZE; i++) {
				; for (j = 0; j < SIZE; j++) {
				; for (k = 0; k < SIZE; k++) {
				; c[i][j] = c[i][j] + a[i][k] * b[k][j];
				; }
				; }
				; }
				; }
				; with clang -emit-llvm matmul-perfect.c -S -o matmul-perfect.clang.ll
				; then opt -mem2reg -loop-simplify -instcombine -instnamer -indvars -S -o matmul-perfect.opt.ll

				; CHECK: Loop: for.cond4 Costs: 1.563688e+11
				; CHECK-NEXT: Loop: for.cond1 Costs: 6.256252e+10
				; CHECK-NEXT: Loop: for.cond Costs: 2.501750e+11
				jfbUnsubmitted Not Done Reply Inline Actions Is your algorithm stable enough that every platform will produce exactly the same `double` result? You may want to print out the double differently so it's easier to ignore bits of lower significance. jfb: Is your algorithm stable enough that every platform will produce exactly the same `double`…
				tvvikramAuthorUnsubmitted Not Done Reply Inline Actions LSBs can vary, I guess. Will shorten to 4 decimal places. tvvikram: LSBs can vary, I guess. Will shorten to 4 decimal places.
				; CHECK: Loop: for.cond TripCount: 5001
				; CHECK-NEXT: Loop: for.cond1 TripCount: 5001
				; CHECK-NEXT: Loop: for.cond4 TripCount: 5001

				; ModuleID = 'matmul-perfect.clang.bc'
				source_filename = "matmul-perfect.c"
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@c = common global [5000 x [5000 x i32]] zeroinitializer, align 16
				@a = common global [5000 x [5000 x i32]] zeroinitializer, align 16
				@b = common global [5000 x [5000 x i32]] zeroinitializer, align 16

				; Function Attrs: nounwind uwtable
				define void @matmul() #0 {
				entry:
				br label %for.cond

				for.cond: ; preds = %for.inc24, %entry
				%indvars.iv6 = phi i64 [ %indvars.iv.next7, %for.inc24 ], [ 0, %entry ]
				%exitcond8 = icmp ne i64 %indvars.iv6, 5000
				br i1 %exitcond8, label %for.body, label %for.end26

				for.body: ; preds = %for.cond
				br label %for.cond1

				for.cond1: ; preds = %for.inc21, %for.body
				%indvars.iv3 = phi i64 [ %indvars.iv.next4, %for.inc21 ], [ 0, %for.body ]
				%exitcond5 = icmp ne i64 %indvars.iv3, 5000
				br i1 %exitcond5, label %for.body3, label %for.end23

				for.body3: ; preds = %for.cond1
				br label %for.cond4

				for.cond4: ; preds = %for.inc, %for.body3
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.inc ], [ 0, %for.body3 ]
				%exitcond = icmp ne i64 %indvars.iv, 5000
				br i1 %exitcond, label %for.body6, label %for.end

				for.body6: ; preds = %for.cond4
				%arrayidx8 = getelementptr inbounds [5000 x [5000 x i32]], [5000 x [5000 x i32]]* @c, i64 0, i64 %indvars.iv6, i64 %indvars.iv3
				%tmp = load i32, i32* %arrayidx8, align 4
				%arrayidx12 = getelementptr inbounds [5000 x [5000 x i32]], [5000 x [5000 x i32]]* @a, i64 0, i64 %indvars.iv6, i64 %indvars.iv
				%tmp1 = load i32, i32* %arrayidx12, align 4
				%arrayidx16 = getelementptr inbounds [5000 x [5000 x i32]], [5000 x [5000 x i32]]* @b, i64 0, i64 %indvars.iv, i64 %indvars.iv3
				%tmp2 = load i32, i32* %arrayidx16, align 4
				%mul = mul nsw i32 %tmp1, %tmp2
				%add = add nsw i32 %tmp, %mul
				%arrayidx20 = getelementptr inbounds [5000 x [5000 x i32]], [5000 x [5000 x i32]]* @c, i64 0, i64 %indvars.iv6, i64 %indvars.iv3
				store i32 %add, i32* %arrayidx20, align 4
				br label %for.inc

				for.inc: ; preds = %for.body6
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %for.cond4

				for.end: ; preds = %for.cond4
				br label %for.inc21

				for.inc21: ; preds = %for.end
				%indvars.iv.next4 = add nuw nsw i64 %indvars.iv3, 1
				br label %for.cond1

				for.end23: ; preds = %for.cond1
				br label %for.inc24

				for.inc24: ; preds = %for.end23
				%indvars.iv.next7 = add nuw nsw i64 %indvars.iv6, 1
				br label %for.cond

				for.end26: ; preds = %for.cond
				ret void
				}

				attributes #0 = { nounwind uwtable }

				!llvm.ident = !{!0}

				!0 = !{!""}

This is an archive of the discontinued LLVM Phabricator instance.

Cache aware Loop Cost AnalysisNeeds RevisionPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 60008

include/llvm/Analysis/LoopCostAnalysis.h

include/llvm/Analysis/Passes.h

include/llvm/InitializePasses.h

include/llvm/LinkAllPasses.h

lib/Analysis/Analysis.cpp

lib/Analysis/CMakeLists.txt

lib/Analysis/LoopCostAnalysis.cpp

test/Analysis/CostModel/X86/matmul-perfect-loopCost.ll

Cache aware Loop Cost Analysis
Needs RevisionPublic