This is an archive of the discontinued LLVM Phabricator instance.

This cost model component is pretty interesting! It looks like something that really can be a generic pass that would drive a traversal and query opinterfaces?

mlir/lib/Dialect/Async/Transforms/CostModel.cpp
2	(license header)
90–92
96	Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an op interface instead, shouldn't it?
mlir/lib/Dialect/Async/Transforms/CostModel.h
2	You're missing the license header, and the guard does not match the usual format I believe.
7	Please avoid using namespace directive in a header.
15	Can you make it a class and increase the doc? Right now the description says " Approximate cost of executing an op on a modern CPU" but the first member is a Builder which is confusing to me. Edit: reading the implementation, I get a better feel for what is going on, but the comment stands :)

This seems important and cross-cutting enough to warrant an RFC on discourse, I am surprised @mehdi_amini has not asked for this already :)
Putting a blocker until a discussion has happened.

This revision now requires changes to proceed.Nov 24 2021, 3:07 AM

Address comments.

bakhtiyarneyman added inline comments.Nov 24 2021, 4:45 PM

mlir/lib/Dialect/Async/Transforms/CostModel.cpp
96	Yes, the code is written with that potential development in mind. But it is not clear that something as intrusive is necessary now. The concept needs to be validated first. The biggest question is whether the estimation precision of what we have now is sufficient. If it turns to be insufficient, we might need to do better analysis. For instance, it's conceivable that the whole approach would need to be scrapped, it might be imperative to do the memory component of the analysis earlier and/or incrementally, before the tiling and fusion have been lowered to ops which make it difficult to guess the cache performance. Furthermore, it might be beneficial to make a first-class cost-model which would help drive optimizations in other parts of the pipeline, most importantly the codegen. Furthermore, I suspect that once we are trying to use op interfaces, the cost model needs to be reimagined as this first-class entity, which might mean that we would need to design it in a way, that is useful for other architectures, e.g. GPUs. That in itself will take a quarter. So my preference is to do small and local to the AsyncParallelFor pass.
mlir/lib/Dialect/Async/Transforms/CostModel.h
2	It seems to be matching the PassDetail.h from the same directory, no?

Harbormaster completed remote builds in B135952: Diff 389628.Nov 24 2021, 4:54 PM

It would be great to have the cost model to be a part of core MLIR, defined with op interface, but for it to be usable for all the potential targets it will require a lot of discussions (e.g. how cost model can capture GPU memory coalescing, should it be the concern of the cost model at all, or it should just track compute?). Let's start an RFC to figure out how it should look like. In the meantime we can hide the current crude implementation behind the AsyncBlockSizeComputationFunction API, and keep implementation inside the Tensorflow compiler. We can tune it for the compilation pipeline we have right now, without need to worry about the general case.

mlir/include/mlir/Dialect/Async/Passes.h
25	How about giving more control to the caller (similar to the `TileSizeComputationFunction` in linalg). Something like this: struct AsyncBlockSize { Value targetBlockSize; // must be a Value of index type ... // some other values? } using AsyncBlockSizeComputationFunction = std::function<AsyncBlockSize(OpBuilder &, scf::ParallelOp parallelLoop)>; createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads, AsyncBlockSizeComputationFunction cost); Constant `int32_t targetBlockSize` will be just a trivial cost compute function. Also Async dialect can provide the default cost function based on the ops properties.
mlir/include/mlir/Dialect/Async/Passes.td
34	and this will become a property of the caller pipeline

ezhulenev removed reviewers: ezhulenev, nicolasvasilache.Feb 17 2022, 10:08 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Async/

Passes.h

6 lines

Passes.td

7 lines

lib/

Dialect/

Async/

Transforms/

31 lines

4 lines

78 lines

232 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

3 lines

Diff 389628

mlir/include/mlir/Dialect/Async/Passes.h

	Show All 13 Lines
	#define MLIR_DIALECT_ASYNC_PASSES_H_			#define MLIR_DIALECT_ASYNC_PASSES_H_

	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {

	std::unique_ptr<Pass> createAsyncParallelForPass();			std::unique_ptr<Pass> createAsyncParallelForPass();

	std::unique_ptr<Pass> createAsyncParallelForPass(bool asyncDispatch,			std::unique_ptr<Pass>
	int32_t numWorkerThreads,			createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,
	int32_t targetBlockSize);			bool useCostModel, int32_t minTaskSize = 512 * 1024);

				ezhulenevUnsubmitted Not Done Reply Inline Actions How about giving more control to the caller (similar to the `TileSizeComputationFunction` in linalg). Something like this: struct AsyncBlockSize { Value targetBlockSize; // must be a Value of index type ... // some other values? } using AsyncBlockSizeComputationFunction = std::function<AsyncBlockSize(OpBuilder &, scf::ParallelOp parallelLoop)>; createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads, AsyncBlockSizeComputationFunction cost); Constant `int32_t targetBlockSize` will be just a trivial cost compute function. Also Async dialect can provide the default cost function based on the ops properties. ezhulenev: How about giving more control to the caller (similar to the `TileSizeComputationFunction` in…
	std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();			std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();

	std::unique_ptr<Pass> createAsyncRuntimePolicyBasedRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimePolicyBasedRefCountingPass();

	Show All 11 Lines

mlir/include/mlir/Dialect/Async/Passes.td

Show All 23 Lines	Option<"asyncDispatch", "async-dispatch",
"caller thread.">,		"caller thread.">,

Option<"numWorkerThreads", "num-workers",		Option<"numWorkerThreads", "num-workers",
"int32_t", /default=/"8",		"int32_t", /default=/"8",
"The number of available workers to execute async operations.">,		"The number of available workers to execute async operations.">,

Option<"minTaskSize", "min-task-size",		Option<"minTaskSize", "min-task-size",
"int32_t", /default=/"1000",		"int32_t", /default=/"1000",
"The minimum task size for sharding parallel operation.">		"The minimum task size for sharding parallel operation.">,

		Option<"useCostModel", "use-cost-model",
		ezhulenevUnsubmitted Not Done Reply Inline Actions and this will become a property of the caller pipeline ezhulenev: and this will become a property of the caller pipeline
		"bool", /default=/"false",
		"Use experimental cost model to determine the parallelism granularity.">

];		];

let dependentDialects = [		let dependentDialects = [
"arith::ArithmeticDialect",		"arith::ArithmeticDialect",
"async::AsyncDialect",		"async::AsyncDialect",
"scf::SCFDialect"		"scf::SCFDialect"
];		];
}		}
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//		//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements scf.parallel to scf.for + async.execute conversion pass.		// This file implements scf.parallel to scf.for + async.execute conversion pass.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "CostModel.h"
#include "PassDetail.h"		#include "PassDetail.h"
#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"		#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"
#include "mlir/Dialect/Async/IR/Async.h"		#include "mlir/Dialect/Async/IR/Async.h"
#include "mlir/Dialect/Async/Passes.h"		#include "mlir/Dialect/Async/Passes.h"
#include "mlir/Dialect/SCF/SCF.h"		#include "mlir/Dialect/SCF/SCF.h"
#include "mlir/Dialect/StandardOps/IR/Ops.h"		#include "mlir/Dialect/StandardOps/IR/Ops.h"
#include "mlir/IR/BlockAndValueMapping.h"		#include "mlir/IR/BlockAndValueMapping.h"
#include "mlir/IR/ImplicitLocOpBuilder.h"		#include "mlir/IR/ImplicitLocOpBuilder.h"
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines
// call @parallel_compute_fn(%block_index, %block_size, ...)		// call @parallel_compute_fn(%block_index, %block_size, ...)
// }		// }
//		//
struct AsyncParallelForPass		struct AsyncParallelForPass
: public AsyncParallelForBase<AsyncParallelForPass> {		: public AsyncParallelForBase<AsyncParallelForPass> {
AsyncParallelForPass() = default;		AsyncParallelForPass() = default;

AsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,		AsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,
int32_t minTaskSize) {		bool useCostModel, int32_t minTaskSize) {
this->asyncDispatch = asyncDispatch;		this->asyncDispatch = asyncDispatch;
this->numWorkerThreads = numWorkerThreads;		this->numWorkerThreads = numWorkerThreads;
		this->useCostModel = useCostModel;
this->minTaskSize = minTaskSize;		this->minTaskSize = minTaskSize;
}		}

void runOnOperation() override;		void runOnOperation() override;
};		};

struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {		struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {
public:		public:
AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,		AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,
int32_t numWorkerThreads, int32_t minTaskSize)		int32_t numWorkerThreads, int32_t minTaskSize)
: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),		: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),
numWorkerThreads(numWorkerThreads), minTaskSize(minTaskSize) {}		numWorkerThreads(numWorkerThreads), minTaskSize(minTaskSize) {}

LogicalResult matchAndRewrite(scf::ParallelOp op,		LogicalResult matchAndRewrite(scf::ParallelOp op,
PatternRewriter &rewriter) const override;		PatternRewriter &rewriter) const override;

private:		private:
bool asyncDispatch;		bool asyncDispatch;
int32_t numWorkerThreads;		int32_t numWorkerThreads;
		bool useCostModel;
int32_t minTaskSize;		int32_t minTaskSize;
};		};

struct ParallelComputeFunctionType {		struct ParallelComputeFunctionType {
FunctionType type;		FunctionType type;
llvm::SmallVector<Value> captures;		llvm::SmallVector<Value> captures;
};		};

▲ Show 20 Lines • Show All 541 Lines • ▼ Show 20 Lines	auto noOp = [&](OpBuilder &nestedBuilder, Location loc) {
nestedBuilder.create<scf::YieldOp>(loc);		nestedBuilder.create<scf::YieldOp>(loc);
};		};

// Compute the parallel block size and dispatch concurrent tasks computing		// Compute the parallel block size and dispatch concurrent tasks computing
// results for each block.		// results for each block.
auto dispatch = [&](OpBuilder &nestedBuilder, Location loc) {		auto dispatch = [&](OpBuilder &nestedBuilder, Location loc) {
ImplicitLocOpBuilder nb(loc, nestedBuilder);		ImplicitLocOpBuilder nb(loc, nestedBuilder);

		// Create a parallel compute function that takes a block id and computes the
		// parallel operation body for a subset of iteration space.
		ParallelComputeFunction parallelComputeFunction =
		createParallelComputeFunction(op, rewriter);

// With large number of threads the value of creating many compute blocks		// With large number of threads the value of creating many compute blocks
// is reduced because the problem typically becomes memory bound. For small		// is reduced because the problem typically becomes memory bound. For small
// number of threads it helps with stragglers.		// number of threads it helps with stragglers.
float overshardingFactor = numWorkerThreads <= 4 ? 8.0		float overshardingFactor = numWorkerThreads <= 4 ? 8.0
: numWorkerThreads <= 8 ? 4.0		: numWorkerThreads <= 8 ? 4.0
: numWorkerThreads <= 16 ? 2.0		: numWorkerThreads <= 16 ? 2.0
: numWorkerThreads <= 32 ? 1.0		: numWorkerThreads <= 32 ? 1.0
: numWorkerThreads <= 64 ? 0.8		: numWorkerThreads <= 64 ? 0.8
: 0.6;		: 0.6;

// Do not overload worker threads with too many compute blocks.		// Do not overload worker threads with too many compute blocks.
Value maxComputeBlocks = b.create<arith::ConstantIndexOp>(		Value maxComputeBlocks = b.create<arith::ConstantIndexOp>(
std::max(1, static_cast<int>(numWorkerThreads * overshardingFactor)));		std::max(1, static_cast<int>(numWorkerThreads * overshardingFactor)));

// Target block size from the pass parameters.		// Target block size from the pass parameters.
Value minTaskSizeCst = b.create<arith::ConstantIndexOp>(minTaskSize);		Value minTaskSizeCst = b.create<arith::ConstantIndexOp>(minTaskSize);

// Compute parallel block size from the parallel problem size:		// Compute parallel block size from the parallel problem size:
// blockSize = min(tripCount,		// blockSize = min(tripCount,
// max(ceil_div(tripCount, maxComputeBlocks),		// max(ceil_div(tripCount, maxComputeBlocks),
// ceil_div(minTaskSize, bodySize)))		// ceil_div(minTaskSize, bodySize)))
		Value bodySize;
		if (useCostModel) {
		CostModel helper(b, *op);
		Cost cost = helper.estimateCost(op.getLoopBody());
		bodySize = helper.costToNanoseconds(cost);
		} else {
		bodySize = b.create<arith::ConstantIndexOp>(32);
		}
Value bs0 = b.create<arith::CeilDivSIOp>(tripCount, maxComputeBlocks);		Value bs0 = b.create<arith::CeilDivSIOp>(tripCount, maxComputeBlocks);
Value bs1 = b.create<arith::MaxSIOp>(bs0, minTaskSizeCst);		Value bs1 = b.create<arith::CeilDivSIOp>(minTaskSizeCst, bodySize);
Value blockSize = b.create<arith::MinSIOp>(tripCount, bs1);		Value bs2 = b.create<arith::MaxSIOp>(bs0, bs1);
		Value blockSize = b.create<arith::MinSIOp>(tripCount, bs2);
Value blockCount = b.create<arith::CeilDivSIOp>(tripCount, blockSize);		Value blockCount = b.create<arith::CeilDivSIOp>(tripCount, blockSize);

// Create a parallel compute function that takes a block id and computes the
// parallel operation body for a subset of iteration space.
ParallelComputeFunction parallelComputeFunction =
createParallelComputeFunction(op, rewriter);

// Dispatch parallel compute function using async recursive work splitting,		// Dispatch parallel compute function using async recursive work splitting,
// or by submitting compute task sequentially from a caller thread.		// or by submitting compute task sequentially from a caller thread.
if (asyncDispatch) {		if (asyncDispatch) {
doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,		doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
blockCount, tripCounts);		blockCount, tripCounts);
} else {		} else {
doSequentialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,		doSequentialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
blockCount, tripCounts);		blockCount, tripCounts);
Show All 23 Lines
}		}

std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {		std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {
return std::make_unique<AsyncParallelForPass>();		return std::make_unique<AsyncParallelForPass>();
}		}

std::unique_ptr<Pass> mlir::createAsyncParallelForPass(bool asyncDispatch,		std::unique_ptr<Pass> mlir::createAsyncParallelForPass(bool asyncDispatch,
int32_t numWorkerThreads,		int32_t numWorkerThreads,
		bool useCostModel,
int32_t minTaskSize) {		int32_t minTaskSize) {
return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,		return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,
minTaskSize);		useCostModel, minTaskSize);
}		}

mlir/lib/Dialect/Async/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRAsyncTransforms			add_mlir_dialect_library(MLIRAsyncTransforms
	AsyncParallelFor.cpp			AsyncParallelFor.cpp
	AsyncRuntimeRefCounting.cpp			AsyncRuntimeRefCounting.cpp
	AsyncRuntimeRefCountingOpt.cpp			AsyncRuntimeRefCountingOpt.cpp
	AsyncToAsyncRuntime.cpp			AsyncToAsyncRuntime.cpp
				CostModel.cpp
	PassDetail.cpp			PassDetail.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Async			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/Async

	DEPENDS			DEPENDS
	MLIRAsyncPassIncGen			MLIRAsyncPassIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRArithmetic			MLIRArithmetic
	MLIRAsync			MLIRAsync
	MLIRIR			MLIRIR
				MLIRMath
				MLIRMemref
	MLIRPass			MLIRPass
	MLIRSCF			MLIRSCF
	MLIRSCFToStandard			MLIRSCFToStandard
	MLIRStandard			MLIRStandard
	MLIRTransforms			MLIRTransforms
	MLIRTransformUtils			MLIRTransformUtils
				MLIRVector
	)			)

mlir/lib/Dialect/Async/Transforms/CostModel.h

This file was added.

				//===- CostModel.h - Declaration of the cost model. --------===//
				//
				mehdi_aminiUnsubmitted Done Reply Inline Actions You're missing the license header, and the guard does not match the usual format I believe. mehdi_amini: You're missing the license header, and the guard does not match the usual format I believe.
				bakhtiyarneymanAuthorUnsubmitted Done Reply Inline Actions It seems to be matching the PassDetail.h from the same directory, no? bakhtiyarneyman: It seems to be matching the PassDetail.h from the same directory, no?
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===////
				mehdi_aminiUnsubmitted Done Reply Inline Actions Please avoid using namespace directive in a header. mehdi_amini: Please avoid using namespace directive in a header.
				//
				// This file declares a cost model that drives deciding the parallelization
				// granularity when lowering parallel operations into async tasks.
				//
				//===----------------------------------------------------------------------===//

				#ifndef DIALECT_ASYNC_TRANSFORMS_COSTMODEL_H_
				#define DIALECT_ASYNC_TRANSFORMS_COSTMODEL_H_
				mehdi_aminiUnsubmitted Done Reply Inline Actions Can you make it a class and increase the doc? Right now the description says " Approximate cost of executing an op on a modern CPU" but the first member is a Builder which is confusing to me. Edit: reading the implementation, I get a better feel for what is going on, but the comment stands :) mehdi_amini: Can you make it a class and increase the doc? Right now the description says " Approximate cost…

				#include "mlir/IR/ImplicitLocOpBuilder.h"

				namespace mlir {

				// Cost measuring unit.
				//
				// For CPU roughly corresponds to cycles. For RAM roughly corresponds to bytes.
				using InverseThroughput = Value;

				// Approximate cost of executing an op on a modern CPU.
				//
				// Encapsulates the leaf nodes of the IR that is being emitted as the cost
				// estimation progresses. The leaf nodes correspond to CPU and RAM
				// inverse-throughputs.
				class Cost {
				private:
				InverseThroughput ram;
				InverseThroughput cpu;
				ImplicitLocOpBuilder builder;

				public:
				Cost(ImplicitLocOpBuilder &builder, size_t ramCost = 0, size_t cpuCost = 0);

				Cost(ImplicitLocOpBuilder &builder, InverseThroughput &ramCost,
				InverseThroughput &cpuCost);

				InverseThroughput &getRAM();
				InverseThroughput &getCPU();

				Cost &operator*=(const size_t &rhs);
				Cost operator*(const size_t &rhs);
				Cost &operator+=(const Cost &rhs);
				Cost operator+(const Cost &rhs);
				};

				// Estimates execution time for an op on a modern CPU.
				//
				// Errs on the lower end, but is not strictly a lower bound estimate. Targeting
				// being within an order of magnitude of the correct value.
				class CostModel {
				private:
				ImplicitLocOpBuilder builder;
				Operation &rootOp;

				public:
				CostModel(ImplicitLocOpBuilder &builder, Operation &rootOp)
				: builder(builder), rootOp(rootOp) {}

				ImplicitLocOpBuilder &getBuilder();
				bool isBeforeRoot(Operation &op);
				bool isBeforeRoot(Value &value);
				Optional<Value> getIterations(Value lowerBound, Value upperBound, Value step);
				Cost newCost(size_t ramCost, size_t cpuCost);
				Cost newCost(Value ramCost, Value cpuCost);
				Cost zeroCost();
				Cost estimateCost(Operation &op);
				Cost estimateCost(Region &region);
				Value costToNanoseconds(Cost cost);
				};

				} // namespace mlir
				#endif // DIALECT_ASYNC_TRANSFORMS_COSTMODEL_H_

mlir/lib/Dialect/Async/Transforms/CostModel.cpp

This file was added.

//===- CostModel.cpp - Implementation of the cost model. --------===//

mehdi_aminiUnsubmitted

Done

(license header)

mehdi_amini: (license header)

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This file implements a cost model that drives deciding the parallelization

// granularity when lowering parallel operations into async tasks.

//===----------------------------------------------------------------------===//

#include "CostModel.h"

#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"

#include "mlir/Dialect/Math/IR/Math.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Dialect/SCF/SCF.h"

#include "mlir/Dialect/Vector/VectorOps.h"

namespace mlir {

Cost::Cost(ImplicitLocOpBuilder &builder, size_t ramCost, size_t cpuCost)

: builder(builder), ram(builder.create<arith::ConstantIndexOp>(ramCost)),

cpu(builder.create<arith::ConstantIndexOp>(cpuCost)) {}

Cost::Cost(ImplicitLocOpBuilder &builder, InverseThroughput &ramCost,

InverseThroughput &cpuCost)

: builder(builder), ram(ramCost), cpu(cpuCost) {}

InverseThroughput &Cost::getRAM() { return ram; }

InverseThroughput &Cost::getCPU() { return cpu; }

Cost &Cost::operator*=(const size_t &rhs) {

auto factor = builder.create<arith::ConstantIndexOp>(rhs);

auto scale = [&](InverseThroughput resource) {

return builder.create<arith::MulIOp>(resource, factor);

};

ram = scale(ram);

cpu = scale(cpu);

return *this;

}

Cost Cost::operator*(const size_t &rhs) {

Cost ret = *this;

ret *= rhs;

return ret;

}

Cost &Cost::operator+=(const Cost &rhs) {

ram = builder.create<arith::AddIOp>(ram, rhs.ram);

cpu = builder.create<arith::AddIOp>(cpu, rhs.cpu);

return *this; // return the result by reference

}

Cost Cost::operator+(const Cost &rhs) {

Cost ret = *this;

ret += rhs;

return ret;

}

Cost estimateCostVector(CostModel &helper, Operation &op) {

auto newCost = [&](Value value) {

if (auto type = value.getType().dyn_cast<ShapedType>()) {

if (type.hasStaticShape()) {

return helper.newCost(/* ram */ type.getNumElements() *

type.getElementTypeBitWidth() / 8,

/* cpu */ 0);

}

return helper.zeroCost();

};

if (auto loadOp = dyn_cast<vector::LoadOp>(op)) {

return newCost(loadOp.getResult());

}

if (auto storeOp = dyn_cast<vector::StoreOp>(op)) {

return newCost(storeOp.getOperand(0));

}

return helper.zeroCost();

}

Cost estimateCostMemref(CostModel &helper, Operation &op) {

auto newCost = [&](Value value) {

return helper.newCost(/* ram */ value.getType().getIntOrFloatBitWidth() / 8,

/* cpu */ 0);

};

mehdi_aminiUnsubmitted

Done

Cost estimateCostArith(CostModel &helper, Operation &op) {

- if (dyn_cast<arith::DivUIOp>(op) || dyn_cast<arith::DivSIOp>(op) ||

- dyn_cast<arith::CeilDivSIOp>(op) || dyn_cast<arith::FloorDivSIOp>(op) ||

- dyn_cast<arith::RemSIOp>(op) || dyn_cast<arith::RemUIOp>(op)) {

+ if (isa<arith::DivUIOp, arith::DivSIOp, arith::CeilDivSIOp,

+ arith::FloorDivSIOp, arith::RemSIOp, arith::RemUIOp>(op)) {

return helper.newCost(/* ram */ 0, /* cpu */ 3);

mehdi_amini:

if (auto loadOp = dyn_cast<memref::LoadOp>(op)) {

return newCost(loadOp.getResult());

}

mehdi_aminiUnsubmitted

Not Done

Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an op interface instead, shouldn't it?

mehdi_amini: Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an…

bakhtiyarneymanAuthorUnsubmitted

Done

Yes, the code is written with that potential development in mind. But it is not clear that something as intrusive is necessary now. The concept needs to be validated first. The biggest question is whether the estimation precision of what we have now is sufficient. If it turns to be insufficient, we might need to do better analysis. For instance, it's conceivable that the whole approach would need to be scrapped, it might be imperative to do the memory component of the analysis earlier and/or incrementally, before the tiling and fusion have been lowered to ops which make it difficult to guess the cache performance. Furthermore, it might be beneficial to make a first-class cost-model which would help drive optimizations in other parts of the pipeline, most importantly the codegen.

Furthermore, I suspect that once we are trying to use op interfaces, the cost model needs to be reimagined as this first-class entity, which might mean that we would need to design it in a way, that is useful for other architectures, e.g. GPUs. That in itself will take a quarter. So my preference is to do small and local to the AsyncParallelFor pass.

bakhtiyarneyman: Yes, the code is written with that potential development in mind. But it is not clear that…

if (auto storeOp = dyn_cast<memref::StoreOp>(op)) {

return newCost(storeOp.getOperand(0));

}

return helper.zeroCost();

}

Cost estimateCostArith(CostModel &helper, Operation &op) {

if (isa<arith::DivUIOp, arith::DivSIOp, arith::CeilDivSIOp,

arith::FloorDivSIOp, arith::RemSIOp, arith::RemUIOp>(op)) {

return helper.newCost(/* ram */ 0, /* cpu */ 3);

}

return helper.newCost(/* ram */ 0, /* cpu */ 1);

}

Cost estimateCostMath(CostModel &helper, Operation &op) {

if (isa<math::AtanOp, math::Atan2Op, math::CosOp, math::SinOp, math::ExpOp,

math::Exp2Op, math::ExpM1Op, math::LogOp, math::Log10Op,

math::Log1pOp, math::Log2Op, math::PowFOp, math::RsqrtOp,

math::SqrtOp, math::TanhOp>(op)) {

return helper.newCost(/* ram */ 0, /* cpu */ 100);

}

return helper.newCost(/* ram */ 0, /* cpu */ 1);

}

Cost estimateCostSCF(CostModel &helper, Operation &op) {

auto scaleCost = [&](Value iterations, Cost cost) {

return helper.newCost(

helper.getBuilder().create<arith::MulIOp>(cost.getRAM(), iterations),

helper.getBuilder().create<arith::MulIOp>(cost.getCPU(), iterations));

};

if (auto forOp = dyn_cast<scf::ForOp>(op)) {

Cost cost = helper.estimateCost(forOp.getLoopBody());

if (auto iterations = helper.getIterations(

forOp.lowerBound(), forOp.upperBound(), forOp.step())) {

return scaleCost(*iterations, cost);

}

return cost;

}

if (auto parallelOp = dyn_cast<scf::ParallelOp>(op)) {

Cost cost = helper.estimateCost(parallelOp.getLoopBody());

for (auto &inductionVariableDomain : llvm::enumerate(

llvm::zip(parallelOp.lowerBound(), parallelOp.upperBound(),

parallelOp.step()))) {

Value lb, ub, step;

std::tie(lb, ub, step) = inductionVariableDomain.value();

if (auto iterations = helper.getIterations(lb, ub, step)) {

cost = scaleCost(*iterations, cost);

}

return cost;

}

return helper.zeroCost();

}

ImplicitLocOpBuilder &CostModel::getBuilder() { return builder; }

bool CostModel::isBeforeRoot(Operation &op) {

return op.getBlock() == rootOp.getBlock() && op.isBeforeInBlock(&rootOp);

}

bool CostModel::isBeforeRoot(Value &value) {

return isBeforeRoot(*value.getDefiningOp());

}

Optional<Value> CostModel::getIterations(Value lowerBound, Value upperBound,

Value step) {

if (isBeforeRoot(lowerBound) && isBeforeRoot(upperBound) &&

isBeforeRoot(step)) {

return (Value)builder.create<arith::CeilDivSIOp>(

builder.create<arith::SubIOp>(upperBound, lowerBound), step);

}

return llvm::None;

}

Cost CostModel::newCost(size_t ramCost, size_t cpuCost) {

return Cost(builder, ramCost, cpuCost);

}

Cost CostModel::newCost(Value ramCost, Value cpuCost) {

return Cost(builder, ramCost, cpuCost);

}

Cost CostModel::zeroCost() { return Cost(builder, 0, 0); }

Cost CostModel::estimateCost(Operation &op) {

MLIRContext *ctx = op.getContext();

if (op.getDialect() == ctx->getLoadedDialect("scf")) {

return estimateCostSCF(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("memref")) {

return estimateCostMemref(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("vector")) {

return estimateCostVector(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("arith")) {

return estimateCostArith(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("math")) {

return estimateCostMath(*this, op);

}

return zeroCost();

}

Cost CostModel::estimateCost(Region &region) {

Cost cost = zeroCost();

for (auto &op : region.getOps()) {

cost += estimateCost(op);

}

return cost;

}

Value CostModel::costToNanoseconds(Cost cost) {

// Assume that RAM throughput is 16 GB/s and that CPU runs at 3 GHz.

auto ramRuntime = builder.create<arith::DivUIOp>(

cost.getRAM(), builder.create<arith::ConstantIndexOp>(16));

auto cpuRuntime = builder.create<arith::DivUIOp>(

cost.getCPU(), builder.create<arith::ConstantIndexOp>(3));

auto max = [&](Value x, Value y) {

return builder.create<arith::MaxSIOp>(x, y);

};

return max(max(ramRuntime, cpuRuntime),

builder.create<arith::ConstantIndexOp>(1));

}

} // namespace mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Show First 20 Lines • Show All 2,008 Lines • ▼ Show 20 Lines	cc_library(
hdrs = ["include/mlir/Dialect/Async/Passes.h"],		hdrs = ["include/mlir/Dialect/Async/Passes.h"],
includes = ["include"],		includes = ["include"],
deps = [		deps = [
":Analysis",		":Analysis",
":ArithmeticDialect",		":ArithmeticDialect",
":Async",		":Async",
":AsyncPassIncGen",		":AsyncPassIncGen",
":IR",		":IR",
		":MathDialect",
		":MemRefDialect",
":Pass",		":Pass",
":SCFDialect",		":SCFDialect",
":SCFToStandard",		":SCFToStandard",
":StandardOps",		":StandardOps",
":Support",		":Support",
":TransformUtils",		":TransformUtils",
":Transforms",		":Transforms",
":TransformsPassIncGen",		":TransformsPassIncGen",
		":VectorOps",
"//llvm:Core",		"//llvm:Core",
"//llvm:Support",		"//llvm:Support",
],		],
)		)

cc_library(		cc_library(
name = "AffineUtils",		name = "AffineUtils",
srcs = glob(		srcs = glob(
▲ Show 20 Lines • Show All 5,611 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Implement a cost model to drive the lowering of scf.parallel.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 389628

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Passes.td

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/lib/Dialect/Async/Transforms/CMakeLists.txt

mlir/lib/Dialect/Async/Transforms/CostModel.h

mlir/lib/Dialect/Async/Transforms/CostModel.cpp

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Implement a cost model to drive the lowering of scf.parallel.
Needs ReviewPublic