This is an archive of the discontinued LLVM Phabricator instance.

This cost model component is pretty interesting! It looks like something that really can be a generic pass that would drive a traversal and query opinterfaces?

mlir/lib/Dialect/Async/Transforms/CostModel.cpp
1	(license header)
89–91
95	Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an op interface instead, shouldn't it?
mlir/lib/Dialect/Async/Transforms/CostModel.h
1	You're missing the license header, and the guard does not match the usual format I believe.
6	Please avoid using namespace directive in a header.
14	Can you make it a class and increase the doc? Right now the description says " Approximate cost of executing an op on a modern CPU" but the first member is a Builder which is confusing to me. Edit: reading the implementation, I get a better feel for what is going on, but the comment stands :)

This seems important and cross-cutting enough to warrant an RFC on discourse, I am surprised @mehdi_amini has not asked for this already :)
Putting a blocker until a discussion has happened.

This revision now requires changes to proceed.Nov 24 2021, 3:07 AM

Address comments.

bakhtiyarneyman added inline comments.Nov 24 2021, 4:45 PM

mlir/lib/Dialect/Async/Transforms/CostModel.cpp
95	Yes, the code is written with that potential development in mind. But it is not clear that something as intrusive is necessary now. The concept needs to be validated first. The biggest question is whether the estimation precision of what we have now is sufficient. If it turns to be insufficient, we might need to do better analysis. For instance, it's conceivable that the whole approach would need to be scrapped, it might be imperative to do the memory component of the analysis earlier and/or incrementally, before the tiling and fusion have been lowered to ops which make it difficult to guess the cache performance. Furthermore, it might be beneficial to make a first-class cost-model which would help drive optimizations in other parts of the pipeline, most importantly the codegen. Furthermore, I suspect that once we are trying to use op interfaces, the cost model needs to be reimagined as this first-class entity, which might mean that we would need to design it in a way, that is useful for other architectures, e.g. GPUs. That in itself will take a quarter. So my preference is to do small and local to the AsyncParallelFor pass.
mlir/lib/Dialect/Async/Transforms/CostModel.h
1	It seems to be matching the PassDetail.h from the same directory, no?

Harbormaster completed remote builds in B135952: Diff 389628.Nov 24 2021, 4:54 PM

It would be great to have the cost model to be a part of core MLIR, defined with op interface, but for it to be usable for all the potential targets it will require a lot of discussions (e.g. how cost model can capture GPU memory coalescing, should it be the concern of the cost model at all, or it should just track compute?). Let's start an RFC to figure out how it should look like. In the meantime we can hide the current crude implementation behind the AsyncBlockSizeComputationFunction API, and keep implementation inside the Tensorflow compiler. We can tune it for the compilation pipeline we have right now, without need to worry about the general case.

mlir/include/mlir/Dialect/Async/Passes.h
25	How about giving more control to the caller (similar to the `TileSizeComputationFunction` in linalg). Something like this: struct AsyncBlockSize { Value targetBlockSize; // must be a Value of index type ... // some other values? } using AsyncBlockSizeComputationFunction = std::function<AsyncBlockSize(OpBuilder &, scf::ParallelOp parallelLoop)>; createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads, AsyncBlockSizeComputationFunction cost); Constant `int32_t targetBlockSize` will be just a trivial cost compute function. Also Async dialect can provide the default cost function based on the ops properties.
mlir/include/mlir/Dialect/Async/Passes.td
34	and this will become a property of the caller pipeline

ezhulenev removed reviewers: ezhulenev, nicolasvasilache.Feb 17 2022, 10:08 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Async/

Passes.h

6 lines

Passes.td

7 lines

lib/

Dialect/

Async/

Transforms/

31 lines

3 lines

55 lines

218 lines

utils/

bazel/

llvm-project-overlay/

mlir/

BUILD.bazel

3 lines

Diff 389377

mlir/include/mlir/Dialect/Async/Passes.h

	Show All 13 Lines
	#define MLIR_DIALECT_ASYNC_PASSES_H_			#define MLIR_DIALECT_ASYNC_PASSES_H_

	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {

	std::unique_ptr<Pass> createAsyncParallelForPass();			std::unique_ptr<Pass> createAsyncParallelForPass();

	std::unique_ptr<Pass> createAsyncParallelForPass(bool asyncDispatch,			std::unique_ptr<Pass>
	int32_t numWorkerThreads,			createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,
	int32_t targetBlockSize);			bool useCostModel, int32_t minTaskSize = 512 * 1024);

				ezhulenevUnsubmitted Not Done Reply Inline Actions How about giving more control to the caller (similar to the `TileSizeComputationFunction` in linalg). Something like this: struct AsyncBlockSize { Value targetBlockSize; // must be a Value of index type ... // some other values? } using AsyncBlockSizeComputationFunction = std::function<AsyncBlockSize(OpBuilder &, scf::ParallelOp parallelLoop)>; createAsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads, AsyncBlockSizeComputationFunction cost); Constant `int32_t targetBlockSize` will be just a trivial cost compute function. Also Async dialect can provide the default cost function based on the ops properties. ezhulenev: How about giving more control to the caller (similar to the `TileSizeComputationFunction` in…
	std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();			std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();

	std::unique_ptr<Pass> createAsyncRuntimePolicyBasedRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimePolicyBasedRefCountingPass();

	Show All 11 Lines

mlir/include/mlir/Dialect/Async/Passes.td

Show All 23 Lines	Option<"asyncDispatch", "async-dispatch",
"caller thread.">,		"caller thread.">,

Option<"numWorkerThreads", "num-workers",		Option<"numWorkerThreads", "num-workers",
"int32_t", /default=/"8",		"int32_t", /default=/"8",
"The number of available workers to execute async operations.">,		"The number of available workers to execute async operations.">,

Option<"minTaskSize", "min-task-size",		Option<"minTaskSize", "min-task-size",
"int32_t", /default=/"1000",		"int32_t", /default=/"1000",
"The minimum task size for sharding parallel operation.">		"The minimum task size for sharding parallel operation.">,

		Option<"useCostModel", "use-cost-model",
		ezhulenevUnsubmitted Not Done Reply Inline Actions and this will become a property of the caller pipeline ezhulenev: and this will become a property of the caller pipeline
		"bool", /default=/"false",
		"Use experimental cost model to determine the parallelism granularity.">

];		];

let dependentDialects = [		let dependentDialects = [
"arith::ArithmeticDialect",		"arith::ArithmeticDialect",
"async::AsyncDialect",		"async::AsyncDialect",
"scf::SCFDialect"		"scf::SCFDialect"
];		];
}		}
▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//		//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements scf.parallel to scf.for + async.execute conversion pass.		// This file implements scf.parallel to scf.for + async.execute conversion pass.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		#include "CostModel.h"
#include "PassDetail.h"		#include "PassDetail.h"
#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"		#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"
#include "mlir/Dialect/Async/IR/Async.h"		#include "mlir/Dialect/Async/IR/Async.h"
#include "mlir/Dialect/Async/Passes.h"		#include "mlir/Dialect/Async/Passes.h"
#include "mlir/Dialect/SCF/SCF.h"		#include "mlir/Dialect/SCF/SCF.h"
#include "mlir/Dialect/StandardOps/IR/Ops.h"		#include "mlir/Dialect/StandardOps/IR/Ops.h"
#include "mlir/IR/BlockAndValueMapping.h"		#include "mlir/IR/BlockAndValueMapping.h"
#include "mlir/IR/ImplicitLocOpBuilder.h"		#include "mlir/IR/ImplicitLocOpBuilder.h"
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines
// call @parallel_compute_fn(%block_index, %block_size, ...)		// call @parallel_compute_fn(%block_index, %block_size, ...)
// }		// }
//		//
struct AsyncParallelForPass		struct AsyncParallelForPass
: public AsyncParallelForBase<AsyncParallelForPass> {		: public AsyncParallelForBase<AsyncParallelForPass> {
AsyncParallelForPass() = default;		AsyncParallelForPass() = default;

AsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,		AsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,
int32_t minTaskSize) {		bool useCostModel, int32_t minTaskSize) {
this->asyncDispatch = asyncDispatch;		this->asyncDispatch = asyncDispatch;
this->numWorkerThreads = numWorkerThreads;		this->numWorkerThreads = numWorkerThreads;
		this->useCostModel = useCostModel;
this->minTaskSize = minTaskSize;		this->minTaskSize = minTaskSize;
}		}

void runOnOperation() override;		void runOnOperation() override;
};		};

struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {		struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {
public:		public:
AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,		AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,
int32_t numWorkerThreads, int32_t minTaskSize)		int32_t numWorkerThreads, int32_t minTaskSize)
: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),		: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),
numWorkerThreads(numWorkerThreads), minTaskSize(minTaskSize) {}		numWorkerThreads(numWorkerThreads), minTaskSize(minTaskSize) {}

LogicalResult matchAndRewrite(scf::ParallelOp op,		LogicalResult matchAndRewrite(scf::ParallelOp op,
PatternRewriter &rewriter) const override;		PatternRewriter &rewriter) const override;

private:		private:
bool asyncDispatch;		bool asyncDispatch;
int32_t numWorkerThreads;		int32_t numWorkerThreads;
		bool useCostModel;
int32_t minTaskSize;		int32_t minTaskSize;
};		};

struct ParallelComputeFunctionType {		struct ParallelComputeFunctionType {
FunctionType type;		FunctionType type;
llvm::SmallVector<Value> captures;		llvm::SmallVector<Value> captures;
};		};

▲ Show 20 Lines • Show All 541 Lines • ▼ Show 20 Lines	auto noOp = [&](OpBuilder &nestedBuilder, Location loc) {
nestedBuilder.create<scf::YieldOp>(loc);		nestedBuilder.create<scf::YieldOp>(loc);
};		};

// Compute the parallel block size and dispatch concurrent tasks computing		// Compute the parallel block size and dispatch concurrent tasks computing
// results for each block.		// results for each block.
auto dispatch = [&](OpBuilder &nestedBuilder, Location loc) {		auto dispatch = [&](OpBuilder &nestedBuilder, Location loc) {
ImplicitLocOpBuilder nb(loc, nestedBuilder);		ImplicitLocOpBuilder nb(loc, nestedBuilder);

		// Create a parallel compute function that takes a block id and computes the
		// parallel operation body for a subset of iteration space.
		ParallelComputeFunction parallelComputeFunction =
		createParallelComputeFunction(op, rewriter);

// With large number of threads the value of creating many compute blocks		// With large number of threads the value of creating many compute blocks
// is reduced because the problem typically becomes memory bound. For small		// is reduced because the problem typically becomes memory bound. For small
// number of threads it helps with stragglers.		// number of threads it helps with stragglers.
float overshardingFactor = numWorkerThreads <= 4 ? 8.0		float overshardingFactor = numWorkerThreads <= 4 ? 8.0
: numWorkerThreads <= 8 ? 4.0		: numWorkerThreads <= 8 ? 4.0
: numWorkerThreads <= 16 ? 2.0		: numWorkerThreads <= 16 ? 2.0
: numWorkerThreads <= 32 ? 1.0		: numWorkerThreads <= 32 ? 1.0
: numWorkerThreads <= 64 ? 0.8		: numWorkerThreads <= 64 ? 0.8
: 0.6;		: 0.6;

// Do not overload worker threads with too many compute blocks.		// Do not overload worker threads with too many compute blocks.
Value maxComputeBlocks = b.create<arith::ConstantIndexOp>(		Value maxComputeBlocks = b.create<arith::ConstantIndexOp>(
std::max(1, static_cast<int>(numWorkerThreads * overshardingFactor)));		std::max(1, static_cast<int>(numWorkerThreads * overshardingFactor)));

// Target block size from the pass parameters.		// Target block size from the pass parameters.
Value minTaskSizeCst = b.create<arith::ConstantIndexOp>(minTaskSize);		Value minTaskSizeCst = b.create<arith::ConstantIndexOp>(minTaskSize);

// Compute parallel block size from the parallel problem size:		// Compute parallel block size from the parallel problem size:
// blockSize = min(tripCount,		// blockSize = min(tripCount,
// max(ceil_div(tripCount, maxComputeBlocks),		// max(ceil_div(tripCount, maxComputeBlocks),
// ceil_div(minTaskSize, bodySize)))		// ceil_div(minTaskSize, bodySize)))
		Value bodySize;
		if (useCostModel) {
		CostModel helper(b, *op);
		Cost cost = helper.estimateCost(op.getLoopBody());
		bodySize = helper.costToNanoseconds(cost);
		} else {
		bodySize = b.create<arith::ConstantIndexOp>(32);
		}
Value bs0 = b.create<arith::CeilDivSIOp>(tripCount, maxComputeBlocks);		Value bs0 = b.create<arith::CeilDivSIOp>(tripCount, maxComputeBlocks);
Value bs1 = b.create<arith::MaxSIOp>(bs0, minTaskSizeCst);		Value bs1 = b.create<arith::CeilDivSIOp>(minTaskSizeCst, bodySize);
Value blockSize = b.create<arith::MinSIOp>(tripCount, bs1);		Value bs2 = b.create<arith::MaxSIOp>(bs0, bs1);
		Value blockSize = b.create<arith::MinSIOp>(tripCount, bs2);
Value blockCount = b.create<arith::CeilDivSIOp>(tripCount, blockSize);		Value blockCount = b.create<arith::CeilDivSIOp>(tripCount, blockSize);

// Create a parallel compute function that takes a block id and computes the
// parallel operation body for a subset of iteration space.
ParallelComputeFunction parallelComputeFunction =
createParallelComputeFunction(op, rewriter);

// Dispatch parallel compute function using async recursive work splitting,		// Dispatch parallel compute function using async recursive work splitting,
// or by submitting compute task sequentially from a caller thread.		// or by submitting compute task sequentially from a caller thread.
if (asyncDispatch) {		if (asyncDispatch) {
doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,		doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
blockCount, tripCounts);		blockCount, tripCounts);
} else {		} else {
doSequentialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,		doSequentialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
blockCount, tripCounts);		blockCount, tripCounts);
Show All 23 Lines
}		}

std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {		std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {
return std::make_unique<AsyncParallelForPass>();		return std::make_unique<AsyncParallelForPass>();
}		}

std::unique_ptr<Pass> mlir::createAsyncParallelForPass(bool asyncDispatch,		std::unique_ptr<Pass> mlir::createAsyncParallelForPass(bool asyncDispatch,
int32_t numWorkerThreads,		int32_t numWorkerThreads,
		bool useCostModel,
int32_t minTaskSize) {		int32_t minTaskSize) {
return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,		return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,
minTaskSize);		useCostModel, minTaskSize);
}		}

mlir/lib/Dialect/Async/Transforms/CMakeLists.txt

Show All 9 Lines	add_mlir_dialect_library(MLIRAsyncTransforms

DEPENDS		DEPENDS
MLIRAsyncPassIncGen		MLIRAsyncPassIncGen

LINK_LIBS PUBLIC		LINK_LIBS PUBLIC
MLIRArithmetic		MLIRArithmetic
MLIRAsync		MLIRAsync
MLIRIR		MLIRIR
		MLIRMath
		MLIRMemref
MLIRPass		MLIRPass
MLIRSCF		MLIRSCF
MLIRSCFToStandard		MLIRSCFToStandard
MLIRStandard		MLIRStandard
MLIRTransforms		MLIRTransforms
MLIRTransformUtils		MLIRTransformUtils
		MLIRVector
)		)

mlir/lib/Dialect/Async/Transforms/CostModel.h

This file was added.

				#ifndef DIALECT_ASYNC_TRANSFORMS_COSTMODEL_H_
				mehdi_aminiUnsubmitted Done Reply Inline Actions You're missing the license header, and the guard does not match the usual format I believe. mehdi_amini: You're missing the license header, and the guard does not match the usual format I believe.
				bakhtiyarneymanAuthorUnsubmitted Done Reply Inline Actions It seems to be matching the PassDetail.h from the same directory, no? bakhtiyarneyman: It seems to be matching the PassDetail.h from the same directory, no?
				#define DIALECT_ASYNC_TRANSFORMS_COSTMODEL_H_

				#include "mlir/IR/ImplicitLocOpBuilder.h"

				using namespace mlir;
				mehdi_aminiUnsubmitted Done Reply Inline Actions Please avoid using namespace directive in a header. mehdi_amini: Please avoid using namespace directive in a header.

				// Cost measuring unit.
				//
				// For CPU roughly corresponds to cycles. For RAM roughly corresponds to bytes.
				using InverseThroughput = Value;

				// Approximate cost of executing an op on a modern CPU.
				struct Cost {
				mehdi_aminiUnsubmitted Done Reply Inline Actions Can you make it a class and increase the doc? Right now the description says " Approximate cost of executing an op on a modern CPU" but the first member is a Builder which is confusing to me. Edit: reading the implementation, I get a better feel for what is going on, but the comment stands :) mehdi_amini: Can you make it a class and increase the doc? Right now the description says " Approximate cost…
				ImplicitLocOpBuilder builder;
				InverseThroughput ram;
				InverseThroughput cpu;

				Cost(ImplicitLocOpBuilder &builder, size_t ramCost = 0, size_t cpuCost = 0);

				Cost(ImplicitLocOpBuilder &builder, InverseThroughput &ramCost,
				InverseThroughput &cpuCost);

				Cost &operator*=(const size_t &rhs);
				Cost operator*(const size_t &rhs);
				Cost &operator+=(const Cost &rhs);
				Cost operator+(const Cost &rhs);
				};

				// Estimate execution time for an op.
				//
				// Errs on the lower end, but is not strictly a lower bound estimate. Targeting
				// being within an order of magnitude of the correct value.
				class CostModel {
				private:
				ImplicitLocOpBuilder builder;
				Operation &rootOp;

				public:
				CostModel(ImplicitLocOpBuilder &builder, Operation &rootOp)
				: builder(builder), rootOp(rootOp) {}

				ImplicitLocOpBuilder &getBuilder();
				bool isBeforeRoot(Operation &op);
				bool isBeforeRoot(Value &value);
				Optional<Value> getIterations(Value lowerBound, Value upperBound, Value step);
				Cost newCost(size_t ramCost, size_t cpuCost);
				Cost newCost(Value ramCost, Value cpuCost);
				Cost zeroCost();
				Cost estimateCost(Operation &op);
				Cost estimateCost(Region &region);
				Value costToNanoseconds(Cost cost);
				};

				#endif // DIALECT_ASYNC_TRANSFORMS_COSTMODEL_H_

mlir/lib/Dialect/Async/Transforms/CostModel.cpp

This file was added.

mehdi_aminiUnsubmitted

Done

(license header)

mehdi_amini: (license header)

#include "CostModel.h"

#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"

#include "mlir/Dialect/Math/IR/Math.h"

#include "mlir/Dialect/MemRef/IR/MemRef.h"

#include "mlir/Dialect/SCF/SCF.h"

#include "mlir/Dialect/Vector/VectorOps.h"

Cost::Cost(ImplicitLocOpBuilder &builder, size_t ramCost, size_t cpuCost)

: builder(builder), ram(builder.create<arith::ConstantIndexOp>(ramCost)),

cpu(builder.create<arith::ConstantIndexOp>(cpuCost)) {}

Cost::Cost(ImplicitLocOpBuilder &builder, InverseThroughput &ramCost,

InverseThroughput &cpuCost)

: builder(builder), ram(ramCost), cpu(cpuCost) {}

Cost &Cost::operator*=(const size_t &rhs) {

auto factor = builder.create<arith::ConstantIndexOp>(rhs);

auto scale = [&](InverseThroughput resource) {

return builder.create<arith::MulIOp>(resource, factor);

};

ram = scale(ram);

cpu = scale(cpu);

return *this;

}

Cost Cost::operator*(const size_t &rhs) {

Cost ret = *this;

ret *= rhs;

return ret;

}

Cost &Cost::operator+=(const Cost &rhs) {

ram = builder.create<arith::AddIOp>(ram, rhs.ram);

cpu = builder.create<arith::AddIOp>(cpu, rhs.cpu);

return *this; // return the result by reference

}

Cost Cost::operator+(const Cost &rhs) {

Cost ret = *this;

ret += rhs;

return ret;

}

Cost estimateCostVector(CostModel &helper, Operation &op) {

auto newCost = [&](Value value) {

if (auto type = value.getType().dyn_cast<ShapedType>()) {

if (type.hasStaticShape()) {

return helper.newCost(/* ram */ type.getNumElements() *

type.getElementTypeBitWidth() / 8,

/* cpu */ 0);

}

return helper.zeroCost();

};

if (auto loadOp = dyn_cast<vector::LoadOp>(op)) {

return newCost(loadOp.getResult());

}

if (auto storeOp = dyn_cast<vector::StoreOp>(op)) {

return newCost(storeOp.getOperand(0));

}

return helper.zeroCost();

}

Cost estimateCostMemref(CostModel &helper, Operation &op) {

auto newCost = [&](Value value) {

return helper.newCost(/* ram */ value.getType().getIntOrFloatBitWidth() / 8,

/* cpu */ 0);

};

if (auto loadOp = dyn_cast<memref::LoadOp>(op)) {

return newCost(loadOp.getResult());

}

if (auto storeOp = dyn_cast<memref::StoreOp>(op)) {

return newCost(storeOp.getOperand(0));

}

return helper.zeroCost();

}

Cost estimateCostArith(CostModel &helper, Operation &op) {

if (dyn_cast<arith::DivUIOp>(op) || dyn_cast<arith::DivSIOp>(op) ||

dyn_cast<arith::CeilDivSIOp>(op) || dyn_cast<arith::FloorDivSIOp>(op) ||

dyn_cast<arith::RemSIOp>(op) || dyn_cast<arith::RemUIOp>(op)) {

mehdi_aminiUnsubmitted

Done

Cost estimateCostArith(CostModel &helper, Operation &op) {

- if (dyn_cast<arith::DivUIOp>(op) || dyn_cast<arith::DivSIOp>(op) ||

- dyn_cast<arith::CeilDivSIOp>(op) || dyn_cast<arith::FloorDivSIOp>(op) ||

- dyn_cast<arith::RemSIOp>(op) || dyn_cast<arith::RemUIOp>(op)) {

+ if (isa<arith::DivUIOp, arith::DivSIOp, arith::CeilDivSIOp,

+ arith::FloorDivSIOp, arith::RemSIOp, arith::RemUIOp>(op)) {

return helper.newCost(/* ram */ 0, /* cpu */ 3);

mehdi_amini:

return helper.newCost(/* ram */ 0, /* cpu */ 3);

}

return helper.newCost(/* ram */ 0, /* cpu */ 1);

mehdi_aminiUnsubmitted

Not Done

Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an op interface instead, shouldn't it?

mehdi_amini: Seems like all these kind of "estimateCostXXX" are the kind of things that should be behind an…

bakhtiyarneymanAuthorUnsubmitted

Done

Yes, the code is written with that potential development in mind. But it is not clear that something as intrusive is necessary now. The concept needs to be validated first. The biggest question is whether the estimation precision of what we have now is sufficient. If it turns to be insufficient, we might need to do better analysis. For instance, it's conceivable that the whole approach would need to be scrapped, it might be imperative to do the memory component of the analysis earlier and/or incrementally, before the tiling and fusion have been lowered to ops which make it difficult to guess the cache performance. Furthermore, it might be beneficial to make a first-class cost-model which would help drive optimizations in other parts of the pipeline, most importantly the codegen.

Furthermore, I suspect that once we are trying to use op interfaces, the cost model needs to be reimagined as this first-class entity, which might mean that we would need to design it in a way, that is useful for other architectures, e.g. GPUs. That in itself will take a quarter. So my preference is to do small and local to the AsyncParallelFor pass.

bakhtiyarneyman: Yes, the code is written with that potential development in mind. But it is not clear that…

}

Cost estimateCostMath(CostModel &helper, Operation &op) {

if (dyn_cast<math::AtanOp>(op) || dyn_cast<math::Atan2Op>(op) ||

dyn_cast<math::CosOp>(op) || dyn_cast<math::SinOp>(op) ||

dyn_cast<math::ExpOp>(op) || dyn_cast<math::Exp2Op>(op) ||

dyn_cast<math::ExpM1Op>(op) || dyn_cast<math::LogOp>(op) ||

dyn_cast<math::Log10Op>(op) || dyn_cast<math::Log1pOp>(op) ||

dyn_cast<math::Log2Op>(op) || dyn_cast<math::PowFOp>(op) ||

dyn_cast<math::RsqrtOp>(op) || dyn_cast<math::SqrtOp>(op) ||

dyn_cast<math::TanhOp>(op)) {

return helper.newCost(/* ram */ 0, /* cpu */ 100);

}

return helper.newCost(/* ram */ 0, /* cpu */ 1);

}

Cost estimateCostSCF(CostModel &helper, Operation &op) {

auto scaleCost = [&](Value iterations, Cost cost) {

return helper.newCost(

helper.getBuilder().create<arith::MulIOp>(cost.ram, iterations),

helper.getBuilder().create<arith::MulIOp>(cost.cpu, iterations));

};

if (auto forOp = dyn_cast<scf::ForOp>(op)) {

Cost cost = helper.estimateCost(forOp.getLoopBody());

if (auto iterations = helper.getIterations(

forOp.lowerBound(), forOp.upperBound(), forOp.step())) {

return scaleCost(*iterations, cost);

}

return cost;

}

if (auto parallelOp = dyn_cast<scf::ParallelOp>(op)) {

Cost cost = helper.estimateCost(parallelOp.getLoopBody());

for (auto &inductionVariableDomain : llvm::enumerate(

llvm::zip(parallelOp.lowerBound(), parallelOp.upperBound(),

parallelOp.step()))) {

Value lb, ub, step;

std::tie(lb, ub, step) = inductionVariableDomain.value();

if (auto iterations = helper.getIterations(lb, ub, step)) {

cost = scaleCost(*iterations, cost);

}

return cost;

}

return helper.zeroCost();

}

ImplicitLocOpBuilder &CostModel::getBuilder() { return builder; }

bool CostModel::isBeforeRoot(Operation &op) {

return op.getBlock() == rootOp.getBlock() && op.isBeforeInBlock(&rootOp);

}

bool CostModel::isBeforeRoot(Value &value) {

return isBeforeRoot(*value.getDefiningOp());

}

Optional<Value> CostModel::getIterations(Value lowerBound, Value upperBound,

Value step) {

if (isBeforeRoot(lowerBound) && isBeforeRoot(upperBound) &&

isBeforeRoot(step)) {

return (Value)builder.create<arith::CeilDivSIOp>(

builder.create<arith::SubIOp>(upperBound, lowerBound), step);

}

return llvm::None;

}

Cost CostModel::newCost(size_t ramCost, size_t cpuCost) {

return Cost(builder, ramCost, cpuCost);

}

Cost CostModel::newCost(Value ramCost, Value cpuCost) {

return Cost(builder, ramCost, cpuCost);

}

Cost CostModel::zeroCost() { return Cost(builder, 0, 0); }

Cost CostModel::estimateCost(Operation &op) {

MLIRContext *ctx = op.getContext();

if (op.getDialect() == ctx->getLoadedDialect("scf")) {

return estimateCostSCF(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("memref")) {

return estimateCostMemref(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("vector")) {

return estimateCostVector(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("arith")) {

return estimateCostArith(*this, op);

}

if (op.getDialect() == ctx->getLoadedDialect("math")) {

return estimateCostMath(*this, op);

}

return zeroCost();

}

Cost CostModel::estimateCost(Region &region) {

Cost cost = zeroCost();

for (auto &op : region.getOps()) {

cost += estimateCost(op);

}

return cost;

}

Value CostModel::costToNanoseconds(Cost cost) {

// Assume that RAM throughput is 16 GB/s and that CPU runs at 3 GHz.

auto ramRuntime = builder.create<arith::DivUIOp>(

cost.ram, builder.create<arith::ConstantIndexOp>(16));

auto cpuRuntime = builder.create<arith::DivUIOp>(

cost.cpu, builder.create<arith::ConstantIndexOp>(3));

auto max = [&](Value x, Value y) {

return builder.create<arith::MaxSIOp>(x, y);

};

return max(max(ramRuntime, cpuRuntime),

builder.create<arith::ConstantIndexOp>(1));

}

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Show First 20 Lines • Show All 2,008 Lines • ▼ Show 20 Lines	cc_library(
hdrs = ["include/mlir/Dialect/Async/Passes.h"],		hdrs = ["include/mlir/Dialect/Async/Passes.h"],
includes = ["include"],		includes = ["include"],
deps = [		deps = [
":Analysis",		":Analysis",
":ArithmeticDialect",		":ArithmeticDialect",
":Async",		":Async",
":AsyncPassIncGen",		":AsyncPassIncGen",
":IR",		":IR",
		":MathDialect",
		":MemRefDialect",
":Pass",		":Pass",
":SCFDialect",		":SCFDialect",
":SCFToStandard",		":SCFToStandard",
":StandardOps",		":StandardOps",
":Support",		":Support",
":TransformUtils",		":TransformUtils",
":Transforms",		":Transforms",
":TransformsPassIncGen",		":TransformsPassIncGen",
		":VectorOps",
"//llvm:Core",		"//llvm:Core",
"//llvm:Support",		"//llvm:Support",
],		],
)		)

cc_library(		cc_library(
name = "AffineUtils",		name = "AffineUtils",
srcs = glob(		srcs = glob(
▲ Show 20 Lines • Show All 5,611 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Implement a cost model to drive the lowering of scf.parallel.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 389377

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Passes.td

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/lib/Dialect/Async/Transforms/CMakeLists.txt

mlir/lib/Dialect/Async/Transforms/CostModel.h

mlir/lib/Dialect/Async/Transforms/CostModel.cpp

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Implement a cost model to drive the lowering of scf.parallel.
Needs ReviewPublic