This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Async/
-
mlir/
-
Dialect/
-
Async/
-
Passes.h
-
Passes.td
-
lib/Dialect/Async/Transforms/
-
Dialect/
-
Async/
-
Transforms/
25/25
AsyncParallelFor.cpp
-
test/
-
Dialect/Async/
-
Async/
2/2
async-parallel-for-async-dispatch.mlir
-
async-parallel-for-seq-dispatch.mlir
-
async-parallel-for.mlir
-
Integration/Dialect/Async/CPU/
-
Dialect/
-
Async/
-
CPU/
2/2
microbench-linalg-async-parallel-for.mlir
-
microbench-scf-async-parallel-for.mlir
-
test-async-parallel-for-1d.mlir
-
test-async-parallel-for-2d.mlir

Differential D104850

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)
ClosedPublic

Authored by ezhulenev on Jun 24 2021, 5:27 AM.

Download Raw Diff

Details

Reviewers

herhut
mehdi_amini

Commits

rG86ad0af87054: [mlir:Async] Implement recursive async work splitting for scf.parallel…

Summary

Depends On D104780

Recursive work splitting instead of sequential async tasks submission gives ~20%-30% speedup in microbenchmarks.

Algorithm outline:

Collapse scf.parallel dimensions into a single dimension
Compute the block size for the parallel operations from the 1d problem size
Launch parallel tasks
Each parallel task reconstructs its own bounds in the original multi-dimensional iteration space
Each parallel task computes the original parallel operation body using scf.for loop nest

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ezhulenev created this revision.Jun 24 2021, 5:27 AM

Herald added subscribers: dcaballe, cota, teijeong and 16 others. · View Herald TranscriptJun 24 2021, 5:27 AM

ezhulenev requested review of this revision.Jun 24 2021, 5:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2021, 5:27 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

ezhulenev edited the summary of this revision. (Show Details)Jun 24 2021, 5:31 AM

ezhulenev added reviewers: herhut, mehdi_amini.

Harbormaster completed remote builds in B110805: Diff 354222.Jun 24 2021, 6:09 AM

Very neat! Just some nits.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
137	Longer term this could become an op, as this functionality is needed frequently. Not here, though.
209	Why is this needed?
230	I see the convenience of this but I'd prefer if there was no invisible side-effect on offset.
242	`b.create<ConstantIndexOp>(0)`
246	I wonder whether `ArrayRef` would be the better abstraction here.
267	how about `max(blockFirstIndex + blockSize, tripCount)`?
278	nit: double compute?
282	nit: contains, multiple
322	nit: or
387	Is this needed?
415	nit: 'ConstantIndexOp'
446	nit: the second half
453	nit: use `start`?
475	nit: use `start`?
507	`ConstantIndexOp`
544	`ConstantIndexOp`
682	Some more ConstantIndexOp opportunities
690	Maybe use ceil_div instead of divup?
697	Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be positive here and in most of the rest of this file. Wouldn't the unsigned one suffice here?
714	nit replaced.
mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26	This test is super sparse. Why even capture `S` and `E` here, as they are not used? Unless you want to ensure that the right block numbers are passed, but that needs a less sparse test.
mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6	Is this broken by this change? Is that because we now pass groups through functions?

Resolve comments

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
209	Removed. Leftovers that I forgot to cleanup.
230	I added a not that it updates the offset, but don't see any other way to keep uses oneliners and make it more explicit
242	Cool, didn't know it exists. Updated all constants.
246	Updated all callsites.
697	Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just for "consistency" :)
mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26	I added a bit more details, but still decided not to add checks for all of the generated IR, just checked the main structure. It's just too many checks, and I find it easier to rely on the mlir integration tests that run the code.
mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6	There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<->dropRef and destroys the group too early. Will send a fix in one of the next PRs.

ezhulenev added a child revision: D104891: [mlir:Async] Remove async operations if it is statically known that the parallel operation has a single compute block.Jun 24 2021, 5:48 PM

Revert accidental change

Harbormaster completed remote builds in B110931: Diff 354408.Jun 24 2021, 6:25 PM

Thanks!

This revision is now accepted and ready to land.Jun 25 2021, 5:19 AM

Closed by commit rG86ad0af87054: [mlir:Async] Implement recursive async work splitting for scf.parallel… (authored by ezhulenev). · Explain WhyJun 25 2021, 10:34 AM

This revision was automatically updated to reflect the committed changes.

ezhulenev added a commit: rG86ad0af87054: [mlir:Async] Implement recursive async work splitting for scf.parallel….

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Async/

Passes.h

4 lines

Passes.td

21 lines

lib/

Dialect/

Async/

Transforms/

AsyncParallelFor.cpp

753 lines

test/

Dialect/

Async/

async-parallel-for-async-dispatch.mlir

72 lines

	async-parallel-for-seq-dispatch.mlir
	async-parallel-for.mlir

39 lines

async-parallel-for.mlir

Integration/

Dialect/

Async/

CPU/

microbench-linalg-async-parallel-for.mlir

8 lines

	microbench-scf-async-parallel-for.mlir
	microbench-linalg-async-parallel-for.mlir

57 lines

test-async-parallel-for-1d.mlir

17 lines

test-async-parallel-for-2d.mlir

17 lines

Diff 354408

mlir/include/mlir/Dialect/Async/Passes.h

	Show All 13 Lines
	#define MLIR_DIALECT_ASYNC_PASSES_H_			#define MLIR_DIALECT_ASYNC_PASSES_H_

	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {

	std::unique_ptr<Pass> createAsyncParallelForPass();			std::unique_ptr<Pass> createAsyncParallelForPass();

	std::unique_ptr<Pass> createAsyncParallelForPass(int numWorkerThreads);			std::unique_ptr<Pass> createAsyncParallelForPass(bool asyncDispatch,
				int32_t numWorkerThreads,
				int32_t targetBlockSize);

	std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();			std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	Show All 10 Lines

mlir/include/mlir/Dialect/Async/Passes.td

	//===-- Passes.td - Async pass definition file -------------- tablegen --===//			//===-- Passes.td - Async pass definition file -------------- tablegen --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef MLIR_DIALECT_ASYNC_PASSES			#ifndef MLIR_DIALECT_ASYNC_PASSES
	#define MLIR_DIALECT_ASYNC_PASSES			#define MLIR_DIALECT_ASYNC_PASSES

	include "mlir/Pass/PassBase.td"			include "mlir/Pass/PassBase.td"

	def AsyncParallelFor : Pass<"async-parallel-for"> {			def AsyncParallelFor : Pass<"async-parallel-for"> {
	let summary = "Convert scf.parallel operations to multiple async regions "			let summary = "Convert scf.parallel operations to multiple async compute ops "
	"executed concurrently for non-overlapping iteration ranges";			"executed concurrently for non-overlapping iteration ranges";
	let constructor = "mlir::createAsyncParallelForPass()";			let constructor = "mlir::createAsyncParallelForPass()";

	let options = [			let options = [
	Option<"numConcurrentAsyncExecute", "num-concurrent-async-execute",			Option<"asyncDispatch", "async-dispatch",
	"int32_t", /default=/"4",			"bool", /default=/"true",
	"The number of async.execute operations that will be used for concurrent "			"Dispatch async compute tasks using recursive work splitting. If `false` "
	"loop execution.">			"async compute tasks will be launched using simple for loop in the "
				"caller thread.">,

				Option<"numWorkerThreads", "num-workers",
				"int32_t", /default=/"8",
				"The number of available workers to execute async operations.">,

				Option<"targetBlockSize", "target-block-size",
				"int32_t", /default=/"1000",
				"The target block size for sharding parallel operation.">
	];			];

	let dependentDialects = ["async::AsyncDialect", "scf::SCFDialect"];			let dependentDialects = ["async::AsyncDialect", "scf::SCFDialect"];
	}			}

	def AsyncToAsyncRuntime : Pass<"async-to-async-runtime", "ModuleOp"> {			def AsyncToAsyncRuntime : Pass<"async-to-async-runtime", "ModuleOp"> {
	let summary = "Lower high level async operations (e.g. async.execute) to the"			let summary = "Lower high level async operations (e.g. async.execute) to the"
	"explicit async.runtime and async.coro operations";			"explicit async.runtime and async.coro operations";
	let constructor = "mlir::createAsyncToAsyncRuntimePass()";			let constructor = "mlir::createAsyncToAsyncRuntimePass()";
	let dependentDialects = ["async::AsyncDialect"];			let dependentDialects = ["async::AsyncDialect"];
	Show All 27 Lines

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

	//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//			//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements scf.parallel to src.for + async.execute conversion pass.			// This file implements scf.parallel to scf.for + async.execute conversion pass.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "PassDetail.h"			#include "PassDetail.h"
	#include "mlir/Dialect/Async/IR/Async.h"			#include "mlir/Dialect/Async/IR/Async.h"
	#include "mlir/Dialect/Async/Passes.h"			#include "mlir/Dialect/Async/Passes.h"
	#include "mlir/Dialect/SCF/SCF.h"			#include "mlir/Dialect/SCF/SCF.h"
	#include "mlir/Dialect/StandardOps/IR/Ops.h"			#include "mlir/Dialect/StandardOps/IR/Ops.h"
	#include "mlir/IR/BlockAndValueMapping.h"			#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/ImplicitLocOpBuilder.h"
	#include "mlir/IR/PatternMatch.h"			#include "mlir/IR/PatternMatch.h"
	#include "mlir/Transforms/GreedyPatternRewriteDriver.h"			#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
				#include "mlir/Transforms/RegionUtils.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::async;			using namespace mlir::async;

	#define DEBUG_TYPE "async-parallel-for"			#define DEBUG_TYPE "async-parallel-for"

	namespace {			namespace {

	// Rewrite scf.parallel operation into multiple concurrent async.execute			// Rewrite scf.parallel operation into multiple concurrent async.execute
	// operations over non overlapping subranges of the original loop.			// operations over non overlapping subranges of the original loop.
	//			//
	// Example:			// Example:
	//			//
	// scf.for (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {			// scf.parallel (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {
	// "do_some_compute"(%i, %j): () -> ()			// "do_some_compute"(%i, %j): () -> ()
	// }			// }
	//			//
	// Converted to:			// Converted to:
	//			//
	// %c0 = constant 0 : index			// // Parallel compute function that executes the parallel body region for
	// %c1 = constant 1 : index			// // a subset of the parallel iteration space defined by the one-dimensional
	//			// // compute block index.
	// // Compute blocks sizes for each induction variable.			// func parallel_compute_function(%block_index : index, %block_size : index,
	// %num_blocks_i = ... : index			// <parallel operation properties>, ...) {
	// %num_blocks_j = ... : index			// // Compute multi-dimensional loop bounds for %block_index.
	// %block_size_i = ... : index			// %block_lbi, %block_lbj = ...
	// %block_size_j = ... : index			// %block_ubi, %block_ubj = ...
	//			//
	// // Create an async group to track async execute ops.			// // Clone parallel operation body into the scf.for loop nest.
	// %group = async.create_group			// scf.for %i = %blockLbi to %blockUbi {
	//			// scf.for %j = block_lbj to %block_ubj {
	// scf.for %bi = %c0 to %num_blocks_i step %c1 {
	// %block_start_i = ... : index
	// %block_end_i = ... : index
	//
	// scf.for %bj = %c0 to %num_blocks_j step %c1 {
	// %block_start_j = ... : index
	// %block_end_j = ... : index
	//
	// // Execute the body of original parallel operation for the current
	// // block.
	// %token = async.execute {
	// scf.for %i = %block_start_i to %block_end_i step %si {
	// scf.for %j = %block_start_j to %block_end_j step %sj {
	// "do_some_compute"(%i, %j): () -> ()			// "do_some_compute"(%i, %j): () -> ()
	// }			// }
	// }			// }
	// }			// }
	//			//
	// // Add produced async token to the group.			// And a dispatch function depending on the `asyncDispatch` option.
	// async.add_to_group %token, %group			//
				// When async dispatch is on: (pseudocode)
				//
				// %block_size = ... compute parallel compute block size
				// %block_count = ... compute the number of compute blocks
				//
				// func @async_dispatch(%block_start : index, %block_end : index, ...) {
				// // Keep splitting block range until we reached a range of size 1.
				// while (%block_end - %block_start > 1) {
				// %mid_index = block_start + (block_end - block_start) / 2;
				// async.execute { call @async_dispatch(%mid_index, %block_end); }
				// %block_end = %mid_index
	// }			// }
				//
				// // Call parallel compute function for a single block.
				// call @parallel_compute_fn(%block_start, %block_size, ...);
	// }			// }
	//			//
	// // Await completion of all async.execute operations.			// // Launch async dispatch for [0, block_count) range.
	// async.await_all %group			// call @async_dispatch(%c0, %block_count);
				//
				// When async dispatch is off:
	//			//
	// In this example outer loop launches inner block level loops as separate async			// %block_size = ... compute parallel compute block size
	// execute operations which will be executed concurrently.			// %block_count = ... compute the number of compute blocks
	//			//
	// At the end it waits for the completiom of all async execute operations.			// scf.for %block_index = %c0 to %block_count {
				// call @parallel_compute_fn(%block_index, %block_size, ...)
				// }
	//			//
				struct AsyncParallelForPass
				: public AsyncParallelForBase<AsyncParallelForPass> {
				AsyncParallelForPass() = default;

				AsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,
				int32_t targetBlockSize) {
				this->asyncDispatch = asyncDispatch;
				this->numWorkerThreads = numWorkerThreads;
				this->targetBlockSize = targetBlockSize;
				}

				void runOnOperation() override;
				};

	struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {			struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {
	public:			public:
	AsyncParallelForRewrite(MLIRContext *ctx, int numConcurrentAsyncExecute)			AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,
	: OpRewritePattern(ctx),			int32_t numWorkerThreads, int32_t targetBlockSize)
	numConcurrentAsyncExecute(numConcurrentAsyncExecute) {}			: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),
				numWorkerThreads(numWorkerThreads), targetBlockSize(targetBlockSize) {}

	LogicalResult matchAndRewrite(scf::ParallelOp op,			LogicalResult matchAndRewrite(scf::ParallelOp op,
	PatternRewriter &rewriter) const override;			PatternRewriter &rewriter) const override;

	private:			private:
	int numConcurrentAsyncExecute;			// The maximum number of tasks per worker thread when sharding parallel op.
				static constexpr int32_t kMaxOversharding = 4;

				bool asyncDispatch;
				int32_t numWorkerThreads;
				int32_t targetBlockSize;
	};			};

	struct AsyncParallelForPass			struct ParallelComputeFunctionType {
	: public AsyncParallelForBase<AsyncParallelForPass> {			FunctionType type;
	AsyncParallelForPass() = default;			llvm::SmallVector<Value> captures;
	AsyncParallelForPass(int numWorkerThreads) {			};
	assert(numWorkerThreads >= 1);
	numConcurrentAsyncExecute = numWorkerThreads;			struct ParallelComputeFunction {
	}			FuncOp func;
	void runOnOperation() override;			llvm::SmallVector<Value> captures;
	};			};

	} // namespace			} // namespace

	LogicalResult			// Converts one-dimensional iteration index in the [0, tripCount) interval
	AsyncParallelForRewrite::matchAndRewrite(scf::ParallelOp op,			// into multidimensional iteration coordinate.
	PatternRewriter &rewriter) const {			static SmallVector<Value> delinearize(ImplicitLocOpBuilder &b, Value index,
				herhutUnsubmitted Done Reply Inline Actions Longer term this could become an op, as this functionality is needed frequently. Not here, though. herhut: Longer term this could become an op, as this functionality is needed frequently. Not here…
	// We do not currently support rewrite for parallel op with reductions.			ArrayRef<Value> tripCounts) {
	if (op.getNumReductions() != 0)			SmallVector<Value> coords(tripCounts.size());
	return failure();			assert(!tripCounts.empty() && "tripCounts must be not empty");

				for (ssize_t i = tripCounts.size() - 1; i >= 0; --i) {
				coords[i] = b.create<SignedRemIOp>(index, tripCounts[i]);
				index = b.create<SignedDivIOp>(index, tripCounts[i]);
				}

	MLIRContext *ctx = op.getContext();			return coords;
	Location loc = op.getLoc();			}

	// Index constants used below.			// Returns a function type and implicit captures for a parallel compute
	auto indexTy = IndexType::get(ctx);			// function. We'll need a list of implicit captures to setup block and value
	auto zero = IntegerAttr::get(indexTy, 0);			// mapping when we'll clone the body of the parallel operation.
	auto one = IntegerAttr::get(indexTy, 1);			static ParallelComputeFunctionType
	auto c0 = rewriter.create<ConstantOp>(loc, indexTy, zero);			getParallelComputeFunctionType(scf::ParallelOp op, PatternRewriter &rewriter) {
	auto c1 = rewriter.create<ConstantOp>(loc, indexTy, one);			// Values implicitly captured by the parallel operation.
				llvm::SetVector<Value> captures;
	// Shorthand for signed integer ceil division operation.			getUsedValuesDefinedAbove(op.region(), op.region(), captures);
	auto divup = [&](Value x, Value y) -> Value {
	return rewriter.create<SignedCeilDivIOp>(loc, x, y);			llvm::SmallVector<Type> inputs;
	};			inputs.reserve(2 + 4 * op.getNumLoops() + captures.size());

				Type indexTy = rewriter.getIndexType();

				// One-dimensional iteration space defined by the block index and size.
				inputs.push_back(indexTy); // blockIndex
				inputs.push_back(indexTy); // blockSize

				// Multi-dimensional parallel iteration space defined by the loop trip counts.
				for (unsigned i = 0; i < op.getNumLoops(); ++i)
				inputs.push_back(indexTy); // loop tripCount

				// Parallel operation lower bound, upper bound and step.
				for (unsigned i = 0; i < op.getNumLoops(); ++i) {
				inputs.push_back(indexTy); // lower bound
				inputs.push_back(indexTy); // upper bound
				inputs.push_back(indexTy); // step
				}

	// Compute trip count for each loop induction variable:			// Types of the implicit captures.
	// tripCount = divUp(upperBound - lowerBound, step);			for (Value capture : captures)
	SmallVector<Value, 4> tripCounts(op.getNumLoops());			inputs.push_back(capture.getType());
	for (size_t i = 0; i < op.getNumLoops(); ++i) {
	auto lb = op.lowerBound()[i];			// Convert captures to vector for later convenience.
	auto ub = op.upperBound()[i];			SmallVector<Value> capturesVector(captures.begin(), captures.end());
	auto step = op.step()[i];			return {rewriter.getFunctionType(inputs, TypeRange()), capturesVector};
	auto range = rewriter.create<SubIOp>(loc, ub, lb);
	tripCounts[i] = divup(range, step);
	}			}

	// The target number of concurrent async.execute ops.			// Create a parallel compute fuction from the parallel operation.
	auto numExecuteOps = rewriter.create<ConstantOp>(			static ParallelComputeFunction
	loc, indexTy, IntegerAttr::get(indexTy, numConcurrentAsyncExecute));			createParallelComputeFunction(scf::ParallelOp op, PatternRewriter &rewriter) {
				OpBuilder::InsertionGuard guard(rewriter);
	// Blocks sizes configuration for each induction variable.			ImplicitLocOpBuilder b(op.getLoc(), rewriter);

	// We try to use maximum available concurrency in outer dimensions first			ModuleOp module = op->getParentOfType<ModuleOp>();
	// (assuming that parallel induction variables are corresponding to some
	// multidimensional access, e.g. in (%d0, %d1, ..., %dn) = (<from>) to (<to>)			ParallelComputeFunctionType computeFuncType =
	// we will try to parallelize iteration along the %d0. If %d0 is too small,			getParallelComputeFunctionType(op, rewriter);
	// we'll parallelize iteration over %d1, and so on.
	SmallVector<Value, 4> targetNumBlocks(op.getNumLoops());			FunctionType type = computeFuncType.type;
	SmallVector<Value, 4> blockSize(op.getNumLoops());			FuncOp func = FuncOp::create(op.getLoc(), "parallel_compute_fn", type);
	SmallVector<Value, 4> numBlocks(op.getNumLoops());			func.setPrivate();

	// Compute block size and number of blocks along the first induction variable.			// Insert function into the module symbol table and assign it unique name.
	targetNumBlocks[0] = numExecuteOps;			SymbolTable symbolTable(module);
	blockSize[0] = divup(tripCounts[0], targetNumBlocks[0]);			symbolTable.insert(func);
	numBlocks[0] = divup(tripCounts[0], blockSize[0]);			rewriter.getListener()->notifyOperationInserted(func);

	// Assign remaining available concurrency to other induction variables.			// Create function entry block.
	for (size_t i = 1; i < op.getNumLoops(); ++i) {			Block *block = b.createBlock(&func.getBody(), func.begin(), type.getInputs());
				herhutUnsubmitted Done Reply Inline Actions Why is this needed? herhut: Why is this needed?
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Removed. Leftovers that I forgot to cleanup. ezhulenev: Removed. Leftovers that I forgot to cleanup.
	targetNumBlocks[i] = divup(targetNumBlocks[i - 1], numBlocks[i - 1]);			b.setInsertionPointToEnd(block);
	blockSize[i] = divup(tripCounts[i], targetNumBlocks[i]);
	numBlocks[i] = divup(tripCounts[i], blockSize[i]);			unsigned offset = 0; // argument offset for arguments decoding
	}
				// Returns `numArguments` arguments starting from `offset` and updates offset
	// Total number of async compute blocks.			// by moving forward to the next argument.
	Value totalBlocks = numBlocks[0];			auto getArguments = [&](unsigned numArguments) -> ArrayRef<Value> {
	for (size_t i = 1; i < op.getNumLoops(); ++i)			auto args = block->getArguments();
	totalBlocks = rewriter.create<MulIOp>(loc, totalBlocks, numBlocks[i]);			auto slice = args.drop_front(offset).take_front(numArguments);
				offset += numArguments;
	// Create an async.group to wait on all async tokens from async execute ops.			return {slice.begin(), slice.end()};
	auto group =			};
	rewriter.create<CreateGroupOp>(loc, GroupType::get(ctx), totalBlocks);
				// Block iteration position defined by the block index and size.
	// Build a scf.for loop nest from the parallel operation.			Value blockIndex = block->getArgument(offset++);
				Value blockSize = block->getArgument(offset++);
	// Lower/upper bounds for nest block level computations.
	SmallVector<Value, 4> blockLowerBounds(op.getNumLoops());			// Constants used below.
	SmallVector<Value, 4> blockUpperBounds(op.getNumLoops());			Value c0 = b.create<ConstantIndexOp>(0);
	SmallVector<Value, 4> blockInductionVars(op.getNumLoops());			Value c1 = b.create<ConstantIndexOp>(1);

				herhutUnsubmitted Done Reply Inline Actions I see the convenience of this but I'd prefer if there was no invisible side-effect on offset. herhut: I see the convenience of this but I'd prefer if there was no invisible side-effect on offset.
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions I added a not that it updates the offset, but don't see any other way to keep uses oneliners and make it more explicit ezhulenev: I added a not that it updates the offset, but don't see any other way to keep uses oneliners…
				// Multi-dimensional parallel iteration space defined by the loop trip counts.
				ArrayRef<Value> tripCounts = getArguments(op.getNumLoops());

				// Compute a product of trip counts to get the size of the flattened
				// one-dimensional iteration space.
				Value tripCount = tripCounts[0];
				for (unsigned i = 1; i < tripCounts.size(); ++i)
				tripCount = b.create<MulIOp>(tripCount, tripCounts[i]);

				// Parallel operation lower bound and step.
				ArrayRef<Value> lowerBound = getArguments(op.getNumLoops());
				offset += op.getNumLoops(); // skip upper bound arguments
				herhutUnsubmitted Done Reply Inline Actions `b.create<ConstantIndexOp>(0)` herhut: `b.create<ConstantIndexOp>(0)`
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Cool, didn't know it exists. Updated all constants. ezhulenev: Cool, didn't know it exists. Updated all constants.
				ArrayRef<Value> step = getArguments(op.getNumLoops());

				// Remaining arguments are implicit captures of the parallel operation.
				ArrayRef<Value> captures = getArguments(block->getNumArguments() - offset);
				herhutUnsubmitted Done Reply Inline Actions I wonder whether `ArrayRef` would be the better abstraction here. herhut: I wonder whether `ArrayRef` would be the better abstraction here.
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Updated all callsites. ezhulenev: Updated all callsites.

				// Find one-dimensional iteration bounds: [blockFirstIndex, blockLastIndex]:
				// blockFirstIndex = blockIndex * blockSize
				Value blockFirstIndex = b.create<MulIOp>(blockIndex, blockSize);

				// The last one-dimensional index in the block defined by the `blockIndex`:
				// blockLastIndex = max(blockFirstIndex + blockSize, tripCount) - 1
				Value blockEnd0 = b.create<AddIOp>(blockFirstIndex, blockSize);
				Value blockEnd1 = b.create<CmpIOp>(CmpIPredicate::sge, blockEnd0, tripCount);
				Value blockEnd2 = b.create<SelectOp>(blockEnd1, tripCount, blockEnd0);
				Value blockLastIndex = b.create<SubIOp>(blockEnd2, c1);

				// Convert one-dimensional indices to multi-dimensional coordinates.
				auto blockFirstCoord = delinearize(b, blockFirstIndex, tripCounts);
				auto blockLastCoord = delinearize(b, blockLastIndex, tripCounts);

				// Compute loops upper bounds derived from the block last coordinates:
				// blockEndCoord[i] = blockLastCoord[i] + 1
				//
				// Block first and last coordinates can be the same along the outer compute
				// dimension when inner compute dimension contains multiple blocks.
				herhutUnsubmitted Done Reply Inline Actions how about `max(blockFirstIndex + blockSize, tripCount)`? herhut: how about `max(blockFirstIndex + blockSize, tripCount)`?
				SmallVector<Value> blockEndCoord(op.getNumLoops());
				for (size_t i = 0; i < blockLastCoord.size(); ++i)
				blockEndCoord[i] = b.create<AddIOp>(blockLastCoord[i], c1);

				// Construct a loop nest out of scf.for operations that will iterate over
				// all coordinates in [blockFirstCoord, blockLastCoord] range.
	using LoopBodyBuilder =			using LoopBodyBuilder =
	std::function<void(OpBuilder &, Location, Value, ValueRange)>;			std::function<void(OpBuilder &, Location, Value, ValueRange)>;
	using LoopBuilder = std::function<LoopBodyBuilder(size_t loopIdx)>;			using LoopNestBuilder = std::function<LoopBodyBuilder(size_t loopIdx)>;

				// Parallel region induction variables computed from the multi-dimensional
				herhutUnsubmitted Done Reply Inline Actions nit: double compute? herhut: nit: double compute?
				// iteration coordinate using parallel operation bounds and step:
				//
				// computeBlockInductionVars[loopIdx] =
				// lowerBound[loopIdx] + blockCoord[loopIdx] * step[loopDdx]
				herhutUnsubmitted Done Reply Inline Actions nit: contains, multiple herhut: nit: contains, multiple
				SmallVector<Value> computeBlockInductionVars(op.getNumLoops());

				// We need to know if we are in the first or last iteration of the
				// multi-dimensional loop for each loop in the nest, so we can decide what
				// loop bounds should we use for the nested loops: bounds defined by compute
				// block interval, or bounds defined by the parallel operation.
				//
				// Example: 2d parallel operation
				// i j
				// loop sizes: [50, 50]
				// first coord: [25, 25]
				// last coord: [30, 30]
				//
				// If `i` is equal to 25 then iteration over `j` should start at 25, when `i`
				// is between 25 and 30 it should start at 0. The upper bound for `j` should
				// be 50, except when `i` is equal to 30, then it should also be 30.
				//
				// Value at ith position specifies if all loops in [0, i) range of the loop
				// nest are in the first/last iteration.
				SmallVector<Value> isBlockFirstCoord(op.getNumLoops());
				SmallVector<Value> isBlockLastCoord(op.getNumLoops());

	// Builds inner loop nest inside async.execute operation that does all the			// Builds inner loop nest inside async.execute operation that does all the
	// work concurrently.			// work concurrently.
	LoopBuilder workLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {			LoopNestBuilder workLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {
	return [&, loopIdx](OpBuilder &b, Location loc, Value iv, ValueRange args) {			return [&, loopIdx](OpBuilder &nestedBuilder, Location loc, Value iv,
	blockInductionVars[loopIdx] = iv;			ValueRange args) {
				ImplicitLocOpBuilder nb(loc, nestedBuilder);

				// Compute induction variable for `loopIdx`.
				computeBlockInductionVars[loopIdx] = nb.create<AddIOp>(
				lowerBound[loopIdx], nb.create<MulIOp>(iv, step[loopIdx]));

				// Check if we are inside first or last iteration of the loop.
				isBlockFirstCoord[loopIdx] =
				nb.create<CmpIOp>(CmpIPredicate::eq, iv, blockFirstCoord[loopIdx]);
				isBlockLastCoord[loopIdx] =
				nb.create<CmpIOp>(CmpIPredicate::eq, iv, blockLastCoord[loopIdx]);

				// Check if the previous loop is in its first or last iteration.
				herhutUnsubmitted Done Reply Inline Actions nit: or herhut: nit: or
				if (loopIdx > 0) {
				isBlockFirstCoord[loopIdx] = nb.create<AndOp>(
				isBlockFirstCoord[loopIdx], isBlockFirstCoord[loopIdx - 1]);
				isBlockLastCoord[loopIdx] = nb.create<AndOp>(
				isBlockLastCoord[loopIdx], isBlockLastCoord[loopIdx - 1]);
				}

	// Continue building async loop nest.			// Keep building loop nest.
	if (loopIdx < op.getNumLoops() - 1) {			if (loopIdx < op.getNumLoops() - 1) {
	b.create<scf::ForOp>(			// Select nested loop lower/upper bounds depending on out position in
	loc, blockLowerBounds[loopIdx + 1], blockUpperBounds[loopIdx + 1],			// the multi-dimensional iteration space.
	op.step()[loopIdx + 1], ValueRange(), workLoopBuilder(loopIdx + 1));			auto lb = nb.create<SelectOp>(isBlockFirstCoord[loopIdx],
	b.create<scf::YieldOp>(loc);			blockFirstCoord[loopIdx + 1], c0);

				auto ub = nb.create<SelectOp>(isBlockLastCoord[loopIdx],
				blockEndCoord[loopIdx + 1],
				tripCounts[loopIdx + 1]);

				nb.create<scf::ForOp>(lb, ub, c1, ValueRange(),
				workLoopBuilder(loopIdx + 1));
				nb.create<scf::YieldOp>(loc);
	return;			return;
	}			}

	// Copy the body of the parallel op with new loop bounds.			// Copy the body of the parallel op into the inner-most loop.
	BlockAndValueMapping mapping;			BlockAndValueMapping mapping;
	mapping.map(op.getInductionVars(), blockInductionVars);			mapping.map(op.getInductionVars(), computeBlockInductionVars);
				mapping.map(computeFuncType.captures, captures);

	for (auto &bodyOp : op.getLoopBody().getOps())			for (auto &bodyOp : op.getLoopBody().getOps())
	b.clone(bodyOp, mapping);			nb.clone(bodyOp, mapping);
	};			};
	};			};

	// Builds a loop nest that does async execute op dispatching.			b.create<scf::ForOp>(blockFirstCoord[0], blockEndCoord[0], c1, ValueRange(),
	LoopBuilder asyncLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {			workLoopBuilder(0));
	return [&, loopIdx](OpBuilder &b, Location loc, Value iv, ValueRange args) {			b.create<ReturnOp>(ValueRange());
	auto lb = op.lowerBound()[loopIdx];
	auto ub = op.upperBound()[loopIdx];
	auto step = op.step()[loopIdx];

	// Compute lower bound for the current block:
	// blockLowerBound = iv * blockSize * step + lowerBound
	auto s0 = b.create<MulIOp>(loc, iv, blockSize[loopIdx]);
	auto s1 = b.create<MulIOp>(loc, s0, step);
	auto s2 = b.create<AddIOp>(loc, s1, lb);
	blockLowerBounds[loopIdx] = s2;

	// Compute upper bound for the current block:
	// blockUpperBound = min(upperBound,
	// blockLowerBound + blockSize * step)
	auto e0 = b.create<MulIOp>(loc, blockSize[loopIdx], step);
	auto e1 = b.create<AddIOp>(loc, e0, s2);
	auto e2 = b.create<CmpIOp>(loc, CmpIPredicate::slt, e1, ub);
	auto e3 = b.create<SelectOp>(loc, e2, e1, ub);
	blockUpperBounds[loopIdx] = e3;

	// Continue building async dispatch loop nest.			return {func, std::move(computeFuncType.captures)};
	if (loopIdx < op.getNumLoops() - 1) {
	b.create<scf::ForOp>(loc, c0, numBlocks[loopIdx + 1], c1, ValueRange(),
	asyncLoopBuilder(loopIdx + 1));
	b.create<scf::YieldOp>(loc);
	return;
	}			}

	// Build the inner loop nest that will do the actual work inside the			// Creates recursive async dispatch function for the given parallel compute
	// `async.execute` body region.			// function. Dispatch function keeps splitting block range into halves until it
				// reaches a single block, and then excecutes it inline.
				//
				// Function pseudocode (mix of C++ and MLIR):
				//
				// func @async_dispatch(%block_start : index, %block_end : index, ...) {
				//
				// // Keep splitting block range until we reached a range of size 1.
				// while (%block_end - %block_start > 1) {
				// %mid_index = block_start + (block_end - block_start) / 2;
				// async.execute { call @async_dispatch(%mid_index, %block_end); }
				// %block_end = %mid_index
				// }
				//
				// // Call parallel compute function for a single block.
				// call @parallel_compute_fn(%block_start, %block_size, ...);
				// }
				//
				static FuncOp createAsyncDispatchFunction(ParallelComputeFunction &computeFunc,
				PatternRewriter &rewriter) {
				OpBuilder::InsertionGuard guard(rewriter);
				Location loc = computeFunc.func.getLoc();
				ImplicitLocOpBuilder b(loc, rewriter);
				herhutUnsubmitted Done Reply Inline Actions Is this needed? herhut: Is this needed?

				ModuleOp module = computeFunc.func->getParentOfType<ModuleOp>();

				ArrayRef<Type> computeFuncInputTypes =
				computeFunc.func.type().cast<FunctionType>().getInputs();

				// Compared to the parallel compute function async dispatch function takes
				// additional !async.group argument. Also instead of a single `blockIndex` it
				// takes `blockStart` and `blockEnd` arguments to define the range of
				// dispatched blocks.
				SmallVector<Type> inputTypes;
				inputTypes.push_back(async::GroupType::get(rewriter.getContext()));
				inputTypes.push_back(rewriter.getIndexType()); // add blockStart argument
				inputTypes.append(computeFuncInputTypes.begin(), computeFuncInputTypes.end());

				FunctionType type = rewriter.getFunctionType(inputTypes, TypeRange());
				FuncOp func = FuncOp::create(loc, "async_dispatch_fn", type);
				func.setPrivate();

				// Insert function into the module symbol table and assign it unique name.
				SymbolTable symbolTable(module);
				symbolTable.insert(func);
				rewriter.getListener()->notifyOperationInserted(func);

				// Create function entry block.
				Block *block = b.createBlock(&func.getBody(), func.begin(), type.getInputs());
				b.setInsertionPointToEnd(block);

				herhutUnsubmitted Done Reply Inline Actions nit: 'ConstantIndexOp' herhut: nit: 'ConstantIndexOp'
				Type indexTy = b.getIndexType();
				Value c1 = b.create<ConstantIndexOp>(1);
				Value c2 = b.create<ConstantIndexOp>(2);

				// Get the async group that will track async dispatch completion.
				Value group = block->getArgument(0);

				// Get the block iteration range: [blockStart, blockEnd)
				Value blockStart = block->getArgument(1);
				Value blockEnd = block->getArgument(2);

				// Create a work splitting while loop for the [blockStart, blockEnd) range.
				SmallVector<Type> types = {indexTy, indexTy};
				SmallVector<Value> operands = {blockStart, blockEnd};

				// Create a recursive dispatch loop.
				scf::WhileOp whileOp = b.create<scf::WhileOp>(types, operands);
				Block *before = b.createBlock(&whileOp.before(), {}, types);
				Block *after = b.createBlock(&whileOp.after(), {}, types);

				// Setup dispatch loop condition block: decide if we need to go into the
				// `after` block and launch one more async dispatch.
				{
				b.setInsertionPointToEnd(before);
				Value start = before->getArgument(0);
				Value end = before->getArgument(1);
				Value distance = b.create<SubIOp>(end, start);
				Value dispatch = b.create<CmpIOp>(CmpIPredicate::sgt, distance, c1);
				b.create<scf::ConditionOp>(dispatch, before->getArguments());
				}

				herhutUnsubmitted Done Reply Inline Actions nit: the second half herhut: nit: the second half
				// Setup the async dispatch loop body: recursively call dispatch function
				// for the seconds half of the original range and go to the next iteration.
				{
				b.setInsertionPointToEnd(after);
				Value start = after->getArgument(0);
				Value end = after->getArgument(1);
				Value distance = b.create<SubIOp>(end, start);
				herhutUnsubmitted Done Reply Inline Actions nit: use `start`? herhut: nit: use `start`?
				Value halfDistance = b.create<SignedDivIOp>(distance, c2);
				Value midIndex = b.create<AddIOp>(start, halfDistance);

				// Call parallel compute function inside the async.execute region.
	auto executeBodyBuilder = [&](OpBuilder &executeBuilder,			auto executeBodyBuilder = [&](OpBuilder &executeBuilder,
	Location executeLoc,			Location executeLoc, ValueRange executeArgs) {
	ValueRange executeArgs) {			// Update the original `blockStart` and `blockEnd` with new range.
	executeBuilder.create<scf::ForOp>(executeLoc, blockLowerBounds[0],			SmallVector<Value> operands{block->getArguments().begin(),
	blockUpperBounds[0], op.step()[0],			block->getArguments().end()};
	ValueRange(), workLoopBuilder(0));			operands[1] = midIndex;
				operands[2] = end;

				executeBuilder.create<CallOp>(executeLoc, func.sym_name(),
				func.getCallableResults(), operands);
	executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());			executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());
	};			};

	auto execute = b.create<ExecuteOp>(			// Create async.execute operation to dispatch half of the block range.
	loc, /resultTypes=/TypeRange(), /dependencies=/ValueRange(),			auto execute = b.create<ExecuteOp>(TypeRange(), ValueRange(), ValueRange(),
	/operands=/ValueRange(), executeBodyBuilder);			executeBodyBuilder);
	auto rankType = IndexType::get(ctx);			b.create<AddToGroupOp>(indexTy, execute.token(), group);
	b.create<AddToGroupOp>(loc, rankType, execute.token(), group.result());			b.create<scf::YieldOp>(ValueRange({start, midIndex}));
				herhutUnsubmitted Done Reply Inline Actions nit: use `start`? herhut: nit: use `start`?
	b.create<scf::YieldOp>(loc);			}

				// After dispatching async operations to process the tail of the block range
				// call the parallel compute function for the first block of the range.
				b.setInsertionPointAfter(whileOp);

				// Drop async dispatch specific arguments: async group, block start and end.
				auto forwardedInputs = block->getArguments().drop_front(3);
				SmallVector<Value> computeFuncOperands = {blockStart};
				computeFuncOperands.append(forwardedInputs.begin(), forwardedInputs.end());

				b.create<CallOp>(computeFunc.func.sym_name(),
				computeFunc.func.getCallableResults(), computeFuncOperands);
				b.create<ReturnOp>(ValueRange());

				return func;
				}

				// Launch async dispatch of the parallel compute function.
				static void doAsyncDispatch(ImplicitLocOpBuilder &b, PatternRewriter &rewriter,
				ParallelComputeFunction &parallelComputeFunction,
				scf::ParallelOp op, Value blockSize,
				Value blockCount,
				const SmallVector<Value> &tripCounts) {
				MLIRContext *ctx = op->getContext();

				// Add one more level of indirection to dispatch parallel compute functions
				// using async operations and recursive work splitting.
				FuncOp asyncDispatchFunction =
				createAsyncDispatchFunction(parallelComputeFunction, rewriter);

				Value c0 = b.create<ConstantIndexOp>(0);
				herhutUnsubmitted Done Reply Inline Actions `ConstantIndexOp` herhut: `ConstantIndexOp`
				Value c1 = b.create<ConstantIndexOp>(1);

				// Create an async.group to wait on all async tokens from the concurrent
				// execution of multiple parallel compute function. First block will be
				// executed synchronously in the caller thread.
				Value groupSize = b.create<SubIOp>(blockCount, c1);
				Value group = b.create<CreateGroupOp>(GroupType::get(ctx), groupSize);

				// Pack the async dispath function operands to launch the work splitting.
				SmallVector<Value> asyncDispatchOperands = {group, c0, blockCount, blockSize};
				asyncDispatchOperands.append(tripCounts);
				asyncDispatchOperands.append(op.lowerBound().begin(), op.lowerBound().end());
				asyncDispatchOperands.append(op.upperBound().begin(), op.upperBound().end());
				asyncDispatchOperands.append(op.step().begin(), op.step().end());
				asyncDispatchOperands.append(parallelComputeFunction.captures);

				// Launch async dispatch function for [0, blockCount) range.
				b.create<CallOp>(asyncDispatchFunction.sym_name(),
				asyncDispatchFunction.getCallableResults(),
				asyncDispatchOperands);

				// Wait for the completion of all parallel compute operations.
				b.create<AwaitAllOp>(group);
				}

				// Dispatch parallel compute functions by submitting all async compute tasks
				// from a simple for loop in the caller thread.
				static void
				doSequantialDispatch(ImplicitLocOpBuilder &b, PatternRewriter &rewriter,
				ParallelComputeFunction &parallelComputeFunction,
				scf::ParallelOp op, Value blockSize, Value blockCount,
				const SmallVector<Value> &tripCounts) {
				MLIRContext *ctx = op->getContext();

				FuncOp compute = parallelComputeFunction.func;

				Value c0 = b.create<ConstantIndexOp>(0);
				herhutUnsubmitted Done Reply Inline Actions `ConstantIndexOp` herhut: `ConstantIndexOp`
				Value c1 = b.create<ConstantIndexOp>(1);

				// Create an async.group to wait on all async tokens from the concurrent
				// execution of multiple parallel compute function. First block will be
				// executed synchronously in the caller thread.
				Value groupSize = b.create<SubIOp>(blockCount, c1);
				Value group = b.create<CreateGroupOp>(GroupType::get(ctx), groupSize);

				// Call parallel compute function for all blocks.
				using LoopBodyBuilder =
				std::function<void(OpBuilder &, Location, Value, ValueRange)>;

				// Returns parallel compute function operands to process the given block.
				auto computeFuncOperands = [&](Value blockIndex) -> SmallVector<Value> {
				SmallVector<Value> computeFuncOperands = {blockIndex, blockSize};
				computeFuncOperands.append(tripCounts);
				computeFuncOperands.append(op.lowerBound().begin(), op.lowerBound().end());
				computeFuncOperands.append(op.upperBound().begin(), op.upperBound().end());
				computeFuncOperands.append(op.step().begin(), op.step().end());
				computeFuncOperands.append(parallelComputeFunction.captures);
				return computeFuncOperands;
	};			};

				// Induction variable is the index of the block: [0, blockCount).
				LoopBodyBuilder loopBuilder = [&](OpBuilder &loopBuilder, Location loc,
				Value iv, ValueRange args) {
				ImplicitLocOpBuilder nb(loc, loopBuilder);

				// Call parallel compute function inside the async.execute region.
				auto executeBodyBuilder = [&](OpBuilder &executeBuilder,
				Location executeLoc, ValueRange executeArgs) {
				executeBuilder.create<CallOp>(executeLoc, compute.sym_name(),
				compute.getCallableResults(),
				computeFuncOperands(iv));
				executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());
				};

				// Create async.execute operation to launch parallel computate function.
				auto execute = nb.create<ExecuteOp>(TypeRange(), ValueRange(), ValueRange(),
				executeBodyBuilder);
				nb.create<AddToGroupOp>(rewriter.getIndexType(), execute.token(), group);
				nb.create<scf::YieldOp>();
	};			};

	// Start building a loop nest from the first induction variable.			// Iterate over all compute blocks and launch parallel compute operations.
	rewriter.create<scf::ForOp>(loc, c0, numBlocks[0], c1, ValueRange(),			b.create<scf::ForOp>(c1, blockCount, c1, ValueRange(), loopBuilder);
	asyncLoopBuilder(0));
				// Call parallel compute function for the first block in the caller thread.
				b.create<CallOp>(compute.sym_name(), compute.getCallableResults(),
				computeFuncOperands(c0));

				// Wait for the completion of all async compute operations.
				b.create<AwaitAllOp>(group);
				}

				LogicalResult
				AsyncParallelForRewrite::matchAndRewrite(scf::ParallelOp op,
				PatternRewriter &rewriter) const {
				// We do not currently support rewrite for parallel op with reductions.
				if (op.getNumReductions() != 0)
				return failure();

				ImplicitLocOpBuilder b(op.getLoc(), rewriter);

	// Wait for the completion of all subtasks.			// Compute trip count for each loop induction variable:
	rewriter.create<AwaitAllOp>(loc, group.result());			// tripCount = ceil_div(upperBound - lowerBound, step);
				SmallVector<Value> tripCounts(op.getNumLoops());
				for (size_t i = 0; i < op.getNumLoops(); ++i) {
				auto lb = op.lowerBound()[i];
				auto ub = op.upperBound()[i];
				auto step = op.step()[i];
				auto range = b.create<SubIOp>(ub, lb);
				tripCounts[i] = b.create<SignedCeilDivIOp>(range, step);
				}

				// Compute a product of trip counts to get the 1-dimensional iteration space
				// for the scf.parallel operation.
				Value tripCount = tripCounts[0];
				for (size_t i = 1; i < tripCounts.size(); ++i)
				tripCount = b.create<MulIOp>(tripCount, tripCounts[i]);

				// Do not overload worker threads with too many compute blocks.
				Value maxComputeBlocks =
				b.create<ConstantIndexOp>(numWorkerThreads * kMaxOversharding);

				// Target block size from the pass parameters.
				Value targetComputeBlockSize = b.create<ConstantIndexOp>(targetBlockSize);

				// Compute parallel block size from the parallel problem size:
				// blockSize = min(tripCount,
				// max(ceil_div(tripCount, maxComputeBlocks),
				// targetComputeBlockSize))
				Value bs0 = b.create<SignedCeilDivIOp>(tripCount, maxComputeBlocks);
				Value bs1 = b.create<CmpIOp>(CmpIPredicate::sge, bs0, targetComputeBlockSize);
				Value bs2 = b.create<SelectOp>(bs1, bs0, targetComputeBlockSize);
				Value bs3 = b.create<CmpIOp>(CmpIPredicate::sle, tripCount, bs2);
				Value blockSize = b.create<SelectOp>(bs3, tripCount, bs2);
				Value blockCount = b.create<SignedCeilDivIOp>(tripCount, blockSize);

				// Create a parallel compute function that takes a block id and computes the
				// parallel operation body for a subset of iteration space.
				ParallelComputeFunction parallelComputeFunction =
				createParallelComputeFunction(op, rewriter);

				// Dispatch parallel compute function using async recursive work splitting, or
				// by submitting compute task sequentially from a caller thread.
				if (asyncDispatch) {
				doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
				blockCount, tripCounts);
				} else {
				doSequantialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
				blockCount, tripCounts);
				}

	// Erase the original parallel operation.			// Parallel operation was replaced with a block iteration loop.
	rewriter.eraseOp(op);			rewriter.eraseOp(op);

	return success();			return success();
	}			}

	void AsyncParallelForPass::runOnOperation() {			void AsyncParallelForPass::runOnOperation() {
	MLIRContext *ctx = &getContext();			MLIRContext *ctx = &getContext();

	RewritePatternSet patterns(ctx);			RewritePatternSet patterns(ctx);
	patterns.add<AsyncParallelForRewrite>(ctx, numConcurrentAsyncExecute);			patterns.add<AsyncParallelForRewrite>(ctx, asyncDispatch, numWorkerThreads,
				targetBlockSize);

	if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))			if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))
	signalPassFailure();			signalPassFailure();
	}			}

	std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {			std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {
	return std::make_unique<AsyncParallelForPass>();			return std::make_unique<AsyncParallelForPass>();
	}			}

	std::unique_ptr<Pass> mlir::createAsyncParallelForPass(int numWorkerThreads) {			std::unique_ptr<Pass> createAsyncParallelForPass(bool asyncDispatch,
	return std::make_unique<AsyncParallelForPass>(numWorkerThreads);			int32_t numWorkerThreads,
				int32_t targetBlockSize) {
				herhutUnsubmitted Done Reply Inline Actions Some more ConstantIndexOp opportunities herhut: Some more ConstantIndexOp opportunities
				return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,
				targetBlockSize);
	}			}
				herhutUnsubmitted Done Reply Inline Actions Maybe use ceil_div instead of divup? herhut: Maybe use ceil_div instead of divup?
				herhutUnsubmitted Done Reply Inline Actions nit replaced. herhut: nit replaced.
				herhutUnsubmitted Done Reply Inline Actions Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be positive here and in most of the rest of this file. Wouldn't the unsigned one suffice here? herhut: Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be…
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just for "consistency" :) ezhulenev: Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just…

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir

This file was added.

				// RUN: mlir-opt %s -split-input-file -async-parallel-for=async-dispatch=true \
				// RUN: \| FileCheck %s

				// CHECK-LABEL: @loop_1d
				func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {
				// CHECK: %[[GROUP:.*]] = async.create_group
				// CHECK: call @async_dispatch_fn
				// CHECK: async.await_all %[[GROUP]]
				scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {
				%one = constant 1.0 : f32
				memref.store %one, %arg3[%i] : memref<?xf32>
				}
				return
				}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: memref.store

				// CHECK-LABEL: func private @async_dispatch_fn
				// CHECK-SAME: (
				// CHECK-SAME: %[[GROUP:arg0]]: !async.group,
				// CHECK-SAME: %[[BLOCK_START:arg1]]: index
				// CHECK-SAME: %[[BLOCK_END:arg2]]: index
				// CHECK-SAME: )
				// CHECK: %[[C1:.*]] = constant 1 : index
				herhutUnsubmitted Done Reply Inline Actions This test is super sparse. Why even capture `S` and `E` here, as they are not used? Unless you want to ensure that the right block numbers are passed, but that needs a less sparse test. herhut: This test is super sparse. Why even capture `S` and `E` here, as they are not used? Unless you…
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions I added a bit more details, but still decided not to add checks for all of the generated IR, just checked the main structure. It's just too many checks, and I find it easier to rely on the mlir integration tests that run the code. ezhulenev: I added a bit more details, but still decided not to add checks for all of the generated IR…
				// CHECK: %[[C2:.*]] = constant 2 : index
				// CHECK: scf.while (%[[S0:.*]] = %[[BLOCK_START]],
				// CHECK-SAME: %[[E0:.*]] = %[[BLOCK_END]])
				// While loop `before` block decides if we need to dispatch more tasks.
				// CHECK: {
				// CHECK: %[[DIFF0:.*]] = subi %[[E0]], %[[S0]]
				// CHECK: %[[COND:.*]] = cmpi sgt, %[[DIFF0]], %[[C1]]
				// CHECK: scf.condition(%[[COND]])
				// While loop `after` block splits the range in half and submits async task
				// to process the second half using the call to the same dispatch function.
				// CHECK: } do {
				// CHECK: ^bb0(%[[S1:.]]: index, %[[E1:.]]: index):
				// CHECK: %[[DIFF1:.*]] = subi %[[E1]], %[[S1]]
				// CHECK: %[[HALF:.*]] = divi_signed %[[DIFF1]], %[[C2]]
				// CHECK: %[[MID:.*]] = addi %[[S1]], %[[HALF]]
				// CHECK: %[[TOKEN:.*]] = async.execute
				// CHECK: call @async_dispatch_fn
				// CHECK: async.add_to_group
				// CHECK: scf.yield %[[S1]], %[[MID]]
				// CHECK: }
				// After async dispatch the first block processed in the caller thread.
				// CHECK: call @parallel_compute_fn(%[[BLOCK_START]]

				// -----

				// CHECK-LABEL: @loop_2d
				func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step
				%arg3: index, %arg4: index, %arg5: index, // lb, ub, step
				%arg6: memref<?x?xf32>) {
				// CHECK: %[[GROUP:.*]] = async.create_group
				// CHECK: call @async_dispatch_fn
				// CHECK: async.await_all %[[GROUP]]
				scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)
				step (%arg2, %arg5) {
				%one = constant 1.0 : f32
				memref.store %one, %arg6[%i0, %i1] : memref<?x?xf32>
				}
				return
				}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: scf.for
				// CHECK: memref.store

				// CHECK-LABEL: func private @async_dispatch_fn

mlir/test/Dialect/Async/async-parallel-for-seq-dispatch.mlir

This file was moved from mlir/test/Dialect/Async/async-parallel-for.mlir.

	// RUN: mlir-opt %s -async-parallel-for \| FileCheck %s			// RUN: mlir-opt %s -split-input-file -async-parallel-for=async-dispatch=false \
				// RUN: \| FileCheck %s --dump-input=always

				// The structure of @parallel_compute_fn checked in the async dispatch test.
				// Here we only check the structure of the sequential dispatch loop.

	// CHECK-LABEL: @loop_1d			// CHECK-LABEL: @loop_1d
	func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {			func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {
	// CHECK: %[[GROUP:.*]] = async.create_group			// CHECK: %[[GROUP:.*]] = async.create_group
	// CHECK: scf.for			// CHECK: scf.for
	// CHECK: %[[TOKEN:.*]] = async.execute {			// CHECK: %[[TOKEN:.*]] = async.execute
	// CHECK: scf.for			// CHECK: call @parallel_compute_fn
	// CHECK: memref.store
	// CHECK: async.yield			// CHECK: async.yield
	// CHECK: }
	// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]			// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]
				// CHECK: call @parallel_compute_fn
	// CHECK: async.await_all %[[GROUP]]			// CHECK: async.await_all %[[GROUP]]
	scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {			scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {
	%one = constant 1.0 : f32			%one = constant 1.0 : f32
	memref.store %one, %arg3[%i] : memref<?xf32>			memref.store %one, %arg3[%i] : memref<?xf32>
	}			}

	return			return
	}			}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: memref.store

				// -----

	// CHECK-LABEL: @loop_2d			// CHECK-LABEL: @loop_2d
	func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step			func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step
	%arg3: index, %arg4: index, %arg5: index, // lb, ub, step			%arg3: index, %arg4: index, %arg5: index, // lb, ub, step
	%arg6: memref<?x?xf32>) {			%arg6: memref<?x?xf32>) {
	// CHECK: %[[GROUP:.*]] = async.create_group			// CHECK: %[[GROUP:.*]] = async.create_group
	// CHECK: scf.for			// CHECK: scf.for
	// CHECK: scf.for			// CHECK: %[[TOKEN:.*]] = async.execute
	// CHECK: %[[TOKEN:.*]] = async.execute {			// CHECK: call @parallel_compute_fn
	// CHECK: scf.for
	// CHECK: scf.for
	// CHECK: memref.store
	// CHECK: async.yield			// CHECK: async.yield
	// CHECK: }
	// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]			// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]
				// CHECK: call @parallel_compute_fn
	// CHECK: async.await_all %[[GROUP]]			// CHECK: async.await_all %[[GROUP]]
	scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)			scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)
	step (%arg2, %arg5) {			step (%arg2, %arg5) {
	%one = constant 1.0 : f32			%one = constant 1.0 : f32
	memref.store %one, %arg6[%i0, %i1] : memref<?x?xf32>			memref.store %one, %arg6[%i0, %i1] : memref<?x?xf32>
	}			}

	return			return
	}			}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: scf.for
				// CHECK: memref.store

mlir/test/Dialect/Async/async-parallel-for.mlir

This file was moved to mlir/test/Dialect/Async/async-parallel-for-seq-dispatch.mlir.

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir

This file was copied to mlir/test/Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir.

	// RUN: mlir-opt %s \			// RUN: mlir-opt %s \
	// RUN: -linalg-tile-to-parallel-loops="linalg-tile-sizes=256" \			// RUN: -convert-linalg-to-parallel-loops \
	// RUN: -async-parallel-for="num-concurrent-async-execute=4" \			// RUN: -async-parallel-for \
	// RUN: -async-to-async-runtime \			// RUN: -async-to-async-runtime \
	// RUN: -async-runtime-ref-counting \			// RUN: -async-runtime-ref-counting \
	// RUN: -async-runtime-ref-counting-opt \			// FIXME: -async-runtime-ref-counting-opt \
				herhutUnsubmitted Done Reply Inline Actions Is this broken by this change? Is that because we now pass groups through functions? herhut: Is this broken by this change? Is that because we now pass groups through functions?
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<->dropRef and destroys the group too early. Will send a fix in one of the next PRs. ezhulenev: There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<…
	// RUN: -convert-async-to-llvm \			// RUN: -convert-async-to-llvm \
	// RUN: -lower-affine \
	// RUN: -convert-linalg-to-loops \
	// RUN: -convert-scf-to-std \			// RUN: -convert-scf-to-std \
	// RUN: -std-expand \			// RUN: -std-expand \
	// RUN: -convert-vector-to-llvm \			// RUN: -convert-vector-to-llvm \
	// RUN: -convert-std-to-llvm \			// RUN: -convert-std-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: -e entry -entry-point-result=void -O3 \			// RUN: -e entry -entry-point-result=void -O3 \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\
	▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

mlir/test/Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir

This file was copied from mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir.

// RUN: mlir-opt %s \		// RUN: mlir-opt %s \
// RUN: -linalg-tile-to-parallel-loops="linalg-tile-sizes=256" \		// RUN: -async-parallel-for \
// RUN: -async-parallel-for="num-concurrent-async-execute=4" \
// RUN: -async-to-async-runtime \		// RUN: -async-to-async-runtime \
// RUN: -async-runtime-ref-counting \		// RUN: -async-runtime-ref-counting \
// RUN: -async-runtime-ref-counting-opt \		// FIXME: -async-runtime-ref-counting-opt \
		// RUN: -convert-async-to-llvm \
		// RUN: -convert-linalg-to-loops \
		// RUN: -convert-scf-to-std \
		// RUN: -std-expand \
		// RUN: -convert-vector-to-llvm \
		// RUN: -convert-std-to-llvm \
		// RUN: \| mlir-cpu-runner \
		// RUN: -e entry -entry-point-result=void -O3 \
		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\
		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext \
		// RUN: \| FileCheck %s --dump-input=always

		// RUN: mlir-opt %s \
		// RUN: -async-parallel-for=async-dispatch=false \
		// RUN: -async-to-async-runtime \
		// RUN: -async-runtime-ref-counting \
		// FIXME: -async-runtime-ref-counting-opt \
// RUN: -convert-async-to-llvm \		// RUN: -convert-async-to-llvm \
// RUN: -lower-affine \
// RUN: -convert-linalg-to-loops \		// RUN: -convert-linalg-to-loops \
// RUN: -convert-scf-to-std \		// RUN: -convert-scf-to-std \
// RUN: -std-expand \		// RUN: -std-expand \
// RUN: -convert-vector-to-llvm \		// RUN: -convert-vector-to-llvm \
// RUN: -convert-std-to-llvm \		// RUN: -convert-std-to-llvm \
// RUN: \| mlir-cpu-runner \		// RUN: \| mlir-cpu-runner \
// RUN: -e entry -entry-point-result=void -O3 \		// RUN: -e entry -entry-point-result=void -O3 \
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
Show All 10 Lines
// RUN: -e entry -entry-point-result=void -O3 \		// RUN: -e entry -entry-point-result=void -O3 \
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext \		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext \
// RUN: \| FileCheck %s --dump-input=always		// RUN: \| FileCheck %s --dump-input=always

#map0 = affine_map<(d0, d1) -> (d0, d1)>		#map0 = affine_map<(d0, d1) -> (d0, d1)>

func @linalg_generic(%lhs: memref<?x?xf32>,		func @scf_parallel(%lhs: memref<?x?xf32>,
%rhs: memref<?x?xf32>,		%rhs: memref<?x?xf32>,
%sum: memref<?x?xf32>) {		%sum: memref<?x?xf32>) {
linalg.generic {		%c0 = constant 0 : index
indexing_maps = [#map0, #map0, #map0],		%c1 = constant 1 : index
iterator_types = ["parallel", "parallel"]
}		%d0 = memref.dim %lhs, %c0 : memref<?x?xf32>
ins(%lhs, %rhs : memref<?x?xf32>, memref<?x?xf32>)		%d1 = memref.dim %lhs, %c1 : memref<?x?xf32>
outs(%sum : memref<?x?xf32>)
{		scf.parallel (%i, %j) = (%c0, %c0) to (%d0, %d1) step (%c1, %c1) {
^bb0(%lhs_in: f32, %rhs_in: f32, %sum_out: f32):		%lv = memref.load %lhs[%i, %j] : memref<?x?xf32>
%0 = addf %lhs_in, %rhs_in : f32		%rv = memref.load %lhs[%i, %j] : memref<?x?xf32>
linalg.yield %0 : f32		%r = addf %lv, %rv : f32
		memref.store %r, %sum[%i, %j] : memref<?x?xf32>
}		}

return		return
}		}

func @entry() {		func @entry() {
%f1 = constant 1.0 : f32		%f1 = constant 1.0 : f32
%f4 = constant 4.0 : f32		%f4 = constant 4.0 : f32
Show All 11 Lines	func @entry() {

linalg.fill(%f1, %LHS10) : f32, memref<1x10xf32>		linalg.fill(%f1, %LHS10) : f32, memref<1x10xf32>
linalg.fill(%f1, %RHS10) : f32, memref<1x10xf32>		linalg.fill(%f1, %RHS10) : f32, memref<1x10xf32>

%LHS = memref.cast %LHS10 : memref<1x10xf32> to memref<?x?xf32>		%LHS = memref.cast %LHS10 : memref<1x10xf32> to memref<?x?xf32>
%RHS = memref.cast %RHS10 : memref<1x10xf32> to memref<?x?xf32>		%RHS = memref.cast %RHS10 : memref<1x10xf32> to memref<?x?xf32>
%DST = memref.cast %DST10 : memref<1x10xf32> to memref<?x?xf32>		%DST = memref.cast %DST10 : memref<1x10xf32> to memref<?x?xf32>

call @linalg_generic(%LHS, %RHS, %DST)		call @scf_parallel(%LHS, %RHS, %DST)
: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()		: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()

// CHECK: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]		// CHECK: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
%U = memref.cast %DST10 : memref<1x10xf32> to memref<*xf32>		%U = memref.cast %DST10 : memref<1x10xf32> to memref<*xf32>
call @print_memref_f32(%U): (memref<*xf32>) -> ()		call @print_memref_f32(%U): (memref<*xf32>) -> ()

memref.dealloc %LHS10: memref<1x10xf32>		memref.dealloc %LHS10: memref<1x10xf32>
memref.dealloc %RHS10: memref<1x10xf32>		memref.dealloc %RHS10: memref<1x10xf32>
Show All 10 Lines	func @entry() {
%LHS0 = memref.cast %LHS1024 : memref<1024x1024xf32> to memref<?x?xf32>		%LHS0 = memref.cast %LHS1024 : memref<1024x1024xf32> to memref<?x?xf32>
%RHS0 = memref.cast %RHS1024 : memref<1024x1024xf32> to memref<?x?xf32>		%RHS0 = memref.cast %RHS1024 : memref<1024x1024xf32> to memref<?x?xf32>
%DST0 = memref.cast %DST1024 : memref<1024x1024xf32> to memref<?x?xf32>		%DST0 = memref.cast %DST1024 : memref<1024x1024xf32> to memref<?x?xf32>

//		//
// Warm up.		// Warm up.
//		//

call @linalg_generic(%LHS0, %RHS0, %DST0)		call @scf_parallel(%LHS0, %RHS0, %DST0)
: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()		: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()

//		//
// Measure execution time.		// Measure execution time.
//		//

%t0 = call @rtclock() : () -> f64		%t0 = call @rtclock() : () -> f64
scf.for %i = %c0 to %cM step %c1 {		scf.for %i = %c0 to %cM step %c1 {
call @linalg_generic(%LHS0, %RHS0, %DST0)		call @scf_parallel(%LHS0, %RHS0, %DST0)
: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()		: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
}		}
%t1 = call @rtclock() : () -> f64		%t1 = call @rtclock() : () -> f64
%t1024 = subf %t1, %t0 : f64		%t1024 = subf %t1, %t0 : f64

// Print timings.		// Print timings.
vector.print %t1024 : f64		vector.print %t1024 : f64

Show All 12 Lines

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-1d.mlir

	// RUN: mlir-opt %s -async-parallel-for \			// RUN: mlir-opt %s -async-parallel-for \
	// RUN: -async-to-async-runtime \			// RUN: -async-to-async-runtime \
	// RUN: -async-runtime-ref-counting \			// RUN: -async-runtime-ref-counting \
	// RUN: -async-runtime-ref-counting-opt \			// FIXME: -async-runtime-ref-counting-opt \
				// RUN: -convert-async-to-llvm \
				// RUN: -convert-scf-to-std \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e entry -entry-point-result=void -O0 \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
				// RUN: \| FileCheck %s --dump-input=always

				// RUN: mlir-opt %s -async-parallel-for="async-dispatch=false \
				// RUN: num-workers=20 \
				// RUN: target-block-size=3" \
				// RUN: -async-to-async-runtime \
				// RUN: -async-runtime-ref-counting \
				// FIXME: -async-runtime-ref-counting-opt \
	// RUN: -convert-async-to-llvm \			// RUN: -convert-async-to-llvm \
	// RUN: -convert-scf-to-std \			// RUN: -convert-scf-to-std \
	// RUN: -convert-std-to-llvm \			// RUN: -convert-std-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: -e entry -entry-point-result=void -O0 \			// RUN: -e entry -entry-point-result=void -O0 \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
	// RUN: \| FileCheck %s --dump-input=always			// RUN: \| FileCheck %s --dump-input=always
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

	// RUN: mlir-opt %s -async-parallel-for \			// RUN: mlir-opt %s -async-parallel-for \
	// RUN: -async-to-async-runtime \			// RUN: -async-to-async-runtime \
	// RUN: -async-runtime-ref-counting \			// RUN: -async-runtime-ref-counting \
	// RUN: -async-runtime-ref-counting-opt \			// FIXME: -async-runtime-ref-counting-opt \
				// RUN: -convert-async-to-llvm \
				// RUN: -convert-scf-to-std \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e entry -entry-point-result=void -O0 \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
				// RUN: \| FileCheck %s --dump-input=always

				// RUN: mlir-opt %s -async-parallel-for="async-dispatch=false \
				// RUN: num-workers=20 \
				// RUN: target-block-size=3" \
				// RUN: -async-to-async-runtime \
				// RUN: -async-runtime-ref-counting \
				// FIXME: -async-runtime-ref-counting-opt \
	// RUN: -convert-async-to-llvm \			// RUN: -convert-async-to-llvm \
	// RUN: -convert-scf-to-std \			// RUN: -convert-scf-to-std \
	// RUN: -convert-std-to-llvm \			// RUN: -convert-std-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: -e entry -entry-point-result=void -O0 \			// RUN: -e entry -entry-point-result=void -O0 \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
	// RUN: \| FileCheck %s --dump-input=always			// RUN: \| FileCheck %s --dump-input=always
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 354408

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Passes.td

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir

mlir/test/Dialect/Async/async-parallel-for-seq-dispatch.mlir

mlir/test/Dialect/Async/async-parallel-for.mlir

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir

mlir/test/Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-1d.mlir

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)
ClosedPublic