This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Async/
-
mlir/
-
Dialect/
-
Async/
-
Passes.h
-
Passes.td
-
lib/Dialect/Async/Transforms/
-
Dialect/
-
Async/
-
Transforms/
25/25
AsyncParallelFor.cpp
-
test/
-
Dialect/Async/
-
Async/
2/2
async-parallel-for-async-dispatch.mlir
-
async-parallel-for-seq-dispatch.mlir
-
async-parallel-for.mlir
-
Integration/Dialect/Async/CPU/
-
Dialect/
-
Async/
-
CPU/
2/2
microbench-linalg-async-parallel-for.mlir
-
microbench-scf-async-parallel-for.mlir
-
test-async-parallel-for-1d.mlir
-
test-async-parallel-for-2d.mlir

Differential D104850

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)
ClosedPublic

Authored by ezhulenev on Jun 24 2021, 5:27 AM.

Download Raw Diff

Details

Reviewers

herhut
mehdi_amini

Commits

rG86ad0af87054: [mlir:Async] Implement recursive async work splitting for scf.parallel…

Summary

Depends On D104780

Recursive work splitting instead of sequential async tasks submission gives ~20%-30% speedup in microbenchmarks.

Algorithm outline:

Collapse scf.parallel dimensions into a single dimension
Compute the block size for the parallel operations from the 1d problem size
Launch parallel tasks
Each parallel task reconstructs its own bounds in the original multi-dimensional iteration space
Each parallel task computes the original parallel operation body using scf.for loop nest

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ezhulenev created this revision.Jun 24 2021, 5:27 AM

Herald added subscribers: dcaballe, cota, teijeong and 16 others. · View Herald TranscriptJun 24 2021, 5:27 AM

ezhulenev requested review of this revision.Jun 24 2021, 5:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 24 2021, 5:27 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

ezhulenev edited the summary of this revision. (Show Details)Jun 24 2021, 5:31 AM

ezhulenev added reviewers: herhut, mehdi_amini.

Harbormaster completed remote builds in B110805: Diff 354222.Jun 24 2021, 6:09 AM

Very neat! Just some nits.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
129	Longer term this could become an op, as this functionality is needed frequently. Not here, though.
187	Why is this needed?
208	I see the convenience of this but I'd prefer if there was no invisible side-effect on offset.
220	`b.create<ConstantIndexOp>(0)`
224	I wonder whether `ArrayRef` would be the better abstraction here.
245	how about `max(blockFirstIndex + blockSize, tripCount)`?
256	nit: double compute?
260	nit: contains, multiple
315	nit: or
383	Is this needed?
411	nit: 'ConstantIndexOp'
442	nit: the second half
449	nit: use `start`?
469	nit: use `start`?
501	`ConstantIndexOp`
538	`ConstantIndexOp`
624	Some more ConstantIndexOp opportunities
632	Maybe use ceil_div instead of divup?
639	Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be positive here and in most of the rest of this file. Wouldn't the unsigned one suffice here?
656	nit replaced.
mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26	This test is super sparse. Why even capture `S` and `E` here, as they are not used? Unless you want to ensure that the right block numbers are passed, but that needs a less sparse test.
mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6	Is this broken by this change? Is that because we now pass groups through functions?

Resolve comments

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
187	Removed. Leftovers that I forgot to cleanup.
208	I added a not that it updates the offset, but don't see any other way to keep uses oneliners and make it more explicit
220	Cool, didn't know it exists. Updated all constants.
224	Updated all callsites.
639	Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just for "consistency" :)
mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir
26	I added a bit more details, but still decided not to add checks for all of the generated IR, just checked the main structure. It's just too many checks, and I find it easier to rely on the mlir integration tests that run the code.
mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir
6	There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<->dropRef and destroys the group too early. Will send a fix in one of the next PRs.

ezhulenev added a child revision: D104891: [mlir:Async] Remove async operations if it is statically known that the parallel operation has a single compute block.Jun 24 2021, 5:48 PM

Revert accidental change

Harbormaster completed remote builds in B110931: Diff 354408.Jun 24 2021, 6:25 PM

Thanks!

This revision is now accepted and ready to land.Jun 25 2021, 5:19 AM

Closed by commit rG86ad0af87054: [mlir:Async] Implement recursive async work splitting for scf.parallel… (authored by ezhulenev). · Explain WhyJun 25 2021, 10:34 AM

This revision was automatically updated to reflect the committed changes.

ezhulenev added a commit: rG86ad0af87054: [mlir:Async] Implement recursive async work splitting for scf.parallel….

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Async/

Passes.h

2 lines

Passes.td

21 lines

lib/

Dialect/

Async/

Transforms/

AsyncParallelFor.cpp

747 lines

test/

Dialect/

Async/

async-parallel-for-async-dispatch.mlir

57 lines

	async-parallel-for-seq-dispatch.mlir
	async-parallel-for.mlir

36 lines

async-parallel-for.mlir

Integration/

Dialect/

Async/

CPU/

microbench-linalg-async-parallel-for.mlir

8 lines

	microbench-scf-async-parallel-for.mlir
	microbench-linalg-async-parallel-for.mlir

57 lines

test-async-parallel-for-1d.mlir

17 lines

test-async-parallel-for-2d.mlir

17 lines

Diff 354546

mlir/include/mlir/Dialect/Async/Passes.h

	Show All 13 Lines
	#define MLIR_DIALECT_ASYNC_PASSES_H_			#define MLIR_DIALECT_ASYNC_PASSES_H_

	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {

	std::unique_ptr<Pass> createAsyncParallelForPass();			std::unique_ptr<Pass> createAsyncParallelForPass();

	std::unique_ptr<Pass> createAsyncParallelForPass(int numWorkerThreads);

	std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();			std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Registration			// Registration
	Show All 9 Lines

mlir/include/mlir/Dialect/Async/Passes.td

	//===-- Passes.td - Async pass definition file -------------- tablegen --===//			//===-- Passes.td - Async pass definition file -------------- tablegen --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef MLIR_DIALECT_ASYNC_PASSES			#ifndef MLIR_DIALECT_ASYNC_PASSES
	#define MLIR_DIALECT_ASYNC_PASSES			#define MLIR_DIALECT_ASYNC_PASSES

	include "mlir/Pass/PassBase.td"			include "mlir/Pass/PassBase.td"

	def AsyncParallelFor : Pass<"async-parallel-for"> {			def AsyncParallelFor : Pass<"async-parallel-for"> {
	let summary = "Convert scf.parallel operations to multiple async regions "			let summary = "Convert scf.parallel operations to multiple async compute ops "
	"executed concurrently for non-overlapping iteration ranges";			"executed concurrently for non-overlapping iteration ranges";
	let constructor = "mlir::createAsyncParallelForPass()";			let constructor = "mlir::createAsyncParallelForPass()";

	let options = [			let options = [
	Option<"numConcurrentAsyncExecute", "num-concurrent-async-execute",			Option<"asyncDispatch", "async-dispatch",
	"int32_t", /default=/"4",			"bool", /default=/"true",
	"The number of async.execute operations that will be used for concurrent "			"Dispatch async compute tasks using recursive work splitting. If `false` "
	"loop execution.">			"async compute tasks will be launched using simple for loop in the "
				"caller thread.">,

				Option<"numWorkerThreads", "num-workers",
				"int32_t", /default=/"8",
				"The number of available workers to execute async operations.">,

				Option<"targetBlockSize", "target-block-size",
				"int32_t", /default=/"1000",
				"The target block size for sharding parallel operation.">
	];			];

	let dependentDialects = ["async::AsyncDialect", "scf::SCFDialect"];			let dependentDialects = ["async::AsyncDialect", "scf::SCFDialect"];
	}			}

	def AsyncToAsyncRuntime : Pass<"async-to-async-runtime", "ModuleOp"> {			def AsyncToAsyncRuntime : Pass<"async-to-async-runtime", "ModuleOp"> {
	let summary = "Lower high level async operations (e.g. async.execute) to the"			let summary = "Lower high level async operations (e.g. async.execute) to the"
	"explicit async.runtime and async.coro operations";			"explicit async.runtime and async.coro operations";
	let constructor = "mlir::createAsyncToAsyncRuntimePass()";			let constructor = "mlir::createAsyncToAsyncRuntimePass()";
	let dependentDialects = ["async::AsyncDialect"];			let dependentDialects = ["async::AsyncDialect"];
	Show All 27 Lines

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

	//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//			//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements scf.parallel to src.for + async.execute conversion pass.			// This file implements scf.parallel to scf.for + async.execute conversion pass.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "PassDetail.h"			#include "PassDetail.h"
	#include "mlir/Dialect/Async/IR/Async.h"			#include "mlir/Dialect/Async/IR/Async.h"
	#include "mlir/Dialect/Async/Passes.h"			#include "mlir/Dialect/Async/Passes.h"
	#include "mlir/Dialect/SCF/SCF.h"			#include "mlir/Dialect/SCF/SCF.h"
	#include "mlir/Dialect/StandardOps/IR/Ops.h"			#include "mlir/Dialect/StandardOps/IR/Ops.h"
	#include "mlir/IR/BlockAndValueMapping.h"			#include "mlir/IR/BlockAndValueMapping.h"
				#include "mlir/IR/ImplicitLocOpBuilder.h"
	#include "mlir/IR/PatternMatch.h"			#include "mlir/IR/PatternMatch.h"
	#include "mlir/Transforms/GreedyPatternRewriteDriver.h"			#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
				#include "mlir/Transforms/RegionUtils.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::async;			using namespace mlir::async;

	#define DEBUG_TYPE "async-parallel-for"			#define DEBUG_TYPE "async-parallel-for"

	namespace {			namespace {

	// Rewrite scf.parallel operation into multiple concurrent async.execute			// Rewrite scf.parallel operation into multiple concurrent async.execute
	// operations over non overlapping subranges of the original loop.			// operations over non overlapping subranges of the original loop.
	//			//
	// Example:			// Example:
	//			//
	// scf.for (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {			// scf.parallel (%i, %j) = (%lbi, %lbj) to (%ubi, %ubj) step (%si, %sj) {
	// "do_some_compute"(%i, %j): () -> ()			// "do_some_compute"(%i, %j): () -> ()
	// }			// }
	//			//
	// Converted to:			// Converted to:
	//			//
	// %c0 = constant 0 : index			// // Parallel compute function that executes the parallel body region for
	// %c1 = constant 1 : index			// // a subset of the parallel iteration space defined by the one-dimensional
	//			// // compute block index.
	// // Compute blocks sizes for each induction variable.			// func parallel_compute_function(%block_index : index, %block_size : index,
	// %num_blocks_i = ... : index			// <parallel operation properties>, ...) {
	// %num_blocks_j = ... : index			// // Compute multi-dimensional loop bounds for %block_index.
	// %block_size_i = ... : index			// %block_lbi, %block_lbj = ...
	// %block_size_j = ... : index			// %block_ubi, %block_ubj = ...
	//			//
	// // Create an async group to track async execute ops.			// // Clone parallel operation body into the scf.for loop nest.
	// %group = async.create_group			// scf.for %i = %blockLbi to %blockUbi {
	//			// scf.for %j = block_lbj to %block_ubj {
	// scf.for %bi = %c0 to %num_blocks_i step %c1 {
	// %block_start_i = ... : index
	// %block_end_i = ... : index
	//
	// scf.for %bj = %c0 to %num_blocks_j step %c1 {
	// %block_start_j = ... : index
	// %block_end_j = ... : index
	//
	// // Execute the body of original parallel operation for the current
	// // block.
	// %token = async.execute {
	// scf.for %i = %block_start_i to %block_end_i step %si {
	// scf.for %j = %block_start_j to %block_end_j step %sj {
	// "do_some_compute"(%i, %j): () -> ()			// "do_some_compute"(%i, %j): () -> ()
	// }			// }
	// }			// }
	// }			// }
	//			//
	// // Add produced async token to the group.			// And a dispatch function depending on the `asyncDispatch` option.
	// async.add_to_group %token, %group			//
				// When async dispatch is on: (pseudocode)
				//
				// %block_size = ... compute parallel compute block size
				// %block_count = ... compute the number of compute blocks
				//
				// func @async_dispatch(%block_start : index, %block_end : index, ...) {
				// // Keep splitting block range until we reached a range of size 1.
				// while (%block_end - %block_start > 1) {
				// %mid_index = block_start + (block_end - block_start) / 2;
				// async.execute { call @async_dispatch(%mid_index, %block_end); }
				// %block_end = %mid_index
	// }			// }
				//
				// // Call parallel compute function for a single block.
				// call @parallel_compute_fn(%block_start, %block_size, ...);
	// }			// }
	//			//
	// // Await completion of all async.execute operations.			// // Launch async dispatch for [0, block_count) range.
	// async.await_all %group			// call @async_dispatch(%c0, %block_count);
				//
				// When async dispatch is off:
	//			//
	// In this example outer loop launches inner block level loops as separate async			// %block_size = ... compute parallel compute block size
	// execute operations which will be executed concurrently.			// %block_count = ... compute the number of compute blocks
	//			//
	// At the end it waits for the completiom of all async execute operations.			// scf.for %block_index = %c0 to %block_count {
				// call @parallel_compute_fn(%block_index, %block_size, ...)
				// }
	//			//
				struct AsyncParallelForPass
				: public AsyncParallelForBase<AsyncParallelForPass> {
				AsyncParallelForPass() = default;
				void runOnOperation() override;
				};

	struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {			struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {
	public:			public:
	AsyncParallelForRewrite(MLIRContext *ctx, int numConcurrentAsyncExecute)			AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,
	: OpRewritePattern(ctx),			int32_t numWorkerThreads, int32_t targetBlockSize)
	numConcurrentAsyncExecute(numConcurrentAsyncExecute) {}			: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),
				numWorkerThreads(numWorkerThreads), targetBlockSize(targetBlockSize) {}

	LogicalResult matchAndRewrite(scf::ParallelOp op,			LogicalResult matchAndRewrite(scf::ParallelOp op,
	PatternRewriter &rewriter) const override;			PatternRewriter &rewriter) const override;

	private:			private:
	int numConcurrentAsyncExecute;			// The maximum number of tasks per worker thread when sharding parallel op.
				static constexpr int32_t kMaxOversharding = 4;

				bool asyncDispatch;
				int32_t numWorkerThreads;
				int32_t targetBlockSize;
	};			};

	struct AsyncParallelForPass			struct ParallelComputeFunctionType {
	: public AsyncParallelForBase<AsyncParallelForPass> {			FunctionType type;
	AsyncParallelForPass() = default;			llvm::SmallVector<Value> captures;
	AsyncParallelForPass(int numWorkerThreads) {			};
	assert(numWorkerThreads >= 1);
	numConcurrentAsyncExecute = numWorkerThreads;			struct ParallelComputeFunction {
	}			FuncOp func;
	void runOnOperation() override;			llvm::SmallVector<Value> captures;
	};			};

	} // namespace			} // namespace

	LogicalResult			// Converts one-dimensional iteration index in the [0, tripCount) interval
	AsyncParallelForRewrite::matchAndRewrite(scf::ParallelOp op,			// into multidimensional iteration coordinate.
	PatternRewriter &rewriter) const {			static SmallVector<Value> delinearize(ImplicitLocOpBuilder &b, Value index,
				herhutUnsubmitted Done Reply Inline Actions Longer term this could become an op, as this functionality is needed frequently. Not here, though. herhut: Longer term this could become an op, as this functionality is needed frequently. Not here…
	// We do not currently support rewrite for parallel op with reductions.			const SmallVector<Value> &tripCounts) {
	if (op.getNumReductions() != 0)			SmallVector<Value> coords(tripCounts.size());
	return failure();			assert(!tripCounts.empty() && "tripCounts must be not empty");

				for (ssize_t i = tripCounts.size() - 1; i >= 0; --i) {
				coords[i] = b.create<SignedRemIOp>(index, tripCounts[i]);
				index = b.create<SignedDivIOp>(index, tripCounts[i]);
				}

	MLIRContext *ctx = op.getContext();			return coords;
	Location loc = op.getLoc();			}

	// Index constants used below.			// Returns a function type and implicit captures for a parallel compute
	auto indexTy = IndexType::get(ctx);			// function. We'll need a list of implicit captures to setup block and value
	auto zero = IntegerAttr::get(indexTy, 0);			// mapping when we'll clone the body of the parallel operation.
	auto one = IntegerAttr::get(indexTy, 1);			static ParallelComputeFunctionType
	auto c0 = rewriter.create<ConstantOp>(loc, indexTy, zero);			getParallelComputeFunctionType(scf::ParallelOp op, PatternRewriter &rewriter) {
	auto c1 = rewriter.create<ConstantOp>(loc, indexTy, one);			// Values implicitly captured by the parallel operation.
				llvm::SetVector<Value> captures;
	// Shorthand for signed integer ceil division operation.			getUsedValuesDefinedAbove(op.region(), op.region(), captures);
	auto divup = [&](Value x, Value y) -> Value {
	return rewriter.create<SignedCeilDivIOp>(loc, x, y);			llvm::SmallVector<Type> inputs;
	};			inputs.reserve(2 + 4 * op.getNumLoops() + captures.size());

				Type indexTy = rewriter.getIndexType();

				// One-dimensional iteration space defined by the block index and size.
				inputs.push_back(indexTy); // blockIndex
				inputs.push_back(indexTy); // blockSize

				// Multi-dimensional parallel iteration space defined by the loop trip counts.
				for (unsigned i = 0; i < op.getNumLoops(); ++i)
				inputs.push_back(indexTy); // loop tripCount

				// Parallel operation lower bound, upper bound and step.
				for (unsigned i = 0; i < op.getNumLoops(); ++i) {
				inputs.push_back(indexTy); // lower bound
				inputs.push_back(indexTy); // upper bound
				inputs.push_back(indexTy); // step
				}

	// Compute trip count for each loop induction variable:			// Types of the implicit captures.
	// tripCount = divUp(upperBound - lowerBound, step);			for (Value capture : captures)
	SmallVector<Value, 4> tripCounts(op.getNumLoops());			inputs.push_back(capture.getType());
	for (size_t i = 0; i < op.getNumLoops(); ++i) {
	auto lb = op.lowerBound()[i];			// Convert captures to vector for later convenience.
	auto ub = op.upperBound()[i];			SmallVector<Value> capturesVector(captures.begin(), captures.end());
	auto step = op.step()[i];			return {rewriter.getFunctionType(inputs, TypeRange()), capturesVector};
	auto range = rewriter.create<SubIOp>(loc, ub, lb);
	tripCounts[i] = divup(range, step);
	}			}

	// The target number of concurrent async.execute ops.			// Create a parallel compute fuction from the parallel operation.
	auto numExecuteOps = rewriter.create<ConstantOp>(			static ParallelComputeFunction
	loc, indexTy, IntegerAttr::get(indexTy, numConcurrentAsyncExecute));			createParallelComputeFunction(scf::ParallelOp op, PatternRewriter &rewriter) {
				OpBuilder::InsertionGuard guard(rewriter);
	// Blocks sizes configuration for each induction variable.			ImplicitLocOpBuilder b(op.getLoc(), rewriter);

	// We try to use maximum available concurrency in outer dimensions first			ModuleOp module = op->getParentOfType<ModuleOp>();
	// (assuming that parallel induction variables are corresponding to some			b.setInsertionPointToStart(&module->getRegion(0).front());
				herhutUnsubmitted Done Reply Inline Actions Why is this needed? herhut: Why is this needed?
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Removed. Leftovers that I forgot to cleanup. ezhulenev: Removed. Leftovers that I forgot to cleanup.
	// multidimensional access, e.g. in (%d0, %d1, ..., %dn) = (<from>) to (<to>)
	// we will try to parallelize iteration along the %d0. If %d0 is too small,			ParallelComputeFunctionType computeFuncType =
	// we'll parallelize iteration over %d1, and so on.			getParallelComputeFunctionType(op, rewriter);
	SmallVector<Value, 4> targetNumBlocks(op.getNumLoops());
	SmallVector<Value, 4> blockSize(op.getNumLoops());			FunctionType type = computeFuncType.type;
	SmallVector<Value, 4> numBlocks(op.getNumLoops());			FuncOp func = FuncOp::create(op.getLoc(), "parallel_compute_fn", type);
				func.setPrivate();
	// Compute block size and number of blocks along the first induction variable.
	targetNumBlocks[0] = numExecuteOps;			// Insert function into the module symbol table and assign it unique name.
	blockSize[0] = divup(tripCounts[0], targetNumBlocks[0]);			SymbolTable symbolTable(module);
	numBlocks[0] = divup(tripCounts[0], blockSize[0]);			symbolTable.insert(func);
				rewriter.getListener()->notifyOperationInserted(func);
	// Assign remaining available concurrency to other induction variables.
	for (size_t i = 1; i < op.getNumLoops(); ++i) {			// Create function entry block.
	targetNumBlocks[i] = divup(targetNumBlocks[i - 1], numBlocks[i - 1]);			Block *block = b.createBlock(&func.getBody(), func.begin(), type.getInputs());
	blockSize[i] = divup(tripCounts[i], targetNumBlocks[i]);			b.setInsertionPointToEnd(block);
	numBlocks[i] = divup(tripCounts[i], blockSize[i]);
	}			unsigned offset = 0; // argument offset for arguments decoding

	// Total number of async compute blocks.			// Load multiple arguments into values vector.
	Value totalBlocks = numBlocks[0];			auto getArguments = [&](unsigned num_arguments) -> SmallVector<Value> {
				herhutUnsubmitted Done Reply Inline Actions I see the convenience of this but I'd prefer if there was no invisible side-effect on offset. herhut: I see the convenience of this but I'd prefer if there was no invisible side-effect on offset.
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions I added a not that it updates the offset, but don't see any other way to keep uses oneliners and make it more explicit ezhulenev: I added a not that it updates the offset, but don't see any other way to keep uses oneliners…
	for (size_t i = 1; i < op.getNumLoops(); ++i)			SmallVector<Value> values(num_arguments);
	totalBlocks = rewriter.create<MulIOp>(loc, totalBlocks, numBlocks[i]);			for (unsigned i = 0; i < num_arguments; ++i)
				values[i] = block->getArgument(offset++);
	// Create an async.group to wait on all async tokens from async execute ops.			return values;
	auto group =			};
	rewriter.create<CreateGroupOp>(loc, GroupType::get(ctx), totalBlocks);
				// Block iteration position defined by the block index and size.
	// Build a scf.for loop nest from the parallel operation.			Value blockIndex = block->getArgument(offset++);
				Value blockSize = block->getArgument(offset++);
	// Lower/upper bounds for nest block level computations.
	SmallVector<Value, 4> blockLowerBounds(op.getNumLoops());			// Constants used below.
	SmallVector<Value, 4> blockUpperBounds(op.getNumLoops());			Value c0 = b.create<ConstantOp>(b.getIndexAttr(0));
				herhutUnsubmitted Done Reply Inline Actions `b.create<ConstantIndexOp>(0)` herhut: `b.create<ConstantIndexOp>(0)`
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Cool, didn't know it exists. Updated all constants. ezhulenev: Cool, didn't know it exists. Updated all constants.
	SmallVector<Value, 4> blockInductionVars(op.getNumLoops());			Value c1 = b.create<ConstantOp>(b.getIndexAttr(1));

				// Multi-dimensional parallel iteration space defined by the loop trip counts.
				SmallVector<Value> tripCounts = getArguments(op.getNumLoops());
				herhutUnsubmitted Done Reply Inline Actions I wonder whether `ArrayRef` would be the better abstraction here. herhut: I wonder whether `ArrayRef` would be the better abstraction here.
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Updated all callsites. ezhulenev: Updated all callsites.

				// Compute a product of trip counts to get the size of the flattened
				// one-dimensional iteration space.
				Value tripCount = tripCounts[0];
				for (unsigned i = 1; i < tripCounts.size(); ++i)
				tripCount = b.create<MulIOp>(tripCount, tripCounts[i]);

				// Parallel operation lower bound, upper bound and step.
				SmallVector<Value> lowerBound = getArguments(op.getNumLoops());
				SmallVector<Value> upperBound = getArguments(op.getNumLoops());
				SmallVector<Value> step = getArguments(op.getNumLoops());

				// Remaining arguments are implicit captures of the parallel operation.
				SmallVector<Value> captures = getArguments(block->getNumArguments() - offset);

				// Find one-dimensional iteration bounds: [blockFirstIndex, blockLastIndex]:
				// blockFirstIndex = blockIndex * blockSize
				Value blockFirstIndex = b.create<MulIOp>(blockIndex, blockSize);

				// The last one-dimensional index in the block defined by the `blockIndex`:
				// blockLastIndex = max((blockIndex + 1) * blockSize, tripCount) - 1
				herhutUnsubmitted Done Reply Inline Actions how about `max(blockFirstIndex + blockSize, tripCount)`? herhut: how about `max(blockFirstIndex + blockSize, tripCount)`?
				Value blockEnd0 = b.create<AddIOp>(blockIndex, c1);
				Value blockEnd1 = b.create<MulIOp>(blockEnd0, blockSize);
				Value blockEnd2 = b.create<CmpIOp>(CmpIPredicate::sge, blockEnd1, tripCount);
				Value blockEnd3 = b.create<SelectOp>(blockEnd2, tripCount, blockEnd1);
				Value blockLastIndex = b.create<SubIOp>(blockEnd3, c1);

				// Convert one-dimensional indices to multi-dimensional coordinates.
				auto blockFirstCoord = delinearize(b, blockFirstIndex, tripCounts);
				auto blockLastCoord = delinearize(b, blockLastIndex, tripCounts);

				// Compute compute loops upper bounds from the block last coordinates:
				herhutUnsubmitted Done Reply Inline Actions nit: double compute? herhut: nit: double compute?
				// blockEndCoord[i] = blockLastCoord[i] + 1
				//
				// Block first and last coordinates can be the same along the outer compute
				// dimension when inner compute dimension containts multple blocks.
				herhutUnsubmitted Done Reply Inline Actions nit: contains, multiple herhut: nit: contains, multiple
				SmallVector<Value> blockEndCoord(op.getNumLoops());
				for (size_t i = 0; i < blockLastCoord.size(); ++i)
				blockEndCoord[i] = b.create<AddIOp>(blockLastCoord[i], c1);

				// Construct a loop nest out of scf.for operations that will iterate over
				// all coordinates in [blockFirstCoord, blockLastCoord] range.
	using LoopBodyBuilder =			using LoopBodyBuilder =
	std::function<void(OpBuilder &, Location, Value, ValueRange)>;			std::function<void(OpBuilder &, Location, Value, ValueRange)>;
	using LoopBuilder = std::function<LoopBodyBuilder(size_t loopIdx)>;			using LoopNestBuilder = std::function<LoopBodyBuilder(size_t loopIdx)>;

				// Parallel region induction variables computed from the multi-dimensional
				// iteration coordinate using parallel operation bounds and step:
				//
				// computeBlockInductionVars[loopIdx] =
				// lowerBound[loopIdx] + blockCoord[loopIdx] * step[loopDdx]
				SmallVector<Value> computeBlockInductionVars(op.getNumLoops());

				// We need to know if we are in the first or last iteration of the
				// multi-dimensional loop for each loop in the nest, so we can decide what
				// loop bounds should we use for the nested loops: bounds defined by compute
				// block interval, or bounds defined by the parallel operation.
				//
				// Example: 2d parallel operation
				// i j
				// loop sizes: [50, 50]
				// first coord: [25, 25]
				// last coord: [30, 30]
				//
				// If `i` is equal to 25 then iteration over `j` should start at 25, when `i`
				// is between 25 and 30 it should start at 0. The upper bound for `j` should
				// be 50, except when `i` is equal to 30, then it should also be 30.
				//
				// Value at ith position specifies if all loops in [0, i) range of the loop
				// nest are in the first/last iteration.
				SmallVector<Value> isBlockFirstCoord(op.getNumLoops());
				SmallVector<Value> isBlockLastCoord(op.getNumLoops());

	// Builds inner loop nest inside async.execute operation that does all the			// Builds inner loop nest inside async.execute operation that does all the
	// work concurrently.			// work concurrently.
	LoopBuilder workLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {			LoopNestBuilder workLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {
	return [&, loopIdx](OpBuilder &b, Location loc, Value iv, ValueRange args) {			return [&, loopIdx](OpBuilder &nestedBuilder, Location loc, Value iv,
	blockInductionVars[loopIdx] = iv;			ValueRange args) {
				ImplicitLocOpBuilder nb(loc, nestedBuilder);

				// Compute induction variable for `loopIdx`.
				computeBlockInductionVars[loopIdx] = nb.create<AddIOp>(
				lowerBound[loopIdx], nb.create<MulIOp>(iv, step[loopIdx]));

				// Check if we are inside first or last iteration of the loop.
				isBlockFirstCoord[loopIdx] =
				nb.create<CmpIOp>(CmpIPredicate::eq, iv, blockFirstCoord[loopIdx]);
				isBlockLastCoord[loopIdx] =
				nb.create<CmpIOp>(CmpIPredicate::eq, iv, blockLastCoord[loopIdx]);

				// Check if the previous loop is in its first of last iteration.
				herhutUnsubmitted Done Reply Inline Actions nit: or herhut: nit: or
				if (loopIdx > 0) {
				isBlockFirstCoord[loopIdx] = nb.create<AndOp>(
				isBlockFirstCoord[loopIdx], isBlockFirstCoord[loopIdx - 1]);
				isBlockLastCoord[loopIdx] = nb.create<AndOp>(
				isBlockLastCoord[loopIdx], isBlockLastCoord[loopIdx - 1]);
				}

	// Continue building async loop nest.			// Keep building loop nest.
	if (loopIdx < op.getNumLoops() - 1) {			if (loopIdx < op.getNumLoops() - 1) {
	b.create<scf::ForOp>(			// Select nested loop lower/upper bounds depending on out position in
	loc, blockLowerBounds[loopIdx + 1], blockUpperBounds[loopIdx + 1],			// the multi-dimensional iteration space.
	op.step()[loopIdx + 1], ValueRange(), workLoopBuilder(loopIdx + 1));			auto lb = nb.create<SelectOp>(isBlockFirstCoord[loopIdx],
	b.create<scf::YieldOp>(loc);			blockFirstCoord[loopIdx + 1], c0);

				auto ub = nb.create<SelectOp>(isBlockLastCoord[loopIdx],
				blockEndCoord[loopIdx + 1],
				tripCounts[loopIdx + 1]);

				nb.create<scf::ForOp>(lb, ub, c1, ValueRange(),
				workLoopBuilder(loopIdx + 1));
				nb.create<scf::YieldOp>(loc);
	return;			return;
	}			}

	// Copy the body of the parallel op with new loop bounds.			// Copy the body of the parallel op into the inner-most loop.
	BlockAndValueMapping mapping;			BlockAndValueMapping mapping;
	mapping.map(op.getInductionVars(), blockInductionVars);			mapping.map(op.getInductionVars(), computeBlockInductionVars);
				mapping.map(computeFuncType.captures, captures);

	for (auto &bodyOp : op.getLoopBody().getOps())			for (auto &bodyOp : op.getLoopBody().getOps())
	b.clone(bodyOp, mapping);			nb.clone(bodyOp, mapping);
	};			};
	};			};

	// Builds a loop nest that does async execute op dispatching.			b.create<scf::ForOp>(blockFirstCoord[0], blockEndCoord[0], c1, ValueRange(),
	LoopBuilder asyncLoopBuilder = [&](size_t loopIdx) -> LoopBodyBuilder {			workLoopBuilder(0));
	return [&, loopIdx](OpBuilder &b, Location loc, Value iv, ValueRange args) {			b.create<ReturnOp>(ValueRange());
	auto lb = op.lowerBound()[loopIdx];
	auto ub = op.upperBound()[loopIdx];
	auto step = op.step()[loopIdx];

	// Compute lower bound for the current block:
	// blockLowerBound = iv * blockSize * step + lowerBound
	auto s0 = b.create<MulIOp>(loc, iv, blockSize[loopIdx]);
	auto s1 = b.create<MulIOp>(loc, s0, step);
	auto s2 = b.create<AddIOp>(loc, s1, lb);
	blockLowerBounds[loopIdx] = s2;

	// Compute upper bound for the current block:
	// blockUpperBound = min(upperBound,
	// blockLowerBound + blockSize * step)
	auto e0 = b.create<MulIOp>(loc, blockSize[loopIdx], step);
	auto e1 = b.create<AddIOp>(loc, e0, s2);
	auto e2 = b.create<CmpIOp>(loc, CmpIPredicate::slt, e1, ub);
	auto e3 = b.create<SelectOp>(loc, e2, e1, ub);
	blockUpperBounds[loopIdx] = e3;

	// Continue building async dispatch loop nest.			return {func, std::move(computeFuncType.captures)};
	if (loopIdx < op.getNumLoops() - 1) {			}
	b.create<scf::ForOp>(loc, c0, numBlocks[loopIdx + 1], c1, ValueRange(),
	asyncLoopBuilder(loopIdx + 1));			// Creates recursive async dispatch function for the given parallel compute
	b.create<scf::YieldOp>(loc);			// function. Dispatch function keeps splitting block range into halves until it
	return;			// reaches a single block, and then excecutes it inline.
				//
				// Function pseudocode (mix of C++ and MLIR):
				//
				// func @async_dispatch(%block_start : index, %block_end : index, ...) {
				//
				// // Keep splitting block range until we reached a range of size 1.
				// while (%block_end - %block_start > 1) {
				// %mid_index = block_start + (block_end - block_start) / 2;
				// async.execute { call @async_dispatch(%mid_index, %block_end); }
				// %block_end = %mid_index
				// }
				//
				// // Call parallel compute function for a single block.
				// call @parallel_compute_fn(%block_start, %block_size, ...);
				// }
				//
				static FuncOp createAsyncDispatchFunction(ParallelComputeFunction &computeFunc,
				PatternRewriter &rewriter) {
				OpBuilder::InsertionGuard guard(rewriter);
				Location loc = computeFunc.func.getLoc();
				ImplicitLocOpBuilder b(loc, rewriter);

				ModuleOp module = computeFunc.func->getParentOfType<ModuleOp>();
				b.setInsertionPointToStart(&module->getRegion(0).front());
				herhutUnsubmitted Done Reply Inline Actions Is this needed? herhut: Is this needed?

				ArrayRef<Type> computeFuncInputTypes =
				computeFunc.func.type().cast<FunctionType>().getInputs();

				// Compared to the parallel compute function async dispatch function takes
				// additional !async.group argument. Also instead of a single `blockIndex` it
				// takes `blockStart` and `blockEnd` arguments to define the range of
				// dispatched blocks.
				SmallVector<Type> inputTypes;
				inputTypes.push_back(async::GroupType::get(rewriter.getContext()));
				inputTypes.push_back(rewriter.getIndexType()); // add blockStart argument
				inputTypes.append(computeFuncInputTypes.begin(), computeFuncInputTypes.end());

				FunctionType type = rewriter.getFunctionType(inputTypes, TypeRange());
				FuncOp func = FuncOp::create(loc, "async_dispatch_fn", type);
				func.setPrivate();

				// Insert function into the module symbol table and assign it unique name.
				SymbolTable symbolTable(module);
				symbolTable.insert(func);
				rewriter.getListener()->notifyOperationInserted(func);

				// Create function entry block.
				Block *block = b.createBlock(&func.getBody(), func.begin(), type.getInputs());
				b.setInsertionPointToEnd(block);

				Type indexTy = b.getIndexType();
				Value c1 = b.create<ConstantOp>(b.getIndexAttr(1));
				herhutUnsubmitted Done Reply Inline Actions nit: 'ConstantIndexOp' herhut: nit: 'ConstantIndexOp'
				Value c2 = b.create<ConstantOp>(b.getIndexAttr(2));

				// Get the async group that will track async dispatch completion.
				Value group = block->getArgument(0);

				// Get the block iteration range: [blockStart, blockEnd)
				Value blockStart = block->getArgument(1);
				Value blockEnd = block->getArgument(2);

				// Create a work splitting while loop for the [blockStart, blockEnd) range.
				SmallVector<Type> types = {indexTy, indexTy};
				SmallVector<Value> operands = {blockStart, blockEnd};

				// Create a recursive dispatch loop.
				scf::WhileOp whileOp = b.create<scf::WhileOp>(types, operands);
				Block *before = b.createBlock(&whileOp.before(), {}, types);
				Block *after = b.createBlock(&whileOp.after(), {}, types);

				// Setup dispatch loop condition block: decide if we need to go into the
				// `after` block and launch one more async dispatch.
				{
				b.setInsertionPointToEnd(before);
				Value start = before->getArgument(0);
				Value end = before->getArgument(1);
				Value distance = b.create<SubIOp>(end, start);
				Value dispatch = b.create<CmpIOp>(CmpIPredicate::sgt, distance, c1);
				b.create<scf::ConditionOp>(dispatch, before->getArguments());
	}			}

	// Build the inner loop nest that will do the actual work inside the			// Setup the async dispatch loop body: recursively call dispatch function
	// `async.execute` body region.			// for second the half of the original range and go to the next iteration.
				herhutUnsubmitted Done Reply Inline Actions nit: the second half herhut: nit: the second half
				{
				b.setInsertionPointToEnd(after);
				Value start = after->getArgument(0);
				Value end = after->getArgument(1);
				Value distance = b.create<SubIOp>(end, start);
				Value halfDistance = b.create<SignedDivIOp>(distance, c2);
				Value midIndex = b.create<AddIOp>(after->getArgument(0), halfDistance);
				herhutUnsubmitted Done Reply Inline Actions nit: use `start`? herhut: nit: use `start`?

				// Call parallel compute function inside the async.execute region.
	auto executeBodyBuilder = [&](OpBuilder &executeBuilder,			auto executeBodyBuilder = [&](OpBuilder &executeBuilder,
	Location executeLoc,			Location executeLoc, ValueRange executeArgs) {
	ValueRange executeArgs) {			// Update the original `blockStart` and `blockEnd` with new range.
	executeBuilder.create<scf::ForOp>(executeLoc, blockLowerBounds[0],			SmallVector<Value> operands{block->getArguments().begin(),
	blockUpperBounds[0], op.step()[0],			block->getArguments().end()};
	ValueRange(), workLoopBuilder(0));			operands[1] = midIndex;
				operands[2] = end;

				executeBuilder.create<CallOp>(executeLoc, func.sym_name(),
				func.getCallableResults(), operands);
	executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());			executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());
	};			};

	auto execute = b.create<ExecuteOp>(			// Create async.execute operation to dispatch half of the block range.
	loc, /resultTypes=/TypeRange(), /dependencies=/ValueRange(),			auto execute = b.create<ExecuteOp>(TypeRange(), ValueRange(), ValueRange(),
	/operands=/ValueRange(), executeBodyBuilder);			executeBodyBuilder);
	auto rankType = IndexType::get(ctx);			b.create<AddToGroupOp>(indexTy, execute.token(), group);
	b.create<AddToGroupOp>(loc, rankType, execute.token(), group.result());			b.create<scf::YieldOp>(ValueRange({after->getArgument(0), midIndex}));
				herhutUnsubmitted Done Reply Inline Actions nit: use `start`? herhut: nit: use `start`?
	b.create<scf::YieldOp>(loc);			}

				// After dispatching async operations to process the tail of the block range
				// call the parallel compute function for the first block of the range.
				b.setInsertionPointAfter(whileOp);

				// Drop async dispatch specific arguments: async group, block start and end.
				auto forwardedInputs = block->getArguments().drop_front(3);
				SmallVector<Value> computeFuncOperands = {blockStart};
				computeFuncOperands.append(forwardedInputs.begin(), forwardedInputs.end());

				b.create<CallOp>(computeFunc.func.sym_name(),
				computeFunc.func.getCallableResults(), computeFuncOperands);
				b.create<ReturnOp>(ValueRange());

				return func;
				}

				// Launch async dispatch of the parallel compute function.
				static void doAsyncDispatch(ImplicitLocOpBuilder &b, PatternRewriter &rewriter,
				ParallelComputeFunction &parallelComputeFunction,
				scf::ParallelOp op, Value blockSize,
				Value blockCount,
				const SmallVector<Value> &tripCounts) {
				MLIRContext *ctx = op->getContext();

				// Add one more level of indirection to dispatch parallel compute functions
				// using async operations and recursive work splitting.
				FuncOp asyncDispatchFunction =
				createAsyncDispatchFunction(parallelComputeFunction, rewriter);

				Value c0 = b.create<ConstantOp>(b.getIndexAttr(0));
				herhutUnsubmitted Done Reply Inline Actions `ConstantIndexOp` herhut: `ConstantIndexOp`
				Value c1 = b.create<ConstantOp>(b.getIndexAttr(1));

				// Create an async.group to wait on all async tokens from the concurrent
				// execution of multiple parallel compute function. First block will be
				// executed synchronously in the caller thread.
				Value groupSize = b.create<SubIOp>(blockCount, c1);
				Value group = b.create<CreateGroupOp>(GroupType::get(ctx), groupSize);

				// Pack the async dispath function operands to launch the work splitting.
				SmallVector<Value> asyncDispatchOperands = {group, c0, blockCount, blockSize};
				asyncDispatchOperands.append(tripCounts);
				asyncDispatchOperands.append(op.lowerBound().begin(), op.lowerBound().end());
				asyncDispatchOperands.append(op.upperBound().begin(), op.upperBound().end());
				asyncDispatchOperands.append(op.step().begin(), op.step().end());
				asyncDispatchOperands.append(parallelComputeFunction.captures);

				// Launch async dispatch function for [0, blockCount) range.
				b.create<CallOp>(asyncDispatchFunction.sym_name(),
				asyncDispatchFunction.getCallableResults(),
				asyncDispatchOperands);

				// Wait for the completion of all parallel compute operations.
				b.create<AwaitAllOp>(group);
				}

				// Dispatch parallel compute functions by submitting all async compute tasks
				// from a simple for loop in the caller thread.
				static void
				doSequantialDispatch(ImplicitLocOpBuilder &b, PatternRewriter &rewriter,
				ParallelComputeFunction &parallelComputeFunction,
				scf::ParallelOp op, Value blockSize, Value blockCount,
				const SmallVector<Value> &tripCounts) {
				MLIRContext *ctx = op->getContext();

				FuncOp compute = parallelComputeFunction.func;

				Value c0 = b.create<ConstantOp>(b.getIndexAttr(0));
				herhutUnsubmitted Done Reply Inline Actions `ConstantIndexOp` herhut: `ConstantIndexOp`
				Value c1 = b.create<ConstantOp>(b.getIndexAttr(1));

				// Create an async.group to wait on all async tokens from the concurrent
				// execution of multiple parallel compute function. First block will be
				// executed synchronously in the caller thread.
				Value groupSize = b.create<SubIOp>(blockCount, c1);
				Value group = b.create<CreateGroupOp>(GroupType::get(ctx), groupSize);

				// Call parallel compute function for all blocks.
				using LoopBodyBuilder =
				std::function<void(OpBuilder &, Location, Value, ValueRange)>;

				// Returns parallel compute function operands to process the given block.
				auto computeFuncOperands = [&](Value blockIndex) -> SmallVector<Value> {
				SmallVector<Value> computeFuncOperands = {blockIndex, blockSize};
				computeFuncOperands.append(tripCounts);
				computeFuncOperands.append(op.lowerBound().begin(), op.lowerBound().end());
				computeFuncOperands.append(op.upperBound().begin(), op.upperBound().end());
				computeFuncOperands.append(op.step().begin(), op.step().end());
				computeFuncOperands.append(parallelComputeFunction.captures);
				return computeFuncOperands;
	};			};

				// Induction variable is the index of the block: [0, blockCount).
				LoopBodyBuilder loopBuilder = [&](OpBuilder &loopBuilder, Location loc,
				Value iv, ValueRange args) {
				ImplicitLocOpBuilder nb(loc, loopBuilder);

				// Call parallel compute function inside the async.execute region.
				auto executeBodyBuilder = [&](OpBuilder &executeBuilder,
				Location executeLoc, ValueRange executeArgs) {
				executeBuilder.create<CallOp>(executeLoc, compute.sym_name(),
				compute.getCallableResults(),
				computeFuncOperands(iv));
				executeBuilder.create<async::YieldOp>(executeLoc, ValueRange());
	};			};

	// Start building a loop nest from the first induction variable.			// Create async.execute operation to launch parallel computate function.
	rewriter.create<scf::ForOp>(loc, c0, numBlocks[0], c1, ValueRange(),			auto execute = nb.create<ExecuteOp>(TypeRange(), ValueRange(), ValueRange(),
	asyncLoopBuilder(0));			executeBodyBuilder);
				nb.create<AddToGroupOp>(rewriter.getIndexType(), execute.token(), group);
				nb.create<scf::YieldOp>();
				};

	// Wait for the completion of all subtasks.			// Iterate over all compute blocks and launch parallel compute operations.
	rewriter.create<AwaitAllOp>(loc, group.result());			b.create<scf::ForOp>(c1, blockCount, c1, ValueRange(), loopBuilder);

	// Erase the original parallel operation.			// Call parallel compute function for the first block in the caller thread.
				b.create<CallOp>(compute.sym_name(), compute.getCallableResults(),
				computeFuncOperands(c0));

				// Wait for the completion of all async compute operations.
				b.create<AwaitAllOp>(group);
				}

				LogicalResult
				AsyncParallelForRewrite::matchAndRewrite(scf::ParallelOp op,
				PatternRewriter &rewriter) const {
				// We do not currently support rewrite for parallel op with reductions.
				if (op.getNumReductions() != 0)
				return failure();

				ImplicitLocOpBuilder b(op.getLoc(), rewriter);

				// Compute trip count for each loop induction variable:
				// tripCount = ceil_div(upperBound - lowerBound, step);
				SmallVector<Value> tripCounts(op.getNumLoops());
				for (size_t i = 0; i < op.getNumLoops(); ++i) {
				auto lb = op.lowerBound()[i];
				auto ub = op.upperBound()[i];
				auto step = op.step()[i];
				auto range = b.create<SubIOp>(ub, lb);
				tripCounts[i] = b.create<SignedCeilDivIOp>(range, step);
				}

				// Compute a product of trip counts to get the 1-dimensional iteration space
				// for the scf.parallel operation.
				Value tripCount = tripCounts[0];
				for (size_t i = 1; i < tripCounts.size(); ++i)
				tripCount = b.create<MulIOp>(tripCount, tripCounts[i]);

				auto indexTy = b.getIndexType();

				// Do not overload worker threads with too many compute blocks.
				Value maxComputeBlocks = b.create<ConstantOp>(
				indexTy, b.getIndexAttr(numWorkerThreads * kMaxOversharding));
				herhutUnsubmitted Done Reply Inline Actions Some more ConstantIndexOp opportunities herhut: Some more ConstantIndexOp opportunities

				// Target block size from the pass parameters.
				Value targetComputeBlockSize =
				b.create<ConstantOp>(indexTy, b.getIndexAttr(targetBlockSize));

				// Compute parallel block size from the parallel problem size:
				// blockSize = min(tripCount,
				// max(divup(tripCount, maxComputeBlocks),
				herhutUnsubmitted Done Reply Inline Actions Maybe use ceil_div instead of divup? herhut: Maybe use ceil_div instead of divup?
				// targetComputeBlockSize))
				Value bs0 = b.create<SignedCeilDivIOp>(tripCount, maxComputeBlocks);
				Value bs1 = b.create<CmpIOp>(CmpIPredicate::sge, bs0, targetComputeBlockSize);
				Value bs2 = b.create<SelectOp>(bs1, bs0, targetComputeBlockSize);
				Value bs3 = b.create<CmpIOp>(CmpIPredicate::sle, tripCount, bs2);
				Value blockSize = b.create<SelectOp>(bs3, tripCount, bs2);
				Value blockCount = b.create<SignedCeilDivIOp>(tripCount, blockSize);
				herhutUnsubmitted Done Reply Inline Actions Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be positive here and in most of the rest of this file. Wouldn't the unsigned one suffice here? herhut: Out of curiosity: Why do you use a signed ceildiv everywhere? Both operands are known to be…
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just for "consistency" :) ezhulenev: Step can be negative, so I used SignedCeilDiv there for correctness, and everywhere else just…

				// Create a parallel compute function that takes a block id and computes the
				// parallel operation body for a subset of iteration space.
				ParallelComputeFunction parallelComputeFunction =
				createParallelComputeFunction(op, rewriter);

				// Dispatch parallel compute function using async recursive work splitting, or
				// by submitting compute task sequentially from a caller thread.
				if (asyncDispatch) {
				doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
				blockCount, tripCounts);
				} else {
				doSequantialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,
				blockCount, tripCounts);
				}

				// Parallel operation was replaces with a block iteration loop.
				herhutUnsubmitted Done Reply Inline Actions nit replaced. herhut: nit replaced.
	rewriter.eraseOp(op);			rewriter.eraseOp(op);

	return success();			return success();
	}			}

	void AsyncParallelForPass::runOnOperation() {			void AsyncParallelForPass::runOnOperation() {
	MLIRContext *ctx = &getContext();			MLIRContext *ctx = &getContext();

	RewritePatternSet patterns(ctx);			RewritePatternSet patterns(ctx);
	patterns.add<AsyncParallelForRewrite>(ctx, numConcurrentAsyncExecute);			patterns.add<AsyncParallelForRewrite>(ctx, asyncDispatch, numWorkerThreads,
				targetBlockSize);

	if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))			if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))
	signalPassFailure();			signalPassFailure();
	}			}

	std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {			std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {
	return std::make_unique<AsyncParallelForPass>();			return std::make_unique<AsyncParallelForPass>();
	}			}

	std::unique_ptr<Pass> mlir::createAsyncParallelForPass(int numWorkerThreads) {
	return std::make_unique<AsyncParallelForPass>(numWorkerThreads);
	}

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir

This file was added.

				// RUN: mlir-opt %s -split-input-file -async-parallel-for=async-dispatch=true \
				// RUN: \| FileCheck %s

				// CHECK-LABEL: @loop_1d
				func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {
				// CHECK: %[[GROUP:.*]] = async.create_group
				// CHECK: call @async_dispatch_fn
				// CHECK: async.await_all %[[GROUP]]
				scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {
				%one = constant 1.0 : f32
				memref.store %one, %arg3[%i] : memref<?xf32>
				}
				return
				}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: memref.store

				// CHECK-LABEL: func private @async_dispatch_fn
				// CHECK-SAME: %[[GROUP:arg0]]: !async.group,
				// CHECK-SAME: %[[BLOCK_START:arg1]]: index
				// CHECK-SAME: %[[BLOCK_END:arg2]]: index

				// CHECK: scf.while (%[[S:.*]] = %[[BLOCK_START]],
				// CHECK-SAME: %[[E:.*]] = %[[BLOCK_END]])
				herhutUnsubmitted Done Reply Inline Actions This test is super sparse. Why even capture `S` and `E` here, as they are not used? Unless you want to ensure that the right block numbers are passed, but that needs a less sparse test. herhut: This test is super sparse. Why even capture `S` and `E` here, as they are not used? Unless you…
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions I added a bit more details, but still decided not to add checks for all of the generated IR, just checked the main structure. It's just too many checks, and I find it easier to rely on the mlir integration tests that run the code. ezhulenev: I added a bit more details, but still decided not to add checks for all of the generated IR…
				// CHECK: } do {
				// CHECK: %[[TOKEN:.*]] = async.execute
				// CHECK: call @async_dispatch_fn
				// CHECK: async.add_to_group
				// CHECK: }

				// CHECK: call @parallel_compute_fn(%[[BLOCK_START]]

				// -----

				// CHECK-LABEL: @loop_2d
				func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step
				%arg3: index, %arg4: index, %arg5: index, // lb, ub, step
				%arg6: memref<?x?xf32>) {
				// CHECK: %[[GROUP:.*]] = async.create_group
				// CHECK: call @async_dispatch_fn
				// CHECK: async.await_all %[[GROUP]]
				scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)
				step (%arg2, %arg5) {
				%one = constant 1.0 : f32
				memref.store %one, %arg6[%i0, %i1] : memref<?x?xf32>
				}
				return
				}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: scf.for
				// CHECK: memref.store

				// CHECK-LABEL: func private @async_dispatch_fn

mlir/test/Dialect/Async/async-parallel-for-seq-dispatch.mlir

This file was moved from mlir/test/Dialect/Async/async-parallel-for.mlir.

	// RUN: mlir-opt %s -async-parallel-for \| FileCheck %s			// RUN: mlir-opt %s -split-input-file -async-parallel-for=async-dispatch=false \
				// RUN: \| FileCheck %s --dump-input=always

	// CHECK-LABEL: @loop_1d			// CHECK-LABEL: @loop_1d
	func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {			func @loop_1d(%arg0: index, %arg1: index, %arg2: index, %arg3: memref<?xf32>) {
	// CHECK: %[[GROUP:.*]] = async.create_group			// CHECK: %[[GROUP:.*]] = async.create_group
	// CHECK: scf.for			// CHECK: scf.for
	// CHECK: %[[TOKEN:.*]] = async.execute {			// CHECK: %[[TOKEN:.*]] = async.execute
	// CHECK: scf.for			// CHECK: call @parallel_compute_fn
	// CHECK: memref.store
	// CHECK: async.yield			// CHECK: async.yield
	// CHECK: }
	// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]			// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]
				// CHECK: call @parallel_compute_fn
	// CHECK: async.await_all %[[GROUP]]			// CHECK: async.await_all %[[GROUP]]
	scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {			scf.parallel (%i) = (%arg0) to (%arg1) step (%arg2) {
	%one = constant 1.0 : f32			%one = constant 1.0 : f32
	memref.store %one, %arg3[%i] : memref<?xf32>			memref.store %one, %arg3[%i] : memref<?xf32>
	}			}

	return			return
	}			}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: memref.store

				// -----

	// CHECK-LABEL: @loop_2d			// CHECK-LABEL: @loop_2d
	func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step			func @loop_2d(%arg0: index, %arg1: index, %arg2: index, // lb, ub, step
	%arg3: index, %arg4: index, %arg5: index, // lb, ub, step			%arg3: index, %arg4: index, %arg5: index, // lb, ub, step
	%arg6: memref<?x?xf32>) {			%arg6: memref<?x?xf32>) {
	// CHECK: %[[GROUP:.*]] = async.create_group			// CHECK: %[[GROUP:.*]] = async.create_group
	// CHECK: scf.for			// CHECK: scf.for
	// CHECK: scf.for			// CHECK: %[[TOKEN:.*]] = async.execute
	// CHECK: %[[TOKEN:.*]] = async.execute {			// CHECK: call @parallel_compute_fn
	// CHECK: scf.for
	// CHECK: scf.for
	// CHECK: memref.store
	// CHECK: async.yield			// CHECK: async.yield
	// CHECK: }
	// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]			// CHECK: async.add_to_group %[[TOKEN]], %[[GROUP]]
				// CHECK: call @parallel_compute_fn
	// CHECK: async.await_all %[[GROUP]]			// CHECK: async.await_all %[[GROUP]]
	scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)			scf.parallel (%i0, %i1) = (%arg0, %arg3) to (%arg1, %arg4)
	step (%arg2, %arg5) {			step (%arg2, %arg5) {
	%one = constant 1.0 : f32			%one = constant 1.0 : f32
	memref.store %one, %arg6[%i0, %i1] : memref<?x?xf32>			memref.store %one, %arg6[%i0, %i1] : memref<?x?xf32>
	}			}

	return			return
	}			}

				// CHECK-LABEL: func private @parallel_compute_fn
				// CHECK: scf.for
				// CHECK: scf.for
				// CHECK: memref.store

mlir/test/Dialect/Async/async-parallel-for.mlir

This file was moved to mlir/test/Dialect/Async/async-parallel-for-seq-dispatch.mlir.

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir

This file was copied to mlir/test/Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir.

	// RUN: mlir-opt %s \			// RUN: mlir-opt %s \
	// RUN: -linalg-tile-to-parallel-loops="linalg-tile-sizes=256" \			// RUN: -convert-linalg-to-parallel-loops \
	// RUN: -async-parallel-for="num-concurrent-async-execute=4" \			// RUN: -async-parallel-for \
	// RUN: -async-to-async-runtime \			// RUN: -async-to-async-runtime \
	// RUN: -async-runtime-ref-counting \			// RUN: -async-runtime-ref-counting \
	// RUN: -async-runtime-ref-counting-opt \			// FIXME: -async-runtime-ref-counting-opt \
				herhutUnsubmitted Done Reply Inline Actions Is this broken by this change? Is that because we now pass groups through functions? herhut: Is this broken by this change? Is that because we now pass groups through functions?
				ezhulenevAuthorUnsubmitted Done Reply Inline Actions There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<->dropRef and destroys the group too early. Will send a fix in one of the next PRs. ezhulenev: There is a bug in reference counting optimization, it incorrectly removes a pair of addRef<…
	// RUN: -convert-async-to-llvm \			// RUN: -convert-async-to-llvm \
	// RUN: -lower-affine \
	// RUN: -convert-linalg-to-loops \
	// RUN: -convert-scf-to-std \			// RUN: -convert-scf-to-std \
	// RUN: -std-expand \			// RUN: -std-expand \
	// RUN: -convert-vector-to-llvm \			// RUN: -convert-vector-to-llvm \
	// RUN: -convert-std-to-llvm \			// RUN: -convert-std-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: -e entry -entry-point-result=void -O3 \			// RUN: -e entry -entry-point-result=void -O3 \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\
	▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

mlir/test/Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir

This file was copied from mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir.

// RUN: mlir-opt %s \		// RUN: mlir-opt %s \
// RUN: -linalg-tile-to-parallel-loops="linalg-tile-sizes=256" \		// RUN: -async-parallel-for \
// RUN: -async-parallel-for="num-concurrent-async-execute=4" \
// RUN: -async-to-async-runtime \		// RUN: -async-to-async-runtime \
// RUN: -async-runtime-ref-counting \		// RUN: -async-runtime-ref-counting \
// RUN: -async-runtime-ref-counting-opt \		// FIXME: -async-runtime-ref-counting-opt \
		// RUN: -convert-async-to-llvm \
		// RUN: -convert-linalg-to-loops \
		// RUN: -convert-scf-to-std \
		// RUN: -std-expand \
		// RUN: -convert-vector-to-llvm \
		// RUN: -convert-std-to-llvm \
		// RUN: \| mlir-cpu-runner \
		// RUN: -e entry -entry-point-result=void -O3 \
		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\
		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext \
		// RUN: \| FileCheck %s --dump-input=always

		// RUN: mlir-opt %s \
		// RUN: -async-parallel-for=async-dispatch=false \
		// RUN: -async-to-async-runtime \
		// RUN: -async-runtime-ref-counting \
		// FIXME: -async-runtime-ref-counting-opt \
// RUN: -convert-async-to-llvm \		// RUN: -convert-async-to-llvm \
// RUN: -lower-affine \
// RUN: -convert-linalg-to-loops \		// RUN: -convert-linalg-to-loops \
// RUN: -convert-scf-to-std \		// RUN: -convert-scf-to-std \
// RUN: -std-expand \		// RUN: -std-expand \
// RUN: -convert-vector-to-llvm \		// RUN: -convert-vector-to-llvm \
// RUN: -convert-std-to-llvm \		// RUN: -convert-std-to-llvm \
// RUN: \| mlir-cpu-runner \		// RUN: \| mlir-cpu-runner \
// RUN: -e entry -entry-point-result=void -O3 \		// RUN: -e entry -entry-point-result=void -O3 \
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
Show All 10 Lines
// RUN: -e entry -entry-point-result=void -O3 \		// RUN: -e entry -entry-point-result=void -O3 \
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_c_runner_utils%shlibext\
// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext \		// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext \
// RUN: \| FileCheck %s --dump-input=always		// RUN: \| FileCheck %s --dump-input=always

#map0 = affine_map<(d0, d1) -> (d0, d1)>		#map0 = affine_map<(d0, d1) -> (d0, d1)>

func @linalg_generic(%lhs: memref<?x?xf32>,		func @scf_parallel(%lhs: memref<?x?xf32>,
%rhs: memref<?x?xf32>,		%rhs: memref<?x?xf32>,
%sum: memref<?x?xf32>) {		%sum: memref<?x?xf32>) {
linalg.generic {		%c0 = constant 0 : index
indexing_maps = [#map0, #map0, #map0],		%c1 = constant 1 : index
iterator_types = ["parallel", "parallel"]
}		%d0 = memref.dim %lhs, %c0 : memref<?x?xf32>
ins(%lhs, %rhs : memref<?x?xf32>, memref<?x?xf32>)		%d1 = memref.dim %lhs, %c1 : memref<?x?xf32>
outs(%sum : memref<?x?xf32>)
{		scf.parallel (%i, %j) = (%c0, %c0) to (%d0, %d1) step (%c1, %c1) {
^bb0(%lhs_in: f32, %rhs_in: f32, %sum_out: f32):		%lv = memref.load %lhs[%i, %j] : memref<?x?xf32>
%0 = addf %lhs_in, %rhs_in : f32		%rv = memref.load %lhs[%i, %j] : memref<?x?xf32>
linalg.yield %0 : f32		%r = addf %lv, %rv : f32
		memref.store %r, %sum[%i, %j] : memref<?x?xf32>
}		}

return		return
}		}

func @entry() {		func @entry() {
%f1 = constant 1.0 : f32		%f1 = constant 1.0 : f32
%f4 = constant 4.0 : f32		%f4 = constant 4.0 : f32
Show All 11 Lines	func @entry() {

linalg.fill(%f1, %LHS10) : f32, memref<1x10xf32>		linalg.fill(%f1, %LHS10) : f32, memref<1x10xf32>
linalg.fill(%f1, %RHS10) : f32, memref<1x10xf32>		linalg.fill(%f1, %RHS10) : f32, memref<1x10xf32>

%LHS = memref.cast %LHS10 : memref<1x10xf32> to memref<?x?xf32>		%LHS = memref.cast %LHS10 : memref<1x10xf32> to memref<?x?xf32>
%RHS = memref.cast %RHS10 : memref<1x10xf32> to memref<?x?xf32>		%RHS = memref.cast %RHS10 : memref<1x10xf32> to memref<?x?xf32>
%DST = memref.cast %DST10 : memref<1x10xf32> to memref<?x?xf32>		%DST = memref.cast %DST10 : memref<1x10xf32> to memref<?x?xf32>

call @linalg_generic(%LHS, %RHS, %DST)		call @scf_parallel(%LHS, %RHS, %DST)
: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()		: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()

// CHECK: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]		// CHECK: [2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
%U = memref.cast %DST10 : memref<1x10xf32> to memref<*xf32>		%U = memref.cast %DST10 : memref<1x10xf32> to memref<*xf32>
call @print_memref_f32(%U): (memref<*xf32>) -> ()		call @print_memref_f32(%U): (memref<*xf32>) -> ()

memref.dealloc %LHS10: memref<1x10xf32>		memref.dealloc %LHS10: memref<1x10xf32>
memref.dealloc %RHS10: memref<1x10xf32>		memref.dealloc %RHS10: memref<1x10xf32>
Show All 10 Lines	func @entry() {
%LHS0 = memref.cast %LHS1024 : memref<1024x1024xf32> to memref<?x?xf32>		%LHS0 = memref.cast %LHS1024 : memref<1024x1024xf32> to memref<?x?xf32>
%RHS0 = memref.cast %RHS1024 : memref<1024x1024xf32> to memref<?x?xf32>		%RHS0 = memref.cast %RHS1024 : memref<1024x1024xf32> to memref<?x?xf32>
%DST0 = memref.cast %DST1024 : memref<1024x1024xf32> to memref<?x?xf32>		%DST0 = memref.cast %DST1024 : memref<1024x1024xf32> to memref<?x?xf32>

//		//
// Warm up.		// Warm up.
//		//

call @linalg_generic(%LHS0, %RHS0, %DST0)		call @scf_parallel(%LHS0, %RHS0, %DST0)
: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()		: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()

//		//
// Measure execution time.		// Measure execution time.
//		//

%t0 = call @rtclock() : () -> f64		%t0 = call @rtclock() : () -> f64
scf.for %i = %c0 to %cM step %c1 {		scf.for %i = %c0 to %cM step %c1 {
call @linalg_generic(%LHS0, %RHS0, %DST0)		call @scf_parallel(%LHS0, %RHS0, %DST0)
: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()		: (memref<?x?xf32>, memref<?x?xf32>, memref<?x?xf32>) -> ()
}		}
%t1 = call @rtclock() : () -> f64		%t1 = call @rtclock() : () -> f64
%t1024 = subf %t1, %t0 : f64		%t1024 = subf %t1, %t0 : f64

// Print timings.		// Print timings.
vector.print %t1024 : f64		vector.print %t1024 : f64

Show All 12 Lines

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-1d.mlir

	// RUN: mlir-opt %s -async-parallel-for \			// RUN: mlir-opt %s -async-parallel-for \
	// RUN: -async-to-async-runtime \			// RUN: -async-to-async-runtime \
	// RUN: -async-runtime-ref-counting \			// RUN: -async-runtime-ref-counting \
	// RUN: -async-runtime-ref-counting-opt \			// FIXME: -async-runtime-ref-counting-opt \
				// RUN: -convert-async-to-llvm \
				// RUN: -convert-scf-to-std \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e entry -entry-point-result=void -O0 \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
				// RUN: \| FileCheck %s --dump-input=always

				// RUN: mlir-opt %s -async-parallel-for="async-dispatch=false \
				// RUN: num-workers=20 \
				// RUN: target-block-size=1" \
				// RUN: -async-to-async-runtime \
				// RUN: -async-runtime-ref-counting \
				// FIXME: -async-runtime-ref-counting-opt \
	// RUN: -convert-async-to-llvm \			// RUN: -convert-async-to-llvm \
	// RUN: -convert-scf-to-std \			// RUN: -convert-scf-to-std \
	// RUN: -convert-std-to-llvm \			// RUN: -convert-std-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: -e entry -entry-point-result=void -O0 \			// RUN: -e entry -entry-point-result=void -O0 \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
	// RUN: \| FileCheck %s --dump-input=always			// RUN: \| FileCheck %s --dump-input=always
	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

	// RUN: mlir-opt %s -async-parallel-for \			// RUN: mlir-opt %s -async-parallel-for \
	// RUN: -async-to-async-runtime \			// RUN: -async-to-async-runtime \
	// RUN: -async-runtime-ref-counting \			// RUN: -async-runtime-ref-counting \
	// RUN: -async-runtime-ref-counting-opt \			// FIXME: -async-runtime-ref-counting-opt \
				// RUN: -convert-async-to-llvm \
				// RUN: -convert-scf-to-std \
				// RUN: -convert-std-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: -e entry -entry-point-result=void -O0 \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
				// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
				// RUN: \| FileCheck %s --dump-input=always

				// RUN: mlir-opt %s -async-parallel-for="async-dispatch=false \
				// RUN: num-workers=20 \
				// RUN: target-block-size=1" \
				// RUN: -async-to-async-runtime \
				// RUN: -async-runtime-ref-counting \
				// FIXME: -async-runtime-ref-counting-opt \
	// RUN: -convert-async-to-llvm \			// RUN: -convert-async-to-llvm \
	// RUN: -convert-scf-to-std \			// RUN: -convert-scf-to-std \
	// RUN: -convert-std-to-llvm \			// RUN: -convert-std-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: -e entry -entry-point-result=void -O0 \			// RUN: -e entry -entry-point-result=void -O0 \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_runner_utils%shlibext \
	// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\			// RUN: -shared-libs=%mlir_integration_test_dir/libmlir_async_runtime%shlibext\
	// RUN: \| FileCheck %s --dump-input=always			// RUN: \| FileCheck %s --dump-input=always
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 354546

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Passes.td

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/test/Dialect/Async/async-parallel-for-async-dispatch.mlir

mlir/test/Dialect/Async/async-parallel-for-seq-dispatch.mlir

mlir/test/Dialect/Async/async-parallel-for.mlir

mlir/test/Integration/Dialect/Async/CPU/microbench-linalg-async-parallel-for.mlir

mlir/test/Integration/Dialect/Async/CPU/microbench-scf-async-parallel-for.mlir

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-1d.mlir

mlir/test/Integration/Dialect/Async/CPU/test-async-parallel-for-2d.mlir

[mlir:Async] Implement recursive async work splitting for scf.parallel operation (async-parallel-for pass)
ClosedPublic