Download Raw Diff

Details

Reviewers

ezhulenev
mehdi_amini

Commits

rGec0e4545caa1: Make AsyncParallelForRewrite parameterizable with a cost model which drives…

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bakhtiyarneyman created this revision.Dec 8 2021, 8:54 PM

Herald added subscribers: sdasgup3, wenzhicui, wrengr and 20 others. · View Herald TranscriptDec 8 2021, 8:54 PM

bakhtiyarneyman requested review of this revision.Dec 8 2021, 8:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2021, 8:54 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

The revision title mentions "AsyncParallelFor pass" but this isn't touching the pass itself right?

mlir/include/mlir/Dialect/Async/Transforms.h
35	Can you improve the doc?

Harbormaster completed remote builds in B138355: Diff 393022.Dec 8 2021, 10:50 PM

Improve docs.

In D115423#3181902, @mehdi_amini wrote:

The revision title mentions "AsyncParallelFor pass" but this isn't touching the pass itself right?

Correct.

bakhtiyarneyman retitled this revision from Make AsyncParallelFor pass parameterizable with a cost model which drives deciding the parallelization granularity. to Make AsyncParallelForRewrite parameterizable with a cost model which drives deciding the parallelization granularity..Dec 9 2021, 2:50 PM

Harbormaster completed remote builds in B138534: Diff 393292.Dec 9 2021, 3:02 PM

ezhulenev accepted this revision.Dec 9 2021, 3:19 PM

This revision is now accepted and ready to land.Dec 9 2021, 3:19 PM

mehdi_amini accepted this revision.Dec 9 2021, 4:47 PM

mehdi_amini added inline comments.

mlir/include/mlir/Dialect/Async/Transforms.h
23

Change the comment.

Harbormaster completed remote builds in B138577: Diff 393353.Dec 9 2021, 5:55 PM

bakhtiyarneyman marked an inline comment as done.Dec 9 2021, 6:00 PM

Rebase. Note that this made some of the loop unrolling logic dynamic which seeped all the way into loop building.

Harbormaster completed remote builds in B139138: Diff 394128.Dec 13 2021, 8:58 PM

ezhulenev added inline comments.Dec 14 2021, 4:28 AM

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

821

Computing numUnrollableLoops as Value (compute function argument) + runtime scf.if in the loop nest prevents LLVM from loop unrolling and vectorization, and it leads to large regressions in: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/tfrt/benchmarks/compute_function_benchmark.cc

name                                 old cpu/op  new cpu/op  delta
BM_cpurt_Fresh2/4/process_time       12.3µs ± 9%  12.3µs ±10%      ~     (p=0.961 n=17+18)
BM_cpurt_Fresh2/8/process_time       12.3µs ± 8%  12.5µs ± 7%      ~     (p=0.245 n=18+17)
BM_cpurt_Factorized0/0/process_time  21.1µs ± 4%  21.2µs ± 4%      ~     (p=0.499 n=19+18)
BM_cpurt_Factorized0/4/process_time  21.5µs ± 7%  35.9µs ± 6%   +66.93%  (p=0.000 n=19+20)
BM_cpurt_Factorized0/8/process_time  21.0µs ± 3%  35.7µs ± 5%   +69.52%  (p=0.000 n=17+20)
BM_cpurt_Factorized1/0/process_time  9.35µs ± 6%  9.31µs ± 3%      ~     (p=0.732 n=18+17)
BM_cpurt_Factorized1/4/process_time  35.9µs ± 4%  46.4µs ± 5%   +29.04%  (p=0.000 n=18+19)
BM_cpurt_Factorized1/8/process_time  36.2µs ± 3%  46.3µs ± 6%   +28.17%  (p=0.000 n=18+19)
BM_cpurt_Factorized2/0/process_time  86.7µs ± 8%  86.3µs ± 8%      ~     (p=0.832 n=18+17)
BM_cpurt_Factorized2/4/process_time   124µs ± 7%   442µs ± 3%  +255.67%  (p=0.000 n=18+18)
BM_cpurt_Factorized2/8/process_time   159µs ± 6%   459µs ± 6%  +187.73%  (p=0.000 n=17+18)

numUnrollableLoops should be known at compiled time, with dynamic minTaskSize the structure should look like this:

ParallelComputeFunction unrollableParallelComputeFunction =
        createParallelComputeFunction(op, staticBounds, numUnrollableLoops,
                                      rewriter);

ParallelComputeFunction defaultParallelComputeFunction =
        createParallelComputeFunction(op, staticBounds, numUnrollableLoops,
                                      rewriter);

b.create<scf::IfOp>(..) {
  dispatch unrollableParallelComputeFunction
} else {
  dispatch defaultParallelComputeFunction
}

There is no real need of benefit of computing numUnrollableLoop as Value, because unrolling/vectorization can only happen in trip counts (loop bounds) are know at compile time.

Address comments: lift the scf.if out of the loops.

bakhtiyarneyman marked an inline comment as done.Dec 14 2021, 5:51 PM

bakhtiyarneyman added inline comments.

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
821	Done, PTAL.

Harbormaster completed remote builds in B139337: Diff 394428.Dec 14 2021, 6:05 PM

mehdi_amini added inline comments.Dec 14 2021, 6:07 PM

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
829	You're creating a value `dynamicShouldUnroll` here which isn't used in the else branch below, can you sink this in the `then` branch?

Elide dynamic check if possible.

bakhtiyarneyman marked an inline comment as done.Dec 14 2021, 6:23 PM

Harbormaster completed remote builds in B139349: Diff 394441.Dec 14 2021, 6:37 PM

ezhulenev accepted this revision.Dec 15 2021, 2:05 AM

Build is failing: https://buildkite.com/llvm-project/premerge-checks/builds/70123#b97ba1d4-10ba-4287-899a-0bcf5260aefe

ezhulenev added inline comments.Dec 15 2021, 2:14 AM

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp
843	nit: I'd move this lambda close to the `dispatchNotUnrollable` to reduce the nesting, and put similar things together, although it's used only inside one branch of the if. I think it's ok to put `createParallelComputeFunction` inside lamdba, so you don't create aligned compute function if you'll not need it.

Address nit.

Harbormaster completed remote builds in B139540: Diff 394693.Dec 15 2021, 5:45 PM

Fix minor C++ compilation issue.

Harbormaster completed remote builds in B139545: Diff 394699.Dec 15 2021, 6:09 PM

Same thing for another lambda.

Harbormaster completed remote builds in B139566: Diff 394732.Dec 15 2021, 8:04 PM

Move minTaskSize definition earlier to avoid contaminating the analysis results with the partially executed rewrite logic.

Harbormaster completed remote builds in B139779: Diff 395038.Dec 16 2021, 6:36 PM

Closed by commit rGec0e4545caa1: Make AsyncParallelForRewrite parameterizable with a cost model which drives… (authored by bakhtiyar <bakhtiyar@x.team>, committed by ezhulenev). · Explain WhyDec 19 2021, 8:41 AM

This revision was automatically updated to reflect the committed changes.

ezhulenev added a commit: rGec0e4545caa1: Make AsyncParallelForRewrite parameterizable with a cost model which drives….

Diff 394699

mlir/include/mlir/Dialect/Async/Passes.h

	Show All 15 Lines
	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {

	std::unique_ptr<Pass> createAsyncParallelForPass();			std::unique_ptr<Pass> createAsyncParallelForPass();

	std::unique_ptr<Pass> createAsyncParallelForPass(bool asyncDispatch,			std::unique_ptr<Pass> createAsyncParallelForPass(bool asyncDispatch,
	int32_t numWorkerThreads,			int32_t numWorkerThreads,
	int32_t targetBlockSize);			int32_t minTaskSize);

	std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();			std::unique_ptr<OperationPass<ModuleOp>> createAsyncToAsyncRuntimePass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingPass();

	std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();			std::unique_ptr<Pass> createAsyncRuntimeRefCountingOptPass();

	std::unique_ptr<Pass> createAsyncRuntimePolicyBasedRefCountingPass();			std::unique_ptr<Pass> createAsyncRuntimePolicyBasedRefCountingPass();
	Show All 12 Lines

mlir/include/mlir/Dialect/Async/Transforms.h

This file was added.

//===- Transforms.h - Async dialect transformation utilities ----*- C++ -*-===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// This header file defines transformations on Async operations.

//===----------------------------------------------------------------------===//

#ifndef MLIR_DIALECT_ASYNC_TRANSFORMS_H_

#define MLIR_DIALECT_ASYNC_TRANSFORMS_H_

#include "mlir/Dialect/SCF/SCF.h"

#include "mlir/IR/ImplicitLocOpBuilder.h"

namespace mlir {

namespace async {

/// Emit the IR to compute the minimum number of iterations of scf.parallel body

/// that would be viable for a single parallel task. Allows the user to avoid

mehdi_aminiUnsubmitted

Done

namespace async {

- /// Compute the minimum number of iterations of scf.parallel body that would be

+ /// Emit the IR to compute at runtime the minimum number of iterations of scf.parallel body that would be

/// viable for a single parallel task. Allows the user to avoid incurring the

mehdi_amini:

/// incurring the overheads of spawning costly parallel tasks in absence of

/// sufficient amount of parallelizable work.

///

/// Must return an index type.

using AsyncMinTaskSizeComputationFunction =

std::function<Value(ImplicitLocOpBuilder, scf::ParallelOp)>;

/// Add a pattern to the given pattern list to lower scf.parallel to async

/// operations.

void populateAsyncParallelForPatterns(

RewritePatternSet &patterns, bool asyncDispatch, int32_t numWorkerThreads,

AsyncMinTaskSizeComputationFunction computeMinTaskSize);

mehdi_aminiUnsubmitted

Done

Can you improve the doc?

mehdi_amini: Can you improve the doc?

} // namespace async

} // namespace mlir

#endif // MLIR_DIALECT_ASYNC_TRANSFORMS_H_

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//		//===- AsyncParallelFor.cpp - Implementation of Async Parallel For --------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements scf.parallel to scf.for + async.execute conversion pass.		// This file implements scf.parallel to scf.for + async.execute conversion pass.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "PassDetail.h"		#include "PassDetail.h"
#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"		#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"
#include "mlir/Dialect/Async/IR/Async.h"		#include "mlir/Dialect/Async/IR/Async.h"
#include "mlir/Dialect/Async/Passes.h"		#include "mlir/Dialect/Async/Passes.h"
		#include "mlir/Dialect/Async/Transforms.h"
#include "mlir/Dialect/SCF/SCF.h"		#include "mlir/Dialect/SCF/SCF.h"
#include "mlir/Dialect/StandardOps/IR/Ops.h"		#include "mlir/Dialect/StandardOps/IR/Ops.h"
#include "mlir/IR/BlockAndValueMapping.h"		#include "mlir/IR/BlockAndValueMapping.h"
#include "mlir/IR/ImplicitLocOpBuilder.h"		#include "mlir/IR/ImplicitLocOpBuilder.h"
#include "mlir/IR/Matchers.h"		#include "mlir/IR/Matchers.h"
#include "mlir/IR/PatternMatch.h"		#include "mlir/IR/PatternMatch.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "mlir/Transforms/RegionUtils.h"		#include "mlir/Transforms/RegionUtils.h"
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	AsyncParallelForPass(bool asyncDispatch, int32_t numWorkerThreads,
this->minTaskSize = minTaskSize;		this->minTaskSize = minTaskSize;
}		}

void runOnOperation() override;		void runOnOperation() override;
};		};

struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {		struct AsyncParallelForRewrite : public OpRewritePattern<scf::ParallelOp> {
public:		public:
AsyncParallelForRewrite(MLIRContext *ctx, bool asyncDispatch,		AsyncParallelForRewrite(
int32_t numWorkerThreads, int32_t minTaskSize)		MLIRContext *ctx, bool asyncDispatch, int32_t numWorkerThreads,
		AsyncMinTaskSizeComputationFunction computeMinTaskSize)
: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),		: OpRewritePattern(ctx), asyncDispatch(asyncDispatch),
numWorkerThreads(numWorkerThreads), minTaskSize(minTaskSize) {}		numWorkerThreads(numWorkerThreads),
		computeMinTaskSize(computeMinTaskSize) {}

LogicalResult matchAndRewrite(scf::ParallelOp op,		LogicalResult matchAndRewrite(scf::ParallelOp op,
PatternRewriter &rewriter) const override;		PatternRewriter &rewriter) const override;

private:		private:
bool asyncDispatch;		bool asyncDispatch;
int32_t numWorkerThreads;		int32_t numWorkerThreads;
int32_t minTaskSize;		AsyncMinTaskSizeComputationFunction computeMinTaskSize;
};		};

struct ParallelComputeFunctionType {		struct ParallelComputeFunctionType {
FunctionType type;		FunctionType type;
SmallVector<Value> captures;		SmallVector<Value> captures;
};		};

// Helper struct to parse parallel compute function argument list.		// Helper struct to parse parallel compute function argument list.
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	static ParallelComputeFunction createParallelComputeFunction(
ImplicitLocOpBuilder b(op.getLoc(), rewriter);		ImplicitLocOpBuilder b(op.getLoc(), rewriter);

ModuleOp module = op->getParentOfType<ModuleOp>();		ModuleOp module = op->getParentOfType<ModuleOp>();

ParallelComputeFunctionType computeFuncType =		ParallelComputeFunctionType computeFuncType =
getParallelComputeFunctionType(op, rewriter);		getParallelComputeFunctionType(op, rewriter);

FunctionType type = computeFuncType.type;		FunctionType type = computeFuncType.type;
FuncOp func = FuncOp::create(op.getLoc(), "parallel_compute_fn", type);		FuncOp func = FuncOp::create(op.getLoc(),
		numBlockAlignedInnerLoops > 0
		? "parallel_compute_fn_with_aligned_loops"
		: "parallel_compute_fn",
		type);
func.setPrivate();		func.setPrivate();

// Insert function into the module symbol table and assign it unique name.		// Insert function into the module symbol table and assign it unique name.
SymbolTable symbolTable(module);		SymbolTable symbolTable(module);
symbolTable.insert(func);		symbolTable.insert(func);
rewriter.getListener()->notifyOperationInserted(func);		rewriter.getListener()->notifyOperationInserted(func);

// Create function entry block.		// Create function entry block.
▲ Show 20 Lines • Show All 483 Lines • ▼ Show 20 Lines	auto dispatch = [&](OpBuilder &nestedBuilder, Location loc) {
ParallelComputeFunctionBounds staticBounds = {		ParallelComputeFunctionBounds staticBounds = {
integerConstants(tripCounts),		integerConstants(tripCounts),
integerConstants(op.lowerBound()),		integerConstants(op.lowerBound()),
integerConstants(op.upperBound()),		integerConstants(op.upperBound()),
integerConstants(op.step()),		integerConstants(op.step()),
};		};

// Find how many inner iteration dimensions are statically known, and their		// Find how many inner iteration dimensions are statically known, and their
// product is smaller than the `512`. We aling the parallel compute block		// product is smaller than the `512`. We align the parallel compute block
// size by the product of statically known dimensions, so that we can		// size by the product of statically known dimensions, so that we can
// guarantee that the inner loops executes from 0 to the loop trip counts		// guarantee that the inner loops executes from 0 to the loop trip counts
// and we can elide dynamic loop boundaries, and give LLVM an opportunity to		// and we can elide dynamic loop boundaries, and give LLVM an opportunity to
// unroll the loops. The constant `512` is arbitrary, it should depend on		// unroll the loops. The constant `512` is arbitrary, it should depend on
// how many iterations LLVM will typically decide to unroll.		// how many iterations LLVM will typically decide to unroll.
static constexpr int64_t maxIterations = 512;		static constexpr int64_t maxIterations = 512;

// The number of inner loops with statically known number of iterations less		// The number of inner loops with statically known number of iterations less
Show All 24 Lines	float overshardingFactor = numWorkerThreads <= 4 ? 8.0
: numWorkerThreads <= 32 ? 1.0		: numWorkerThreads <= 32 ? 1.0
: numWorkerThreads <= 64 ? 0.8		: numWorkerThreads <= 64 ? 0.8
: 0.6;		: 0.6;

// Do not overload worker threads with too many compute blocks.		// Do not overload worker threads with too many compute blocks.
Value maxComputeBlocks = b.create<arith::ConstantIndexOp>(		Value maxComputeBlocks = b.create<arith::ConstantIndexOp>(
std::max(1, static_cast<int>(numWorkerThreads * overshardingFactor)));		std::max(1, static_cast<int>(numWorkerThreads * overshardingFactor)));

// Target block size from the pass parameters.
Value minTaskSizeCst = b.create<arith::ConstantIndexOp>(minTaskSize);

// Compute parallel block size from the parallel problem size:		// Compute parallel block size from the parallel problem size:
// blockSize = min(tripCount,		// blockSize = min(tripCount,
// max(ceil_div(tripCount, maxComputeBlocks),		// max(ceil_div(tripCount, maxComputeBlocks),
// ceil_div(minTaskSize, bodySize)))		// minTaskSize))
Value bs0 = b.create<arith::CeilDivSIOp>(tripCount, maxComputeBlocks);		Value bs0 = b.create<arith::CeilDivSIOp>(tripCount, maxComputeBlocks);
Value bs1 = b.create<arith::MaxSIOp>(bs0, minTaskSizeCst);		Value minTaskSize = computeMinTaskSize(b, op);
		Value bs1 = b.create<arith::MaxSIOp>(bs0, minTaskSize);
Value blockSize = b.create<arith::MinSIOp>(tripCount, bs1);		Value blockSize = b.create<arith::MinSIOp>(tripCount, bs1);

// Align the block size to be a multiple of the statically known number		ParallelComputeFunction notUnrollableParallelComputeFunction =
// of iterations in the inner loops.		createParallelComputeFunction(op, staticBounds, 0, rewriter);
if (numUnrollableLoops > 0 && minTaskSize >= maxIterations) {
Value numIters = b.create<arith::ConstantIndexOp>(
numIterations[op.getNumLoops() - numUnrollableLoops]);
Value bs2 = b.create<arith::MulIOp>(
b.create<arith::CeilDivSIOp>(blockSize, numIters), numIters);
blockSize = b.create<arith::MinSIOp>(tripCount, bs2);
} else {
// Reset the number of unrollable loops if we didn't align the block size.
numUnrollableLoops = 0;
}

// Compute the number of parallel compute blocks.		// Dispatch parallel compute function using async recursive work splitting,
Value blockCount = b.create<arith::CeilDivSIOp>(tripCount, blockSize);		// or by submitting compute task sequentially from a caller thread.
		auto doDispatch = asyncDispatch ? doAsyncDispatch : doSequentialDispatch;

// Create a parallel compute function that takes a block id and computes		// Create a parallel compute function that takes a block id and computes
// the parallel operation body for a subset of iteration space.		// the parallel operation body for a subset of iteration space.
ParallelComputeFunction parallelComputeFunction =
		ezhulenevUnsubmitted Done Reply Inline Actions Computing `numUnrollableLoops` as `Value` (compute function argument) + runtime `scf.if` in the loop nest prevents LLVM from loop unrolling and vectorization, and it leads to large regressions in: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/mlir/tfrt/benchmarks/compute_function_benchmark.cc name old cpu/op new cpu/op delta BM_cpurt_Fresh2/4/process_time 12.3µs ± 9% 12.3µs ±10% ~ (p=0.961 n=17+18) BM_cpurt_Fresh2/8/process_time 12.3µs ± 8% 12.5µs ± 7% ~ (p=0.245 n=18+17) BM_cpurt_Factorized0/0/process_time 21.1µs ± 4% 21.2µs ± 4% ~ (p=0.499 n=19+18) BM_cpurt_Factorized0/4/process_time 21.5µs ± 7% 35.9µs ± 6% +66.93% (p=0.000 n=19+20) BM_cpurt_Factorized0/8/process_time 21.0µs ± 3% 35.7µs ± 5% +69.52% (p=0.000 n=17+20) BM_cpurt_Factorized1/0/process_time 9.35µs ± 6% 9.31µs ± 3% ~ (p=0.732 n=18+17) BM_cpurt_Factorized1/4/process_time 35.9µs ± 4% 46.4µs ± 5% +29.04% (p=0.000 n=18+19) BM_cpurt_Factorized1/8/process_time 36.2µs ± 3% 46.3µs ± 6% +28.17% (p=0.000 n=18+19) BM_cpurt_Factorized2/0/process_time 86.7µs ± 8% 86.3µs ± 8% ~ (p=0.832 n=18+17) BM_cpurt_Factorized2/4/process_time 124µs ± 7% 442µs ± 3% +255.67% (p=0.000 n=18+18) BM_cpurt_Factorized2/8/process_time 159µs ± 6% 459µs ± 6% +187.73% (p=0.000 n=17+18) `numUnrollableLoops` should be known at compiled time, with dynamic `minTaskSize` the structure should look like this: ParallelComputeFunction unrollableParallelComputeFunction = createParallelComputeFunction(op, staticBounds, numUnrollableLoops, rewriter); ParallelComputeFunction defaultParallelComputeFunction = createParallelComputeFunction(op, staticBounds, numUnrollableLoops, rewriter); b.create<scf::IfOp>(..) { dispatch unrollableParallelComputeFunction } else { dispatch defaultParallelComputeFunction } There is no real need of benefit of computing `numUnrollableLoop` as `Value`, because unrolling/vectorization can only happen in trip counts (loop bounds) are know at compile time. ezhulenev: Computing `numUnrollableLoops` as `Value` (compute function argument) + runtime `scf.if` in the…
		bakhtiyarneymanAuthorUnsubmitted Done Reply Inline Actions Done, PTAL. bakhtiyarneyman: Done, PTAL.
		// Compute the number of parallel compute blocks.
		Value blockCount = b.create<arith::CeilDivSIOp>(tripCount, blockSize);

		// Unroll when numUnrollableLoops > 0 && blockSize >= maxIterations.
		bool staticShouldUnroll = numUnrollableLoops > 0;
		auto dispatchNotUnrollable = [&](OpBuilder &nestedBuilder, Location loc) {
		ImplicitLocOpBuilder nb(loc, nestedBuilder);
		doDispatch(b, rewriter, notUnrollableParallelComputeFunction, op,
		mehdi_aminiUnsubmitted Done Reply Inline Actions You're creating a value `dynamicShouldUnroll` here which isn't used in the else branch below, can you sink this in the `then` branch? mehdi_amini: You're creating a value `dynamicShouldUnroll` here which isn't used in the else branch below…
		blockSize, blockCount, tripCounts);
		nb.create<scf::YieldOp>();
		};

		if (staticShouldUnroll) {
		Value dynamicShouldUnroll = b.create<arith::CmpIOp>(
		arith::CmpIPredicate::sge, blockSize,
		b.create<arith::ConstantIndexOp>(maxIterations));

		ParallelComputeFunction unrollableParallelComputeFunction =
createParallelComputeFunction(op, staticBounds, numUnrollableLoops,		createParallelComputeFunction(op, staticBounds, numUnrollableLoops,
rewriter);		rewriter);

// Dispatch parallel compute function using async recursive work splitting,		auto dispatchUnrollable = [&](OpBuilder &nestedBuilder, Location loc) {
		ezhulenevUnsubmitted Not Done Reply Inline Actions nit: I'd move this lambda close to the `dispatchNotUnrollable` to reduce the nesting, and put similar things together, although it's used only inside one branch of the if. I think it's ok to put `createParallelComputeFunction` inside lamdba, so you don't create aligned compute function if you'll not need it. ezhulenev: nit: I'd move this lambda close to the `dispatchNotUnrollable` to reduce the nesting, and put…
// or by submitting compute task sequentially from a caller thread.		ImplicitLocOpBuilder nb(loc, nestedBuilder);
if (asyncDispatch) {		// Align the block size to be a multiple of the statically known
doAsyncDispatch(b, rewriter, parallelComputeFunction, op, blockSize,		// number of iterations in the inner loops.
blockCount, tripCounts);		Value numIters = nb.create<arith::ConstantIndexOp>(
} else {		numIterations[op.getNumLoops() - numUnrollableLoops]);
doSequentialDispatch(b, rewriter, parallelComputeFunction, op, blockSize,		Value alignedBlockSize = nb.create<arith::MulIOp>(
blockCount, tripCounts);		nb.create<arith::CeilDivSIOp>(blockSize, numIters), numIters);
}		doDispatch(b, rewriter, unrollableParallelComputeFunction, op,
		alignedBlockSize, blockCount, tripCounts);
		return nb.create<scf::YieldOp>();
		};

		b.create<scf::IfOp>(TypeRange(), dynamicShouldUnroll, dispatchUnrollable,
		dispatchNotUnrollable);
nb.create<scf::YieldOp>();		nb.create<scf::YieldOp>();
		} else {
		dispatchNotUnrollable(nb, loc);
		}
};		};

// Replace the `scf.parallel` operation with the parallel compute function.		// Replace the `scf.parallel` operation with the parallel compute function.
b.create<scf::IfOp>(TypeRange(), isZeroIterations, noOp, dispatch);		b.create<scf::IfOp>(TypeRange(), isZeroIterations, noOp, dispatch);

// Parallel operation was replaced with a block iteration loop.		// Parallel operation was replaced with a block iteration loop.
rewriter.eraseOp(op);		rewriter.eraseOp(op);

return success();		return success();
}		}

void AsyncParallelForPass::runOnOperation() {		void AsyncParallelForPass::runOnOperation() {
MLIRContext *ctx = &getContext();		MLIRContext *ctx = &getContext();

RewritePatternSet patterns(ctx);		RewritePatternSet patterns(ctx);
patterns.add<AsyncParallelForRewrite>(ctx, asyncDispatch, numWorkerThreads,		populateAsyncParallelForPatterns(
minTaskSize);		patterns, asyncDispatch, numWorkerThreads,
		[&](ImplicitLocOpBuilder builder, scf::ParallelOp op) {
		return builder.create<arith::ConstantIndexOp>(minTaskSize);
		});
if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))		if (failed(applyPatternsAndFoldGreedily(getOperation(), std::move(patterns))))
signalPassFailure();		signalPassFailure();
}		}

std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {		std::unique_ptr<Pass> mlir::createAsyncParallelForPass() {
return std::make_unique<AsyncParallelForPass>();		return std::make_unique<AsyncParallelForPass>();
}		}

std::unique_ptr<Pass> mlir::createAsyncParallelForPass(bool asyncDispatch,		std::unique_ptr<Pass> mlir::createAsyncParallelForPass(bool asyncDispatch,
int32_t numWorkerThreads,		int32_t numWorkerThreads,
int32_t minTaskSize) {		int32_t minTaskSize) {
return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,		return std::make_unique<AsyncParallelForPass>(asyncDispatch, numWorkerThreads,
minTaskSize);		minTaskSize);
}		}

		void mlir::async::populateAsyncParallelForPatterns(
		RewritePatternSet &patterns, bool asyncDispatch, int32_t numWorkerThreads,
		AsyncMinTaskSizeComputationFunction computeMinTaskSize) {
		MLIRContext *ctx = patterns.getContext();
		patterns.add<AsyncParallelForRewrite>(ctx, asyncDispatch, numWorkerThreads,
		computeMinTaskSize);
		}

mlir/test/Dialect/Async/async-parallel-for-compute-fn.mlir

	Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines
	// CHECK-SAME: %[[LB0:arg[0-9]+]]: index,			// CHECK-SAME: %[[LB0:arg[0-9]+]]: index,
	// CHECK-SAME: %[[LB1:arg[0-9]+]]: index,			// CHECK-SAME: %[[LB1:arg[0-9]+]]: index,
	// CHECK-SAME: %[[UB0:arg[0-9]+]]: index,			// CHECK-SAME: %[[UB0:arg[0-9]+]]: index,
	// CHECK-SAME: %[[UB1:arg[0-9]+]]: index,			// CHECK-SAME: %[[UB1:arg[0-9]+]]: index,
	// CHECK-SAME: %[[STEP0:arg[0-9]+]]: index,			// CHECK-SAME: %[[STEP0:arg[0-9]+]]: index,
	// CHECK-SAME: %[[STEP1:arg[0-9]+]]: index,			// CHECK-SAME: %[[STEP1:arg[0-9]+]]: index,
	// CHECK-SAME: %[[MEMREF:arg[0-9]+]]: memref<?x10xf32>			// CHECK-SAME: %[[MEMREF:arg[0-9]+]]: memref<?x10xf32>
	// CHECK-SAME: ) {			// CHECK-SAME: ) {
				// CHECK: scf.for %[[I:arg[0-9]+]]
				// CHECK: select
				// CHECK: scf.for %[[J:arg[0-9]+]]
				// CHECK: memref.store

				// CHECK-LABEL: func private @parallel_compute_fn_with_aligned_loops(
				// CHECK-SAME: %[[BLOCK_INDEX:arg[0-9]+]]: index,
				// CHECK-SAME: %[[BLOCK_SIZE:arg[0-9]+]]: index,
				// CHECK-SAME: %[[TRIP_COUNT0:arg[0-9]+]]: index,
				// CHECK-SAME: %[[TRIP_COUNT1:arg[0-9]+]]: index,
				// CHECK-SAME: %[[LB0:arg[0-9]+]]: index,
				// CHECK-SAME: %[[LB1:arg[0-9]+]]: index,
				// CHECK-SAME: %[[UB0:arg[0-9]+]]: index,
				// CHECK-SAME: %[[UB1:arg[0-9]+]]: index,
				// CHECK-SAME: %[[STEP0:arg[0-9]+]]: index,
				// CHECK-SAME: %[[STEP1:arg[0-9]+]]: index,
				// CHECK-SAME: %[[MEMREF:arg[0-9]+]]: memref<?x10xf32>
				// CHECK-SAME: ) {
	// CHECK: %[[C0:.*]] = arith.constant 0 : index			// CHECK: %[[C0:.*]] = arith.constant 0 : index
	// CHECK: %[[C1:.*]] = arith.constant 1 : index			// CHECK: %[[C1:.*]] = arith.constant 1 : index
	// CHECK: %[[C10:.*]] = arith.constant 10 : index			// CHECK: %[[C10:.*]] = arith.constant 10 : index
	// CHECK: scf.for %[[I:arg[0-9]+]]			// CHECK: scf.for %[[I:arg[0-9]+]]
	// CHECK-NOT: select			// CHECK-NOT: select
	// CHECK: scf.for %[[J:arg[0-9]+]] = %c0 to %c10 step %c1			// CHECK: scf.for %[[J:arg[0-9]+]] = %c0 to %c10 step %c1
	No newline at end of file

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

	Show First 20 Lines • Show All 2,011 Lines • ▼ Show 20 Lines
	)			)

	cc_library(			cc_library(
	name = "AsyncTransforms",			name = "AsyncTransforms",
	srcs = glob([			srcs = glob([
	"lib/Dialect/Async/Transforms/*.cpp",			"lib/Dialect/Async/Transforms/*.cpp",
	"lib/Dialect/Async/Transforms/*.h",			"lib/Dialect/Async/Transforms/*.h",
	]),			]),
	hdrs = ["include/mlir/Dialect/Async/Passes.h"],			hdrs = [
				"include/mlir/Dialect/Async/Passes.h",
				"include/mlir/Dialect/Async/Transforms.h",
				],
	includes = ["include"],			includes = ["include"],
	deps = [			deps = [
	":Analysis",			":Analysis",
	":ArithmeticDialect",			":ArithmeticDialect",
	":Async",			":Async",
	":AsyncPassIncGen",			":AsyncPassIncGen",
	":IR",			":IR",
	":Pass",			":Pass",
	▲ Show 20 Lines • Show All 5,898 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Make AsyncParallelForRewrite parameterizable with a cost model which drives deciding the parallelization granularity.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 394699

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Transforms.h

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/test/Dialect/Async/async-parallel-for-compute-fn.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

This is an archive of the discontinued LLVM Phabricator instance.

Make AsyncParallelForRewrite parameterizable with a cost model which drives deciding the parallelization granularity.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 394699

mlir/include/mlir/Dialect/Async/Passes.h

mlir/include/mlir/Dialect/Async/Transforms.h

mlir/lib/Dialect/Async/Transforms/AsyncParallelFor.cpp

mlir/test/Dialect/Async/async-parallel-for-compute-fn.mlir

utils/bazel/llvm-project-overlay/mlir/BUILD.bazel

Make AsyncParallelForRewrite parameterizable with a cost model which drives deciding the parallelization granularity.
ClosedPublic