This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Linalg/
-
mlir/
-
Dialect/
-
Linalg/
-
TransformOps/
2/4
LinalgTransformOps.td
-
Transforms/
-
Transforms.h
-
lib/Dialect/
-
Dialect/
-
Affine/IR/
-
IR/
2/2
AffineOps.cpp
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.cpp
-
Transforms/
-
CMakeLists.txt
4/4
Tiling.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
1
tile-to-foreach-thread.mlir

Differential D130139

[mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`
ClosedPublic

Authored by christopherbate on Jul 19 2022, 7:43 PM.

Download Raw Diff

Details

Reviewers

bondhugula
nicolasvasilache

Commits

rG297ba167ded0: [mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`

Summary

This change modifies structured.tile_to_foreach_thread_op so that
it accepts either tile_sizes or num_threads parameters. If
tile_sizes are specified, then the number of threads required is
derived the tile sizes rather than the other way around. In both cases,
more aggressive folding of loop parameters is enabled during the
transformation, allowing for the potential elimination of affine.min
and affine.max operations in the static shape case when calculating
the final adjusted tile size.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

christopherbate created this revision.Jul 19 2022, 7:43 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptJul 19 2022, 7:43 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, sdasgup3, Groverkss and 20 others. · View Herald Transcript

christopherbate requested review of this revision.Jul 19 2022, 7:43 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 19 2022, 7:43 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

christopherbate added inline comments.Jul 19 2022, 7:50 PM

mlir/lib/Dialect/Affine/IR/AffineOps.cpp
725	Without this change an error will be produced because we give an OpFoldResult vector that comes directly from I64ArrayAttr that belongs to an op attribute. Explicitly convert to index type.
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
252–253	You can't use RewriterBase in the body-creation lambda, so I moved to the non-lambda creation form and manually move the insertion point below.

christopherbate mentioned this in D129335: [mlir][SCF] Tile with TilingInterface using `scf.foreach_thread`.Jul 19 2022, 7:52 PM

Harbormaster completed remote builds in B176405: Diff 446019.Jul 19 2022, 8:04 PM

Thanks much!

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
642	This should say 'num_threads', also `0` is not a valid num_threads
663–664	Ah, very cool, I didn't realize the custom assembly format allows ternary expressions now.
mlir/lib/Dialect/Affine/IR/AffineOps.cpp
725	can you add this as a comment in the code to explain why the cast?
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
195	nit: typo/grammo.
252–253	Yes, this is annoying and I had the same issue recently. Another possibility is to explicitly cast the OpBuilder as a RewriterBase inside the lambda when you control the call site, you can do that here if you prefer (fine to leave as is). In any case, can you please add a comment explaining this?
307	Making this discovery more powerful is going to be painful with SSA values. OTOH we know by construction in the "tile_size case" that we don't need the max. I would just add a bool passed at the call site, true for the "tile_size case" and false for the "num_threads case". Then this discovery can refine the "num_threads case".
mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir
47	Nice test case!

This revision is now accepted and ready to land.Jul 20 2022, 1:13 AM

Address comments.

christopherbate marked 2 inline comments as done.Jul 20 2022, 12:36 PM

christopherbate added inline comments.

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
642	also 0 is not a valid num_threads But in the doc that you wrote above, you state Zero tile sizes indicate that the dimension is not tiled, and can be thought of as tiling by the full size of data. I had assumed you meant that `0` is a sentinel value indicating to skip that dimension, regardless of whether it is specified in `num_threads` or `tile_size`. If you specify `num_threads` then the derived tile size can not be zero. Otherwise you can't handle ops that have a reduction dimension that appears before a parallel dimension that you would like to tile.

nicolasvasilache added inline comments.Jul 20 2022, 12:49 PM

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
642	You're right, I confused myself over nothing, please ignore.

Harbormaster completed remote builds in B176573: Diff 446238.Jul 20 2022, 12:49 PM

Rebase

Herald added a subscriber: mgorny. · View Herald TranscriptJul 21 2022, 8:27 AM

Harbormaster completed remote builds in B176781: Diff 446517.Jul 21 2022, 8:59 AM

Closed by commit rG297ba167ded0: [mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op` (authored by christopherbate). · Explain WhyJul 21 2022, 9:36 AM

This revision was automatically updated to reflect the committed changes.

christopherbate added a commit: rG297ba167ded0: [mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Linalg/

TransformOps/

LinalgTransformOps.td

43 lines

Transforms/

Transforms.h

9 lines

lib/

Dialect/

Affine/

IR/

AffineOps.cpp

9 lines

Linalg/

TransformOps/

LinalgTransformOps.cpp

20 lines

Transforms/

CMakeLists.txt

2 lines

Tiling.cpp

231 lines

test/

Dialect/

Linalg/

tile-to-foreach-thread.mlir

140 lines

Diff 446539

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

	Show First 20 Lines • Show All 602 Lines • ▼ Show 20 Lines

	def TileToForeachThreadOp :			def TileToForeachThreadOp :
	Op<Transform_Dialect, "structured.tile_to_foreach_thread_op",			Op<Transform_Dialect, "structured.tile_to_foreach_thread_op",
	[FunctionalStyleTransformOpTrait,			[FunctionalStyleTransformOpTrait,
	MemoryEffectsOpInterface,			MemoryEffectsOpInterface,
	TransformEachOpTrait,			TransformEachOpTrait,
	TransformOpInterface]> {			TransformOpInterface]> {
	let description = [{			let description = [{
	Tile a TilingInterface `op` to a tiled `scf.foreach_thread`, applying			Tile a TilingInterface op to a tiled `scf.foreach_thread`. Tiling is
	tiling by `num_threads`.			applied by either specifying `num_threads` or `tile_size`. If `num_threads`
				is specified, then the tile size for each dimension `i` is calculated
				dynamically via `ceilDiv(dimSize[i], num_threads[i])`.
	If non-empty, the `thread_dim_mapping` is added as an attribute to the			If non-empty, the `thread_dim_mapping` is added as an attribute to the
	resulting `scf.foreach_thread`.			resulting `scf.foreach_thread`.
	Zero tile sizes indicate that the dimension is not tiled, and can be thought			Zero tile sizes indicate that the dimension is not tiled and can be
	of as tiling by the full size of data.			thought of as tiling by the full size of data.
	It is the user's responsibility to ensure that `num_threads` is a valid			It is the user's responsibility to ensure that `num_threads/tile_sizes` is
	tiling specification (i.e. that only tiles parallel dimensions, e.g. in the			a valid tiling specification (i.e. that only tiles parallel dimensions,
	Linalg case).			e.g. in the Linalg case).

	#### Return modes			#### Return modes

	This operation ignores ops that do not implement the TilingInterface and			This operation ignores ops that do not implement the TilingInterface and
	drops them in the return.			drops them in the return.

	If all the operations referred to by the `target` PDLOperation tile			If all the operations referred to by the `target` PDLOperation tile
	successfully, the transform succeeds.			successfully, the transform succeeds.
	Otherwise the transform silently fails.			Otherwise the transform silently fails.

	The 2 returned handles point to only the subset of successfully produced			The two returned handles point to only the subset of successfully produced
	tiled operations, which can all be empty.			tiled operations, which can all be empty.

	These 2 returned handles point to:			These two returned handles point to:
	- the new scf.foreach_thread op,			- the new scf.foreach_thread op,
	- the tiled op that implements TilingInterface.			- the tiled op that implements TilingInterface.

				### Example using `num_threads`

				```
				%0 = pdl_match @match_matmul in %arg1
				nicolasvasilacheUnsubmitted Done Reply Inline Actions This should say 'num_threads', also `0` is not a valid num_threads nicolasvasilache: This should say 'num_threads', also `0` is not a valid num_threads
				christopherbateAuthorUnsubmitted Not Done Reply Inline Actions also 0 is not a valid num_threads But in the doc that you wrote above, you state Zero tile sizes indicate that the dimension is not tiled, and can be thought of as tiling by the full size of data. I had assumed you meant that `0` is a sentinel value indicating to skip that dimension, regardless of whether it is specified in `num_threads` or `tile_size`. If you specify `num_threads` then the derived tile size can not be zero. Otherwise you can't handle ops that have a reduction dimension that appears before a parallel dimension that you would like to tile. christopherbate: > also 0 is not a valid num_threads But in the doc that you wrote above, you state > Zero…
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions You're right, I confused myself over nothing, please ignore. nicolasvasilache: You're right, I confused myself over nothing, please ignore.
				%3:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20]
				```

				### Example using `tile_sizes`

				```
				%0 = pdl_match @match_matmul in %arg1
				%3:2 = transform.structured.tile_to_foreach_thread_op %0 tile_sizes [10, 20, 0]
				```
	}];			}];

	let arguments = (ins PDL_Operation:$target,			let arguments = (ins PDL_Operation:$target,
	// TODO: dynamic number of threads.			// TODO: dynamic number of threads.
	DefaultValuedAttr<I64ArrayAttr, "{}">:$num_threads,			OptionalAttr<DefaultValuedAttr<I64ArrayAttr, "{}">>:$num_threads,
				OptionalAttr<DefaultValuedAttr<I64ArrayAttr, "{}">>:$tile_sizes,
	OptionalAttr<I64ArrayAttr>:$thread_dim_mapping);			OptionalAttr<I64ArrayAttr>:$thread_dim_mapping);
	let results = (outs PDL_Operation:$foreach_thread_op,			let results = (outs PDL_Operation:$foreach_thread_op,
	PDL_Operation:$tiled_op);			PDL_Operation:$tiled_op);

	let assemblyFormat = [{			let assemblyFormat = [{
	$target $num_threads (`(` `mapped` `to` `dims` $thread_dim_mapping^ `)`)?			$target (`num_threads` $num_threads^) : (`tile_sizes` $tile_sizes)?
	attr-dict			(`(` `mapped` `to` `dims` $thread_dim_mapping^ `)`)? attr-dict
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Ah, very cool, I didn't realize the custom assembly format allows ternary expressions now. nicolasvasilache: Ah, very cool, I didn't realize the custom assembly format allows ternary expressions now.
	}];			}];

	let extraClassDeclaration = [{			let extraClassDeclaration = [{
	::mlir::DiagnosedSilenceableFailure applyToOne(			::mlir::DiagnosedSilenceableFailure applyToOne(
	::mlir::TilingInterface target,			::mlir::TilingInterface target,
	::llvm::SmallVectorImpl<::mlir::Operation *> &results,			::llvm::SmallVectorImpl<::mlir::Operation *> &results,
	::mlir::transform::TransformState &state);			::mlir::transform::TransformState &state);
	}];			}];
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

	Show First 20 Lines • Show All 460 Lines • ▼ Show 20 Lines
	/// It is the user's responsibility to ensure that `numThreads` is a			/// It is the user's responsibility to ensure that `numThreads` is a
	/// valid tiling specification (i.e. that only tiles parallel			/// valid tiling specification (i.e. that only tiles parallel
	/// dimensions, e.g. in the Linalg case).			/// dimensions, e.g. in the Linalg case).
	struct ForeachThreadTilingResult {			struct ForeachThreadTilingResult {
	Operation *tileOp;			Operation *tileOp;
	Operation *tiledOp;			Operation *tiledOp;
	};			};
	FailureOr<ForeachThreadTilingResult>			FailureOr<ForeachThreadTilingResult>
	tileToForeachThreadOp(OpBuilder &builder, TilingInterface op,			tileToForeachThreadOp(RewriterBase &builder, TilingInterface op,
	ArrayRef<OpFoldResult> numThreads,			ArrayRef<OpFoldResult> numThreads,
	ArrayRef<int64_t> threadDimMapping = {});			ArrayRef<int64_t> threadDimMapping = {});

				/// Same as `tileToForeachThreadOp`, but calculate the number of threads
				/// required using the given tileSizes.
				FailureOr<ForeachThreadTilingResult>
				tileToForeachThreadOpUsingTileSizes(RewriterBase &builder, TilingInterface op,
				ArrayRef<OpFoldResult> tileSizes,
				ArrayRef<int64_t> threadDimMapping = {});

	/// All indices returned by IndexOp should be invariant with respect to tiling.			/// All indices returned by IndexOp should be invariant with respect to tiling.
	/// Therefore, if an operation is tiled, we have to transform the indices			/// Therefore, if an operation is tiled, we have to transform the indices
	/// accordingly, i.e. offset them by the values of the corresponding induction			/// accordingly, i.e. offset them by the values of the corresponding induction
	/// variables that are captured implicitly in the body of the op.			/// variables that are captured implicitly in the body of the op.
	///			///
	/// Example. `linalg.generic` before tiling:			/// Example. `linalg.generic` before tiling:
	///			///
	/// #id_2d = (i, j) -> (i, j)			/// #id_2d = (i, j) -> (i, j)
	▲ Show 20 Lines • Show All 1,029 Lines • Show Last 20 Lines

mlir/lib/Dialect/Affine/IR/AffineOps.cpp

Show First 20 Lines • Show All 715 Lines • ▼ Show 20 Lines	static void materializeConstants(OpBuilder &b, Location loc,
SmallVectorImpl<Value> &actualValues) {		SmallVectorImpl<Value> &actualValues) {
actualValues.reserve(values.size());		actualValues.reserve(values.size());
auto *dialect = b.getContext()->getLoadedDialect<AffineDialect>();		auto *dialect = b.getContext()->getLoadedDialect<AffineDialect>();
for (OpFoldResult ofr : values) {		for (OpFoldResult ofr : values) {
if (auto value = ofr.dyn_cast<Value>()) {		if (auto value = ofr.dyn_cast<Value>()) {
actualValues.push_back(value);		actualValues.push_back(value);
continue;		continue;
}		}
constants.push_back(dialect->materializeConstant(b, ofr.get<Attribute>(),		// Since we are directly specifying `index` as the result type, we need to
		// ensure the provided attribute is also an index type. Otherwise, the
		christopherbateAuthorUnsubmitted Done Reply Inline Actions Without this change an error will be produced because we give an OpFoldResult vector that comes directly from I64ArrayAttr that belongs to an op attribute. Explicitly convert to index type. christopherbate: Without this change an error will be produced because we give an OpFoldResult vector that comes…
		nicolasvasilacheUnsubmitted Done Reply Inline Actions can you add this as a comment in the code to explain why the cast? nicolasvasilache: can you add this as a comment in the code to explain why the cast?
		// AffineDialect materializer will create invalid `arith.constant`
		// operations if the provided Attribute is any other kind of integer.
		constants.push_back(dialect->materializeConstant(
		b, b.getIndexAttr(ofr.get<Attribute>().cast<IntegerAttr>().getInt()),
b.getIndexType(), loc));		b.getIndexType(), loc));
actualValues.push_back(constants.back()->getResult(0));		actualValues.push_back(constants.back()->getResult(0));
}		}
}		}

/// Create an operation of the type provided as template argument and attempt to		/// Create an operation of the type provided as template argument and attempt to
/// fold it immediately. The operation is expected to have a builder taking		/// fold it immediately. The operation is expected to have a builder taking
/// arbitrary `leadingArguments`, followed by a list of Value-typed `operands`.		/// arbitrary `leadingArguments`, followed by a list of Value-typed `operands`.
/// The operation is also expected to always produce a single result. Return an		/// The operation is also expected to always produce a single result. Return an
▲ Show 20 Lines • Show All 3,278 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

	Show First 20 Lines • Show All 903 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	DiagnosedSilenceableFailure transform::TileToForeachThreadOp::applyToOne(			DiagnosedSilenceableFailure transform::TileToForeachThreadOp::applyToOne(
	TilingInterface target, SmallVectorImpl<Operation *> &results,			TilingInterface target, SmallVectorImpl<Operation *> &results,
	transform::TransformState &state) {			transform::TransformState &state) {
	IRRewriter rewriter(getContext());			IRRewriter rewriter(getContext());
	rewriter.setInsertionPoint(target);			rewriter.setInsertionPoint(target);
	auto maybeThreadDimMappingAttr = getThreadDimMapping();			auto maybeThreadDimMappingAttr = getThreadDimMapping();
	FailureOr<ForeachThreadTilingResult> tilingResult =			auto dimMapping =
	linalg::tileToForeachThreadOp(			llvm::to_vector(maybeThreadDimMappingAttr
	rewriter, target, getAsOpFoldResult(getNumThreads()),
	maybeThreadDimMappingAttr
	? extractFromI64ArrayAttr(*maybeThreadDimMappingAttr)			? extractFromI64ArrayAttr(*maybeThreadDimMappingAttr)
	: ArrayRef<int64_t>{});			: ArrayRef<int64_t>{});

				FailureOr<ForeachThreadTilingResult> tilingResult = failure();
				if (Optional<ArrayAttr> numThreads = getNumThreads())
				tilingResult = linalg::tileToForeachThreadOp(
				rewriter, target, getAsOpFoldResult(*numThreads), dimMapping);

				if (Optional<ArrayAttr> tileSizes = getTileSizes())
				tilingResult = linalg::tileToForeachThreadOpUsingTileSizes(
				rewriter, target, getAsOpFoldResult(*tileSizes), dimMapping);

	if (failed(tilingResult))			if (failed(tilingResult))
	return emitDefaultSilenceableFailure(target);			return emitDefaultSilenceableFailure(target);
	rewriter.replaceOp(target, tilingResult->tileOp->getResults());			rewriter.replaceOp(target, tilingResult->tileOp->getResults());
	results.assign({tilingResult->tileOp, tilingResult->tiledOp});			results.assign({tilingResult->tileOp, tilingResult->tiledOp});
	return DiagnosedSilenceableFailure(success());			return DiagnosedSilenceableFailure(success());
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 68 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt

Show All 35 Lines	add_mlir_dialect_library(MLIRLinalgTransforms
MLIRLinalgPassIncGen		MLIRLinalgPassIncGen

LINK_LIBS PUBLIC		LINK_LIBS PUBLIC
MLIRAffineDialect		MLIRAffineDialect
MLIRAffineUtils		MLIRAffineUtils
MLIRAnalysis		MLIRAnalysis
MLIRArithmeticDialect		MLIRArithmeticDialect
MLIRArithmeticTransforms		MLIRArithmeticTransforms
		MLIRArithmeticUtils
MLIRBufferizationDialect		MLIRBufferizationDialect
MLIRBufferizationTransforms		MLIRBufferizationTransforms
MLIRComplexDialect		MLIRComplexDialect
		MLIRDialectUtils
MLIRFuncDialect		MLIRFuncDialect
MLIRFuncToLLVM		MLIRFuncToLLVM
MLIRFuncTransforms		MLIRFuncTransforms
MLIRInferTypeOpInterface		MLIRInferTypeOpInterface
MLIRIR		MLIRIR
MLIRMemRefDialect		MLIRMemRefDialect
MLIRLinalgDialect		MLIRLinalgDialect
MLIRLinalgAnalysis		MLIRLinalgAnalysis
Show All 17 Lines

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

//===- Tiling.cpp - Implementation of linalg Tiling -----------------------===//		//===- Tiling.cpp - Implementation of linalg Tiling -----------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements the linalg dialect Tiling pass.		// This file implements the linalg dialect Tiling pass.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include <utility>		#include <utility>

#include "PassDetail.h"		#include "PassDetail.h"
		#include "mlir/Dialect/Arithmetic/Utils/Utils.h"
#include "mlir/Dialect/ControlFlow/IR/ControlFlowOps.h"		#include "mlir/Dialect/ControlFlow/IR/ControlFlowOps.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/Linalg/Passes.h"		#include "mlir/Dialect/Linalg/Passes.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/Linalg/Utils/Utils.h"		#include "mlir/Dialect/Linalg/Utils/Utils.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/SCF/Transforms/Transforms.h"		#include "mlir/Dialect/SCF/Transforms/Transforms.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	createMatchingParallelSubsetInsertOp(OpBuilder &b, Location loc,
tensor::ExtractSliceOp subsetExtractOp,		tensor::ExtractSliceOp subsetExtractOp,
Value source, Value dest) {		Value source, Value dest) {
b.create<tensor::ParallelInsertSliceOp>(		b.create<tensor::ParallelInsertSliceOp>(
loc, source, dest, subsetExtractOp.getMixedOffsets(),		loc, source, dest, subsetExtractOp.getMixedOffsets(),
subsetExtractOp.getMixedSizes(), subsetExtractOp.getMixedStrides());		subsetExtractOp.getMixedSizes(), subsetExtractOp.getMixedStrides());
}		}

/// Build an `affine_max` of all the `vals`.		/// Build an `affine_max` of all the `vals`.
static Value buildMax(OpBuilder &b, Location loc, ValueRange vals) {		static OpFoldResult buildMax(OpBuilder &b, Location loc,
		ArrayRef<OpFoldResult> vals) {
		SmallVector<Value> args = getValueOrCreateConstantIndexOp(b, loc, vals);
return b.createOrFold<AffineMaxOp>(		return b.createOrFold<AffineMaxOp>(
loc, AffineMap::getMultiDimIdentityMap(vals.size(), loc.getContext()),		loc, AffineMap::getMultiDimIdentityMap(vals.size(), loc.getContext()),
vals);		args);
}		}

/// Build an `affine_min` of all the `vals`.		/// Returns true if the maximum tile offset `tileSize * numThreads-1` is less
static Value buildMin(OpBuilder &b, Location loc, ValueRange vals) {		/// than `iterationSize`.
		nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: typo/grammo. nicolasvasilache: nit: typo/grammo.
return b.createOrFold<AffineMinOp>(		static bool canOmitTileOffsetInBoundsCheck(OpFoldResult tileSize,
loc, AffineMap::getMultiDimIdentityMap(vals.size(), loc.getContext()),		OpFoldResult numThreads,
vals);		OpFoldResult iterationSize) {
		Optional<int64_t> tileSizeConst = getConstantIntValue(tileSize);
		Optional<int64_t> numThreadsConst = getConstantIntValue(numThreads);
		Optional<int64_t> iterSizeConst = getConstantIntValue(iterationSize);
		if (!tileSizeConst \|\| !numThreadsConst \|\| !iterSizeConst)
		return false;
		return tileSizeConst (numThreadsConst - 1) < iterSizeConst;
}		}

FailureOr<ForeachThreadTilingResult>		/// Rewrite a TilingInterface `op` to a tiled `scf.foreach_thread`. The
linalg::tileToForeachThreadOp(OpBuilder &b, TilingInterface op,		/// tiling is specified by the number of tiles/threads `numThreads` and the
ArrayRef<OpFoldResult> numThreads,		/// optional nominal tile size `nominalTileSizes`. If `nominalTilSizes` is
ArrayRef<int64_t> threadDimMapping) {		/// not specified, then it is derived from `numThreads` as `ceilDiv(dimSize[i],
		/// numThreads[i])`. If non-empty, the `threadDimMapping` is added as an
		/// attribute to the resulting `scf.foreach_thread`. A zero tile sizes indicate
		/// that the dimension is not tiled, and can be thought of as tiling by the full
		/// size of data.
		/// It is the user's responsibility to ensure that `numThreads` is a valid
		/// tiling specification (i.e. that only tiles parallel dimensions, e.g. in the
		/// Linalg case). If `omitTileOffsetBoundsCheck` is true, then the function will
		/// assume that `tileSize[i] * (numThread[i] -1) <= dimSize[i]` holds.
		static FailureOr<ForeachThreadTilingResult> tileToForeachThreadOpImpl(
		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> numThreads,
		Optional<ArrayRef<OpFoldResult>> nominalTileSizes,
		ArrayRef<int64_t> threadDimMapping, bool omitTileOffsetBoundsCheck) {
Location loc = op->getLoc();		Location loc = op->getLoc();
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
SmallVector<Range> loopRanges = op.getIterationDomain(b);		SmallVector<Range> loopRanges = op.getIterationDomain(b);
if (loopRanges.empty())		if (loopRanges.empty())
return op->emitOpError("expected non-empty loop ranges");		return op->emitOpError("expected non-empty loop ranges");
auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };		auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };
if (llvm::any_of(loopRanges, hasStrideOne))		if (llvm::any_of(loopRanges, hasStrideOne))
return op->emitOpError("only stride-1 supported atm");		return op->emitOpError("only stride-1 supported atm");
Show All 9 Lines	static FailureOr<ForeachThreadTilingResult> tileToForeachThreadOpImpl(
SmallVector<Value> materializedNonZeroNumThreads =		SmallVector<Value> materializedNonZeroNumThreads =
llvm::to_vector(llvm::map_range(nonZeroNumThreads, [&](OpFoldResult ofr) {		llvm::to_vector(llvm::map_range(nonZeroNumThreads, [&](OpFoldResult ofr) {
ImplicitLocOpBuilder ilocb(loc, b);		ImplicitLocOpBuilder ilocb(loc, b);
return materializeOpFoldResult(ilocb, ofr);		return materializeOpFoldResult(ilocb, ofr);
}));		}));

Value zero = b.create<arith::ConstantIndexOp>(loc, 0);		Value zero = b.create<arith::ConstantIndexOp>(loc, 0);
Operation *tiledOp = nullptr;		Operation *tiledOp = nullptr;

		// Create the ForeachThreadOp. We don't use the lambda body-builder
		// version because we require the use of RewriterBase in the body, so we
		// manually move the insertion point to the body below.
scf::ForeachThreadOp foreachThreadOp = b.create<scf::ForeachThreadOp>(		scf::ForeachThreadOp foreachThreadOp = b.create<scf::ForeachThreadOp>(
loc, materializedNonZeroNumThreads, threadDimMapping,		loc, op->getResultTypes(), ValueRange(materializedNonZeroNumThreads),
		christopherbateAuthorUnsubmitted Done Reply Inline Actions You can't use RewriterBase in the body-creation lambda, so I moved to the non-lambda creation form and manually move the insertion point below. christopherbate: You can't use RewriterBase in the body-creation lambda, so I moved to the non-lambda creation…
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Yes, this is annoying and I had the same issue recently. Another possibility is to explicitly cast the OpBuilder as a RewriterBase inside the lambda when you control the call site, you can do that here if you prefer (fine to leave as is). In any case, can you please add a comment explaining this? nicolasvasilache: Yes, this is annoying and I had the same issue recently. Another possibility is to explicitly…
[&](OpBuilder &b, Location loc, ValueRange threadIds) {		threadDimMapping);

		// Fill out the ForeachThreadOp body.
		b.setInsertionPointToStart(foreachThreadOp.getBody(0));
		ValueRange threadIds = foreachThreadOp.getThreadIndices();
int64_t nLoops = loopRanges.size();		int64_t nLoops = loopRanges.size();
SmallVector<OpFoldResult> tiledOffsets, tiledSizes;		SmallVector<OpFoldResult> tiledOffsets, tiledSizes;
tiledOffsets.reserve(nLoops);		tiledOffsets.reserve(nLoops);
tiledSizes.reserve(nLoops);		tiledSizes.reserve(nLoops);
for (unsigned loopIdx = 0, threadIdIdx = 0; loopIdx < nLoops;		for (unsigned loopIdx = 0, threadIdIdx = 0; loopIdx < nLoops; ++loopIdx) {
++loopIdx) {
bool overflow = loopIdx >= numThreads.size();		bool overflow = loopIdx >= numThreads.size();
bool isZero = !overflow && isConstantIntValue(numThreads[loopIdx], 0);		bool isZero = !overflow && isConstantIntValue(numThreads[loopIdx], 0);
// Degenerate case: take the whole domain.		// Degenerate case: take the whole domain.
if (overflow \|\| isZero) {		if (overflow \|\| isZero) {
tiledOffsets.push_back(loopRanges[loopIdx].offset);		tiledOffsets.push_back(loopRanges[loopIdx].offset);
tiledSizes.push_back(loopRanges[loopIdx].size);		tiledSizes.push_back(loopRanges[loopIdx].size);
continue;		continue;
}		}

// Tiled case: compute the offset and size.		// Tiled case: compute the offset and size.
AffineExpr i, j, M, N, O;		AffineExpr i, j, M, N, O;
bindDims(b.getContext(), i, j);		bindDims(b.getContext(), i, j);
bindSymbols(b.getContext(), M, N, O);		bindSymbols(b.getContext(), M, N, O);
Value size = loopRanges[loopIdx].size;		Value size = loopRanges[loopIdx].size;
Value offset = loopRanges[loopIdx].offset;		Value offset = loopRanges[loopIdx].offset;
Value threadId = threadIds[threadIdIdx];		Value threadId = threadIds[threadIdIdx];
// TODO: more aggressive foldings.
// Symbolic fixed max size per thread.		// Symbolic fixed max size per thread.
// TODO: floor + 0/1 depending on case for better load-balancing.		// TODO: floor + 0/1 depending on case for better load-balancing.
Value maxSizePerThread = b.createOrFold<AffineApplyOp>(		OpFoldResult tileSizePerThread =
loc, M.ceilDiv(N),		nominalTileSizes.hasValue()
ValueRange{size, materializedNonZeroNumThreads[threadIdIdx]});		? (*nominalTileSizes)[loopIdx]
		: makeComposedFoldedAffineApply(
		b, loc, M.ceilDiv(N),
		ArrayRef<OpFoldResult>{size, nonZeroNumThreads[threadIdIdx]});

// Dynamic offset shifted by threadId * maxSizePerThread.		// Dynamic offset shifted by threadId * maxSizePerThread.
Value offsetPerThread = b.createOrFold<AffineApplyOp>(		OpFoldResult offsetPerThread = makeComposedFoldedAffineApply(
loc, i + j * M, ValueRange{offset, threadId, maxSizePerThread});		b, loc, i + j * M, {offset, threadId, tileSizePerThread});
// Dynamic upper-bound depending on the threadId.		// Dynamic upper-bound depending on the threadId.
Value sizeMinusOffsetPerThread = b.createOrFold<AffineApplyOp>(		OpFoldResult residualTileSize = makeComposedFoldedAffineApply(
loc, -i + M, ValueRange{offsetPerThread, size});		b, loc, i + j * M - N,
Value tileSizePerThread = buildMin(		{offset, nonZeroNumThreads[threadIdIdx], tileSizePerThread, size});
b, loc, ValueRange{sizeMinusOffsetPerThread, maxSizePerThread});		if (!isConstantIntValue(residualTileSize, 0)) {
		OpFoldResult sizeMinusOffsetPerThread = makeComposedFoldedAffineApply(
		b, loc, -i + M, {offsetPerThread, size});
		tileSizePerThread = makeComposedFoldedAffineMin(
		b, loc, AffineMap::getMultiDimIdentityMap(2, b.getContext()),
		ArrayRef<OpFoldResult>{sizeMinusOffsetPerThread, tileSizePerThread});
		}

tiledOffsets.push_back(offsetPerThread);		tiledOffsets.push_back(offsetPerThread);
// TODO: if tileSizePerThread <= 0 early exit.		// TODO: if tileSizePerThread <= 0 early exit.
tiledSizes.push_back(		if (!omitTileOffsetBoundsCheck &&
buildMax(b, loc, ValueRange{zero, tileSizePerThread}));		!canOmitTileOffsetInBoundsCheck(tileSizePerThread,
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Making this discovery more powerful is going to be painful with SSA values. OTOH we know by construction in the "tile_size case" that we don't need the max. I would just add a bool passed at the call site, true for the "tile_size case" and false for the "num_threads case". Then this discovery can refine the "num_threads case". nicolasvasilache: Making this discovery more powerful is going to be painful with SSA values. OTOH we know by…
		nonZeroNumThreads[threadIdIdx], size))
		tileSizePerThread = buildMax(b, loc, {zero, tileSizePerThread});

		tiledSizes.push_back(tileSizePerThread);
++threadIdIdx;		++threadIdIdx;
}		}

SmallVector<Operation *> tiledOps =		SmallVector<Operation *> tiledOps =
op.getTiledImplementation(b, destOperands, tiledOffsets, tiledSizes,		op.getTiledImplementation(b, destOperands, tiledOffsets, tiledSizes,
/tileDestOperands=/true);		/tileDestOperands=/true);
assert(tiledOps.size() == 1 && "expected a single produced tiled op");		assert(tiledOps.size() == 1 && "expected a single produced tiled op");
tiledOp = tiledOps.front();		tiledOp = tiledOps.front();

auto tilingInterfaceOp = dyn_cast<TilingInterface>(tiledOp);		auto tilingInterfaceOp = dyn_cast<TilingInterface>(tiledOp);
assert(tilingInterfaceOp &&		assert(tilingInterfaceOp && "Tiled op does not implement TilingInterface");
"Tiled op does not implement TilingInterface");

auto tiledDestOperands = tilingInterfaceOp.getDestinationOperands(b);		auto tiledDestOperands = tilingInterfaceOp.getDestinationOperands(b);

// Create terminator with parallel subset insert operations.		// Create terminator with parallel subset insert operations.
auto performConcurrentlyOp = b.create<scf::PerformConcurrentlyOp>(loc);		b.setInsertionPointToStart(foreachThreadOp.getTerminator().getBody());
OpBuilder::InsertionGuard g(b);		for (auto it : llvm::zip(tiledDestOperands, tilingInterfaceOp->getResults(),
b.setInsertionPointToStart(performConcurrentlyOp.getBody());
for (auto it :
llvm::zip(tiledDestOperands, tilingInterfaceOp->getResults(),
destOperands)) {		destOperands)) {
createMatchingParallelSubsetInsertOp(		createMatchingParallelSubsetInsertOp(
b, loc,		b, loc, cast<tensor::ExtractSliceOp>(std::get<0>(it).getDefiningOp()),
cast<tensor::ExtractSliceOp>(std::get<0>(it).getDefiningOp()),
std::get<1>(it), std::get<2>(it));		std::get<1>(it), std::get<2>(it));
}		}
});
return ForeachThreadTilingResult{foreachThreadOp, tiledOp};		return ForeachThreadTilingResult{foreachThreadOp, tiledOp};
}		}

		FailureOr<ForeachThreadTilingResult>
		linalg::tileToForeachThreadOp(RewriterBase &b, TilingInterface op,
		ArrayRef<OpFoldResult> numThreads,
		ArrayRef<int64_t> threadDimMapping) {
		return tileToForeachThreadOpImpl(b, op, numThreads, /nominalTileSizes=/None,
		threadDimMapping,
		/omitTileOffsetBoundsCheck=/false);
		}

		FailureOr<ForeachThreadTilingResult>
		linalg::tileToForeachThreadOpUsingTileSizes(
		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> tileSizes,
		ArrayRef<int64_t> threadDimMapping) {
		SmallVector<Range> loopRanges = op.getIterationDomain(b);
		unsigned nLoops = loopRanges.size();
		SmallVector<OpFoldResult> numThreads;
		numThreads.reserve(nLoops);
		AffineExpr s0, s1;
		bindSymbols(b.getContext(), s0, s1);
		AffineExpr divExpr = s0.ceilDiv(s1);
		for (const auto &it : llvm::zip(tileSizes, loopRanges)) {
		OpFoldResult numTiles = std::get<0>(it);
		if (!isConstantIntValue(numTiles, 0))
		numTiles = makeComposedFoldedAffineApply(
		b, op.getLoc(), divExpr, {std::get<1>(it).size, std::get<0>(it)});
		numThreads.push_back(numTiles);
		}
		return tileToForeachThreadOpImpl(b, op, numThreads,
		/nominalTileSizes=/tileSizes,
		threadDimMapping,
		/omitTileOffsetBoundsCheck=/true);
		}

// Insert a tile `source` into the destination tensor `dest`. The position at		// Insert a tile `source` into the destination tensor `dest`. The position at
// which the tile is inserted (as well as size of tile) is taken from a given		// which the tile is inserted (as well as size of tile) is taken from a given
// ExtractSliceOp `sliceOp`.		// ExtractSliceOp `sliceOp`.
static Value insertSliceIntoTensor(RewriterBase &b, Location loc,		static Value insertSliceIntoTensor(RewriterBase &b, Location loc,
tensor::ExtractSliceOp sliceOp, Value source,		tensor::ExtractSliceOp sliceOp, Value source,
Value dest) {		Value dest) {
return b.create<tensor::InsertSliceOp>(		return b.create<tensor::InsertSliceOp>(
loc, sliceOp.getSource().getType(), source, dest, sliceOp.getOffsets(),		loc, sliceOp.getSource().getType(), source, dest, sliceOp.getOffsets(),
▲ Show 20 Lines • Show All 384 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

// RUN: mlir-opt %s --test-transform-dialect-interpreter -canonicalize \| FileCheck %s		// RUN: mlir-opt %s --test-transform-dialect-interpreter -canonicalize -split-input-file \| FileCheck %s

// Offset per thread:		// Offset per thread:
// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 10))>		// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 10))>
// Per thread tile size.		// Per thread tile size.
// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 10)) + s0, s0 ceildiv 10)>		// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 10)) + s0, s0 ceildiv 10)>
// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 20))>		// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 20))>
// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 20)) + s0, s0 ceildiv 20)>		// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 20)) + s0, s0 ceildiv 20)>

Show All 22 Lines	// CHECK-NEXT: } {thread_dim_mapping = [1, 0]}
return %0 : tensor<?x?xf32>		return %0 : tensor<?x?xf32>
}		}

transform.with_pdl_patterns {		transform.with_pdl_patterns {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
transform.sequence %arg0 {		transform.sequence %arg0 {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1		%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1
%1:2 = transform.structured.tile_to_foreach_thread_op %0 [10, 20] (mapped to dims [1, 0])		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20] (mapped to dims [1, 0])
}		}
}		}
}		}

		// -----

		// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Nice test case! nicolasvasilache: Nice test case!

		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -15 + 300, 15)>
		// CHECK-DAG: #[[$map1:.+]] = affine_map<(d0) -> (0, d0)>
		// CHECK-DAG: #[[$map2:.+]] = affine_map<(d0) -> (d0 * 10)>
		// CHECK-DAG: #[[$map3:.+]] = affine_map<(d0) -> (d0 * 15)>

		// CHECK-LABEL: matmul_static(
		// CHECK-SAME: %[[A:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[B:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor
		func.func @matmul_static(%A: tensor<100x200xf32>, %B: tensor<200x300xf32>, %C: tensor<100x300xf32>) -> tensor<100x300xf32> {
		// CHECK-DAG: %[[c10:.+]] = arith.constant 10 : index
		// CHECK-DAG: %[[c21:.+]] = arith.constant 21 : index
		// CHECK: scf.foreach_thread (%[[IV0:.+]], %[[IV1:.+]]) in (%[[c10]], %[[c21]])
		// CHECK: %[[TSMIN:.+]] = affine.min #[[$map0]](%[[IV1]])
		// CHECK: %[[TS:.+]] = affine.max #[[$map1]](%[[TSMIN]])
		// CHECK-NOT: affine.min
		// CHECK-NOT: affine.max
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[tA:.+]] = tensor.extract_slice %[[A]][%[[LB0]], 0] [10, 200] [1, 1] :
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tB:.+]] = tensor.extract_slice %[[B]][0, %[[LB1]]] [200, %[[TS]]] [1, 1] :
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tC:.+]] = tensor.extract_slice %[[C]][%[[LB0]], %[[LB1]]] [10, %[[TS]]] [1, 1] :
		// CHECK: linalg.matmul
		// CHECK: scf.foreach_thread.perform_concurrently
		// CHECK-NEXT: tensor.parallel_insert_slice
		%0 = linalg.matmul ins(%A, %B : tensor<100x200xf32>, tensor<200x300xf32>)
		outs(%C : tensor<100x300xf32>) -> (tensor<100x300xf32>)
		return %0 : tensor<100x300xf32>
		}

		transform.with_pdl_patterns {
		^bb0(%arg0: !pdl.operation):
		transform.sequence %arg0 {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1
		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 21]
		}
		}


		// -----

		// CHECK-DAG: #[[$map0:.+]] = affine_map<()[s0] -> (s0 ceildiv 10)>
		// CHECK-DAG: #[[$map1:.+]] = affine_map<()[s0] -> (s0 ceildiv 20)>
		// CHECK-DAG: #[[$map2:.+]] = affine_map<(d0)[s0] -> (d0 * -10 + s0, 10)>
		// CHECK-DAG: #[[$map4:.+]] = affine_map<(d0)[s0] -> (d0 * -20 + s0, 20)>
		// CHECK-DAG: #[[$map5:.+]] = affine_map<(d0) -> (d0 * 10)>
		// CHECK-DAG: #[[$map6:.+]] = affine_map<(d0) -> (d0 * 20)>

		// CHECK-LABEL: matmul_tile_size_dynamic(
		// CHECK-SAME: %[[A:[0-9a-z]+]]: tensor<?x?xf32>
		// CHECK-SAME: %[[B:[0-9a-z]+]]: tensor<?x?xf32>
		// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor<?x?xf32>
		func.func @matmul_tile_size_dynamic(%A: tensor<?x?xf32>, %B: tensor<?x?xf32>, %C: tensor<?x?xf32>) -> tensor<?x?xf32> {
		// CHECK: %[[M:.+]] = tensor.dim %[[A]], %c0 :
		// CHECK: %[[N:.+]] = tensor.dim %[[B]], %c1 :
		// CHECK: %[[NT0:.+]] = affine.apply #map0()[%[[M]]]
		// CHECK: %[[NT1:.+]] = affine.apply #map1()[%[[N]]]
		// CHECK: %[[M:.+]] = tensor.dim %[[A]], %c0 :
		// CHECK: %[[N:.+]] = tensor.dim %[[B]], %c1 :
		// CHECK: scf.foreach_thread (%[[IV0:.+]], %[[IV1:.+]]) in (%[[NT0]], %[[NT1]])
		// CHECK: %[[TS0:.+]] = affine.min #[[$map2]](%[[IV0]])[%[[M]]]
		// CHECK: %[[TS1:.+]] = affine.min #[[$map4]](%[[IV1]])[%[[N]]]
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map5]](%[[IV0]])
		// CHECK tensor.extract_slice %[[A]]
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map6]](%[[IV1]])
		// CHECK tensor.extract_slice %[[B]]
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map5]](%[[IV0]])
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map6]](%[[IV1]])
		// CHECK tensor.extract_slice %[[C]]
		// CHECK: linalg.matmul
		// CHECK: scf.foreach_thread.perform_concurrently
		// CHECK-NEXT: tensor.parallel_insert_slice
		%0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
		outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)
		return %0 : tensor<?x?xf32>
		}

		transform.with_pdl_patterns {
		^bb0(%arg0: !pdl.operation):
		transform.sequence %arg0 {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1
		%1:2 = transform.structured.tile_to_foreach_thread_op %0 tile_sizes [10, 20]
		}
		}

		// -----

		// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.

		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -21 + 300, 21)>
		// CHECK-DAG: #[[$map2:.+]] = affine_map<(d0) -> (d0 * 10)>
		// CHECK-DAG: #[[$map3:.+]] = affine_map<(d0) -> (d0 * 21)>

		// CHECK-LABEL: matmul_tile_size_static(
		// CHECK-SAME: %[[A:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[B:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor
		func.func @matmul_tile_size_static(%A: tensor<100x200xf32>, %B: tensor<200x300xf32>, %C: tensor<100x300xf32>) -> tensor<100x300xf32> {
		// CHECK-DAG: %[[c10:.+]] = arith.constant 10 :
		// CHECK-DAG: %[[c15:.+]] = arith.constant 15 :
		// CHECK: scf.foreach_thread (%[[IV0:.+]], %[[IV1:.+]]) in (%[[c10]], %[[c15]])
		// CHECK: %[[TS:.+]] = affine.min #[[$map0]](%[[IV1]])
		// CHECK-NOT: affine.max
		// CHECK-NOT: affine.min
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[tA:.+]] = tensor.extract_slice %[[A]][%[[LB0]], 0] [10, 200] [1, 1] :
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tB:.+]] = tensor.extract_slice %[[B]][0, %[[LB1]]] [200, %[[TS]]] [1, 1] :
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tC:.+]] = tensor.extract_slice %[[C]][%[[LB0]], %[[LB1]]] [10, %[[TS]]] [1, 1] :
		// CHECK: linalg.matmul
		// CHECK: scf.foreach_thread.perform_concurrently
		// CHECK-NEXT: tensor.parallel_insert_slice
		%0 = linalg.matmul ins(%A, %B : tensor<100x200xf32>, tensor<200x300xf32>)
		outs(%C : tensor<100x300xf32>) -> (tensor<100x300xf32>)
		return %0 : tensor<100x300xf32>
		}

		transform.with_pdl_patterns {
		^bb0(%arg0: !pdl.operation):
		transform.sequence %arg0 {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1
		%1:2 = transform.structured.tile_to_foreach_thread_op %0 tile_sizes [10, 21]
		}
		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 446539

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

mlir/lib/Dialect/Affine/IR/AffineOps.cpp

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

[mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`
ClosedPublic