This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Linalg/
-
mlir/
-
Dialect/
-
Linalg/
-
TransformOps/
2/4
LinalgTransformOps.td
-
Transforms/
-
Transforms.h
-
lib/Dialect/
-
Dialect/
-
Affine/IR/
-
IR/
2/2
AffineOps.cpp
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.cpp
-
Transforms/
4/4
Tiling.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
1
tile-to-foreach-thread.mlir

Differential D130139

[mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`
ClosedPublic

Authored by christopherbate on Jul 19 2022, 7:43 PM.

Download Raw Diff

Details

Reviewers

bondhugula
nicolasvasilache

Commits

rG297ba167ded0: [mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`

Summary

This change modifies structured.tile_to_foreach_thread_op so that
it accepts either tile_sizes or num_threads parameters. If
tile_sizes are specified, then the number of threads required is
derived the tile sizes rather than the other way around. In both cases,
more aggressive folding of loop parameters is enabled during the
transformation, allowing for the potential elimination of affine.min
and affine.max operations in the static shape case when calculating
the final adjusted tile size.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

christopherbate created this revision.Jul 19 2022, 7:43 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptJul 19 2022, 7:43 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, sdasgup3, Groverkss and 20 others. · View Herald Transcript

christopherbate requested review of this revision.Jul 19 2022, 7:43 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJul 19 2022, 7:43 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

christopherbate added inline comments.Jul 19 2022, 7:50 PM

mlir/lib/Dialect/Affine/IR/AffineOps.cpp
725	Without this change an error will be produced because we give an OpFoldResult vector that comes directly from I64ArrayAttr that belongs to an op attribute. Explicitly convert to index type.
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
237	You can't use RewriterBase in the body-creation lambda, so I moved to the non-lambda creation form and manually move the insertion point below.

christopherbate mentioned this in D129335: [mlir][SCF] Tile with TilingInterface using `scf.foreach_thread`.Jul 19 2022, 7:52 PM

Harbormaster completed remote builds in B176405: Diff 446019.Jul 19 2022, 8:04 PM

Thanks much!

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
593	This should say 'num_threads', also `0` is not a valid num_threads
612	Ah, very cool, I didn't realize the custom assembly format allows ternary expressions now.
mlir/lib/Dialect/Affine/IR/AffineOps.cpp
725	can you add this as a comment in the code to explain why the cast?
mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
195	nit: typo/grammo.
237	Yes, this is annoying and I had the same issue recently. Another possibility is to explicitly cast the OpBuilder as a RewriterBase inside the lambda when you control the call site, you can do that here if you prefer (fine to leave as is). In any case, can you please add a comment explaining this?
292	Making this discovery more powerful is going to be painful with SSA values. OTOH we know by construction in the "tile_size case" that we don't need the max. I would just add a bool passed at the call site, true for the "tile_size case" and false for the "num_threads case". Then this discovery can refine the "num_threads case".
mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir
53	Nice test case!

This revision is now accepted and ready to land.Jul 20 2022, 1:13 AM

Address comments.

christopherbate marked 2 inline comments as done.Jul 20 2022, 12:36 PM

christopherbate added inline comments.

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
593	also 0 is not a valid num_threads But in the doc that you wrote above, you state Zero tile sizes indicate that the dimension is not tiled, and can be thought of as tiling by the full size of data. I had assumed you meant that `0` is a sentinel value indicating to skip that dimension, regardless of whether it is specified in `num_threads` or `tile_size`. If you specify `num_threads` then the derived tile size can not be zero. Otherwise you can't handle ops that have a reduction dimension that appears before a parallel dimension that you would like to tile.

nicolasvasilache added inline comments.Jul 20 2022, 12:49 PM

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
593	You're right, I confused myself over nothing, please ignore.

Harbormaster completed remote builds in B176573: Diff 446238.Jul 20 2022, 12:49 PM

Rebase

Herald added a subscriber: mgorny. · View Herald TranscriptJul 21 2022, 8:27 AM

Harbormaster completed remote builds in B176781: Diff 446517.Jul 21 2022, 8:59 AM

Closed by commit rG297ba167ded0: [mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op` (authored by christopherbate). · Explain WhyJul 21 2022, 9:36 AM

This revision was automatically updated to reflect the committed changes.

christopherbate added a commit: rG297ba167ded0: [mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Linalg/

TransformOps/

LinalgTransformOps.td

19 lines

Transforms/

Transforms.h

9 lines

lib/

Dialect/

Affine/

IR/

AffineOps.cpp

5 lines

Linalg/

TransformOps/

LinalgTransformOps.cpp

20 lines

Transforms/

Tiling.cpp

215 lines

test/

Dialect/

Linalg/

tile-to-foreach-thread.mlir

163 lines

Diff 446019

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

Show First 20 Lines • Show All 580 Lines • ▼ Show 20 Lines	let description = [{
Otherwise the transform silently fails.		Otherwise the transform silently fails.

The 2 returned handles point to only the subset of successfully produced		The 2 returned handles point to only the subset of successfully produced
tiled operations, which can all be empty.		tiled operations, which can all be empty.

These 2 returned handles point to:		These 2 returned handles point to:
- the new scf.foreach_thread op,		- the new scf.foreach_thread op,
- the tiled op that implements TilingInterface.		- the tiled op that implements TilingInterface.

		### Example using `num_threads`
		```
		%0 = pdl_match @match_matmul in %arg1
		%3:2 = transform.structured.tile_to_foreach %0 tile_sizes [10, 20, 0]
		nicolasvasilacheUnsubmitted Done Reply Inline Actions This should say 'num_threads', also `0` is not a valid num_threads nicolasvasilache: This should say 'num_threads', also `0` is not a valid num_threads
		christopherbateAuthorUnsubmitted Not Done Reply Inline Actions also 0 is not a valid num_threads But in the doc that you wrote above, you state Zero tile sizes indicate that the dimension is not tiled, and can be thought of as tiling by the full size of data. I had assumed you meant that `0` is a sentinel value indicating to skip that dimension, regardless of whether it is specified in `num_threads` or `tile_size`. If you specify `num_threads` then the derived tile size can not be zero. Otherwise you can't handle ops that have a reduction dimension that appears before a parallel dimension that you would like to tile. christopherbate: > also 0 is not a valid num_threads But in the doc that you wrote above, you state > Zero…
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions You're right, I confused myself over nothing, please ignore. nicolasvasilache: You're right, I confused myself over nothing, please ignore.
		```

		### Example using `tile_sizes`
		```
		%0 = pdl_match @match_matmul in %arg1
		%3:2 = transform.structured.tile_to_foreach %0 tile_sizes [10, 20, 0]
		```
}];		}];

let arguments = (ins PDL_Operation:$target,		let arguments = (ins PDL_Operation:$target,
// TODO: dynamic number of threads.		// TODO: dynamic number of threads.
DefaultValuedAttr<I64ArrayAttr, "{}">:$num_threads,		OptionalAttr<DefaultValuedAttr<I64ArrayAttr, "{}">>:$num_threads,
		OptionalAttr<DefaultValuedAttr<I64ArrayAttr, "{}">>:$tile_sizes,
OptionalAttr<I64ArrayAttr>:$thread_dim_mapping);		OptionalAttr<I64ArrayAttr>:$thread_dim_mapping);
let results = (outs PDL_Operation:$foreach_thread_op,		let results = (outs PDL_Operation:$foreach_thread_op,
PDL_Operation:$tiled_op);		PDL_Operation:$tiled_op);

let assemblyFormat = [{		let assemblyFormat = [{
$target $num_threads (`(` `mapped` `to` `dims` $thread_dim_mapping^ `)`)?		$target (`num_threads` $num_threads^) : (`tile_sizes` $tile_sizes)?
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Ah, very cool, I didn't realize the custom assembly format allows ternary expressions now. nicolasvasilache: Ah, very cool, I didn't realize the custom assembly format allows ternary expressions now.
attr-dict		(`(` `mapped` `to` `dims` $thread_dim_mapping^ `)`)? attr-dict
}];		}];

let extraClassDeclaration = [{		let extraClassDeclaration = [{
::mlir::DiagnosedSilenceableFailure applyToOne(		::mlir::DiagnosedSilenceableFailure applyToOne(
::mlir::TilingInterface target,		::mlir::TilingInterface target,
::llvm::SmallVectorImpl<::mlir::Operation *> &results,		::llvm::SmallVectorImpl<::mlir::Operation *> &results,
::mlir::transform::TransformState &state);		::mlir::transform::TransformState &state);
}];		}];
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

	Show First 20 Lines • Show All 525 Lines • ▼ Show 20 Lines
	/// It is the user's responsibility to ensure that `numThreads` is a			/// It is the user's responsibility to ensure that `numThreads` is a
	/// valid tiling specification (i.e. that only tiles parallel			/// valid tiling specification (i.e. that only tiles parallel
	/// dimensions, e.g. in the Linalg case).			/// dimensions, e.g. in the Linalg case).
	struct ForeachThreadTilingResult {			struct ForeachThreadTilingResult {
	Operation *tileOp;			Operation *tileOp;
	Operation *tiledOp;			Operation *tiledOp;
	};			};
	FailureOr<ForeachThreadTilingResult>			FailureOr<ForeachThreadTilingResult>
	tileToForeachThreadOp(OpBuilder &builder, TilingInterface op,			tileToForeachThreadOp(RewriterBase &builder, TilingInterface op,
	ArrayRef<OpFoldResult> numThreads,			ArrayRef<OpFoldResult> numThreads,
	ArrayRef<int64_t> threadDimMapping = {});			ArrayRef<int64_t> threadDimMapping = {});

				/// Same as `tileToForeachThreadOp`, but calculate the number of threads
				/// required using the given tileSizes.
				FailureOr<ForeachThreadTilingResult>
				tileToForeachThreadOpUsingTileSizes(RewriterBase &builder, TilingInterface op,
				ArrayRef<OpFoldResult> tileSizes,
				ArrayRef<int64_t> threadDimMapping = {});

	/// All indices returned by IndexOp should be invariant with respect to tiling.			/// All indices returned by IndexOp should be invariant with respect to tiling.
	/// Therefore, if an operation is tiled, we have to transform the indices			/// Therefore, if an operation is tiled, we have to transform the indices
	/// accordingly, i.e. offset them by the values of the corresponding induction			/// accordingly, i.e. offset them by the values of the corresponding induction
	/// variables that are captured implicitly in the body of the op.			/// variables that are captured implicitly in the body of the op.
	///			///
	/// Example. `linalg.generic` before tiling:			/// Example. `linalg.generic` before tiling:
	///			///
	/// #id_2d = (i, j) -> (i, j)			/// #id_2d = (i, j) -> (i, j)
	▲ Show 20 Lines • Show All 1,085 Lines • Show Last 20 Lines

mlir/lib/Dialect/Affine/IR/AffineOps.cpp

Show First 20 Lines • Show All 715 Lines • ▼ Show 20 Lines	static void materializeConstants(OpBuilder &b, Location loc,
SmallVectorImpl<Value> &actualValues) {		SmallVectorImpl<Value> &actualValues) {
actualValues.reserve(values.size());		actualValues.reserve(values.size());
auto *dialect = b.getContext()->getLoadedDialect<AffineDialect>();		auto *dialect = b.getContext()->getLoadedDialect<AffineDialect>();
for (OpFoldResult ofr : values) {		for (OpFoldResult ofr : values) {
if (auto value = ofr.dyn_cast<Value>()) {		if (auto value = ofr.dyn_cast<Value>()) {
actualValues.push_back(value);		actualValues.push_back(value);
continue;		continue;
}		}
constants.push_back(dialect->materializeConstant(b, ofr.get<Attribute>(),		constants.push_back(dialect->materializeConstant(
		b, b.getIndexAttr(ofr.get<Attribute>().cast<IntegerAttr>().getInt()),
		christopherbateAuthorUnsubmitted Done Reply Inline Actions Without this change an error will be produced because we give an OpFoldResult vector that comes directly from I64ArrayAttr that belongs to an op attribute. Explicitly convert to index type. christopherbate: Without this change an error will be produced because we give an OpFoldResult vector that comes…
		nicolasvasilacheUnsubmitted Done Reply Inline Actions can you add this as a comment in the code to explain why the cast? nicolasvasilache: can you add this as a comment in the code to explain why the cast?
b.getIndexType(), loc));		b.getIndexType(), loc));
actualValues.push_back(constants.back()->getResult(0));		actualValues.push_back(constants.back()->getResult(0));
}		}
}		}

/// Create an operation of the type provided as template argument and attempt to		/// Create an operation of the type provided as template argument and attempt to
/// fold it immediately. The operation is expected to have a builder taking		/// fold it immediately. The operation is expected to have a builder taking
/// arbitrary `leadingArguments`, followed by a list of Value-typed `operands`.		/// arbitrary `leadingArguments`, followed by a list of Value-typed `operands`.
/// The operation is also expected to always produce a single result. Return an		/// The operation is also expected to always produce a single result. Return an
▲ Show 20 Lines • Show All 3,278 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

	Show First 20 Lines • Show All 794 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	DiagnosedSilenceableFailure transform::TileToForeachThreadOp::applyToOne(			DiagnosedSilenceableFailure transform::TileToForeachThreadOp::applyToOne(
	TilingInterface target, SmallVectorImpl<Operation *> &results,			TilingInterface target, SmallVectorImpl<Operation *> &results,
	transform::TransformState &state) {			transform::TransformState &state) {
	IRRewriter rewriter(getContext());			IRRewriter rewriter(getContext());
	rewriter.setInsertionPoint(target);			rewriter.setInsertionPoint(target);
	auto maybeThreadDimMappingAttr = getThreadDimMapping();			auto maybeThreadDimMappingAttr = getThreadDimMapping();
	FailureOr<ForeachThreadTilingResult> tilingResult =			auto dimMapping =
	linalg::tileToForeachThreadOp(			llvm::to_vector(maybeThreadDimMappingAttr
	rewriter, target, getAsOpFoldResult(getNumThreads()),
	maybeThreadDimMappingAttr
	? extractFromI64ArrayAttr(*maybeThreadDimMappingAttr)			? extractFromI64ArrayAttr(*maybeThreadDimMappingAttr)
	: ArrayRef<int64_t>{});			: ArrayRef<int64_t>{});

				FailureOr<ForeachThreadTilingResult> tilingResult = failure();
				if (Optional<ArrayAttr> numThreads = getNumThreads())
				tilingResult = linalg::tileToForeachThreadOp(
				rewriter, target, getAsOpFoldResult(*numThreads), dimMapping);

				if (Optional<ArrayAttr> tileSizes = getTileSizes())
				tilingResult = linalg::tileToForeachThreadOpUsingTileSizes(
				rewriter, target, getAsOpFoldResult(*tileSizes), dimMapping);

	if (failed(tilingResult))			if (failed(tilingResult))
	return emitDefaultSilenceableFailure(target);			return emitDefaultSilenceableFailure(target);
	rewriter.replaceOp(target, tilingResult->tileOp->getResults());			rewriter.replaceOp(target, tilingResult->tileOp->getResults());
	results.assign({tilingResult->tileOp, tilingResult->tiledOp});			results.assign({tilingResult->tileOp, tilingResult->tiledOp});
	return DiagnosedSilenceableFailure(success());			return DiagnosedSilenceableFailure(success());
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

//===- Tiling.cpp - Implementation of linalg Tiling -----------------------===//		//===- Tiling.cpp - Implementation of linalg Tiling -----------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements the linalg dialect Tiling pass.		// This file implements the linalg dialect Tiling pass.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include <utility>		#include <utility>

#include "PassDetail.h"		#include "PassDetail.h"
		#include "mlir/Dialect/Arithmetic/Utils/Utils.h"
#include "mlir/Dialect/ControlFlow/IR/ControlFlowOps.h"		#include "mlir/Dialect/ControlFlow/IR/ControlFlowOps.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/Linalg/Passes.h"		#include "mlir/Dialect/Linalg/Passes.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/Linalg/Utils/Utils.h"		#include "mlir/Dialect/Linalg/Utils/Utils.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/SCF/Transforms/Transforms.h"		#include "mlir/Dialect/SCF/Transforms/Transforms.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	createMatchingParallelSubsetInsertOp(OpBuilder &b, Location loc,
tensor::ExtractSliceOp subsetExtractOp,		tensor::ExtractSliceOp subsetExtractOp,
Value source, Value dest) {		Value source, Value dest) {
b.create<tensor::ParallelInsertSliceOp>(		b.create<tensor::ParallelInsertSliceOp>(
loc, source, dest, subsetExtractOp.getMixedOffsets(),		loc, source, dest, subsetExtractOp.getMixedOffsets(),
subsetExtractOp.getMixedSizes(), subsetExtractOp.getMixedStrides());		subsetExtractOp.getMixedSizes(), subsetExtractOp.getMixedStrides());
}		}

/// Build an `affine_max` of all the `vals`.		/// Build an `affine_max` of all the `vals`.
static Value buildMax(OpBuilder &b, Location loc, ValueRange vals) {		static OpFoldResult buildMax(OpBuilder &b, Location loc,
		ArrayRef<OpFoldResult> vals) {
		SmallVector<Value> args = getValueOrCreateConstantIndexOp(b, loc, vals);
return b.createOrFold<AffineMaxOp>(		return b.createOrFold<AffineMaxOp>(
loc, AffineMap::getMultiDimIdentityMap(vals.size(), loc.getContext()),		loc, AffineMap::getMultiDimIdentityMap(vals.size(), loc.getContext()),
vals);		args);
}		}

/// Build an `affine_min` of all the `vals`.		/// Returns true if the maximum tile offset `tileSize` * `numThreads-1` less
static Value buildMin(OpBuilder &b, Location loc, ValueRange vals) {		/// than is `iterationSize`.
		nicolasvasilacheUnsubmitted Done Reply Inline Actions nit: typo/grammo. nicolasvasilache: nit: typo/grammo.
return b.createOrFold<AffineMinOp>(		static bool canOmitTileOffsetInBoundsCheck(OpFoldResult tileSize,
loc, AffineMap::getMultiDimIdentityMap(vals.size(), loc.getContext()),		OpFoldResult numThreads,
vals);		OpFoldResult iterationSize) {
		Optional<int64_t> tileSizeConst = getConstantIntValue(tileSize);
		Optional<int64_t> numThreadsConst = getConstantIntValue(numThreads);
		Optional<int64_t> iterSizeConst = getConstantIntValue(iterationSize);
		if (!tileSizeConst \|\| !numThreadsConst \|\| !iterSizeConst)
		return false;
		return tileSizeConst (numThreadsConst - 1) < iterSizeConst;
}		}

FailureOr<ForeachThreadTilingResult>		static FailureOr<ForeachThreadTilingResult>
linalg::tileToForeachThreadOp(OpBuilder &b, TilingInterface op,		tileToForeachThreadOpImpl(RewriterBase &b, TilingInterface op,
ArrayRef<OpFoldResult> numThreads,		ArrayRef<OpFoldResult> numThreads,
		Optional<ArrayRef<OpFoldResult>> maxTileSizes,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<int64_t> threadDimMapping) {
Location loc = op->getLoc();		Location loc = op->getLoc();
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
SmallVector<Range> loopRanges = op.getIterationDomain(b);		SmallVector<Range> loopRanges = op.getIterationDomain(b);
if (loopRanges.empty())		if (loopRanges.empty())
return op->emitOpError("expected non-empty loop ranges");		return op->emitOpError("expected non-empty loop ranges");
auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };		auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };
if (llvm::any_of(loopRanges, hasStrideOne))		if (llvm::any_of(loopRanges, hasStrideOne))
return op->emitOpError("only stride-1 supported atm");		return op->emitOpError("only stride-1 supported atm");
Show All 9 Lines	tileToForeachThreadOpImpl(RewriterBase &b, TilingInterface op,
SmallVector<Value> materializedNonZeroNumThreads =		SmallVector<Value> materializedNonZeroNumThreads =
llvm::to_vector(llvm::map_range(nonZeroNumThreads, [&](OpFoldResult ofr) {		llvm::to_vector(llvm::map_range(nonZeroNumThreads, [&](OpFoldResult ofr) {
ImplicitLocOpBuilder ilocb(loc, b);		ImplicitLocOpBuilder ilocb(loc, b);
return materializeOpFoldResult(ilocb, ofr);		return materializeOpFoldResult(ilocb, ofr);
}));		}));

Value zero = b.create<arith::ConstantIndexOp>(loc, 0);		Value zero = b.create<arith::ConstantIndexOp>(loc, 0);
Operation *tiledOp = nullptr;		Operation *tiledOp = nullptr;
scf::ForeachThreadOp foreachThreadOp = b.create<scf::ForeachThreadOp>(		scf::ForeachThreadOp foreachThreadOp = b.create<scf::ForeachThreadOp>(
		christopherbateAuthorUnsubmitted Done Reply Inline Actions You can't use RewriterBase in the body-creation lambda, so I moved to the non-lambda creation form and manually move the insertion point below. christopherbate: You can't use RewriterBase in the body-creation lambda, so I moved to the non-lambda creation…
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Yes, this is annoying and I had the same issue recently. Another possibility is to explicitly cast the OpBuilder as a RewriterBase inside the lambda when you control the call site, you can do that here if you prefer (fine to leave as is). In any case, can you please add a comment explaining this? nicolasvasilache: Yes, this is annoying and I had the same issue recently. Another possibility is to explicitly…
loc, materializedNonZeroNumThreads, threadDimMapping,		loc, op->getResultTypes(), ValueRange(materializedNonZeroNumThreads),
[&](OpBuilder &b, Location loc, ValueRange threadIds) {		threadDimMapping);

		// Create the ForeachThreadOp body.
		b.setInsertionPointToStart(foreachThreadOp.getBody(0));
		ValueRange threadIds = foreachThreadOp.getThreadIndices();
int64_t nLoops = loopRanges.size();		int64_t nLoops = loopRanges.size();
SmallVector<OpFoldResult> tiledOffsets, tiledSizes;		SmallVector<OpFoldResult> tiledOffsets, tiledSizes;
tiledOffsets.reserve(nLoops);		tiledOffsets.reserve(nLoops);
tiledSizes.reserve(nLoops);		tiledSizes.reserve(nLoops);
for (unsigned loopIdx = 0, threadIdIdx = 0; loopIdx < nLoops;		for (unsigned loopIdx = 0, threadIdIdx = 0; loopIdx < nLoops; ++loopIdx) {
++loopIdx) {
bool overflow = loopIdx >= numThreads.size();		bool overflow = loopIdx >= numThreads.size();
bool isZero = !overflow && isConstantIntValue(numThreads[loopIdx], 0);		bool isZero = !overflow && isConstantIntValue(numThreads[loopIdx], 0);
// Degenerate case: take the whole domain.		// Degenerate case: take the whole domain.
if (overflow \|\| isZero) {		if (overflow \|\| isZero) {
tiledOffsets.push_back(loopRanges[loopIdx].offset);		tiledOffsets.push_back(loopRanges[loopIdx].offset);
tiledSizes.push_back(loopRanges[loopIdx].size);		tiledSizes.push_back(loopRanges[loopIdx].size);
continue;		continue;
}		}

// Tiled case: compute the offset and size.		// Tiled case: compute the offset and size.
AffineExpr i, j, M, N, O;		AffineExpr i, j, M, N, O;
bindDims(b.getContext(), i, j);		bindDims(b.getContext(), i, j);
bindSymbols(b.getContext(), M, N, O);		bindSymbols(b.getContext(), M, N, O);
Value size = loopRanges[loopIdx].size;		Value size = loopRanges[loopIdx].size;
Value offset = loopRanges[loopIdx].offset;		Value offset = loopRanges[loopIdx].offset;
Value threadId = threadIds[threadIdIdx];		Value threadId = threadIds[threadIdIdx];
// TODO: more aggressive foldings.		// TODO: more aggressive foldings.
// Symbolic fixed max size per thread.		// Symbolic fixed max size per thread.
// TODO: floor + 0/1 depending on case for better load-balancing.		// TODO: floor + 0/1 depending on case for better load-balancing.
Value maxSizePerThread = b.createOrFold<AffineApplyOp>(		OpFoldResult tileSizePerThread =
loc, M.ceilDiv(N),		maxTileSizes.hasValue()
ValueRange{size, materializedNonZeroNumThreads[threadIdIdx]});		? (*maxTileSizes)[loopIdx]
		: makeComposedFoldedAffineApply(
		b, loc, M.ceilDiv(N),
		ArrayRef<OpFoldResult>{size, nonZeroNumThreads[threadIdIdx]});

// Dynamic offset shifted by threadId * maxSizePerThread.		// Dynamic offset shifted by threadId * maxSizePerThread.
Value offsetPerThread = b.createOrFold<AffineApplyOp>(		OpFoldResult offsetPerThread = makeComposedFoldedAffineApply(
loc, i + j * M, ValueRange{offset, threadId, maxSizePerThread});		b, loc, i + j * M, {offset, threadId, tileSizePerThread});
// Dynamic upper-bound depending on the threadId.		// Dynamic upper-bound depending on the threadId.
Value sizeMinusOffsetPerThread = b.createOrFold<AffineApplyOp>(		OpFoldResult maxOffsetPerThread = makeComposedFoldedAffineApply(
loc, -i + M, ValueRange{offsetPerThread, size});		b, loc, i + j * M - N,
Value tileSizePerThread = buildMin(		{offset, nonZeroNumThreads[threadIdIdx], tileSizePerThread, size});
b, loc, ValueRange{sizeMinusOffsetPerThread, maxSizePerThread});		if (!isConstantIntValue(maxOffsetPerThread, 0)) {
		OpFoldResult sizeMinusOffsetPerThread = makeComposedFoldedAffineApply(
		b, loc, -i + M, {offsetPerThread, size});
		tileSizePerThread = makeComposedFoldedAffineMin(
		b, loc, AffineMap::getMultiDimIdentityMap(2, b.getContext()),
		ArrayRef<OpFoldResult>{sizeMinusOffsetPerThread, tileSizePerThread});
		}

tiledOffsets.push_back(offsetPerThread);		tiledOffsets.push_back(offsetPerThread);
// TODO: if tileSizePerThread <= 0 early exit.		// TODO: if tileSizePerThread <= 0 early exit.
tiledSizes.push_back(		if (!canOmitTileOffsetInBoundsCheck(tileSizePerThread,
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Making this discovery more powerful is going to be painful with SSA values. OTOH we know by construction in the "tile_size case" that we don't need the max. I would just add a bool passed at the call site, true for the "tile_size case" and false for the "num_threads case". Then this discovery can refine the "num_threads case". nicolasvasilache: Making this discovery more powerful is going to be painful with SSA values. OTOH we know by…
buildMax(b, loc, ValueRange{zero, tileSizePerThread}));		nonZeroNumThreads[threadIdIdx], size)) {
		tileSizePerThread = buildMax(b, loc, {zero, tileSizePerThread});
		}

		tiledSizes.push_back(tileSizePerThread);
++threadIdIdx;		++threadIdIdx;
}		}

SmallVector<Operation *> tiledOps =		SmallVector<Operation *> tiledOps =
op.getTiledImplementation(b, destOperands, tiledOffsets, tiledSizes,		op.getTiledImplementation(b, destOperands, tiledOffsets, tiledSizes,
/tileDestOperands=/true);		/tileDestOperands=/true);
assert(tiledOps.size() == 1 && "expected a single produced tiled op");		assert(tiledOps.size() == 1 && "expected a single produced tiled op");
tiledOp = tiledOps.front();		tiledOp = tiledOps.front();

auto tilingInterfaceOp = dyn_cast<TilingInterface>(tiledOp);		auto tilingInterfaceOp = dyn_cast<TilingInterface>(tiledOp);
assert(tilingInterfaceOp &&		assert(tilingInterfaceOp && "Tiled op does not implement TilingInterface");
"Tiled op does not implement TilingInterface");

auto tiledDestOperands = tilingInterfaceOp.getDestinationOperands(b);		auto tiledDestOperands = tilingInterfaceOp.getDestinationOperands(b);

// Create terminator with parallel subset insert operations.		// Create terminator with parallel subset insert operations.
auto performConcurrentlyOp = b.create<scf::PerformConcurrentlyOp>(loc);		b.setInsertionPointToStart(foreachThreadOp.getTerminator().getBody());
OpBuilder::InsertionGuard g(b);		for (auto it : llvm::zip(tiledDestOperands, tilingInterfaceOp->getResults(),
b.setInsertionPointToStart(performConcurrentlyOp.getBody());
for (auto it :
llvm::zip(tiledDestOperands, tilingInterfaceOp->getResults(),
destOperands)) {		destOperands)) {
createMatchingParallelSubsetInsertOp(		createMatchingParallelSubsetInsertOp(
b, loc,		b, loc, cast<tensor::ExtractSliceOp>(std::get<0>(it).getDefiningOp()),
cast<tensor::ExtractSliceOp>(std::get<0>(it).getDefiningOp()),
std::get<1>(it), std::get<2>(it));		std::get<1>(it), std::get<2>(it));
}		}
});
return ForeachThreadTilingResult{foreachThreadOp, tiledOp};		return ForeachThreadTilingResult{foreachThreadOp, tiledOp};
}		}

		FailureOr<ForeachThreadTilingResult>
		linalg::tileToForeachThreadOp(RewriterBase &b, TilingInterface op,
		ArrayRef<OpFoldResult> numThreads,
		ArrayRef<int64_t> threadDimMapping) {
		return tileToForeachThreadOpImpl(b, op, numThreads, /maxTileSizes=/None,
		threadDimMapping);
		}

		FailureOr<ForeachThreadTilingResult>
		linalg::tileToForeachThreadOpUsingTileSizes(
		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> tileSizes,
		ArrayRef<int64_t> threadDimMapping) {
		SmallVector<Range> loopRanges = op.getIterationDomain(b);
		unsigned nLoops = loopRanges.size();
		SmallVector<OpFoldResult> numThreads;
		numThreads.reserve(nLoops);
		AffineExpr s0, s1;
		bindSymbols(b.getContext(), s0, s1);
		AffineExpr divExpr = s0.ceilDiv(s1);
		for (const auto &it : llvm::zip(tileSizes, loopRanges)) {
		OpFoldResult numTiles = std::get<0>(it);
		if (!isConstantIntValue(numTiles, 0))
		numTiles = makeComposedFoldedAffineApply(
		b, op.getLoc(), divExpr, {std::get<1>(it).size, std::get<0>(it)});
		numThreads.push_back(numTiles);
		}
		return tileToForeachThreadOpImpl(b, op, numThreads,
		/maxTileSizes=/tileSizes,
		threadDimMapping);
		}

// Insert a tile `source` into the destination tensor `dest`. The position at		// Insert a tile `source` into the destination tensor `dest`. The position at
// which the tile is inserted (as well as size of tile) is taken from a given		// which the tile is inserted (as well as size of tile) is taken from a given
// ExtractSliceOp `sliceOp`.		// ExtractSliceOp `sliceOp`.
static Value insertSliceIntoTensor(RewriterBase &b, Location loc,		static Value insertSliceIntoTensor(RewriterBase &b, Location loc,
tensor::ExtractSliceOp sliceOp, Value source,		tensor::ExtractSliceOp sliceOp, Value source,
Value dest) {		Value dest) {
return b.create<tensor::InsertSliceOp>(		return b.create<tensor::InsertSliceOp>(
loc, sliceOp.getSource().getType(), source, dest, sliceOp.getOffsets(),		loc, sliceOp.getSource().getType(), source, dest, sliceOp.getOffsets(),
▲ Show 20 Lines • Show All 384 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

// RUN: mlir-opt %s --test-transform-dialect-interpreter -canonicalize \| FileCheck %s		// RUN: mlir-opt %s --test-transform-dialect-interpreter -canonicalize -split-input-file \| FileCheck %s

// Offset per thread:		// Offset per thread:
// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 10))>		// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 10))>
// Per thread tile size.		// Per thread tile size.
// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 10)) + s0, s0 ceildiv 10)>		// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 10)) + s0, s0 ceildiv 10)>
// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 20))>		// CHECK-DAG: affine_map<(d0)[s0] -> (d0 * (s0 ceildiv 20))>
// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 20)) + s0, s0 ceildiv 20)>		// CHECK-DAG: affine_map<(d0)[s0] -> (-(d0 * (s0 ceildiv 20)) + s0, s0 ceildiv 20)>

Show All 28 Lines	pdl.pattern @match_linalg_matmul : benefit(1) {
%0 = operands		%0 = operands
%1 = types		%1 = types
%2 = operation "linalg.matmul"(%0 : !pdl.range<value>) -> (%1 : !pdl.range<type>)		%2 = operation "linalg.matmul"(%0 : !pdl.range<value>) -> (%1 : !pdl.range<type>)
rewrite %2 with "transform.dialect"		rewrite %2 with "transform.dialect"
}		}
transform.sequence %arg0 {		transform.sequence %arg0 {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = pdl_match @match_linalg_matmul in %arg1		%0 = pdl_match @match_linalg_matmul in %arg1
%1:2 = transform.structured.tile_to_foreach_thread_op %0 [10, 20] (mapped to dims [1, 0])		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20] (mapped to dims [1, 0])
}		}
}		}
}		}

		// -----

		// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Nice test case! nicolasvasilache: Nice test case!

		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -15 + 300, 15)>
		// CHECK-DAG: #[[$map1:.+]] = affine_map<(d0) -> (0, d0)>
		// CHECK-DAG: #[[$map2:.+]] = affine_map<(d0) -> (d0 * 10)>
		// CHECK-DAG: #[[$map3:.+]] = affine_map<(d0) -> (d0 * 15)>

		// CHECK-LABEL: matmul_static(
		// CHECK-SAME: %[[A:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[B:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor
		func.func @matmul_static(%A: tensor<100x200xf32>, %B: tensor<200x300xf32>, %C: tensor<100x300xf32>) -> tensor<100x300xf32> {
		// CHECK-DAG: %[[c10:.+]] = arith.constant 10 : index
		// CHECK-DAG: %[[c21:.+]] = arith.constant 21 : index
		// CHECK: scf.foreach_thread (%[[IV0:.+]], %[[IV1:.+]]) in (%[[c10]], %[[c21]])
		// CHECK: %[[TSMIN:.+]] = affine.min #[[$map0]](%[[IV1]])
		// CHECK: %[[TS:.+]] = affine.max #[[$map1]](%[[TSMIN]])
		// CHECK-NOT: affine.min
		// CHECK-NOT: affine.max
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[tA:.+]] = tensor.extract_slice %[[A]][%[[LB0]], 0] [10, 200] [1, 1] :
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tB:.+]] = tensor.extract_slice %[[B]][0, %[[LB1]]] [200, %[[TS]]] [1, 1] :
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tC:.+]] = tensor.extract_slice %[[C]][%[[LB0]], %[[LB1]]] [10, %[[TS]]] [1, 1] :
		// CHECK: linalg.matmul
		// CHECK: scf.foreach_thread.perform_concurrently
		// CHECK-NEXT: tensor.parallel_insert_slice
		%0 = linalg.matmul ins(%A, %B : tensor<100x200xf32>, tensor<200x300xf32>)
		outs(%C : tensor<100x300xf32>) -> (tensor<100x300xf32>)
		return %0 : tensor<100x300xf32>
		}

		transform.with_pdl_patterns {
		^bb0(%arg0: !pdl.operation):
		pdl.pattern @match_linalg_matmul : benefit(1) {
		%0 = operands
		%1 = types
		%2 = operation "linalg.matmul"(%0 : !pdl.range<value>) -> (%1 : !pdl.range<type>)
		rewrite %2 with "transform.dialect"
		}
		transform.sequence %arg0 {
		^bb1(%arg1: !pdl.operation):
		%0 = pdl_match @match_linalg_matmul in %arg1
		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 21]
		}
		}


		// -----

		// CHECK-DAG: #[[$map0:.+]] = affine_map<()[s0] -> (s0 ceildiv 10)>
		// CHECK-DAG: #[[$map1:.+]] = affine_map<()[s0] -> (s0 ceildiv 20)>
		// CHECK-DAG: #[[$map2:.+]] = affine_map<(d0)[s0] -> (d0 * -10 + s0, 10)>
		// CHECK-DAG: #[[$map3:.+]] = affine_map<(d0) -> (0, d0)>
		// CHECK-DAG: #[[$map4:.+]] = affine_map<(d0)[s0] -> (d0 * -20 + s0, 20)>
		// CHECK-DAG: #[[$map5:.+]] = affine_map<(d0) -> (d0 * 10)>
		// CHECK-DAG: #[[$map6:.+]] = affine_map<(d0) -> (d0 * 20)>

		// CHECK-LABEL: matmul_tile_size_dynamic(
		// CHECK-SAME: %[[A:[0-9a-z]+]]: tensor<?x?xf32>
		// CHECK-SAME: %[[B:[0-9a-z]+]]: tensor<?x?xf32>
		// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor<?x?xf32>
		func.func @matmul_tile_size_dynamic(%A: tensor<?x?xf32>, %B: tensor<?x?xf32>, %C: tensor<?x?xf32>) -> tensor<?x?xf32> {
		// CHECK: %[[M:.+]] = tensor.dim %[[A]], %c0 :
		// CHECK: %[[N:.+]] = tensor.dim %[[B]], %c1 :
		// CHECK: %[[NT0:.+]] = affine.apply #map0()[%[[M]]]
		// CHECK: %[[NT1:.+]] = affine.apply #map1()[%[[N]]]
		// CHECK: %[[M:.+]] = tensor.dim %[[A]], %c0 :
		// CHECK: %[[N:.+]] = tensor.dim %[[B]], %c1 :
		// CHECK: scf.foreach_thread (%[[IV0:.+]], %[[IV1:.+]]) in (%[[NT0]], %[[NT1]])
		// CHECK: %[[TSMin:.+]] = affine.min #[[$map2]](%[[IV0]])[%[[M]]]
		// CHECK: %[[TS0:.+]] = affine.max #[[$map3]](%[[TSMin]])
		// CHECK: %[[TSMin:.+]] = affine.min #[[$map4]](%[[IV1]])[%[[N]]]
		// CHECK: %[[TS1:.+]] = affine.max #[[$map3]](%[[TSMin]])
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map5]](%[[IV0]])
		// CHECK tensor.extract_slice %[[A]]
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map6]](%[[IV1]])
		// CHECK tensor.extract_slice %[[B]]
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map5]](%[[IV0]])
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map6]](%[[IV1]])
		// CHECK tensor.extract_slice %[[C]]
		// CHECK: linalg.matmul
		// CHECK: scf.foreach_thread.perform_concurrently
		// CHECK-NEXT: tensor.parallel_insert_slice
		%0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
		outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)
		return %0 : tensor<?x?xf32>
		}

		transform.with_pdl_patterns {
		^bb0(%arg0: !pdl.operation):
		pdl.pattern @match_linalg_matmul : benefit(1) {
		%0 = operands
		%1 = types
		%2 = operation "linalg.matmul"(%0 : !pdl.range<value>) -> (%1 : !pdl.range<type>)
		rewrite %2 with "transform.dialect"
		}
		transform.sequence %arg0 {
		^bb1(%arg1: !pdl.operation):
		%0 = pdl_match @match_linalg_matmul in %arg1
		%1:2 = transform.structured.tile_to_foreach_thread_op %0 tile_sizes [10, 20]
		}
		}

		// -----

		// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.

		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -21 + 300, 21)>
		// CHECK-DAG: #[[$map1:.+]] = affine_map<(d0) -> (0, d0)>
		// CHECK-DAG: #[[$map2:.+]] = affine_map<(d0) -> (d0 * 10)>
		// CHECK-DAG: #[[$map3:.+]] = affine_map<(d0) -> (d0 * 21)>

		// CHECK-LABEL: matmul_tile_size_static(
		// CHECK-SAME: %[[A:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[B:[0-9a-z]+]]: tensor
		// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor
		func.func @matmul_tile_size_static(%A: tensor<100x200xf32>, %B: tensor<200x300xf32>, %C: tensor<100x300xf32>) -> tensor<100x300xf32> {
		// CHECK-DAG: %[[c10:.+]] = arith.constant 10 :
		// CHECK-DAG: %[[c15:.+]] = arith.constant 15 :
		// CHECK: scf.foreach_thread (%[[IV0:.+]], %[[IV1:.+]]) in (%[[c10]], %[[c15]])
		// CHECK: %[[TSMIN:.+]] = affine.min #[[$map0]](%[[IV1]])
		// CHECK: %[[TS:.+]] = affine.max #[[$map1]](%[[TSMIN]])
		// CHECK-NOT: affine.min
		// CHECK-NOT: affine.max
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[tA:.+]] = tensor.extract_slice %[[A]][%[[LB0]], 0] [10, 200] [1, 1] :
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tB:.+]] = tensor.extract_slice %[[B]][0, %[[LB1]]] [200, %[[TS]]] [1, 1] :
		// CHECK: %[[LB0:.+]] = affine.apply #[[$map2]](%[[IV0]])
		// CHECK: %[[LB1:.+]] = affine.apply #[[$map3]](%[[IV1]])
		// CHECK: %[[tC:.+]] = tensor.extract_slice %[[C]][%[[LB0]], %[[LB1]]] [10, %[[TS]]] [1, 1] :
		// CHECK: linalg.matmul
		// CHECK: scf.foreach_thread.perform_concurrently
		// CHECK-NEXT: tensor.parallel_insert_slice
		%0 = linalg.matmul ins(%A, %B : tensor<100x200xf32>, tensor<200x300xf32>)
		outs(%C : tensor<100x300xf32>) -> (tensor<100x300xf32>)
		return %0 : tensor<100x300xf32>
		}

		transform.with_pdl_patterns {
		^bb0(%arg0: !pdl.operation):
		pdl.pattern @match_linalg_matmul : benefit(1) {
		%0 = operands
		%1 = types
		%2 = operation "linalg.matmul"(%0 : !pdl.range<value>) -> (%1 : !pdl.range<type>)
		rewrite %2 with "transform.dialect"
		}
		transform.sequence %arg0 {
		^bb1(%arg1: !pdl.operation):
		%0 = pdl_match @match_linalg_matmul in %arg1
		%1:2 = transform.structured.tile_to_foreach_thread_op %0 tile_sizes [10, 21]
		}
		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 446019

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

mlir/lib/Dialect/Affine/IR/AffineOps.cpp

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

[mlir][linalg] Add tile_size option to `structured.tile_to_foreach_thread_op`
ClosedPublic