This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
5/9
Tiling.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
3
transform-tile-reduction.mlir

Differential D158478

[mlir][linalg] Enable parallel partial reduction tiling with multiple dims
AcceptedPublic

Authored by qedawkins on Aug 21 2023, 7:56 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
antiagainst
mravishankar
Groverkss

Summary

This extends transform.structured.tile_reduction_using_forall to
operations with multiple reduction dimensions as implied by the thread
counts. This enables reduction splitting strategies for operations with
higher dimensionality.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

qedawkins created this revision.Aug 21 2023, 7:56 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 21 2023, 7:56 PM

Herald added subscribers: bviyer, Moerafaat, bzcheeseman and 24 others. · View Herald Transcript

qedawkins requested review of this revision.Aug 21 2023, 7:56 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptAug 21 2023, 7:56 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

qedawkins added reviewers: antiagainst, mravishankar.Aug 21 2023, 7:57 PM

Harbormaster completed remote builds in B253971: Diff 552192.Aug 21 2023, 8:11 PM

mravishankar added a reviewer: Groverkss.Aug 23 2023, 1:59 PM

Groverkss added inline comments.Aug 23 2023, 10:13 PM

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
663	You can use .empty()
713	Why not use an ArrayRef?
725	nit: Don't use auto here.
731–736	It's not very clear what exactly is happening here. Could you add more explanation?
809	You can use `indVars` here.

Address comments

qedawkins marked 4 inline comments as done.Aug 23 2023, 10:46 PM

qedawkins added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
731–736	Let me know if the explanation here makes sense.

Harbormaster completed remote builds in B254544: Diff 552995.Aug 23 2023, 11:19 PM

@Groverkss gentle bump on the review here when you have time.

LGTM

This revision is now accepted and ready to land.Aug 29 2023, 8:59 AM

nicolasvasilache added inline comments.Aug 29 2023, 9:10 AM

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
652	Can we write this with LinalgOp::getReductionDims and a followup filter ? It is unclear to me whether this custom logic has something load-bearing offahnd.

qedawkins added inline comments.Aug 29 2023, 9:14 AM

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
652	It's both getting the reduction dims and also identifying which thread counts in the `scf.forall` correspond to reduction dimensions. I can do this but the logic here will look quite similar. I'll also add a comment.

thanks for generalizing this transform!

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp
725	Can we extract this in a meaningfully named helper function? With one single reduction dimension the intent was clear but now the nesting is deeper than I'd like (in an already too long function).
mlir/test/Dialect/Linalg/transform-tile-reduction.mlir
407	it is weird to me that you only need to specify 2 entries in num_threads here, I would have expected you'd need `[0, 4, 2]` (like in your second test below). Are / were we somehow too permissive in the specification? Would be good to tighten the verifier to force alignment of number of dimensions on the rank of the linalg op when appropriate. It would be good to also have a `[red, par, red]` test for the "interleaved parallel" case.

qedawkins added inline comments.Aug 30 2023, 7:26 AM

mlir/test/Dialect/Linalg/transform-tile-reduction.mlir
407	This is intentional, as I'm trying to tile parallel dimensions as well as reductions here. As far as I could tell, this was never explicitly prohibited by the pattern and I find it convenient to be able to tile both at the same time (and otherwise avoid nested foralls which interact poorly with distribution later on). The interleaved parallel case is a good idea though, will add a test for it. In terms of forcing rank to align, unlike the scf.for version of this pattern, additional tile sizes require corresponding entries in the mapping which restricts the mapping options for distribution. For example, now I need to distribute explicitly along `gpu.thread<x>` in addition to `gpu.thread<z>` and `gpu.thread<y>` if I want to tile the parallel and first reduction dimensions only, and adding more dimensions requires going to linearized thread indices which don't work well when we are intentionally avoiding distribution along a specific dimension (e.g. `x` for later use with warp distribution patterns).

nicolasvasilache added inline comments.Sep 6 2023, 8:02 AM

mlir/test/Dialect/Linalg/transform-tile-reduction.mlir
407	Re tiling parallel and reduction at once, this is a great idea indeed, thanks for pushing on this. I was thinking that we could insert a `<none>` or `<seq>` mapping kind to allow us to skip or lower to loops but if this is too tedious for now let's table it.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Tiling.cpp

63 lines

test/

Dialect/

Linalg/

transform-tile-reduction.mlir

118 lines

Diff 552995

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

Show First 20 Lines • Show All 637 Lines • ▼ Show 20 Lines	FailureOr<linalg::ForallReductionTilingResult> linalg::tileReductionUsingForall(

SmallVector<Range> iterationDomain = tilingInterfaceOp.getIterationDomain(b);		SmallVector<Range> iterationDomain = tilingInterfaceOp.getIterationDomain(b);
if (op->getNumResults() != 1)		if (op->getNumResults() != 1)
return b.notifyMatchFailure(		return b.notifyMatchFailure(
op, "don't support ops with multiple results for now");		op, "don't support ops with multiple results for now");

SmallVector<utils::IteratorType> iterators =		SmallVector<utils::IteratorType> iterators =
tilingInterfaceOp.getLoopIteratorTypes();		tilingInterfaceOp.getLoopIteratorTypes();
SmallVector<unsigned> redDims;
linalgOp.getReductionDims(redDims);
if (redDims.size() != 1)
return b.notifyMatchFailure(
op, "only support ops with one reduction dimension.");
if (!tileSizes.empty() && tileSizes.size() != numThreads.size())		if (!tileSizes.empty() && tileSizes.size() != numThreads.size())
return b.notifyMatchFailure(op, "if tile sizes are present it must have as "		return b.notifyMatchFailure(op, "if tile sizes are present it must have as "
"many elements as number of threads");		"many elements as number of threads");
int reductionDim = static_cast<int>(redDims.front());

if (redDims.front() >= numThreads.size())		SmallVector<int> tiledReductionDims, reductionInductionVarIndices;
		int64_t nonZeroTileIdx = 0;
		for (auto [idx, iteratorType] :
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Can we write this with LinalgOp::getReductionDims and a followup filter ? It is unclear to me whether this custom logic has something load-bearing offahnd. nicolasvasilache: Can we write this with LinalgOp::getReductionDims and a followup filter ? It is unclear to me…
		qedawkinsAuthorUnsubmitted Not Done Reply Inline Actions It's both getting the reduction dims and also identifying which thread counts in the `scf.forall` correspond to reduction dimensions. I can do this but the logic here will look quite similar. I'll also add a comment. qedawkins: It's both getting the reduction dims and also identifying which thread counts in the `scf.
		llvm::enumerate(tilingInterfaceOp.getLoopIteratorTypes())) {
		bool isNonZeroTileSize =
		idx < numThreads.size() && !isConstantIntValue(numThreads[idx], 0);
		if (iteratorType == utils::IteratorType::reduction && isNonZeroTileSize) {
		tiledReductionDims.push_back(idx);
		reductionInductionVarIndices.push_back(nonZeroTileIdx);
		}
		nonZeroTileIdx += isNonZeroTileSize;
		}

		if (tiledReductionDims.empty()) {
		GroverkssUnsubmitted Done Reply Inline Actions You can use .empty() Groverkss: You can use .empty()
return b.notifyMatchFailure(		return b.notifyMatchFailure(
op, "reduction dimension must be mapped to threads");		op, "at least one reduction dimension must be mapped to threads");
		}

// 1. Create the inital tensor value.		// 1. Create the inital tensor value.
FailureOr<Operation *> identityTensor =		FailureOr<Operation *> identityTensor =
op.generateInitialTensorForPartialReduction(b, loc, numThreads,		op.generateInitialTensorForPartialReduction(b, loc, numThreads,
reductionDim);		tiledReductionDims);
if (failed(identityTensor))		if (failed(identityTensor))
return b.notifyMatchFailure(op,		return b.notifyMatchFailure(op,
"cannot create a tensor of identity value.");		"cannot create a tensor of identity value.");

// Gather destination tensors.		// Gather destination tensors.
SmallVector<Value> dest;		SmallVector<Value> dest;
if (failed(tensor::getOrCreateDestinations(b, loc, op, dest)))		if (failed(tensor::getOrCreateDestinations(b, loc, op, dest)))
return b.notifyMatchFailure(op, "failed to get destination tensors");		return b.notifyMatchFailure(op, "failed to get destination tensors");

Operation *tiledOp = nullptr;		Operation *tiledOp = nullptr;

SmallVector<OpFoldResult> nonZeroNumThreads =		SmallVector<OpFoldResult> nonZeroNumThreads =
llvm::to_vector(llvm::make_filter_range(numThreads, [](OpFoldResult ofr) {		llvm::to_vector(llvm::make_filter_range(numThreads, [](OpFoldResult ofr) {
return !isConstantIntValue(ofr, 0);		return !isConstantIntValue(ofr, 0);
}));		}));
SmallVector<Value> materializedNonZeroNumThreads =		SmallVector<Value> materializedNonZeroNumThreads =
getValueOrCreateConstantIndexOp(b, loc, nonZeroNumThreads);		getValueOrCreateConstantIndexOp(b, loc, nonZeroNumThreads);

// 2. Create the ForallOp with an empty region.		// 2. Create the ForallOp with an empty region.
scf::ForallOp forallOp = b.create<scf::ForallOp>(		scf::ForallOp forallOp = b.create<scf::ForallOp>(
loc, getAsOpFoldResult(materializedNonZeroNumThreads),		loc, getAsOpFoldResult(materializedNonZeroNumThreads),
(*identityTensor)->getResults(), mapping);		(*identityTensor)->getResults(), mapping);
		SmallVector<Value> threadIds = forallOp.getInductionVars();

// 3. Calculate the tile offsets and sizes for the subsequent loop that will		// 3. Calculate the tile offsets and sizes for the subsequent loop that will
// be nested under `forallOp`.		// be nested under `forallOp`.
SmallVector<OpFoldResult> tiledOffsets, tiledSizes;		SmallVector<OpFoldResult> tiledOffsets, tiledSizes;
calculateTileOffsetsAndSizes(b, loc, forallOp, numThreads, iterationDomain,		calculateTileOffsetsAndSizes(b, loc, forallOp, numThreads, iterationDomain,
/omitTileOffsetBoundsCheck =/false,		/omitTileOffsetBoundsCheck =/false,
/nominalTileSizes=/std::nullopt, tiledOffsets,		/nominalTileSizes=/std::nullopt, tiledOffsets,
tiledSizes);		tiledSizes);

// 4. Clone the tileable op and update its destination operands to use the		// 4. Clone the tileable op and update its destination operands to use the
// output bbArgs of the ForallOp.		// output bbArgs of the ForallOp.
SmallVector<Value> tilingResults;		SmallVector<Value> tilingResults;
ArrayRef<BlockArgument> destBbArgs = forallOp.getOutputBlockArguments();		ArrayRef<BlockArgument> destBbArgs = forallOp.getOutputBlockArguments();
{		{
// 4.a. RAII guard, inserting within forallOp, before terminator.		// 4.a. RAII guard, inserting within forallOp, before terminator.
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
b.setInsertionPoint(forallOp.getTerminator());		b.setInsertionPoint(forallOp.getTerminator());

		llvm::SmallDenseSet<int> reductionIndexSet(tiledReductionDims.begin(),
		GroverkssUnsubmitted Done Reply Inline Actions Why not use an ArrayRef? Groverkss: Why not use an ArrayRef?
		tiledReductionDims.end());

SmallVector<Value> tiledDpsInitOperands;		SmallVector<Value> tiledDpsInitOperands;
for (OpOperand *initOperand : destinationStyleOp.getDpsInitOperands()) {		for (OpOperand *initOperand : destinationStyleOp.getDpsInitOperands()) {
auto *it = llvm::find(dest, initOperand->get());		auto *it = llvm::find(dest, initOperand->get());
assert(it != dest.end() && "dest operand not found in dest");		assert(it != dest.end() && "dest operand not found in dest");
unsigned destNum = std::distance(dest.begin(), it);		unsigned destNum = std::distance(dest.begin(), it);
SmallVector<OpFoldResult> strides(numThreads.size(), b.getIndexAttr(1));		SmallVector<OpFoldResult> strides(numThreads.size(), b.getIndexAttr(1));
SmallVector<OpFoldResult> outOffsets(numThreads.size(),		SmallVector<OpFoldResult> outOffsets(numThreads.size(),
b.getIndexAttr(0));		b.getIndexAttr(0));
SmallVector<OpFoldResult> sizes = tiledSizes;		SmallVector<OpFoldResult> sizes(tiledSizes.begin(),
sizes[reductionDim] = b.getIndexAttr(1);		tiledSizes.begin() + numThreads.size());
		GroverkssUnsubmitted Done Reply Inline Actions nit: Don't use auto here. Groverkss: nit: Don't use auto here.
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Can we extract this in a meaningfully named helper function? With one single reduction dimension the intent was clear but now the nesting is deeper than I'd like (in an already too long function). nicolasvasilache: Can we extract this in a meaningfully named helper function? With one single reduction…
outOffsets[reductionDim] = forallOp.getInductionVars().front();		for (auto [indVarIdx, redDim] :
		llvm::zip_equal(reductionInductionVarIndices, tiledReductionDims)) {
		sizes[redDim] = b.getIndexAttr(1);
		outOffsets[redDim] = threadIds[indVarIdx];
		}
		// Here we are just slicing along tiled reduction dimensions
		// so that the shape of the output of the cloned op matches
		// that of the original op. This enables generating the tiled
		// implementation in the next step, which includes parallel dimension
		// tiling.
		for (int i = 0, e = numThreads.size(); i < e; ++i) {
		GroverkssUnsubmitted Not Done Reply Inline Actions It's not very clear what exactly is happening here. Could you add more explanation? Groverkss: It's not very clear what exactly is happening here. Could you add more explanation?
		qedawkinsAuthorUnsubmitted Done Reply Inline Actions Let me know if the explanation here makes sense. qedawkins: Let me know if the explanation here makes sense.
		if (!reductionIndexSet.contains(i) &&
		!isConstantIntValue(numThreads[i], 0))
		sizes[i] = tensor::getMixedSize(b, loc, destBbArgs[destNum], i);
		}
// TODO: use SubsetExtractOpInterface once it is available.		// TODO: use SubsetExtractOpInterface once it is available.
tiledDpsInitOperands.push_back(b.create<tensor::ExtractSliceOp>(		tiledDpsInitOperands.push_back(b.create<tensor::ExtractSliceOp>(
loc, cast<RankedTensorType>(initOperand->get().getType()),		loc, cast<RankedTensorType>(initOperand->get().getType()),
destBbArgs[destNum], outOffsets, sizes, strides));		destBbArgs[destNum], outOffsets, sizes, strides));
}		}

// 4.b. Clone the op and update init operands.		// 4.b. Clone the op and update init operands.
// We cannot use a IRMapping here because it can replace		// We cannot use a IRMapping here because it can replace
Show All 22 Lines	if (tileSizes.empty()) {
tilingResults = tilingResult->tiledValues;		tilingResults = tilingResult->tiledValues;
} else {		} else {
LinalgTilingOptions options;		LinalgTilingOptions options;
FailureOr<TiledLinalgOp> maybeTiled = tileLinalgOpImpl<scf::ForOp>(		FailureOr<TiledLinalgOp> maybeTiled = tileLinalgOpImpl<scf::ForOp>(
b, cast<LinalgOp>(clonedOp), tileSizes, options);		b, cast<LinalgOp>(clonedOp), tileSizes, options);
if (failed(maybeTiled))		if (failed(maybeTiled))
return b.notifyMatchFailure(op, "failed tileLinalgOpImpl");		return b.notifyMatchFailure(op, "failed tileLinalgOpImpl");

SmallVector<Value> ids = forallOp.getInductionVars();		mapLoopToProcessorIds(cast<scf::ForOp>(maybeTiled->loops.back()),
mapLoopToProcessorIds(cast<scf::ForOp>(maybeTiled->loops.back()), ids,		threadIds, materializedNonZeroNumThreads);
materializedNonZeroNumThreads);
if (maybeTiled->loops.size() != 1) {		if (maybeTiled->loops.size() != 1) {
return clonedOp->emitError("expected a single produced loop");		return clonedOp->emitError("expected a single produced loop");
}		}
tiledOp = maybeTiled->op;		tiledOp = maybeTiled->op;
tilingResults = maybeTiled->loops.front()->getResults();		tilingResults = maybeTiled->loops.front()->getResults();
}		}

b.eraseOp(clonedOp);		b.eraseOp(clonedOp);
}		}

// 6. Insert the partial reductions back into a new tensor.		// 6. Insert the partial reductions back into a new tensor.
for (auto [index, result, bbArg] : llvm::zip(		for (auto [index, result, bbArg] : llvm::zip(
llvm::seq<unsigned>(0, dest.size()), tilingResults, destBbArgs)) {		llvm::seq<unsigned>(0, dest.size()), tilingResults, destBbArgs)) {
// 6.a. Partial subset information is inserted just before the terminator.		// 6.a. Partial subset information is inserted just before the terminator.
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
b.setInsertionPoint(forallOp.getTerminator());		b.setInsertionPoint(forallOp.getTerminator());

SmallVector<OpFoldResult> resultOffsets, resultSizes;		SmallVector<OpFoldResult> resultOffsets, resultSizes;
if (failed(tilingInterfaceOp.getResultTilePosition(		if (failed(tilingInterfaceOp.getResultTilePosition(
b, index, tiledOffsets, tiledSizes, resultOffsets, resultSizes)))		b, index, tiledOffsets, tiledSizes, resultOffsets, resultSizes)))
return op->emitOpError("output offsets couldn't be calculated");		return op->emitOpError("output offsets couldn't be calculated");
SmallVector<OpFoldResult> resultOffsetsRank, resultSizesRank;		SmallVector<OpFoldResult> resultOffsetsRank, resultSizesRank;
int64_t offIdx = 0;		int64_t offIdx = 0;
int64_t sizeIdx = 0;		int64_t sizeIdx = 0;
		int64_t reductionIdx = 0;
for (int64_t i = 0, e = numThreads.size(); i < e; ++i) {		for (int64_t i = 0, e = numThreads.size(); i < e; ++i) {
if (i == reductionDim) {		if (tiledReductionDims[reductionIdx] == i) {
resultOffsetsRank.push_back(forallOp.getInductionVars().front());		resultOffsetsRank.push_back(
		threadIds[reductionInductionVarIndices[reductionIdx++]]);
		GroverkssUnsubmitted Done Reply Inline Actions You can use `indVars` here. Groverkss: You can use `indVars` here.
resultSizesRank.push_back(b.getIndexAttr(1));		resultSizesRank.push_back(b.getIndexAttr(1));
continue;		continue;
}		}
resultOffsetsRank.push_back(resultOffsets[offIdx++]);		resultOffsetsRank.push_back(resultOffsets[offIdx++]);
resultSizesRank.push_back(resultSizes[sizeIdx++]);		resultSizesRank.push_back(resultSizes[sizeIdx++]);
}		}
SmallVector<OpFoldResult> strides(resultSizesRank.size(),		SmallVector<OpFoldResult> strides(resultSizesRank.size(),
b.getIndexAttr(1));		b.getIndexAttr(1));

// 6.b. Parallel insertions are inserted at the end of the combining		// 6.b. Parallel insertions are inserted at the end of the combining
// terminator.		// terminator.
b.setInsertionPointToEnd(forallOp.getTerminator().getBody());		b.setInsertionPointToEnd(forallOp.getTerminator().getBody());
b.create<tensor::ParallelInsertSliceOp>(		b.create<tensor::ParallelInsertSliceOp>(
loc, result, bbArg, resultOffsetsRank, resultSizesRank, strides);		loc, result, bbArg, resultOffsetsRank, resultSizesRank, strides);
}		}

// 7. Merge the partial reductions.		// 7. Merge the partial reductions.
b.setInsertionPointAfter(forallOp);		b.setInsertionPointAfter(forallOp);
Operation *mergeOp =		Operation *mergeOp =
op.mergeReductions(b, loc, forallOp->getResults(), reductionDim);		op.mergeReductions(b, loc, forallOp->getResults(), tiledReductionDims);
b.replaceOp(op, mergeOp->getResults());		b.replaceOp(op, mergeOp->getResults());

// 8. Return.		// 8. Return.
ForallReductionTilingResult results;		ForallReductionTilingResult results;
results.initialOp = *identityTensor;		results.initialOp = *identityTensor;
results.loops = forallOp;		results.loops = forallOp;
results.parallelTiledOp = tiledOp;		results.parallelTiledOp = tiledOp;
results.mergeOp = mergeOp;		results.mergeOp = mergeOp;
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/transform-tile-reduction.mlir

	Show First 20 Lines • Show All 379 Lines • ▼ Show 20 Lines
	// CHECK: %[[F:.]] = linalg.fill ins(%{{.}} : f32) outs(%{{.*}} : tensor<4096x2x64xf32>) -> tensor<4096x2x64xf32>			// CHECK: %[[F:.]] = linalg.fill ins(%{{.}} : f32) outs(%{{.*}} : tensor<4096x2x64xf32>) -> tensor<4096x2x64xf32>
	// CHECK: %[[L0:.]] = scf.for %{{.}} = %{{.}} to %{{.}} step %{{.}} iter_args(%[[ARG3:.]] = %[[F]]) -> (tensor<4096x2x64xf32>)			// CHECK: %[[L0:.]] = scf.for %{{.}} = %{{.}} to %{{.}} step %{{.}} iter_args(%[[ARG3:.]] = %[[F]]) -> (tensor<4096x2x64xf32>)
	// CHECK: %[[L1:.]] = scf.for %{{.}} = %{{.}} to %{{.}} step %{{.}} iter_args(%[[ARG4:.]] = %[[ARG3]]) -> (tensor<4096x2x64xf32>)			// CHECK: %[[L1:.]] = scf.for %{{.}} = %{{.}} to %{{.}} step %{{.}} iter_args(%[[ARG4:.]] = %[[ARG3]]) -> (tensor<4096x2x64xf32>)
	// CHECK: %[[OUT:.]] = linalg.generic {indexing_maps = [{{.}}, {{.}}, {{.}}], iterator_types = ["parallel", "parallel", "parallel"]} ins(%{{.}}, %{{.}}: tensor<2x64xf32>, tensor<4096x2x64xf32>) outs(%{{.*}}: tensor<4096x2x64xf32>)			// CHECK: %[[OUT:.]] = linalg.generic {indexing_maps = [{{.}}, {{.}}, {{.}}], iterator_types = ["parallel", "parallel", "parallel"]} ins(%{{.}}, %{{.}}: tensor<2x64xf32>, tensor<4096x2x64xf32>) outs(%{{.*}}: tensor<4096x2x64xf32>)
	// CHECK: scf.yield %[[OUT]] : tensor<4096x2x64xf32>			// CHECK: scf.yield %[[OUT]] : tensor<4096x2x64xf32>
	// CHECK: scf.yield %[[L1]] : tensor<4096x2x64xf32>			// CHECK: scf.yield %[[L1]] : tensor<4096x2x64xf32>
	// CHECK: %[[OUT2:.]] = linalg.generic {indexing_maps = [{{.}}, {{.}}], iterator_types = ["parallel", "reduction", "reduction"]} ins(%{{.}} : tensor<4096x2x64xf32>) outs(%{{.*}} : tensor<4096xf32>)			// CHECK: %[[OUT2:.]] = linalg.generic {indexing_maps = [{{.}}, {{.}}], iterator_types = ["parallel", "reduction", "reduction"]} ins(%{{.}} : tensor<4096x2x64xf32>) outs(%{{.*}} : tensor<4096xf32>)
	// CHECK: return %[[OUT2]] : tensor<4096xf32>			// CHECK: return %[[OUT2]] : tensor<4096xf32>

				// -----

				#map = affine_map<(d0, d1, d2) -> (d1, d2)>
				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				#map2 = affine_map<(d0, d1, d2) -> (d0)>
				module {
				func.func @reduction_tile_multiple_reduction_parallel(%arg0: tensor<32x128xf32>, %arg1: tensor<4x32x128xf32>, %arg2: tensor<4xf32>) -> tensor<4xf32> {
				%0 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "reduction", "reduction"]} ins(%arg0, %arg1 : tensor<32x128xf32>, tensor<4x32x128xf32>) outs(%arg2 : tensor<4xf32>) {
				^bb0(%in: f32, %in_0: f32, %out: f32):
				%1 = arith.mulf %in, %in_0 : f32
				%2 = arith.addf %1, %out : f32
				linalg.yield %2 : f32
				} -> tensor<4xf32>
				return %0 : tensor<4xf32>
				}
				transform.sequence failures(propagate) {
				^bb0(%arg0: !transform.any_op):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg0 : (!transform.any_op) -> !transform.any_op
				%loop, %1, %2, %3 = transform.structured.tile_reduction_using_forall %0 by num_threads = [4, 2], tile_sizes = [], mapping = [#gpu.thread<z>, #gpu.thread<y>] : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op)
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions it is weird to me that you only need to specify 2 entries in num_threads here, I would have expected you'd need `[0, 4, 2]` (like in your second test below). Are / were we somehow too permissive in the specification? Would be good to tighten the verifier to force alignment of number of dimensions on the rank of the linalg op when appropriate. It would be good to also have a `[red, par, red]` test for the "interleaved parallel" case. nicolasvasilache: it is weird to me that you only need to specify 2 entries in num_threads here, I would have…
				qedawkinsAuthorUnsubmitted Not Done Reply Inline Actions This is intentional, as I'm trying to tile parallel dimensions as well as reductions here. As far as I could tell, this was never explicitly prohibited by the pattern and I find it convenient to be able to tile both at the same time (and otherwise avoid nested foralls which interact poorly with distribution later on). The interleaved parallel case is a good idea though, will add a test for it. In terms of forcing rank to align, unlike the scf.for version of this pattern, additional tile sizes require corresponding entries in the mapping which restricts the mapping options for distribution. For example, now I need to distribute explicitly along `gpu.thread<x>` in addition to `gpu.thread<z>` and `gpu.thread<y>` if I want to tile the parallel and first reduction dimensions only, and adding more dimensions requires going to linearized thread indices which don't work well when we are intentionally avoiding distribution along a specific dimension (e.g. `x` for later use with warp distribution patterns). qedawkins: This is intentional, as I'm trying to tile parallel dimensions as well as reductions here. As…
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Re tiling parallel and reduction at once, this is a great idea indeed, thanks for pushing on this. I was thinking that we could insert a `<none>` or `<seq>` mapping kind to allow us to skip or lower to loops but if this is too tedious for now let's table it. nicolasvasilache: Re tiling parallel and reduction at once, this is a great idea indeed, thanks for pushing on…
				}
				}

				// CHECK-DAG: #[[MAP:.]] = affine_map<(d0) -> (d0 16)>
				// CHECK-DAG: #[[MAP1:.*]] = affine_map<(d0, d1, d2) -> (d1, d2)>
				// CHECK-DAG: #[[MAP2:.*]] = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				// CHECK-DAG: #[[MAP3:.*]] = affine_map<(d0, d1, d2) -> (d0)>
				// CHECK-DAG: #[[MAP4:.*]] = affine_map<(d0, d1) -> (d0, d1)>
				// CHECK-DAG: #[[MAP5:.*]] = affine_map<(d0, d1) -> (d0)>
				// CHECK: func @reduction_tile_multiple_reduction_parallel(%[[ARG0:.+]]: tensor<32x128xf32>, %[[ARG1:.+]]: tensor<4x32x128xf32>, %[[ARG2:.+]]: tensor<4xf32>
				// CHECK: %[[F:.]] = linalg.fill ins(%{{.}} : f32) outs(%{{.*}} : tensor<4x2xf32>) -> tensor<4x2xf32>
				// CHECK: %[[L:.*]] = scf.forall (%[[Z:.+]], %[[Y:.+]]) in (4, 2) shared_outs(%[[ARG5:.+]] = %[[F]]) -> (tensor<4x2xf32>) {
				// CHECK: %[[ER:.+]] = tensor.extract_slice %[[ARG5]][0, %[[Y]]] [4, 1] [1, 1] : tensor<4x2xf32> to tensor<4xf32>
				// CHECK: %[[OFF:.+]] = affine.apply #[[MAP]](%[[Y]])
				// CHECK: %[[ESIN0:.+]] = tensor.extract_slice %[[ARG0]][%[[OFF]], 0] [16, 128] [1, 1] : tensor<32x128xf32> to tensor<16x128xf32>
				// CHECK: %[[ESIN1:.+]] = tensor.extract_slice %[[ARG1]][%[[Z]], %[[OFF]], 0] [1, 16, 128] [1, 1, 1] : tensor<4x32x128xf32> to tensor<1x16x128xf32>
				// CHECK: %[[EP:.+]] = tensor.extract_slice %[[ER]][%[[Z]]] [1] [1] : tensor<4xf32> to tensor<1xf32>
				// CHECK: %[[PARTIAL:.+]] = linalg.generic
				// CHECK-SAME: indexing_maps = [#[[MAP1]], #[[MAP2]], #[[MAP3]]]
				// CHECK-SAME: iterator_types = ["parallel", "reduction", "reduction"]
				// CHECK-SAME: ins(%[[ESIN0]], %[[ESIN1]] : tensor<16x128xf32>, tensor<1x16x128xf32>)
				// CHECK-SAME: outs(%[[EP]] : tensor<1xf32>) {
				// CHECK: arith.mulf
				// CHECK: arith.addf
				// CHECK: linalg.yield
				// CHECK: } -> tensor<1xf32>
				// CHECK: scf.forall.in_parallel {
				// CHECK: tensor.parallel_insert_slice %[[PARTIAL]] into %[[ARG5]][%[[Z]], %[[Y]]] [1, 1] [1, 1] : tensor<1xf32> into tensor<4x2xf32>
				// CHECK: }
				// CHECK: } {mapping = [#gpu.thread<z>, #gpu.thread<y>]}
				// CHECK: %[[R:.*]] = linalg.generic
				// CHECK-SAME: indexing_maps = [#[[MAP4]], #[[MAP5]]]
				// CHECK-SAME: iterator_types = ["parallel", "reduction"]
				// CHECK-SAME: ins(%[[L]] : tensor<4x2xf32>)
				// CHECK-SAME: outs(%[[ARG2]] : tensor<4xf32>) {
				// CHECK: arith.addf
				// CHECK: linalg.yield
				// CHECK: } -> tensor<4xf32>
				// CHECK: return %[[R]] : tensor<4xf32>


				// -----

				#map = affine_map<(d0, d1, d2) -> (d1, d2)>
				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				#map2 = affine_map<(d0, d1, d2) -> (d0)>
				module {
				func.func @reduction_tile_multiple_reduction_parallel_all_dims(%arg0: tensor<32x128xf32>, %arg1: tensor<4x32x128xf32>, %arg2: tensor<4xf32>) -> tensor<4xf32> {
				%0 = linalg.generic {indexing_maps = [#map, #map1, #map2], iterator_types = ["parallel", "reduction", "reduction"]} ins(%arg0, %arg1 : tensor<32x128xf32>, tensor<4x32x128xf32>) outs(%arg2 : tensor<4xf32>) {
				^bb0(%in: f32, %in_0: f32, %out: f32):
				%1 = arith.mulf %in, %in_0 : f32
				%2 = arith.addf %1, %out : f32
				linalg.yield %2 : f32
				} -> tensor<4xf32>
				return %0 : tensor<4xf32>
				}
				transform.sequence failures(propagate) {
				^bb0(%arg0: !transform.any_op):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg0 : (!transform.any_op) -> !transform.any_op
				%loop, %1, %2, %3 = transform.structured.tile_reduction_using_forall %0 by num_threads = [0, 2, 4], tile_sizes = [], mapping = [#gpu.thread<z>, #gpu.thread<y>] : (!transform.any_op) -> (!transform.any_op, !transform.any_op, !transform.any_op, !transform.any_op)
				}
				}

				// CHECK-DAG: #[[MAP:.]] = affine_map<(d0) -> (d0 16)>
				// CHECK-DAG: #[[MAP1:.]] = affine_map<(d0) -> (d0 32)>
				// CHECK-DAG: #[[MAP2:.*]] = affine_map<(d0, d1, d2) -> (d1, d2)>
				// CHECK-DAG: #[[MAP3:.*]] = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				// CHECK-DAG: #[[MAP4:.*]] = affine_map<(d0, d1, d2) -> (d0)>
				// CHECK: func @reduction_tile_multiple_reduction_parallel_all_dims(%[[ARG0:.+]]: tensor<32x128xf32>, %[[ARG1:.+]]: tensor<4x32x128xf32>, %[[ARG2:.+]]: tensor<4xf32>
				// CHECK: %[[F:.]] = linalg.fill ins(%{{.}} : f32) outs(%{{.*}} : tensor<4x2x4xf32>) -> tensor<4x2x4xf32>
				// CHECK: %[[L:.*]] = scf.forall (%[[Z:.+]], %[[Y:.+]]) in (2, 4) shared_outs(%[[ARG5:.+]] = %[[F]]) -> (tensor<4x2x4xf32>) {
				// CHECK: %[[ER:.+]] = tensor.extract_slice %[[ARG5]][0, %[[Z]], %[[Y]]] [4, 1, 1] [1, 1, 1] : tensor<4x2x4xf32> to tensor<4xf32>
				// CHECK: %[[OFFZ:.+]] = affine.apply #[[MAP]](%[[Z]])
				// CHECK: %[[OFFY:.+]] = affine.apply #[[MAP1]](%[[Y]])
				// CHECK: %[[ESIN0:.+]] = tensor.extract_slice %[[ARG0]][%[[OFFZ]], %[[OFFY]]] [16, 32] [1, 1] : tensor<32x128xf32> to tensor<16x32xf32>
				// CHECK: %[[ESIN1:.+]] = tensor.extract_slice %[[ARG1]][0, %[[OFFZ]], %[[OFFY]]] [4, 16, 32] [1, 1, 1] : tensor<4x32x128xf32> to tensor<4x16x32xf32>
				// CHECK: %[[PARTIAL:.+]] = linalg.generic
				// CHECK-SAME: indexing_maps = [#[[MAP2]], #[[MAP3]], #[[MAP4]]]
				// CHECK-SAME: iterator_types = ["parallel", "reduction", "reduction"]
				// CHECK-SAME: ins(%[[ESIN0]], %[[ESIN1]] : tensor<16x32xf32>, tensor<4x16x32xf32>)
				// CHECK-SAME: outs(%[[ER]] : tensor<4xf32>) {
				// CHECK: arith.mulf
				// CHECK: arith.addf
				// CHECK: linalg.yield
				// CHECK: } -> tensor<4xf32>
				// CHECK: scf.forall.in_parallel {
				// CHECK: tensor.parallel_insert_slice %[[PARTIAL]] into %[[ARG5]][0, %[[Z]], %[[Y]]] [4, 1, 1] [1, 1, 1] : tensor<4xf32> into tensor<4x2x4xf32>
				// CHECK: }
				// CHECK: } {mapping = [#gpu.thread<z>, #gpu.thread<y>]}
				// CHECK: %[[R:.*]] = linalg.generic
				// CHECK-SAME: indexing_maps = [#[[MAP3]], #[[MAP4]]]
				// CHECK-SAME: iterator_types = ["parallel", "reduction", "reduction"]
				// CHECK-SAME: ins(%[[L]] : tensor<4x2x4xf32>)
				// CHECK-SAME: outs(%[[ARG2]] : tensor<4xf32>) {
				// CHECK: arith.addf
				// CHECK: linalg.yield
				// CHECK: } -> tensor<4xf32>
				// CHECK: return %[[R]] : tensor<4xf32>