Download Raw Diff

Details

Reviewers

nicolasvasilache
ThomasRaoux

Commits

rG233de4e808b3: [mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialect

Summary

This revision adds a new op map_nested_foreach_thread_to_gpu_threads to transform dialect. The op searches scf.foreach_threads inside the gpu_launch and distributes them with gpu.thread_id attribute.

Loop mapping is explicit and given by the map_nested_foreach_thread_to_gpu_threads op. Mapping is done one-to-one, therefore the loops dissappear.

The dynamic trip count or trip count that are larger than thread size are not supported for the time being. However, we can indeed support them by generating a loop inside with cyclic scheduling.

For the time being, trip counts that are dynamic or bigger than thread sizes are not supported. However, in the future the compiler can indeed generate a loop with static cyclic scheduling to support these cases.

Current mechanism allows scf.foreach_threads to be siblings or nested. There cannot be interleaving code between the loops when they are nested.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Sep 15 2022, 8:50 AM

Herald added a reviewer: ThomasRaoux. · View Herald TranscriptSep 15 2022, 8:50 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, sdasgup3, wenzhicui and 20 others. · View Herald Transcript

Harbormaster completed remote builds in B186879: Diff 460433.Sep 15 2022, 9:18 AM

Some high level fly by comments, thanks for sharing an early draft, will review when not in draft mode anymore.
Generally looks good and it's great to see things connecting quite easily!

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1036	`///` prefixed doc comments on all upstreamed functions please, with doc examples that will surface the limitations of the current impl.
1067	This should become a helper function on the `scf::ForeachThreadOp` directly. I would call it `getPermutedThreadIndices` to avoid the conflation with `getThreadIndices` which only return the bbargs in order. Also the implementation will need to be improved to be more robust with accepting `0` values and partial thread_dim_mapping attr specifications (this can be done in a followup once we have the basic bits connected).
1085	IANM workgroup is vulkan terminology that is used in IREE. We should call this Grid size (here and everywhere).
1158	RAUW is not good here, I have something in flight separately. It should be something like: for (auto it : llvm::zip(threadIndices, threadOps)) { Value val = std::get<0>(it); if (!val) continue; for (Operation *user : llvm::make_early_inc_range(val.getUsers())) { rewriter.updateRootInPlace(user, [&](){ user->replaceUsesOfWith(val, std::get<1>(it)); }); } }
mlir/test/Dialect/Linalg/transform-gpu.mlir
135	nit:nl

Addressed comments

guraypp edited the summary of this revision. (Show Details)Sep 16 2022, 8:33 AM

Harbormaster completed remote builds in B187162: Diff 460776.Sep 16 2022, 8:52 AM

nicolasvasilache added inline comments.Sep 19 2022, 12:25 AM

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
759	there is is no translation_info attribute in MLIR
805	This example looks suspicious .. the result depends on what the `blockDim` attribute passed to the transform op contains. Also, the `threads` directive should be updated with the proper `blockDim` values.
mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
127 ↗	(On Diff #460776)	The builder/rewriter is often the first operand passed in such APIs, any reason to deviate from that ?
mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
504 ↗	(On Diff #460776)	docs please
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1036	docs please
1037	plz rename to `rewriteOneForeachThreadToGpuThreads`
1039	Rewriter should be the first argument.
1063	please drop the `, 3` here and below, these are not necessary anymore.
1132	plz rename to `rewriteNestedForeachThreadsTo GpuThreads`
1134	Rewriter should be the first argument.
1163	This needs to happen before the PR can land, otherwise the code will be incorrect.
mlir/lib/Dialect/SCF/IR/SCF.cpp
1266 ↗	(On Diff #460776)	drop TODO
1281 ↗	(On Diff #460776)	doc
1287 ↗	(On Diff #460776)	doc
mlir/test/Dialect/Linalg/transform-gpu.mlir
46	Let's be more explicit about the function name: `map_nested_foreach_thread_to_gpu_threads`. Also, this currently has to anchor on a `func` but this is not really necessary, we may want to anchor on `gpu.launch` instead.

addressed comments

guraypp published this revision for review.Sep 19 2022, 3:04 AM

guraypp retitled this revision from [WIP][mlir] Add foreach_thread_to_gpu_and_translation_info op to transform dialect to [mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialect.

guraypp edited the summary of this revision. (Show Details)

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 3:04 AM

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B187464: Diff 461169.Sep 19 2022, 3:19 AM

rebase

Harbormaster completed remote builds in B187467: Diff 461173.Sep 19 2022, 3:47 AM

nicolasvasilache added inline comments.Sep 19 2022, 3:47 AM

mlir/lib/Dialect/SCF/IR/SCF.cpp
1284 ↗	(On Diff #461173)	Better doc comments please. /// Return the thread indices in the order specified by the thread_dim_mapping attribute. /// Return failure is thread_dim_mapping is not a valid permutation.

improve the comment

Harbormaster completed remote builds in B187472: Diff 461185.Sep 19 2022, 5:02 AM

fix clang-format

nicolasvasilache added inline comments.Sep 19 2022, 5:38 AM

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
771	Please reflect the part that are not yet captured from longer comment I rewrote in this doc.
mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
124 ↗	(On Diff #461185)	Better comments please, fixing them for you here as there are some misconceptions: /// Searches `scf.foreach_thread` ops nested under `target` and maps each such op to GPU threads. /// Mapping is one-to-one and the induction variables of `scf.foreach_thread` are rewritten to gpu.thread_id according to the thread_dim_apping attribute. /// Sibling `scf.foreach_thread` are supported in which case, the union of the number of threads is and may result in predication. /// Multiple `scf.foreach_thread` are currently not supported. /// Dynamic, `scf.foreach_thread` trip counts are currently not supported. /// Dynamic block dim sizes are currently not supported. Please lift these comments into the op documentation in Tablegen.
mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
504 ↗	(On Diff #460776)	Better doc comments everywhere please
mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp
1041	please update comment to what I pasted above.

update the doc of MapNestedForeachThreadToGpuThreads

Harbormaster completed remote builds in B187475: Diff 461189.Sep 19 2022, 6:01 AM

Improve doc

Harbormaster completed remote builds in B187480: Diff 461194.Sep 19 2022, 6:26 AM

Looks good, thanks Guray!

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td
767	is computed
mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h
128 ↗	(On Diff #461194)	I forgot a word: "is computed"

This revision is now accepted and ready to land.Sep 19 2022, 7:11 AM

fix minor comment

This revision was landed with ongoing or failed builds.Sep 19 2022, 7:27 AM

Closed by commit rG233de4e808b3: [mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialect (authored by guraypp). · Explain Why

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rG233de4e808b3: [mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialect.

Harbormaster completed remote builds in B187485: Diff 461200.Sep 19 2022, 7:41 AM

Diff 460433

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

Show First 20 Lines • Show All 742 Lines • ▼ Show 20 Lines	::mlir::DiagnosedSilenceableFailure apply(
::mlir::transform::TransformResults &transformResults,		::mlir::transform::TransformResults &transformResults,
::mlir::transform::TransformState &state);		::mlir::transform::TransformState &state);

::llvm::SmallVector<::mlir::OpFoldResult> getMixedNumThreads();		::llvm::SmallVector<::mlir::OpFoldResult> getMixedNumThreads();
::llvm::SmallVector<::mlir::OpFoldResult> getMixedTileSizes();		::llvm::SmallVector<::mlir::OpFoldResult> getMixedTileSizes();
}];		}];
}		}

		def ForeachThreadToGpuAndTranslationInfo :
		Op<Transform_Dialect, "structured.foreach_thread_to_gpu_and_translation_info",
		[FunctionalStyleTransformOpTrait,
		MemoryEffectsOpInterface,
		TransformEachOpTrait,
		TransformOpInterface]> {
		let description = [{
		TBD
		}];
		nicolasvasilacheUnsubmitted Done Reply Inline Actions there is is no translation_info attribute in MLIR nicolasvasilache: there is is no translation_info attribute in MLIR

		let arguments = (ins PDL_Operation:$target,
		DefaultValuedAttr<I64ArrayAttr, "{}">:$workgroup_size);
		let results = (outs PDL_Operation:$result);

		let assemblyFormat = "$target attr-dict";

		let extraClassDeclaration = [{
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions is computed nicolasvasilache: is computed
		::mlir::DiagnosedSilenceableFailure applyToOne(
		::mlir::Operation *target,
		::llvm::SmallVectorImpl<::mlir::Operation *> &results,
		::mlir::transform::TransformState &state);
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Please reflect the part that are not yet captured from longer comment I rewrote in this doc. nicolasvasilache: Please reflect the part that are not yet captured from longer comment I rewrote in this doc.
		}];
		}

def VectorizeOp : Op<Transform_Dialect, "structured.vectorize",		def VectorizeOp : Op<Transform_Dialect, "structured.vectorize",
[FunctionalStyleTransformOpTrait, MemoryEffectsOpInterface,		[FunctionalStyleTransformOpTrait, MemoryEffectsOpInterface,
TransformEachOpTrait, TransformOpInterface]> {		TransformEachOpTrait, TransformOpInterface]> {
let description = [{		let description = [{
Indicates that the given `target` op all the ops it contains should be		Indicates that the given `target` op all the ops it contains should be
vectorized with the configuration specified by the attributes of this op.		vectorized with the configuration specified by the attributes of this op.
This vectorization only handles structured ops that operate on shaped types		This vectorization only handles structured ops that operate on shaped types
and does not vectorize loops or straight-line. Internally, it applies a		and does not vectorize loops or straight-line. Internally, it applies a
Show All 14 Lines	let description = [{

#### Return modes:		#### Return modes:

This operation produces `definiteFailure` if vectorization fails for any		This operation produces `definiteFailure` if vectorization fails for any
reason.		reason.
The operation always returns the handle to the target op that is expected		The operation always returns the handle to the target op that is expected
to be isolated from above.		to be isolated from above.
}];		}];

		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This example looks suspicious .. the result depends on what the `blockDim` attribute passed to the transform op contains. Also, the `threads` directive should be updated with the proper `blockDim` values. nicolasvasilache: This example looks suspicious .. the result depends on what the `blockDim` attribute passed to…
let arguments = (ins PDL_Operation:$target,		let arguments = (ins PDL_Operation:$target,
DefaultValuedAttr<BoolAttr, "false">:$vectorize_padding,		DefaultValuedAttr<BoolAttr, "false">:$vectorize_padding,
DefaultValuedAttr<BoolAttr, "false">:$disable_multi_reduction_to_contract_patterns,		DefaultValuedAttr<BoolAttr, "false">:$disable_multi_reduction_to_contract_patterns,
DefaultValuedAttr<BoolAttr, "false">:$disable_transfer_permutation_map_lowering_patterns);		DefaultValuedAttr<BoolAttr, "false">:$disable_transfer_permutation_map_lowering_patterns);
let results = (outs PDL_Operation:$transformed);		let results = (outs PDL_Operation:$transformed);

let assemblyFormat = "$target attr-dict";		let assemblyFormat = "$target attr-dict";

Show All 9 Lines

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

//===- LinalgTransformOps.cpp - Implementation of Linalg transform ops ----===//		//===- LinalgTransformOps.cpp - Implementation of Linalg transform ops ----===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.h"		#include "mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.h"

		#include "mlir/Dialect/GPU/IR/GPUDialect.h"

#include "mlir/AsmParser/AsmParser.h"		#include "mlir/AsmParser/AsmParser.h"
#include "mlir/Dialect/Affine/IR/AffineOps.h"		#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"		#include "mlir/Dialect/Arithmetic/IR/Arithmetic.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/PDL/IR/PDL.h"		#include "mlir/Dialect/PDL/IR/PDL.h"
#include "mlir/Dialect/PDL/IR/PDLTypes.h"		#include "mlir/Dialect/PDL/IR/PDLTypes.h"
#include "mlir/Dialect/Transform/IR/TransformDialect.h"		#include "mlir/Dialect/Transform/IR/TransformDialect.h"
#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"		#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"
#include "mlir/Interfaces/TilingInterface.h"		#include "mlir/Interfaces/TilingInterface.h"
#include "mlir/Parser/Parser.h"		#include "mlir/Parser/Parser.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/ADT/StringSet.h"		#include "llvm/ADT/StringSet.h"
		#include "llvm/ADT/STLExtras.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::linalg;		using namespace mlir::linalg;
using namespace mlir::transform;		using namespace mlir::transform;

/// Extracts a vector of unsigned from an array attribute. Asserts if the		/// Extracts a vector of unsigned from an array attribute. Asserts if the
/// attribute contains values other than intergers. May truncate.		/// attribute contains values other than intergers. May truncate.
static SmallVector<unsigned> extractUIntArray(ArrayAttr attr) {		static SmallVector<unsigned> extractUIntArray(ArrayAttr attr) {
▲ Show 20 Lines • Show All 991 Lines • ▼ Show 20 Lines	void transform::TileOp::getEffects(
consumesHandle(getTarget(), effects);		consumesHandle(getTarget(), effects);
onlyReadsHandle(getDynamicSizes(), effects);		onlyReadsHandle(getDynamicSizes(), effects);
producesHandle(getTiledLinalgOp(), effects);		producesHandle(getTiledLinalgOp(), effects);
producesHandle(getLoops(), effects);		producesHandle(getLoops(), effects);
modifiesPayload(effects);		modifiesPayload(effects);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// ForeachThreadToGpuAndTranslationInfo
		//===----------------------------------------------------------------------===//
		template <typename T>
		nicolasvasilacheUnsubmitted Done Reply Inline Actions `///` prefixed doc comments on all upstreamed functions please, with doc examples that will surface the limitations of the current impl. nicolasvasilache: `///` prefixed doc comments on all upstreamed functions please, with doc examples that will…
		nicolasvasilacheUnsubmitted Done Reply Inline Actions docs please nicolasvasilache: docs please
		static FailureOr<SmallVector<T>> permute(const SmallVector<T> &vals,
		nicolasvasilacheUnsubmitted Done Reply Inline Actions plz rename to `rewriteOneForeachThreadToGpuThreads` nicolasvasilache: plz rename to `rewriteOneForeachThreadToGpuThreads`
		ArrayRef<int64_t> perm) {
		if (vals.size() != perm.size()) return failure();
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Rewriter should be the first argument. nicolasvasilache: Rewriter should be the first argument.
		SmallVector<T> result(vals.size());
		SmallVector<bool> seen(vals.size());
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions please update comment to what I pasted above. nicolasvasilache: please update comment to what I pasted above.
		for (auto [idx, val] : llvm::zip(perm, vals)) {
		// Already seen, invalid thread_dim_mapping.
		if (seen[idx]) return failure();
		result[idx] = val;
		seen[idx] = true;
		}
		// Some not seen, invalid thread_dim_mapping.
		if (!llvm::all_of(seen, [](bool b) { return b; })) return failure();
		return result;
		}
		template <typename T>
		static FailureOr<SmallVector<T>> getValuesPermutedByThreadMapping(
		scf::ForeachThreadOp foreachThreadOp, const SmallVector<T> &values) {
		// Apply mapping permutation if specified.
		auto mapping = foreachThreadOp.getThreadDimMapping();

		if (mapping && !mapping.empty()) {
		auto maybePermuted = permute(values, extractFromI64ArrayAttr(mapping));
		if (failed(maybePermuted))
		return foreachThreadOp->emitError("invalid permutation");
		return *maybePermuted;
		}
		nicolasvasilacheUnsubmitted Done Reply Inline Actions please drop the `, 3` here and below, these are not necessary anymore. nicolasvasilache: please drop the `, 3` here and below, these are not necessary anymore.
		return values;
		}

		static FailureOr<SmallVector<Value>> getThreadIndices(
		nicolasvasilacheUnsubmitted Done Reply Inline Actions This should become a helper function on the `scf::ForeachThreadOp` directly. I would call it `getPermutedThreadIndices` to avoid the conflation with `getThreadIndices` which only return the bbargs in order. Also the implementation will need to be improved to be more robust with accepting `0` values and partial thread_dim_mapping attr specifications (this can be done in a followup once we have the basic bits connected). nicolasvasilache: This should become a helper function on the `scf::ForeachThreadOp` directly. I would call it…
		OpBuilder &b, scf::ForeachThreadOp foreachThreadOp) {
		SmallVector<Value> threadCount = foreachThreadOp.getThreadIndices();
		threadCount.resize(3, Value());
		return getValuesPermutedByThreadMapping(foreachThreadOp, threadCount);
		}


		static FailureOr<SmallVector<OpFoldResult>> getNumThreads(
		OpBuilder &b, scf::ForeachThreadOp foreachThreadOp) {
		SmallVector<OpFoldResult> threadCount = foreachThreadOp.getNumThreads();
		threadCount.resize(3, b.getIndexAttr(1));
		return getValuesPermutedByThreadMapping(foreachThreadOp, threadCount);
		}

		static FailureOr<SmallVector<OpFoldResult>>
		rewriteForeachThreadToGpu(
		scf::ForeachThreadOp foreachThreadOp,
		const SmallVector<int64_t> &globalWorkgroupSizes, RewriterBase &rewriter,
		nicolasvasilacheUnsubmitted Done Reply Inline Actions IANM workgroup is vulkan terminology that is used in IREE. We should call this Grid size (here and everywhere). nicolasvasilache: IANM workgroup is vulkan terminology that is used in IREE. We should call this Grid size (here…
		bool syncAfterDistribute) {
		if (foreachThreadOp.getNumResults() > 0)
		return foreachThreadOp->emitError(
		"only bufferized scf.foreach_thread lowers to gpu.thread");
		if (foreachThreadOp.getNumThreads().size() > 3)
		return foreachThreadOp->emitError(
		"scf.foreach_thread with rank > 3 does not lower to gpu.thread");

		auto maybeWorkgroupSizes = getNumThreads(rewriter, foreachThreadOp);
		if (failed(maybeWorkgroupSizes) \|\|
		llvm::any_of(*maybeWorkgroupSizes, [](OpFoldResult ofr) {
		return !getConstantIntValue(ofr).has_value();
		}))
		return foreachThreadOp->emitError("unsupported dynamic workgroup size");

		SmallVector<int64_t> workgroupSizes = llvm::to_vector(llvm::map_range(
		*maybeWorkgroupSizes,
		[](OpFoldResult ofr) { return getConstantIntValue(ofr).value(); }));

		// Step 1. Create the gpu.thread ops
		Location loc = foreachThreadOp.getLoc();
		IndexType indexType = rewriter.getIndexType();

		SmallVector<gpu::Dimension, 3> gpuDims{gpu::Dimension::x, gpu::Dimension::y,
		gpu::Dimension::z};
		SmallVector<Value, 3> threadOps;
		for (int64_t idx : llvm::seq<int64_t>(0, workgroupSizes.size())) {
		threadOps.push_back(
		rewriter.create<gpu::ThreadIdOp>(loc, indexType, gpuDims[idx]));
		}
		// Step 2. Maybe create conditionals to predicate the region.
		Value predicate;
		for (auto [threadId, workgroupSize, globalWorkgroupSize] :
		llvm::zip(threadOps, workgroupSizes, globalWorkgroupSizes)) {
		if (workgroupSize > globalWorkgroupSize) {
		return foreachThreadOp.emitOpError("workgroup size overflow: ")
		<< workgroupSize << " > " << globalWorkgroupSize;
		}
		if (workgroupSize == globalWorkgroupSize) continue;
		Value tmpPredicate = rewriter.create<arith::CmpIOp>(
		loc, arith::CmpIPredicate::ult, threadId,
		rewriter.create<arith::ConstantIndexOp>(loc, workgroupSize));
		predicate =
		predicate ? rewriter.create<arith::AndIOp>(loc, predicate, tmpPredicate)
		: tmpPredicate;
		}

		nicolasvasilacheUnsubmitted Done Reply Inline Actions plz rename to `rewriteNestedForeachThreadsTo GpuThreads` nicolasvasilache: plz rename to `rewriteNestedForeachThreadsTo GpuThreads`
		// Step 3. Move the body of foreachThreadOp.
		// Erase the terminator first, it will not be used.
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Rewriter should be the first argument. nicolasvasilache: Rewriter should be the first argument.
		rewriter.eraseOp(foreachThreadOp.getTerminator());
		Block *targetBlock;
		Block::iterator insertionPoint;
		if (predicate) {
		// Step 3.a. If predicated, move at the beginning.
		auto ifOp =
		rewriter.create<scf::IfOp>(loc, predicate, /withElseRegion=/false);
		targetBlock = ifOp.thenBlock();
		insertionPoint = ifOp.thenBlock()->begin();
		} else {
		// Step 3.a. Otherwise, move inline just before foreachThreadOp.
		targetBlock = foreachThreadOp->getBlock();
		insertionPoint = Block::iterator(foreachThreadOp);
		}
		Block &sourceBlock = foreachThreadOp.getRegion().front();
		targetBlock->getOperations().splice(insertionPoint,
		sourceBlock.getOperations());

		// Step 4. RAUW thread indices to thread ops.
		SmallVector<Value> threadIndices =
		*getThreadIndices(rewriter, foreachThreadOp);
		for (auto it : llvm::zip(threadIndices, threadOps)) {
		if (!std::get<0>(it)) continue;
		std::get<0>(it).replaceAllUsesWith(std::get<1>(it));
		nicolasvasilacheUnsubmitted Done Reply Inline Actions RAUW is not good here, I have something in flight separately. It should be something like: for (auto it : llvm::zip(threadIndices, threadOps)) { Value val = std::get<0>(it); if (!val) continue; for (Operation user : llvm::make_early_inc_range(val.getUsers())) { rewriter.updateRootInPlace(user, [&](){ user->replaceUsesOfWith(val, std::get<1>(it)); }); } } nicolasvasilache:* RAUW is not good here, I have something in flight separately. It should be something like: ```…
		}

		// Step 5. syncthreads.
		if (syncAfterDistribute) rewriter.create<gpu::BarrierOp>(loc);

		nicolasvasilacheUnsubmitted Done Reply Inline Actions This needs to happen before the PR can land, otherwise the code will be incorrect. nicolasvasilache: This needs to happen before the PR can land, otherwise the code will be incorrect.
		// Step 6. Erase old op.
		rewriter.eraseOp(foreachThreadOp);

		return *maybeWorkgroupSizes;
		}

		DiagnosedSilenceableFailure
		transform::ForeachThreadToGpuAndTranslationInfo::applyToOne(
		Operation target, SmallVectorImpl<Operation > &results,
		transform::TransformState &state) {
		SmallVector<int64_t> workgroupSize =
		extractFromI64ArrayAttr(getWorkgroupSize());

		gpu::LaunchOp kernelLaunch = dyn_cast<gpu::LaunchOp>(target);
		if(!kernelLaunch) {
		target->emitError("It is not a GPU kernel launch");
		return DiagnosedSilenceableFailure::definiteFailure();
		}

		workgroupSize.resize(/size=/3, /value=/1);

		SimpleRewriter rewriter(getContext());
		rewriter.setInsertionPoint(target);

		auto walkResult = target->walk([&](scf::ForeachThreadOp foreachThreadOp) {
		rewriter.setInsertionPoint(foreachThreadOp);
		if (failed(rewriteForeachThreadToGpu(foreachThreadOp, workgroupSize,
		rewriter, true)))
		return WalkResult::interrupt();
		return WalkResult::advance();
		});

		if (walkResult.wasInterrupted())
		return DiagnosedSilenceableFailure(reportUnknownTransformError(target));

		//TODO: assign workgroup to kernel launch

		results.assign({target});
		return DiagnosedSilenceableFailure(success());
		}

		//===----------------------------------------------------------------------===//
// TileToForeachThreadOp		// TileToForeachThreadOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

DiagnosedSilenceableFailure transform::TileToForeachThreadOp::apply(		DiagnosedSilenceableFailure transform::TileToForeachThreadOp::apply(
transform::TransformResults &transformResults,		transform::TransformResults &transformResults,
transform::TransformState &state) {		transform::TransformState &state) {
IRRewriter rewriter(getContext());		IRRewriter rewriter(getContext());
ArrayRef<Operation *> targets = state.getPayloadOps(getTarget());		ArrayRef<Operation *> targets = state.getPayloadOps(getTarget());
▲ Show 20 Lines • Show All 222 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/transform-gpu.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: -test-transform-dialect-interpreter \
				// RUN: -canonicalize \
				// RUN: -convert-scf-to-cf \
				// RUN: -convert-cf-to-llvm \
				// RUN: -convert-vector-to-llvm \
				// RUN: -convert-arith-to-llvm \
				// RUN: -gpu-kernel-outlining \
				// RUN: -convert-math-to-llvm \
				// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin)' \
				// RUN: -gpu-to-llvm \
				// RUN: -reconcile-unrealized-casts \| \
				// RUN: mlir-cpu-runner -e main -entry-point-result=void \
				// RUN: -shared-libs=%mlir_lib_dir/libmlir_cuda_runtime.%shlibext \
				// RUN: -shared-libs=%mlir_lib_dir/libmlir_c_runner_utils.%shlibext \
				// RUN: -shared-libs=%mlir_lib_dir/libmlir_runner_utils.%shlibext \| \
				// RUN: FileCheck %s

				!type = memref<2 x 32 x f32>

				func.func @saxpy2d(%x: !type, %y: !type, %alpha : f32, %stream : !gpu.async.token) -> !type {
				%grid1 = arith.constant 1 : index
				%grid2 = arith.constant 1 : index
				%grid3 = arith.constant 1 : index
				%cta1 = arith.constant 2 : index
				%cta2 = arith.constant 32 : index
				%cta3 = arith.constant 1 : index
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%m = arith.constant 2 : index
				%n = arith.constant 32 : index
				%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %grid1, %arg10 = %grid2, %arg11 = %grid3)
				threads(%arg6, %arg7, %arg8) in (%arg12 = %cta1, %arg13 = %cta2, %arg14 = %cta3)
				{
				scf.foreach_thread (%i, %j) in (%n, %m) {
				%4 = memref.load %x[%i, %j] : !type
				%5 = memref.load %y[%i, %j] : !type
				%6 = math.fma %alpha, %4, %5 : f32
				memref.store %6, %y[%i, %j] : !type
				} {thread_dim_mapping = [0, 1, 2]}
				gpu.terminator
				}

				return %y : !type
				}

				nicolasvasilacheUnsubmitted Done Reply Inline Actions Let's be more explicit about the function name: `map_nested_foreach_thread_to_gpu_threads`. Also, this currently has to anchor on a `func` but this is not really necessary, we may want to anchor on `gpu.launch` instead. nicolasvasilache: Let's be more explicit about the function name: `map_nested_foreach_thread_to_gpu_threads`.
				transform.with_pdl_patterns {
				^bb0(%arg0: !pdl.operation):
				transform.sequence %arg0 failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
				transform.structured.foreach_thread_to_gpu_and_translation_info %funcop { workgroup_size = [32, 2, 1] }
				}
				}

				func.func private @getn() -> index {
				%n = arith.constant 2 : index
				return %n : index
				}
				func.func private @getm() -> index {
				%m = arith.constant 32 : index
				return %m : index
				}

				func.func @main() {
				%c0 = arith.constant 0 : index

				%c1 = arith.constant 1 : index
				%cst = arith.constant 4.000000e+00 : f32
				%cst_0 = arith.constant 1.000000e+00 : f32
				%alpha = arith.constant 0.2 : f32
				%n = call @getn():()-> index
				%m = call @getm():()-> index
				%x = memref.alloc() {alignment = 64 : i64} : !type
				%y = memref.alloc() {alignment = 64 : i64} : !type

				scf.for %arg0 = %c0 to %n step %c1 {
				scf.for %arg1 = %c0 to %m step %c1 {
				%l1 = arith.index_cast %arg1 : index to i32
				%l2 = arith.sitofp %l1 : i32 to f32
				%l3 = arith.index_cast %arg0 : index to i32
				%l4 = arith.sitofp %l3 : i32 to f32
				%l5 = arith.addf %l4, %l2 : f32
				memref.store %cst_0, %x[%arg0, %arg1] : !type
				memref.store %l5, %y[%arg0, %arg1] : !type
				}
				}

				// call @printme(%x) : (!type) -> ()
				// call @printme(%y) : (!type) -> ()

				%stream = gpu.wait async
				%dx = call @alloccopyme(%x, %stream) : (!type, !gpu.async.token) -> !type
				%dy = call @alloccopyme(%y, %stream) : (!type, !gpu.async.token) -> !type

				%res = call @saxpy2d(%dx, %dy, %alpha, %stream) : (!type, !type, f32, !gpu.async.token) -> !type

				call @copyme(%dy, %y, %stream) : (!type, !type, !gpu.async.token) -> ()
				%wait = gpu.wait async [%stream]
				call @printme(%y) : (!type) -> ()

				memref.dealloc %x : !type
				memref.dealloc %y : !type
				return
				}

				func.func private @allocme(%stream : !gpu.async.token) -> !type {
				%r, %asyncToken = gpu.alloc async [%stream] () : !type
				return %r : !type
				}

				func.func private @copyme(%dest : !type, %src : !type, %stream : !gpu.async.token) -> () {
				%4 = gpu.memcpy async [%stream] %src, %dest : !type, !type
				return
				}

				func.func private @alloccopyme(%ptr : !type, %stream : !gpu.async.token) -> !type {
				%dev = call @allocme(%stream) : (!gpu.async.token) -> !type
				call @copyme(%ptr, %dev, %stream):(!type, !type, !gpu.async.token)-> ()
				return %dev : !type
				}

				func.func private @printme(%ptr : !type) {
				%1 = memref.cast %ptr : !type to memref<*xf32>
				call @printMemrefF32(%1) : (memref<*xf32>) -> ()
				return
				}

				func.func private @printMemrefF32(%ptr : memref<*xf32>)



				// CHECK: 0.2, 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2, 8.2, 9.2, 10.2, 11.2, 12.2, 13.2, 14.2, 15.2, 16.2, 17.2, 18.2, 19.2, 20.2, 21.2, 22.2, 23.2, 24.2, 25.2, 26.2, 27.2, 28.2, 29.2, 30.2, 31.2
				// CHECK: 1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2, 8.2, 9.2, 10.2, 11.2, 12.2, 13.2, 14.2, 15.2, 16.2, 17.2, 18.2, 19.2, 20.2, 21.2, 22.2, 23.2, 24.2, 25.2, 26.2, 27.2, 28.2, 29.2, 30.2, 31.2, 32.2
				No newline at end of file
				nicolasvasilacheUnsubmitted Done Reply Inline Actions nit:nl nicolasvasilache: nit:nl

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialect
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 460433

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/test/Dialect/Linalg/transform-gpu.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialectClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 460433

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/test/Dialect/Linalg/transform-gpu.mlir

[mlir] Add map_nested_foreach_thread_to_gpu_threads op to transform dialect
ClosedPublic