This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Dialect/
-
GPU/
-
IR/
-
GPUDialect.h
-
GPUOps.td
-
TransformOps/
-
CMakeLists.txt
2/2
GPUDeviceMapper.td
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.td
-
Transforms/
-
Transforms.h
-
SCF/IR/
-
IR/
3/3
SCFOps.td
-
Utils/
-
StaticValueUtils.h
-
Interfaces/
-
CMakeLists.txt
1/1
DeviceMappingInterface.h
2/2
DeviceMappingInterface.td
-
lib/
-
Dialect/
-
GPU/TransformOps/
-
TransformOps/
-
CMakeLists.txt
3/5
GPUTransformOps.cpp
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.cpp
-
Transforms/
-
Tiling.cpp
-
SCF/
-
IR/
1/1
SCF.cpp
-
Transforms/
1/1
BufferizableOpInterfaceImpl.cpp
-
Utils/
1
StaticValueUtils.cpp
-
IR/
1
Builders.cpp
-
Interfaces/
-
CMakeLists.txt
1/1
DeviceMappingInterface.cpp
-
test/
-
Dialect/
-
GPU/
1/1
transform-gpu-failing.mlir
-
transform-gpu.mlir
-
Linalg/
-
tile-to-foreach-thread.mlir
-
SCF/
-
one-shot-bufferize-tensor-copy-insertion.mlir
-
ops.mlir
-
lib/Dialect/Tensor/
-
Dialect/
-
Tensor/
-
TestTensorTransforms.cpp

Differential D137413

[mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to dims`
ClosedPublic

Authored by guraypp on Nov 4 2022, 5:25 AM.

Download Raw Diff

Details

Reviewers

ftynse
nicolasvasilache
rriddle
bondhugula
ThomasRaoux
herhut
aaron.ballman

Commits

rG6663f3470417: [mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to…

Summary

scf.foreach_thread defines mapping its loops to processors via an integer array, see an example below. A lowering can use this mapping. However, expressing mapping as an integer array is very confusing, especially when there are multiple levels of parallelism. In addition, the op does not verify the integer array. This change introduces device mapping attribute to make mapping descriptive and verifiable. Then it makes GPU transform dialect use it.

scf.foreach_thread (%i, %j) in (%c1, %c2) {
	scf.foreach_thread (%i2, %j2) in (%c1, %c2)
	{...} { thread_dim_mapping = [0, 1]}
} { thread_dim_mapping = [0, 1]}

It first introduces a DeviceMappingInterface which is an attribute interface. scf.foreach_thread defines its mapping via this interface. A lowering must define its attributes and implement this interface as well. This way gives us a clear validation.

The change also introduces two new attributes (#gpu.thread<x/y/z> and #gpu.block<x,y,z> ). After this change, the above code prints as below, as seen here, this way clarifies the loop mappings. The change also implements consuming of these two new attribute by the transform dialect. Transform dialect binds the outermost loops to the thread blocks and innermost loops to threads.

scf.foreach_thread (%i, %j) in (%c1, %c2) {
	scf.foreach_thread (%i2, %j2) in (%c1, %c2)
	{...} { thread_dim_mapping = [#gpu.thread<x>, #gpu.thread<y>]}
} { thread_dim_mapping = [#gpu.block<x>, #gpu.block<y>]}

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Nov 4 2022, 5:25 AM

Herald added a reviewer: rriddle. · View Herald TranscriptNov 4 2022, 5:25 AM

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: Moerafaat, zero9178, bzcheeseman and 22 others. · View Herald Transcript

guraypp requested review of this revision.Nov 4 2022, 5:25 AM

Herald added a reviewer: herhut. · View Herald TranscriptNov 4 2022, 5:25 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

guraypp added a comment.Nov 4 2022, 5:27 AM

This comment was removed by guraypp.

guraypp mentioned this in D136851: [mlir] Use processing unit names for `thread_dim_map` and `mapped to dims`.Nov 4 2022, 5:27 AM

guraypp edited the summary of this revision. (Show Details)Nov 4 2022, 5:32 AM

Harbormaster completed remote builds in B196118: Diff 473199.Nov 4 2022, 6:07 AM

Please move DeviceMappingInterface to SCF.

guraypp mentioned this in D137424: [mlir][transform] Introduce `gpu.map_foreach`.Nov 4 2022, 7:27 AM

This goes into the right direction IMO. I don't have a strong preference on where the interface should live. Putting it in SCF makes it a bit more "private", not sure we have layering issues with the GPU dialect depending on SCF for these interface purposes.

mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMapper.td
1	The naming is weird. "Mapper" implies there is a process of mapping. I'd rather call this `GPUDeviceMappingAttr.td`.
49–60	This is unused so let's not commit it.
mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
483	Can't we keep the default value here?
483	Also, we can use `TypedArrayAttrBase` here to specify that only `DeviceMappingAttr` are accepted.
mlir/include/mlir/Interfaces/DeviceMappingInterface.h
1	Nit: 80 cols.
mlir/include/mlir/Interfaces/DeviceMappingInterface.td
1	Nit: 80 cols
26	???
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
174	Nit: expand `auto` unless the type is obvious from line context.
179	Nit: error messages must start with a small letter, unlike comments. Same above.
359	Nit: prefer C++-style casts.
mlir/lib/Dialect/SCF/IR/SCF.cpp
1120	Debug leftover?
mlir/lib/Dialect/SCF/Transforms/BufferizableOpInterfaceImpl.cpp
1147–1155	Can't we just use a `foreachThreadOp.getMapping().value_or({})` for the last argument instead? If not, wrap multi-line bodies into braces.
mlir/lib/Dialect/Utils/StaticValueUtils.cpp
70–74	`llvm::to_vector(arrayAttr.getAsRange<Attribute>())` is shorter and likely even removes the need for this function entirely. Just use that at the call sites.
mlir/lib/Interfaces/DeviceMappingInterface.cpp
1–2	This should have been one line.
mlir/test/Dialect/GPU/transform-gpu-failing.mlir
276	Please add the newline.

I agree that this goes in the right direction, however, I am unsure what this direction is? Is there a writeup of where this should be going? That would make reviewing these changes easier.

Also, it would be nice to unify this with the mapping attributes on the scf.parallel operation, so that we have one way of doing this. Having a different encoding was fine for the initial experimentation but now that this turns into an interface, we should ensure it works with other scf operations, as well.

guraypp marked 14 inline comments as done.Nov 7 2022, 8:40 AM

guraypp added inline comments.

mlir/include/mlir/Dialect/SCF/IR/SCFOps.td
483	I removed this because we don't have a meaningful default value. I can introduce a new DeviceMappingAttr like `#scf.not_mapped` and make this one default value of the `$mapping` as a next work.

guraypp edited the summary of this revision. (Show Details)Nov 7 2022, 8:41 AM

In D137413#3912032, @herhut wrote:

I agree that this goes in the right direction, however, I am unsure what this direction is? Is there a writeup of where this should be going? That would make reviewing these changes easier.

Also, it would be nice to unify this with the mapping attributes on the scf.parallel operation, so that we have one way of doing this. Having a different encoding was fine for the initial experimentation but now that this turns into an interface, we should ensure it works with other scf operations, as well.

When I originally implemented D136851, which just changed mappings from integer to string, I thought it is NFC. so we don't have a write-up. The idea for attribute interface originated in D136851 so I created this revision. I added more explanation to make things clear.

I think it is a good idea to make the mapping attribute of scf.parallel to implement DeviceMappingAttrInterface I can work on that as my next task.

address ftynse comments

Herald added a reviewer: aaron.ballman. · View Herald TranscriptNov 7 2022, 8:52 AM

Harbormaster completed remote builds in B196505: Diff 473700.Nov 7 2022, 9:44 AM

In D137413#3908170, @nicolasvasilache wrote:

Please move DeviceMappingInterface to SCF.

+1 here. This looks very SCF specific at this point.

I'm also curious on what the direction is.

mlir/lib/IR/Builders.cpp
12	This looks unnecessary.

This revision now requires changes to proceed.Nov 7 2022, 3:42 PM

In D137413#3912032, @herhut wrote:

I agree that this goes in the right direction, however, I am unsure what this direction is? Is there a writeup of where this should be going? That would make reviewing these changes easier.

This a refactoring related to https://discourse.llvm.org/t/rfc-parallel-abstraction-for-tensors-and-buffers/62607 and the followup thread_dim_mapping attribute that was added as we implemented the transforms for e2e execution.
We have now reached the limits of the array of integers encoding, which this PR addresses.

Also, it would be nice to unify this with the mapping attributes on the scf.parallel operation, so that we have one way of doing this. Having a different encoding was fine for the initial experimentation but now that this turns into an interface, we should ensure it works with other scf operations, as well.

+1 this would be a nice followup.

Thanks for addressing @ftynse 's comments.
Please move the interface to SCF and then we can iterate.

Moved DeviceMappingInterface to SCF.

guraypp edited the summary of this revision. (Show Details)Nov 8 2022, 10:49 AM

Harbormaster completed remote builds in B196747: Diff 474047.Nov 8 2022, 11:08 AM

Thanks @guraypp, Our existing transformations should not be much more intuitive to use.!

(I'll defer LGTM to others now)

mlir/include/mlir/Dialect/SCF/IR/DeviceMappingInterface.h
1 ↗	(On Diff #474047)	Cast Interfaces for MLIR?
mlir/include/mlir/Dialect/SCF/IR/DeviceMappingInterface.td
1 ↗	(On Diff #474047)	Data layout interfaces?

Address @rriddle comments, rebase, add description to gpu device mapping attributes

Harbormaster completed remote builds in B196857: Diff 474199.Nov 9 2022, 2:27 AM

bazel fix

Harbormaster completed remote builds in B196865: Diff 474213.Nov 9 2022, 3:50 AM

minor: fix comments

Harbormaster completed remote builds in B196874: Diff 474226.Nov 9 2022, 5:24 AM

rebase

Harbormaster completed remote builds in B196920: Diff 474293.Nov 9 2022, 9:31 AM

ftynse accepted this revision.Nov 10 2022, 8:02 AM

ftynse added inline comments.

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
364	Nit: please expand auto systematically unless the type is obvious from statement context (the RHS is a constructor or a cast) or obnoxious/impossible to spell (lambdas, some iterators).
397–398	Nit: this rewrapping looks spurious.

address @ftynse comments

guraypp edited the summary of this revision. (Show Details)Nov 10 2022, 8:52 AM

guraypp edited the summary of this revision. (Show Details)Nov 10 2022, 8:55 AM

Harbormaster completed remote builds in B197091: Diff 474560.Nov 10 2022, 9:12 AM

This revision was not accepted when it landed; it landed in state Needs Review.Nov 10 2022, 11:45 PM

Closed by commit rG6663f3470417: [mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to… (authored by guraypp). · Explain Why

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rG6663f3470417: [mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to….

guraypp mentioned this in D137891: [mlir][transform] Make `tile_to_foreach_thread_op` builder to use ArrayAttr.Nov 12 2022, 4:02 AM

guraypp mentioned this in rGd93be483eaf5: [mlir][transform] Make `tile_to_foreach_thread_op` builder to use ArrayAttr.Nov 12 2022, 10:27 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

IR/

GPUDialect.h

2 lines

GPUOps.td

1 line

TransformOps/

CMakeLists.txt

5 lines

GPUDeviceMapper.td

63 lines

Linalg/

TransformOps/

LinalgTransformOps.td

4 lines

Transforms/

Transforms.h

5 lines

SCF/

IR/

SCFOps.td

26 lines

Utils/

StaticValueUtils.h

2 lines

Interfaces/

CMakeLists.txt

6 lines

DeviceMappingInterface.h

22 lines

DeviceMappingInterface.td

30 lines

lib/

Dialect/

GPU/

TransformOps/

CMakeLists.txt

5 lines

GPUTransformOps.cpp

41 lines

Linalg/

TransformOps/

LinalgTransformOps.cpp

8 lines

Transforms/

Tiling.cpp

7 lines

SCF/

IR/

SCF.cpp

50 lines

Transforms/

BufferizableOpInterfaceImpl.cpp

16 lines

Utils/

StaticValueUtils.cpp

9 lines

IR/

Builders.cpp

1 line

Interfaces/

CMakeLists.txt

1 line

DeviceMappingInterface.cpp

18 lines

test/

Dialect/

GPU/

transform-gpu-failing.mlir

25 lines

transform-gpu.mlir

16 lines

Linalg/

tile-to-foreach-thread.mlir

6 lines

SCF/

one-shot-bufferize-tensor-copy-insertion.mlir

4 lines

ops.mlir

4 lines

lib/

Dialect/

Tensor/

TestTensorTransforms.cpp

3 lines

Diff 473199

mlir/include/mlir/Dialect/GPU/IR/GPUDialect.h

	Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines
	} // namespace mlir			} // namespace mlir

	#include "mlir/Dialect/GPU/IR/GPUOpsEnums.h.inc"			#include "mlir/Dialect/GPU/IR/GPUOpsEnums.h.inc"

	#include "mlir/Dialect/GPU/IR/GPUOpsDialect.h.inc"			#include "mlir/Dialect/GPU/IR/GPUOpsDialect.h.inc"

	#include "mlir/Dialect/GPU/IR/GPUOpInterfaces.h.inc"			#include "mlir/Dialect/GPU/IR/GPUOpInterfaces.h.inc"

				#include "mlir/Interfaces/DeviceMappingInterface.h"

	#define GET_ATTRDEF_CLASSES			#define GET_ATTRDEF_CLASSES
	#include "mlir/Dialect/GPU/IR/GPUOpsAttributes.h.inc"			#include "mlir/Dialect/GPU/IR/GPUOpsAttributes.h.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/GPU/IR/GPUOps.h.inc"			#include "mlir/Dialect/GPU/IR/GPUOps.h.inc"

	#endif // MLIR_DIALECT_GPU_IR_GPUDIALECT_H			#endif // MLIR_DIALECT_GPU_IR_GPUDIALECT_H

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

	Show All 10 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef GPU_OPS			#ifndef GPU_OPS
	#define GPU_OPS			#define GPU_OPS

	include "mlir/Dialect/DLTI/DLTIBase.td"			include "mlir/Dialect/DLTI/DLTIBase.td"
	include "mlir/Dialect/GPU/IR/GPUBase.td"			include "mlir/Dialect/GPU/IR/GPUBase.td"
	include "mlir/Dialect/GPU/IR/ParallelLoopMapperAttr.td"			include "mlir/Dialect/GPU/IR/ParallelLoopMapperAttr.td"
				include "mlir/Dialect/GPU/TransformOps/GPUDeviceMapper.td"
	include "mlir/IR/EnumAttr.td"			include "mlir/IR/EnumAttr.td"
	include "mlir/IR/FunctionInterfaces.td"			include "mlir/IR/FunctionInterfaces.td"
	include "mlir/IR/SymbolInterfaces.td"			include "mlir/IR/SymbolInterfaces.td"
	include "mlir/Interfaces/DataLayoutInterfaces.td"			include "mlir/Interfaces/DataLayoutInterfaces.td"
	include "mlir/Interfaces/InferIntRangeInterface.td"			include "mlir/Interfaces/InferIntRangeInterface.td"
	include "mlir/Interfaces/InferTypeOpInterface.td"			include "mlir/Interfaces/InferTypeOpInterface.td"
	include "mlir/Interfaces/SideEffectInterfaces.td"			include "mlir/Interfaces/SideEffectInterfaces.td"

	▲ Show 20 Lines • Show All 1,300 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/TransformOps/CMakeLists.txt

	set(LLVM_TARGET_DEFINITIONS GPUTransformOps.td)			set(LLVM_TARGET_DEFINITIONS GPUTransformOps.td)
	mlir_tablegen(GPUTransformOps.h.inc -gen-op-decls)			mlir_tablegen(GPUTransformOps.h.inc -gen-op-decls)
	mlir_tablegen(GPUTransformOps.cpp.inc -gen-op-defs)			mlir_tablegen(GPUTransformOps.cpp.inc -gen-op-defs)
	add_public_tablegen_target(MLIRGPUTransformOpsIncGen)			add_public_tablegen_target(MLIRGPUTransformOpsIncGen)

	add_mlir_doc(GPUTransformOps GPUTransformOps Dialects/ -gen-op-doc)			add_mlir_doc(GPUTransformOps GPUTransformOps Dialects/ -gen-op-doc)

				set(LLVM_TARGET_DEFINITIONS GPUDeviceMapper.td)
				mlir_tablegen(GPUDeviceMapperEnums.h.inc -gen-enum-decls)
				mlir_tablegen(GPUDeviceMapperEnums.cpp.inc -gen-enum-defs)
				add_public_tablegen_target(MLIRGPUDeviceMapperEnumsGen)

mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMapper.td

This file was added.

				//===-- GPUDeviceMapper.td - Attribute definition ----------- tablegen --===//
				ftynseUnsubmitted Done Reply Inline Actions The naming is weird. "Mapper" implies there is a process of mapping. I'd rather call this `GPUDeviceMappingAttr.td`. ftynse: The naming is weird. "Mapper" implies there is a process of mapping. I'd rather call this…
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Defines the attribute used to map loops to gpu.
				//
				//===----------------------------------------------------------------------===//

				#ifndef GPU_DEVICE_MAPPER_ATTR
				#define GPU_DEVICE_MAPPER_ATTR

				include "mlir/Dialect/GPU/IR/GPUBase.td"
				include "mlir/IR/EnumAttr.td"
				include "mlir/Interfaces/DeviceMappingInterface.td"

				def DimX : I64EnumAttrCase<"DimX", 0, "x">;
				def DimY : I64EnumAttrCase<"DimY", 1, "y">;
				def DimZ : I64EnumAttrCase<"DimZ", 2, "z">;

				def ThreadsEnum : I64EnumAttr<"Threads", "threads for loop mapping", [
				DimX, DimY, DimZ]> {
				let cppNamespace = "::mlir::gpu";
				}

				def GPUThreadMappingAttr
				: GPU_Attr<"GPUThreadMapping", "thread", [ DeviceMappingAttrInterface ]> {
				let parameters = (ins
				EnumParameter<ThreadsEnum>:$thread
				);
				let assemblyFormat = "`<` params `>`";
				}

				def BlocksEnum : I64EnumAttr<"Blocks", "threads for loop mapping", [
				DimX, DimY, DimZ]> {
				let cppNamespace = "::mlir::gpu";
				}

				def GPUBlockMappingAttr : GPU_Attr<"GPUBlockMapping", "block", [ DeviceMappingAttrInterface ] > {
				let parameters = (ins
				EnumParameter<BlocksEnum>:$block
				);
				let assemblyFormat = "`<` params `>`";
				}

				def GlobalEnum : I64EnumAttr<"Global", "threads+blocks for loop mapping", [
				DimX, DimY, DimZ]> {
				let cppNamespace = "::mlir::gpu";
				}

				def GPUGlobalMappingAttr
				: GPU_Attr<"GPUGlobalMapping", "global", [ DeviceMappingAttrInterface ]> {
				let parameters = (ins
				EnumParameter<ThreadsEnum>:$global
				);
				let assemblyFormat = "`<` params `>`";
				}
				ftynseUnsubmitted Done Reply Inline Actions This is unused so let's not commit it. ftynse: This is unused so let's not commit it.


				#endif // GPU_DEVICE_MAPPER_ATTR

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

Show First 20 Lines • Show All 721 Lines • ▼ Show 20 Lines	let description = [{
```		```
}];		}];

let arguments = (ins PDL_Operation:$target,		let arguments = (ins PDL_Operation:$target,
Variadic<PDL_Operation>:$num_threads,		Variadic<PDL_Operation>:$num_threads,
Variadic<PDL_Operation>:$tile_sizes,		Variadic<PDL_Operation>:$tile_sizes,
DefaultValuedAttr<I64ArrayAttr, "{}">:$static_num_threads,		DefaultValuedAttr<I64ArrayAttr, "{}">:$static_num_threads,
DefaultValuedAttr<I64ArrayAttr, "{}">:$static_tile_sizes,		DefaultValuedAttr<I64ArrayAttr, "{}">:$static_tile_sizes,
OptionalAttr<I64ArrayAttr>:$thread_dim_mapping);		OptionalAttr<ArrayAttr>:$mapping);
let results = (outs PDL_Operation:$foreach_thread_op,		let results = (outs PDL_Operation:$foreach_thread_op,
PDL_Operation:$tiled_op);		PDL_Operation:$tiled_op);
let assemblyFormat = [{		let assemblyFormat = [{
$target oilist(		$target oilist(
`num_threads` custom<DynamicIndexList>($num_threads,		`num_threads` custom<DynamicIndexList>($num_threads,
$static_num_threads,		$static_num_threads,
"ShapedType::kDynamicSize") \|		"ShapedType::kDynamicSize") \|
`tile_sizes` custom<DynamicIndexList>($tile_sizes,		`tile_sizes` custom<DynamicIndexList>($tile_sizes,
$static_tile_sizes,		$static_tile_sizes,
"ShapedType::kDynamicSize"))		"ShapedType::kDynamicSize"))
(`(` `mapped` `to` `dims` $thread_dim_mapping^ `)`)? attr-dict		(`(` `mapping` `=` $mapping^ `)`)? attr-dict
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;

let extraClassDeclaration = [{		let extraClassDeclaration = [{
::mlir::DiagnosedSilenceableFailure apply(		::mlir::DiagnosedSilenceableFailure apply(
::mlir::transform::TransformResults &transformResults,		::mlir::transform::TransformResults &transformResults,
::mlir::transform::TransformState &state);		::mlir::transform::TransformState &state);

▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

	Show All 13 Lines
	#include "mlir/Conversion/VectorToSCF/VectorToSCF.h"			#include "mlir/Conversion/VectorToSCF/VectorToSCF.h"
	#include "mlir/Dialect/Linalg/Utils/Utils.h"			#include "mlir/Dialect/Linalg/Utils/Utils.h"
	#include "mlir/Dialect/MemRef/IR/MemRef.h"			#include "mlir/Dialect/MemRef/IR/MemRef.h"
	#include "mlir/Dialect/SCF/Utils/Utils.h"			#include "mlir/Dialect/SCF/Utils/Utils.h"
	#include "mlir/Dialect/Tensor/IR/Tensor.h"			#include "mlir/Dialect/Tensor/IR/Tensor.h"
	#include "mlir/Dialect/Utils/StaticValueUtils.h"			#include "mlir/Dialect/Utils/StaticValueUtils.h"
	#include "mlir/Dialect/Vector/Transforms/VectorTransforms.h"			#include "mlir/Dialect/Vector/Transforms/VectorTransforms.h"
	#include "mlir/Dialect/X86Vector/Transforms.h"			#include "mlir/Dialect/X86Vector/Transforms.h"
				#include "mlir/IR/BuiltinAttributes.h"
	#include "mlir/IR/PatternMatch.h"			#include "mlir/IR/PatternMatch.h"
	#include "mlir/Interfaces/TilingInterface.h"			#include "mlir/Interfaces/TilingInterface.h"
	#include "mlir/Transforms/DialectConversion.h"			#include "mlir/Transforms/DialectConversion.h"
	#include "llvm/ADT/SmallBitVector.h"			#include "llvm/ADT/SmallBitVector.h"
	#include "llvm/ADT/SmallSet.h"			#include "llvm/ADT/SmallSet.h"

	namespace mlir {			namespace mlir {
	namespace bufferization {			namespace bufferization {
	▲ Show 20 Lines • Show All 401 Lines • ▼ Show 20 Lines
	/// (i.e. that only tiles parallel dimensions, e.g. in the Linalg case).			/// (i.e. that only tiles parallel dimensions, e.g. in the Linalg case).
	struct ForeachThreadTilingResult {			struct ForeachThreadTilingResult {
	Operation *tileOp;			Operation *tileOp;
	Operation *tiledOp;			Operation *tiledOp;
	};			};
	FailureOr<ForeachThreadTilingResult>			FailureOr<ForeachThreadTilingResult>
	tileToForeachThreadOp(RewriterBase &builder, TilingInterface op,			tileToForeachThreadOp(RewriterBase &builder, TilingInterface op,
	ArrayRef<OpFoldResult> numThreads,			ArrayRef<OpFoldResult> numThreads,
	ArrayRef<int64_t> threadDimMapping = {});			ArrayRef<Attribute> threadDimMapping = {});

	/// Same as `tileToForeachThreadOp`, but calculate the number of threads			/// Same as `tileToForeachThreadOp`, but calculate the number of threads
	/// required using the given tileSizes.			/// required using the given tileSizes.
	FailureOr<ForeachThreadTilingResult>			FailureOr<ForeachThreadTilingResult>
	tileToForeachThreadOpUsingTileSizes(RewriterBase &builder, TilingInterface op,			tileToForeachThreadOpUsingTileSizes(RewriterBase &builder, TilingInterface op,
	ArrayRef<OpFoldResult> tileSizes,			ArrayRef<OpFoldResult> tileSizes,
	ArrayRef<int64_t> threadDimMapping = {});			ArrayRef<Attribute> threadDimMapping = {});

	/// All indices returned by IndexOp should be invariant with respect to			/// All indices returned by IndexOp should be invariant with respect to
	/// tiling. Therefore, if an operation is tiled, we have to transform the			/// tiling. Therefore, if an operation is tiled, we have to transform the
	/// indices accordingly, i.e. offset them by the values of the corresponding			/// indices accordingly, i.e. offset them by the values of the corresponding
	/// induction variables that are captured implicitly in the body of the op.			/// induction variables that are captured implicitly in the body of the op.
	///			///
	/// Example. `linalg.generic` before tiling:			/// Example. `linalg.generic` before tiling:
	///			///
	▲ Show 20 Lines • Show All 571 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/SCF/IR/SCFOps.td

Show First 20 Lines • Show All 372 Lines • ▼ Show 20 Lines	let description = [{
used. This ensures that memory side effects of a thread are not visible to		used. This ensures that memory side effects of a thread are not visible to
other threads (or in the parent body), apart from explicitly shared tensors.		other threads (or in the parent body), apart from explicitly shared tensors.

The name "thread" conveys the fact that the parallel execution is mapped		The name "thread" conveys the fact that the parallel execution is mapped
(i.e. distributed) to a set of virtual threads of execution, one function		(i.e. distributed) to a set of virtual threads of execution, one function
application per thread. Further lowerings are responsible for specifying		application per thread. Further lowerings are responsible for specifying
how this is materialized on concrete hardware resources.		how this is materialized on concrete hardware resources.

An optional thread_dim_mapping index array attribute specifies for each		An optional mapping is an attribute array that specifies processing units
virtual thread dimension, how it remaps 1-1 to a set of concrete processing		with their dimension, how it remaps 1-1 to a set of concrete processing
element resources (e.g. a CUDA grid dimension or a level of concrete nested		element resources (e.g. a CUDA grid dimension or a level of concrete nested
async parallelism). At this time, the specification is backend-dependent and		async parallelism). At this time, the specification is backend-dependent and
is not verified by the op, beyond being an index array attribute.		is not verified by the op, beyond being an index array attribute.
It is the reponsibility of the lowering to interpret the index array in the		It is the reponsibility of the lowering to interpret the index array in the
context of the concrete target the op is lowered to, or to ignore it when		context of the concrete target the op is lowered to, or to ignore it when
the specification is ill-formed or unsupported for a particular target.		the specification is ill-formed or unsupported for a particular target.

The only allowed terminator is `scf.foreach_thread.perform_concurrently`.		The only allowed terminator is `scf.foreach_thread.perform_concurrently`.
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	%matmul_and_pointwise:2 = scf.foreach_thread (%thread_id_1, %thread_id_2) in
tensor<?xT> into tensor<?xT>		tensor<?xT> into tensor<?xT>
}		}
}		}
// Implicit synchronization point.		// Implicit synchronization point.
// Sequential context.		// Sequential context.
//		//
```		```

Example with thread_dim_mapping attribute:		Example with mapping attribute:

```mlir		```mlir
//		//
// Sequential context.		// Sequential context.
//		//
%matmul_and_pointwise:2 = scf.foreach_thread (%thread_id_1, %thread_id_2) in		%matmul_and_pointwise:2 = scf.foreach_thread (%thread_id_1, %thread_id_2) in
(%num_threads_1, %numthread_id_2) shared_outs(...)		(%num_threads_1, %numthread_id_2) shared_outs(...)
-> (tensor<?x?xT>, tensor<?xT>) {		-> (tensor<?x?xT>, tensor<?xT>) {
//		//
// Parallel context, each thread with id = (%thread_id_2, %thread_id_1)		// Parallel context, each thread with id = (%thread_id_2, %thread_id_1)
// runs its version of the code.		// runs its version of the code.
//		//
scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
...		...
}		}
} { thread_dim_mapping = [1, 0] }		} { mapping = [#gpu.thread<x>, #gpu.thread<y>] }
// Implicit synchronization point.		// Implicit synchronization point.
// Sequential context.		// Sequential context.
//		//
```		```

Example with privatized tensors:		Example with privatized tensors:

```mlir		```mlir
%t0 = ...		%t0 = ...
%t1 = ...		%t1 = ...
%r = scf.foreach_thread ... shared_outs(%o = t0) -> tensor<?xf32> {		%r = scf.foreach_thread ... shared_outs(%o = t0) -> tensor<?xf32> {
// %t0 and %t1 are privatized. %t0 is definitely copied for each thread		// %t0 and %t1 are privatized. %t0 is definitely copied for each thread
// because the scf.foreach_thread op's %t0 use bufferizes to a memory		// because the scf.foreach_thread op's %t0 use bufferizes to a memory
// write. In the absence of other conflicts, %t1 is copied only if there		// write. In the absence of other conflicts, %t1 is copied only if there
// are uses of %t1 in the body that bufferize to a memory read and to a		// are uses of %t1 in the body that bufferize to a memory read and to a
// memory write.		// memory write.
"some_use"(%t0)		"some_use"(%t0)
"some_use"(%t1)		"some_use"(%t1)
}		}
```		```
}];		}];
let arguments = (ins Variadic<Index>:$num_threads,		let arguments = (ins Variadic<Index>:$num_threads,
Variadic<AnyRankedTensor>:$outputs,		Variadic<AnyRankedTensor>:$outputs,
DefaultValuedAttr<I64ArrayAttr, "{}">:$thread_dim_mapping);		OptionalAttr<ArrayAttr>:$mapping);
		ftynseUnsubmitted Done Reply Inline Actions Can't we keep the default value here? ftynse: Can't we keep the default value here?
		gurayppAuthorUnsubmitted Done Reply Inline Actions I removed this because we don't have a meaningful default value. I can introduce a new DeviceMappingAttr like `#scf.not_mapped` and make this one default value of the `$mapping` as a next work. guraypp: I removed this because we don't have a meaningful default value. I can introduce a new…
		ftynseUnsubmitted Done Reply Inline Actions Also, we can use `TypedArrayAttrBase` here to specify that only `DeviceMappingAttr` are accepted. ftynse: Also, we can use `TypedArrayAttrBase` here to specify that only `DeviceMappingAttr` are…

let results = (outs Variadic<AnyType>:$results);		let results = (outs Variadic<AnyType>:$results);
let regions = (region SizedRegion<1>:$region);		let regions = (region SizedRegion<1>:$region);

let hasCustomAssemblyFormat = 1;		let hasCustomAssemblyFormat = 1;
let hasVerifier = 1;		let hasVerifier = 1;

// The default builder does not add the proper body BBargs, roll our own.		// The default builder does not add the proper body BBargs, roll our own.
let skipDefaultBuilders = 1;		let skipDefaultBuilders = 1;
let builders = [		let builders = [
// Bodyless builder, outputs must be specified.		// Bodyless builder, outputs must be specified.
OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,		OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,
CArg<"ArrayRef<int64_t>", "{}">:$thread_dim_mapping)>,		CArg<"ArrayRef<Attribute>", "{}">:$mapping)>,
// Builder that takes a bodyBuilder lambda.		// Builder that takes a bodyBuilder lambda.
OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,		OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,
"ArrayRef<int64_t>":$thread_dim_mapping,		"ArrayRef<Attribute>":$mapping,
"function_ref<void(OpBuilder &, Location, ValueRange)>":$bodyBuilder)>		"function_ref<void(OpBuilder &, Location, ValueRange)>":$bodyBuilder)>
];		];
let extraClassDeclaration = [{		let extraClassDeclaration = [{
int64_t getRank() { return getNumThreads().size(); }		int64_t getRank() { return getNumThreads().size(); }

OpResult getTiedOpResult(OpOperand *opOperand) {		OpResult getTiedOpResult(OpOperand *opOperand) {
assert(opOperand->getOperandNumber() >= getRank() && "invalid operand");		assert(opOperand->getOperandNumber() >= getRank() && "invalid operand");
return getOperation()->getOpResult(		return getOperation()->getOpResult(
Show All 22 Lines	::mlir::Value getThreadIndex(int64_t idx) {
return getThreadIndices()[idx];		return getThreadIndices()[idx];
}		}

::mlir::Block::BlockArgListType getRegionOutArgs() {		::mlir::Block::BlockArgListType getRegionOutArgs() {
return getBody()->getArguments().drop_front(getRank());		return getBody()->getArguments().drop_front(getRank());
}		}

/// Return the thread indices in the order specified by the		/// Return the thread indices in the order specified by the
/// thread_dim_mapping attribute. Return failure is		/// given mapping argument. Return failure is
/// thread_dim_mapping is not a valid permutation.		/// mapping is not a valid permutation.
FailureOr<SmallVector<Value>> getPermutedThreadIndices();		FailureOr<SmallVector<Value>> getPermutedThreadIndices(ArrayRef<int64_t> mapping);

/// Return the number of threads in the order specified by the		/// Return the number of threads in the order specified by the
/// thread_dim_mapping attribute.		/// given mapping argument.
/// Return failure is thread_dim_mapping is not a valid permutation.		/// Return failure is mapping is not a valid permutation.
FailureOr<SmallVector<OpFoldResult>> getPermutedNumThreads(OpBuilder &b);		FailureOr<SmallVector<OpFoldResult>> getPermutedNumThreads(OpBuilder &b, ArrayRef<int64_t> mapping);

// The ensureTerminator method generated by SingleBlockImplicitTerminator is		// The ensureTerminator method generated by SingleBlockImplicitTerminator is
// unaware of the fact that our terminator also needs a region to be		// unaware of the fact that our terminator also needs a region to be
// well-formed. We override it here to ensure that we do the right thing.		// well-formed. We override it here to ensure that we do the right thing.
static void ensureTerminator(Region &region, OpBuilder &builder, Location loc);		static void ensureTerminator(Region &region, OpBuilder &builder, Location loc);

PerformConcurrentlyOp getTerminator();		PerformConcurrentlyOp getTerminator();
}];		}];
▲ Show 20 Lines • Show All 536 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	void dispatchIndexOpFoldResults(ArrayRef<OpFoldResult> ofrs,			void dispatchIndexOpFoldResults(ArrayRef<OpFoldResult> ofrs,
	SmallVectorImpl<Value> &dynamicVec,			SmallVectorImpl<Value> &dynamicVec,
	SmallVectorImpl<int64_t> &staticVec,			SmallVectorImpl<int64_t> &staticVec,
	int64_t sentinel);			int64_t sentinel);

	/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.			/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.
	SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr);			SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr);

				/// Extract Attribute values from the assumed ArrayAttr of IntegerAttr.
				SmallVector<Attribute> extractFromAttributeArrayAttr(ArrayAttr arrayAttr);
	/// Given a value, try to extract a constant Attribute. If this fails, return			/// Given a value, try to extract a constant Attribute. If this fails, return
	/// the original value.			/// the original value.
	OpFoldResult getAsOpFoldResult(Value val);			OpFoldResult getAsOpFoldResult(Value val);

	/// Given an array of values, try to extract a constant Attribute from each			/// Given an array of values, try to extract a constant Attribute from each
	/// value. If this fails, return the original value.			/// value. If this fails, return the original value.
	SmallVector<OpFoldResult> getAsOpFoldResult(ValueRange values);			SmallVector<OpFoldResult> getAsOpFoldResult(ValueRange values);

	Show All 25 Lines

mlir/include/mlir/Interfaces/CMakeLists.txt

	Show All 18 Lines
	mlir_tablegen(DataLayoutAttrInterface.cpp.inc -gen-attr-interface-defs)			mlir_tablegen(DataLayoutAttrInterface.cpp.inc -gen-attr-interface-defs)
	mlir_tablegen(DataLayoutOpInterface.h.inc -gen-op-interface-decls)			mlir_tablegen(DataLayoutOpInterface.h.inc -gen-op-interface-decls)
	mlir_tablegen(DataLayoutOpInterface.cpp.inc -gen-op-interface-defs)			mlir_tablegen(DataLayoutOpInterface.cpp.inc -gen-op-interface-defs)
	mlir_tablegen(DataLayoutTypeInterface.h.inc -gen-type-interface-decls)			mlir_tablegen(DataLayoutTypeInterface.h.inc -gen-type-interface-decls)
	mlir_tablegen(DataLayoutTypeInterface.cpp.inc -gen-type-interface-defs)			mlir_tablegen(DataLayoutTypeInterface.cpp.inc -gen-type-interface-defs)
	add_public_tablegen_target(MLIRDataLayoutInterfacesIncGen)			add_public_tablegen_target(MLIRDataLayoutInterfacesIncGen)
	add_dependencies(mlir-generic-headers MLIRDataLayoutInterfacesIncGen)			add_dependencies(mlir-generic-headers MLIRDataLayoutInterfacesIncGen)

				set(LLVM_TARGET_DEFINITIONS DeviceMappingInterface.td)
				mlir_tablegen(DeviceMappingAttrInterface.h.inc -gen-attr-interface-decls)
				mlir_tablegen(DeviceMappingAttrInterface.cpp.inc -gen-attr-interface-defs)
				add_public_tablegen_target(MLIRDeviceMappingInterfacesIncGen)
				add_dependencies(mlir-generic-headers MLIRDeviceMappingInterfacesIncGen)

	add_mlir_doc(DataLayoutInterfaces			add_mlir_doc(DataLayoutInterfaces
	DataLayoutAttrInterface			DataLayoutAttrInterface
	Interfaces/			Interfaces/
	-gen-attr-interface-docs)			-gen-attr-interface-docs)

	add_mlir_doc(DataLayoutInterfaces			add_mlir_doc(DataLayoutInterfaces
	DataLayoutTypeInterface			DataLayoutTypeInterface
	Interfaces/			Interfaces/
	-gen-type-interface-docs)			-gen-type-interface-docs)

	add_mlir_doc(DataLayoutInterfaces			add_mlir_doc(DataLayoutInterfaces
	DataLayoutOpInterface			DataLayoutOpInterface
	Interfaces/			Interfaces/
	-gen-op-interface-docs)			-gen-op-interface-docs)

mlir/include/mlir/Interfaces/DeviceMappingInterface.h

This file was added.

				//===- DeviceMappingInterface.h - Cast Interfaces for MLIR --- C++ --===//
				ftynseUnsubmitted Done Reply Inline Actions Nit: 80 cols. ftynse: Nit: 80 cols.
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains the definitions of the cast interfaces defined in
				// `DeviceMappingInterface.td`.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_DEVICEMAPPINGINTERFACE_H
				#define MLIR_DEVICEMAPPINGINTERFACE_H

				#include "mlir/IR/OpDefinition.h"

				/// Include the generated interface declarations.
				#include "mlir/Interfaces/DeviceMappingAttrInterface.h.inc"

				#endif // MLIR_DEVICEMAPPINGINTERFACE_H

mlir/include/mlir/Interfaces/DeviceMappingInterface.td

This file was added.

				//===- DeviceMappingInterface.td - Data layout interfaces ----- tablegen --===//
				ftynseUnsubmitted Done Reply Inline Actions Nit: 80 cols ftynse: Nit: 80 cols
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Defines the interfaces for the device mapping specification for the loops.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_DEVICEMAPPINGINTERFACE
				#define MLIR_DEVICEMAPPINGINTERFACE

				include "mlir/IR/OpBase.td"

				//===----------------------------------------------------------------------===//
				// Attribute interfaces
				//===----------------------------------------------------------------------===//

				def DeviceMappingAttrInterface : AttrInterface<"DeviceMappingAttrInterface"> {
				let cppNamespace = "::mlir";
				let description = [{
				Attribute interface describing a loop in an op can be mapped to a device.
				...
				ftynseUnsubmitted Done Reply Inline Actions ??? ftynse: ???
				}];
				}

				#endif // MLIR_DEVICEMAPPINGINTERFACE

mlir/lib/Dialect/GPU/TransformOps/CMakeLists.txt

	add_mlir_dialect_library(MLIRGPUTransformOps			add_mlir_dialect_library(MLIRGPUTransformOps
	GPUTransformOps.cpp			GPUTransformOps.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU/TransformOps			${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU/TransformOps
				${MLIR_MAIN_INCLUDE_DIR}/mlir/Interfaces

	DEPENDS			DEPENDS
	MLIRGPUTransformOpsIncGen			MLIRGPUTransformOpsIncGen
				MLIRDeviceMappingInterfacesIncGen
				MLIRGPUDeviceMapperEnumsGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRIR			MLIRIR
	MLIRGPUTransforms			MLIRGPUTransforms
	MLIRParser			MLIRParser
	MLIRPDLDialect			MLIRPDLDialect
	MLIRSideEffectInterfaces			MLIRSideEffectInterfaces
	MLIRTransformDialect			MLIRTransformDialect
	MLIRGPUOps			MLIRGPUOps
	)			)

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp

Show All 9 Lines

#include "mlir/Dialect/Arith/IR/Arith.h"		#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.h"		#include "mlir/Dialect/GPU/TransformOps/GPUTransformOps.h"
#include "mlir/Dialect/PDL/IR/PDL.h"		#include "mlir/Dialect/PDL/IR/PDL.h"
#include "mlir/Dialect/SCF/IR/SCF.h"		#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/Transform/IR/TransformDialect.h"		#include "mlir/Dialect/Transform/IR/TransformDialect.h"
#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"		#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"
		#include "mlir/IR/Builders.h"
		#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/IR/Diagnostics.h"		#include "mlir/IR/Diagnostics.h"
#include "mlir/IR/Value.h"		#include "mlir/IR/Value.h"
#include "llvm/ADT/None.h"		#include "llvm/ADT/None.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::gpu;		using namespace mlir::gpu;
using namespace mlir::transform;		using namespace mlir::transform;
▲ Show 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	if (foreachThreadOp.getNumResults() > 0)
return transformOp.emitSilenceableError()		return transformOp.emitSilenceableError()
<< "only bufferized scf.foreach_thread lowers to gpu.block_id";		<< "only bufferized scf.foreach_thread lowers to gpu.block_id";
if (foreachThreadOp.getNumThreads().size() > 3)		if (foreachThreadOp.getNumThreads().size() > 3)
return transformOp.emitSilenceableError()		return transformOp.emitSilenceableError()
<< "scf.foreach_thread with rank > 3 does not lower to gpu.block_id";		<< "scf.foreach_thread with rank > 3 does not lower to gpu.block_id";

// Step 0. Outline the compute workload region and set up the workload		// Step 0. Outline the compute workload region and set up the workload
// operands.		// operands.
		SmallVector<int64_t> mapping;
		if (!foreachThreadOp.getMapping().has_value())
		return transformOp.emitSilenceableError() << "Mapping must be present";
		for (auto map : *foreachThreadOp.getMapping()) {
		ftynseUnsubmitted Done Reply Inline Actions Nit: expand `auto` unless the type is obvious from line context. ftynse: Nit: expand `auto` unless the type is obvious from line context.
		if (auto blockMap = map.dyn_cast<GPUBlockMappingAttr>()) {
		mapping.push_back((int64_t)blockMap.getBlock());
		} else {
		return transformOp.emitSilenceableError()
		<< "Mapping must be #gpu.block<x/y/z/>";
		ftynseUnsubmitted Done Reply Inline Actions Nit: error messages must start with a small letter, unlike comments. Same above. ftynse: Nit: error messages must start with a small letter, unlike comments. Same above.
		}
		}

FailureOr<SmallVector<OpFoldResult>> potentialGridDim =		FailureOr<SmallVector<OpFoldResult>> potentialGridDim =
foreachThreadOp.getPermutedNumThreads(rewriter);		foreachThreadOp.getPermutedNumThreads(rewriter, mapping);

if (failed(potentialGridDim) \|\|		if (failed(potentialGridDim) \|\|
llvm::any_of(*potentialGridDim, [](OpFoldResult ofr) {		llvm::any_of(*potentialGridDim, [](OpFoldResult ofr) {
return !getConstantIntValue(ofr).has_value();		return !getConstantIntValue(ofr).has_value();
})) {		})) {
return transformOp.emitSilenceableError() << "unsupported dynamic gridDim";		return transformOp.emitSilenceableError() << "unsupported dynamic gridDim";
}		}

Show All 9 Lines	DiagnosedSilenceableFailure mlir::transform::gpu::mapForeachToBlocksImpl(
Block *targetBlock = foreachThreadOp->getBlock();		Block *targetBlock = foreachThreadOp->getBlock();
Block::iterator insertionPoint = Block::iterator(foreachThreadOp);		Block::iterator insertionPoint = Block::iterator(foreachThreadOp);
Block &sourceBlock = foreachThreadOp.getRegion().front();		Block &sourceBlock = foreachThreadOp.getRegion().front();
targetBlock->getOperations().splice(insertionPoint,		targetBlock->getOperations().splice(insertionPoint,
sourceBlock.getOperations());		sourceBlock.getOperations());

// Step 2. RAUW thread indices to thread ops.		// Step 2. RAUW thread indices to thread ops.
SmallVector<Value> threadIndices =		SmallVector<Value> threadIndices =
*foreachThreadOp.getPermutedThreadIndices();		*foreachThreadOp.getPermutedThreadIndices(mapping);
assert(blockOps.size() == 3 && "3 block id ops are required");		assert(blockOps.size() == 3 && "3 block id ops are required");
for (auto [blockIdx, blockOp] : llvm::zip(threadIndices, blockOps)) {		for (auto [blockIdx, blockOp] : llvm::zip(threadIndices, blockOps)) {
Value val = blockIdx;		Value val = blockIdx;
Value blkOp = blockOp;		Value blkOp = blockOp;
if (!val)		if (!val)
continue;		continue;
for (Operation *user : llvm::make_early_inc_range(val.getUsers()))		for (Operation *user : llvm::make_early_inc_range(val.getUsers()))
user->replaceUsesOfWith(val, blkOp);		user->replaceUsesOfWith(val, blkOp);
Show All 20 Lines	DiagnosedSilenceableFailure mlir::transform::gpu::findTopLevelForeachThreadOp(

if (walkResult.wasInterrupted())		if (walkResult.wasInterrupted())
return transformOp.emitSilenceableError()		return transformOp.emitSilenceableError()
<< "could not find a unique topLevel scf.foreach_thread";		<< "could not find a unique topLevel scf.foreach_thread";
return DiagnosedSilenceableFailure::success();		return DiagnosedSilenceableFailure::success();
}		}

/// This is a helper that is only used in		/// This is a helper that is only used in
/// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects block_id.		/// rewriteTopLevelForeachThreadToGpuBlocks. It generates GPU dialects
		/// block_id.
static void generateGpuBlockIds(RewriterBase &rewriter,		static void generateGpuBlockIds(RewriterBase &rewriter,
scf::ForeachThreadOp foreachOp,		scf::ForeachThreadOp foreachOp,
SmallVectorImpl<Value> &blockOps) {		SmallVectorImpl<Value> &blockOps) {
Location loc = foreachOp->getLoc();		Location loc = foreachOp->getLoc();
OpBuilder::InsertionGuard guard(rewriter);		OpBuilder::InsertionGuard guard(rewriter);
rewriter.setInsertionPoint(foreachOp);		rewriter.setInsertionPoint(foreachOp);
IndexType indexType = rewriter.getIndexType();		IndexType indexType = rewriter.getIndexType();
SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};		SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	if (getGenerateGpuLaunch()) {
topLevelForeachThreadOp = cast<scf::ForeachThreadOp>(newForeachThreadOp);		topLevelForeachThreadOp = cast<scf::ForeachThreadOp>(newForeachThreadOp);
}		}

SmallVector<int64_t> gridDim = extractFromI64ArrayAttr(getGridDim());		SmallVector<int64_t> gridDim = extractFromI64ArrayAttr(getGridDim());
diag = mlir::transform::gpu::mapForeachToBlocksImpl(		diag = mlir::transform::gpu::mapForeachToBlocksImpl(
rewriter, topLevelForeachThreadOp, generateGpuBlockIds, gridDim,		rewriter, topLevelForeachThreadOp, generateGpuBlockIds, gridDim,
transformOp);		transformOp);
if (diag.succeeded()) {		if (diag.succeeded()) {
		gridDim.resize(3, 1);
diag = alterGpuLaunch(rewriter, gpuLaunch,		diag = alterGpuLaunch(rewriter, gpuLaunch,
cast<TransformOpInterface>(getOperation()),		cast<TransformOpInterface>(getOperation()),
gridDim[0], gridDim[1], gridDim[2]);		gridDim[0], gridDim[1], gridDim[2]);
}		}

results.assign({gpuLaunch});		results.assign({gpuLaunch});
return diag;		return diag;
}		}
Show All 24 Lines	static DiagnosedSilenceableFailure rewriteOneForeachThreadToGpuThreads(
if (foreachThreadOp.getNumResults() > 0)		if (foreachThreadOp.getNumResults() > 0)
return failureHelper(		return failureHelper(
"only bufferized scf.foreach_thread lowers to gpu.thread_id");		"only bufferized scf.foreach_thread lowers to gpu.thread_id");

if (foreachThreadOp.getNumThreads().size() > 3)		if (foreachThreadOp.getNumThreads().size() > 3)
return failureHelper(		return failureHelper(
"scf.foreach_thread with rank > 3 does not lower to gpu.thread_id");		"scf.foreach_thread with rank > 3 does not lower to gpu.thread_id");

auto potentialBlockDim = foreachThreadOp.getPermutedNumThreads(rewriter);		SmallVector<int64_t> mapping;
		if (!foreachThreadOp.getMapping().has_value())
		return failureHelper("Mapping must be present");
		for (auto map : *foreachThreadOp.getMapping()) {
		if (auto threadMap = map.dyn_cast<GPUThreadMappingAttr>()) {
		mapping.push_back((int64_t)threadMap.getThread());
		ftynseUnsubmitted Done Reply Inline Actions Nit: prefer C++-style casts. ftynse: Nit: prefer C++-style casts.
		} else {
		return failureHelper("Mapping must be #gpu.thread<x/y/z/>");
		}
		}
		auto potentialBlockDim =
		ftynseUnsubmitted Not Done Reply Inline Actions Nit: please expand auto systematically unless the type is obvious from statement context (the RHS is a constructor or a cast) or obnoxious/impossible to spell (lambdas, some iterators). ftynse: Nit: please expand auto systematically unless the type is obvious from statement context (the…
		foreachThreadOp.getPermutedNumThreads(rewriter, mapping);
if (failed(potentialBlockDim) \|\|		if (failed(potentialBlockDim) \|\|
llvm::any_of(*potentialBlockDim, [](OpFoldResult ofr) {		llvm::any_of(*potentialBlockDim, [](OpFoldResult ofr) {
return !getConstantIntValue(ofr).has_value();		return !getConstantIntValue(ofr).has_value();
})) {		})) {
return failureHelper("unsupported dynamic blockdim size");		return failureHelper("unsupported dynamic blockdim size");
}		}

SmallVector<int64_t> blockDim =		SmallVector<int64_t> blockDim =
llvm::to_vector(llvm::map_range(*potentialBlockDim, [](OpFoldResult ofr) {		llvm::to_vector(llvm::map_range(*potentialBlockDim, [](OpFoldResult ofr) {
return getConstantIntValue(ofr).value();		return getConstantIntValue(ofr).value();
}));		}));
		blockDim.resize(3, 1);

// Step 1. Create the gpu.thread ops		// Step 1. Create the gpu.thread ops
Location loc = foreachThreadOp.getLoc();		Location loc = foreachThreadOp.getLoc();
IndexType indexType = rewriter.getIndexType();		IndexType indexType = rewriter.getIndexType();

SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};		SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};
SmallVector<Value> threadOps;		SmallVector<Value> threadOps;
for (int64_t idx : llvm::seq<int64_t>(0, blockDim.size())) {		for (int64_t idx : llvm::seq<int64_t>(0, blockDim.size())) {
threadOps.push_back(		threadOps.push_back(
rewriter.create<ThreadIdOp>(loc, indexType, gpuDims[idx]));		rewriter.create<ThreadIdOp>(loc, indexType, gpuDims[idx]));
}		}
// Step 2. Maybe create conditionals to predicate the region.		// Step 2. Maybe create conditionals to predicate the region.
Value predicate;		Value predicate;
for (auto [threadId, blockDim, globalBlockDim] :		for (auto [threadId, blockDim, globalBlockDim] :
llvm::zip(threadOps, blockDim, globalBlockDims)) {		llvm::zip(threadOps, blockDim, globalBlockDims)) {
if (blockDim > globalBlockDim) {		if (blockDim > globalBlockDim) {
return failureHelper(		return failureHelper(
"The requested GPU threads are fewer than the number of loop trip "		"The requested GPU threads are fewer than the number of loop trip "
"counts. Try to tile scf.foreach_thread before mapping or set small "		"counts. Try to tile scf.foreach_thread before mapping or set "
		"small "
"blockDim.");		"blockDim.");
		ftynseUnsubmitted Not Done Reply Inline Actions Nit: this rewrapping looks spurious. ftynse: Nit: this rewrapping looks spurious.
}		}
if (blockDim == globalBlockDim)		if (blockDim == globalBlockDim)
continue;		continue;
Value blockIdx = rewriter.create<arith::ConstantIndexOp>(loc, blockDim);		Value blockIdx = rewriter.create<arith::ConstantIndexOp>(loc, blockDim);
Value tmpPredicate = rewriter.create<arith::CmpIOp>(		Value tmpPredicate = rewriter.create<arith::CmpIOp>(
loc, arith::CmpIPredicate::ult, threadId, blockIdx);		loc, arith::CmpIPredicate::ult, threadId, blockIdx);
predicate =		predicate =
predicate ? rewriter.create<arith::AndIOp>(loc, predicate, tmpPredicate)		predicate ? rewriter.create<arith::AndIOp>(loc, predicate, tmpPredicate)
Show All 17 Lines	if (predicate) {
insertionPoint = Block::iterator(foreachThreadOp);		insertionPoint = Block::iterator(foreachThreadOp);
}		}
Block &sourceBlock = foreachThreadOp.getRegion().front();		Block &sourceBlock = foreachThreadOp.getRegion().front();
targetBlock->getOperations().splice(insertionPoint,		targetBlock->getOperations().splice(insertionPoint,
sourceBlock.getOperations());		sourceBlock.getOperations());

// Step 4. RAUW thread indices to thread ops.		// Step 4. RAUW thread indices to thread ops.
SmallVector<Value> threadIndices =		SmallVector<Value> threadIndices =
*foreachThreadOp.getPermutedThreadIndices();		*foreachThreadOp.getPermutedThreadIndices(mapping);
for (auto [threadIdx, threadOp] : llvm::zip(threadIndices, threadOps)) {		for (auto [threadIdx, threadOp] : llvm::zip(threadIndices, threadOps)) {
Value val = threadIdx;		Value val = threadIdx;
Value op = threadOp;		Value op = threadOp;
if (!val)		if (!val)
continue;		continue;
for (Operation *user : llvm::make_early_inc_range(val.getUsers())) {		for (Operation *user : llvm::make_early_inc_range(val.getUsers())) {
user->replaceUsesOfWith(val, op);		user->replaceUsesOfWith(val, op);
}		}
▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

Show All 13 Lines
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/PDL/IR/PDL.h"		#include "mlir/Dialect/PDL/IR/PDL.h"
#include "mlir/Dialect/PDL/IR/PDLTypes.h"		#include "mlir/Dialect/PDL/IR/PDLTypes.h"
#include "mlir/Dialect/SCF/Transforms/TileUsingInterface.h"		#include "mlir/Dialect/SCF/Transforms/TileUsingInterface.h"
#include "mlir/Dialect/Transform/IR/TransformDialect.h"		#include "mlir/Dialect/Transform/IR/TransformDialect.h"
#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"		#include "mlir/Dialect/Transform/IR/TransformInterfaces.h"
		#include "mlir/Dialect/Utils/StaticValueUtils.h"
		#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/Interfaces/TilingInterface.h"		#include "mlir/Interfaces/TilingInterface.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/ADT/StringSet.h"		#include "llvm/ADT/StringSet.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::linalg;		using namespace mlir::linalg;
using namespace mlir::transform;		using namespace mlir::transform;
▲ Show 20 Lines • Show All 1,292 Lines • ▼ Show 20 Lines	if (!tilableOp) {
<< "only TilingInterface ops are supported";		<< "only TilingInterface ops are supported";
diag.attachNote(target->getLoc()) << "target op";		diag.attachNote(target->getLoc()) << "target op";
return diag;		return diag;
}		}
rewriter.setInsertionPoint(tilableOp);		rewriter.setInsertionPoint(tilableOp);
auto maybeThreadDimMappingAttr = threadDimMapping;		auto maybeThreadDimMappingAttr = threadDimMapping;
auto dimMapping = llvm::to_vector(		auto dimMapping = llvm::to_vector(
maybeThreadDimMappingAttr		maybeThreadDimMappingAttr
? extractFromI64ArrayAttr(*maybeThreadDimMappingAttr)		? extractFromAttributeArrayAttr(*maybeThreadDimMappingAttr)
: ArrayRef<int64_t>{});		: ArrayRef<Attribute>{});

FailureOr<linalg::ForeachThreadTilingResult> tilingResult = failure();		FailureOr<linalg::ForeachThreadTilingResult> tilingResult = failure();
if (!mixedNumThreads.empty()) {		if (!mixedNumThreads.empty()) {
tilingResult = linalg::tileToForeachThreadOp(rewriter, tilableOp,		tilingResult = linalg::tileToForeachThreadOp(rewriter, tilableOp,
numThreads, dimMapping);		numThreads, dimMapping);
} else {		} else {
tilingResult = linalg::tileToForeachThreadOpUsingTileSizes(		tilingResult = linalg::tileToForeachThreadOpUsingTileSizes(
rewriter, tilableOp, tileSizes, dimMapping);		rewriter, tilableOp, tileSizes, dimMapping);
Show All 16 Lines	DiagnosedSilenceableFailure transform::TileToForeachThreadOp::apply(
ArrayRef<Operation *> targets = state.getPayloadOps(getTarget());		ArrayRef<Operation *> targets = state.getPayloadOps(getTarget());

// Result payload ops.		// Result payload ops.
SmallVector<Operation *> tileOps;		SmallVector<Operation *> tileOps;
SmallVector<Operation *> tiledOps;		SmallVector<Operation *> tiledOps;

DiagnosedSilenceableFailure diag = tileToForeachThreadOpImpl(		DiagnosedSilenceableFailure diag = tileToForeachThreadOpImpl(
rewriter, state, cast<TransformOpInterface>(getOperation()), targets,		rewriter, state, cast<TransformOpInterface>(getOperation()), targets,
getMixedNumThreads(), getMixedTileSizes(), getThreadDimMapping(), tileOps,		getMixedNumThreads(), getMixedTileSizes(), getMapping(), tileOps,
tiledOps);		tiledOps);

if (!diag.succeeded())		if (!diag.succeeded())
return diag;		return diag;

transformResults.set(getForeachThreadOp().cast<OpResult>(), tileOps);		transformResults.set(getForeachThreadOp().cast<OpResult>(), tileOps);
transformResults.set(getTiledOp().cast<OpResult>(), tiledOps);		transformResults.set(getTiledOp().cast<OpResult>(), tiledOps);

▲ Show 20 Lines • Show All 269 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

Show All 18 Lines
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/SCF/Transforms/Transforms.h"		#include "mlir/Dialect/SCF/Transforms/Transforms.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/Dialect/Utils/IndexingUtils.h"		#include "mlir/Dialect/Utils/IndexingUtils.h"
#include "mlir/IR/AffineExpr.h"		#include "mlir/IR/AffineExpr.h"
#include "mlir/IR/AffineMap.h"		#include "mlir/IR/AffineMap.h"
		#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/Transforms/FoldUtils.h"		#include "mlir/Transforms/FoldUtils.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include <utility>		#include <utility>

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_LINALGTILINGPASS		#define GEN_PASS_DEF_LINALGTILINGPASS
#include "mlir/Dialect/Linalg/Passes.h.inc"		#include "mlir/Dialect/Linalg/Passes.h.inc"
▲ Show 20 Lines • Show All 186 Lines • ▼ Show 20 Lines
/// size of data.		/// size of data.
/// It is the user's responsibility to ensure that `numThreads` is a valid		/// It is the user's responsibility to ensure that `numThreads` is a valid
/// tiling specification (i.e. that only tiles parallel dimensions, e.g. in the		/// tiling specification (i.e. that only tiles parallel dimensions, e.g. in the
/// Linalg case). If `omitTileOffsetBoundsCheck` is true, then the function will		/// Linalg case). If `omitTileOffsetBoundsCheck` is true, then the function will
/// assume that `tileSize[i] * (numThread[i] -1) <= dimSize[i]` holds.		/// assume that `tileSize[i] * (numThread[i] -1) <= dimSize[i]` holds.
static FailureOr<ForeachThreadTilingResult> tileToForeachThreadOpImpl(		static FailureOr<ForeachThreadTilingResult> tileToForeachThreadOpImpl(
RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> numThreads,		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> numThreads,
Optional<ArrayRef<OpFoldResult>> nominalTileSizes,		Optional<ArrayRef<OpFoldResult>> nominalTileSizes,
ArrayRef<int64_t> threadDimMapping, bool omitTileOffsetBoundsCheck) {		ArrayRef<Attribute> threadDimMapping, bool omitTileOffsetBoundsCheck) {
Location loc = op->getLoc();		Location loc = op->getLoc();
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
SmallVector<Range> loopRanges = op.getIterationDomain(b);		SmallVector<Range> loopRanges = op.getIterationDomain(b);
if (loopRanges.empty())		if (loopRanges.empty())
return op->emitOpError("expected non-empty loop ranges");		return op->emitOpError("expected non-empty loop ranges");
auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };		auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };
if (llvm::any_of(loopRanges, hasStrideOne))		if (llvm::any_of(loopRanges, hasStrideOne))
return op->emitOpError("only stride-1 supported atm");		return op->emitOpError("only stride-1 supported atm");
▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	b.create<tensor::ParallelInsertSliceOp>(loc, std::get<1>(it),
resultSizes, strides);		resultSizes, strides);
}		}
return ForeachThreadTilingResult{foreachThreadOp, tiledOp};		return ForeachThreadTilingResult{foreachThreadOp, tiledOp};
}		}

FailureOr<ForeachThreadTilingResult>		FailureOr<ForeachThreadTilingResult>
linalg::tileToForeachThreadOp(RewriterBase &b, TilingInterface op,		linalg::tileToForeachThreadOp(RewriterBase &b, TilingInterface op,
ArrayRef<OpFoldResult> numThreads,		ArrayRef<OpFoldResult> numThreads,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<Attribute> threadDimMapping) {
return tileToForeachThreadOpImpl(b, op, numThreads, /nominalTileSizes=/None,		return tileToForeachThreadOpImpl(b, op, numThreads, /nominalTileSizes=/None,
threadDimMapping,		threadDimMapping,
/omitTileOffsetBoundsCheck=/false);		/omitTileOffsetBoundsCheck=/false);
}		}

FailureOr<ForeachThreadTilingResult>		FailureOr<ForeachThreadTilingResult>
linalg::tileToForeachThreadOpUsingTileSizes(		linalg::tileToForeachThreadOpUsingTileSizes(
RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> tileSizes,		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> tileSizes,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<Attribute> threadDimMapping) {
SmallVector<Range> loopRanges = op.getIterationDomain(b);		SmallVector<Range> loopRanges = op.getIterationDomain(b);
unsigned nLoops = loopRanges.size();		unsigned nLoops = loopRanges.size();
SmallVector<OpFoldResult> numThreads;		SmallVector<OpFoldResult> numThreads;
numThreads.reserve(nLoops);		numThreads.reserve(nLoops);
AffineExpr s0, s1;		AffineExpr s0, s1;
bindSymbols(b.getContext(), s0, s1);		bindSymbols(b.getContext(), s0, s1);
AffineExpr divExpr = s0.ceilDiv(s1);		AffineExpr divExpr = s0.ceilDiv(s1);
for (const auto &it : llvm::zip(tileSizes, loopRanges)) {		for (const auto &it : llvm::zip(tileSizes, loopRanges)) {
▲ Show 20 Lines • Show All 358 Lines • Show Last 20 Lines

mlir/lib/Dialect/SCF/IR/SCF.cpp

//===- SCF.cpp - Structured Control Flow Operations -----------------------===//		//===- SCF.cpp - Structured Control Flow Operations -----------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/SCF/IR/SCF.h"		#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/Arith/IR/Arith.h"		#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/Arith/Utils/Utils.h"		#include "mlir/Dialect/Arith/Utils/Utils.h"
#include "mlir/Dialect/Bufferization/IR/Bufferization.h"		#include "mlir/Dialect/Bufferization/IR/Bufferization.h"
#include "mlir/Dialect/ControlFlow/IR/ControlFlowOps.h"		#include "mlir/Dialect/ControlFlow/IR/ControlFlowOps.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
		#include "mlir/IR/Attributes.h"
#include "mlir/IR/BlockAndValueMapping.h"		#include "mlir/IR/BlockAndValueMapping.h"
#include "mlir/IR/FunctionInterfaces.h"		#include "mlir/IR/FunctionInterfaces.h"
#include "mlir/IR/Matchers.h"		#include "mlir/IR/Matchers.h"
#include "mlir/IR/PatternMatch.h"		#include "mlir/IR/PatternMatch.h"
		#include "mlir/Interfaces/DeviceMappingInterface.h"
#include "mlir/Support/MathExtras.h"		#include "mlir/Support/MathExtras.h"
#include "mlir/Transforms/InliningUtils.h"		#include "mlir/Transforms/InliningUtils.h"
		#include "llvm/ADT/TypeSwitch.h"
		#include "llvm/Support/raw_ostream.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::scf;		using namespace mlir::scf;

#include "mlir/Dialect/SCF/IR/SCFOpsDialect.cpp.inc"		#include "mlir/Dialect/SCF/IR/SCFOpsDialect.cpp.inc"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SCFDialect Dialect Interfaces		// SCFDialect Dialect Interfaces
▲ Show 20 Lines • Show All 1,076 Lines • ▼ Show 20 Lines	LogicalResult ForeachThreadOp::verify() {
for (int64_t i = 0; i < getRank(); ++i)		for (int64_t i = 0; i < getRank(); ++i)
if (!body->getArgument(i).getType().isIndex())		if (!body->getArgument(i).getType().isIndex())
return emitOpError("expects ")		return emitOpError("expects ")
<< i << "-th block argument to be an index";		<< i << "-th block argument to be an index";
for (unsigned i = 0; i < getOutputs().size(); ++i)		for (unsigned i = 0; i < getOutputs().size(); ++i)
if (body->getArgument(i + getRank()).getType() != getOutputs()[i].getType())		if (body->getArgument(i + getRank()).getType() != getOutputs()[i].getType())
return emitOpError("type mismatch between ")		return emitOpError("type mismatch between ")
<< i << "-th output and corresponding block argument";		<< i << "-th output and corresponding block argument";
		if (getMapping().has_value())
		for (auto map : getMapping().value()) {
		map.dump();
		ftynseUnsubmitted Done Reply Inline Actions Debug leftover? ftynse: Debug leftover?
		if (!isa<DeviceMappingAttrInterface>(map))
		return emitOpError()
		<< getMappingAttrName() << " is not device mapping attribute";
		}

return success();		return success();
}		}

void ForeachThreadOp::print(OpAsmPrinter &p) {		void ForeachThreadOp::print(OpAsmPrinter &p) {
p << " (";		p << " (";
llvm::interleaveComma(getThreadIndices(), p);		llvm::interleaveComma(getThreadIndices(), p);
p << ") in (";		p << ") in (";
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	result.addAttribute("operand_segment_sizes",
static_cast<int32_t>(outOperands.size())}));		static_cast<int32_t>(outOperands.size())}));
return success();		return success();
}		}

// Bodyless builder, outputs must be specified.		// Bodyless builder, outputs must be specified.
void ForeachThreadOp::build(mlir::OpBuilder &builder,		void ForeachThreadOp::build(mlir::OpBuilder &builder,
mlir::OperationState &result, ValueRange outputs,		mlir::OperationState &result, ValueRange outputs,
ValueRange numThreads,		ValueRange numThreads,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<Attribute> threadDimMapping) {
result.addOperands(numThreads);		result.addOperands(numThreads);
result.addOperands(outputs);		result.addOperands(outputs);
result.addAttribute(ForeachThreadOp::getThreadDimMappingAttrName(result.name),		result.addAttribute(ForeachThreadOp::getMappingAttrName(result.name),
builder.getI64ArrayAttr(threadDimMapping));		builder.getArrayAttr(threadDimMapping));

result.addAttribute(		result.addAttribute(
"operand_segment_sizes",		"operand_segment_sizes",
builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),		builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),
static_cast<int32_t>(outputs.size())}));		static_cast<int32_t>(outputs.size())}));
result.addTypes(TypeRange(outputs));		result.addTypes(TypeRange(outputs));

Region *bodyRegion = result.addRegion();		Region *bodyRegion = result.addRegion();
OpBuilder::InsertionGuard g(builder);		OpBuilder::InsertionGuard g(builder);
Show All 10 Lines	bodyBlock.addArguments(
TypeRange(outputs),		TypeRange(outputs),
SmallVector<Location>(outputs.size(), result.location));		SmallVector<Location>(outputs.size(), result.location));
ForeachThreadOp::ensureTerminator(*bodyRegion, builder, result.location);		ForeachThreadOp::ensureTerminator(*bodyRegion, builder, result.location);
}		}

// Builder that takes a bodyBuilder lambda.		// Builder that takes a bodyBuilder lambda.
void ForeachThreadOp::build(		void ForeachThreadOp::build(
mlir::OpBuilder &builder, mlir::OperationState &result, ValueRange outputs,		mlir::OpBuilder &builder, mlir::OperationState &result, ValueRange outputs,
ValueRange numThreads, ArrayRef<int64_t> threadDimMapping,		ValueRange numThreads, ArrayRef<Attribute> threadDimMapping,
function_ref<void(OpBuilder &, Location, ValueRange)> bodyBuilder) {		function_ref<void(OpBuilder &, Location, ValueRange)> bodyBuilder) {
result.addOperands(numThreads);		result.addOperands(numThreads);
result.addOperands(outputs);		result.addOperands(outputs);
result.addAttribute(ForeachThreadOp::getThreadDimMappingAttrName(result.name),		result.addAttribute(ForeachThreadOp::getMappingAttrName(result.name),
builder.getI64ArrayAttr(threadDimMapping));		builder.getArrayAttr(threadDimMapping));
result.addAttribute(		result.addAttribute(
"operand_segment_sizes",		"operand_segment_sizes",
builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),		builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),
static_cast<int32_t>(outputs.size())}));		static_cast<int32_t>(outputs.size())}));
result.addTypes(TypeRange(outputs));		result.addTypes(TypeRange(outputs));

Region *bodyRegion = result.addRegion();		Region *bodyRegion = result.addRegion();
OpBuilder::InsertionGuard g(builder);		OpBuilder::InsertionGuard g(builder);
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	static FailureOr<SmallVector<T>> permute(const SmallVector<T> &vals,
return result;		return result;
}		}

/// Helper to get apply the `thread_dim_mapping` permutation of a		/// Helper to get apply the `thread_dim_mapping` permutation of a
/// `foreachThreadOp` to `values`.		/// `foreachThreadOp` to `values`.
template <typename T>		template <typename T>
static FailureOr<SmallVector<T>>		static FailureOr<SmallVector<T>>
getValuesPermutedByThreadMapping(scf::ForeachThreadOp foreachThreadOp,		getValuesPermutedByThreadMapping(scf::ForeachThreadOp foreachThreadOp,
const SmallVector<T> &values) {		const SmallVector<T> &values,
		ArrayRef<int64_t> mapping) {
// Apply mapping permutation if specified.		// Apply mapping permutation if specified.
auto mapping = foreachThreadOp.getThreadDimMapping();		auto maybePermuted = permute(values, mapping);
if (mapping && !mapping.empty()) {
auto maybePermuted = permute(values, extractFromI64ArrayAttr(mapping));
if (failed(maybePermuted))		if (failed(maybePermuted))
return foreachThreadOp->emitError("invalid permutation");		return foreachThreadOp->emitError("invalid permutation");
return *maybePermuted;		return *maybePermuted;
}
return values;		return values;
}		}

/// Return the thread indices in the order specified by the thread_dim_mapping		/// Return the thread indices in the order specified by the thread_dim_mapping
/// attribute. Return failure is thread_dim_mapping is not a valid permutation.		/// attribute. Return failure is thread_dim_mapping is not a valid permutation.
FailureOr<SmallVector<Value>> ForeachThreadOp::getPermutedThreadIndices() {		FailureOr<SmallVector<Value>>
		ForeachThreadOp::getPermutedThreadIndices(ArrayRef<int64_t> mapping) {
SmallVector<Value> threadCountValues = this->getThreadIndices();		SmallVector<Value> threadCountValues = this->getThreadIndices();
threadCountValues.resize(3, Value());		return getValuesPermutedByThreadMapping(*this, threadCountValues, mapping);
return getValuesPermutedByThreadMapping(*this, threadCountValues);
}		}

/// Return the number of threads in the order specified by the		/// Return the number of threads in the order specified by the
/// thread_dim_mapping attribute.		/// thread_dim_mapping attribute.
/// Return failure is thread_dim_mapping is not a valid permutation.		/// Return failure is thread_dim_mapping is not a valid permutation.
FailureOr<SmallVector<OpFoldResult>>		FailureOr<SmallVector<OpFoldResult>>
ForeachThreadOp::getPermutedNumThreads(OpBuilder &b) {		ForeachThreadOp::getPermutedNumThreads(OpBuilder &b,
		ArrayRef<int64_t> mapping) {
SmallVector<OpFoldResult> threadCountValues = this->getNumThreads();		SmallVector<OpFoldResult> threadCountValues = this->getNumThreads();
threadCountValues.resize(3, b.getIndexAttr(1));		return getValuesPermutedByThreadMapping(*this, threadCountValues, mapping);
return getValuesPermutedByThreadMapping(*this, threadCountValues);
}		}

ForeachThreadOp mlir::scf::getForeachThreadOpThreadIndexOwner(Value val) {		ForeachThreadOp mlir::scf::getForeachThreadOpThreadIndexOwner(Value val) {
auto tidxArg = val.dyn_cast<BlockArgument>();		auto tidxArg = val.dyn_cast<BlockArgument>();
if (!tidxArg)		if (!tidxArg)
return ForeachThreadOp();		return ForeachThreadOp();
assert(tidxArg.getOwner() && "unlinked block argument");		assert(tidxArg.getOwner() && "unlinked block argument");
auto *containingOp = tidxArg.getOwner()->getParentOp();		auto *containingOp = tidxArg.getOwner()->getParentOp();
▲ Show 20 Lines • Show All 2,199 Lines • Show Last 20 Lines

mlir/lib/Dialect/SCF/Transforms/BufferizableOpInterfaceImpl.cpp

Show All 9 Lines

#include "mlir/Dialect/Bufferization/IR/BufferizableOpInterface.h"		#include "mlir/Dialect/Bufferization/IR/BufferizableOpInterface.h"
#include "mlir/Dialect/Bufferization/IR/Bufferization.h"		#include "mlir/Dialect/Bufferization/IR/Bufferization.h"
#include "mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h"		#include "mlir/Dialect/Bufferization/Transforms/OneShotAnalysis.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/SCF/IR/SCF.h"		#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/Dialect/Utils/StaticValueUtils.h"		#include "mlir/Dialect/Utils/StaticValueUtils.h"
		#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/IR/Dialect.h"		#include "mlir/IR/Dialect.h"
#include "mlir/IR/Operation.h"		#include "mlir/IR/Operation.h"
#include "mlir/IR/PatternMatch.h"		#include "mlir/IR/PatternMatch.h"
		#include "llvm/ADT/None.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::bufferization;		using namespace mlir::bufferization;
using namespace mlir::scf;		using namespace mlir::scf;

namespace mlir {		namespace mlir {
namespace scf {		namespace scf {
namespace {		namespace {
▲ Show 20 Lines • Show All 1,107 Lines • ▼ Show 20 Lines	for (const auto &it :
Value bufferAsTensor =		Value bufferAsTensor =
rewriter.create<ToTensorOp>(foreachThreadOp.getLoc(), buffer);		rewriter.create<ToTensorOp>(foreachThreadOp.getLoc(), buffer);
bbArg.replaceAllUsesWith(bufferAsTensor);		bbArg.replaceAllUsesWith(bufferAsTensor);
}		}

// Create new ForeachThreadOp without any results and drop the automatically		// Create new ForeachThreadOp without any results and drop the automatically
// introduced terminator.		// introduced terminator.
rewriter.setInsertionPoint(foreachThreadOp);		rewriter.setInsertionPoint(foreachThreadOp);
auto newForeachThreadOp = rewriter.create<ForeachThreadOp>(		ForeachThreadOp newForeachThreadOp;
		if (foreachThreadOp.getMapping().has_value())
		newForeachThreadOp = rewriter.create<ForeachThreadOp>(
foreachThreadOp.getLoc(), /outputs=/ValueRange(),		foreachThreadOp.getLoc(), /outputs=/ValueRange(),
foreachThreadOp.getNumThreads(),		foreachThreadOp.getNumThreads(),
extractFromI64ArrayAttr(foreachThreadOp.getThreadDimMapping()));		foreachThreadOp.getMapping().value());
		else
		newForeachThreadOp = rewriter.create<ForeachThreadOp>(
		foreachThreadOp.getLoc(), /outputs=/ValueRange(),
		foreachThreadOp.getNumThreads());
		ftynseUnsubmitted Done Reply Inline Actions Can't we just use a `foreachThreadOp.getMapping().value_or({})` for the last argument instead? If not, wrap multi-line bodies into braces. ftynse: Can't we just use a `foreachThreadOp.getMapping().value_or({})` for the last argument instead?
newForeachThreadOp.getBody()->getTerminator()->erase();		newForeachThreadOp.getBody()->getTerminator()->erase();

// Move over block contents of the old op.		// Move over block contents of the old op.
SmallVector<Value> replacementBbArgs;		SmallVector<Value> replacementBbArgs;
replacementBbArgs.append(		replacementBbArgs.append(
newForeachThreadOp.getBody()->getArguments().begin(),		newForeachThreadOp.getBody()->getArguments().begin(),
newForeachThreadOp.getBody()->getArguments().end());		newForeachThreadOp.getBody()->getArguments().end());
replacementBbArgs.append(foreachThreadOp.getOutputs().size(), Value());		replacementBbArgs.append(foreachThreadOp.getOutputs().size(), Value());
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

mlir/lib/Dialect/Utils/StaticValueUtils.cpp

	Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.			/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.
	SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr) {			SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr) {
	return llvm::to_vector<4>(			return llvm::to_vector<4>(
	llvm::map_range(attr.cast<ArrayAttr>(), [](Attribute a) -> int64_t {			llvm::map_range(attr.cast<ArrayAttr>(), [](Attribute a) -> int64_t {
	return a.cast<IntegerAttr>().getInt();			return a.cast<IntegerAttr>().getInt();
	}));			}));
	}			}

				/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.
				SmallVector<Attribute> extractFromAttributeArrayAttr(ArrayAttr arrayAttr) {
				SmallVector<Attribute> result;
				result.reserve(arrayAttr.size());
				for (auto attr : arrayAttr.getAsRange<Attribute>())
				result.push_back(attr);
				return result;
				ftynseUnsubmitted Not Done Reply Inline Actions `llvm::to_vector(arrayAttr.getAsRange<Attribute>())` is shorter and likely even removes the need for this function entirely. Just use that at the call sites. ftynse: `llvm::to_vector(arrayAttr.getAsRange<Attribute>())` is shorter and likely even removes the…
				}

	/// Given a value, try to extract a constant Attribute. If this fails, return			/// Given a value, try to extract a constant Attribute. If this fails, return
	/// the original value.			/// the original value.
	OpFoldResult getAsOpFoldResult(Value val) {			OpFoldResult getAsOpFoldResult(Value val) {
	if (!val)			if (!val)
	return OpFoldResult();			return OpFoldResult();
	Attribute attr;			Attribute attr;
	if (matchPattern(val, m_Constant(&attr)))			if (matchPattern(val, m_Constant(&attr)))
	return attr;			return attr;
	▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

mlir/lib/IR/Builders.cpp

	//===- Builders.cpp - Helpers for constructing MLIR Classes ---------------===//			//===- Builders.cpp - Helpers for constructing MLIR Classes ---------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
	#include "mlir/IR/AffineExpr.h"			#include "mlir/IR/AffineExpr.h"
	#include "mlir/IR/AffineMap.h"			#include "mlir/IR/AffineMap.h"
				#include "mlir/IR/Attributes.h"
				rriddleUnsubmitted Not Done Reply Inline Actions This looks unnecessary. rriddle: This looks unnecessary.
	#include "mlir/IR/BlockAndValueMapping.h"			#include "mlir/IR/BlockAndValueMapping.h"
	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/Dialect.h"			#include "mlir/IR/Dialect.h"
	#include "mlir/IR/IntegerSet.h"			#include "mlir/IR/IntegerSet.h"
	#include "mlir/IR/Matchers.h"			#include "mlir/IR/Matchers.h"
	#include "mlir/IR/SymbolTable.h"			#include "mlir/IR/SymbolTable.h"
	#include "llvm/Support/raw_ostream.h"			#include "llvm/Support/raw_ostream.h"

	▲ Show 20 Lines • Show All 505 Lines • Show Last 20 Lines

mlir/lib/Interfaces/CMakeLists.txt

Show All 9 Lines	set(LLVM_OPTIONAL_SOURCES
InferTypeOpInterface.cpp		InferTypeOpInterface.cpp
LoopLikeInterface.cpp		LoopLikeInterface.cpp
ParallelCombiningOpInterface.cpp		ParallelCombiningOpInterface.cpp
ShapedOpInterfaces.cpp		ShapedOpInterfaces.cpp
SideEffectInterfaces.cpp		SideEffectInterfaces.cpp
TilingInterface.cpp		TilingInterface.cpp
VectorInterfaces.cpp		VectorInterfaces.cpp
ViewLikeInterface.cpp		ViewLikeInterface.cpp
		DeviceMappingInterface.cpp
)		)

function(add_mlir_interface_library name)		function(add_mlir_interface_library name)
add_mlir_library(MLIR${name}		add_mlir_library(MLIR${name}
${name}.cpp		${name}.cpp

ADDITIONAL_HEADER_DIRS		ADDITIONAL_HEADER_DIRS
${MLIR_MAIN_INCLUDE_DIR}/mlir/Interfaces		${MLIR_MAIN_INCLUDE_DIR}/mlir/Interfaces
Show All 26 Lines

mlir/lib/Interfaces/DeviceMappingInterface.cpp

This file was added.

				//===- DeviceMappingInterface.cpp
				//-------------------------------------------------===//
				ftynseUnsubmitted Done Reply Inline Actions This should have been one line. ftynse: This should have been one line.
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Interfaces/DeviceMappingInterface.h"

				using namespace mlir;

				//===----------------------------------------------------------------------===//
				// Table-generated class definitions
				//===----------------------------------------------------------------------===//

				#include "mlir/Interfaces/DeviceMappingAttrInterface.cpp.inc"

mlir/test/Dialect/GPU/transform-gpu-failing.mlir

Show All 20 Lines	func.func @map_nested_foreach_to_threads_excessive_threads(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}

%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}
transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
Show All 12 Lines	func.func @map_nested_foreach_to_threads_fewer_threads(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}

%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
Show All 10 Lines	func.func @map_nested_foreach_to_threads_dynamic_trip_count(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token, %c9 : index, %c7 : index) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}
return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
// expected-error @below {{unsupported dynamic blockdim size}}		// expected-error @below {{unsupported dynamic blockdim size}}
transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [128, 4, 1] }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [128, 4, 1] }
}		}

// -----		// -----

func.func @map_nested_foreach_to_threads_4d_loop(%x: memref<2x32x32x32xf32>, %y: memref<2x32x32x32xf32>, %stream : !gpu.async.token) -> memref<2x32x32x32xf32> {		func.func @map_nested_foreach_to_threads_4d_loop(%x: memref<2x32x32x32xf32>, %y: memref<2x32x32x32xf32>, %stream : !gpu.async.token) -> memref<2x32x32x32xf32> {
%one = arith.constant 1 : index		%one = arith.constant 1 : index
%c2 = arith.constant 1 : index		%c2 = arith.constant 1 : index
%c32 = arith.constant 32 : index		%c32 = arith.constant 32 : index
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j, %k, %l) in (%c2, %c32,%c32,%c32) {		scf.foreach_thread (%i, %j, %k, %l) in (%c2, %c32,%c32,%c32) {
%4 = memref.load %x[%i, %j, %k, %l] : memref<2x32x32x32xf32>		%4 = memref.load %x[%i, %j, %k, %l] : memref<2x32x32x32xf32>
memref.store %4, %y[%i, %j, %k, %l] : memref<2x32x32x32xf32>		memref.store %4, %y[%i, %j, %k, %l] : memref<2x32x32x32xf32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>, #gpu.thread<z>] }
gpu.terminator		gpu.terminator
}		}
return %y : memref<2x32x32x32xf32>		return %y : memref<2x32x32x32xf32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	func.func @map_foreach_to_blocks_not_unique(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }

scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
Show All 11 Lines	func.func @map_foreach_to_blocks_large_loop(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%c9 = arith.constant 9 : index		%c9 = arith.constant 9 : index
%c7 = arith.constant 7 : index		%c7 = arith.constant 7 : index

scf.foreach_thread (%i, %j) in (%c7, %c65537) {		scf.foreach_thread (%i, %j) in (%c7, %c65537) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = [#gpu.thread<x>, #gpu.thread<y>] }

scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
// expected-error @below {{could not find a unique topLevel scf.foreach_thread}}		// expected-error @below {{could not find a unique topLevel scf.foreach_thread}}
%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }		%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
}		}

// -----		// -----

func.func @map_foreach_to_blocks_large_loop(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {		func.func @map_foreach_to_blocks_large_loop(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%one = arith.constant 1 : index		%one = arith.constant 1 : index
%c65535 = arith.constant 65535 : index		%c65535 = arith.constant 65535 : index
scf.foreach_thread (%i, %j) in (%c65535, %c65535) {		scf.foreach_thread (%i, %j) in (%c65535, %c65535) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = [#gpu.block<x>, #gpu.block<y>] }
return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
// expected-error @below {{Trying to launch a GPU kernel with gridDim = (65535, 65535, 1) blockDim = (1, 1, 1). It is larger than the limits.}}		// expected-error @below {{Trying to launch a GPU kernel with gridDim = (65535, 65535, 1) blockDim = (1, 1, 1). It is larger than the limits.}}
%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }		%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
}		}

// -----		// -----
		No newline at end of file
		ftynseUnsubmitted Done Reply Inline Actions Please add the newline. ftynse: Please add the newline.

mlir/test/Dialect/GPU/transform-gpu.mlir

Show All 18 Lines	// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]]]
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : !type		%4 = memref.load %x[%i, %j] : !type
%5 = memref.load %y[%i, %j] : !type		%5 = memref.load %y[%i, %j] : !type
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : !type		memref.store %6, %y[%i, %j] : !type
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = [#gpu.block<x>, #gpu.block<y>]}
gpu.terminator		gpu.terminator
}		}
return %y : !type		return %y : !type
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.gpu.map_foreach_to_blocks %funcop { blockDim = [12, 9, 1]}		transform.gpu.map_foreach_to_blocks %funcop { gridDim = [12, 9]}
}		}

// -----		// -----

!type = memref<2 x 32 x f32>		!type = memref<2 x 32 x f32>
!type1d = memref<32 x f32>		!type1d = memref<32 x f32>

// CHECK-LABEL: func.func @saxpy2d(		// CHECK-LABEL: func.func @saxpy2d(
Show All 23 Lines	// CHECK: gpu.barrier
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : !type		%4 = memref.load %x[%i, %j] : !type
%5 = memref.load %y[%i, %j] : !type		%5 = memref.load %y[%i, %j] : !type
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : !type		memref.store %6, %y[%i, %j] : !type
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>]}
scf.foreach_thread (%i) in (%c12) {		scf.foreach_thread (%i) in (%c12) {
%7 = memref.load %t[%i] : !type1d		%7 = memref.load %t[%i] : !type1d
%8 = arith.addf %alpha, %7 : f32		%8 = arith.addf %alpha, %7 : f32
memref.store %8, %t[%i] : !type1d		memref.store %8, %t[%i] : !type1d
} {thread_dim_mapping = [0, 1, 2]}		} {mapping = [#gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}
return %y : !type		return %y : !type
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1] }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9] }
}		}

// -----		// -----

!type4d = memref<32x64x4x32xf32>		!type4d = memref<32x64x4x32xf32>

// CHECK-LABEL: func.func @saxpy4d(		// CHECK-LABEL: func.func @saxpy4d(
// CHECK-SAME: %[[ARGX:[0-9a-z]+]]: memref<32x64x4x32xf32>		// CHECK-SAME: %[[ARGX:[0-9a-z]+]]: memref<32x64x4x32xf32>
Show All 14 Lines
// CHECK: memref.load %[[ARGX]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]		// CHECK: memref.load %[[ARGX]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]
// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]		// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]
scf.foreach_thread (%i, %j) in (%c32, %c64) {		scf.foreach_thread (%i, %j) in (%c32, %c64) {
scf.foreach_thread (%k, %l) in (%c4, %c32) {		scf.foreach_thread (%k, %l) in (%c4, %c32) {
%4 = memref.load %x[%i, %j, %k, %l] : !type4d		%4 = memref.load %x[%i, %j, %k, %l] : !type4d
%5 = memref.load %y[%i, %j, %k, %l] : !type4d		%5 = memref.load %y[%i, %j, %k, %l] : !type4d
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j, %k, %l] : !type4d		memref.store %6, %y[%i, %j, %k, %l] : !type4d
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = [#gpu.block<x>, #gpu.block<y>] }
return %y : !type4d		return %y : !type4d
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
%gpuLaunch = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }		%gpuLaunch = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
transform.gpu.map_nested_foreach_to_threads %gpuLaunch { blockDim = [32, 4, 1] }		transform.gpu.map_nested_foreach_to_threads %gpuLaunch { blockDim = [32, 4, 1] }
Show All 15 Lines	// CHECK: return
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : !type		%4 = memref.load %x[%i, %j] : !type
%5 = memref.load %y[%i, %j] : !type		%5 = memref.load %y[%i, %j] : !type
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : !type		memref.store %6, %y[%i, %j] : !type
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = [#gpu.thread<y>, #gpu.thread<x>] }
gpu.terminator		gpu.terminator
}		}
return %y : !type		return %y : !type
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1], syncAfterDistribute = false }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1], syncAfterDistribute = false }
}		}

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

Show All 20 Lines	// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor<?x?xf32>
// CHECK: %[[tC:.]] = tensor.extract_slice %[[C_BLK]]{{.}} : tensor<?x?xf32> to tensor<?x?xf32>		// CHECK: %[[tC:.]] = tensor.extract_slice %[[C_BLK]]{{.}} : tensor<?x?xf32> to tensor<?x?xf32>
// CHECK: %[[RES:.*]] = linalg.matmul		// CHECK: %[[RES:.*]] = linalg.matmul
// CHECK-SAME: ins(%[[tA]], %[[tB]] : tensor<?x?xf32>, tensor<?x?xf32>)		// CHECK-SAME: ins(%[[tA]], %[[tB]] : tensor<?x?xf32>, tensor<?x?xf32>)
// CHECK-SAME: outs(%[[tC]] : tensor<?x?xf32>) -> tensor<?x?xf32>		// CHECK-SAME: outs(%[[tC]] : tensor<?x?xf32>) -> tensor<?x?xf32>
// CHECK: scf.foreach_thread.perform_concurrently {		// CHECK: scf.foreach_thread.perform_concurrently {
// CHECK-NEXT: tensor.parallel_insert_slice %[[RES]] into %[[C_BLK]]{{.*}} :		// CHECK-NEXT: tensor.parallel_insert_slice %[[RES]] into %[[C_BLK]]{{.*}} :
// CHECK-SAME: tensor<?x?xf32> into tensor<?x?xf32>		// CHECK-SAME: tensor<?x?xf32> into tensor<?x?xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: } {thread_dim_mapping = [1, 0]}		// CHECK-NEXT: } {mapping = [#gpu.thread<y>, #gpu.thread<x>]}
%0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)		%0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)		outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)
return %0 : tensor<?x?xf32>		return %0 : tensor<?x?xf32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1		%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1
%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20] (mapped to dims [1, 0])		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20] (mapping = [ #gpu.thread<y>, #gpu.thread<x> ] )
}		}
}		}

// -----		// -----

// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.		// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.

// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -15 + 300, 15)>		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -15 + 300, 15)>
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	%result = linalg.generic {indexing_maps = [
linalg.yield %2 : f32		linalg.yield %2 : f32
} -> tensor<4xf32>		} -> tensor<4xf32>
return %result : tensor<4xf32>		return %result : tensor<4xf32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1
%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [2] (mapped to dims [0])		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [2] ( mapping = [#gpu.thread<x>])
}		}
}		}
// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * 2)>		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * 2)>

// CHECK-LABEL: extract_source(		// CHECK-LABEL: extract_source(
// CHECK: %[[C2:.*]] = arith.constant 2 : index		// CHECK: %[[C2:.*]] = arith.constant 2 : index
// CHECK: scf.foreach_thread (%[[ARG:.]]) in (%[[C2]]) shared_outs(%{{.}} = %{{.*}}) -> (tensor<4xf32>) {		// CHECK: scf.foreach_thread (%[[ARG:.]]) in (%[[C2]]) shared_outs(%{{.}} = %{{.*}}) -> (tensor<4xf32>) {
// CHECK: %[[OFF:.*]] = affine.apply #[[$map0]](%[[ARG]])		// CHECK: %[[OFF:.*]] = affine.apply #[[$map0]](%[[ARG]])
▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

mlir/test/Dialect/SCF/one-shot-bufferize-tensor-copy-insertion.mlir

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	%result = scf.foreach_thread (%thread_idx) in (%num_threads) shared_outs(%o = %out) -> tensor<100xf32> {
// CHECK: tensor.extract_slice		// CHECK: tensor.extract_slice
// CHECK: scf.foreach_thread.perform_concurrently		// CHECK: scf.foreach_thread.perform_concurrently
// CHECK: tensor.parallel_insert_slice %{{.*}} into %[[o]]		// CHECK: tensor.parallel_insert_slice %{{.*}} into %[[o]]
%1 = tensor.extract_slice %in[%thread_idx][1][1] : tensor<100xf32> to tensor<1xf32>		%1 = tensor.extract_slice %in[%thread_idx][1][1] : tensor<100xf32> to tensor<1xf32>
scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
tensor.parallel_insert_slice %1 into %o[%thread_idx][1][1] :		tensor.parallel_insert_slice %1 into %o[%thread_idx][1][1] :
tensor<1xf32> into tensor<100xf32>		tensor<1xf32> into tensor<100xf32>
}		}
// CHECK: } {thread_dim_mapping = [5]}		// CHECK: } {mapping = [#gpu.thread<x>]}
} {thread_dim_mapping = [5]}		} {mapping = [#gpu.thread<x>]}
return		return
}		}

mlir/test/Dialect/SCF/ops.mlir

Show First 20 Lines • Show All 332 Lines • ▼ Show 20 Lines	func.func @simple_example(%in: tensor<100xf32>, %out: tensor<100xf32>) {
return		return
}		}

// CHECK-LABEL: func.func @elide_terminator		// CHECK-LABEL: func.func @elide_terminator
func.func @elide_terminator() -> () {		func.func @elide_terminator() -> () {
%num_threads = arith.constant 100 : index		%num_threads = arith.constant 100 : index

// CHECK: scf.foreach_thread		// CHECK: scf.foreach_thread
// CHECK-NEXT: } {thread_dim_mapping = [42]}		// CHECK-NEXT: } {mapping = [#gpu.thread<x>]}
// CHECK-NEXT: return		// CHECK-NEXT: return
scf.foreach_thread (%thread_idx) in (%num_threads) {		scf.foreach_thread (%thread_idx) in (%num_threads) {
scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
}		}
} {thread_dim_mapping = [42]}		} {mapping = [#gpu.thread<x>]}
return		return
}		}

// CHECK-LABEL: @switch		// CHECK-LABEL: @switch
func.func @switch(%arg0: index) -> i32 {		func.func @switch(%arg0: index) -> i32 {
// CHECK: %{{.*}} = scf.index_switch %arg0 -> i32		// CHECK: %{{.*}} = scf.index_switch %arg0 -> i32
%0 = scf.index_switch %arg0 -> i32		%0 = scf.index_switch %arg0 -> i32
// CHECK-NEXT: case 2 {		// CHECK-NEXT: case 2 {
Show All 27 Lines

mlir/test/lib/Dialect/Tensor/TestTensorTransforms.cpp

Show All 10 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/Arith/IR/Arith.h"		#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/SCF/IR/SCF.h"		#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/Dialect/Tensor/Transforms/TransformUtils.h"		#include "mlir/Dialect/Tensor/Transforms/TransformUtils.h"
#include "mlir/Dialect/Tensor/Transforms/Transforms.h"		#include "mlir/Dialect/Tensor/Transforms/Transforms.h"
		#include "mlir/IR/BuiltinAttributes.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"

using namespace mlir;		using namespace mlir;

namespace {		namespace {
struct TestTensorTransforms		struct TestTensorTransforms
: public PassWrapper<TestTensorTransforms, OperationPass<>> {		: public PassWrapper<TestTensorTransforms, OperationPass<>> {
▲ Show 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	struct RewriteExtractSliceFromCollapseShapeUsingScfForeach
RewriteExtractSliceFromCollapseShapeUsingScfForeach(MLIRContext *context)		RewriteExtractSliceFromCollapseShapeUsingScfForeach(MLIRContext *context)
: RewriteExtractSliceFromCollapseShapeBase(context) {}		: RewriteExtractSliceFromCollapseShapeBase(context) {}
LogicalResult emitReplacement(tensor::ExtractSliceOp op, Value dest,		LogicalResult emitReplacement(tensor::ExtractSliceOp op, Value dest,
tensor::ExtractSliceFromCollapseHelper &helper,		tensor::ExtractSliceFromCollapseHelper &helper,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
Location loc = op.getLoc();		Location loc = op.getLoc();
auto foreachOp = rewriter.create<scf::ForeachThreadOp>(		auto foreachOp = rewriter.create<scf::ForeachThreadOp>(
loc, /outputs=/dest, /numThreads=/helper.getIterationSpaceSizes(),		loc, /outputs=/dest, /numThreads=/helper.getIterationSpaceSizes(),
/threadDimMapping=/ArrayRef<int64_t>{},		/threadDimMapping=/ArrayRef<Attribute>{},
[&](OpBuilder &nestedBuilder, Location loc, ValueRange regionArgs) {		[&](OpBuilder &nestedBuilder, Location loc, ValueRange regionArgs) {
unsigned numThreadIdRegionArgs =		unsigned numThreadIdRegionArgs =
helper.getIterationSpaceSizes().size();		helper.getIterationSpaceSizes().size();
unsigned numOutputRegionArgs =		unsigned numOutputRegionArgs =
regionArgs.size() - numThreadIdRegionArgs;		regionArgs.size() - numThreadIdRegionArgs;
ValueRange outputIvs = regionArgs.take_front(numThreadIdRegionArgs);		ValueRange outputIvs = regionArgs.take_front(numThreadIdRegionArgs);
ValueRange outputArgs = regionArgs.take_back(numOutputRegionArgs);		ValueRange outputArgs = regionArgs.take_back(numOutputRegionArgs);
assert(outputArgs.size() == 1 &&		assert(outputArgs.size() == 1 &&
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to dims`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 473199

mlir/include/mlir/Dialect/GPU/IR/GPUDialect.h

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/include/mlir/Dialect/GPU/TransformOps/CMakeLists.txt

mlir/include/mlir/Dialect/GPU/TransformOps/GPUDeviceMapper.td

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

mlir/include/mlir/Dialect/SCF/IR/SCFOps.td

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h

mlir/include/mlir/Interfaces/CMakeLists.txt

mlir/include/mlir/Interfaces/DeviceMappingInterface.h

mlir/include/mlir/Interfaces/DeviceMappingInterface.td

mlir/lib/Dialect/GPU/TransformOps/CMakeLists.txt

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

mlir/lib/Dialect/SCF/IR/SCF.cpp

mlir/lib/Dialect/SCF/Transforms/BufferizableOpInterfaceImpl.cpp

mlir/lib/Dialect/Utils/StaticValueUtils.cpp

mlir/lib/IR/Builders.cpp

mlir/lib/Interfaces/CMakeLists.txt

mlir/lib/Interfaces/DeviceMappingInterface.cpp

mlir/test/Dialect/GPU/transform-gpu-failing.mlir

mlir/test/Dialect/GPU/transform-gpu.mlir

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

mlir/test/Dialect/SCF/one-shot-bufferize-tensor-copy-insertion.mlir

mlir/test/Dialect/SCF/ops.mlir

mlir/test/lib/Dialect/Tensor/TestTensorTransforms.cpp

[mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to dims`
ClosedPublic