This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
GPU/TransformOps/
-
TransformOps/
-
GPUTransformOps.td
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.td
-
Transforms/
-
Transforms.h
-
SCF/IR/
-
IR/
-
SCFOps.td
-
Utils/
1/2
StaticValueUtils.h
-
lib/Dialect/
-
Dialect/
-
GPU/TransformOps/
-
TransformOps/
1/2
GPUTransformOps.cpp
-
Linalg/
-
TransformOps/
-
LinalgTransformOps.cpp
-
Transforms/
-
Tiling.cpp
-
SCF/
-
IR/
1/2
SCF.cpp
-
Transforms/
-
BufferizableOpInterfaceImpl.cpp
-
Utils/
-
StaticValueUtils.cpp
-
test/
-
Dialect/
-
GPU/
-
transform-gpu-failing.mlir
-
transform-gpu.mlir
-
Linalg/
-
tile-to-foreach-thread.mlir
-
SCF/
-
one-shot-bufferize-analysis.mlir
-
one-shot-bufferize-tensor-copy-insertion.mlir
-
ops.mlir
-
lib/Dialect/Tensor/
-
Dialect/
-
Tensor/
-
TestTensorTransforms.cpp

Differential D136851

[mlir] Use processing unit names for `thread_dim_map` and `mapped to dims`
AbandonedPublic

Authored by guraypp on Oct 27 2022, 8:59 AM.

Download Raw Diff

Details

Reviewers

bondhugula
ThomasRaoux
herhut
ftynse
nicolasvasilache

Summary

Currently using thread_dim_map and a directory array map to dims. This is very confusing in many ways. This change uses meaningful words in these structures.

For now there is thread_x/y/z and block_x/y/z. They cannot be mixed in the same foreach_thread. However, it is possible to mix them or use them together in some cases in the future.

The change is almost NFC other than changing the names.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Oct 27 2022, 8:59 AM

Herald added a reviewer: bondhugula. · View Herald TranscriptOct 27 2022, 8:59 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: zero9178, bzcheeseman, sdasgup3 and 21 others. · View Herald Transcript

guraypp requested review of this revision.Oct 27 2022, 8:59 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptOct 27 2022, 8:59 AM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

guraypp edited the summary of this revision. (Show Details)Oct 27 2022, 9:00 AM

Harbormaster completed remote builds in B194669: Diff 471179.Oct 27 2022, 9:44 AM

update the description

Harbormaster completed remote builds in B194708: Diff 471225.Oct 27 2022, 12:12 PM

guraypp added a reviewer: ftynse.Oct 27 2022, 12:41 PM

remove comments

Harbormaster completed remote builds in B194843: Diff 471419.Oct 28 2022, 12:36 AM

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h
61	Nit: do not specify the number of stack elements in the vector unless you have a strong reason to.
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
298–299	This seems orthogonal to the purpose of the change, can you please comment why this is necessary?
mlir/lib/Dialect/SCF/IR/SCF.cpp
1073–1081	Why does it have to be a bit map? Are we expecting to use a mapping that is simultaneously thread-x, thread-y and block-z, for example?

In D136851#3890998, @ftynse wrote:

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

remove size of the vector

In D136851#3890998, @ftynse wrote:

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

I totally agree with you. Actually I created the enum array at the beginning. But`foreach_thread` is manually parsed, printed, verified. So I had to parse enum array manually here, it looked also unsafe to me. Therefore, I choose string array option.
But If you have a strong argument, I can work on changing to enum.

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h
61	I removed the size here.
mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp
298–299	Good catch :) This is a minor bug fix. Without this line, `blockDim` always required 3 dimensions. It was annoying when `foreach_thread` contains less than 3 loops. This line will set 1 to unused dimensions.
mlir/lib/Dialect/SCF/IR/SCF.cpp
1073–1081	Exactly. One can think about mixings blocks and threads like below. This will be my future work. loop1 --> blockIdx.x, threadIdx.x loop2 --> blockIdx.y, threadIdx.y

Harbormaster completed remote builds in B194856: Diff 471436.Oct 28 2022, 2:20 AM

In D136851#3891068, @nicolasvasilache wrote:

In D136851#3890998, @ftynse wrote:

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

If you don't constrain the kind of attributes in the op verifier, any downstream client can use whatever they want there. Lowerings can then check if they support the specific attribute kind and fail early if they don't. If you want more guarantees, introduce an attribute interface. There is no need to invent new infrastructure layers here.

In D136851#3891267, @ftynse wrote:

In D136851#3891068, @nicolasvasilache wrote:

In D136851#3890998, @ftynse wrote:

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

If you don't constrain the kind of attributes in the op verifier, any downstream client can use whatever they want there. Lowerings can then check if they support the specific attribute kind and fail early if they don't. If you want more guarantees, introduce an attribute interface. There is no need to invent new infrastructure layers here.

Cool thnks for the guidnce, I had not followed that part of the stack.

In D136851#3891357, @nicolasvasilache wrote:

In D136851#3891267, @ftynse wrote:

In D136851#3891068, @nicolasvasilache wrote:

In D136851#3890998, @ftynse wrote:

Have you considered making it an actual attribute in the GPU dialect, e.g., an enum attribute? Passing strings around feels a lot like unsafe JSON even if it is a readability improvement.

One aspect here is that the op semantics and transformations should be retargetable.
Is there a way to have something like a BaseForeachThreadTargetEnumAttr for which the GPU dialect could define Thread/Block; the IREE xxx dialect could define Workgroupyy and future ACME dialect could define its own?

A (far from) ideal solution would be to have users define their own names "foo", "bar", "baz" and explicitly have the map_to_target specify the map ["foo" -> threadIdx.x ...].
This is so ugly that I don't think it should be considered but that should give a feeling on what a better preferred way to interact with the system could look like.

If you don't constrain the kind of attributes in the op verifier, any downstream client can use whatever they want there. Lowerings can then check if they support the specific attribute kind and fail early if they don't. If you want more guarantees, introduce an attribute interface. There is no need to invent new infrastructure layers here.

Cool thnks for the guidnce, I had not followed that part of the stack.

Then I have a proposal. As Alex said, let's relex verifier of foreach_thread, for example as follows (here I improved the syntax a bit).

scf.foreach_thread (%bi, %bj)  {
  scf.foreach_thread (%ti, %tj) {
  }  {map = "unit", dimensions = ["x", "y"]}
}  {map = "group", dimensions = ["x", "y"]}

Next, let's add a new generic op to the GPU dialect. One can map named parallelism unit to block and thread.

// transform dialect for GPU
transform.gpu.map_foreach %op "group" to blocks
transform.gpu.map_foreach %op "unit" to threads

Another accelerator transform dialect uses only the outer level parallelism for example.

// transform dialect for other accelerator
transform.other_accelerator.map_foreach %op "group" to threads

Then I have a proposal. As Alex said, let's relex verifier of foreach_thread, for example as follows (here I improved the syntax a bit).
scf.foreach_thread (%bi, %bj)  {
  scf.foreach_thread (%ti, %tj) {
  }  {map = "unit", dimensions = ["x", "y"]}
}  {map = "group", dimensions = ["x", "y"]}
Next, let's add a new generic op to the GPU dialect. One can map named parallelism unit to block and thread.
// transform dialect for GPU
transform.gpu.map_foreach %op "group" to blocks
transform.gpu.map_foreach %op "unit" to threads
Another accelerator transform dialect uses only the outer level parallelism for example.
// transform dialect for other accelerator
transform.other_accelerator.map_foreach %op "group" to threads

I will need to also change tile_to_foreach_thread_op to something like below.

transform.structured.tile_to_foreach_thread_op %op (map = "unit", dimensions = ["x", "y"])

In D136851#3891440, @guraypp wrote:
Then I have a proposal. As Alex said, let's relex verifier of foreach_thread, for example as follows (here I improved the syntax a bit).
scf.foreach_thread (%bi, %bj)  {
  scf.foreach_thread (%ti, %tj) {
  }  {map = "unit", dimensions = ["x", "y"]}
}  {map = "group", dimensions = ["x", "y"]}
Next, let's add a new generic op to the GPU dialect. One can map named parallelism unit to block and thread.
// transform dialect for GPU
transform.gpu.map_foreach %op "group" to blocks
transform.gpu.map_foreach %op "unit" to threads
Another accelerator transform dialect uses only the outer level parallelism for example.
// transform dialect for other accelerator
transform.other_accelerator.map_foreach %op "group" to threads

I think having string attributes and then binding them to something in extra operations is an overkill and also doesn't advance us in the right direction. If you watch some of the earlier MLIR presentations, we specifically wanted to avoid MLIR becoming a "JSON of compiler IRs". String attributes loosely bound by an operation located elsewhere has exactly that JSON feeling for me. What I propose concretely is to introduce a DeviceMappingAttrInterface, initially with no methods at all, just as a unification mechanism. Then, have a GPUThreadMappingAttr and GPUBlockMappingAttr, potentially sharing parts of the implementation, that implement the DeviceMappingAttrInterface. These can print as #gpu.threads<y, z> and #gpu.blocks<x, y>, respectively. Other accelerators can then introduce their own attributes #other_accelerator.mapping<parallel_hw_feature_no_7> that also implement the interface. The verifier of foreach_thread can then check that the mapping attribute is DeviceMappingAttrInterface (or an array thereof depending on the desired model for mapping to more than one parallel HW dimension). Initially, the lowering pattern to GPU can do an additional isa<GPUThreadMappingAttr, GPUBlockMappingAttr>() on the attribute and fail to match otherwise, with a helpful debug message. Other accelerators can do the same. In a slightly longer term, we can consider moving HW-specific parts of the lowering into interface methods so we can keep a common lowering, but this is not a priority.

Herald added a subscriber: Moerafaat. · View Herald TranscriptNov 2 2022, 5:24 AM

@ftynse thanks for clear explanation. As it is different that this current PR, I implement in another in D137413. Let me know what do you think.

guraypp mentioned this in D137413: [mlir] Introduce device mapper attribute for `thread_dim_map` and `mapped to dims`.Nov 7 2022, 8:47 AM

I think we can now abandon this ?

We implemented a better solution in https://reviews.llvm.org/D137413

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

TransformOps/

GPUTransformOps.td

4 lines

Linalg/

TransformOps/

LinalgTransformOps.td

4 lines

Transforms/

Transforms.h

4 lines

SCF/

IR/

SCFOps.td

32 lines

Utils/

StaticValueUtils.h

3 lines

lib/

Dialect/

GPU/

TransformOps/

GPUTransformOps.cpp

3 lines

Linalg/

TransformOps/

LinalgTransformOps.cpp

6 lines

Transforms/

Tiling.cpp

6 lines

SCF/

IR/

SCF.cpp

97 lines

Transforms/

BufferizableOpInterfaceImpl.cpp

2 lines

Utils/

StaticValueUtils.cpp

8 lines

test/

Dialect/

GPU/

transform-gpu-failing.mlir

22 lines

transform-gpu.mlir

20 lines

Linalg/

tile-to-foreach-thread.mlir

6 lines

SCF/

one-shot-bufferize-analysis.mlir

2 lines

one-shot-bufferize-tensor-copy-insertion.mlir

4 lines

ops.mlir

4 lines

lib/

Dialect/

Tensor/

TestTensorTransforms.cpp

2 lines

Diff 471436

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td

Show All 23 Lines	Op<Transform_Dialect, "gpu.map_nested_foreach_to_threads",
TransformOpInterface]> {		TransformOpInterface]> {
let description = [{		let description = [{
Target the `gpu.launch op` and rewrite all `scf.foreach_thread`		Target the `gpu.launch op` and rewrite all `scf.foreach_thread`
nested in it to distributed `gpu.thread_id` attribute.		nested in it to distributed `gpu.thread_id` attribute.

The operation searches for `scf.foreach_thread` ops nested under `target`		The operation searches for `scf.foreach_thread` ops nested under `target`
and maps each such op to GPU threads. Mapping is one-to-one and the		and maps each such op to GPU threads. Mapping is one-to-one and the
induction variables of `scf.foreach_thread` are rewritten to		induction variables of `scf.foreach_thread` are rewritten to
`gpu.thread_id` according to the `thread_dim_mapping` attribute.		`gpu.thread_id` according to the `mapping` attribute.

Sibling `scf.foreach_thread` are supported in which case, the union of		Sibling `scf.foreach_thread` are supported in which case, the union of
the number of threads is computed and may result in predication.		the number of threads is computed and may result in predication.

Multiple scf.foreach_thread are supported per `gpu.launch` in which case,		Multiple scf.foreach_thread are supported per `gpu.launch` in which case,
the max of all the threads is computed and taken for the global		the max of all the threads is computed and taken for the global
`gpu.thread_id`. If necessary, `scf.foreach_thread` that do not use the		`gpu.thread_id`. If necessary, `scf.foreach_thread` that do not use the
whole thread range result in predicated computations.		whole thread range result in predicated computations.
Show All 27 Lines	let description = [{

#### Example:		#### Example:

```		```
gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2)		gpu.launch blocks(%bx, %by, %bz) in (%x = %0, %y = %1, %z = %2)
threads(%tx, %ty, %tz) in (%tx = %3, %ty = %4, %tz = %5) {		threads(%tx, %ty, %tz) in (%tx = %3, %ty = %4, %tz = %5) {
scf.foreach_thread (%i, %j) in (7, 9) {		scf.foreach_thread (%i, %j) in (7, 9) {
... // body 1		... // body 1
} {thread_dim_mapping = [1, 0, 2]}		} {mapping = ["thread_y", "thread_x"]}
scf.foreach_thread (%i) in (12) {		scf.foreach_thread (%i) in (12) {
... // body 2		... // body 2
}		}
gpu.terminator		gpu.terminator
}		}
```		```
is translated to:		is translated to:

▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

Show First 20 Lines • Show All 721 Lines • ▼ Show 20 Lines	let description = [{
```		```
}];		}];

let arguments = (ins PDL_Operation:$target,		let arguments = (ins PDL_Operation:$target,
Variadic<PDL_Operation>:$num_threads,		Variadic<PDL_Operation>:$num_threads,
Variadic<PDL_Operation>:$tile_sizes,		Variadic<PDL_Operation>:$tile_sizes,
DefaultValuedAttr<I64ArrayAttr, "{}">:$static_num_threads,		DefaultValuedAttr<I64ArrayAttr, "{}">:$static_num_threads,
DefaultValuedAttr<I64ArrayAttr, "{}">:$static_tile_sizes,		DefaultValuedAttr<I64ArrayAttr, "{}">:$static_tile_sizes,
OptionalAttr<I64ArrayAttr>:$thread_dim_mapping);		OptionalAttr<StrArrayAttr>:$mapping);
let results = (outs PDL_Operation:$foreach_thread_op,		let results = (outs PDL_Operation:$foreach_thread_op,
PDL_Operation:$tiled_op);		PDL_Operation:$tiled_op);
let assemblyFormat = [{		let assemblyFormat = [{
$target oilist(		$target oilist(
`num_threads` custom<DynamicIndexList>($num_threads,		`num_threads` custom<DynamicIndexList>($num_threads,
$static_num_threads,		$static_num_threads,
"ShapedType::kDynamicSize") \|		"ShapedType::kDynamicSize") \|
`tile_sizes` custom<DynamicIndexList>($tile_sizes,		`tile_sizes` custom<DynamicIndexList>($tile_sizes,
$static_tile_sizes,		$static_tile_sizes,
"ShapedType::kDynamicSize"))		"ShapedType::kDynamicSize"))
(`(` `mapped` `to` `dims` $thread_dim_mapping^ `)`)? attr-dict		(`(` `mapped` `to` `dims` $mapping^ `)`)? attr-dict
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;

let extraClassDeclaration = [{		let extraClassDeclaration = [{
::mlir::DiagnosedSilenceableFailure apply(		::mlir::DiagnosedSilenceableFailure apply(
::mlir::transform::TransformResults &transformResults,		::mlir::transform::TransformResults &transformResults,
::mlir::transform::TransformState &state);		::mlir::transform::TransformState &state);

▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

	Show First 20 Lines • Show All 433 Lines • ▼ Show 20 Lines
	/// dimensions, e.g. in the Linalg case).			/// dimensions, e.g. in the Linalg case).
	struct ForeachThreadTilingResult {			struct ForeachThreadTilingResult {
	Operation *tileOp;			Operation *tileOp;
	Operation *tiledOp;			Operation *tiledOp;
	};			};
	FailureOr<ForeachThreadTilingResult>			FailureOr<ForeachThreadTilingResult>
	tileToForeachThreadOp(RewriterBase &builder, TilingInterface op,			tileToForeachThreadOp(RewriterBase &builder, TilingInterface op,
	ArrayRef<OpFoldResult> numThreads,			ArrayRef<OpFoldResult> numThreads,
	ArrayRef<int64_t> threadDimMapping = {});			ArrayRef<StringRef> threadDimMapping = {});

	/// Same as `tileToForeachThreadOp`, but calculate the number of threads			/// Same as `tileToForeachThreadOp`, but calculate the number of threads
	/// required using the given tileSizes.			/// required using the given tileSizes.
	FailureOr<ForeachThreadTilingResult>			FailureOr<ForeachThreadTilingResult>
	tileToForeachThreadOpUsingTileSizes(RewriterBase &builder, TilingInterface op,			tileToForeachThreadOpUsingTileSizes(RewriterBase &builder, TilingInterface op,
	ArrayRef<OpFoldResult> tileSizes,			ArrayRef<OpFoldResult> tileSizes,
	ArrayRef<int64_t> threadDimMapping = {});			ArrayRef<StringRef> threadDimMapping = {});

	/// All indices returned by IndexOp should be invariant with respect to tiling.			/// All indices returned by IndexOp should be invariant with respect to tiling.
	/// Therefore, if an operation is tiled, we have to transform the indices			/// Therefore, if an operation is tiled, we have to transform the indices
	/// accordingly, i.e. offset them by the values of the corresponding induction			/// accordingly, i.e. offset them by the values of the corresponding induction
	/// variables that are captured implicitly in the body of the op.			/// variables that are captured implicitly in the body of the op.
	///			///
	/// Example. `linalg.generic` before tiling:			/// Example. `linalg.generic` before tiling:
	///			///
	▲ Show 20 Lines • Show All 576 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/SCF/IR/SCFOps.td

Show First 20 Lines • Show All 365 Lines • ▼ Show 20 Lines	let description = [{
used. This ensures that memory side effects of a thread are not visible to		used. This ensures that memory side effects of a thread are not visible to
other threads (or in the parent body), apart from explicitly shared tensors.		other threads (or in the parent body), apart from explicitly shared tensors.

The name "thread" conveys the fact that the parallel execution is mapped		The name "thread" conveys the fact that the parallel execution is mapped
(i.e. distributed) to a set of virtual threads of execution, one function		(i.e. distributed) to a set of virtual threads of execution, one function
application per thread. Further lowerings are responsible for specifying		application per thread. Further lowerings are responsible for specifying
how this is materialized on concrete hardware resources.		how this is materialized on concrete hardware resources.

An optional thread_dim_mapping index array attribute specifies for each		An optional `mapping` string array attribute specifies for each virtual
virtual thread dimension, how it remaps 1-1 to a set of concrete processing		thread dimension, how it remaps 1-1 to a set of concrete processing element
element resources (e.g. a CUDA grid dimension or a level of concrete nested		resources (e.g. a CUDA grid dimension or a level of concrete nested async
async parallelism). At this time, the specification is backend-dependent and		parallelism). Mapping can be one of following thread_x, thread_y, thread_z,
is not verified by the op, beyond being an index array attribute.		block_x, block_y, and block_z. in this way, one can use 3 dimensional threads
It is the reponsibility of the lowering to interpret the index array in the		and blocks that contains threads. The threads and blocks cannot be mixed at
context of the concrete target the op is lowered to, or to ignore it when		this point or used together.
the specification is ill-formed or unsupported for a particular target.

The only allowed terminator is `scf.foreach_thread.perform_concurrently`.		The only allowed terminator is `scf.foreach_thread.perform_concurrently`.
`scf.foreach_thread` returns one value per `shared_out` operand. The		`scf.foreach_thread` returns one value per `shared_out` operand. The
actions of the `perform_concurrently` terminators specify how to combine the		actions of the `perform_concurrently` terminators specify how to combine the
partial results of all parallel invocations into a full value, in some		partial results of all parallel invocations into a full value, in some
unspecified order. The "destination" of each such op must be a `shared_out`		unspecified order. The "destination" of each such op must be a `shared_out`
block argument of the `scf.foreach_thread` op.		block argument of the `scf.foreach_thread` op.

Show All 38 Lines	%matmul_and_pointwise:2 = scf.foreach_thread (%thread_id_1, %thread_id_2) in
tensor<?xT> into tensor<?xT>		tensor<?xT> into tensor<?xT>
}		}
}		}
// Implicit synchronization point.		// Implicit synchronization point.
// Sequential context.		// Sequential context.
//		//
```		```

Example with thread_dim_mapping attribute:		Example with mapping attribute:

```mlir		```mlir
//		//
// Sequential context.		// Sequential context.
//		//
%matmul_and_pointwise:2 = scf.foreach_thread (%thread_id_1, %thread_id_2) in		%matmul_and_pointwise:2 = scf.foreach_thread (%thread_id_1, %thread_id_2) in
(%num_threads_1, %numthread_id_2) shared_outs(...)		(%num_threads_1, %numthread_id_2) shared_outs(...)
-> (tensor<?x?xT>, tensor<?xT>) {		-> (tensor<?x?xT>, tensor<?xT>) {
//		//
// Parallel context, each thread with id = (%thread_id_2, %thread_id_1)		// Parallel context, each thread with id = (%thread_id_2, %thread_id_1)
// runs its version of the code.		// runs its version of the code.
//		//
scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
...		...
}		}
} { thread_dim_mapping = [1, 0] }		} { mapping = ["thread_y", "thread_x"] }
// Implicit synchronization point.		// Implicit synchronization point.
// Sequential context.		// Sequential context.
//		//
```		```

Example with privatized tensors:		Example with privatized tensors:

```mlir		```mlir
%t0 = ...		%t0 = ...
%t1 = ...		%t1 = ...
%r = scf.foreach_thread ... shared_outs(%o = t0) -> tensor<?xf32> {		%r = scf.foreach_thread ... shared_outs(%o = t0) -> tensor<?xf32> {
// %t0 and %t1 are privatized. %t0 is definitely copied for each thread		// %t0 and %t1 are privatized. %t0 is definitely copied for each thread
// because the scf.foreach_thread op's %t0 use bufferizes to a memory		// because the scf.foreach_thread op's %t0 use bufferizes to a memory
// write. In the absence of other conflicts, %t1 is copied only if there		// write. In the absence of other conflicts, %t1 is copied only if there
// are uses of %t1 in the body that bufferize to a memory read and to a		// are uses of %t1 in the body that bufferize to a memory read and to a
// memory write.		// memory write.
"some_use"(%t0)		"some_use"(%t0)
"some_use"(%t1)		"some_use"(%t1)
}		}
```		```
}];		}];
let arguments = (ins Variadic<Index>:$num_threads,		let arguments = (ins Variadic<Index>:$num_threads,
Variadic<AnyRankedTensor>:$outputs,		Variadic<AnyRankedTensor>:$outputs,
DefaultValuedAttr<I64ArrayAttr, "{}">:$thread_dim_mapping);		DefaultValuedAttr<StrArrayAttr, "{}">:$mapping);

let results = (outs Variadic<AnyType>:$results);		let results = (outs Variadic<AnyType>:$results);
let regions = (region SizedRegion<1>:$region);		let regions = (region SizedRegion<1>:$region);

let hasCustomAssemblyFormat = 1;		let hasCustomAssemblyFormat = 1;
let hasVerifier = 1;		let hasVerifier = 1;

// The default builder does not add the proper body BBargs, roll our own.		// The default builder does not add the proper body BBargs, roll our own.
let skipDefaultBuilders = 1;		let skipDefaultBuilders = 1;
let builders = [		let builders = [
// Bodyless builder, outputs must be specified.		// Bodyless builder, outputs must be specified.
OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,		OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,
CArg<"ArrayRef<int64_t>", "{}">:$thread_dim_mapping)>,		CArg<"ArrayRef<StringRef>", "{}">:$mapping)>,
// Builder that takes a bodyBuilder lambda.		// Builder that takes a bodyBuilder lambda.
OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,		OpBuilder<(ins "ValueRange":$outputs, "ValueRange":$num_threads,
"ArrayRef<int64_t>":$thread_dim_mapping,		"ArrayRef<StringRef>":$mapping,
"function_ref<void(OpBuilder &, Location, ValueRange)>":$bodyBuilder)>		"function_ref<void(OpBuilder &, Location, ValueRange)>":$bodyBuilder)>
];		];
let extraClassDeclaration = [{		let extraClassDeclaration = [{
int64_t getRank() { return getNumThreads().size(); }		int64_t getRank() { return getNumThreads().size(); }

OpResult getTiedOpResult(OpOperand *opOperand) {		OpResult getTiedOpResult(OpOperand *opOperand) {
assert(opOperand->getOperandNumber() >= getRank() && "invalid operand");		assert(opOperand->getOperandNumber() >= getRank() && "invalid operand");
return getOperation()->getOpResult(		return getOperation()->getOpResult(
Show All 22 Lines	::mlir::Value getThreadIndex(int64_t idx) {
return getThreadIndices()[idx];		return getThreadIndices()[idx];
}		}

::mlir::Block::BlockArgListType getRegionOutArgs() {		::mlir::Block::BlockArgListType getRegionOutArgs() {
return getBody()->getArguments().drop_front(getRank());		return getBody()->getArguments().drop_front(getRank());
}		}

/// Return the thread indices in the order specified by the		/// Return the thread indices in the order specified by the
/// thread_dim_mapping attribute. Return failure is		/// mapping attribute. Return failure is mapping is not a valid permutation.
/// thread_dim_mapping is not a valid permutation.
FailureOr<SmallVector<Value>> getPermutedThreadIndices();		FailureOr<SmallVector<Value>> getPermutedThreadIndices();

/// Return the number of threads in the order specified by the		/// Return the number of threads in the order specified by the
/// thread_dim_mapping attribute.		/// mapping attribute.
/// Return failure is thread_dim_mapping is not a valid permutation.		/// Return failure is mapping is not a valid permutation.
FailureOr<SmallVector<OpFoldResult>> getPermutedNumThreads(OpBuilder &b);		FailureOr<SmallVector<OpFoldResult>> getPermutedNumThreads(OpBuilder &b);

// The ensureTerminator method generated by SingleBlockImplicitTerminator is		// The ensureTerminator method generated by SingleBlockImplicitTerminator is
// unaware of the fact that our terminator also needs a region to be		// unaware of the fact that our terminator also needs a region to be
// well-formed. We override it here to ensure that we do the right thing.		// well-formed. We override it here to ensure that we do the right thing.
static void ensureTerminator(Region &region, OpBuilder &builder, Location loc);		static void ensureTerminator(Region &region, OpBuilder &builder, Location loc);

PerformConcurrentlyOp getTerminator();		PerformConcurrentlyOp getTerminator();
▲ Show 20 Lines • Show All 537 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	void dispatchIndexOpFoldResults(ArrayRef<OpFoldResult> ofrs,			void dispatchIndexOpFoldResults(ArrayRef<OpFoldResult> ofrs,
	SmallVectorImpl<Value> &dynamicVec,			SmallVectorImpl<Value> &dynamicVec,
	SmallVectorImpl<int64_t> &staticVec,			SmallVectorImpl<int64_t> &staticVec,
	int64_t sentinel);			int64_t sentinel);

	/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.			/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.
	SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr);			SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr);

				/// Extract StringRef values from the assumed ArrayAttr of StringAttr.
				SmallVector<StringRef> extractFromStringArrayAttr(Attribute attr);
				ftynseUnsubmitted Not Done Reply Inline Actions Nit: do not specify the number of stack elements in the vector unless you have a strong reason to. ftynse: Nit: do not specify the number of stack elements in the vector unless you have a strong reason…
				gurayppAuthorUnsubmitted Done Reply Inline Actions I removed the size here. guraypp: I removed the size here.

	/// Given a value, try to extract a constant Attribute. If this fails, return			/// Given a value, try to extract a constant Attribute. If this fails, return
	/// the original value.			/// the original value.
	OpFoldResult getAsOpFoldResult(Value val);			OpFoldResult getAsOpFoldResult(Value val);

	/// Given an array of values, try to extract a constant Attribute from each			/// Given an array of values, try to extract a constant Attribute from each
	/// value. If this fails, return the original value.			/// value. If this fails, return the original value.
	SmallVector<OpFoldResult> getAsOpFoldResult(ValueRange values);			SmallVector<OpFoldResult> getAsOpFoldResult(ValueRange values);

	Show All 25 Lines

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp

Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines	if (getGenerateGpuLaunch()) {
topLevelForeachThreadOp = cast<scf::ForeachThreadOp>(newForeachThreadOp);		topLevelForeachThreadOp = cast<scf::ForeachThreadOp>(newForeachThreadOp);
}		}

SmallVector<int64_t> gridDim = extractFromI64ArrayAttr(getGridDim());		SmallVector<int64_t> gridDim = extractFromI64ArrayAttr(getGridDim());
diag = mlir::transform::gpu::mapForeachToBlocksImpl(		diag = mlir::transform::gpu::mapForeachToBlocksImpl(
rewriter, topLevelForeachThreadOp, generateGpuBlockIds, gridDim,		rewriter, topLevelForeachThreadOp, generateGpuBlockIds, gridDim,
transformOp);		transformOp);
if (diag.succeeded()) {		if (diag.succeeded()) {
		// Set 1 to dim that aren't used
		gridDim.resize(3, 1);
		ftynseUnsubmitted Not Done Reply Inline Actions This seems orthogonal to the purpose of the change, can you please comment why this is necessary? ftynse: This seems orthogonal to the purpose of the change, can you please comment why this is…
		gurayppAuthorUnsubmitted Done Reply Inline Actions Good catch :) This is a minor bug fix. Without this line, `blockDim` always required 3 dimensions. It was annoying when `foreach_thread` contains less than 3 loops. This line will set 1 to unused dimensions. guraypp: Good catch :) This is a minor bug fix. Without this line, `blockDim` always required 3…
diag = alterGpuLaunch(rewriter, gpuLaunch,		diag = alterGpuLaunch(rewriter, gpuLaunch,
cast<TransformOpInterface>(getOperation()),		cast<TransformOpInterface>(getOperation()),
gridDim[0], gridDim[1], gridDim[2]);		gridDim[0], gridDim[1], gridDim[2]);
}		}

results.assign({gpuLaunch});		results.assign({gpuLaunch});
return diag;		return diag;
}		}
Show All 36 Lines	if (failed(potentialBlockDim) \|\|
})) {		})) {
return failureHelper("unsupported dynamic blockdim size");		return failureHelper("unsupported dynamic blockdim size");
}		}

SmallVector<int64_t> blockDim =		SmallVector<int64_t> blockDim =
llvm::to_vector(llvm::map_range(*potentialBlockDim, [](OpFoldResult ofr) {		llvm::to_vector(llvm::map_range(*potentialBlockDim, [](OpFoldResult ofr) {
return getConstantIntValue(ofr).value();		return getConstantIntValue(ofr).value();
}));		}));
		blockDim.resize(3, 1);

// Step 1. Create the gpu.thread ops		// Step 1. Create the gpu.thread ops
Location loc = foreachThreadOp.getLoc();		Location loc = foreachThreadOp.getLoc();
IndexType indexType = rewriter.getIndexType();		IndexType indexType = rewriter.getIndexType();

SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};		SmallVector<Dimension> gpuDims{Dimension::x, Dimension::y, Dimension::z};
SmallVector<Value> threadOps;		SmallVector<Value> threadOps;
for (int64_t idx : llvm::seq<int64_t>(0, blockDim.size())) {		for (int64_t idx : llvm::seq<int64_t>(0, blockDim.size())) {
▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

Show First 20 Lines • Show All 1,296 Lines • ▼ Show 20 Lines	if (!tilableOp) {
<< "only TilingInterface ops are supported";		<< "only TilingInterface ops are supported";
diag.attachNote(target->getLoc()) << "target op";		diag.attachNote(target->getLoc()) << "target op";
return diag;		return diag;
}		}
rewriter.setInsertionPoint(tilableOp);		rewriter.setInsertionPoint(tilableOp);
auto maybeThreadDimMappingAttr = threadDimMapping;		auto maybeThreadDimMappingAttr = threadDimMapping;
auto dimMapping = llvm::to_vector(		auto dimMapping = llvm::to_vector(
maybeThreadDimMappingAttr		maybeThreadDimMappingAttr
? extractFromI64ArrayAttr(*maybeThreadDimMappingAttr)		? extractFromStringArrayAttr(*maybeThreadDimMappingAttr)
: ArrayRef<int64_t>{});		: ArrayRef<StringRef>{});

FailureOr<linalg::ForeachThreadTilingResult> tilingResult = failure();		FailureOr<linalg::ForeachThreadTilingResult> tilingResult = failure();
if (!mixedNumThreads.empty()) {		if (!mixedNumThreads.empty()) {
tilingResult = linalg::tileToForeachThreadOp(rewriter, tilableOp,		tilingResult = linalg::tileToForeachThreadOp(rewriter, tilableOp,
numThreads, dimMapping);		numThreads, dimMapping);
} else {		} else {
tilingResult = linalg::tileToForeachThreadOpUsingTileSizes(		tilingResult = linalg::tileToForeachThreadOpUsingTileSizes(
rewriter, tilableOp, tileSizes, dimMapping);		rewriter, tilableOp, tileSizes, dimMapping);
Show All 16 Lines	DiagnosedSilenceableFailure transform::TileToForeachThreadOp::apply(
ArrayRef<Operation *> targets = state.getPayloadOps(getTarget());		ArrayRef<Operation *> targets = state.getPayloadOps(getTarget());

// Result payload ops.		// Result payload ops.
SmallVector<Operation *> tileOps;		SmallVector<Operation *> tileOps;
SmallVector<Operation *> tiledOps;		SmallVector<Operation *> tiledOps;

DiagnosedSilenceableFailure diag = tileToForeachThreadOpImpl(		DiagnosedSilenceableFailure diag = tileToForeachThreadOpImpl(
rewriter, state, cast<TransformOpInterface>(getOperation()), targets,		rewriter, state, cast<TransformOpInterface>(getOperation()), targets,
getMixedNumThreads(), getMixedTileSizes(), getThreadDimMapping(), tileOps,		getMixedNumThreads(), getMixedTileSizes(), getMapping(), tileOps,
tiledOps);		tiledOps);

if (!diag.succeeded())		if (!diag.succeeded())
return diag;		return diag;

transformResults.set(getForeachThreadOp().cast<OpResult>(), tileOps);		transformResults.set(getForeachThreadOp().cast<OpResult>(), tileOps);
transformResults.set(getTiledOp().cast<OpResult>(), tiledOps);		transformResults.set(getTiledOp().cast<OpResult>(), tiledOps);

▲ Show 20 Lines • Show All 269 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

Show First 20 Lines • Show All 220 Lines • ▼ Show 20 Lines
/// size of data.		/// size of data.
/// It is the user's responsibility to ensure that `numThreads` is a valid		/// It is the user's responsibility to ensure that `numThreads` is a valid
/// tiling specification (i.e. that only tiles parallel dimensions, e.g. in the		/// tiling specification (i.e. that only tiles parallel dimensions, e.g. in the
/// Linalg case). If `omitTileOffsetBoundsCheck` is true, then the function will		/// Linalg case). If `omitTileOffsetBoundsCheck` is true, then the function will
/// assume that `tileSize[i] * (numThread[i] -1) <= dimSize[i]` holds.		/// assume that `tileSize[i] * (numThread[i] -1) <= dimSize[i]` holds.
static FailureOr<ForeachThreadTilingResult> tileToForeachThreadOpImpl(		static FailureOr<ForeachThreadTilingResult> tileToForeachThreadOpImpl(
RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> numThreads,		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> numThreads,
Optional<ArrayRef<OpFoldResult>> nominalTileSizes,		Optional<ArrayRef<OpFoldResult>> nominalTileSizes,
ArrayRef<int64_t> threadDimMapping, bool omitTileOffsetBoundsCheck) {		ArrayRef<StringRef> threadDimMapping, bool omitTileOffsetBoundsCheck) {
Location loc = op->getLoc();		Location loc = op->getLoc();
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
SmallVector<Range> loopRanges = op.getIterationDomain(b);		SmallVector<Range> loopRanges = op.getIterationDomain(b);
if (loopRanges.empty())		if (loopRanges.empty())
return op->emitOpError("expected non-empty loop ranges");		return op->emitOpError("expected non-empty loop ranges");
auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };		auto hasStrideOne = [](Range r) { return !isConstantIntValue(r.stride, 1); };
if (llvm::any_of(loopRanges, hasStrideOne))		if (llvm::any_of(loopRanges, hasStrideOne))
return op->emitOpError("only stride-1 supported atm");		return op->emitOpError("only stride-1 supported atm");
▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines	b.create<tensor::ParallelInsertSliceOp>(loc, std::get<1>(it),
resultSizes, strides);		resultSizes, strides);
}		}
return ForeachThreadTilingResult{foreachThreadOp, tiledOp};		return ForeachThreadTilingResult{foreachThreadOp, tiledOp};
}		}

FailureOr<ForeachThreadTilingResult>		FailureOr<ForeachThreadTilingResult>
linalg::tileToForeachThreadOp(RewriterBase &b, TilingInterface op,		linalg::tileToForeachThreadOp(RewriterBase &b, TilingInterface op,
ArrayRef<OpFoldResult> numThreads,		ArrayRef<OpFoldResult> numThreads,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<StringRef> threadDimMapping) {
return tileToForeachThreadOpImpl(b, op, numThreads, /nominalTileSizes=/None,		return tileToForeachThreadOpImpl(b, op, numThreads, /nominalTileSizes=/None,
threadDimMapping,		threadDimMapping,
/omitTileOffsetBoundsCheck=/false);		/omitTileOffsetBoundsCheck=/false);
}		}

FailureOr<ForeachThreadTilingResult>		FailureOr<ForeachThreadTilingResult>
linalg::tileToForeachThreadOpUsingTileSizes(		linalg::tileToForeachThreadOpUsingTileSizes(
RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> tileSizes,		RewriterBase &b, TilingInterface op, ArrayRef<OpFoldResult> tileSizes,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<StringRef> threadDimMapping) {
SmallVector<Range> loopRanges = op.getIterationDomain(b);		SmallVector<Range> loopRanges = op.getIterationDomain(b);
unsigned nLoops = loopRanges.size();		unsigned nLoops = loopRanges.size();
SmallVector<OpFoldResult> numThreads;		SmallVector<OpFoldResult> numThreads;
numThreads.reserve(nLoops);		numThreads.reserve(nLoops);
AffineExpr s0, s1;		AffineExpr s0, s1;
bindSymbols(b.getContext(), s0, s1);		bindSymbols(b.getContext(), s0, s1);
AffineExpr divExpr = s0.ceilDiv(s1);		AffineExpr divExpr = s0.ceilDiv(s1);
for (const auto &it : llvm::zip(tileSizes, loopRanges)) {		for (const auto &it : llvm::zip(tileSizes, loopRanges)) {
▲ Show 20 Lines • Show All 358 Lines • Show Last 20 Lines

mlir/lib/Dialect/SCF/IR/SCF.cpp

Show First 20 Lines • Show All 1,063 Lines • ▼ Show 20 Lines	void ForOp::getCanonicalizationPatterns(RewritePatternSet &results,
results.add<ForOpIterArgsFolder, SimplifyTrivialLoops,		results.add<ForOpIterArgsFolder, SimplifyTrivialLoops,
LastTensorLoadCanonicalization, ForOpTensorCastFolder>(context);		LastTensorLoadCanonicalization, ForOpTensorCastFolder>(context);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// ForeachThreadOp		// ForeachThreadOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		// Thread mapping
		enum class ThreadMap {
		none = 0,
		tidx = 1,
		tidy = 2,
		tidz = 4,
		bidx = 8,
		bidy = 10,
		bidz = 20
		};
		ftynseUnsubmitted Not Done Reply Inline Actions Why does it have to be a bit map? Are we expecting to use a mapping that is simultaneously thread-x, thread-y and block-z, for example? ftynse: Why does it have to be a bit map? Are we expecting to use a mapping that is simultaneously…
		gurayppAuthorUnsubmitted Done Reply Inline Actions Exactly. One can think about mixings blocks and threads like below. This will be my future work. loop1 --> blockIdx.x, threadIdx.x loop2 --> blockIdx.y, threadIdx.y guraypp: Exactly. One can think about mixings blocks and threads like below. This will be my future work.

		static Optional<ThreadMap> symbolizeThreadMap(StringRef str) {
		return StringSwitch<Optional<ThreadMap>>(str)
		.Case("none", ThreadMap::none)
		.Case("thread_x", ThreadMap::tidx)
		.Case("thread_y", ThreadMap::tidy)
		.Case("thread_z", ThreadMap::tidz)
		.Case("block_x", ThreadMap::bidx)
		.Case("block_y", ThreadMap::bidy)
		.Case("block_z", ThreadMap::bidz)
		.Default(None);
		}

LogicalResult ForeachThreadOp::verify() {		LogicalResult ForeachThreadOp::verify() {
// Call terminator's verify to produce most informative error messages.		// Call terminator's verify to produce most informative error messages.
if (failed(getTerminator().verify()))		if (failed(getTerminator().verify()))
return failure();		return failure();

// Check number of outputs.		// Check number of outputs.
if (getNumResults() != getOutputs().size())		if (getNumResults() != getOutputs().size())
return emitOpError("produces ")		return emitOpError("produces ")
<< getNumResults() << " results, but has only "		<< getNumResults() << " results, but has only "
<< getOutputs().size() << " outputs";		<< getOutputs().size() << " outputs";

// Check that the body defines block arguments for thread indices and outputs.		// Check that the body defines block arguments for thread indices and outputs.
auto *body = getBody();		auto *body = getBody();
if (body->getNumArguments() != getRank() + getOutputs().size())		if (body->getNumArguments() != getRank() + getOutputs().size())
return emitOpError("region expects ") << getRank() << " arguments";		return emitOpError("region expects ") << getRank() << " arguments";
for (int64_t i = 0; i < getRank(); ++i)		for (int64_t i = 0; i < getRank(); ++i)
if (!body->getArgument(i).getType().isIndex())		if (!body->getArgument(i).getType().isIndex())
return emitOpError("expects ")		return emitOpError("expects ")
<< i << "-th block argument to be an index";		<< i << "-th block argument to be an index";
for (unsigned i = 0; i < getOutputs().size(); ++i)		for (unsigned i = 0; i < getOutputs().size(); ++i)
if (body->getArgument(i + getRank()).getType() != getOutputs()[i].getType())		if (body->getArgument(i + getRank()).getType() != getOutputs()[i].getType())
return emitOpError("type mismatch between ")		return emitOpError("type mismatch between ")
<< i << "-th output and corresponding block argument";		<< i << "-th output and corresponding block argument";

		bool mapBlock = false, mapThread = false;
		for (auto map : getMapping()) {
		auto threadMap = symbolizeThreadMap(map.cast<StringAttr>().getValue());
		if (!threadMap.has_value())
		return emitOpError("expected ")
		<< "mapping to be one of: none, thread_x, thread_y, thread_z, "
		"block_x, block_y, and block_z";
		if (threadMap.value() == ThreadMap::tidx \|\|
		threadMap.value() == ThreadMap::tidy \|\|
		threadMap.value() == ThreadMap::tidz)
		mapThread = true;
		if (threadMap.value() == ThreadMap::bidx \|\|
		threadMap.value() == ThreadMap::bidy \|\|
		threadMap.value() == ThreadMap::bidz)
		mapBlock = true;
		if (mapBlock && mapThread)
		return emitOpError("Mapping blocks and threads cannot be mixed in the "
		"same foreach_thread");
		}

return success();		return success();
}		}

void ForeachThreadOp::print(OpAsmPrinter &p) {		void ForeachThreadOp::print(OpAsmPrinter &p) {
p << " (";		p << " (";
llvm::interleaveComma(getThreadIndices(), p);		llvm::interleaveComma(getThreadIndices(), p);
p << ") in (";		p << ") in (";
llvm::interleaveComma(getNumThreads(), p);		llvm::interleaveComma(getNumThreads(), p);
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	result.addAttribute("operand_segment_sizes",
static_cast<int32_t>(outOperands.size())}));		static_cast<int32_t>(outOperands.size())}));
return success();		return success();
}		}

// Bodyless builder, outputs must be specified.		// Bodyless builder, outputs must be specified.
void ForeachThreadOp::build(mlir::OpBuilder &builder,		void ForeachThreadOp::build(mlir::OpBuilder &builder,
mlir::OperationState &result, ValueRange outputs,		mlir::OperationState &result, ValueRange outputs,
ValueRange numThreads,		ValueRange numThreads,
ArrayRef<int64_t> threadDimMapping) {		ArrayRef<StringRef> mapping) {
result.addOperands(numThreads);		result.addOperands(numThreads);
result.addOperands(outputs);		result.addOperands(outputs);
result.addAttribute(ForeachThreadOp::getThreadDimMappingAttrName(result.name),		result.addAttribute(ForeachThreadOp::getMappingAttrName(result.name),
builder.getI64ArrayAttr(threadDimMapping));		builder.getStrArrayAttr(mapping));
result.addAttribute(		result.addAttribute(
"operand_segment_sizes",		"operand_segment_sizes",
builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),		builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),
static_cast<int32_t>(outputs.size())}));		static_cast<int32_t>(outputs.size())}));
result.addTypes(TypeRange(outputs));		result.addTypes(TypeRange(outputs));

Region *bodyRegion = result.addRegion();		Region *bodyRegion = result.addRegion();
OpBuilder::InsertionGuard g(builder);		OpBuilder::InsertionGuard g(builder);
Show All 10 Lines	bodyBlock.addArguments(
TypeRange(outputs),		TypeRange(outputs),
SmallVector<Location>(outputs.size(), result.location));		SmallVector<Location>(outputs.size(), result.location));
ForeachThreadOp::ensureTerminator(*bodyRegion, builder, result.location);		ForeachThreadOp::ensureTerminator(*bodyRegion, builder, result.location);
}		}

// Builder that takes a bodyBuilder lambda.		// Builder that takes a bodyBuilder lambda.
void ForeachThreadOp::build(		void ForeachThreadOp::build(
mlir::OpBuilder &builder, mlir::OperationState &result, ValueRange outputs,		mlir::OpBuilder &builder, mlir::OperationState &result, ValueRange outputs,
ValueRange numThreads, ArrayRef<int64_t> threadDimMapping,		ValueRange numThreads, ArrayRef<StringRef> mapping,
function_ref<void(OpBuilder &, Location, ValueRange)> bodyBuilder) {		function_ref<void(OpBuilder &, Location, ValueRange)> bodyBuilder) {
result.addOperands(numThreads);		result.addOperands(numThreads);
result.addOperands(outputs);		result.addOperands(outputs);
result.addAttribute(ForeachThreadOp::getThreadDimMappingAttrName(result.name),		result.addAttribute(ForeachThreadOp::getMappingAttrName(result.name),
builder.getI64ArrayAttr(threadDimMapping));		builder.getStrArrayAttr(mapping));
result.addAttribute(		result.addAttribute(
"operand_segment_sizes",		"operand_segment_sizes",
builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),		builder.getDenseI32ArrayAttr({static_cast<int32_t>(numThreads.size()),
static_cast<int32_t>(outputs.size())}));		static_cast<int32_t>(outputs.size())}));
result.addTypes(TypeRange(outputs));		result.addTypes(TypeRange(outputs));

Region *bodyRegion = result.addRegion();		Region *bodyRegion = result.addRegion();
OpBuilder::InsertionGuard g(builder);		OpBuilder::InsertionGuard g(builder);
Show All 31 Lines
}		}

PerformConcurrentlyOp ForeachThreadOp::getTerminator() {		PerformConcurrentlyOp ForeachThreadOp::getTerminator() {
return cast<PerformConcurrentlyOp>(getBody()->getTerminator());		return cast<PerformConcurrentlyOp>(getBody()->getTerminator());
}		}

template <typename T>		template <typename T>
static FailureOr<SmallVector<T>> permute(const SmallVector<T> &vals,		static FailureOr<SmallVector<T>> permute(const SmallVector<T> &vals,
ArrayRef<int64_t> perm) {		ArrayRef<StringRef> perm) {
if (vals.size() != perm.size())		if (vals.size() != perm.size())
return failure();		return failure();
SmallVector<T> result(vals.size());		SmallVector<T> result(vals.size());
SmallVector<bool> seen(vals.size());		SmallVector<bool> seen(vals.size());
for (auto [idx, val] : llvm::zip(perm, vals)) {		for (auto [mappingName, val] : llvm::zip(perm, vals)) {
// Already seen, invalid thread_dim_mapping.		auto maybeMap = symbolizeThreadMap(mappingName);
if (seen[idx])		if (!maybeMap.has_value())
		return failure();
		ThreadMap mapping = maybeMap.value();
		int idx = 0;
		if (mapping == ThreadMap::tidx \|\| mapping == ThreadMap::bidx)
		idx = 0;
		else if (mapping == ThreadMap::tidy \|\| mapping == ThreadMap::bidy)
		idx = 1;
		else if (mapping == ThreadMap::tidz \|\| mapping == ThreadMap::bidz)
		idx = 2;
		// Already seen, invalid mapping.
		if (seen[idx]) {
return failure();		return failure();
		}
		if (mapping != ThreadMap::none)
result[idx] = val;		result[idx] = val;
seen[idx] = true;		seen[idx] = true;
}		}
// Some not seen, invalid thread_dim_mapping.		// Some not seen, invalid mapping.
if (!llvm::all_of(seen, [](bool b) { return b; }))		if (!llvm::all_of(seen, [](bool b) { return b; }))
return failure();		return failure();
return result;		return result;
}		}

/// Helper to get apply the `thread_dim_mapping` permutation of a		/// Helper to get apply the `mapping` permutation of a
/// `foreachThreadOp` to `values`.		/// `foreachThreadOp` to `values`.
template <typename T>		template <typename T>
static FailureOr<SmallVector<T>>		static FailureOr<SmallVector<T>>
getValuesPermutedByThreadMapping(scf::ForeachThreadOp foreachThreadOp,		getValuesPermutedByThreadMapping(scf::ForeachThreadOp foreachThreadOp,
const SmallVector<T> &values) {		const SmallVector<T> &values) {
// Apply mapping permutation if specified.		// Apply mapping permutation if specified.
auto mapping = foreachThreadOp.getThreadDimMapping();		auto mapping = foreachThreadOp.getMapping();
		auto maps = extractFromStringArrayAttr(mapping);
if (mapping && !mapping.empty()) {		if (mapping && !mapping.empty()) {
auto maybePermuted = permute(values, extractFromI64ArrayAttr(mapping));		auto maybePermuted = permute(values, maps);
if (failed(maybePermuted))		if (failed(maybePermuted))
return foreachThreadOp->emitError("invalid permutation");		return foreachThreadOp->emitError("invalid permutation");
return *maybePermuted;		return *maybePermuted;
}		}
return values;		return values;
}		}

/// Return the thread indices in the order specified by the thread_dim_mapping		/// Return the thread indices in the order specified by the mapping
/// attribute. Return failure is thread_dim_mapping is not a valid permutation.		/// attribute. Return failure is mapping is not a valid permutation.
FailureOr<SmallVector<Value>> ForeachThreadOp::getPermutedThreadIndices() {		FailureOr<SmallVector<Value>> ForeachThreadOp::getPermutedThreadIndices() {
SmallVector<Value> threadCountValues = this->getThreadIndices();		SmallVector<Value> threadCountValues = this->getThreadIndices();
threadCountValues.resize(3, Value());
return getValuesPermutedByThreadMapping(*this, threadCountValues);		return getValuesPermutedByThreadMapping(*this, threadCountValues);
}		}

/// Return the number of threads in the order specified by the		/// Return the number of threads in the order specified by the
/// thread_dim_mapping attribute.		/// mapping attribute.
/// Return failure is thread_dim_mapping is not a valid permutation.		/// Return failure is mapping is not a valid permutation.
FailureOr<SmallVector<OpFoldResult>>		FailureOr<SmallVector<OpFoldResult>>
ForeachThreadOp::getPermutedNumThreads(OpBuilder &b) {		ForeachThreadOp::getPermutedNumThreads(OpBuilder &b) {
SmallVector<OpFoldResult> threadCountValues = this->getNumThreads();		SmallVector<OpFoldResult> threadCountValues = this->getNumThreads();
threadCountValues.resize(3, b.getIndexAttr(1));
return getValuesPermutedByThreadMapping(*this, threadCountValues);		return getValuesPermutedByThreadMapping(*this, threadCountValues);
}		}

ForeachThreadOp mlir::scf::getForeachThreadOpThreadIndexOwner(Value val) {		ForeachThreadOp mlir::scf::getForeachThreadOpThreadIndexOwner(Value val) {
auto tidxArg = val.dyn_cast<BlockArgument>();		auto tidxArg = val.dyn_cast<BlockArgument>();
if (!tidxArg)		if (!tidxArg)
return ForeachThreadOp();		return ForeachThreadOp();
assert(tidxArg.getOwner() && "unlinked block argument");		assert(tidxArg.getOwner() && "unlinked block argument");
▲ Show 20 Lines • Show All 2,200 Lines • Show Last 20 Lines

mlir/lib/Dialect/SCF/Transforms/BufferizableOpInterfaceImpl.cpp

Show First 20 Lines • Show All 1,138 Lines • ▼ Show 20 Lines	LogicalResult bufferize(Operation *op, RewriterBase &rewriter,
}		}

// Create new ForeachThreadOp without any results and drop the automatically		// Create new ForeachThreadOp without any results and drop the automatically
// introduced terminator.		// introduced terminator.
rewriter.setInsertionPoint(foreachThreadOp);		rewriter.setInsertionPoint(foreachThreadOp);
auto newForeachThreadOp = rewriter.create<ForeachThreadOp>(		auto newForeachThreadOp = rewriter.create<ForeachThreadOp>(
foreachThreadOp.getLoc(), /outputs=/ValueRange(),		foreachThreadOp.getLoc(), /outputs=/ValueRange(),
foreachThreadOp.getNumThreads(),		foreachThreadOp.getNumThreads(),
extractFromI64ArrayAttr(foreachThreadOp.getThreadDimMapping()));		extractFromStringArrayAttr(foreachThreadOp.getMapping()));
newForeachThreadOp.getBody()->getTerminator()->erase();		newForeachThreadOp.getBody()->getTerminator()->erase();

// Move over block contents of the old op.		// Move over block contents of the old op.
SmallVector<Value> replacementBbArgs;		SmallVector<Value> replacementBbArgs;
replacementBbArgs.append(		replacementBbArgs.append(
newForeachThreadOp.getBody()->getArguments().begin(),		newForeachThreadOp.getBody()->getArguments().begin(),
newForeachThreadOp.getBody()->getArguments().end());		newForeachThreadOp.getBody()->getArguments().end());
replacementBbArgs.append(foreachThreadOp.getOutputs().size(), Value());		replacementBbArgs.append(foreachThreadOp.getOutputs().size(), Value());
▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

mlir/lib/Dialect/Utils/StaticValueUtils.cpp

	Show First 20 Lines • Show All 59 Lines • ▼ Show 20 Lines
	/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.			/// Extract int64_t values from the assumed ArrayAttr of IntegerAttr.
	SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr) {			SmallVector<int64_t, 4> extractFromI64ArrayAttr(Attribute attr) {
	return llvm::to_vector<4>(			return llvm::to_vector<4>(
	llvm::map_range(attr.cast<ArrayAttr>(), [](Attribute a) -> int64_t {			llvm::map_range(attr.cast<ArrayAttr>(), [](Attribute a) -> int64_t {
	return a.cast<IntegerAttr>().getInt();			return a.cast<IntegerAttr>().getInt();
	}));			}));
	}			}

				/// Extract StringRef values from the assumed ArrayAttr of StringAttr.
				SmallVector<StringRef> extractFromStringArrayAttr(Attribute attr) {
				return llvm::to_vector(
				llvm::map_range(attr.cast<ArrayAttr>(), [](Attribute a) -> StringRef {
				return a.cast<StringAttr>().getValue();
				}));
				}

	/// Given a value, try to extract a constant Attribute. If this fails, return			/// Given a value, try to extract a constant Attribute. If this fails, return
	/// the original value.			/// the original value.
	OpFoldResult getAsOpFoldResult(Value val) {			OpFoldResult getAsOpFoldResult(Value val) {
	if (!val)			if (!val)
	return OpFoldResult();			return OpFoldResult();
	Attribute attr;			Attribute attr;
	if (matchPattern(val, m_Constant(&attr)))			if (matchPattern(val, m_Constant(&attr)))
	return attr;			return attr;
	▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/transform-gpu-failing.mlir

Show All 20 Lines	func.func @map_nested_foreach_to_threads_excessive_threads(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}

%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}
transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
Show All 12 Lines	func.func @map_nested_foreach_to_threads_fewer_threads(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}

%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name2 = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
Show All 10 Lines	func.func @map_nested_foreach_to_threads_dynamic_trip_count(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token, %c9 : index, %c7 : index) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}
return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
// expected-error @below {{unsupported dynamic blockdim size}}		// expected-error @below {{unsupported dynamic blockdim size}}
transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [128, 4, 1] }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [128, 4, 1] }
}		}

// -----		// -----

func.func @map_nested_foreach_to_threads_4d_loop(%x: memref<2x32x32x32xf32>, %y: memref<2x32x32x32xf32>, %stream : !gpu.async.token) -> memref<2x32x32x32xf32> {		func.func @map_nested_foreach_to_threads_4d_loop(%x: memref<2x32x32x32xf32>, %y: memref<2x32x32x32xf32>, %stream : !gpu.async.token) -> memref<2x32x32x32xf32> {
%one = arith.constant 1 : index		%one = arith.constant 1 : index
%c2 = arith.constant 1 : index		%c2 = arith.constant 1 : index
%c32 = arith.constant 32 : index		%c32 = arith.constant 32 : index
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j, %k, %l) in (%c2, %c32,%c32,%c32) {		scf.foreach_thread (%i, %j, %k, %l) in (%c2, %c32,%c32,%c32) {
%4 = memref.load %x[%i, %j, %k, %l] : memref<2x32x32x32xf32>		%4 = memref.load %x[%i, %j, %k, %l] : memref<2x32x32x32xf32>
memref.store %4, %y[%i, %j, %k, %l] : memref<2x32x32x32xf32>		memref.store %4, %y[%i, %j, %k, %l] : memref<2x32x32x32xf32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x", "thread_z"] }
gpu.terminator		gpu.terminator
}		}
return %y : memref<2x32x32x32xf32>		return %y : memref<2x32x32x32xf32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	func.func @map_foreach_to_blocks_not_unique(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c900) {		scf.foreach_thread (%i, %j) in (%c7, %c900) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }

scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
Show All 11 Lines	func.func @map_foreach_to_blocks_large_loop(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%c9 = arith.constant 9 : index		%c9 = arith.constant 9 : index
%c7 = arith.constant 7 : index		%c7 = arith.constant 7 : index

scf.foreach_thread (%i, %j) in (%c7, %c65537) {		scf.foreach_thread (%i, %j) in (%c7, %c65537) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = ["thread_x", "thread_y"] }

scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }

return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
// expected-error @below {{could not find a unique topLevel scf.foreach_thread}}		// expected-error @below {{could not find a unique topLevel scf.foreach_thread}}
%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }		%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
}		}

// -----		// -----

func.func @map_foreach_to_blocks_large_loop(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {		func.func @map_foreach_to_blocks_large_loop(%x: memref<2 x 32 x f32>, %y: memref<2 x 32 x f32>, %t: memref<32 x f32>, %alpha : f32, %stream : !gpu.async.token) -> memref<2 x 32 x f32> {
%one = arith.constant 1 : index		%one = arith.constant 1 : index
%c65535 = arith.constant 65535 : index		%c65535 = arith.constant 65535 : index
scf.foreach_thread (%i, %j) in (%c65535, %c65535) {		scf.foreach_thread (%i, %j) in (%c65535, %c65535) {
%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>		%4 = memref.load %x[%i, %j] : memref<2 x 32 x f32>
%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>		%5 = memref.load %y[%i, %j] : memref<2 x 32 x f32>
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>		memref.store %6, %y[%i, %j] : memref<2 x 32 x f32>
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = ["thread_x", "thread_y"] }
return %y : memref<2 x 32 x f32>		return %y : memref<2 x 32 x f32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb0(%arg0: !pdl.operation):		^bb0(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
// expected-error @below {{Trying to launch a GPU kernel with gridDim = (65535, 65535, 1) blockDim = (1, 1, 1). It is larger than the limits.}}		// expected-error @below {{Trying to launch a GPU kernel with gridDim = (65535, 65535, 1) blockDim = (1, 1, 1). It is larger than the limits.}}
%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }		%1 = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
}		}

// -----		// -----

mlir/test/Dialect/GPU/transform-gpu.mlir

Show All 18 Lines	// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]]]
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : !type		%4 = memref.load %x[%i, %j] : !type
%5 = memref.load %y[%i, %j] : !type		%5 = memref.load %y[%i, %j] : !type
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : !type		memref.store %6, %y[%i, %j] : !type
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = ["block_x", "block_y"]}
gpu.terminator		gpu.terminator
}		}
return %y : !type		return %y : !type
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.gpu.map_foreach_to_blocks %funcop { blockDim = [12, 9, 1]}		transform.gpu.map_foreach_to_blocks %funcop { gridDim = [12, 9]}
}		}

// -----		// -----

!type = memref<2 x 32 x f32>		!type = memref<2 x 32 x f32>
!type1d = memref<32 x f32>		!type1d = memref<32 x f32>

// CHECK-LABEL: func.func @saxpy2d(		// CHECK-LABEL: func.func @saxpy2d(
Show All 23 Lines	// CHECK: gpu.barrier
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : !type		%4 = memref.load %x[%i, %j] : !type
%5 = memref.load %y[%i, %j] : !type		%5 = memref.load %y[%i, %j] : !type
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : !type		memref.store %6, %y[%i, %j] : !type
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"]}
scf.foreach_thread (%i) in (%c12) {		scf.foreach_thread (%i) in (%c12) {
%7 = memref.load %t[%i] : !type1d		%7 = memref.load %t[%i] : !type1d
%8 = arith.addf %alpha, %7 : f32		%8 = arith.addf %alpha, %7 : f32
memref.store %8, %t[%i] : !type1d		memref.store %8, %t[%i] : !type1d
} {thread_dim_mapping = [0, 1, 2]}		} {mapping = ["thread_x"] }
gpu.terminator		gpu.terminator
}		}
return %y : !type		return %y : !type
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1] }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9] }
}		}

// -----		// -----

!type4d = memref<32x64x4x32xf32>		!type4d = memref<32x64x4x32xf32>

// CHECK-LABEL: func.func @saxpy4d(		// CHECK-LABEL: func.func @saxpy4d(
// CHECK-SAME: %[[ARGX:[0-9a-z]+]]: memref<32x64x4x32xf32>		// CHECK-SAME: %[[ARGX:[0-9a-z]+]]: memref<32x64x4x32xf32>
Show All 14 Lines
// CHECK: memref.load %[[ARGX]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]		// CHECK: memref.load %[[ARGX]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]
// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]		// CHECK: memref.load %[[ARGY]][%[[BLKX]], %[[BLKY]], %[[TIDY]], %[[TIDX]]]
scf.foreach_thread (%i, %j) in (%c32, %c64) {		scf.foreach_thread (%i, %j) in (%c32, %c64) {
scf.foreach_thread (%k, %l) in (%c4, %c32) {		scf.foreach_thread (%k, %l) in (%c4, %c32) {
%4 = memref.load %x[%i, %j, %k, %l] : !type4d		%4 = memref.load %x[%i, %j, %k, %l] : !type4d
%5 = memref.load %y[%i, %j, %k, %l] : !type4d		%5 = memref.load %y[%i, %j, %k, %l] : !type4d
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j, %k, %l] : !type4d		memref.store %6, %y[%i, %j, %k, %l] : !type4d
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
} {thread_dim_mapping = [0, 1, 2]}		} { mapping = ["block_x", "block_y"] }
return %y : !type4d		return %y : !type4d
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["func.func"]} in %arg0		%funcop = transform.structured.match ops{["func.func"]} in %arg0
%gpuLaunch = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }		%gpuLaunch = transform.gpu.map_foreach_to_blocks %funcop { generate_gpu_launch }
transform.gpu.map_nested_foreach_to_threads %gpuLaunch { blockDim = [32, 4, 1] }		transform.gpu.map_nested_foreach_to_threads %gpuLaunch { blockDim = [32, 4] }
}		}

// -----		// -----

!type = memref<2 x 32 x f32>		!type = memref<2 x 32 x f32>
!type1d = memref<32 x f32>		!type1d = memref<32 x f32>

// CHECK-LABEL: func.func @saxpy2d_no_barrier(		// CHECK-LABEL: func.func @saxpy2d_no_barrier(
func.func @saxpy2d_no_barrier(%x: !type, %y: !type, %t: !type1d, %alpha : f32, %stream : !gpu.async.token) -> !type {		func.func @saxpy2d_no_barrier(%x: !type, %y: !type, %t: !type1d, %alpha : f32, %stream : !gpu.async.token) -> !type {
%one = arith.constant 1 : index		%one = arith.constant 1 : index
%c12 = arith.constant 12 : index		%c12 = arith.constant 12 : index
%c9 = arith.constant 9 : index		%c9 = arith.constant 9 : index
%c7 = arith.constant 7 : index		%c7 = arith.constant 7 : index
// CHECK-NOT: gpu.barrier		// CHECK-NOT: gpu.barrier
// CHECK: return		// CHECK: return
%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)		%name = gpu.launch async[%stream] blocks(%arg3, %arg4, %arg5) in (%arg9 = %one, %arg10 = %one, %arg11 = %one)
threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)		threads(%arg6, %arg7, %arg8) in (%arg12 = %one, %arg13 = %one, %arg14 = %one)
{		{
scf.foreach_thread (%i, %j) in (%c7, %c9) {		scf.foreach_thread (%i, %j) in (%c7, %c9) {
%4 = memref.load %x[%i, %j] : !type		%4 = memref.load %x[%i, %j] : !type
%5 = memref.load %y[%i, %j] : !type		%5 = memref.load %y[%i, %j] : !type
%6 = math.fma %alpha, %4, %5 : f32		%6 = math.fma %alpha, %4, %5 : f32
memref.store %6, %y[%i, %j] : !type		memref.store %6, %y[%i, %j] : !type
} {thread_dim_mapping = [1, 0, 2]}		} { mapping = ["thread_y", "thread_x"] }
gpu.terminator		gpu.terminator
}		}
return %y : !type		return %y : !type
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg0: !pdl.operation):		^bb1(%arg0: !pdl.operation):
%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0		%funcop = transform.structured.match ops{["gpu.launch"]} in %arg0
transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9, 1], syncAfterDistribute = false }		transform.gpu.map_nested_foreach_to_threads %funcop { blockDim = [12, 9], syncAfterDistribute = false }
}		}

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

Show All 20 Lines	// CHECK-SAME: %[[C:[0-9a-z]+]]: tensor<?x?xf32>
// CHECK: %[[tC:.]] = tensor.extract_slice %[[C_BLK]]{{.}} : tensor<?x?xf32> to tensor<?x?xf32>		// CHECK: %[[tC:.]] = tensor.extract_slice %[[C_BLK]]{{.}} : tensor<?x?xf32> to tensor<?x?xf32>
// CHECK: %[[RES:.*]] = linalg.matmul		// CHECK: %[[RES:.*]] = linalg.matmul
// CHECK-SAME: ins(%[[tA]], %[[tB]] : tensor<?x?xf32>, tensor<?x?xf32>)		// CHECK-SAME: ins(%[[tA]], %[[tB]] : tensor<?x?xf32>, tensor<?x?xf32>)
// CHECK-SAME: outs(%[[tC]] : tensor<?x?xf32>) -> tensor<?x?xf32>		// CHECK-SAME: outs(%[[tC]] : tensor<?x?xf32>) -> tensor<?x?xf32>
// CHECK: scf.foreach_thread.perform_concurrently {		// CHECK: scf.foreach_thread.perform_concurrently {
// CHECK-NEXT: tensor.parallel_insert_slice %[[RES]] into %[[C_BLK]]{{.*}} :		// CHECK-NEXT: tensor.parallel_insert_slice %[[RES]] into %[[C_BLK]]{{.*}} :
// CHECK-SAME: tensor<?x?xf32> into tensor<?x?xf32>		// CHECK-SAME: tensor<?x?xf32> into tensor<?x?xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: } {thread_dim_mapping = [1, 0]}		// CHECK-NEXT: } {mapping = ["thread_y", "thread_x"]}
%0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)		%0 = linalg.matmul ins(%A, %B : tensor<?x?xf32>, tensor<?x?xf32>)
outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)		outs(%C : tensor<?x?xf32>) -> (tensor<?x?xf32>)
return %0 : tensor<?x?xf32>		return %0 : tensor<?x?xf32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1		%0 = transform.structured.match ops{["linalg.matmul"]} in %arg1
%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20] (mapped to dims [1, 0])		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [10, 20] (mapped to dims ["thread_y", "thread_x"])
}		}
}		}

// -----		// -----

// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.		// Tests that dimension 0 can eliminate affine.min/max, dimension 1 cannot.

// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -15 + 300, 15)>		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * -15 + 300, 15)>
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	%result = linalg.generic {indexing_maps = [
linalg.yield %2 : f32		linalg.yield %2 : f32
} -> tensor<4xf32>		} -> tensor<4xf32>
return %result : tensor<4xf32>		return %result : tensor<4xf32>
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1
%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [2] (mapped to dims [0])		%1:2 = transform.structured.tile_to_foreach_thread_op %0 num_threads [2] (mapped to dims ["thread_x"])
}		}
}		}
// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * 2)>		// CHECK-DAG: #[[$map0:.+]] = affine_map<(d0) -> (d0 * 2)>

// CHECK-LABEL: extract_source(		// CHECK-LABEL: extract_source(
// CHECK: %[[C2:.*]] = arith.constant 2 : index		// CHECK: %[[C2:.*]] = arith.constant 2 : index
// CHECK: scf.foreach_thread (%[[ARG:.]]) in (%[[C2]]) shared_outs(%{{.}} = %{{.*}}) -> (tensor<4xf32>) {		// CHECK: scf.foreach_thread (%[[ARG:.]]) in (%[[C2]]) shared_outs(%{{.}} = %{{.*}}) -> (tensor<4xf32>) {
// CHECK: %[[OFF:.*]] = affine.apply #[[$map0]](%[[ARG]])		// CHECK: %[[OFF:.*]] = affine.apply #[[$map0]](%[[ARG]])
▲ Show 20 Lines • Show All 154 Lines • Show Last 20 Lines

mlir/test/Dialect/SCF/one-shot-bufferize-analysis.mlir

Show First 20 Lines • Show All 622 Lines • ▼ Show 20 Lines	%4 = scf.foreach_thread (%arg0) in (%c320) shared_outs(%arg1 = %2) -> (tensor<320xf32>) {
%7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1xf32>) -> tensor<1xf32>		%7 = linalg.fill ins(%cst : f32) outs(%6 : tensor<1xf32>) -> tensor<1xf32>
// CHECK: linalg.fill {__inplace_operands_attr__ = ["none", "true"]}		// CHECK: linalg.fill {__inplace_operands_attr__ = ["none", "true"]}
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<1xf32>) -> tensor<1xf32>		%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<1xf32>) -> tensor<1xf32>

scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
// CHECK: tensor.parallel_insert_slice {{.*}} {__inplace_operands_attr__ = ["true", "true", "none"]}		// CHECK: tensor.parallel_insert_slice {{.*}} {__inplace_operands_attr__ = ["true", "true", "none"]}
tensor.parallel_insert_slice %8 into %arg1[%arg0] [1] [1] : tensor<1xf32> into tensor<320xf32>		tensor.parallel_insert_slice %8 into %arg1[%arg0] [1] [1] : tensor<1xf32> into tensor<320xf32>
}		}
} {thread_dim_mapping = []}		} {mapping = []}
return %4 : tensor<320xf32>		return %4 : tensor<320xf32>
}		}

// -----		// -----

// CHECK-LABEL: different_repetitive_region_via_alias		// CHECK-LABEL: different_repetitive_region_via_alias
func.func @different_repetitive_region_via_alias(%arg0: tensor<4xf32>,		func.func @different_repetitive_region_via_alias(%arg0: tensor<4xf32>,
%arg1: tensor<4xf32>,		%arg1: tensor<4xf32>,
▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

mlir/test/Dialect/SCF/one-shot-bufferize-tensor-copy-insertion.mlir

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	%result = scf.foreach_thread (%thread_idx) in (%num_threads) shared_outs(%o = %out) -> tensor<100xf32> {
// CHECK: tensor.extract_slice		// CHECK: tensor.extract_slice
// CHECK: scf.foreach_thread.perform_concurrently		// CHECK: scf.foreach_thread.perform_concurrently
// CHECK: tensor.parallel_insert_slice %{{.*}} into %[[o]]		// CHECK: tensor.parallel_insert_slice %{{.*}} into %[[o]]
%1 = tensor.extract_slice %in[%thread_idx][1][1] : tensor<100xf32> to tensor<1xf32>		%1 = tensor.extract_slice %in[%thread_idx][1][1] : tensor<100xf32> to tensor<1xf32>
scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
tensor.parallel_insert_slice %1 into %o[%thread_idx][1][1] :		tensor.parallel_insert_slice %1 into %o[%thread_idx][1][1] :
tensor<1xf32> into tensor<100xf32>		tensor<1xf32> into tensor<100xf32>
}		}
// CHECK: } {thread_dim_mapping = [5]}		// CHECK: } {mapping = ["none"]}
} {thread_dim_mapping = [5]}		} {mapping = ["none"]}
return		return
}		}

mlir/test/Dialect/SCF/ops.mlir

Show First 20 Lines • Show All 332 Lines • ▼ Show 20 Lines	func.func @simple_example(%in: tensor<100xf32>, %out: tensor<100xf32>) {
return		return
}		}

// CHECK-LABEL: func.func @elide_terminator		// CHECK-LABEL: func.func @elide_terminator
func.func @elide_terminator() -> () {		func.func @elide_terminator() -> () {
%num_threads = arith.constant 100 : index		%num_threads = arith.constant 100 : index

// CHECK: scf.foreach_thread		// CHECK: scf.foreach_thread
// CHECK-NEXT: } {thread_dim_mapping = [42]}		// CHECK-NEXT: } {mapping = ["none"]}
// CHECK-NEXT: return		// CHECK-NEXT: return
scf.foreach_thread (%thread_idx) in (%num_threads) {		scf.foreach_thread (%thread_idx) in (%num_threads) {
scf.foreach_thread.perform_concurrently {		scf.foreach_thread.perform_concurrently {
}		}
} {thread_dim_mapping = [42]}		} {mapping = ["none"]}
return		return
}		}

// CHECK-LABEL: @switch		// CHECK-LABEL: @switch
func.func @switch(%arg0: index) -> i32 {		func.func @switch(%arg0: index) -> i32 {
// CHECK: %{{.*}} = scf.index_switch %arg0 -> i32		// CHECK: %{{.*}} = scf.index_switch %arg0 -> i32
%0 = scf.index_switch %arg0 -> i32		%0 = scf.index_switch %arg0 -> i32
// CHECK-NEXT: case 2 {		// CHECK-NEXT: case 2 {
Show All 27 Lines

mlir/test/lib/Dialect/Tensor/TestTensorTransforms.cpp

Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	struct RewriteExtractSliceFromCollapseShapeUsingScfForeach
RewriteExtractSliceFromCollapseShapeUsingScfForeach(MLIRContext *context)		RewriteExtractSliceFromCollapseShapeUsingScfForeach(MLIRContext *context)
: RewriteExtractSliceFromCollapseShapeBase(context) {}		: RewriteExtractSliceFromCollapseShapeBase(context) {}
LogicalResult emitReplacement(tensor::ExtractSliceOp op, Value dest,		LogicalResult emitReplacement(tensor::ExtractSliceOp op, Value dest,
tensor::ExtractSliceFromCollapseHelper &helper,		tensor::ExtractSliceFromCollapseHelper &helper,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
Location loc = op.getLoc();		Location loc = op.getLoc();
auto foreachOp = rewriter.create<scf::ForeachThreadOp>(		auto foreachOp = rewriter.create<scf::ForeachThreadOp>(
loc, /outputs=/dest, /numThreads=/helper.getIterationSpaceSizes(),		loc, /outputs=/dest, /numThreads=/helper.getIterationSpaceSizes(),
/threadDimMapping=/ArrayRef<int64_t>{},		/mapping=/ArrayRef<StringRef>{},
[&](OpBuilder &nestedBuilder, Location loc, ValueRange regionArgs) {		[&](OpBuilder &nestedBuilder, Location loc, ValueRange regionArgs) {
unsigned numThreadIdRegionArgs =		unsigned numThreadIdRegionArgs =
helper.getIterationSpaceSizes().size();		helper.getIterationSpaceSizes().size();
unsigned numOutputRegionArgs =		unsigned numOutputRegionArgs =
regionArgs.size() - numThreadIdRegionArgs;		regionArgs.size() - numThreadIdRegionArgs;
ValueRange outputIvs = regionArgs.take_front(numThreadIdRegionArgs);		ValueRange outputIvs = regionArgs.take_front(numThreadIdRegionArgs);
ValueRange outputArgs = regionArgs.take_back(numOutputRegionArgs);		ValueRange outputArgs = regionArgs.take_back(numOutputRegionArgs);
assert(outputArgs.size() == 1 &&		assert(outputArgs.size() == 1 &&
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Use processing unit names for `thread_dim_map` and `mapped to dims`AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 471436

mlir/include/mlir/Dialect/GPU/TransformOps/GPUTransformOps.td

mlir/include/mlir/Dialect/Linalg/TransformOps/LinalgTransformOps.td

mlir/include/mlir/Dialect/Linalg/Transforms/Transforms.h

mlir/include/mlir/Dialect/SCF/IR/SCFOps.td

mlir/include/mlir/Dialect/Utils/StaticValueUtils.h

mlir/lib/Dialect/GPU/TransformOps/GPUTransformOps.cpp

mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp

mlir/lib/Dialect/Linalg/Transforms/Tiling.cpp

mlir/lib/Dialect/SCF/IR/SCF.cpp

mlir/lib/Dialect/SCF/Transforms/BufferizableOpInterfaceImpl.cpp

mlir/lib/Dialect/Utils/StaticValueUtils.cpp

mlir/test/Dialect/GPU/transform-gpu-failing.mlir

mlir/test/Dialect/GPU/transform-gpu.mlir

mlir/test/Dialect/Linalg/tile-to-foreach-thread.mlir

mlir/test/Dialect/SCF/one-shot-bufferize-analysis.mlir

mlir/test/Dialect/SCF/one-shot-bufferize-tensor-copy-insertion.mlir

mlir/test/Dialect/SCF/ops.mlir

mlir/test/lib/Dialect/Tensor/TestTensorTransforms.cpp

[mlir] Use processing unit names for `thread_dim_map` and `mapped to dims`
AbandonedPublic